JP2020057129A

JP2020057129A - Program, device and method for pronunciation assessment using language identification model

Info

Publication number: JP2020057129A
Application number: JP2018186432A
Authority: JP
Inventors: パニコスイラクレウス; Heracleous Panikos; 公一高井; Koichi Takai; 安田　圭志; Keishi Yasuda; 圭志安田
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2018-10-01
Filing date: 2018-10-01
Publication date: 2020-04-09
Anticipated expiration: 2038-10-01
Also published as: JP7064413B2

Abstract

To provide a program capable of assessing pronunciation of a prescribed language by an assessment object independently of a speech recognition technique imposing a heavy processing load because of high accuracy.SOLUTION: The program for assessing pronunciation of a prescribed language by an assessment object causes a pronunciation assessment device 1 to function as: reference score distribution acquisition means which acquires score distribution information of an assessment reference object determined by acquiring a plurality of scores for pronunciation of the prescribed language by the assessment reference object, each of which is acquired by using a language identification model which outputs a degree of certainty that input pronunciation is pronunciation in the prescribed language; object score distribution determination means which uses the same language identification model to acquire a plurality of scores for pronunciation of the prescribed language by the assessment object and determines score distribution information of the assessment object; and an assessment score determination means which determines an assessment score for pronunciation of the prescribed language by the assessment object on the basis of a value relating to a distribution parameter in a distribution of differences between the score distribution information of the assessment object and the score distribution information of the assessment reference object.SELECTED DRAWING: Figure 1

Description

本発明は、所定の言語の発音を評価する技術に関する。 The present invention relates to a technique for evaluating pronunciation in a predetermined language.

近年、深層ニューラルネットワーク（ＤＮＮ，Deep Neural Networks）アルゴリズムを用いた自動音声認識（ＡＳＲ，Automatic Speech Recognition）の発展によって、音声評価や発音スコアリングを自動的に実施する技術が注目されている。 In recent years, with the development of Automatic Speech Recognition (ASR) using a deep neural network (DNN, Deep Neural Networks) algorithm, a technique for automatically executing speech evaluation and pronunciation scoring has attracted attention.

例えば非特許文献１には、ＤＮＮ及び混合正規分布モデル（ＧＭＭ，Gaussian Mixture Model）に基づく自動発音スコアリングの手法が開示されている。この手法では、互いに同一の語句構成を有する生徒の発声文と先生の基準発生文とを採用し、音素配列及び尤度比を用いてスコアリングを行っている。また、発音のスコアは、音素レベルで与えられており、生徒の発声文が先生の発声文をどこまで模倣できているかを示すものとなっている。 For example, Non-Patent Document 1 discloses a method of automatic pronunciation scoring based on DNN and a Gaussian Mixture Model (GMM). In this method, a utterance sentence of a student having the same phrase structure and a reference generation sentence of a teacher are adopted, and scoring is performed using a phoneme array and a likelihood ratio. The pronunciation score is given at the phoneme level, and indicates to what extent the student's utterance can imitate the teacher's utterance.

さらに本手法では、ＤＮＮにおけるボトルネック特徴量を抽出し、この特徴量によってＧＭＭ−ＨＭＭ（Hidden Markov Model）トライフォン（triphone）音響モデルを構築して音素アライメント処理に用いている。非特許文献１では、このような手法によって取得されたスコアが、人間の評価者によるスコアと比較されており、両者の間の相関係数は0.717に達したとしている。さらに、本手法がベースラインによる方法と比較されており、より優れた結果が得られたとしている。 Further, in this method, a bottleneck feature in the DNN is extracted, and a GMM-HMM (Hidden Markov Model) triphone acoustic model is constructed using the feature and used for phoneme alignment processing. According to Non-Patent Document 1, a score obtained by such a method is compared with a score by a human evaluator, and the correlation coefficient between the two has reached 0.717. Furthermore, the method is compared with the baseline method, and it is said that better results were obtained.

また、例えば非特許文献２には、ＡＳＲを用いた発音スコアリングの手法が開示されている。本手法では、大量の（一例として800時間分の）ノンネイティブ（対象言語を母国語としない話者）のコーパスを用いて、ＧＭＭによるＡＳＲ、ＤＮＮによるＡＳＲ、及びタンデム型のボトルネック特徴量によるＡＳＲを構築している。次いで、これら３つのフロントエンドＡＳＲシステムに続き、入力された生徒の発声文に対して発音スコアを付与するため、ノンネイティブによる英語の上達度を評価する自動スコアリングエンジンを配し、スコア特徴量を抽出して、発声された返答に対するスコア値を推定している。 For example, Non-Patent Document 2 discloses a pronunciation scoring method using ASR. In this method, a large amount (for example, 800 hours) of non-native (speakers whose native language is not the target language) corpus is used, and ASR by GMM, ASR by DNN, and tandem-type bottleneck features are used. ASR is being built. Next, following these three front-end ASR systems, an automatic scoring engine that evaluates non-native English progress is provided to assign pronunciation scores to input utterances of students, and score features are provided. Is extracted, and a score value for the uttered response is estimated.

非特許文献２では、このような手法をスコアリングコーパスに基づいて評価し、人間の評価者に近いスコアリング結果を得たとしている。また、同手法では、深層学習に基づいてスコアリングを実施しているので、ＧＭＭによる手法と比較してより優れた結果が得られ、さらに、ボトルネック特徴量を用いたタンデム型を採用することによって、非常に高い相関係数、例えば項目レベルでは0.58、発音者レベルでは0.78といった高い値が達成されたとしている。 Non-Patent Literature 2 states that such a method is evaluated based on a scoring corpus, and a scoring result close to a human evaluator is obtained. In addition, in this method, scoring is performed based on deep learning, so that better results can be obtained as compared with the method using GMM, and furthermore, a tandem type using bottleneck features should be adopted. According to the report, a very high correlation coefficient such as 0.58 at the item level and 0.78 at the pronunciation level was achieved.

M. Nicolao, A. V. Beeston, and T. Hain, "Automatic Assessment of English Learner Pronunciation Using Discriminative Classifiers", in Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech and Signal Processing) 2015, pp. 5351-5355, ２０１５年M. Nicolao, A. V. Beeston, and T. Hain, "Automatic Assessment of English Learner Pronunciation Using Discriminative Classifiers", in Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech and Signal Processing) 2015, pp. 5351-5355, 2015 J. Tao, S. Ghaffarzadegan, and L. Chen, K. Zechner, "Exploring Deep Learning Architectures for Automatically Grading Non-native Spontaneous Speech", in Proceeding of IEEE ICASSP (International Conference on Acoustics, Speech and Signal Processing) 2016, pp. 6140-6144, ２０１６年J. Tao, S. Ghaffarzadegan, and L. Chen, K. Zechner, "Exploring Deep Learning Architectures for Automatically Grading Non-native Spontaneous Speech", in Proceeding of IEEE ICASSP (International Conference on Acoustics, Speech and Signal Processing) 2016, pp. 6140-6144, 2016

しかしながら、非特許文献１及び２に記載されたような従来技術においては、強制切り出し（forced alignment）型音素アライメント及び尤度スコア算定に基づいた技術を採用しているが故に、いくつかの問題が生じている。 However, in the conventional techniques as described in Non-Patent Documents 1 and 2, there are some problems because a technique based on forced alignment type phoneme alignment and likelihood score calculation is employed. Has occurred.

例えば、これらの従来技術では、ネイティブ（対象言語を母国語とする話者）についての高精度の音響モデルを使用する必要があるが、このモデルを構築するには、大量の学習用コーパスが不可欠であり、また、発話の表記も必要となる。さらに例えば、非特許文献１のような強制切り出し型音素アライメント及び尤度比の算定を実施する場合には、ノンネイティブ音響モデルも必要となり、このモデルの構築には大量のノンネイティブコーパスが不可欠となる。しかしながら、このような大量のコーパスの準備・使用は、多大な時間を必要とし、コストの面からも実用的ではない。 For example, in these conventional techniques, it is necessary to use a high-accuracy acoustic model for a native (a speaker whose native language is the target language), but a large amount of a learning corpus is indispensable to construct this model. In addition, utterance notation is required. Further, for example, when performing forced cut-out type phoneme alignment and likelihood ratio calculation as in Non-Patent Document 1, a non-native acoustic model is also required, and a large amount of non-native corpus is indispensable for construction of this model. Become. However, preparing and using such a large amount of corpus requires a great deal of time and is not practical in terms of cost.

また、上述したような強制切り出し型音素アライメントに基づく手法では、評価対象の発音データと対応する教師の発音データが必要となる。またさらに、発声テキストは常に固定されたものでありアプリオリに与えられている。このようなことから結局、強制切り出し型音素アライメントや尤度スコア算定に基づく手法は、適応性の低いものとなっており、例えば別の異なる言語に適用する際には、当該言語に係るデータを用いて同様の処理を繰り返す必要が生じてしまう。 Further, in the method based on the forced cutout phoneme alignment as described above, the pronunciation data of the teacher corresponding to the evaluation target pronunciation data is required. Furthermore, the spoken text is always fixed and given a priori. Consequently, the methods based on forced segmentation phoneme alignment and likelihood score calculation have low adaptability.For example, when applying to another different language, data related to that language is used. It is necessary to repeat the same process using the same.

さらに言えば、非特許文献２に記載された技術のようなＤＮＮに基づくＡＳＲを用いると、システムは格段に複雑なものとなる。すなわち、高精度のＤＮＮによるＡＳＲを構築するためには、膨大な量の学習用コーパスが必要となってしまう。 Furthermore, if an ASR based on DNN such as the technique described in Non-Patent Document 2 is used, the system becomes much more complicated. That is, in order to construct an ASR using a highly accurate DNN, an enormous amount of a learning corpus is required.

また、ノンネイティブの音声に基づくＡＳＲの出力に対し発音スコアが付与される場合、このＡＳＲについては音声認識エラーの存在しないことが大前提となる。しかしながら、このようなエラーは通常、ノンネイティブの音声認識自体に起因して発生し得るだけでなく、様々なケースにおいて特定の語句の認識に失敗することによって起こることもあり、抑制困難となっている。 When a pronunciation score is given to an ASR output based on non-native speech, it is a major premise that no speech recognition error exists in the ASR. However, such errors can usually arise not only from non-native speech recognition itself, but also from failure to recognize certain phrases in various cases, making it difficult to control. I have.

さらに、ＤＮＮによるＡＳＲの構築には、膨大な量の音声コーパスを用いるので演算処理量が多大となり、また、非常に大容量のメモリが必要となり、加えて、大量のパラメータの設定・調整が不可欠となってしまう。 Furthermore, the construction of the ASR by the DNN uses a huge amount of speech corpus, so that the amount of arithmetic processing is enormous, and a very large memory is required. In addition, a large number of parameters must be set and adjusted. Will be.

そこで、本発明は、高精度故に処理負担の大きい音声認識技術に依らずに、評価対象による所定言語の発音を評価可能なプログラム、装置及び方法を提供することを目的とする。 Therefore, an object of the present invention is to provide a program, an apparatus, and a method capable of evaluating pronunciation of a predetermined language by an evaluation target without relying on a speech recognition technique which has a high processing load due to high accuracy.

本発明によれば、評価対象による所定言語の発音を評価する装置に搭載されたコンピュータを機能させる発音評価プログラムであって、
入力した発音が当該所定言語による発音である確度に係るスコアを出力する言語識別モデルを用いて取得されたスコアであって、評価基準対象による当該所定言語の発音に対するスコアを複数取得することにより決定された、当該評価基準対象のスコア分布情報を取得する基準スコア分布取得手段と、
当該言語識別モデルを用い、当該評価対象による当該所定言語の発音に対する当該スコアを複数取得して、当該評価対象のスコア分布情報を決定する対象スコア分布決定手段と、
当該評価対象のスコア分布情報と、当該評価基準対象のスコア分布情報との差の分布における分布パラメータに係る値に基づいて、当該評価対象による当該所定言語の発音に対する評価スコアを決定する評価スコア決定手段と
してコンピュータを機能させる発音評価プログラムが提供される。 According to the present invention, there is provided a pronunciation evaluation program that causes a computer mounted on a device that evaluates pronunciation of a predetermined language by an evaluation target,
A score obtained using a language identification model that outputs a score related to the accuracy of the input pronunciation being the pronunciation in the predetermined language, and determined by obtaining a plurality of scores for the pronunciation in the predetermined language by the evaluation reference target. Reference score distribution obtaining means for obtaining the obtained score distribution information of the evaluation reference target,
Target score distribution determining means for obtaining a plurality of scores for pronunciation of the predetermined language by the evaluation target using the language identification model, and determining score distribution information of the evaluation target;
Evaluation score determination for determining an evaluation score for the pronunciation of the predetermined language by the evaluation target based on a value related to a distribution parameter in a distribution of a difference between the score distribution information of the evaluation target and the score distribution information of the evaluation reference target. A pronunciation evaluation program that causes a computer to function as a means is provided.

この本発明による発音評価プログラムにおいて具体的に、評価スコア決定手段は、当該分布パラメータに係る値として分散に係る値を算出し、当該分散に係る値に基づいて、当該評価スコアを決定することも好ましい。さらに、評価スコア決定手段は、当該分布パラメータに係る値として当該差の分布における最大値を算出し、当該最大値に基づいて、当該評価スコアを決定することも好ましい。 Specifically, in the pronunciation evaluation program according to the present invention, the evaluation score determination means may calculate a value related to variance as a value related to the distribution parameter, and determine the evaluation score based on the value related to the variance. preferable. Furthermore, it is preferable that the evaluation score determination means calculates a maximum value in the distribution of the difference as a value related to the distribution parameter, and determines the evaluation score based on the maximum value.

また、本発明による発音評価プログラムの一実施形態として、評価スコア決定手段は、当該分布パラメータに係る値を、学習済みの評価スコア推定モデルに適用して当該評価スコアを決定することも好ましい。 Further, as one embodiment of the pronunciation evaluation program according to the present invention, it is preferable that the evaluation score determination means determines the evaluation score by applying the value related to the distribution parameter to a learned evaluation score estimation model.

さらに、本発明による発音評価プログラムの他の実施形態として、基準スコア分布取得手段は、当該評価基準対象による当該所定言語の発音に対する複数の当該スコアによって生成されたヒストグラムを表す正規分布の分布パラメータを含む情報を、当該評価基準対象のスコア分布情報として取得し、
対象スコア分布決定手段は、取得した複数の当該スコアのヒストグラムを生成し、当該評価対象のスコア分布情報を、該ヒストグラムを表す正規分布の分布パラメータを含む情報とすることも好ましい。 Further, as another embodiment of the pronunciation evaluation program according to the present invention, the reference score distribution acquiring means includes a normal distribution parameter representing a histogram generated by a plurality of scores for the pronunciation of the predetermined language by the evaluation reference target. Information that contains the information as score distribution information for the evaluation standard,
It is also preferable that the target score distribution determining means generates a histogram of the plurality of obtained scores, and sets the score distribution information of the evaluation target as information including a normal distribution parameter representing the histogram.

さらにまた、本発明による発音評価プログラムの更なる他の実施形態として、対象スコア分布決定手段は、当該評価対象の当該スコアを新たに取得して、当該評価対象のスコア分布情報を更新し、
評価スコア決定手段は、更新された当該評価対象のスコア分布情報に係る差の分布における分布パラメータに係る値に基づいて、当該評価対象による当該所定言語の発音に対する評価スコアを更新することも好ましい。 Furthermore, as still another embodiment of the pronunciation evaluation program according to the present invention, the target score distribution determining means newly obtains the score of the evaluation target, updates the score distribution information of the evaluation target,
It is also preferable that the evaluation score determination means updates the evaluation score for the pronunciation of the predetermined language by the evaluation target based on the updated value of the distribution parameter in the distribution of the difference related to the score distribution information of the evaluation target.

また、本発明による発音評価プログラムは、当該言語識別モデルを用い、当該評価基準対象による当該所定言語の発音に対する当該スコアを複数取得して、当該評価基準対象のスコア分布情報を決定し、基準スコア分布取得手段へ出力する基準スコア分布決定手段としてコンピュータを更に機能させることも好ましい。 Further, the pronunciation evaluation program according to the present invention obtains a plurality of scores for pronunciations of the predetermined language by the evaluation reference target using the language identification model, determines score distribution information of the evaluation reference target, and obtains a reference score. It is preferable that the computer further functions as a reference score distribution determining unit that outputs to the distribution obtaining unit.

さらに、本発明による発音評価プログラムにおける具体例として、当該評価対象は、当該所定言語の学習者であり、当該評価基準対象は、当該所定言語を母国語として話す複数の発音提供者であることも好ましい。 Further, as a specific example in the pronunciation evaluation program according to the present invention, the evaluation target may be a learner of the predetermined language, and the evaluation reference target may be a plurality of pronunciation providers who speak the predetermined language as a native language. preferable.

本発明によれば、また、評価対象による所定言語の発音を評価する発音評価装置であって、
入力した発音が当該所定言語による発音である確度に係るスコアを出力する言語識別モデルを用いて取得されたスコアであって、評価基準対象による当該所定言語の発音に対するスコアを複数取得することにより決定された、当該評価基準対象のスコア分布情報を取得する基準スコア分布取得手段と、
当該言語識別モデルを用い、当該評価対象による当該所定言語の発音に対する当該スコアを複数取得して、当該評価対象のスコア分布情報を決定する対象スコア分布決定手段と、
当該評価対象のスコア分布情報と、当該評価基準対象のスコア分布情報との差の分布における分布パラメータに係る値に基づいて、当該評価対象による当該所定言語の発音に対する評価スコアを決定する評価スコア決定手段と
を有する発音評価装置が提供される。 According to the present invention, it is also a pronunciation evaluation device that evaluates pronunciation of a predetermined language by an evaluation target,
A score obtained using a language identification model that outputs a score related to the accuracy of the input pronunciation being the pronunciation in the predetermined language, and determined by obtaining a plurality of scores for the pronunciation in the predetermined language by the evaluation reference target. Reference score distribution obtaining means for obtaining the obtained score distribution information of the evaluation reference target,
Target score distribution determining means for obtaining a plurality of scores for pronunciation of the predetermined language by the evaluation target using the language identification model, and determining score distribution information of the evaluation target;
Evaluation score determination for determining an evaluation score for the pronunciation of the predetermined language by the evaluation target based on a value related to a distribution parameter in a distribution of a difference between the score distribution information of the evaluation target and the score distribution information of the evaluation reference target. And a pronunciation evaluation device having means.

本発明によれば、さらに、評価対象による所定言語の発音を評価する装置に搭載されたコンピュータにおける発音評価方法であって、
入力した発音が当該所定言語による発音である確度に係るスコアを出力する言語識別モデルを用いて取得されたスコアであって、評価基準対象による当該所定言語の発音に対するスコアを複数取得することにより決定された、当該評価基準対象のスコア分布情報を取得し、一方で、当該言語識別モデルを用い、当該評価対象による当該所定言語の発音に対する当該スコアを複数取得して、当該評価対象のスコア分布情報を決定するステップと、
当該評価対象のスコア分布情報と、当該評価基準対象のスコア分布情報との差の分布における分布パラメータに係る値に基づいて、当該評価対象による当該所定言語の発音に対する評価スコアを決定するステップと
を有する発音評価方法が提供される。 According to the present invention, there is further provided a pronunciation evaluation method in a computer mounted on a device for evaluating pronunciation of a predetermined language by an evaluation target,
A score obtained using a language identification model that outputs a score related to the accuracy of the input pronunciation being the pronunciation in the predetermined language, and determined by obtaining a plurality of scores for the pronunciation in the predetermined language by the evaluation reference target. The obtained score distribution information of the evaluation reference target is obtained, and the score distribution information of the evaluation target is obtained by using the language identification model to obtain a plurality of the scores for the pronunciation of the predetermined language by the evaluation target. Determining
Determining the evaluation score for the pronunciation of the predetermined language by the evaluation target based on the value of the distribution parameter in the distribution of the difference between the score distribution information of the evaluation target and the score distribution information of the evaluation reference target. A method for evaluating pronunciation is provided.

本発明の発音評価プログラム、装置及び方法によれば、高精度故に処理負担の大きい音声認識技術に依らずに、評価対象による所定言語の発音を評価することができる。 ADVANTAGE OF THE INVENTION According to the pronunciation evaluation program, apparatus, and method of the present invention, it is possible to evaluate pronunciation of a predetermined language by an evaluation object without using a speech recognition technology that has a high processing load due to high accuracy.

本発明による発音評価装置を含む発音評価システムの一実施形態を示す模式図である。1 is a schematic diagram illustrating an embodiment of a pronunciation evaluation system including a pronunciation evaluation device according to the present invention. 基準スコア分布決定部、対象スコア分布決定部及び評価スコア決定部によって実施される、本発明の発音評価方法の一実施形態におけるフローを概略的に示す模式図である。It is a schematic diagram which shows roughly the flow in one Embodiment of the pronunciation evaluation method of this invention performed by a reference | standard score distribution determination part, a target score distribution determination part, and an evaluation score determination part. 本発明による発音評価方法の実施例（生徒Ａ）を説明するためのグラフである。5 is a graph for explaining an example (student A) of the pronunciation evaluation method according to the present invention. 本発明による発音評価方法の実施例（生徒Ｂ）を説明するためのグラフである。5 is a graph for explaining an example (student B) of the pronunciation evaluation method according to the present invention. 本発明による発音評価方法の実施例（生徒Ｃ）を説明するためのグラフである。5 is a graph for explaining an example (student C) of the pronunciation evaluation method according to the present invention.

以下、本発明の実施形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

[発音評価システム・装置]
図１は、本発明による発音評価装置を含む発音評価システムの一実施形態を示す模式図である。 [Pronunciation evaluation system / device]
FIG. 1 is a schematic diagram showing one embodiment of a pronunciation evaluation system including a pronunciation evaluation device according to the present invention.

図１に示した本実施形態の発音評価システムは、
（ａ）本発明による発音評価装置１と、
（ｂ）発音評価装置１とインターネット等を介して通信接続されたサーバ２と
を含む。このうち（ａ）の発音評価装置１は、所定言語の発音を評価する装置であるが、例えば本発明による発音評価プログラムをダウンロードした端末、すなわちパーソナル・コンピュータ（ＰＣ）、タブレット型コンピュータや、スマートフォン等とすることが可能である。 The pronunciation evaluation system of the present embodiment shown in FIG.
(A) a pronunciation evaluation device 1 according to the present invention;
(B) Includes a pronunciation evaluation device 1 and a server 2 communicatively connected via the Internet or the like. The pronunciation evaluation device 1 of FIG. 1A is a device for evaluating pronunciation in a predetermined language. For example, a terminal that downloads a pronunciation evaluation program according to the present invention, that is, a personal computer (PC), a tablet computer, a smartphone, or the like. And so on.

発音評価装置１は、ユーザ、例えば外国語会話教室の生徒（例えば第２言語の学習者）による習得中の外国語言語（所定言語）の発音群を、例えばマイク１０７を介して入力し、この生徒（ユーザ）に対し、例えばディスプレイ１０５を介して当該発音群に対する評価結果、例えば後述する５段階のスコアを提示することができる。 The pronunciation evaluation device 1 inputs, via the microphone 107, for example, a pronunciation group of a foreign language (predetermined language) being acquired by a user, for example, a student (for example, a learner of a second language) in a foreign language conversation classroom. For example, an evaluation result for the pronunciation group, for example, a five-stage score described later can be presented to the student (user) via the display 105.

具体的に、発音評価装置１はその特徴として、
（Ａ）入力した発音が所定言語による発音である確度（likelihood）に係るスコアを出力する「言語識別（ＬＩＤ，Language IDentification）モデル」を用いて取得されたスコアであって、評価基準対象、例えば外国語会話教室のネイティブの教師による所定言語の発音に対するスコアを複数取得することにより決定された、教師（評価基準対象）の「スコア分布情報」を取得する基準スコア分布取得部１１２と、
（Ｂ）同じ「ＬＩＤモデル」を用い、１人の生徒（評価対象）による当該所定言語の発音に対するスコアを複数取得して、生徒（評価対象）の「スコア分布情報」を決定する対象スコア分布決定部１１３と、
（Ｃ）この生徒（評価対象）の「スコア分布情報」と、教師（評価基準対象）の「スコア分布情報」との「差の分布」における「分布パラメータに係る値」に基づいて、生徒（評価対象）による所定言語の発音に対する評価スコアを決定する評価スコア決定部１１４と
を有している。 Specifically, the pronunciation evaluation device 1 has the following features:
(A) A score obtained using a “Language IDentification (LID) model” that outputs a score related to the likelihood that the input pronunciation is a pronunciation in a predetermined language, and is an evaluation reference target, for example, A reference score distribution obtaining unit 112 that obtains “score distribution information” of a teacher (evaluation reference target) determined by obtaining a plurality of scores for pronunciation in a predetermined language by a native teacher in a foreign language conversation classroom;
(B) Using the same “LID model”, a plurality of scores for pronunciation of the given language by one student (evaluation target) are acquired, and target score distribution for determining “score distribution information” of the student (evaluation target) A determination unit 113;
(C) On the basis of “value related to distribution parameter” in “difference distribution” between “score distribution information” of this student (evaluation target) and “score distribution information” of teacher (evaluation reference target), And an evaluation score determination unit 114 that determines an evaluation score for pronunciation in a predetermined language by an evaluation target).

ここで、上記構成（Ａ）の教師（評価基準対象）については、基準となる特定の１人とすることも可能であるが、基準としての安定性・高水準性を担保するため、複数の教師を採用することも好ましい。また、「ＬＩＤモデル」による識別の基準となることから、これらの評価基準対象（教師）は、教授する所定言語を母国語として話す複数の発音提供者、すなわち所定言語のネイティブであることも好ましい。 Here, the teacher (evaluation reference target) of the above configuration (A) can be a specific one as a reference, but a plurality of teachers are required in order to secure stability and high level as a reference. It is also preferable to employ a teacher. In addition, since the evaluation target (teacher) is a plurality of pronunciation providers who speak the predetermined language to be taught as their native language, that is, it is also preferable that these evaluation reference targets (teachers) are natives of the predetermined language because they serve as identification criteria by the “LID model”. .

また、上記構成（Ｃ）の「差の分布」については後に詳細に説明するが、例えば正規分布N₁(μ₁, σ₁ ²)と、正規分布N₂(μ₂, σ₂ ²)との「差の分布」は、正規分布N_d（μ_d＝μ₂−μ₁, σ_d ²＝σ₁ ²＋σ₂ ²）と設定される。さらに、「差の分布」における「分布パラメータに係る値」としては、これも後に詳細に説明するが、「差の分布」における分散σ_d ²に係る値とすることができる。より具体的には、「差の分布」における確率密度の最大値（＝(2π)^-0.5／σ_d＝(2π(σ₁ ²＋σ₂ ²))^-0.5）とすることも好ましい。 The “difference distribution” of the above configuration (C) will be described in detail later. For example, the normal distribution N ₁ (μ ₁ , σ ₁ ² ) and the normal distribution N ₂ (μ ₂ , σ ₂ ² ) "distribution of the difference" in the normal distribution _{_{_{N d (μ d = μ 2}}} -μ 1, σ d 2 = σ 1 2 + σ 2 2) are set to. Further, the “value related to the distribution parameter” in the “difference distribution” may be a value related to the variance σ _d ² in the “difference distribution”, which will also be described later in detail. More specifically, to a maximum value of the probability density in the "distribution of the difference" ^{(= (2π) -0.5 / σ} d = (2π (σ 1 2 + σ 2 2)) -0.5) are also preferred.

ちなみにこの場合、「差の分布」の「分布パラメータに係る値」としての確率密度の最大値（＝(2π(σ₁ ²＋σ₂ ²))^-0.5）は、分散σ₁ ²及びσ₂ ²が求められれば算出されるので、結局、
（ａ）教師（評価基準対象）の「スコア分布情報」としての分散σ₁ ²、及び
（ｂ）生徒（評価対象）の「スコア分布情報」として分散σ₂ ²
が取得されれば、最終的に生徒の評価スコアを決定することができるのである。 Incidentally, in this case, the maximum value of the probability density (= (2π (σ ₁ ² + σ ₂ ² )) − ^0.5 ) as the “value related to the distribution parameter” of the “difference distribution” is the variance σ ₁ ² and σ ₂ ² Is calculated if is obtained.
(A) Variance σ ₁ ² as “score distribution information” of teacher (evaluation reference target), and (b) Variance σ ₂ ² as “score distribution information” of student (evaluation target)
Is obtained, the student's evaluation score can be finally determined.

このように、本発明による発音評価装置１は、「ＬＩＤモデル」を用いて取得されたスコアに基づき、生徒（評価対象）の発音に対する評価スコアを自動的に導出している。ここで、「ＬＩＤモデル」は、言語種別の識別・分類を実行可能なモデルであり、具体的には生徒（評価対象）の発音を入力し、当該発音が所定言語の発音である確からしさである確度（likelihood）を出力する。すなわち、非常に高い精度が要求されるものはなく、またそれ故に、そのモデル構築にそれほど大きな処理負担は発生しないようなモデルとなっている。 As described above, the pronunciation evaluation device 1 according to the present invention automatically derives the evaluation score for the pronunciation of the student (evaluation target) based on the score obtained using the “LID model”. Here, the “LID model” is a model capable of executing identification and classification of a language type. Specifically, a pronunciation of a student (evaluation target) is inputted, and the pronunciation is a pronunciation of a predetermined language. Output a certain likelihood. That is, there is no model that requires very high accuracy, and therefore, the model is constructed so that a large processing load does not occur in constructing the model.

これにより、発音評価装置１は、高精度故に処理負担の大きい音声認識モデル（ＡＳＲ（Automatic Speech Recognition）モデル）を用いることなく、さらにはノンネイティブコーパスも必要とせずに、それほどの高精度を必要としない「ＬＩＤモデル」を活用し、「差の分布」における「分布パラメータに係る値」に着目して十分に高い精度を有する評価スコアを提供することができるのである。 Accordingly, the pronunciation evaluation device 1 requires such high accuracy without using a speech recognition model (ASR (Automatic Speech Recognition) model) having a large processing load due to high accuracy, and further, without requiring a non-native corpus. It is possible to provide an evaluation score with sufficiently high accuracy by focusing on “values related to distribution parameters” in “distribution of differences” by utilizing “LID model” which is not used.

また、発音評価装置１によれば、発音評価のために、発声データをテキスト化する必要もなければ、例えば評価基準対象（教師）による基準発声文の提供も不要である。 In addition, according to the pronunciation evaluation device 1, it is not necessary to convert the utterance data into text for the pronunciation evaluation, and it is unnecessary to provide a reference utterance sentence by, for example, an evaluation reference target (teacher).

したがって、発音評価装置１における処理演算量や必要となるメモリ量をより低減させることも可能となり、その場合、例えば発音評価装置１を、計算能力に一定の限界を有する携帯端末に収めることもできるのである。さらに、例えば、最終的な発音の評価スコアを概ねリアルタイムで出力するモードも実現可能となる。 Therefore, it is also possible to further reduce the amount of processing operations and the required memory in the pronunciation evaluation device 1, and in that case, for example, the pronunciation evaluation device 1 can be housed in a portable terminal having a certain limit in computational power. It is. Further, for example, a mode in which the final pronunciation evaluation score is output substantially in real time can be realized.

ちなみに、上記構成（Ａ）における教師（評価基準対象）の「スコア分布情報」を決定する基準スコア分布決定機能（基準スコア分布決定部）は、図１の参照付番１１１のように、発音評価装置１自体が有していてもよく、または変更態様として、サーバ２がこの機能（基準スコア分布決定部２１２）を備えていることも好ましい。後者の場合、教師（評価基準対象）の「スコア分布情報」は、サーバ２から発音評価装置１に送信・提供されることになる。 Incidentally, the reference score distribution determination function (reference score distribution determination unit) for determining the “score distribution information” of the teacher (evaluation reference target) in the above configuration (A) is, as shown by reference number 111 in FIG. The device 1 itself may have it, or as a modification, it is preferable that the server 2 is provided with this function (reference score distribution determination unit 212). In the latter case, the “score distribution information” of the teacher (evaluation reference target) is transmitted and provided from the server 2 to the pronunciation evaluation device 1.

また、発音評価装置１の主要な構成要素である「ＬＩＤモデル」を構築するＬＩＤモデル構築機能（言語識別モデル構築部）も、図１の参照付番１２１のように、発音評価装置１自体が有していてもよく、または、サーバ２が、この機能（言語識別モデル構築部２１１）を備えていることも好ましい。後者の場合、構築済み（学習済みの）の「ＬＩＤモデル」が、サーバ２から発音評価装置１に送信・提供されることになる。また、構築の際に使用されるネイティブコーパス（図１では参照付番１０２）も、発音評価装置１ではなくサーバ２が保持することになる。 The LID model construction function (language identification model construction unit) for constructing a “LID model” which is a main component of the pronunciation evaluation device 1 is also provided by the pronunciation evaluation device 1 itself as indicated by reference numeral 121 in FIG. It is also preferable that the server 2 has this function (the language identification model construction unit 211). In the latter case, the constructed (learned) “LID model” is transmitted and provided from the server 2 to the pronunciation evaluation device 1. The server 2 also holds the native corpus (reference number 102 in FIG. 1) used at the time of construction, instead of the pronunciation evaluation device 1.

さらに、発音評価スコアの評価対象は当然、本実施形態のように生徒（言語学習者）に限定されるものではなく、例えば、学習によって所定言語の対話を可能とする（対話シナリオを備えた）自動対話システムであってもよい。また、一方の評価基準対象も、当然教師（言語教授者）に限定されるものではなく、例えば、更新を繰り返すことによってネイティブ相当の発声が可能となった（基準として十分採用可能な）自動対話システムとすることもできる。 Furthermore, the evaluation target of the pronunciation evaluation score is, of course, not limited to students (language learners) as in the present embodiment. For example, a dialogue of a predetermined language is enabled by learning (with a dialogue scenario). It may be an automatic dialog system. Naturally, the evaluation criteria are not limited to teachers (language professors). For example, by repeating the update, an utterance equivalent to that of a native speaker can be made (can be sufficiently adopted as a criterion). It can also be a system.

［発音評価装置の構成］
同じく図１の機能ブロック図によれば、発音評価装置１は、通信インタフェース部１０１と、ネイティブコーパス１０２と、ユーザ発音保存部１０３と、評価スコア保存部１０４と、タッチパネル・ディスプレイ（ＴＰ・ＤＰ）１０５と、マイク（ＭＣ）１０７と、スピーカ（ＳＰ）１０８と、プロセッサ・メモリとを有する。 [Configuration of pronunciation evaluation device]
Similarly, according to the functional block diagram of FIG. 1, the pronunciation evaluation device 1 includes a communication interface unit 101, a native corpus 102, a user pronunciation storage unit 103, an evaluation score storage unit 104, and a touch panel display (TP / DP). 105, a microphone (MC) 107, a speaker (SP) 108, and a processor memory.

ここで、このプロセッサ・メモリは、本発明による発音評価プログラムの一実施形態を保存しており、また、コンピュータ機能を有していて、この発音評価プログラムを実行することによって、発音評価処理を実施する。このことから、発音評価装置１は、本発明による発音評価プログラムを搭載した、例えばパーソナル・コンピュータ（ＰＣ）、ノート型若しくはタブレット型コンピュータ、又はスマートフォン等であってもよい。 Here, the processor memory stores one embodiment of the pronunciation evaluation program according to the present invention, and has a computer function, and executes the pronunciation evaluation program to execute the pronunciation evaluation processing. I do. For this reason, the pronunciation evaluation device 1 may be, for example, a personal computer (PC), a notebook or tablet computer, a smartphone, or the like, on which the pronunciation evaluation program according to the present invention is installed.

さらに、プロセッサ・メモリは、言語識別部１１１ａを有する基準スコア分布決定部１１１と、基準スコア分布取得部１１２と、言語識別部１１３ａを有する対象スコア分布決定部１１３と、差分布算出部１１４ａ及び評価スコア推定部１１４ｂを有する評価スコア決定部１１４と、言語識別モデル構築部１２１と、評価スコア推定モデル構築部１２２と、通信制御部１３１と、入出力制御部１３２とを有する。なお、これらの機能構成部は、プロセッサ・メモリに保存された発音評価プログラムの機能と捉えることができる。また、図１における発音評価装置１の機能構成部間を矢印で接続して示した処理の流れは、本発明による発音評価方法の一実施形態としても理解される。 Further, the processor memory includes a reference score distribution determination unit 111 having a language identification unit 111a, a reference score distribution acquisition unit 112, a target score distribution determination unit 113 having a language identification unit 113a, a difference distribution calculation unit 114a, and an evaluation unit. It has an evaluation score determination unit 114 having a score estimation unit 114b, a language identification model construction unit 121, an evaluation score estimation model construction unit 122, a communication control unit 131, and an input / output control unit 132. Note that these functional components can be regarded as functions of the pronunciation evaluation program stored in the processor memory. The flow of processing shown by connecting the functional components of the pronunciation evaluation device 1 in FIG. 1 with arrows is also understood as an embodiment of the pronunciation evaluation method according to the present invention.

同じく図１の機能ブロック図において、通信制御部１３１は、通信インタフェース部１０１を介し、
（ａ）サーバ２から、学習済みのＬＩＤ（言語識別）モデルを受信し、
（ｂ）サーバ２から、学習済みの評価スコア推定モデルを受信し、
（ｃ）サーバ２から、教師（評価基準対象）のスコア分布情報である基準スコア分布情報を受信し、
（ｄ）評価スコア決定部１１４から出力された評価スコアを、外部の情報処理装置へ送信する
ことも好ましい。 Similarly, in the functional block diagram of FIG. 1, the communication control unit 131
(A) receiving the learned LID (language identification) model from the server 2,
(B) receiving the learned evaluation score estimation model from the server 2;
(C) receiving the reference score distribution information that is the teacher (evaluation reference target) score distribution information from the server 2,
(D) It is also preferable to transmit the evaluation score output from the evaluation score determination unit 114 to an external information processing device.

なお、発音評価装置１が言語識別モデル構築部１２１を有する実施形態では、上記（ａ）のＬＩＤモデルの受信は不要である。また、発音評価装置１が評価スコア推定モデル構築部１２２を有する実施形態では、上記（ｂ）の評価スコア推定モデルの受信は不要となる。さらに、発音評価装置１が基準スコア分布決定部１１１を有する実施形態では、上記（ｃ）の基準スコア分布情報の受信も不要である。さらにまた、評価スコア決定部１１４から出力された評価スコアが、生徒（評価対象）に対し例えばディスプレイ１０５を介して提示されればよい場合、上記（ｄ）の評価スコアの外部への送信も不要となる。 Note that, in the embodiment in which the pronunciation evaluation device 1 includes the language identification model construction unit 121, the reception of the above-described (a) LID model is unnecessary. Further, in the embodiment in which the pronunciation evaluation device 1 includes the evaluation score estimation model construction unit 122, the reception of the evaluation score estimation model of (b) is unnecessary. Furthermore, in the embodiment in which the pronunciation evaluation device 1 includes the reference score distribution determining unit 111, the reception of the reference score distribution information of (c) is not necessary. Furthermore, when the evaluation score output from the evaluation score determination unit 114 only needs to be presented to the student (evaluation target) via, for example, the display 105, it is unnecessary to transmit the evaluation score of (d) to the outside. Becomes

ユーザ発音保存部１０３は、例えばマイク１０７を介して取得された、生徒（評価対象）による所定言語の発音データであって、入出力制御部１３２で所定形式のデジタルデータに変換された発音データを保存する。ここで、ユーザ発音保存部１０３は、発音データを、生徒の識別子（ＩＤ）に紐づけて生徒毎に区分して保存し、当該発音データを用いて、各生徒につき当該生徒固有のスコア分布情報を生成可能なようにする。 The user pronunciation storage unit 103 stores, for example, pronunciation data of a predetermined language by a student (evaluation target) obtained through the microphone 107 and converted by the input / output control unit 132 into digital data of a predetermined format. save. Here, the user pronunciation storage unit 103 stores the pronunciation data in association with the student's identifier (ID) for each student and stores the score distribution information unique to the student for each student using the pronunciation data. Can be generated.

基準スコア分布決定部１１１の言語識別部１１１ａは、評価基準対象であるネイティブの教師群における各教師について複数の発音データ（から生成された特徴量）を、例えばネイティブコーパス１０２から取り出してＬＩＤモデルに入力し、この発音が所定言語による発音である確度に相当する確度スコアを、当該ＬＩＤモデルから出力させる。ここで、当該ネイティブの教師群の発音をマイク１０７経由で取得し、入出力制御部１３２でデジタルの発音データに変換して、ＬＩＤモデルへの入力データとして用いてもよい。 The language identification unit 111a of the reference score distribution determination unit 111 extracts a plurality of pronunciation data (features generated from) for each teacher in the native teacher group as an evaluation reference target from, for example, the native corpus 102 and converts the data into an LID model. The LID model outputs a certainty score corresponding to the certainty that the pronunciation is a pronunciation in a predetermined language. Here, the pronunciation of the native teacher group may be acquired via the microphone 107, converted into digital pronunciation data by the input / output control unit 132, and used as input data to the LID model.

基準スコア分布決定部１１１は、
（ａ）上述したように取得された複数の確度スコアから、確度スコアヒストグラムを生成し、
（ｂ）生成した確度スコアヒストグラムを表す正規分布、すなわち当該ヒストグラムにフィッティングさせた正規分布を決定し、
（ｃ）この正規分布の分布パラメータ情報、例えば平均μ₁及び分散σ₁ ²を、教師群（評価基準対象）の確度スコア分布情報である「基準スコア分布情報」とする
のである。 The reference score distribution determination unit 111
(A) generating a probability score histogram from the plurality of probability scores acquired as described above,
(B) determining a normal distribution representing the generated accuracy score histogram, that is, a normal distribution fitted to the histogram;
(C) the distribution parameter information of the normal distribution, for example, an average mu ₁ and variance sigma ₁ ^2, a confidence score distribution information teacher group (Evaluation reference object) is taken as the "reference score distribution information".

ここで、確度スコアヒストグラムは、例えば、確度スコアを横軸とし、確度スコアの区分毎に当該区分に該当する確度スコアの度数（カウント数）を縦軸にとったグラフとすることができる。 Here, the certainty score histogram can be a graph in which, for example, the certainty score is set on the horizontal axis, and for each certainty score section, the frequency (count number) of the certainty score corresponding to the section is set on the vertical axis.

また、上記（ｂ）の確度スコアヒストグラムに対するフィッティング処理は、非線形最小二乗法等の公知の手法を用いて実施可能であるが、例えば、市販の各種データ解析ソフトウェアにおける関数フィッティング機能を使用してもよい。 The fitting process for the accuracy score histogram in (b) above can be performed using a known method such as a nonlinear least squares method. For example, a function fitting function in various commercially available data analysis software can be used. Good.

さらに変更態様として、サーバ２の基準スコア分布決定部２１２によって、上記の「基準スコア分布情報」が決定される場合、基準スコア分布取得部１１２が、通信インタフェース部１０１を介してこの「基準スコア分布情報」を取得することになる。 Further, as a modification, when the above “reference score distribution information” is determined by the reference score distribution determination unit 212 of the server 2, the reference score distribution acquisition unit 112 transmits the “reference score distribution information” via the communication interface unit 101. Information ".

言語識別モデル構築部１２１は、基準スコア分布決定部１１１及び対象スコア分布決定部１１３で利用されるＬＩＤモデルを構築する。ここでＬＩＤモデルは、例えば公知の深層ニューラルネットワーク（ＤＮＮ，Deep Neural Network）アルゴリズムによって構築される。 The language identification model construction unit 121 constructs an LID model used by the reference score distribution determination unit 111 and the target score distribution determination unit 113. Here, the LID model is constructed by a known deep neural network (DNN, Deep Neural Network) algorithm, for example.

具体的には、ネイティブコーパス１０２から取り出した、ネイティブの教師に係るデジタル発音データから特徴量を生成し、この特徴量と、当該発音の言語種（当該ネイティブの母国語）とを学習用データとして、ＤＮＮに対し学習処理を実施することにより、ＬＩＤモデルが構築されるのである。 Specifically, a feature is generated from the digital pronunciation data of the native teacher extracted from the native corpus 102, and the feature and the language type of the pronunciation (the native language of the native) are used as learning data. , And DNN, the LID model is constructed.

したがって、ＬＩＤモデルは、所定言語毎に（例えば生徒が学習する言語の種別毎に）、当該所定言語のネイティブコーパスを用いて構築され、例えば英語用のＬＩＤモデル、ギリシア語用のＬＩＤモデル、中国語用のＬＩＤモデル、といった形で準備されることも好ましい。なおこの場合、これらのＬＩＤモデルを利用して、入力発音が複数の言語種のうちのいずれに該当するかを決定する言語分類処理を実施することも可能となる。 Therefore, the LID model is constructed for each predetermined language (for example, for each type of language that the student learns) using the native corpus of the predetermined language. For example, the LID model for English, the LID model for Greek, the Chinese language It is also preferable to prepare in the form of a word LID model. In this case, using these LID models, it is also possible to execute a language classification process for determining which of a plurality of language types the input pronunciation corresponds to.

ここで、ＬＩＤモデルは、高い精度の要求されるＡＳＲ（自動音声認識）モデルとは異なり、所定言語の識別処理だけを実施すればよいので、通常のＡＳＲモデル構築と比較すると、非常に少量の学習用データで構築することができる。またその結果、例えば、様々な言語に対応する（様々な言語用の）発音評価機能も容易に実現可能となるのである。 Here, unlike the ASR (Automatic Speech Recognition) model that requires high accuracy, the LID model only needs to perform the identification processing of a predetermined language. It can be constructed with learning data. As a result, for example, pronunciation evaluation functions (for various languages) corresponding to various languages can be easily realized.

なお、上述したように、サーバ２が言語識別モデル構築部２１１においてＬＩＤモデルを構築し、当該ＬＩＤモデルを発音評価装置１へ提供する実施形態も可能である。この場合、言語識別モデル構築部１２１は不要となる。 As described above, an embodiment in which the server 2 constructs an LID model in the language identification model constructing unit 211 and provides the LID model to the pronunciation evaluation device 1 is also possible. In this case, the language identification model construction unit 121 becomes unnecessary.

言語識別部１１３ａを有する対象スコア分布決定部１１３は、評価対象であるノンネイティブの生徒の発音データ（から生成された特徴量）を、例えばユーザ発音保存部１０３から取り出して、基準スコア分布決定部１１１で使用されたものと同じＬＩＤモデルに入力し、この発音が所定言語による発音である確度に相当する確度スコアを、当該ＬＩＤモデルから出力させる。 The target score distribution determination unit 113 having the language identification unit 113a extracts the pronunciation data (features generated from the non-native students) to be evaluated from the user pronunciation storage unit 103, for example, and obtains the reference score distribution determination unit. Input is made to the same LID model as used in 111, and a certainty score corresponding to the certainty that this pronunciation is a pronunciation in a predetermined language is output from the LID model.

ここで、当該ノンネイティブの生徒の発音をマイク１０７経由で取得し、入出力制御部１３２でデジタルの発音データに変換して、ユーザ発音保存部１０３に一先ず保存し、ＬＩＤモデルへの入力データとして用いてもよい。 Here, the pronunciation of the non-native student is acquired via the microphone 107, converted into digital pronunciation data by the input / output control unit 132, and temporarily stored in the user pronunciation storage unit 103, as input data to the LID model. May be used.

なお、上記の生徒の発音は、例えば（基準スコア分布決定部１１１で言語識別処理を受けた）教師の発音の基となった基準発声文を読み上げたものであることも好ましいが、このような基準発声文によらず、例えば会話において自由に発声されたものであってもよい。 The pronunciation of the student is preferably, for example, a read-out of a reference utterance sentence based on the pronunciation of the teacher (which has undergone language identification processing in the reference score distribution determination unit 111). Instead of the reference utterance sentence, it may be uttered freely in a conversation, for example.

対象スコア分布決定部１１３は、
（ａ）上述したように取得された複数の確度スコアから、確度スコアヒストグラムを生成し、
（ｂ）生成した確度スコアヒストグラムを表す正規分布、すなわち当該ヒストグラムにフィッティングさせた正規分布を決定し、
（ｃ）この正規分布の分布パラメータ情報、例えば平均μ₂及び分散σ₂ ²を、生徒（評価対象）の確度スコア分布情報である「対象スコア分布情報」とする
のである。 The target score distribution determination unit 113
(A) generating a probability score histogram from the plurality of probability scores acquired as described above,
(B) determining a normal distribution representing the generated accuracy score histogram, that is, a normal distribution fitted to the histogram;
(C) The distribution parameter information of the normal distribution, for example, the average μ ₂ and the variance σ ₂ ² are set as “target score distribution information” that is the accuracy score distribution information of the student (evaluation target).

このように、対象スコア分布決定部１１３は、基準スコア分布決定部１１１で使用されたものと同じＬＩＤモデル、すなわちネイティブの音声（ネイティブコーパス１０２）によって構築されたネイティブモデルをもって、ノンネイティブである生徒の発音に対するスコア付けを実施している。したがって、本発音評価装置１では、評価対象（ノンネイティブ）用のモデルを構築・使用する必要がなく、それ故ノンネイティブコーパスも不要となっており、装置の処理負担やメモリ負担がより少なくて済むのである。 As described above, the target score distribution determination unit 113 uses the same LID model as that used in the reference score distribution determination unit 111, that is, the native model constructed by the native voice (the native corpus 102), to generate a non-native student. Is scored for pronunciation. Therefore, in the present pronunciation evaluation device 1, there is no need to build and use a model for the evaluation target (non-native), and hence a non-native corpus is not required, and the processing load and memory load of the device are reduced. That's it.

ここで、対象スコア分布決定部１１３でのＬＩＤモデルによる言語識別処理は、ある意味ネイティブ／ノンネイティブを区別する処理とも捉えられる。したがって、例えばノンネイティブの生徒の発音が、基準となるネイティブの教師の発音に近い場合、ＬＩＤモデルは、ネイティブ／ノンネイティブの識別「ミス」を起こし易くなり、識別結果が大きく揺らぐ傾向になるともいえる。 Here, the language identification processing by the LID model in the target score distribution determination unit 113 can be regarded as a processing for distinguishing between native and non-native in a sense. Therefore, for example, when the pronunciation of a non-native student is close to the pronunciation of a reference native teacher, the LID model is liable to cause a “miss” between native / non-native identification, and the identification result tends to fluctuate greatly. I can say.

言い換えると、ノンネイティブの生徒の発音がネイティブの発音に近づくほど、実際に、生成されるヒストグラムは、ある確度スコア値の周りでよりブロードな分布幅を持ち、それ故、生成された正規分布の分散σ₂ ²はより大きくなる傾向を示す。一方、生徒の発音がネイティブから大きく相違するものであるほど、ヒストグラムは、比較的小さな確度スコア値辺りに集中し、それ故、生成された正規分布の分散σ₂ ²はより小さくなる傾向を示すのである。 In other words, as the pronunciation of a non-native student approaches the native pronunciation, in fact, the generated histogram will have a broader distribution around certain accuracy score values, and hence the generated normal distribution variance sigma ₂ ² shows a larger tendency. On the other hand, as the one in which sound students significantly different from a native, histogram shows the relatively concentrated in a small confidence score value Atari, therefore, tends to be smaller the dispersion sigma ₂ ² Generated normal distribution It is.

ここで、ネイティブの教師に係る正規分布の分散σ₁ ²は、生徒の発音評価処理においては、基準となる固定値であってアプリオリに与えられた値とみることができる。したがって、生徒と教師との間の発音の近さは、生徒に係る正規分布の分散σ₂ ²の値に反映されることが理解される。 Here, the variance σ ₁ ² of the normal distribution relating to the native teacher can be regarded as a fixed value serving as a reference and given a priori in the pronunciation evaluation process of the student. Thus, proximity to pronounce between students and teachers, are understood to be reflected in the variance sigma ₂ ² value of the normal distribution of the students.

ちなみに、生徒（評価対象）の確度スコアヒストグラム（及びその正規分布）は当初、例えば5〜10個程度の音声データをもって生成することができる。またこの後、生成された確度スコアヒストグラム（及び正規分布の分布パラメータ）は、当該生徒による新たな音声データによって順次更新されることも好ましい。これにより、例えば最新の（現段階での）生徒の言語習熟度を把握することも可能となる。 Incidentally, the accuracy score histogram (and its normal distribution) of the student (evaluation target) can be initially generated using, for example, about 5 to 10 pieces of audio data. After that, it is preferable that the generated accuracy score histogram (and the distribution parameter of the normal distribution) is sequentially updated by new voice data by the student. Thus, for example, it is possible to grasp the latest (at this stage) student's language proficiency.

さらに、生徒が学習を開始してから所定期間毎に、当該所定期間に対応する確度スコアヒストグラムを生成することによって、当該生徒における後述する評価スコアの変遷、すなわち言語学習進捗の様子（言語習熟の足跡）を把握することも可能となるのである。 Further, by generating a certainty score histogram corresponding to the predetermined period every predetermined period after the student starts learning, a change in an evaluation score described later for the student, that is, a state of progress of language learning (language learning progress) It is also possible to grasp footprints).

同じく図１の機能ブロック図において、評価スコア決定部１１４は、生徒（評価対象）のスコア分布情報と、教師（評価基準対象）のスコア分布情報との「差の分布」における分布パラメータに係る値、例えば平均μ_d及び分散σ_d ²に基づいて、生徒（評価対象）による所定言語の発音に対する評価スコアを決定する。 Similarly, in the functional block diagram of FIG. 1, the evaluation score determination unit 114 calculates a value related to a distribution parameter in the “difference distribution” between the score distribution information of the student (evaluation target) and the score distribution information of the teacher (evaluation reference target). For example, based on the average μ _d and the variance σ _d ² , an evaluation score for pronunciation of a predetermined language by a student (evaluation target) is determined.

具体的には最初に、評価スコア決定部１１４の差分布算出部１１４ａが、生成された生徒の正規分布及び教師の正規分布から「差の分布」を生成する。この「差の分布」は、同じく正規分布となっており、平均μ_dが、
（１） μ_d＝μ₂−μ₁
であって、分散σ_d ²が、
（２） σ_d ²＝σ₁ ²＋σ₂ ²を満たすような分布として生成される。 Specifically, first, the difference distribution calculation unit 114a of the evaluation score determination unit 114 generates a “difference distribution” from the generated student normal distribution and teacher normal distribution. This “difference distribution” is also a normal distribution, and the average μ _d is
(1) μ _d = μ ₂ −μ ₁
Where the variance σ _d ² is
(2) is generated as _{^{_{^{σ d 2 = σ 1 2 +}}}} σ 2 2 satisfying the distribution.

ここで、この「差の分布」における確率密度のピーク値（最大値）p_dは、次式
（３） p_d＝(2π)^-0.5／σ_d＝(2π)^-0.5／(σ₁ ²＋σ₂ ²)^0.5
≒0.4／(σ₁ ²＋σ₂ ²)^0.5
によって算出される。この式（３）から分かるように、ピーク値p_dは、生徒の正規分布の標準偏差σ₂、すなわち分散σ₂ ²の単調減少関数となっており、したがって、ノンネイティブの生徒とネイティブの教師との発音の近さを表す良い指標となっている。 Here, the peak value (maximum value) p _d of the probability density in this “difference distribution” is given by the following equation (3): p _d = (2π) ^−0.5 / σ _d = (2π) ^−0.5 / (σ ₁ ² + Σ ₂ ² ) ^0.5
≒ 0.4 / (σ ₁ ² + σ ₂ ² ) ^0.5
It is calculated by As seen from this equation (3), the peak value p _d is the standard deviation sigma ₂ of the normal distribution of students, that is, a monotonically decreasing function of the variance sigma ₂ ^2, therefore, non-native students and native teachers It is a good indicator of the closeness of pronunciation.

例えば、生徒の分散σ₂ ²が大きいほど、より小さなピーク値p_dが得られるので、得られたピーク値p_dが小さいほど、生徒の発音がネイティブ（の教師）の発音により近いと判断することができる。ここで、以下、このピーク値p_dを、最終的な評価スコアの前に決定される「予備的評価スコア」とする。 For example, as the student's variance σ ₂ ² is larger, a smaller peak value _pd is obtained. Therefore, it is determined that the student's pronunciation is closer to the native (teacher's) pronunciation as the obtained peak value _pd is smaller. be able to. Here, hereinafter, the peak value p _d, is determined before the final evaluation score to a "preliminary evaluation score".

ちなみに、この予備的評価スコアとしてのピーク値p_dは、ネイティブの教師に係る正規分布の分散σ₁ ²の関数にもなっているが、上述したように分散σ₁ ²は固定値と捉えることができるので、結果的にピーク値p_dを、生徒の分散σ₂ ²が直接的に反映された指標として採用することが可能となるのである。 Incidentally, the peak value p _d as this preliminary evaluation score, but is also the variance sigma ₁ ² function of the normal distribution of the native teacher, be regarded as variance sigma ₁ ² is a fixed value as described above since it is, consequently the peak value p _d, it is become possible to adopt as an indicator dispersion sigma ₂ ² is reflected directly in the student.

また、予備的評価スコアとして、分散σ₂ ²、又は分散σ₂ ²に係る値である標準偏差σ₂や分散σ₂ ²若しくは標準偏差σ₂の関数を採用することも可能である。しかしながら、その中でもピーク値p_dは、導出がより容易であって取り扱い易い値となっている。 Also, as a preliminary evaluation score, it is also possible to employ a variance sigma ₂ ^2, or a function of the standard deviation sigma ₂ and variance sigma ₂ ² or the standard deviation sigma ₂ is the value of the variance sigma ₂ ^2. However, among them, the peak value p _d is a value that is easier to derive and easier to handle.

次いで、評価スコア決定部１１４の評価スコア推定部１１４ｂが、上述したように決定した「予備的評価スコア」を用いて、生徒による所定言語の発音に対する最終的な「評価スコア」を決定するのである。具体的には、一実施形態として、
（ａ）予備的評価スコアとしての（「差の分布」の）ピーク値p_dと、
（ｂ）正解値としての、人間の評価者による評価スコアと
の組である複数の教師データによって構築された評価スコア推定モデルを用いて、評価スコアを決定するのである。 Next, the evaluation score estimating unit 114b of the evaluation score determining unit 114 determines the final “evaluation score” for the pronunciation of the predetermined language by the student, using the “preliminary evaluation score” determined as described above. . Specifically, as one embodiment,
(A) peak value p _d (of “difference distribution”) as a preliminary evaluation score;
(B) The evaluation score is determined by using an evaluation score estimation model constructed by a plurality of teacher data, which is a set of the evaluation score by a human evaluator as a correct answer value.

なお、上記（ｂ）の評価スコアは、様々な形式・基準のものが採用可能であるが、例えば「ネイティブレベル」（Agree level）、「準ネイティブレベル」（Mildly Agree level）、「平均的レベル」（Undecided level）、「準ノンネイティブレベル」（Mildly Disagree level）、「ノンネイティブレベル」（Disagree）の５段階のレベルを示す値とすることができる。 The evaluation score of the above (b) can be of various formats and standards. For example, "native level" (Agree level), "quasi-native level" (Mildly Agree level), "average level" "(Undecided level)," Non-native level "(Mildly Disagree level), and" Non-native level "(Disagree).

また、評価スコア推定モデル構築部１２２は、上記（ａ）及び（ｂ）のデータを含む教師データによって学習処理を行い、評価スコア推定モデルを構築する。ここで構築されるモデルは、回帰モデルであってもよく、又は他の機械学習モデルとすることも可能である。ちなみに、上述したように、サーバ２が評価スコア推定モデル構築部２１３において評価スコア推定モデルを構築し、当該評価スコア推定モデルを発音評価装置１へ提供する実施形態をとることも可能である。この場合、この評価スコア推定モデル構築部１２２は不要となる。 Further, the evaluation score estimation model construction unit 122 performs a learning process using the teacher data including the data (a) and (b) to construct an evaluation score estimation model. The model constructed here may be a regression model or may be another machine learning model. Incidentally, as described above, an embodiment in which the server 2 constructs an evaluation score estimation model in the evaluation score estimation model construction unit 213 and provides the evaluation score estimation model to the pronunciation evaluation device 1 is also possible. In this case, the evaluation score estimation model construction unit 122 becomes unnecessary.

また、上述したように、対象スコア分布決定部１１３が、生徒（評価対象）のスコア分布情報（分布パラメータ情報）を更新した場合、評価スコア決定部１１４は、この更新された生徒（評価対象）のスコア分布情報に係る「差の分布」における分布パラメータ（すなわち更新された平均μ_d及び分散σ_d ²）に基づいて、生徒（評価対象）による所定言語の発音に対する評価スコアを更新する。これにより、例えば生徒の最新の（現段階での）言語習熟度を捉えることも可能となるのである。 In addition, as described above, when the target score distribution determining unit 113 updates the score distribution information (distribution parameter information) of the student (evaluation target), the evaluation score determination unit 114 updates the updated student (evaluation target). Based on the distribution parameters (ie, the updated mean μ _d and variance σ _d ² ) in the “difference distribution” according to the score distribution information, the evaluation score for the pronunciation in the predetermined language by the student (evaluation target) is updated. This makes it possible, for example, to capture the student's latest (at this stage) language proficiency.

評価スコア決定部１１４は最後に、決定した評価スコア（例えば上記の５段階のスコア）を、例えば評価対象である生徒に対し、例えばディスプレイ１０５やスピーカ１０８を介して提示してもよい。また、当該評価スコアを、該当する生徒の識別子（ＩＤ）及び評価期間情報に紐づけて、評価スコア保存部１０４に保存することも好ましい。この場合、例えばこの生徒の過去の評価状況や、評価スコアの推移も提示可能となる。 Finally, the evaluation score determination unit 114 may present the determined evaluation score (for example, the above-mentioned five-level score) to, for example, a student to be evaluated via the display 105 or the speaker 108, for example. It is also preferable that the evaluation score is stored in the evaluation score storage unit 104 in association with the identifier (ID) of the corresponding student and the evaluation period information. In this case, for example, the past evaluation status of this student and the transition of the evaluation score can be presented.

［発音評価方法］
図２は、基準スコア分布決定部１１１、対象スコア分布決定部１１３及び評価スコア決定部１１４によって実施される、本発明の発音評価方法の一実施形態におけるフローを概略的に示す模式図である。 [Pronunciation evaluation method]
FIG. 2 is a schematic diagram schematically showing a flow in one embodiment of the pronunciation evaluation method of the present invention, which is performed by the reference score distribution determination unit 111, the target score distribution determination unit 113, and the evaluation score determination unit 114.

図２に示したように、本実施形態の発音評価方法は、オフラインモード及びオンラインモードの２つをとる。このうち、オフラインモードにおいて、基準スコア分布決定部１１１は、
（Ｓ１）学習済みのＬＩＤモデルを用いて、教師（評価基準対象）による発音の確度スコアを生成し、
（Ｓ２）教師（評価基準対象）の確度スコアヒストグラムを生成し、
（Ｓ３）ステップＳ２で生成したヒストグラムにフィッティングさせた正規分布を決定して、基準スコア分布情報（平均μ₁及び分散σ₁ ²）を取得する。 As shown in FIG. 2, the pronunciation evaluation method of the present embodiment has two modes, an offline mode and an online mode. Among them, in the offline mode, the reference score distribution determination unit 111
(S1) Using the trained LID model, generating a pronunciation accuracy score by a teacher (evaluation reference target),
(S2) Generate a probability score histogram of the teacher (evaluation reference target),
(S3) A normal distribution fitted to the histogram generated in step S2 is determined, and reference score distribution information (mean μ ₁ and variance σ ₁ ² ) is obtained.

このように、本実施形態の発音評価方法では、生徒（評価対象）の最終的な評価スコアを算出するための基準となる基準スコア分布情報（平均μ₁及び分散σ₁ ²）を、オフラインモードとして予め準備しておくのである。 As described above, in the pronunciation evaluation method of the present embodiment, the reference score distribution information (mean μ ₁ and variance σ ₁ ² ) serving as a reference for calculating the final evaluation score of a student (evaluation target) is stored in the offline mode. It is prepared in advance.

一方、オンラインモードにおいては、このように予め準備された基準スコア分布情報（平均μ₁及び分散σ₁ ²）を用いて、生徒（評価対象）による学習中言語の発音に対する評価を、例えば概ねリアルタイムで実施することも可能となる。具体的に、対象スコア分布決定部１１３は、
（Ｓ４）ステップＳ１と同じ学習済みのＬＩＤモデルを用いて、生徒（評価対象）による発音の確度スコアを生成し、
（Ｓ５）生徒（評価対象）の確度スコアヒストグラムを生成し、
（Ｓ６）ステップＳ５で生成したヒストグラムにフィッティングさせた正規分布を決定して、対象スコア分布情報（平均μ₂及び分散σ₂ ²）を取得する。 On the other hand, in the online mode, thus using the previously prepared standard score distribution information (average mu ₁ and variance sigma ₁ ^2), the evaluation of pronunciation learning in the language by the student (evaluated), for example, substantially real-time Can also be implemented. Specifically, the target score distribution determination unit 113
(S4) Using the same learned LID model as in step S1, a pronunciation accuracy score of the student (evaluation target) is generated,
(S5) A probability score histogram of the student (evaluation target) is generated,
(S6) The normal distribution fitted to the histogram generated in step S5 is determined, and the target score distribution information (mean μ ₂ and variance σ ₂ ² ) is obtained.

ここで、ステップＳ４において、生徒の現時点での発音（データ）を用いて確度スコアを生成すれば、最終的に現時点での（概ねリアルタイムでの）発音評価を行うことができる。また、生徒の過去の所定期間における発音（データ）を用いて確度スコアを生成すれば、最終的に当該所定期間についての発音評価を行うことができる。この場合、現時点での最終的な評価スコアを、当該所定期間での最終的な評価スコアと比較することによって、生徒の言語習熟の進展度合いを把握することも可能となるのである。 Here, in step S4, if a certainty score is generated using the current pronunciation (data) of the student, it is possible to finally evaluate the pronunciation at the current time (generally in real time). Further, if a certainty score is generated by using the pronunciation (data) of the student in the past predetermined period, the pronunciation evaluation for the predetermined period can be finally performed. In this case, by comparing the final evaluation score at the present time with the final evaluation score in the predetermined period, it is also possible to grasp the progress degree of the language proficiency of the student.

同じく図３に示すように、次いでこのオンラインモードにおいて、評価スコア決定部１１４は、
（Ｓ７）ステップＳ３及びステップＳ６で決定した正規分布の「差の分布」を生成し、
（Ｓ８）生成した「差の分布」のピーク値p_d（≒0.4／(σ₁ ²＋σ₂ ²)^0.5）を算出し、
（Ｓ９）算出したピーク値p_d（予備的評価スコア）に基づき、評価スコア推定モデルを用いて、最終的な評価スコア、例えば上記の５段階のスコアを決定するのである。 Similarly, as shown in FIG. 3, in this online mode, the evaluation score determination unit 114
(S7) Generate a “difference distribution” of the normal distribution determined in steps S3 and S6,
(S8) The peak value p _d (≒ 0.4 / (σ ₁ ² + σ ₂ ² ) ^0.5 ) of the generated “difference distribution” is calculated,
(S9) Based on the calculated peak value p _d (preliminary evaluation score), a final evaluation score, for example, the above five-level score is determined using an evaluation score estimation model.

このように、本実施形態の発音評価方法では、オフラインモード及びオンラインモードの両モードにおいて、それほどの高精度を必要としないＬＩＤモデルを活用し、さらに、処理演算量の比較的小さくて済む「差の分布」の算出処理を適用して、十分に高い精度を有する生徒（評価対象）の評価スコアを提供することができる。その結果、例えば、計算能力に一定の限界を有する携帯端末において本発音評価方法を実施し、生徒（評価対象）の評価スコアを概ねリアルタイムで当該生徒に提示することも実現可能となるのである。 As described above, in the pronunciation evaluation method of the present embodiment, in both the offline mode and the online mode, the LID model that does not require such high precision is used, and the “difference” that requires a relatively small amount of processing operation is sufficient. By applying the calculation process of “distribution of”, it is possible to provide an evaluation score of a student (evaluation target) having sufficiently high accuracy. As a result, for example, it is also possible to implement the present pronunciation evaluation method on a portable terminal having a certain limit in calculation ability, and to present the evaluation score of the student (evaluation target) to the student substantially in real time.

［実施例］
図３、図４及び図５は、本発明による発音評価方法の実施例を説明するためのグラフである。ここで、図３には、基準となる教師群、及び生徒Ａについての実施例が示されており、図４には、生徒Ｂについての実施例が示されており、図５には、生徒Ｃについての実施例が示されている。 [Example]
FIGS. 3, 4 and 5 are graphs for explaining an embodiment of the pronunciation evaluation method according to the present invention. Here, FIG. 3 shows an example for a reference teacher group and student A, FIG. 4 shows an example for student B, and FIG. An example for C is shown.

最初に、図３（Ａ）には、英語を母国語とするネイティブの複数の教師による英語の発音の確度スコアヒストグラムが示されている。ここで、これらの確度スコアは、英語のＬＩＤモデルを用いて生成されたものである。このヒストグラムでは、確度スコアは値「1」のあたりに集中していることが理解される。これは、複数の教師の発音の大部分が「（ネイティブによる）英語である」と正確に識別されたことを示している。しかしながら、このヒストグラムでは、いくつかの確度スコアが「0」と「1」との間の値をとっており、本言語識別処理においては、幾分かのゆらぎが生じていることが分かる。 First, FIG. 3 (A) shows the accuracy score histogram of English pronunciation by a plurality of native teachers whose native language is English. Here, these accuracy scores are generated using an English LID model. In this histogram, it is understood that the accuracy scores are concentrated around the value “1”. This indicates that most of the pronunciations of the teachers were correctly identified as "English (by native)." However, in the histogram, some accuracy scores take values between “0” and “1”, which indicates that some fluctuation occurs in the present language identification processing.

次いで図３（Ｂ）には、図３（Ａ）の確度スコアヒストグラムに対しフィッティング処理を実施することによって生成された正規分布曲線が示されている。この取得された正規分布から、教師群の分布パラメータ情報（例えば平均μ₁及び分散σ₁ ²）が決定されるのである。 Next, FIG. 3B shows a normal distribution curve generated by performing fitting processing on the accuracy score histogram of FIG. 3A. From this acquired normal distribution, distribution parameter information (for example, mean μ ₁ and variance σ ₁ ² ) of the teacher group is determined.

一方、図３（Ｃ）には、英語を学習中のノンネイティブの１人である生徒Ａによる英語の発音の確度スコアヒストグラムが示されている。ここで、これらの確度スコアは、図３（Ａ）の確度スコア算出用に用いた英語のＬＩＤモデルを用いて生成されたものである。このヒストグラムでは、確度スコアは値「0」のあたりに集中していることが理解される。これは、生徒Ａの発音の大部分が、「（ネイティブによる）英語ではない」と正確に識別されたことを示している。 On the other hand, FIG. 3 (C) shows a certainty score histogram of English pronunciation by a student A who is one of the non-natives who is learning English. Here, these certainty scores are generated using the English LID model used for calculating the certainty score in FIG. In this histogram, it is understood that the accuracy scores are concentrated around the value “0”. This indicates that most of Student A's pronunciation was correctly identified as "not (native) English."

しかしながら、このヒストグラムでは、いくつかの確度スコアが「1」に近い値をとっている。すなわち、本言語識別処理においては、生徒Ａの音声のいくつかはネイティブに近いとの判断がなされていることが分かる。 However, in this histogram, some accuracy scores take values close to “1”. In other words, it can be seen that in the language identification processing, it has been determined that some of the voices of the student A are close to native.

次いで図３（Ｄ）には、図３（Ｃ）の確度スコアに対しフィッティング処理を実施することによって生成された正規分布曲線が示されている。この取得された正規分布から、生徒Ａの分布パラメータ情報（例えば平均μ₂及び分散σ₂ ²）が決定されるのである。ここで、この生徒Ａの正規分布曲線は、この後説明する生徒Ｂ及びＣのものと比べてよりブロードな形状を示し、より大きな分散を示している。 Next, FIG. 3D shows a normal distribution curve generated by performing a fitting process on the accuracy score of FIG. 3C. From the acquired normal distribution, distribution parameter information (for example, mean μ ₂ and variance σ ₂ ² ) of student A is determined. Here, the normal distribution curve of the student A shows a broader shape and a larger variance than those of the students B and C described later.

次いで、図３（Ｅ）には、図３（Ｄ）に示した生徒Ａの正規分布と、図３（Ｂ）に示した教師群の正規分布との「差の分布」（正規分布）曲線が示されている。この「差の分布」曲線のピーク値p_d（≒0.4／(σ₁ ²＋σ₂ ²)^0.5）は、この後説明する生徒Ｂ及びＣのものと比べてより小さくなっている。これは、上述したように、生徒Ａに係る分散σ₂ ²が比較的大きな値であることを反映している。言い換えると、生徒Ａの英語の発音は、ネイティブによる英語の発音に近いと判断される傾向にあることを示している。 Next, FIG. 3E shows a “difference distribution” (normal distribution) curve between the normal distribution of the student A shown in FIG. 3D and the normal distribution of the teacher group shown in FIG. It is shown. The peak value p _d (≒ 0.4 / (σ ₁ ² + σ ₂ ² ) ^0.5 ) of the “difference distribution” curve is smaller than those of the students B and C described later. This is because, as described above, reflects the variance sigma ₂ ² according to the Student A is relatively large value. In other words, it indicates that the English pronunciation of the student A tends to be determined to be close to the native English pronunciation.

最後に、図３（Ｆ）には、生徒Ａに係るピーク値p_dを予備的評価スコアとした上で、回帰モデルである評価スコア推定モデルを用いて決定した評価スコアが示されている。図３（Ｆ）によれば、生徒Ａの評価スコアは、「ネイティブレベル」（Agree level）及び「準ネイティブレベル」（Mildly Agree level）の間の値であって、「ネイティブレベル」（Agree level）により近い値となっている。 Finally, FIG. 3F shows the evaluation score determined using the evaluation score estimation model, which is a regression model, with the peak value p _d of the student A as the preliminary evaluation score. According to FIG. 3 (F), the evaluation score of the student A is a value between the “native level” (Agree level) and the “quasi-native level” (Mildly Agree level), and the “native level” (Agree level) ).

次に、生徒Ｂによる英語の発音の評価結果を説明する。最初に、図４（Ａ）には、英語を学習中のノンネイティブの１人である生徒Ｂによる英語の発音の確度スコアヒストグラムが示されている。ここで、これらの確度スコアも、図３（Ａ）の確度スコア算出用に用いた英語のＬＩＤモデルを用いて生成されたものである。このヒストグラムでは、概ね全ての確度スコアは値「0」に集中していることが理解される。これは、生徒Ｂの発音の概ね全てが、「（ネイティブによる）英語ではない」と正確に識別されたことを示している。 Next, the evaluation result of English pronunciation by the student B will be described. First, FIG. 4A shows a probability score histogram of English pronunciation by a student B who is one of the non-natives who is learning English. Here, these certainty scores are also generated by using the English LID model used for calculating the certainty score in FIG. In this histogram, it is understood that almost all accuracy scores are concentrated on the value “0”. This indicates that nearly all of student B's pronunciation was correctly identified as "not (native) English."

次いで図４（Ｂ）には、図４（Ａ）の確度スコアに対しフィッティング処理を実施することによって生成された正規分布曲線が示されている。この取得された正規分布から、生徒Ｂの分布パラメータ情報（例えば平均μ₂及び分散σ₂ ²）が決定されるのである。ここで、この生徒Ｂの正規分布曲線は、上記の生徒Ａやこの後説明する生徒Ｃのものと比べてよりシャープな形状を示し、より小さな分散を示している。 Next, FIG. 4B shows a normal distribution curve generated by performing a fitting process on the accuracy score of FIG. 4A. From the obtained normal distribution, distribution parameter information (for example, mean μ ₂ and variance σ ₂ ² ) of student B is determined. Here, the normal distribution curve of the student B shows a sharper shape and a smaller variance than those of the student A and the student C described later.

次いで、図４（Ｃ）には、図４（Ｂ）に示した生徒Ａの正規分布と、図３（Ｂ）に示した教師群の正規分布との「差の分布」（正規分布）曲線が示されている。この「差の分布」曲線のピーク値p_d（≒0.4／(σ₁ ²＋σ₂ ²)^0.5）は、上記の生徒Ａやこの後説明する生徒Ｃのものと比べてより大きくなっている。これは、上述したように、生徒Ｂに係る分散σ₂ ²が比較的小さな値であることを反映している。言い換えると、生徒Ｂの英語の発音は、ネイティブによる英語の発音からは相当に離隔していると判断されていることを示している。 Next, FIG. 4C shows a “difference distribution” (normal distribution) curve between the normal distribution of the student A shown in FIG. 4B and the normal distribution of the teacher group shown in FIG. It is shown. The peak value p _d (≒ 0.4 / (σ ₁ ² + σ ₂ ² ) ^0.5 ) of the “difference distribution” curve is larger than that of the student A and the student C described later. This is because, as described above, reflects the variance sigma ₂ ² according to the student B is relatively small value. In other words, it indicates that the English pronunciation of the student B is determined to be considerably different from the native English pronunciation.

最後に、図４（Ｄ）には、生徒Ｂに係るピーク値p_dを予備的評価スコアとした上で、回帰モデルである評価スコア推定モデルを用いて決定した評価スコアが示されている。図４（Ｄ）によれば、生徒Ｂの評価スコアは、「準ノンネイティブレベル」（Mildly Disagree level及び「ノンネイティブレベル」（Disagree）の間の値となっている。 Finally, FIG. 4 (D) shows the evaluation score determined using the evaluation score estimation model, which is a regression model, with the peak value p _d of the student B as the preliminary evaluation score. According to FIG. 4D, the evaluation score of the student B is a value between the “quasi-non-native level” (Mildly Disagree level and the “non-native level” (Disagree)).

次に、生徒Ｃによる英語の発音の評価結果を説明する。最初に、図５（Ａ）には、英語を学習中のノンネイティブの１人である生徒Ｃによる英語の発音の確度スコアヒストグラムが示されている。ここで、これらの確度スコアも、図３（Ａ）の確度スコア算出用に用いた英語のＬＩＤモデルを用いて生成されたものである。このヒストグラムでは、確度スコアは値「0」のあたりに集中していることが理解される。これは、生徒Ｃの発音の大部分が、「（ネイティブによる）英語ではない」と正確に識別されたことを示している。 Next, the evaluation result of English pronunciation by the student C will be described. First, FIG. 5A shows a histogram of accuracy scores of pronunciation of English by a student C who is one of the non-natives who is learning English. Here, these certainty scores are also generated by using the English LID model used for calculating the certainty score in FIG. In this histogram, it is understood that the accuracy scores are concentrated around the value “0”. This indicates that most of the pronunciation of student C was correctly identified as "not (in native) English."

しかしながら、このヒストグラムでは、いくつかの確度スコアが「0」よりも大きな値をとっている。すなわち、本言語識別処理においては、生徒Ｃの音声のいくつかはネイティブに若干近いとの判断がなされていることが分かる。 However, in this histogram, some accuracy scores take values larger than “0”. In other words, it can be seen that in the language identification processing, it has been determined that some of the voices of the student C are slightly closer to native.

次いで図５（Ｂ）には、図５（Ａ）の確度スコアに対しフィッティング処理を実施することによって生成された正規分布曲線が示されている。この取得された正規分布から、生徒Ｃの分布パラメータ情報（例えば平均μ₂及び分散σ₂ ²）が決定されるのである。ここで、この生徒Ｃの正規分布曲線は、上記の生徒Ａや生徒Ｂのものと比較すると、それらの間となる半値幅を有する形状を示し、それらの間の分散を示している。 Next, FIG. 5B shows a normal distribution curve generated by performing a fitting process on the accuracy score of FIG. 5A. From the obtained normal distribution, distribution parameter information (for example, mean μ ₂ and variance σ ₂ ² ) of the student C is determined. Here, the normal distribution curve of the student C shows a shape having a half width between them, as compared with those of the students A and B, and shows the variance between them.

次いで、図５（Ｃ）には、図５（Ｂ）に示した生徒Ｃの正規分布と、図３（Ｂ）に示した教師群の正規分布との「差の分布」（正規分布）曲線が示されている。この「差の分布」曲線のピーク値p_d（≒0.4／(σ₁ ²＋σ₂ ²)^0.5）は、上記の生徒Ａや生徒Ｂのものと比較すると、それらの間の大きさとなっている。 Next, FIG. 5C shows a “difference distribution” (normal distribution) curve between the normal distribution of the student C shown in FIG. 5B and the normal distribution of the teacher group shown in FIG. It is shown. The peak value p _d (≒ 0.4 / (σ ₁ ² + σ ₂ ² ) ^0.5 ) of the “difference distribution” curve has a magnitude between them when compared with those of the students A and B described above. .

最後に、図５（Ｄ）には、生徒Ｃに係るピーク値p_dを予備的評価スコアとした上で、回帰モデルである評価スコア推定モデルを用いて決定した評価スコアが示されている。図５（Ｄ）によれば、生徒Ｃの評価スコアは、「準ネイティブレベル」（Mildly Agree level）及び「平均的レベル」（Undecided level）の間の値であって、「準ネイティブレベル」（Mildly Agree level）により近い値となっている。 Finally, FIG. 5D shows the evaluation score determined using the evaluation score estimation model, which is a regression model, with the peak value p _d of the student C as the preliminary evaluation score. According to FIG. 5D, the evaluation score of the student C is a value between the “quasi-native level” (Mildly Agree level) and the “average level” (Undecided level), and the “quasi-native level” ( Mildly Agree level).

以上、詳細に説明したように、本発明によれば、ＬＩＤモデルを用いて取得されたスコアに基づき、評価対象（例えば生徒）の発音に対する評価スコアを自動的に導出することができる。ここで、ＬＩＤモデルは、言語種別の識別・分類を実行可能なモデルであり、具体的には評価対象（生徒）の発音を入力し、当該発音が所定言語の発音である確からしさである確度を出力する。すなわち、非常に高い精度が要求されることはなく、またそれ故に、そのモデル構築にそれほど大きな処理負担は発生しないようなモデルとなっている。 As described above in detail, according to the present invention, it is possible to automatically derive an evaluation score for the pronunciation of an evaluation target (for example, a student) based on a score obtained using the LID model. Here, the LID model is a model capable of executing the identification and classification of the language type. Specifically, the pronunciation of the evaluation target (student) is input, and the certainty is a certainty that the pronunciation is a pronunciation of a predetermined language. Is output. That is, very high accuracy is not required, and therefore, the model is constructed so that a large processing load does not occur in constructing the model.

これにより、本発明によれば、高精度故に処理負担の大きいＡＳＲモデルを用いることなく、さらにはノンネイティブコーパスも必要とせずに、それほどの高精度を必要としないＬＩＤモデルを活用し、「差の分布」における分布パラメータに着目して十分に高い精度を有する評価スコアを提供することができるのである。 As a result, according to the present invention, the LID model that does not require such high accuracy is used without using the ASR model that has a high processing load due to high accuracy, and further does not require a non-native corpus, and the “differential It is possible to provide an evaluation score having sufficiently high accuracy by paying attention to the distribution parameter in “distribution”.

また、本発明によれば、発音評価のために、発声データをテキスト化する必要もなければ、評価基準対象（例えば教師）による基準発声文の提供も不要である。したがって、本発明の実施における処理演算量や必要となるメモリ量をより低減させることも可能となり、例えば本発明による発音評価装置を、計算能力に一定の限界を有する携帯端末に収めることもできる。さらに、例えば、発音評価スコアを概ねリアルタイムで出力するモードも実現可能となるのである。 Further, according to the present invention, it is not necessary to convert the utterance data into text for pronunciation evaluation, and it is unnecessary to provide a reference utterance sentence by an evaluation reference target (for example, a teacher). Therefore, it is possible to further reduce the amount of processing calculations and the required amount of memory in the implementation of the present invention. For example, the pronunciation evaluation device according to the present invention can be accommodated in a portable terminal having a certain limit in computational power. Further, for example, a mode in which the pronunciation evaluation score is output substantially in real time can be realized.

また、本発明は特に、語学学校や公的教育期間において言語教育サービスを提供する際、個々の学習者における適切な言語習熟度の評価を、より低負担で実施可能にするものとなっている。また、低処理負担のＬＩＤモデルを利用しているので、様々な言語における発音評価にも容易に適用可能となるのである。 In addition, the present invention makes it possible to evaluate the appropriate language proficiency of individual learners at a lower burden, particularly when providing a language education service in a language school or a public education period. . Further, since the LID model with a low processing load is used, it can be easily applied to pronunciation evaluation in various languages.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 For the above-described various embodiments of the present invention, various changes, modifications, and omissions in the scope of the technical idea and viewpoint of the present invention can be easily performed by those skilled in the art. The foregoing description is merely an example, and is not intended to be limiting. The invention is limited only as defined by the following claims and equivalents thereof.

１発音評価装置
１０１通信インタフェース部
１０２ネイティブコーパス
１０３ユーザ発音保存部
１０４評価スコア保存部
１０５タッチパネル・ディスプレイ（ＴＰ・ＤＰ）
１０７マイク（ＭＣ）
１０８スピーカ（ＳＰ）
１１１、２１２基準スコア分布決定部
１１１ａ、１１３ａ言語識別部
１１２基準スコア分布取得部
１１３対象スコア分布決定部
１１４評価スコア決定部
１１４ａ差分布算出部
１１４ｂ評価スコア推定部
１２１、２１１言語識別モデル構築部
１２２、２１３評価スコア推定モデル構築部
１３１通信制御部
１３２入出力制御部
２サーバ 1 pronunciation evaluation device 101 communication interface unit 102 native corpus 103 user pronunciation storage unit 104 evaluation score storage unit 105 touch panel display (TP / DP)
107 Microphone (MC)
108 Speaker (SP)
111, 212 Reference score distribution determination unit 111a, 113a Language identification unit 112 Reference score distribution acquisition unit 113 Target score distribution determination unit 114 Evaluation score determination unit 114a Difference distribution calculation unit 114b Evaluation score estimation unit 121, 211 Language identification model construction unit 122 , 213 evaluation score estimation model construction unit 131 communication control unit 132 input / output control unit 2 server

Claims

A pronunciation evaluation program that causes a computer mounted on a device that evaluates pronunciation in a predetermined language by an evaluation target,
A score obtained using a language identification model that outputs a score related to the accuracy of the input pronunciation being the pronunciation in the predetermined language, and determined by obtaining a plurality of scores for the pronunciation in the predetermined language by the evaluation reference target. Reference score distribution obtaining means for obtaining the obtained score distribution information of the evaluation reference target,
Target score distribution determining means for obtaining a plurality of scores for pronunciation of the predetermined language by the evaluation target using the language identification model, and determining score distribution information of the evaluation target;
Evaluation score determination for determining an evaluation score for the pronunciation of the predetermined language by the evaluation target based on a value related to a distribution parameter in a distribution of a difference between the score distribution information of the evaluation target and the score distribution information of the evaluation reference target. A pronunciation evaluation program characterized by causing a computer to function as a means.

2. The pronunciation evaluation according to claim 1, wherein the evaluation score determination unit calculates a value related to variance as a value related to the distribution parameter, and determines the evaluation score based on the value related to the variance. program.

The said evaluation score determination means calculates the maximum value in the distribution of the said difference as a value concerning the said distribution parameter, and determines the said evaluation score based on the said maximum value, The said evaluation score. Pronunciation evaluation program.

4. The evaluation score determination unit according to claim 1, wherein the evaluation score determination unit determines the evaluation score by applying a value related to the distribution parameter to a learned evaluation score estimation model that has been learned. 5. Pronunciation evaluation program.

The reference score distribution obtaining means obtains, as score distribution information of the evaluation reference target, information including a distribution parameter of a normal distribution representing a histogram generated by a plurality of the scores for the pronunciation of the predetermined language by the evaluation reference target. And
The target score distribution determining means generates a histogram of a plurality of the obtained scores, and sets the score distribution information of the evaluation target as information including a distribution parameter of a normal distribution representing the histogram. The pronunciation evaluation program according to any one of 1 to 4.

The target score distribution determining means newly acquires the score of the evaluation target, updates the score distribution information of the evaluation target,
The evaluation score determination means updates an evaluation score for pronunciation of the predetermined language by the evaluation target based on the updated value of the distribution parameter in the distribution of the difference related to the score distribution information of the evaluation target. The pronunciation evaluation program according to any one of claims 1 to 5, wherein

Using the language identification model, obtain a plurality of scores for the pronunciation of the predetermined language by the evaluation reference target, determine score distribution information of the evaluation reference target, and output the reference score distribution to the reference score distribution obtaining means. The pronunciation evaluation program according to any one of claims 1 to 6, further causing a computer to function as a determination unit.

The said evaluation object is the learner of the said predetermined language, The said evaluation reference object is several pronunciation providers who speak the said predetermined language as a native language, The one of Claim 1 to 7 characterized by the above-mentioned. The pronunciation evaluation program described in.

A pronunciation evaluation device that evaluates pronunciation of a predetermined language by an evaluation target,
A score obtained using a language identification model that outputs a score related to the accuracy of the input pronunciation being the pronunciation in the predetermined language, and determined by obtaining a plurality of scores for the pronunciation in the predetermined language by the evaluation reference target. Reference score distribution obtaining means for obtaining the obtained score distribution information of the evaluation reference target,
Target score distribution determining means for obtaining a plurality of scores for pronunciation of the predetermined language by the evaluation target using the language identification model, and determining score distribution information of the evaluation target;
Evaluation score determination for determining an evaluation score for the pronunciation of the predetermined language by the evaluation target based on a value related to a distribution parameter in a distribution of a difference between the score distribution information of the evaluation target and the score distribution information of the evaluation reference target. Means for evaluating pronunciation.

A pronunciation evaluation method in a computer mounted on a device for evaluating pronunciation in a predetermined language by an evaluation target,
A score obtained using a language identification model that outputs a score related to the accuracy of the input pronunciation being the pronunciation in the predetermined language, and determined by obtaining a plurality of scores for the pronunciation in the predetermined language by the evaluation reference target. The obtained score distribution information of the evaluation reference target is obtained, and the score distribution information of the evaluation target is obtained by using the language identification model to obtain a plurality of the scores for the pronunciation of the predetermined language by the evaluation target. Determining
Determining the evaluation score for the pronunciation of the predetermined language by the evaluation target based on the value of the distribution parameter in the distribution of the difference between the score distribution information of the evaluation target and the score distribution information of the evaluation reference target. A pronunciation evaluation method characterized by having.