JP2021032909A

JP2021032909A - Prediction device, prediction method and prediction program

Info

Publication number: JP2021032909A
Application number: JP2019148529A
Authority: JP
Inventors: 賢一新井; Kenichi Arai; 中谷　智広; Tomohiro Nakatani; 智広中谷; 慶介木下; Keisuke Kinoshita; 荒木　章子; Akiko Araki; 章子荒木; 小川　厚徳; Atsunori Ogawa; 厚徳小川; 入野　俊夫; Toshio Irino; 俊夫入野; 山本　克彦; Katsuhiko Yamamoto; 克彦山本
Original assignee: Nippon Telegraph and Telephone Corp; Wakayama University
Current assignee: Nippon Telegraph and Telephone Corp; Wakayama University
Priority date: 2019-08-13
Filing date: 2019-08-13
Publication date: 2021-03-01
Anticipated expiration: 2039-08-13
Also published as: JP7306626B2

Abstract

To increase the accuracy of predicting word intelligibility which is a quality evaluation scale of voice signals.SOLUTION: A word intelligibility prediction device 10 has: a voice recognition unit 11 which performs voice recognition for voice signals of a prediction object, by using a sound model 121 which outputs likelihood showing which phoneme corresponds to each frame of the inputted voice signals; and a word intelligibility prediction unit 16 which predicts word intelligibility which is a quality evaluation scale of the voice signals, on the basis of the voice recognition result by the voice recognition unit 11.SELECTED DRAWING: Figure 1

Description

本発明は、予測装置、予測方法及び予測プログラムに関する。 The present invention relates to a prediction device, a prediction method and a prediction program.

音声信号の品質評価尺度として、単語了解度や音節明瞭度などがある。単語了解度は、発声・伝達された有意味な単語の正しく聴取された割合を表す指標値であり、聴取者が受聴した単語数のうち聴取者が正しく聴取できた単語数の割合として定義される。音節明瞭度は、発声・伝達された無意味な音節の正しく聴取された割合を表す指標値であり、聴取者が受聴した音節数のうち聴取者が正しく聴取できた音節数の割合として定義される。 Voice signal quality evaluation scales include word intelligibility and syllable intelligibility. Word intelligibility is an index value that indicates the percentage of meaningful words spoken and transmitted that were correctly heard, and is defined as the percentage of the number of words that the listener was able to hear correctly out of the number of words that the listener listened to. To. Syllable intelligibility is an index value that indicates the percentage of meaningless syllables uttered and transmitted that are correctly heard, and is defined as the percentage of the number of syllables that the listener was able to hear correctly. To.

単語了解度の評価として、被験者が音声信号の単語認識したときの認識率から計算されるＳＲＴ（Speech Reception Threshold）や、認識の容易さに関するアンケートから得られるlistening effortなどが知られている。しかしながら、被験者実験は、経済的にも、時間的にも、コストがかかる。このため、音声信号から客観的に単語了解度を測定する方法が提案されている。 As an evaluation of word intelligibility, SRT (Speech Reception Threshold) calculated from the recognition rate when a subject recognizes a word of a voice signal, listening effort obtained from a questionnaire on ease of recognition, and the like are known. However, subject experiments are costly, both economically and temporally. Therefore, a method of objectively measuring word intelligibility from a voice signal has been proposed.

客観的に単語了解度を測定する方法として、例えば、音声明瞭度指数（ＡＩ：Articulation Index）、音声了解度指数（ＳＩＩ：Speech Intelligibility Index）、音声伝達指数（ＳＴＩ：Speech Transmission Index）、ＰＥＳＱ（Perceptual Evaluation of Speech Quality）などの計算方法が使用されている。しかしながら、これらの計算方法は、線形システムを仮定した計算であるため、非線形信号処理を含むような信号の変換に対しては適切な評価が行えないという課題がある。 As a method for objectively measuring word intelligibility, for example, speech intelligibility index (AI: Articulation Index), speech intelligibility index (SII: Speech Intelligibility Index), speech transmission index (STI: Speech Transmission Index), PESQ ( Calculation methods such as Perceptual Evaluation of Speech Quality) are used. However, since these calculation methods are calculations assuming a linear system, there is a problem that appropriate evaluation cannot be performed for signal conversion including non-linear signal processing.

このため、一部の非線形信号処理に適応できるように、短時間客観了解度指数（ＳＴＯＩ：the short timeobjective intelligibility）、補聴器音声知覚指数（ＨＡＳＰＩ：the hearing-aid speech perception index）などが、音声信号品質の評価尺度としてよく使用されている。さらに、人間の聴覚特性を考慮したガンマチャープ振幅包絡歪み指標（ＧＥＤＩ：Gammachirp Envelope Distortion Index）も提案されている。 Therefore, the short time objective intelligibility (STOI), the hearing-aid speech perception index (HASPI), etc. are used as voice signals so that they can be applied to some non-linear signal processing. It is often used as a quality evaluation measure. Furthermore, a Gammachirp Envelope Distortion Index (GEDI) that considers human auditory characteristics has also been proposed.

一方で、深層学習を用いた自動音声認識器の性能は、人間の聴覚の性能に近づいており、その認識率により、被験者実験で得られる認識率を近似できることが期待されている。このことから、被験者実験の替わりに、自動音声認認識器による認識を利用して音声信号品質を予測する方法が提案されている。 On the other hand, the performance of the automatic speech recognizer using deep learning is close to the performance of human hearing, and it is expected that the recognition rate can approximate the recognition rate obtained in the subject experiment. Therefore, instead of the subject experiment, a method of predicting the voice signal quality by using the recognition by the automatic voice recognition recognizer has been proposed.

この方法として、文を読み上げた音声信号を提示し、その一部分の音声信号に対応する単語を正解テキスト候補の中から選ぶという、マトリックス試験を自動音声認識器で行い、その正解率から単語了解度の一つであるＳＲＴを予測する方法がある（非特許文献１参照）。 As this method, a matrix test is performed with an automatic voice recognizer, in which a voice signal that reads aloud a sentence is presented and a word corresponding to a part of the voice signal is selected from correct text candidates, and the word intelligibility is determined from the correct answer rate. There is a method of predicting SRT, which is one of the above (see Non-Patent Document 1).

Constantin Spille, Stephan D. Ewert, Birger Kollmeier and Bernd T. Meyer,“Predicting speech intelligibility with deep neural networks”，Computer Speech & Language, Vol. 48, pp. 51-66, 2018.Constantin Spille, Stephan D. Ewert, Birger Kollmeier and Bernd T. Meyer, “Predicting speech intelligibility with deep neural networks”, Computer Speech & Language, Vol. 48, pp. 51-66, 2018.

自動音声認識器では、単語辞書を利用するなど言語の事前知識など、使用できるものはできる限り使用して認識率を向上させることが一般的である。 In an automatic speech recognizer, it is common to improve the recognition rate by using as much as possible what can be used, such as prior knowledge of a language such as using a word dictionary.

これに対し、音声信号品質は、音声信号そのものが有する特性であるため、言語知識などの要因が認識率に影響を与えることを避けることが望ましい。言語知識による影響として、例えば、前後の文脈が単語認識においてヒントとなることや、単語辞書に登録されているか否かで認識率が大きく変わることが考えられる。 On the other hand, since the voice signal quality is a characteristic of the voice signal itself, it is desirable to avoid factors such as language knowledge from affecting the recognition rate. As an influence of linguistic knowledge, for example, it is conceivable that the context before and after is a hint in word recognition, and that the recognition rate changes greatly depending on whether or not it is registered in the word dictionary.

このため、自動音声認識器による提示音声信号の品質の予測では、音声信号のみではなく、利用している単語知識などが単語了解度の予測に影響を与えてしまうという課題がある。例えば、聴取者がよく知っている親密度の高い単語ほど、単語了解度が高く予測されやすくなる。この影響を避けるため、非特許文献１記載の技術では、文脈に依存せず、どの正解候補でも同程度の尤もらしさで正解となりうるようなマトリックス試験を利用するなどの工夫がなされている。つまり、親密度による影響が品質の予測に影響しないように評価実験の設計を工夫する必要がある。 Therefore, in predicting the quality of the presented voice signal by the automatic voice recognizer, there is a problem that not only the voice signal but also the word knowledge used affects the prediction of the word intelligibility. For example, the more intimate a word the listener is familiar with, the higher the intelligibility of the word and the easier it is to predict. In order to avoid this influence, the technique described in Non-Patent Document 1 is devised such as using a matrix test in which any correct answer candidate can be a correct answer with the same degree of plausibility without depending on the context. In other words, it is necessary to devise the design of the evaluation experiment so that the influence of intimacy does not affect the quality prediction.

このように、非特許文献１記載の自動音声認識器を使った音声信号品質予測技術では、自由に発話された音声や、自動音声認識器の事前言語情報に関して考慮されていない文章の読み上げ音声などでは、単語の親密度が統一されていないため、正確な予測値を得られにくいという課題があった。 As described above, in the voice signal quality prediction technology using the automatic voice recognizer described in Non-Patent Document 1, freely spoken voice, read-aloud voice of sentences that are not considered regarding the prior language information of the automatic voice recognizer, etc. Then, since the intimacy of words is not unified, there is a problem that it is difficult to obtain an accurate predicted value.

本発明は、上記に鑑みてなされたものであって、単語の親密度を統一する等の事前の工夫を要することなく、音声信号の品質評価尺度である単語了解度の予測精度を高めることができる予測装置、予測方法及び予測プログラムを提供することを目的とする。 The present invention has been made in view of the above, and can improve the prediction accuracy of word intelligibility, which is a quality evaluation scale for voice signals, without requiring prior measures such as unifying the intimacy of words. It is an object of the present invention to provide a prediction device, a prediction method and a prediction program capable of providing a prediction device and a prediction program.

上述した課題を解決し、目的を達成するために、本発明に係る予測装置は、入力された音声信号の各フレームがどの音素に対応するのが尤もらしいかを出力する音響モデルを用いて、予測対象の音声信号に対する音声認識を行う音声認識部と、音声認識部による音声認識結果を基に、音声信号の品質評価尺度である単語了解度を予測する予測部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the prediction device according to the present invention uses an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to. It is characterized by having a voice recognition unit that performs voice recognition for the voice signal to be predicted and a prediction unit that predicts the word intelligibility, which is a quality evaluation scale of the voice signal, based on the voice recognition result by the voice recognition unit. To do.

また、本発明に係る予測方法は、予測装置が実行する予測方法であって、入力された音声信号の各フレームがどの音素に対応するのが尤もらしいかを出力する音響モデルを用いて、予測対象の音声信号に対する音声認識を行う工程と、音声認識結果を基に、音声信号の品質評価尺度である単語了解度を予測する工程と、を含んだことを特徴とする。 Further, the prediction method according to the present invention is a prediction method executed by a prediction device, and predicts using an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to. It is characterized by including a step of performing voice recognition for a target voice signal and a step of predicting word intelligibility, which is a quality evaluation scale of the voice signal, based on the voice recognition result.

また、本発明に係る予測プログラムは、入力された音声信号の各フレームがどの音素に対応するのが尤もらしいかを出力する音響モデルを用いて、予測対象の音声信号に対する音声認識を行うステップと、音声認識結果を基に、音声信号の品質評価尺度である単語了解度を予測するステップと、をコンピュータに実行させる。 Further, the prediction program according to the present invention includes a step of performing voice recognition for the voice signal to be predicted by using an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to. , A computer is made to perform a step of predicting word comprehension, which is a quality evaluation scale of a voice signal, based on a voice recognition result.

本発明によれば、音声信号の品質評価尺度である単語了解度の予測精度を高めることができる。 According to the present invention, it is possible to improve the prediction accuracy of word intelligibility, which is a quality evaluation scale for voice signals.

図１は、実施の形態に係る単語了解度予測装置の構成の概略を示す図である。FIG. 1 is a diagram showing an outline of a configuration of a word intelligibility prediction device according to an embodiment. 図２は、図１に示す音響モデル及び音素言語モデルの学習を説明する図である。FIG. 2 is a diagram illustrating learning of the acoustic model and the phoneme language model shown in FIG. 図３は、図１に示す単語了解度予測部の予測関数のパラメータ調整を説明する図である。FIG. 3 is a diagram illustrating parameter adjustment of the prediction function of the word intelligibility prediction unit shown in FIG. 図４は、図１に示す単語了解度予測装置の処理を説明する図である。FIG. 4 is a diagram illustrating processing of the word intelligibility prediction device shown in FIG. 図５は、実施の形態に係る単語了解度予測処理の処理手順を示すフローチャートである。FIG. 5 is a flowchart showing a processing procedure of the word intelligibility prediction processing according to the embodiment. 図６は、図１に示す単語了解度予測装置の評価実験を説明する図である。FIG. 6 is a diagram illustrating an evaluation experiment of the word intelligibility prediction device shown in FIG. 図７は、プログラムが実行されることにより、単語了解度予測装置が実現されるコンピュータの一例を示す図である。FIG. 7 is a diagram showing an example of a computer in which a word intelligibility prediction device is realized by executing a program.

以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals.

［実施の形態］
本発明の実施の形態について説明する。本実施の形態は、被験者実験で得られる単語了解を、音声認識器の音素の認識率を基に予測する単語了解度予測装置に関する。 [Embodiment]
Embodiments of the present invention will be described. The present embodiment relates to a word intelligibility predictor that predicts word intelligibility obtained in a subject experiment based on the phoneme recognition rate of a speech recognizer.

まず、実施の形態に係る単語了解度予測装置の構成について説明する。図１は、実施の形態に係る単語了解度予測装置の構成の概略を示す図である。実施の形態に係る単語了解度予測装置１０は、入力された音声信号に対する音声認識率を基に、単語了解度を予測する。 First, the configuration of the word intelligibility prediction device according to the embodiment will be described. FIG. 1 is a diagram showing an outline of a configuration of a word intelligibility prediction device according to an embodiment. The word intelligibility prediction device 10 according to the embodiment predicts the word intelligibility based on the voice recognition rate for the input voice signal.

単語了解度予測装置１０は、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、ＣＰＵ（Central Processing Unit）等を含むコンピュータ等に所定のプログラムが読み込まれて、ＣＰＵが所定のプログラムを実行することで実現される。また、単語了解度予測装置１０は、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースを有する。例えば、単語了解度予測装置１０は、ＮＩＣ（Network Interface Card）等を有し、ＬＡＮ（Local Area Network）やインターネットなどの電気通信回線を介した他の装置との間の通信を行う。単語了解度予測装置１０は、音声認識部１１及び単語了解度予測部１６（予測部）を有する。 In the word comprehension prediction device 10, for example, a predetermined program is read into a computer or the like including a ROM (Read Only Memory), a RAM (Random Access Memory), a CPU (Central Processing Unit), etc., and the CPU executes the predetermined program. It is realized by executing. Further, the word intelligibility prediction device 10 has a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. For example, the word intelligibility prediction device 10 has a NIC (Network Interface Card) or the like, and communicates with other devices via a telecommunication line such as a LAN (Local Area Network) or the Internet. The word intelligibility prediction device 10 includes a voice recognition unit 11 and a word intelligibility prediction unit 16 (prediction unit).

音声認識部１１は、入力された音声信号の各フレームがどの音素に対応するのが尤もらしいかを出力する音響モデルを用いて、予測対象の音声信号に対する音声認識を行う自動音声認識器である。音声認識部１１は、音素出力部１２、音素並び出力部１３、音素認識部１４（認識部）及び認識率計算部１５（計算部）を有する。 The voice recognition unit 11 is an automatic voice recognizer that performs voice recognition for a voice signal to be predicted by using an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to. .. The voice recognition unit 11 includes a phoneme output unit 12, a phoneme arrangement output unit 13, a phoneme recognition unit 14 (recognition unit), and a recognition rate calculation unit 15 (calculation unit).

音素出力部１２は、音響モデル１２１を用いて、予測対象の音声信号の各フレームに対応する音素の候補を出力する。 The phoneme output unit 12 uses the acoustic model 121 to output phoneme candidates corresponding to each frame of the voice signal to be predicted.

音響モデル１２１は、入力された音声信号の各フレームがどの音素に対応するのが尤もらしいかを出力するモデルである。音響モデル１２１は、深層学習モデルである。深層学習モデルは、信号の入る入力層、入力層からの信号を様々に変換する１層または複数の中間層、及び、中間層の信号を確率などの出力に変換する出力層からなる。音響モデル１２１は、入力層に音声信号が入力されると、出力層からは、入力された音声信号の各フレームがどの音素に対応するのが尤もらしいかを示す、各音素の確率が出力される。 The acoustic model 121 is a model that outputs which phoneme each frame of the input audio signal is likely to correspond to. The acoustic model 121 is a deep learning model. A deep learning model consists of an input layer into which signals enter, one or more intermediate layers that variously convert signals from the input layers, and an output layer that converts the signals of the intermediate layers into outputs such as probabilities. When a voice signal is input to the input layer, the acoustic model 121 outputs the probability of each phoneme indicating which phoneme each frame of the input voice signal is likely to correspond to from the output layer. To.

音素並び出力部１３は、音素言語モデル１３１を用いて、音素出力部１２が出力した音素の候補に対応する音素の並びの候補を出力する。 The phoneme arrangement output unit 13 uses the phoneme language model 131 to output phoneme arrangement candidates corresponding to the phoneme candidates output by the phoneme output unit 12.

音素言語モデル１３１は、入力された音素の候補に対して音素の並びの尤もらしさを出力するモデルである。音素言語モデル１３１は、正解テキストから、音素の並びの出現頻度を計算して学習する音素n-gramなどの音素言語モデルが適用される。 The phoneme language model 131 is a model that outputs the plausibility of the arrangement of phonemes with respect to the input phoneme candidates. A phoneme language model such as a phoneme n-gram is applied to the phoneme language model 131 by calculating and learning the appearance frequency of a sequence of phonemes from the correct text.

音素認識部１４は、音素出力部１２が出力した音素の候補と、音素並び出力部１３が出力した音素の並びの候補とを基に、予測対象の音声信号に対応する音素系列を認識する。音素認識部１４は、音素の候補及び音素の並びの候補から、音素系列（以降では、単語とみなす。）を出力する。 The phoneme recognition unit 14 recognizes a phoneme sequence corresponding to a phoneme signal to be predicted based on a phoneme candidate output by the phoneme output unit 12 and a phoneme sequence candidate output by the phoneme arrangement output unit 13. The phoneme recognition unit 14 outputs a phoneme sequence (hereinafter, regarded as a word) from phoneme candidates and phoneme sequence candidates.

認識率計算部１５は、音素認識部１４によって認識された音素系列の正解率を計算する。認識率計算部１５は、正解テキストを単語に変換する。正解テキストは、文章の読み上げ音声の場合は元の文章のことであり、元の音声が十分クリーンであれば人手による書き起こしなどのことである。その後、認識率計算部１５は、出力された音素系列と正解テキストの音素系列とを照合し、音素認識正解率を出力する。認識率計算部１５は、式（１）を用いて、音素認識正解率Ｐ_ＡＣＣを計算する。なお、式（１）におけるＣは正解音素数であり、Ｓは置換音素数であり、Ｉは挿入音素数であり、Ｄは、削除音素数である。 The recognition rate calculation unit 15 calculates the correct answer rate of the phoneme sequence recognized by the phoneme recognition unit 14. The recognition rate calculation unit 15 converts the correct text into words. The correct text is the original sentence in the case of the read-aloud voice of the sentence, and if the original voice is sufficiently clean, it may be manually transcribed. After that, the recognition rate calculation unit 15 collates the output phoneme series with the phoneme series of the correct answer text, and outputs the phoneme recognition correct answer rate. Recognition rate calculating unit 15, using equation (1), calculates the phoneme recognition accuracy rate _{P ACC.} In the equation (1), C is a correct phonetic prime number, S is a substitution phonetic prime number, I is an inserted phonetic prime number, and D is a deleted phonetic prime number.

単語了解度予測部１６は、音声認識部１１による音声認識結果を基に、音声信号の品質評価尺度である単語了解度を予測し、予測値を出力する。単語了解度予測部１６は、所定の予測関数を用いて、認識率計算部１５によって計算された音素系列の音素認識正解率を、単語了解度の予測値に変換する。 The word intelligibility prediction unit 16 predicts the word intelligibility, which is a quality evaluation scale for voice signals, based on the voice recognition result by the voice recognition unit 11, and outputs the predicted value. The word intelligibility prediction unit 16 converts the phoneme recognition correct answer rate of the phoneme series calculated by the recognition rate calculation unit 15 into a predicted value of the word intelligibility by using a predetermined prediction function.

図２は、図１に示す音響モデル１２１及び音素言語モデル１３１の学習を説明する図である。音響モデル１２１及び音素言語モデル１３１のパラメータは、音声データ及び正解テキストのデータセットを学習することによって調整される。 FIG. 2 is a diagram illustrating learning of the acoustic model 121 and the phoneme language model 131 shown in FIG. The parameters of the acoustic model 121 and the phoneme language model 131 are adjusted by learning a dataset of speech data and correct text.

図２に示すように、まず、クリーン音声信号データセットＤｓ１と、その正解テキストのデータセットとを用意する。そして、クリーンな音声信号に、様々な雑音を加える処理や音声強調処理等を施すことによって、音声信号データ加工を行い、新たな音声信号を作成し、加工音声信号データセットＤｓ２を用意する。 As shown in FIG. 2, first, a clean audio signal data set Ds1 and a data set of the correct text are prepared. Then, the audio signal data is processed, a new audio signal is created, and the processed audio signal data set Ds2 is prepared by performing processing for adding various noises, speech enhancement processing, and the like to the clean audio signal.

音響モデル１２１に対し、クリーン音声信号データセットＤｓ１及び加工音声信号データセットＤｓ２を学習させて（ステップＳ２）、音響モデル１２１のパラメータを調整する。なお、音響モデル１２１の学習については、従来法を用いる。従来法の具体的な手順については、例えば、川原達也，“音声認識システム改訂２版”，オーム社，2016を参照いただきたい。 The acoustic model 121 is made to learn the clean audio signal data set Ds1 and the processed audio signal data set Ds2 (step S2), and the parameters of the acoustic model 121 are adjusted. A conventional method is used for learning the acoustic model 121. For the specific procedure of the conventional method, refer to, for example, Tatsuya Kawahara, "Voice Recognition System Revised 2nd Edition", Ohmsha, 2016.

音素言語モデル１３１に対して、正解テキストから、音素の並びの出現頻度を計算し、音素Ｎグラムなどの音素言語モデルを学習させて（ステップＳ１）、音素言語モデル１３１のパラメータを調整する。 For the phoneme language model 131, the appearance frequency of the phoneme sequence is calculated from the correct text, the phoneme language model such as the phoneme Ngram is learned (step S1), and the parameters of the phoneme language model 131 are adjusted.

図３は、図１に示す単語了解度予測部１６の予測関数のパラメータ調整を説明する図である。図４は、図１に示す単語了解度予測装置１０の処理を説明する図である。 FIG. 3 is a diagram illustrating parameter adjustment of the prediction function of the word intelligibility prediction unit 16 shown in FIG. FIG. 4 is a diagram illustrating the processing of the word intelligibility prediction device 10 shown in FIG.

まず、準備段階として、単語了解度予測部１６のキャリブレーションを行う。キャリブレーションのために、参照音声信号と、その正解テキストとを用意する。そして、照音声信号を用いて被験者実験を行い、単語認識率などの了解度の評価を行う。一方で、単語了解度予測装置１０に参照音声信号を入力し、音素認識正解率を出力させる。 First, as a preparatory step, the word intelligibility prediction unit 16 is calibrated. Prepare the reference audio signal and its correct text for calibration. Then, a subject experiment is conducted using the illuminating voice signal, and the intelligibility such as the word recognition rate is evaluated. On the other hand, a reference audio signal is input to the word intelligibility prediction device 10 to output a phoneme recognition correct answer rate.

続いて、被験者実験の結果、及び、単語了解度予測装置１０による音素認識正解率を基に、単語了解度予測部１６の予測関数のパラメータを調整し（図３のステップＳ３）、予測値が被験者実験の結果と合うようにする。予測関数のパラメータ調整後、図４に示すように、実際の予測処理として、予測対象の音声信号、及び、そのテキストを単語了解度予測装置１０に入力し、単語了解度予測値の出力を得る。 Subsequently, the parameters of the prediction function of the word intelligibility prediction unit 16 are adjusted based on the result of the subject experiment and the phoneme recognition correct answer rate by the word intelligibility prediction device 10 (step S3 in FIG. 3), and the predicted value is obtained. Match the results of the subject experiment. After adjusting the parameters of the prediction function, as shown in FIG. 4, as an actual prediction process, the audio signal to be predicted and the text thereof are input to the word intelligibility prediction device 10 to obtain an output of the word intelligibility prediction value. ..

［予測処理］
次に、単語了解度予測装置１０が実行する予測処理について説明する。図５は、実施の形態に係る単語了解度予測処理の処理手順を示すフローチャートである。 [Prediction processing]
Next, the prediction process executed by the word intelligibility prediction device 10 will be described. FIG. 5 is a flowchart showing a processing procedure of the word intelligibility prediction processing according to the embodiment.

予測対象の音声信号が入力されると、図５に示すように、まず、音声認識部１１は、未処理データはあるか否かを判定する（ステップＳ１１）。未処理データがある場合（ステップＳ１１：Ｙｅｓ）、音声認識部１１は、予測対象の音声信号を読み込み（ステップＳ１２）、音声認識を行う。 When the voice signal to be predicted is input, as shown in FIG. 5, the voice recognition unit 11 first determines whether or not there is unprocessed data (step S11). When there is unprocessed data (step S11: Yes), the voice recognition unit 11 reads the voice signal to be predicted (step S12) and performs voice recognition.

具体的には、音素出力部１２が、音響モデル１２１を用いて、予測対象の音声信号の各フレームに対応する音素の候補を出力する（ステップＳ１３）。続いて、音素並び出力部１３は、音素言語モデル１３１を用いて、音素出力部１２が出力した音素の候補に対応する音素の並びの候補を出力する（ステップＳ１４）。音素認識部１４は、音素の候補と音素の並びの候補とを基に、予測対象の音声信号に対応する単語を認識し（ステップＳ１５）、音声認識部１１は、ステップＳ１１に進む。 Specifically, the phoneme output unit 12 uses the acoustic model 121 to output phoneme candidates corresponding to each frame of the voice signal to be predicted (step S13). Subsequently, the phoneme arrangement output unit 13 uses the phoneme language model 131 to output the phoneme arrangement candidates corresponding to the phoneme candidates output by the phoneme output unit 12 (step S14). The phoneme recognition unit 14 recognizes a word corresponding to the voice signal to be predicted based on the phoneme candidate and the phoneme arrangement candidate (step S15), and the voice recognition unit 11 proceeds to step S11.

一方、未処理データがない場合（ステップＳ１１：Ｎｏ）、音声認識部１１は、正解テキストを読み込む（ステップＳ１６）。そして、認識率計算部１５は、正解テキストを単語に変換し、音素認識部１４によって認識された全単語と正解テキストの単語とを照らし合わせて、音素認識正解率を計算する（ステップＳ１７）。 On the other hand, when there is no unprocessed data (step S11: No), the voice recognition unit 11 reads the correct answer text (step S16). Then, the recognition rate calculation unit 15 converts the correct answer text into words, compares all the words recognized by the phoneme recognition unit 14 with the words of the correct answer text, and calculates the phoneme recognition correct answer rate (step S17).

単語了解度予測部１６は、予測関数を用いて、認識率計算部１５によって計算された単語の音素認識正解率を、単語了解度の予測値に変換することで、単語了解度の予測値を計算する（ステップＳ１８）。単語了解度予測部１６は、単語了解度の予測値を出力し（ステップＳ１９）、処理を終了する。 The word intelligibility prediction unit 16 converts the phonetic recognition correct answer rate of the word calculated by the recognition rate calculation unit 15 into the predicted value of the word intelligibility by using the prediction function to obtain the predicted value of the word intelligibility. Calculate (step S18). The word intelligibility prediction unit 16 outputs the predicted value of the word intelligibility (step S19), and ends the process.

［評価実験］
図６は、図１に示す単語了解度予測装置１０の評価実験を説明する図である。評価実験では、音声信号データセット（訓練データ）として、ＣＳＪ（The corpus of spontaneous Japanese）（詳細は、Sadaoki Furui, Kikuo Maekawa, and Hitoshi Isahara，“A japanese national project on sponta-neous speech corpus and processing technology”，In ASR2000-Automatic Speech Recognition:Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), pp. 244-248, 2000、及び、Kikuo Maekawa，“CORPUS OF SPONTANEOUS JAPANESE: ITS DESIGN AND EVALUATION”，In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003を参照）を用いる。ここでは、ＣＳＪコーパスから得られる音素バイグラムを使って音素言語モデル１３１の学習を行った。 [Evaluation experiment]
FIG. 6 is a diagram illustrating an evaluation experiment of the word intelligibility prediction device 10 shown in FIG. In the evaluation experiment, CSJ (The corpus of spontaneous Japanese) (for details, Sadaoki Furui, Kikuo Maekawa, and Hitoshi Isahara, “A japanese national project on sponta-neous speech corpus and processing technology) was used as the voice signal data set (training data). ”, In ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW), pp. 244-248, 2000, and Kikuo Maekawa,“ CORPUS OF SPONTANEOUS JAPANESE: ITS DESIGN AND EVALUATION ”, In ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, 2003). Here, the phoneme language model 131 was learned using the phoneme biggram obtained from the CSJ corpus.

評価実験では、この音声信号に、いくつかの強度のピンクノイズを加えた信号と、ピンクノイズを付加した音声信号を音声強調した信号とを訓練データとして作成する。ここでは、音声強調として、ＳＳ（spectral subtraction）（詳細は、Michael Berouti, Richard Schwartz, and John Makhoul，“Enhancement of speech corrupted by acoustic noise”，In ICASSP'79. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4, pp. 208-211. IEEE, 1979を参照）と、ＷＦ（Wiener filter）（詳細は、Masakiyo Fujimoto, Shinji Watanabe, and Tomohiro Nakatani，“Noise suppression with unsupervised joint speaker adaptation and noise mixture model estimation”，In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4713-4716. IEEE, 2012を参照）とを用いる。 In the evaluation experiment, a signal in which some intensity pink noise is added to this audio signal and a signal in which the audio signal to which pink noise is added are speech-enhanced are created as training data. Here, as speech enhancement, SS (spectral subtraction) (for details, see Michael Berouti, Richard Schwartz, and John Makhoul, “Enhancement of speech corrupted by acoustic noise”, In ICASSP'79. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 4, pp. 208-211. IEEE, 1979) and WF (Wiener filter) (for details, Masakiyo Fujimoto, Shinji Watanabe, and Tomohiro Nakatani, “Noise suppression with unsupervised joint speaker adaptation and noise) "mixture model estimation", In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4713-4716. See IEEE, 2012).

音響モデル１２１は、クリーン音声信号、ピンクノイズを加えた音声信号、音声強調した音声信号を混合したものを用いて学習を行った。 The acoustic model 121 was trained using a mixture of a clean audio signal, an audio signal with pink noise added, and a speech-enhanced audio signal.

単語了解度を評価するデータセット（評価データ）として、the familiarity-controlled word lists 2007（ＦＷ０７）（詳細は、Shuichi Sakamoto, Naoki Iwaoka, Yoiti Suzuki, Shigeaki Amano, and Tadahisa Kondo，“Complementary relationship between familiarity and SNR in word intelligibility test”，Acoustical science and technology, Vol. 25, No. 4, pp. 290-292, 2004、及び、T Kondo, S Amano, S Sakamoto, and Y Suzuki，“Familiarity-controlled word lists 2007 (fw07)”，The Speech Resources Consortium, National Institute of Informatics, Japan, 2007を参照）を用いる。 As a data set (evaluation data) for evaluating word intelligibility, the familiarity-controlled word lists 2007 (FW07) (for details, see Shuichi Sakamoto, Naoki Iwaoka, Yoiti Suzuki, Shigeaki Amano, and Tadahisa Kondo, “Complementary relationship between familiarity and” SNR in word intelligibility test ”, Acoustical science and technology, Vol. 25, No. 4, pp. 290-292, 2004, and T Kondo, S Amano, S Sakamoto, and Y Suzuki,“ Family-controlled word lists 2007 (fw07) ”, The Speech Resources Consortium, National Institute of Informatics, Japan, 2007).

このデータセットは、単語の親密度別に分かれており、単語知識の認識率への影響を抑えるため、最も親密度の低いものだけを用いる。ＦＷ０７についても、ＣＳＪと同様にピンクノイズの付加、音声強調処理をすることとする。 This dataset is divided by word intimacy, and only the one with the lowest intimacy is used in order to suppress the influence of word knowledge on the recognition rate. Similar to CSJ, FW07 is also subjected to pink noise addition and speech enhancement processing.

本評価実験では、単語了解度の計算のために、被験者実験による単語認識率を用いる。そして、音声強調された音声信号の単語了解度を単語了解度予測部１６により予測することとする。単語了解度予測部１６は、音声認識部１１の音素認識正解率から単語了解度への変換として、式（２）に示す線形関数を用いる。 In this evaluation experiment, the word recognition rate from the subject experiment is used to calculate the word intelligibility. Then, the word intelligibility of the speech-enhanced voice signal is predicted by the word intelligibility prediction unit 16. The word intelligibility prediction unit 16 uses the linear function shown in the equation (2) as the conversion from the phoneme recognition correct answer rate of the speech recognition unit 11 to the word intelligibility.

ここで、Ｐ_ＡＳＲは音声認識部１１の音素認識正解率であり、ＳＩ_ｓｕｂは単語了解度の予測値である。線形関数の係数ａ，ｂは、ピンクノイズを付加した音声信号の、音声認識部１１の音素認識正解率及び被験者実験の単語了解度の値から、最小二乗法を用いて設定される。音声認識部１１の音素認識正解率と単語了解度との組(Ｐ_ＡＳＲ（ｉ），ＳＩ_ｓｕｂ（ｉ）)、ｉ＝１，２，・・・，ｎが与えられたとき、係数ａ，ｂの値は次の式（３）及び式（４）のように推定される。 Here, P _ASR is the phoneme recognition correct answer rate of the voice recognition unit 11, and SI _sub is the predicted value of the word intelligibility. The coefficients a and b of the linear function are set by using the least squares method from the phoneme recognition correct answer rate of the speech recognition unit 11 and the word intelligibility of the subject experiment of the speech signal to which pink noise is added. When a set of phoneme recognition correct answer rate and word intelligibility of the voice recognition unit 11 ( _PASR (i), SI _sub (i)), i = 1, 2, ..., N is given, the coefficient a, The value of b is estimated as in the following equations (3) and (4).

３ｄＢ，０ｄＢ，−３ｄＢ，−６ｄＢのピンクノイズを加えたデータを用いて係数ａ，ｂを推定すると、式（５）及び式（６）となった。 When the coefficients a and b were estimated using the data with pink noise of 3 dB, 0 dB, -3 dB, and -6 dB added, the equations (5) and (6) were obtained.

音声強調として、ＳＳとＷＦとで処理した音声信号に対する単語了解度予測装置１０が予測した単語了解度の予測値（客観的単語了解度の予測値）と、被験者実験の結果（主観単語了解度）との平均二乗誤差を表１に示す。ＡＳＲは、単語了解度予測装置１０による結果である。従来法であるＧＥＤＩ、ＳＴＯＩ、ＨＡＳＰＩの計算の詳細は、Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, and Tomohiro Nakatani，“GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech”，arXiv preprint arXiv:1904.02096, 2019.に記載されている。 As speech enhancement, the predicted value of the word intelligibility predicted by the word intelligibility predictor 10 for the voice signal processed by SS and WF (predicted value of the objective word intelligibility) and the result of the subject experiment (subjective word intelligibility). ) And the mean squared error are shown in Table 1. ASR is the result of the word intelligibility predictor 10. For details on the calculation of GEDI, STOI, and HASPI, which are the conventional methods, see Katsuhiko Yamamoto, Toshio Irino, Shoko Araki, Keisuke Kinoshita, and Tomohiro Nakatani, "GEDI: Gammachirp Envelope Distortion Index for Predicting Intelligibility of Enhanced Speech", arXiv preprint arXiv: It is described in 1904.02096, 2019.

表１に示すように、客観的単語了解度の予測値と主観単語了解度との平均二乗予測誤差は、ＡＳＲにおいて最小となった。すなわち、従来のＧＥＤＩ，ＳＴＯＩ，ＨＡＳＰＩと比べて、ＡＳＲが最も予測性能が高かった。 As shown in Table 1, the mean squared prediction error between the predicted value of the objective word intelligibility and the subjective word intelligibility was the smallest in ASR. That is, ASR had the highest prediction performance as compared with the conventional GEDI, STOI, and HASPI.

［実施の形態の効果］
本実施の形態は、入力された音声信号の各フレームがどの音素に対応するのが尤もらしいかを出力する音響モデルを用いて、予測対象の音声信号に対する音声認識を行い、音声認識結果を基に、音声信号の品質評価尺度である単語了解度を予測する。上述の評価実験にも示したように、本実施の形態によれば、従来のＳＴＯＩ，ＨＡＳＰＩや最近提案されているＧＥＤＩと比して、単語了解度の予測精度を高めることができる。 [Effect of Embodiment]
In the present embodiment, voice recognition is performed on the voice signal to be predicted by using an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to, and the voice recognition result is used as a basis. In addition, the word comprehension level, which is a speech signal quality evaluation scale, is predicted. As shown in the evaluation experiment described above, according to the present embodiment, it is possible to improve the prediction accuracy of the word intelligibility as compared with the conventional STOI, HASPI and the recently proposed GEDI.

ここで、従来の自動音声認識装置は、単語辞書や言語モデルなどを用いており、認識において前後の文脈や単語の事前知識の影響を受けやすい。このような影響を排除するために、前後の文脈に依存しない単語を評価試験に用いる、或いは、試験に用いる発話に含まれる単語の親密度を統一しておく等の工夫が必要であり、このような事前調整がなされていないと精度よく単語了解度を予測できず、音声信号自体の品質の予測の精度も低下してしまう等の課題があった。 Here, the conventional automatic speech recognition device uses a word dictionary, a language model, or the like, and is easily affected by the context and prior knowledge of words in recognition. In order to eliminate such effects, it is necessary to use words that do not depend on the context of the context in the evaluation test, or to unify the intimacy of the words included in the utterances used in the test. Unless such pre-adjustment is made, the word intelligibility cannot be predicted accurately, and the accuracy of predicting the quality of the audio signal itself is lowered.

これに対し、本実施の形態では、音声認識部１１において、前後の文脈に関する情報や単語辞書などの言語の情報ではなく、音素Ｎグラムという音素レベルの音素言語モデル１３１を用いる。これによって、音声認識部１１は、前後の文脈や単語の事前知識の影響を受けずに音声認識を行うことができ、単語了解度予測部１６も、言語情報に左右されず、様々なテキストの音声信号の品質を予測することが可能となった。 On the other hand, in the present embodiment, the speech recognition unit 11 uses a phoneme-level phoneme language model 131 called a phoneme Ngram, instead of information on the context before and after or language information such as a word dictionary. As a result, the voice recognition unit 11 can perform voice recognition without being affected by the context and prior knowledge of words, and the word intelligibility prediction unit 16 is also independent of linguistic information and can be used for various texts. It has become possible to predict the quality of voice signals.

すなわち、本実施の形態によれば、音声信号の発話内容などに依存しない単語了解度を予測することができる。言い換えると、本実施の形態によれば、単語の親密度に依存しない単語了解度を予測することができる。このため、単語の親密度を予め統一した単語リストを試験用に用意する等の工夫をせずとも、従来の客観的音声品質指標よりも、被験者実験による結果に対し、精度よく近似することができる。 That is, according to the present embodiment, it is possible to predict the word intelligibility that does not depend on the utterance content of the voice signal. In other words, according to the present embodiment, it is possible to predict the word intelligibility that does not depend on the intimacy of the word. For this reason, even if a word list with unified word intimacy is prepared in advance for the test, it is possible to approximate the result of the subject experiment more accurately than the conventional objective speech quality index. it can.

なお、本実施の形態では、音声品質の客観評価指標として、単語了解度を予測する場合を例に説明したが、これに限らない。音声品質の客観評価指標として音節明瞭度を使う場合、音声認識器の認識率として単語認識率や文字認識率を使う場合も、本実施の形態と同様に、音声認識部１１による音声認識結果を基に予測値を計算することが可能である。具体的には、本実施形態における単語了解度を音節明瞭度に置き換えた構成を採用してもよい。或いは、本実施形態における音素認識正解率を文字認識正解率や単語認識正解率に置き換えた構成を採用してもよい。文字認識正解率は、上述の式（１）におけるＣを正解文字数であり、Ｓは置換文字数であり、Ｉは挿入文字数であり、Ｄは、削除文字数としたものである。単語認識正解率は、上述の式（１）におけるＣを正解単語数であり、Ｓは置換単語数であり、Ｉは挿入単語数であり、Ｄは、削除単語数としたものである。また、本実施の形態における単語了解度を音節明瞭度とし、音素認識正解率を文字認識正解率に置き換えた構成や、本実施の形態における単語了解度を音節明瞭度とし、音素認識正解率を単語認識正解率に置き換えた構成としてもよい。 In the present embodiment, the case of predicting the word intelligibility as an objective evaluation index of voice quality has been described as an example, but the present invention is not limited to this. When syllable intelligibility is used as an objective evaluation index of voice quality, and when word recognition rate or character recognition rate is used as the recognition rate of the voice recognizer, the voice recognition result by the voice recognition unit 11 is obtained as in the present embodiment. It is possible to calculate the predicted value based on this. Specifically, a configuration in which the word intelligibility in the present embodiment is replaced with syllable intelligibility may be adopted. Alternatively, a configuration in which the phoneme recognition correct answer rate in the present embodiment is replaced with a character recognition correct answer rate or a word recognition correct answer rate may be adopted. In the character recognition correct answer rate, C in the above equation (1) is the number of correct characters, S is the number of replacement characters, I is the number of inserted characters, and D is the number of deleted characters. In the word recognition correct answer rate, C in the above equation (1) is the number of correct words, S is the number of replacement words, I is the number of inserted words, and D is the number of deleted words. Further, the word intelligibility in the present embodiment is defined as syllable intelligibility and the phoneme recognition correct answer rate is replaced with the character recognition correct answer rate, and the word intelligibility in this embodiment is defined as syllable intelligibility and the phoneme recognition correct answer rate is defined. It may be configured by replacing it with the word recognition intelligibility.

［システム構成等］
図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、ＣＰＵ及び当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each of the illustrated devices is a functional concept and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

また、本実施の形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的におこなうこともでき、あるいは、手動的におこなわれるものとして説明した処理の全部又は一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed. It is also possible to automatically perform all or part of the above by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
図７は、プログラムが実行されることにより、単語了解度予測装置１０が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 [program]
FIG. 7 is a diagram showing an example of a computer in which the word intelligibility prediction device 10 is realized by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１及びＲＡＭ（Random Access Memory）１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM (Random Access Memory) 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

ハードディスクドライブ１０９０は、例えば、ＯＳ（Operating System）１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、単語了解度予測装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、単語了解度予測装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤ（Solid State Drive）により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the word intelligibility prediction device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing a process similar to the functional configuration in the word intelligibility prediction device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

また、上述した実施の形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 as needed, and executes the program.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３及びプログラムデータ１０９４は、ネットワーク（ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３及びプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, all other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are included in the scope of the present invention.

１０単語了解度予測装置
１１音声認識部
１２音素出力部
１３音素並び出力部
１４音素認識部
１５認識率計算部
１６単語了解度予測部
１２１音響モデル
１３１音素並び言語モデル 10 Word intelligibility prediction device 11 Speech recognition unit 12 Phoneme output unit 13 Phoneme alignment output unit 14 Phoneme recognition unit 15 Recognition rate calculation unit 16 Word intelligibility prediction unit 121 Acoustic model 131 Phoneme alignment language model

Claims

A voice recognition unit that performs voice recognition for the voice signal to be predicted by using an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to.
Based on the voice recognition result by the voice recognition unit, a prediction unit that predicts word intelligibility, which is a quality evaluation scale for voice signals, and a prediction unit.
A prediction device characterized by having.

The voice recognition unit
Using the acoustic model, a phoneme output unit that outputs phoneme candidates corresponding to each frame of the voice signal to be predicted, and a phoneme output unit.
Using a phoneme language model that outputs the plausibility of phoneme arrangement for the input phoneme candidates, the phoneme arrangement output unit that outputs the phoneme arrangement candidates corresponding to the phoneme candidates output by the phoneme output unit. When,
A recognition unit that recognizes a phoneme sequence corresponding to the voice signal to be predicted based on the phoneme candidate and the phoneme sequence candidate, and a recognition unit.
The prediction device according to claim 1, wherein the prediction device has.

The voice recognition unit
A calculation unit that calculates the phoneme recognition accuracy rate of the phoneme series recognized by the recognition unit, and
Have,
The second aspect of claim 2, wherein the prediction unit converts the phoneme recognition correct answer rate of the phoneme series calculated by the calculation unit into a predicted value of the word intelligibility by using a predetermined prediction function. Predictor.

The prediction device according to any one of claims 1 to 3, wherein the acoustic model is a deep learning model.

It is a prediction method executed by the prediction device.
The process of performing voice recognition for the voice signal to be predicted using an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to.
The process of predicting word intelligibility, which is a quality evaluation scale for voice signals, based on the voice recognition results,
A prediction method characterized by including.

Using an acoustic model that outputs which phoneme each frame of the input voice signal is likely to correspond to, the step of performing voice recognition for the voice signal to be predicted, and
Steps to predict word intelligibility, which is a quality evaluation scale for voice signals, based on voice recognition results,
A prediction program that lets your computer run.