JP2008262120A

JP2008262120A - Utterance evaluation device and utterance evaluation program

Info

Publication number: JP2008262120A
Application number: JP2007106245A
Authority: JP
Inventors: Toru Imai; 亨今井; Shinichi Honma; 真一本間; Kazuho Onoe; 和穂尾上
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2007-04-13
Filing date: 2007-04-13
Publication date: 2008-10-30
Anticipated expiration: 2027-04-13
Also published as: JP5105943B2

Abstract

【課題】任意の単語列について発話の善し悪しを評価する発話評価装置を提供する。
【解決手段】本発明の発話評価装置１は、評価単語列リスト３ｃ、言語モデル、発音辞書及び音響モデルを格納する記憶手段３と、音声データを音声認識して発話単語列に変換するとともに、前記発話単語列の音質の分析結果を生成する大語彙連続音声認識手段８と、前記分析結果を表示する表示手段１５と、前記発話単語列に対して、最高単語類似率を有する評価単語列を前記評価単語列リスト３ｃから検出する最類似評価単語列検出手段１１と、前記発話単語列についての評価結果を生成する発話評価手段１２と、前記単語類似率が第１の閾値を超えた場合にのみ、前記最高単語類似率を有する評価単語列及び前記評価結果を前記表示手段１５に表示する表示制御手段１７とを備える。
【選択図】図１An utterance evaluation apparatus for evaluating the quality of an utterance for an arbitrary word string is provided.
An utterance evaluation apparatus 1 of the present invention includes a storage means 3 for storing an evaluation word string list 3c, a language model, a pronunciation dictionary, and an acoustic model, voice recognition of voice data, and conversion into an utterance word string. Large vocabulary continuous speech recognition means 8 for generating sound quality analysis results of the utterance word string, display means 15 for displaying the analysis results, and an evaluation word string having the highest word similarity to the utterance word string. When the most similar evaluation word string detection means 11 detected from the evaluation word string list 3c, the utterance evaluation means 12 that generates an evaluation result for the utterance word string, and when the word similarity exceeds a first threshold value Only the evaluation word string having the highest word similarity and the display control means 17 for displaying the evaluation result on the display means 15 are provided.
[Selection] Figure 1

Description

本発明は、発話の善し悪しの評価、特に、発話の善し悪しの評価を利用したアナウンサーなどの発話訓練、及び発話の善し悪しをゲームとして楽しむエンターテイメントを実現する発話評価装置及び発話評価プログラムに関する。 The present invention relates to an utterance evaluation apparatus and an utterance evaluation program for realizing evaluation of speech quality, in particular speech training such as an announcer using the speech quality assessment, and entertainment for enjoying the speech quality as a game.

従来、人間の発話の善し悪しを評価する発話評価装置及び発話訓練装置は、発話評価装置によって順次提示される特定の単語列を利用者に発話させ、その発話の発音、抑揚（イントネーション）、発話速度などを評価するものであった（例えば、特許文献１、特許文献２及び特許文献３参照）。しかし、これらの装置は、発話の手本となるアナウンサーなどの音声や抑揚（イントネーション）などの情報をあらかじめ保存しておき、これら装置が発話内容を利用者に指定することで、手本となる発話と利用者の発話を比較するものであった。したがって、これらの装置によれば、利用者が発話したいと望む任意の単語列に対して、その発話の善し悪しを評価することは不可能であった。 Conventionally, an utterance evaluation apparatus and an utterance training apparatus that evaluate the quality of human utterances cause a user to utter specific word strings sequentially presented by the utterance evaluation apparatus, and the pronunciation, intonation, and utterance speed of the utterance. Etc. (see, for example, Patent Document 1, Patent Document 2, and Patent Document 3). However, these devices are modeled by storing information such as an announcer's voice and intonation as examples of utterances in advance and specifying the utterance content to the user. It was to compare the utterance and the user's utterance. Therefore, according to these devices, it is impossible to evaluate the utterance of any word string that the user wants to utter.

例えば、株価を明瞭かつ高速に読み上げるアナウンス訓練のために、その目的のために設計されていない発話評価装置を利用して、利用者自らが発話内容を変更して評価させることはできなかった。また、これらの装置は、不特定多数の利用者が想定される使用環境において、子供、若者、年配、高齢者、発話に関して専門性を有するアナウンサーなど、様々な利用者ルベルにあわせた発話内容を装置自身が容易に指定することは困難であり、発話の善し悪しをゲームとして楽しむエンターテイメント装置への応用が困難であった。 For example, for announcement training to read stock prices clearly and at high speed, the user himself / herself could not make an evaluation by changing the utterance content by using an utterance evaluation device not designed for that purpose. In addition, these devices can provide utterance contents tailored to various user levels, such as children, young people, elderly people, elderly people, and announcers who have expertise in utterances in an environment where an unspecified number of users are expected. It has been difficult for the device itself to specify easily, and it has been difficult to apply it to an entertainment device that enjoys a good or bad utterance as a game.

発話の善し悪しをゲームとして楽しむ従来のエンターテイメント装置は、各クイズに対して予め定めておいた数個の解答候補の発話のみを音声認識で受け付けるようになっており、利用者が発話したいと望む任意の単語列を受け付けるものではなかった（例えば、特許文献４参照）。 The conventional entertainment device that enjoys good and bad utterances as a game is designed to accept only the utterances of several answer candidates that are predetermined for each quiz by voice recognition, and the user wants to speak Is not accepted (see, for example, Patent Document 4).

任意の発話内容を評価する従来の装置としては、発話速度のみを評価する装置はあるが、発音の善し悪しを評価するものではない（例えば、特許文献５、特許文献６及び特許文献７参照）。また、これらは、局所的な音響特徴量のみに基づいて発話速度を推定しており、言語的な情報をまったく利用していないため、発話速度の測定誤差は大きく、特にエンターテイメント装置として早口言葉の善し悪しを評価するような用途には向かない。 As a conventional device that evaluates arbitrary utterance contents, there is a device that evaluates only the utterance speed, but it does not evaluate the quality of pronunciation (see, for example, Patent Document 5, Patent Document 6, and Patent Document 7). In addition, these estimate the speech rate based only on the local acoustic features and do not use linguistic information at all, so the measurement error of speech rate is large, especially as an entertainment device. It is not suitable for applications that evaluate good or bad.

特開平１１−１４３３４６号公報JP-A-11-143346 特開２００３−１８６３７９号公報JP 2003-186379 A 特開２００６−３３７６６７号公報JP 2006-337667 A 特開２００２−１５９７４１号公報Japanese Patent Laid-Open No. 2002-159741 特開平５−２８９６９１号公報JP-A-5-289691 特開平７−２９５５８８号公報JP 7-295588 A 特開２００５−３３１５８９号公報JP 2005-331589 A

本発明の目的は、利用者が発話したいと望む任意の単語列について、直接又は間接的に取得した音声データについて音声認識した単語列（即ち、後述する発話単語列）が、予め定められた評価単語列と類似している場合には、より高精度、且つより多くの評価内容を示すとともに、その発話の善し悪しを評価する発話評価装置及び発話評価プログラムを提供することにある。 An object of the present invention is to evaluate a predetermined word string (that is, an utterance word string to be described later) obtained by voice recognition of voice data directly or indirectly acquired for any word string that a user desires to utter. When it is similar to a word string, it is to provide an utterance evaluation apparatus and an utterance evaluation program that show more evaluation contents with higher accuracy and evaluate the quality of the utterance.

本発明による発話評価装置は、音声データを分析して評価する発話評価装置であって、音声データを評価するための所定の評価単語列を１つ以上含む評価単語列リスト、並びに言語モデル、発音辞書及び音響モデルを格納する記憶手段と、前記言語モデル、前記発音辞書及び前記音響モデルに基づいて、音声データを音声認識して発話単語列に変換するとともに、前記発音辞書及び前記音響モデルに基づいて、前記発話単語列の音質の分析結果を生成する大語彙連続音声認識手段と、前記分析結果を表示する表示手段と、前記発話単語列に対して、単語毎に比較して一致する単語数の最も多い評価単語列を、最高の単語類似率を有する評価単語列として前記評価単語列リストから検出する最類似評価単語列検出手段と、前記発話単語列について、少なくとも前記最高の単語類似率及び発話速度を含む評価結果を生成する発話評価手段と、前記最高の単語類似率が第１の閾値を超えた場合にのみ、前記最高の単語類似率を有する評価単語列及び前記評価結果を前記表示手段に表示する表示制御手段と、を備えることを特徴とする。 An utterance evaluation apparatus according to the present invention is an utterance evaluation apparatus that analyzes and evaluates speech data, and includes an evaluation word string list including one or more predetermined evaluation word strings for evaluating speech data, a language model, and pronunciation Based on the language model, the pronunciation dictionary, and the acoustic model, the voice data is recognized and converted into an utterance word string based on the language model, the pronunciation dictionary, and the acoustic model, and based on the pronunciation dictionary and the acoustic model. A large vocabulary continuous speech recognition means for generating a sound quality analysis result of the utterance word string, a display means for displaying the analysis result, and the number of words that match the utterance word string in comparison with each word. The most similar evaluation word string detecting means for detecting from the evaluation word string list as the evaluation word string having the highest word similarity rate, and the utterance word string. Utterance evaluation means for generating an evaluation result including at least the highest word similarity and utterance speed, and an evaluation having the highest word similarity only when the highest word similarity exceeds a first threshold. Display control means for displaying a word string and the evaluation result on the display means.

これにより、利用者が発話したいと望む任意の単語列の発話の善し悪しを、音声認識結果の単語列や、昔声波形や声紋(スペクトル)、抑揚(イントネーション)といった各種分析結果から確認することができ、発話内容が予め定めておいた評価単語列と類似しているか否かを判定した上で、単語類似率や発話速度などの評価結果を表示部に表示するため、より高精度、且つより多くの評価結果を提供することが可能となる。 This makes it possible to confirm the utterance of any word string that the user wants to utter from various analysis results such as the speech recognition result word string, old voice waveform, voiceprint (spectrum), and intonation (intonation). It is possible to display the evaluation result such as the word similarity rate and the utterance speed on the display unit after determining whether or not the utterance content is similar to a predetermined evaluation word string. Many evaluation results can be provided.

また、本発明による発話評価装置は、前記評価単語列リストに含まれる評価単語列を用いて、前記言語モデルを学習させる言語モデル重み付け手段を更に備え、前記大語彙連続音声認識手段は、学習した言語モデルに基づいて音声データを発話単語列に変換することを特徴とする。 The utterance evaluation apparatus according to the present invention further includes language model weighting means for learning the language model using the evaluation word string included in the evaluation word string list, and the large vocabulary continuous speech recognition means has learned Voice data is converted into an utterance word string based on a language model.

これにより、評価単語列リストを利用者が望む発話内容について容易に変更可能となり、子供から発話について専門性を有するアナウンサーまで様々なレベルの利用者が、発話評価装置を発話訓練装置として利用することが可能となる。 As a result, the evaluation word string list can be easily changed with respect to the utterance content desired by the user, and various levels of users from children to announcers who have expertise in utterance can use the utterance evaluation device as an utterance training device. Is possible.

また、本発明による発話評価装置は、音声データの発話毎の前記発話単語列と一致する正解単語列を前記記憶手段に格納するデータ管理手段を更に備え、前記言語モデル重み付け手段は、前記正解単語列を前記評価単語列リストに加えるとともに、前記正解単語列を用いて前記言語モデルを学習させることを特徴とする。 The utterance evaluation apparatus according to the present invention further comprises data management means for storing a correct word string that matches the utterance word string for each utterance of voice data in the storage means, and the language model weighting means includes the correct word A sequence is added to the evaluation word sequence list, and the language model is learned using the correct word sequence.

これにより、利用者が発話したいと望む任意の単語列の一部が、予め作成された言語モデルあるいは前記評価単語列のリストに含まれていない場合であっても、発話ごとの正解単語列（後述する発音記号列を含む）を利用して、前記任意の単語列の発話の善し悪しを、音声認識結果の単語列や、音響スコアといった各種分析結果から確認することができる。 Thus, even if a part of an arbitrary word string that the user wants to utter is not included in the language model created in advance or the list of evaluation word strings, the correct word string for each utterance ( The utterances of the arbitrary word string can be confirmed from various analysis results such as the word string of the speech recognition result and the acoustic score.

また、本発明による発話評価装置は、前記発話評価手段は、所定の音響スコアを算出して、前記最高の単語類似率及び前記音響スコアの重み付き線形和によって発音明瞭度を導出する手段を更に有し、
前記評価結果は、前記発音明瞭度を更に含むことを特徴とする。 In the utterance evaluation apparatus according to the present invention, the utterance evaluation means further includes means for calculating a predetermined acoustic score and deriving pronunciation intelligibility by a weighted linear sum of the highest word similarity and the acoustic score. Have
The evaluation result further includes the pronunciation intelligibility.

これにより、利用者は自己の発音の明瞭度を数値で確認することができるようになる。 Thereby, the user can confirm the intelligibility of his / her pronunciation numerically.

また、本発明による発話評価装置は、前記発話評価手段は、前記評価結果を履歴として前記記憶手段に格納する手段を更に有し、前記最高の単語類似率が第２の閾値を超えたか否かを判定する閾値判定手段、第２の閾値を超える単語類似率と判定した場合に、前記発話単語列についての評価結果のいずれかが前記履歴に対して最高値を示すか否かを判定する最高値判定手段、及び、最高値を示すと判定した場合に、新記録の達成の旨を音声又は映像で通知する手段から構成される新記録達成判定・通知手段を更に備えることを特徴とする。 In the utterance evaluation device according to the present invention, the utterance evaluation means further includes means for storing the evaluation result as a history in the storage means, and whether or not the highest word similarity exceeds a second threshold value. A threshold determination unit that determines whether or not any of the evaluation results for the utterance word string indicates the highest value for the history when it is determined that the word similarity exceeds the second threshold. It is further characterized by further comprising a value determination means and a new recording achievement determination / notification means comprising a means for notifying the achievement of the new recording by voice or video when it is determined that the maximum value is indicated.

これにより、早口言葉を早く正確に発話できるかどうかなど、発話の善し悪しをゲームとして楽しめるエンターテイメント装置を実現できる。 Thereby, it is possible to realize an entertainment device that allows the user to enjoy good and bad utterances as a game, such as whether or not to speak fast and accurate words.

更に、本発明の発話評価プログラムは、記憶部及び表示部を備えるコンピュータに、音声データを評価するための所定の評価単語列を１つ以上含む評価単語列リスト、並びに言語モデル、発音辞書及び音響モデルを前記記憶部に格納するステップと、前記言語モデル、前記発音辞書及び前記音響モデルに基づいて、音声データを音声認識して発話単語列に変換するとともに、前記発音辞書及び前記音響モデルに基づいて、前記発話単語列の音質の分析結果を生成する大語彙連続音声認識するステップと、前記分析結果を前記表示部に表示するステップと、前記発話単語列に対して、単語毎に比較して一致する単語数の最も多い評価単語列を、最高の単語類似率を有する評価単語列として前記評価単語列リストから検出する最類似評価単語列検出するステップと、前記発話単語列について、少なくとも前記最高の単語類似率及び発話速度を含む評価結果を生成するステップと、前記最高の単語類似率が第１の閾値を超えた場合にのみ、前記最高の単語類似率を有する評価単語列及び前記評価結果を前記表示部に表示するステップと、を実行させるための発話評価プログラムとして特徴付けられる。 Furthermore, the utterance evaluation program of the present invention includes a computer including a storage unit and a display unit, an evaluation word string list including one or more predetermined evaluation word strings for evaluating voice data, a language model, a pronunciation dictionary, and a sound. Based on the step of storing a model in the storage unit, and based on the language model, the pronunciation dictionary, and the acoustic model, speech data is recognized and converted into an utterance word string, and based on the pronunciation dictionary and the acoustic model A step of recognizing a large vocabulary continuous voice for generating a sound quality analysis result of the utterance word string, a step of displaying the analysis result on the display unit, and comparing the utterance word string for each word The most similar evaluation word string detection for detecting the evaluation word string having the largest number of matching words from the evaluation word string list as the evaluation word string having the highest word similarity rate. Generating an evaluation result including at least the highest word similarity rate and speaking rate for the utterance word string, and only if the highest word similarity rate exceeds a first threshold. And a step of displaying the evaluation word string having the word similarity rate and the evaluation result on the display unit.

本発明によれば、利用者が発話したいと望む任意の単語列について、直接又は間接的に取得した音声データの発話単語列について、予め定められた評価単語列と類似しているか否かを判定した上で評価内容を表示部に表示するため、利用者は、より高精度、且つより多くの評価内容について、その発話の善し悪しを評価することが可能となる。また、本発明によれば、発話内容を利用者自身が柔軟に変更可能な発話訓練用、或いは又、子供から発話に関して専門的なアナウンサーまで、様々な発話レベルの利用者が発話の善し悪しをゲームとして楽しむことができるエンターテイメント用の装置又はプログラムとしても機能する発話評価装置又は発話評価プログラムを提供できるようになる。 According to the present invention, for any word string that the user wants to speak, it is determined whether or not the utterance word string of the voice data directly or indirectly obtained is similar to a predetermined evaluation word string In addition, since the evaluation content is displayed on the display unit, the user can evaluate the quality of the utterance with higher accuracy and more evaluation content. In addition, according to the present invention, users of various utterance levels can use the game for utterance training, in which the user can flexibly change the utterance content, or from children to professional announcers regarding utterances. It is possible to provide an utterance evaluation apparatus or an utterance evaluation program that also functions as an entertainment apparatus or program that can be enjoyed as an entertainment program.

以下、本発明による実施例の発話評価装置について詳細に説明する。 Hereinafter, an utterance evaluation apparatus according to an embodiment of the present invention will be described in detail.

本発明による実施例の発話評価装置についての理解を助けるために、「文字」とは、１文字単位で表されるものを云い、「単語」は、文字の組み合わせからなる１つの用語を意味するもの云い、「単語列」は、単語の組み合わせからなる一区切りで表すことができるものを云う。尚、「単語」及び「単語列」は、音声認識において処理可能なレベルのものであって、説明の便宜において区別しているにすぎず、格段の厳密性が要求されるものではない。例えば、単語列「生麦生米生卵」というときは、（生麦）、（生米）、及び（生卵）が「単語」であり、（生）、（麦）、（米）、及び（卵）が、「文字」である。更に、「評価単語列」は、予め定められた単語列であり、音声認識した単語列（後述する発話単語列）との比較に用いるものを云う。また、「発話単語列」は、後述する大語彙連続音声認識手段８によって、音声データを音声認識した結果の単語列を云う。また、「正解単語列」は、後述する大語彙連続音声認識手段８によって音声認識した発話単語列と一致する内容の単語列を云う。 In order to facilitate understanding of the speech evaluation apparatus according to the embodiment of the present invention, “character” means one character unit, and “word” means one term consisting of a combination of characters. In other words, the “word string” is a word that can be expressed as a single segment consisting of a combination of words. Note that “words” and “word strings” are at a level that can be processed in speech recognition, and are only distinguished for convenience of explanation, and are not required to be extremely strict. For example, when the word string “raw raw raw rice raw egg” (raw wheat), (raw rice), and (raw egg) are “words”, (raw), (wheat), (rice), and ( Egg) is the “character”. Further, the “evaluation word string” is a predetermined word string, and is used for comparison with a word string (a spoken word string described later) that has been voice-recognized. Further, the “uttered word string” refers to a word string obtained as a result of voice recognition of voice data by the large vocabulary continuous voice recognition means 8 described later. The “correct word string” refers to a word string having a content that matches the utterance word string voice-recognized by the large vocabulary continuous voice recognition means 8 described later.

また、「利用者」とは、本発話評価装置を利用する者を云う。「利用者音声データ」は、利用者の発話した任意の発話区間のディジタル音声データを云う。「基準音声データ」は、利用者の発話したディジタル音声データに対して比較し評価するのに用いる、基準となる任意の発話区間のディジタル音声データを云い、より具体的には、発話に関して専門的な能力を有するアナウンサーによって発話された任意の発話区間のディジタル音声データを云う。 The “user” refers to a person who uses the utterance evaluation apparatus. “User voice data” refers to digital voice data of an arbitrary utterance section spoken by a user. “Reference voice data” refers to digital voice data of an arbitrary utterance section used as a reference, which is used to compare and evaluate digital voice data uttered by a user. This refers to digital speech data of an arbitrary utterance section uttered by an announcer having a proper ability.

以下、図面を参照して、本発明の実施形態を詳細に説明する。図１は、本発明による実施例の発話評価装置の機能構成を示す図である。発話評価装置１は、データ入力部２、記憶部３、音声入力部４、制御部１６、及び表示部１５を備える。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a diagram showing a functional configuration of an utterance evaluation apparatus according to an embodiment of the present invention. The utterance evaluation apparatus 1 includes a data input unit 2, a storage unit 3, a voice input unit 4, a control unit 16, and a display unit 15.

また、記憶部３は、利用者音声データ又は基準音声データからなる音声データ３ａ、言語モデル、発音辞書、及び音響モデルのデータ（総括して音声認識用モデルデータ３ｂとも称することとする）、１つ以上の評価単語列を含む評価単語列リスト３ｃ、任意の単語列からなる正解単語列データ３ｄ、及び発話評価結果の履歴データ３ｅを格納している。 The storage unit 3 includes voice data 3a composed of user voice data or reference voice data, language model, pronunciation dictionary, and acoustic model data (collectively referred to as voice recognition model data 3b), 1 An evaluation word string list 3c including one or more evaluation word strings, correct word string data 3d composed of arbitrary word strings, and history data 3e of utterance evaluation results are stored.

制御部１６は、データ管理手段５、発話検出手段６、言語モデル重み付け処理手段７、大語彙連続音声認識手段８、分析結果表示制御手段９、認識結果単語列表示制御手段１０、最類似評価単語列検出手段１１、発話評価手段１２、新記録達成判定・通知手段１３、基準音声比較手段１４、及び表示制御手段１７を有している。ここで、最類似評価単語列検出手段１１、新記録達成判定・通知手段１３、基準音声比較手段１４及び表示制御手段１７の各々は、後述するように、各手段での処理結果を表示部１５に送出する機能を有している。 The control unit 16 includes a data management unit 5, an utterance detection unit 6, a language model weighting processing unit 7, a large vocabulary continuous speech recognition unit 8, an analysis result display control unit 9, a recognition result word string display control unit 10, and a most similar evaluation word. It has a column detection unit 11, an utterance evaluation unit 12, a new record achievement determination / notification unit 13, a reference voice comparison unit 14, and a display control unit 17. Here, each of the most similar evaluation word string detection unit 11, the new record achievement determination / notification unit 13, the reference voice comparison unit 14, and the display control unit 17 displays the processing result of each unit on the display unit 15 as will be described later. It has a function to send to.

表示部１５には、分析結果表示制御手段９、認識結果単語列表示制御手段１０、最類似評価単語列検出手段１１、新記録達成判定・通知手段１３、基準音声比較手段１４及び表示制御手段１７の各々の処理結果、即ち、各種分析結果１５ａ、認識結果単語列１５ｂ、最類似評価単語列１５ｃ、新記録達成通知１５ｅ、基準音声比較結果１５ｆ及び各種発話評価結果１５ｄが表示される。 The display unit 15 includes an analysis result display control unit 9, a recognition result word string display control unit 10, a most similar evaluation word string detection unit 11, a new record achievement determination / notification unit 13, a reference voice comparison unit 14, and a display control unit 17. Are displayed, that is, various analysis results 15a, recognition result word string 15b, most similar evaluation word string 15c, new record achievement notification 15e, reference voice comparison result 15f, and various utterance evaluation results 15d.

ここで、発話評価装置１として機能させるために、コンピュータを好適に用いることができ、そのようなコンピュータは、データ管理手段５、発話検出手段６、言語モデル重み付け処理手段７、大語彙連続音声認識手段８、分析結果表示制御手段９、認識結果単語列表示制御手段１０、最類似評価単語列検出手段１１、発話評価手段１２、新記録達成判定・通知手段１３、基準音声比較手段１４、及び表示制御手段１７を機能させるための制御部を中央演算処理装置（ＣＰＵ）で実現でき（図示せず）、且つ、記憶部３を少なくとも１つのメモリで構成させることができる（図示せず）。また、表示部１５は、ＣＲＴ又は液晶ディスプレイなどの表示装置とできる。 Here, in order to function as the utterance evaluation apparatus 1, a computer can be preferably used. Such a computer includes a data management means 5, an utterance detection means 6, a language model weighting processing means 7, a large vocabulary continuous speech recognition. Means 8, analysis result display control means 9, recognition result word string display control means 10, most similar evaluation word string detection means 11, utterance evaluation means 12, new record achievement determination / notification means 13, reference voice comparison means 14, and display A control unit for causing the control means 17 to function can be realized by a central processing unit (CPU) (not shown), and the storage unit 3 can be constituted by at least one memory (not shown). The display unit 15 can be a display device such as a CRT or a liquid crystal display.

更に、そのようなコンピュータに、ＣＰＵによって所定のプログラムを実行させることにより、データ管理手段５、発話検出手段６、言語モデル重み付け処理手段７、大語彙連続音声認識手段８、分析結果表示制御手段９、認識結果単語列表示制御手段１０、最類似評価単語列検出手段１１、発話評価手段１２、新記録達成判定・通知手段１３、基準音声比較手段１４、及び表示制御手段１７の有する機能（後述する機能）を実現させることができる。更に、データ管理手段５、発話検出手段６、言語モデル重み付け処理手段７、大語彙連続音声認識手段８、分析結果表示制御手段９、認識結果単語列表示制御手段１０、最類似評価単語列検出手段１１、発話評価手段１２、新記録達成判定・通知手段１３、基準音声比較手段１４、及び表示制御手段１７の有する機能を実現させるためのプログラムを、前述の記憶部３（メモリ）の所定の領域に格納することができる。そのような記憶部３は、コンピュータ内部のＲＡＭなどで構成させることができ、或いは又、外部記憶装置（例えば、ハードディスク）で構成させることもできる。また、そのようなプログラムは、発話評価装置１としてのコンピュータで利用されるＯＳ上のソフトウェア（ＲＯＭ又は外部記憶装置に格納される）の一部で構成させることができる。 Further, by causing such a computer to execute a predetermined program by the CPU, data management means 5, speech detection means 6, language model weighting processing means 7, large vocabulary continuous speech recognition means 8, analysis result display control means 9 , Functions of recognition result word string display control means 10, most similar evaluation word string detection means 11, utterance evaluation means 12, new record achievement determination / notification means 13, reference voice comparison means 14, and display control means 17 (described later) Function) can be realized. Furthermore, the data management means 5, the speech detection means 6, the language model weighting processing means 7, the large vocabulary continuous speech recognition means 8, the analysis result display control means 9, the recognition result word string display control means 10, and the most similar evaluation word string detection means 11. A program for realizing the functions of the speech evaluation means 12, the new record achievement determination / notification means 13, the reference voice comparison means 14, and the display control means 17 is stored in a predetermined area of the storage unit 3 (memory). Can be stored. Such a storage unit 3 can be composed of a RAM or the like inside the computer, or can be composed of an external storage device (for example, a hard disk). Further, such a program can be constituted by a part of software (stored in a ROM or an external storage device) on an OS used by a computer as the speech evaluation apparatus 1.

さらに、発話評価装置１として機能させるコンピュータを、本発明の各構成要素としての手段として機能させるためのプログラムは、コンピュータ読取り可能な記録媒体に記録することができる。 Furthermore, a program for causing a computer that functions as the speech evaluation apparatus 1 to function as means as each component of the present invention can be recorded on a computer-readable recording medium.

データ管理手段５は、利用者の要求に応じてデータ入力部２から入力された各データ（音声データ３ａ、音声認識用モデルデータ３ｂ、評価単語列リスト３ｃ、及び正解単語列データ３ｄ）を記憶部３に格納することができ、利用者の要求に応じて、各データを更新、変更、又は学習させることができる。尚、データ入力部２は、発話評価装置１として機能させるコンピュータにおいて、キーボード、記録媒体、マイクロホンなど如何なる態様のものでもよい。 The data management means 5 stores each data (voice data 3a, voice recognition model data 3b, evaluation word string list 3c, and correct word string data 3d) input from the data input unit 2 in response to a user request. The data can be stored in the unit 3, and each data can be updated, changed, or learned according to a user's request. The data input unit 2 may be of any form such as a keyboard, a recording medium, and a microphone in a computer that functions as the speech evaluation apparatus 1.

発話検出手段６は、利用者の要求に応じて音声入力部４から入力された音声について発話始端を検出し、発話終端が検出されるまでの発話部分（発話区間とも称する）の音声データ３ａ（利用者音声データ又は基準音声データ）を、大語彙連続音声認識部８に出力する。尚、発話検出手段６は、記憶部３に予め格納されている音声データ３ａ（利用者音声データ又は基準音声データ）について発話区間を検出し、大語彙連続音声認識部８に出力してもよい。また、音声入力部４は、発話評価装置１として機能させるコンピュータにおいて、音声記録媒体又はマイクロホンなど如何なる態様のものでもよい。更に、利用者音声データ及び基準音声データの各々は、同時に入力する必要は無く、特に基準音声データについては、利用者の要求に応じて、逐次、記憶部３に格納することができる。この基準音声データは、後述するように、声紋（スペクトラム）又は抑揚（イントネーション）など音質に係る比較を必要とする場合に用いられるものであり、本発明において、必ず必要とされるものではない。また、利用者音声データ又は基準音声データは、予め定めておいた評価単語列リスト３ｃに含まれる単語列に対する音声データでよく、或いは又、それとは別の任意の単語列に対する音声データでもよい。 The utterance detection means 6 detects the utterance start point of the voice input from the voice input unit 4 in response to the user's request, and the voice data 3a (also referred to as the utterance section) until the utterance end point is detected. User voice data or reference voice data) is output to the large vocabulary continuous voice recognition unit 8. Note that the utterance detection unit 6 may detect an utterance section of the voice data 3 a (user voice data or reference voice data) stored in advance in the storage unit 3, and output it to the large vocabulary continuous voice recognition unit 8. . Further, the voice input unit 4 may be of any form such as a voice recording medium or a microphone in a computer that functions as the speech evaluation apparatus 1. Further, it is not necessary to input the user voice data and the reference voice data at the same time. In particular, the reference voice data can be sequentially stored in the storage unit 3 according to the user's request. As will be described later, this reference voice data is used when a comparison relating to sound quality such as voiceprint (spectrum) or intonation (intonation) is required, and is not necessarily required in the present invention. The user voice data or the reference voice data may be voice data for a word string included in the predetermined evaluation word string list 3c, or may be voice data for any other word string.

大語彙連続音声認識手段８は、予め相当数の様々なジャンルのテキストで学習された言語モデル３ｂ_１と、言語モデル３ｂ_１に含まれる数万単語から数十万単語の発音辞書３ｂ_２と、発話について専門的なアナウンサーなどを含む不特定多数の話者により相当数の音声で学習された音響モデル３ｂ_３を用いて、発話検出手段６から得られる利用者音声データ又は基準音声データの連続音声の認識（即ち文字変換）をほぼリアルタイムで処理する。言語モデル３ｂ_１、発音辞書３ｂ_２及び音響モデル３ｂ_３については、当業者に知られており、詳細な説明は後述する。また、大語彙連続音声認識手段８は、検出した発話単語列について、例えば音声データの波形、声紋（スペクトル）、抑揚（イントネーション）などの音質に係る各種分析結果を生成する。更に、大語彙連続音声認識手段８は、検出した発話単語列の発話速度を算出する。 The large vocabulary continuous speech recognition means 8 includes a language model 3b ₁ previously learned from a considerable number of texts of various genres, a pronunciation dictionary 3b ₂ of tens of thousands to hundreds of thousands of words included in the language model 3b ₁ , Continuous voice of user voice data or reference voice data obtained from the utterance detection means 6 using the acoustic model 3b ₃ learned with a considerable number of voices by an unspecified number of speakers, including a professional announcer for utterances. Recognition (ie, character conversion) is processed almost in real time. The language model 3b ₁ , the pronunciation dictionary 3b ₂ and the acoustic model 3b ₃ are known to those skilled in the art and will be described in detail later. The large vocabulary continuous speech recognition means 8 generates various analysis results relating to the sound quality such as the waveform of speech data, voiceprint (spectrum), intonation (intonation), etc., for the detected utterance word string. Furthermore, the large vocabulary continuous speech recognition means 8 calculates the utterance speed of the detected utterance word string.

言語モデル重み付け処理手段７は、評価単語列リスト３ｃに含まれる評価単語列を利用して、言語モデル３ｂ_１を学習させる機能を有する。ここで、言語モデルの学習による重み付け処理ついては後述で詳細に説明するが、単語間又は文字間の接続の高さ、即ち確率値を学習により定め、その学習結果に基づいて接続文字又は接続単語であることを判断する処理である。言語モデルを学習させる際に、評価単語列リスト３ｃ内の評価単語列の単語において、発音辞書３ｂ_２に該当する単語がない場合、利用者に知らせるようにその旨を表示部１５に表示できる（図示せず。）。利用者は、適宜、データ入力部２を経て、発音辞書３ｂ_２に単語の発音に関するデータを追加することができる。言語モデル重み付け処理手段７の学習機能により、大語彙連続音声認識手段８は、評価単語列リスト３ｃに含まれる単語列の音声データを認識する場合に、特に高い認識率で動作することができる。 The language model weighting processing means 7 has a function of learning the language model 3b ₁ using the evaluation word strings included in the evaluation word string list 3c. Here, the weighting process by learning the language model will be described in detail later, but the connection height between words or characters, that is, the probability value is determined by learning, and the connection character or connected word is determined based on the learning result. This is a process of determining that there is. When to train a language model, the word evaluation word string within the evaluation word sequence list 3c, if there is no word corresponding to the pronunciation dictionary 3b _2, can be displayed on the display unit 15 to that effect to inform the user ( Not shown). The user can appropriately add data related to the pronunciation of the word to the pronunciation dictionary 3b ₂ via the data input unit 2. Due to the learning function of the language model weighting processing means 7, the large vocabulary continuous speech recognition means 8 can operate at a particularly high recognition rate when recognizing speech data of word strings included in the evaluation word string list 3c.

ここで、利用者の要求に応じて、データ管理手段５は、各発話に対応する正解単語列３ｄを記憶部３に格納することができる。この場合、言語モデル重み付け処理手段７は、正解単語列３ｄを評価単語列リスト３ｃに加えるとともに、後述するように、評価単語列リスト３ｃの如何なる単語列よりも大きな確率値が割り当てられるように重み付け処理を行うことができる。これにより、大語彙連続音声認識手段８は、利用者が発話したいと望む任意の単語列の一部が、予め作成された言語モデル３ｂ_１、或いは評価単語列リスト３ｃに含まれていない場合であっても、評価単語列リスト３ｃに加えられた発話毎の正解単語列３ｄを用いて、特に高い認識率で動作させることができる。 Here, the data management means 5 can store the correct word string 3d corresponding to each utterance in the storage unit 3 in response to a user request. In this case, the language model weighting processing means 7 adds the correct word string 3d to the evaluation word string list 3c and performs weighting so that a probability value larger than any word string in the evaluation word string list 3c is assigned as will be described later. Processing can be performed. Thereby, the large vocabulary continuous speech recognition means 8 is used when a part of an arbitrary word string that the user wants to speak is not included in the language model 3b ₁ or the evaluation word string list 3c created in advance. Even if it exists, it can be made to operate at a particularly high recognition rate by using the correct word string 3d for each utterance added to the evaluation word string list 3c.

分析結果表示制御手段９は、大語彙連続音声認識手段８によって検出及び分析した発話単語列についての分析結果、例えば音声データの波形、声紋（スペクトル）、抑揚（イントネーション）などの各種分析結果１５ａをほぼリアルタイムで表示部１５に表示する。利用者音声データの分析結果である場合には、利用者の発話の分析結果が表示されることになり、基準音声データの分析結果である場合には、手本となる音声の分析結果が表示されることになるため、利用者は視覚的に双方の音声の違いを観察することができるようになる。また、後述するが、音声比較手段１４により、利用者音声データ及び基準音声データの双方の分析結果又は評価結果を対照して表示部１５に表示させることもできる。 The analysis result display control means 9 displays analysis results about the utterance word string detected and analyzed by the large vocabulary continuous speech recognition means 8, for example, various analysis results 15a such as speech data waveform, voiceprint (spectrum), and intonation. It is displayed on the display unit 15 in almost real time. If it is the analysis result of the user voice data, the analysis result of the user's utterance will be displayed, and if it is the analysis result of the reference voice data, the analysis result of the sample voice will be displayed. As a result, the user can visually observe the difference between the two voices. As will be described later, the voice comparison means 14 can display the analysis result or evaluation result of both the user voice data and the reference voice data on the display unit 15.

認識結果単語列表示制御手段１０は、大語彙連続音声認識手段８によって検出した発話単語列、即ち認識結果単語列１５ｂをほぼリアルタイムで表示部１５に表示する。利用者音声データの発話単語列である場合には、利用者の発話単語列が表示されることになる。これにより、利用者は、評価単語列リスト３ｃ内の評価単語列との違いを観察することができるようになる。 The recognition result word string display control means 10 displays the utterance word string detected by the large vocabulary continuous speech recognition means 8, that is, the recognition result word string 15b on the display unit 15 in almost real time. In the case of the utterance word string of the user voice data, the user's utterance word string is displayed. Thus, the user can observe the difference from the evaluation word string in the evaluation word string list 3c.

好適には、データ管理手段５により、音響モデル３ｂ_３を基準音声データで予め学習させておく。これにより、大語彙連続音声認識手段８の認識結果である単語列の信頼性は高くなり、利用者は、認識結果単語列１５ｂを表示部１５で観察して、自分の発声が良好でうまく認識できたのか、又は自分の発声が悪くてうまく認識できなかったのかを確認することができるようになる。 Preferably, the acoustic model 3b ₃ is previously learned by the data management means 5 using the reference voice data. Thereby, the reliability of the word string which is the recognition result of the large vocabulary continuous speech recognition means 8 is increased, and the user observes the recognition result word string 15b on the display unit 15 and recognizes his speech well and well. It becomes possible to confirm whether it was possible or not recognized well due to bad utterance.

尚、大語彙連続音声認識手段８で認識可能な単語は有限であるため、登録されていない単語を発生した場合には認識誤りが生じるが、上述したように、各発話に対応する正解単語列３ｄを用いて言語モデル３ｂ_１の重み付け処理を行なうことで認識結果の単語列の信頼性を高めることができる。 Note that since the number of words that can be recognized by the large vocabulary continuous speech recognition means 8 is limited, a recognition error occurs when an unregistered word is generated, but as described above, the correct word string corresponding to each utterance By performing the weighting process of the language model 3b ₁ using 3d, the reliability of the word string of the recognition result can be increased.

最類似評価単語列検出手段１１は、大語彙連続音声認識手段８が認識した結果である発話単語列（即ち、認識結果単語列１５ｂ）を受け取り、この認識結果単語列１５ｂを評価単語列リスト３ｃ内の各評価単語列に対し、それぞれの単語列を単語毎あるいは文字毎に比較して、一致する単語数あるいは文字数の最も多い単語列を最高の単語類似率（以下、最高単語類似率とも称する）を有する評価単語列として判別する。更に、最類似評価単語列検出手段１１は、この最高の単語類似率が所定の閾値αより高い場合、その最高の単語類似率となる評価単語列（以下、最類似評価単語列とも称する）が発声されたと判断し、この最類似評価単語列１５ｃを表示部１５に出力する。ただし、音声データ３ａが評価単語列リスト３ｃに含まれている評価単語列であるとは限らないため、その最高の単語類似率が所定の閾値αよりも低い場合には、最類似評価単語列１５ｃを出力しない。 The most similar evaluation word string detection unit 11 receives an utterance word string (that is, the recognition result word string 15b) as a result of recognition by the large vocabulary continuous speech recognition unit 8, and the recognition result word string 15b is received as the evaluation word string list 3c. For each of the evaluation word strings, each word string is compared for each word or character, and the word string having the largest number of matching words or characters is referred to as the highest word similarity rate (hereinafter also referred to as the highest word similarity rate). ) As an evaluation word string. Further, when the highest word similarity rate is higher than a predetermined threshold value α, the most similar evaluation word string detection means 11 has an evaluation word string (hereinafter also referred to as the most similar evaluation word string) having the highest word similarity rate. It is determined that the voice is uttered, and the most similar evaluation word string 15 c is output to the display unit 15. However, since the voice data 3a is not necessarily an evaluation word string included in the evaluation word string list 3c, when the highest word similarity is lower than a predetermined threshold value α, the most similar evaluation word string 15c is not output.

発話評価手段１２は、発話単語列の単語類似率（即ち、前述の最高単語類似率）の値を最類似評価単語列検出手段１１から受け取るとともに、大語彙連続音声認識手段８から対応する発話単語列の発話速度を受け取る。更に、発話評価手段１２は、後述する音響スコア及び発音明瞭度を算出し、単語類似率、発話速度、音響スコア及び発音明瞭度のうち少なくとも１つ以上を評価結果として生成し、その評価結果を表示制御手段１７に送出する。 The utterance evaluation means 12 receives the value of the word similarity of the utterance word string (that is, the highest word similarity mentioned above) from the most similar evaluation word string detection means 11 and the corresponding utterance word from the large vocabulary continuous speech recognition means 8. Receives the speaking rate of the queue. Furthermore, the utterance evaluation means 12 calculates an acoustic score and pronunciation intelligibility, which will be described later, and generates at least one of word similarity, utterance speed, acoustic score, and pronunciation intelligibility as an evaluation result, and the evaluation result is It is sent to the display control means 17.

尚、音響スコアは、例えば次式により求められる。 The acoustic score is obtained by the following equation, for example.

ここで、ｘ_ｔは入力音声のｔフレーム目の音響特徴量、μ_ｔは事前に学習しておいた音響モデル３ｂ_３のｔフレーム目の平均値、σ_ｔは分散を表す標準偏差である。ｘ_ｔがμ_ｔと一致する場合に音響スコアは１００となり、ｘ_ｔとμ_ｔの差が広がるにつれ、音響スコアは０に近づく。例えば、ｔ＝１０ｍｓであれば１０ｍｓ単位のフレーム毎の音響スコアを算出し、例えば１秒間の発話に対して１００個の音響スコアを得ることができる。また、発音明瞭度としては、例えば次式により求める。 Here, x _t is the acoustic feature quantity of the t-th frame of the input speech, μ _t is the average value of the t-frame of the acoustic model 3b ₃ learned in advance, and σ _t is the standard deviation representing the variance. Acoustic score 100 becomes when x _t coincides with mu _t, as the difference between _{x t} and mu _t widens, acoustic score approaches zero. For example, if t = 10 ms, an acoustic score for each frame of 10 ms unit is calculated, and for example, 100 acoustic scores can be obtained for an utterance for one second. The pronunciation intelligibility is obtained by the following equation, for example.

発音明瞭度＝ｋ×単語類似率＋（１−ｋ）×１フレームあたりの平均音響スコア／全フレームのうち最良の音響スコア Pronunciation intelligibility = k × word similarity + (1−k) × average acoustic score per frame / best acoustic score among all frames

ここで、ｋの値は、０以上１以下の重み付け係数であり、事前に実験的に定めておく。最良の音響スコアとは、複数個の音響スコアのうち、最も１００に近いものを云う。 Here, the value of k is a weighting coefficient of 0 or more and 1 or less, and is experimentally determined in advance. The best acoustic score is the one closest to 100 among a plurality of acoustic scores.

表示制御手段１７は、最類似評価単語列検出手段１１から、発話単語列の単語類似率（即ち、前述の最高単語類似率）の値が所定の閾値αより高いと判定した旨を受け取った場合にのみ、発話速度、音響スコア、発音明瞭度などの各種発話評価結果を表示部１５に出力する。 When the display control means 17 receives from the most similar evaluation word string detection means 11 that it is determined that the value of the word similarity of the utterance word string (that is, the above-mentioned highest word similarity) is higher than the predetermined threshold value α. In other words, various utterance evaluation results such as utterance speed, acoustic score, and pronunciation intelligibility are output to the display unit 15.

新記録達成判定・通知手段１３は、最高単語類似率が第２の閾値を超えたか否かを判定する閾値判定手段、閾値判定手段によって第２の閾値を超える単語類似率と判定した場合に、発話単語列についての評価結果のいずれかが履歴３ｅに対して最高値を示すか否かを判定する最高値判定手段、及び、最高値を示すと判定した場合に、新記録の達成の旨を音声又は映像で通知する手段から構成される。つまり、新記録達成判定・通知手段１３は、最類似評価単語列検出手段１１から（即ち、発話評価手段１２の評価結果から）、最高単語類似率の値を受け取り、その最高単語類似率が所定の閾値β（β＞α）より高いと判定した場合に、発話単語列の評価結果（例えば、発話速度、単語類似率、発音明瞭度など）のいずれかが履歴３ｅに対して最高値を示すか否かを比較し、最高値を示す場合には、例えばファンファーレなどの音声や、くす玉が割れる映像を表示部１５に表示するなどで新記録達成通知を行うとともに、記憶部３（例えば、履歴３ｅ）にその新記録に係る評価結果を記録又は更新する。 When the new record achievement determination / notification unit 13 determines that the highest word similarity exceeds the second threshold, the threshold determination unit determines whether the word similarity exceeds the second threshold by the threshold determination unit. The highest value judging means for judging whether any of the evaluation results for the utterance word string shows the highest value for the history 3e, and when it is judged that the highest value is shown, the fact that the new record is achieved It is comprised from the means to notify with an audio | voice or an image | video. That is, the new record achievement determination / notification unit 13 receives the value of the highest word similarity from the most similar evaluation word string detection unit 11 (that is, from the evaluation result of the utterance evaluation unit 12), and the highest word similarity is predetermined. Any of the evaluation results of the utterance word string (for example, utterance speed, word similarity, pronunciation intelligibility, etc.) shows the highest value for the history 3e. In the case where the highest value is indicated, a new record achievement notification is given by displaying, for example, a sound such as a fanfare or a video of breaking a ball on the display unit 15, and the storage unit 3 (for example, history In 3e), the evaluation result relating to the new record is recorded or updated.

次に、発話評価装置の詳細な処理手順について説明する。 Next, a detailed processing procedure of the speech evaluation apparatus will be described.

図2は、本発明による実施例の発話評価装置における処理手順を示す図である。以下、具体例を用いて処理の流れを説明する。 FIG. 2 is a diagram illustrating a processing procedure in the speech evaluation apparatus according to the embodiment of the present invention. Hereinafter, the flow of processing will be described using a specific example.

ステップＳ１にて、発話検出手段６は、音声データの入力待ち状態である。尚、記憶部３には、利用者音声データ又は基準音声データからなる音声データ３ａ、音声認識用モデルデータ３ｂ（言語モデル、発音辞書、及び音響モデル）、１つ以上の評価単語列を含む評価単語列リスト３ｃ、及び任意の単語列からなる正解単語列データ３ｄを記憶部３にデータ管理手段５により格納しておく。記憶部３には、前述したように、発話分析結果及び発話評価結果の履歴３ｅが格納されており、履歴３ｅは、分析結果表示手段９及び発話評価手段１２の機能として更新され、データ管理手段５により変更可能である。 In step S1, the utterance detection means 6 is waiting for input of voice data. The storage unit 3 includes voice data 3a composed of user voice data or reference voice data, voice recognition model data 3b (language model, pronunciation dictionary, and acoustic model), and an evaluation including one or more evaluation word strings. The data management means 5 stores the word string list 3 c and correct word string data 3 d composed of arbitrary word strings in the storage unit 3. As described above, the history 3e of the speech analysis result and the speech evaluation result is stored in the storage unit 3, and the history 3e is updated as a function of the analysis result display unit 9 and the speech evaluation unit 12, and the data management unit 5 can be changed.

ステップＳ２にて、発話検出手段６により、音声入力部４より入力された音声を、利用者音声データ又は基準音声データの発話始端及び発話終端を検出して発話部分（即ち、発話区間）を特定し、その発話部分を大語彙連続音声認識手段８に送出する。発話検出は、人間の声の音響的特徴とそれ以外の音声の音響的特徴との間の違いに基づき、入力音声のうち、人間の声の部分だけを発話部分として抽出する。 In step S2, the utterance detection unit 6 detects the utterance start point and utterance end point of the user voice data or reference voice data from the voice input from the voice input unit 4, and specifies the utterance part (ie, utterance section). Then, the utterance part is sent to the large vocabulary continuous speech recognition means 8. In the utterance detection, based on the difference between the acoustic features of the human voice and the other acoustic features, only the human voice portion of the input speech is extracted as the utterance portion.

ステップＳ３にて、大語彙連続音声認識手段８により、発話部分の音声データを大語彙連続音声認識する。即ち、大語彙連続音声認識手段８は、予め相当数の様々なジャンルのテキストで学習された言語モデル３ｂ_１と、言語モデル３ｂ_１に含まれる数万単語から数十万単語の発音辞書３ｂ_２と、発話について専門的なアナウンサーなどを含む不特定多数の話者における相当数の音声で予め学習された音響モデル３ｂ_３を用いて、連続音声の認識をリアルタイムで処理する。ここで、大語彙連続音声認識手段８は、音声データが発音辞書３ｂ_２のうち、どの単語の発音記号に類似しているのか調べるため、各発音記号に対応する音響モデル３ｂ_３で音響スコアを求めるとともに、単語同士の接続のしやすさ表す言語スコアを言語モデルで求め、音響スコアと言語スコアの総和が最大となる単語列を認識結果（認識結果単語列１５ｂ）として出力する。 In step S3, the large vocabulary continuous speech recognition means 8 recognizes the speech data of the utterance part in the large vocabulary continuous speech. That is, the large vocabulary continuous speech recognition means 8 has a language model 3b ₁ learned in advance by a considerable number of texts of various genres, and a pronunciation dictionary 3b _{2 of} tens of thousands to hundreds of thousands of words included in the language model 3b _1. Then, the continuous speech recognition is processed in real time using the acoustic model 3b ₃ previously learned with a considerable number of voices of an unspecified number of speakers including a professional announcer for speech. Here, the large vocabulary continuous speech recognition means 8 determines the sound score of which word in the pronunciation dictionary 3b ₂ is similar to the pronunciation symbol of the word in the pronunciation dictionary 3b ₂ , and calculates the acoustic score with the acoustic model 3b ₃ corresponding to each pronunciation symbol. At the same time, a language score representing ease of connection between words is obtained from the language model, and a word string having the maximum sum of the acoustic score and the language score is output as a recognition result (recognition result word string 15b).

ここで、大語彙連続音声認識手段８及び言語モデル重み付け処理手段７に関して、更に詳しく説明する。評価単語列のリスト３ｃは、図３Ａ、図３Ｂ及び図３Ｃの処理サンプルに示すように、例えば早口言葉の発話を評価する目的で本装置を利用する場合、「生麦生米生卵」などの早口言葉や、アナウンサーでも発話しにくい単語列「バスガス爆発」などを含む複数の単語列のリストである。 Here, the large vocabulary continuous speech recognition means 8 and the language model weighting processing means 7 will be described in more detail. As shown in the processing samples of FIGS. 3A, 3B, and 3C, the evaluation word string list 3c includes, for example, “raw raw raw rice raw eggs” when the apparatus is used for the purpose of evaluating utterances of quick words. It is a list of multiple word strings, including quick words and the word string “Bus Gas Explosion” that is difficult for even announcers to speak.

言語モデル重み付け処理手段７は、例えば単語「生麦」の次に単語「生米」が接続する確率値を一定の倍率で高め、大語彙連続音声認識手段８で用いる言語モデル３ｂ_１の重み付け処理を行うとともに、例えば単語「生麦」が発音辞書３ｂ_２に登録されていなかった場合、発音辞書３ｂ_２に発音記号列／ｎａｍａｍｕｇｉ／を加える。これにより、音声データ３ａの発話内容が、評価単語列リスト３ｃの単語列と一致又は類似（例えば、文字が所定の割合以上一致）する場合には、特に高い認識率で動作する。 For example, the language model weighting processing means 7 increases the probability value that the word “raw rice” is connected after the word “raw wheat” at a certain magnification, and performs weighting processing of the language model 3b ₁ used in the large vocabulary continuous speech recognition means 8. performs, for example, if the word "Namamugi" has not been registered in the pronunciation dictionary 3b _2, added to the pronunciation dictionary 3b ₂ pronunciation symbol strings / namamugi / a. As a result, when the utterance content of the voice data 3a matches or resembles the word string in the evaluation word string list 3c (for example, characters match at a predetermined rate or more), the speech data 3a operates at a particularly high recognition rate.

仮に、音声データ３ａの発話内容が、言語モデル３ｂ_１で十分に学習されていない場合や、評価単語列リスト３ｃに含まれていない場合、そのままでは認識率が低下してしまう。そこで、各発話に対応する正解単語列３ｄが、データ入力手段２により与えられた場合には、データ管理手段５により正解単語列３ｄを記憶部３の所定の領域に一旦格納する。言語モデル重み付け処理手段７は、記憶部３において、その正解単語列３ｄを評価単語列リスト３ｃに加える。更に、言語モデル重み付け処理手段７は、評価単語列リスト３ｃのうち、どの単語列よりも大きな確率値が割り当てられるよう、大語彙連続音声認識手段８で用いる言語モデル３ｂ_１の重み付け処理を行う。これにより、大語彙連続音声認識手段８は、利用者が発話したいと望む任意の単語列の一部が、予め作成された言語モデル３ｂ_１或いは評価単語列リスト３ｃに含まれていない場合であっても、発話毎の正解単語列３ｄとこれらの発音記号列を利用して、特に高い認識率で動作する。 If, speech contents of the speech data 3a is, and if not fully learned in the language model 3b _1, rating if the word is not included in the column list 3c, as is the recognition rate decreases. Therefore, when the correct word string 3 d corresponding to each utterance is given by the data input means 2, the correct word string 3 d is temporarily stored in a predetermined area of the storage unit 3 by the data management means 5. The language model weighting processing means 7 adds the correct word string 3d to the evaluation word string list 3c in the storage unit 3. Furthermore, the language model weighting processing means 7 performs weighting processing of the language model 3b ₁ used by the large vocabulary continuous speech recognition means 8 so that a probability value larger than any word string in the evaluation word string list 3c is assigned. Thereby, the large vocabulary continuous speech recognition means 8 is a case where a part of an arbitrary word string that the user wants to speak is not included in the language model 3b ₁ or the evaluation word string list 3c created in advance. However, it operates with a particularly high recognition rate by using the correct word string 3d for each utterance and these phonetic symbol strings.

例えば、利用者音声データの発話内容が「だるまさんが転んだ」であり、これを正解単語列３ｄとして言語モデル重み付け処理手段７により、評価単語列リスト３ｃに加える。言語モデル重み付け処理手段７は、例えば単語「だるま」の後に単語「さん」が接続する確率値が大きくなるよう、単語「だるま」の後に単語「さん」が接続する頻度値を定数倍するか、定数を加えることによって、言語モデル３ｂ_１の重み付け処理を行う。また、言語モデル重み付け処理手段７は、単語「だるま」が発音辞書３ｂ_２に含まれていなかった場合には、発音記号列／ｄａｒｕｍａ／を加えた後、大語彙連続音声認識手段８の機能を実行する。 For example, the utterance content of the user voice data is “Daruma-san fell”, and this is added to the evaluation word string list 3c by the language model weighting processing means 7 as the correct word string 3d. For example, the language model weighting processing unit 7 may multiply the frequency value that the word “san” is connected to after the word “daruma” by a constant multiple so that the probability value that the word “san” is connected after the word “daruma” increases. By adding a constant, the weighting process of the language model 3b ₁ is performed. Further, the language model weighting processing means 7 adds the phonetic symbol string / daruma / to the function of the large vocabulary continuous speech recognition means 8 when the word “daruma” is not included in the pronunciation dictionary 3b _2. Execute.

ステップＳ４にて、分析結果表示制御手段９により、各種分析結果１５ａ、例えば音声認識の過程で得られる入力音声の波形、声紋(スペクトル)、抑揚(イントネーション)などをほぼリアルタイムで表示部１５に表示する。図４に、表示部１５のモニタ画面の表示例を示す。音声データが利用者音声データの場合には、利用者の発話の分析結果を表示し、基準音声データ（例えば、事前収録された発話について専門的な者による音声データ）の場合には、手本となる音声の分析結果を表示することができ、これにより、利用者は視覚的に双方の音声の違いを観察することができるようになる。 In step S4, the analysis result display control means 9 displays various analysis results 15a, for example, the waveform of the input speech obtained during the speech recognition process, voiceprint (spectrum), intonation (intonation), etc. on the display unit 15 in almost real time. To do. FIG. 4 shows a display example of the monitor screen of the display unit 15. When the voice data is user voice data, the analysis result of the user's utterance is displayed. When the voice data is reference voice data (for example, voice data by a specialist for a pre-recorded utterance), a model is displayed. As a result, the user can visually observe the difference between the two voices.

ステップＳ５にて、さらに、認識結果単語列表示制御手段１０により、大語彙連続音声認識手段８の認識結果である発話単語列（即ち、認識結果単語列１５ｂ）を表示部１５に表示する。これにより、利用者は、利用者音声データ又は基準音声データがどのように音声認識されたのかを観察することができるようになる。 In step S5, the recognition result word string display control means 10 further displays the utterance word string (that is, the recognition result word string 15b) as the recognition result of the large vocabulary continuous speech recognition means 8 on the display unit 15. As a result, the user can observe how the user voice data or the reference voice data is recognized.

ステップＳ６にて、最類似評価単語列検出手段１１により、認識結果単語列１５ｃと評価単語列リスト３ｃの各単語列を比較して、最も高い単語類似率となる評価単語列を検出する。大語彙連続音声認識手段８において、例えば入力音声の発話「生麦生米生卵」が、正しく「生麦生米生卵」と認識できた場合（図３Ａ）、または誤りを含む「生麦生ゴミ生卵」と認識した場合（図３Ｂ）、或いは「生無理生ゴミ七田孫」などと認識された場合（図３Ｃ）、最類似評価単語列検出手段１１は、各々の場合において認識結果単語列１５ｃを評価単語列リスト３ｃに含まれる「生麦生米生卵」、「貴社の記者が汽車で帰杜した」、「赤巻紙青巻紙黄巻紙」などと逐次比較して、それぞれの単語類似率を算出する。２つの単語列同士の類似率は、一般的に用いられる動的計画法によって効率よく算出することができる。例えば、大語彙連続音声認識手段８の認識結果として、図３Ａに示すように全ての文字が正しい場合、単語類似率は１００％となり、図３Ｂに示すように６文字中５文字が正しい場合には、単語類似率は８３．３％となり、図３Ｃに示すように６文字中の２文字が正しいものの余計な１文字が挿入されている場合には、単語類似率は１６．７％となる。 In step S6, the most similar evaluation word string detection means 11 compares each word string in the recognition result word string 15c and the evaluation word string list 3c, and detects an evaluation word string having the highest word similarity. In the large vocabulary continuous speech recognition means 8, for example, when the utterance “raw raw raw rice raw egg” of the input speech is correctly recognized as “raw raw raw rice raw egg” (FIG. 3A), When it is recognized as “egg” (FIG. 3B), or when it is recognized as “raw unreasonable garbage Nanata grandchild” (FIG. 3C), the most similar evaluation word string detection means 11 recognizes the recognition result word string 15c in each case. Are sequentially compared with “raw raw raw egg”, “your reporter has returned with a train”, “red roll paper blue roll yellow roll paper”, etc. included in the evaluation word string list 3c. calculate. The similarity ratio between two word strings can be efficiently calculated by a commonly used dynamic programming method. For example, as a recognition result of the large vocabulary continuous speech recognition means 8, when all characters are correct as shown in FIG. 3A, the word similarity is 100%, and when 5 characters out of 6 are correct as shown in FIG. 3B. The word similarity is 83.3%, and when one of the six characters is correct but one extra character is inserted as shown in FIG. 3C, the word similarity is 16.7%. .

ステップＳ７にて、最類似評価単語列検出手段１１により、検出した最高の単語類似率が予め定めた閾値αよりも大きいと判断した場合には（ステップＳ７の図示Ｙ）、ステップＳ８にて評価単語列リスト３ｃの中のいずれかの単語列を発声したものとみなして、その最類似評価単語列１５ｃを表示部１５に表示する。そうでなければ（ステップＳ７の図示Ｎ）、音声データ３ａ（即ち、認識結果単語列１５ｂ）は、評価単語列リスト３ｃに含まれない（又は正解単語列３ｄでもない）、任意の発声内容であるとして、最類似評価単語列１５ｃを表示しない。より具体的には、例えば所定の閾値αを３０％と設定した場合、最高の単語類似率が３０％未満となるときは、あまりにも評価単語列リスト３ｃ又は正解単語列３ｄと異なると判断し、即ち任意の単語列を発声したものとして、最類似評価単語列が評価単語列リスト３ｃに存在しないとみなす。例えば所定の閾値αが３０％の時、図３Ａ及び図３Ｂにそれぞれ示す処理サンプルの例１及び例２は、最類似評価単語列１５ｃが表示されるが、図３Ｃに示す処理サンプルの例３では、最類似評価単語列１５ｃは表示されない。 In step S7, when the most similar evaluation word string detection unit 11 determines that the detected highest word similarity rate is larger than a predetermined threshold value α (Y in step S7), evaluation is performed in step S8. Assuming that any word string in the word string list 3 c is uttered, the most similar evaluation word string 15 c is displayed on the display unit 15. Otherwise (N in step S7), the voice data 3a (that is, the recognition result word string 15b) is not included in the evaluation word string list 3c (or is not the correct word string 3d), and has any utterance content. If there is, the most similar evaluation word string 15c is not displayed. More specifically, for example, when the predetermined threshold value α is set to 30%, when the highest word similarity ratio is less than 30%, it is determined that the evaluation word string list 3c or the correct word string 3d is too different. That is, it is considered that the most similar evaluation word string does not exist in the evaluation word string list 3c as an utterance of an arbitrary word string. For example, when the predetermined threshold α is 30%, the processing sample examples 1 and 2 shown in FIGS. 3A and 3B respectively display the most similar evaluation word string 15c, but the processing sample example 3 shown in FIG. 3C. Then, the most similar evaluation word string 15c is not displayed.

ステップＳ９にて、さらに、発話評価手段１２により、認識結果単語列１５ｂの単語類似率、発話速度、音響スコアなどの発話評価結果を表示部１５に表示する。発話速度は、最類似評価単語列１５ｃの音素数を発話時間で除することで得られる。例えば、図３Ａの処理サンプルの例１に示すように、音声データ３ａの発話「生麦生米生卵」に対する認識結果が「生麦生米生卵」の時、単語類似率１００％、発話時間１．５秒、音響スコア９７点などと表示される。 In step S9, the utterance evaluation means 12 further displays the utterance evaluation results such as the word similarity rate, utterance speed, and acoustic score of the recognition result word string 15b on the display unit 15. The speaking speed is obtained by dividing the number of phonemes in the most similar evaluation word string 15c by the speaking time. For example, as shown in Example 1 of the processing sample in FIG. 3A, when the recognition result for the utterance “raw raw raw rice raw egg” in the voice data 3 a is “raw raw raw rice raw egg”, the word similarity rate is 100% and the utterance time is 1 .5 seconds, 97 acoustic scores, etc. are displayed.

同様に、図３Ｂの処理サンプルの例２に示すように、認識結果が「生麦生ゴミ生卵」の場合、単語類似率８３．３％、発話時間１．７秒、音響スコア８９点などと表示される。音響スコアは、大語彙連続音声認識手段８における音声データ３ａ（利用者音声データ又は基準音声データ）と、音響モデル３ｂ_３との類似性から算出できる。発音明瞭度は、式（１）に従えば、例えば重み付け係数ｋ＝０．６の時、単語類似率８３．３％、音響スコア８９点から、８５．６点と算出することができる。 Similarly, as shown in Example 2 of the processing sample in FIG. 3B, when the recognition result is “raw wheat raw garbage raw egg”, the word similarity rate is 83.3%, the utterance time is 1.7 seconds, the acoustic score is 89 points, and the like. Is displayed. Acoustic score, the audio data 3a in large vocabulary continuous speech recognition means 8 (user voice data or reference audio data) can be calculated from the similarity between the acoustic model 3b _3. According to Equation (1), the pronunciation intelligibility can be calculated as 85.6 points from a word similarity rate of 83.3% and an acoustic score of 89 points when the weighting coefficient k is 0.6, for example.

ステップＳ１０にて、基準音声比較手段１４により、利用者音声データと基準音声データとの間で、分析結果表示制御手段９による分析結果及び／又は発話評価手段１２による結果について、対照比較した基準音声比較結果１５ｆを表示部１５に表示する。これにより、利用者は視覚的に双方の分析結果及び／又は評価結果の違いを対照的に観察することができるようになる。 In step S10, the reference voice comparison means 14 compares and compares the analysis result by the analysis result display control means 9 and / or the result by the speech evaluation means 12 between the user voice data and the reference voice data. The comparison result 15f is displayed on the display unit 15. Thereby, the user can visually observe the difference between both analysis results and / or evaluation results.

ステップＳ１１にて、新記録達成判定・通知手段１３により、ステップ７にて前述した最高単語類似率が所定の閾値β（β＞α）以上であり、且つ対応する評価単語列の評価結果（例えば、発話速度、単語類似率、発音明瞭度など）のいずれかが記憶部３に格納されている記録データ（例えば、履歴３ｅ）に対して最高値を示すか否かを判定する。 In step S11, the new record achievement determination / notification unit 13 causes the highest word similarity described above in step 7 to be equal to or higher than a predetermined threshold β (β> α) and the evaluation result of the corresponding evaluation word string (for example, , Utterance speed, word similarity rate, pronunciation intelligibility, etc.) is determined whether or not the recording data (for example, history 3e) stored in the storage unit 3 shows the highest value.

ステップＳ１２にて、評価結果（例えば、発話速度、単語類似率、発音明瞭度など）のいずれかが履歴３ｅに対して最高値を示す場合（ステップＳ１１の図示Ｙ）、新記録の達成の旨を音声又は映像で通知する。尚、新記録達成判定・通知手段１３により、予め記録達成の対象となる評価結果を定めておき、記憶部３の所定の領域に記録達成履歴としてのリストを改めて格納し、更新するようにしても良い。新記録達成通知は、例えば、閾値βを１００％と設定しているとき、音声データの発話「生麦生米生卵」に対する認識結果が「生麦生米生卵」で単語類似率１００％が得られ、且つ評価単語列のリスト３ｃにおける「生麦生米生卵」のこれまでの最短発話時間の記録よりも短い時間で発話できた場合には、難しい言葉を早く正確に発話できたとして新記録達成を映像又は音声で通知するとともに、その記録を記憶部３の所定の領域に記録達成履歴リストのデータ（又は履歴３ｅ）を更新する。 In step S12, if any of the evaluation results (for example, speech rate, word similarity rate, pronunciation intelligibility, etc.) shows the highest value for the history 3e (Y shown in step S11), the new record is achieved. Is notified by voice or video. The new record achievement determination / notification means 13 preliminarily determines an evaluation result as a record achievement target, and stores and updates a list as a record achievement history in a predetermined area of the storage unit 3. Also good. For example, when the threshold β is set to 100%, the recognition result for the utterance “raw raw raw raw egg” in the voice data is “raw raw raw raw egg” and the word similarity rate is 100%. If the utterance can be uttered in a shorter time than the record of the shortest utterance time of “raw raw raw rice raw egg” in the list 3c of the evaluation word string, it is newly recorded that difficult words can be uttered accurately and quickly. The achievement is notified by video or sound, and the record is updated in a predetermined area of the storage unit 3 in the recording achievement history list (or history 3e).

ステップＳ１２の処理が終了した場合、又は、ステップＳ１１の処理にて評価結果（例えば、発話速度、単語類似率、発音明瞭度など）のいずれも履歴３ｅに対して最高値を示すものではない場合（ステップＳ１１の図示Ｎ）、処理をステップＳ１の音声データ入力待ちの処理に返し、前述した各ステップを同様に繰り返す。尚、利用者が発話評価を終了させたい場合、処理を終了させるように、発話検出手段６に処理中止機能を設けてもよい。 When the process of step S12 is completed, or when none of the evaluation results (for example, utterance speed, word similarity, pronunciation intelligibility, etc.) in the process of step S11 shows the highest value for the history 3e (N in step S11), the process returns to the voice data input waiting process in step S1, and the above-described steps are repeated in the same manner. If the user wants to finish the speech evaluation, the speech detection means 6 may be provided with a processing stop function so that the processing is finished.

これにより、本実施例によれば、利用者が発話したいと望む任意の単語列について、直接又は間接的に取得した音声データの発話単語列が予め定められた評価単語列と類似している場合には、より高精度、且つより多くの評価内容を示すとともに、その発話の善し悪しを評価することが可能となる。また、発話内容を利用者自身が柔軟に変更可能な発話訓練装置、或いは又、子供から専門的なアナウンサーまで、様々な発話レベルの利用者が発話の善し悪しをゲームとして楽しむことができるエンターテイメント装置として機能する発話評価装置を実現できる。 Thereby, according to the present embodiment, for any word string that the user wants to utter, the utterance word string of the voice data obtained directly or indirectly is similar to the predetermined evaluation word string It is possible to show more evaluation contents with higher accuracy and to evaluate the quality of the utterance. Also, as an utterance training device that allows users to flexibly change the utterance content, or as an entertainment device that allows users of various utterance levels to enjoy the quality of utterances as a game, from children to professional announcers A functioning speech evaluation device can be realized.

上述の実施例については代表的な例として説明したが、本発明の趣旨及び範囲内で、多くの変更及び置換することができることは当業者に明らかである。例えば、上述の各ステップは用途に応じて実行せずともよく、或いは又、異なるステップの順序でも実現可能である。更に、データ入力部にマイクロホンを用いれば、実施例のように音声入力に対しほぼリアルタイムで音声データを発話評価できるが、予め音声記録媒体に記録した音声データに対して発話評価することもできる。また、上述の発話評価装置は、発話内容を利用者自身が柔軟に変更可能な発話訓練装置や、子供から専門的なアナウンサーまで、様々な発話レベルの利用者が発話の善し悪しをゲームとして楽しむことができるエンターテイメント装置としても機能することができるものである。従って、本発明は、上述の実施例によって制限するものと解するべきではなく、特許請求の範囲によってのみ制限される。 Although the above embodiments have been described as representative examples, it will be apparent to those skilled in the art that many changes and substitutions can be made within the spirit and scope of the invention. For example, the above steps may not be executed depending on the application, or may be realized in a different order of steps. Furthermore, if a microphone is used for the data input unit, speech data can be evaluated in near real time for voice input as in the embodiment, but speech evaluation can also be performed on voice data recorded in advance on a voice recording medium. In addition, the utterance evaluation device described above allows users of various utterance levels to enjoy good or bad utterances as a game, from utterance training devices that allow users to flexibly change the content of utterances, and children to professional announcers. It is also possible to function as an entertainment device capable of Accordingly, the invention should not be construed as limited by the embodiments described above, but only by the claims.

本発明による発話評価装置は、発話試験、発話訓練、又は発話を楽しむエンターテイメントにおいて有用である。 The utterance evaluation apparatus according to the present invention is useful in utterance tests, utterance training, or entertainment that enjoys utterances.

本発明による実施例の発話評価装置の機能構成を示す図である。It is a figure which shows the function structure of the speech evaluation apparatus of the Example by this invention. 本発明による実施例の発話評価装置における処理手順を示すフローチャートである。It is a flowchart which shows the process sequence in the speech evaluation apparatus of the Example by this invention. 本発明による実施例の発話評価装置における処理サンプル例を示す図である。It is a figure which shows the example of a process sample in the speech evaluation apparatus of the Example by this invention. 本発明による実施例の発話評価装置における処理サンプル例を示す図である。It is a figure which shows the example of a process sample in the speech evaluation apparatus of the Example by this invention. 本発明による実施例の発話評価装置における処理サンプル例を示す図である。It is a figure which shows the example of a process sample in the speech evaluation apparatus of the Example by this invention. 本発明による実施例の発話評価装置における表示部のモニタ画面の表示例を示す図である。It is a figure which shows the example of a display of the monitor screen of the display part in the speech evaluation apparatus of the Example by this invention.

Explanation of symbols

１発話評価装置
２データ入力部
３記憶部
４音声入力部
５データ管理手段
６発話検出手段
７言語モデル重み付け処理手段
８大語彙連続音声認識手段
９分析結果表示制御手段
１０認識結果単語列表示制御手段
１１最類似評価単語列検出手段
１２発話評価手段
１３新記録達成判定・通知手段
１４基準音声比較手段
１５表示部
１６制御部
１７表示制御手段 DESCRIPTION OF SYMBOLS 1 Speech evaluation apparatus 2 Data input part 3 Storage part 4 Voice input part 5 Data management means 6 Speech detection means 7 Language model weighting processing means 8 Large vocabulary continuous speech recognition means 9 Analysis result display control means 10 Recognition result word string display control means 11 Most similar evaluation word string detection means 12 Utterance evaluation means 13 New record achievement determination / notification means 14 Reference voice comparison means 15 Display section 16 Control section 17 Display control means

Claims

An utterance evaluation device that analyzes and evaluates speech data,
Storage means for storing an evaluation word string list including one or more predetermined evaluation word strings for evaluating voice data, and a language model, pronunciation dictionary, and acoustic model;
Based on the language model, the pronunciation dictionary, and the acoustic model, speech data is recognized and converted into an utterance word string, and the sound quality analysis result of the utterance word string is based on the pronunciation dictionary and the acoustic model. A large vocabulary continuous speech recognition means for generating
Display means for displaying the analysis result;
The most similar evaluation word string that detects the evaluation word string having the largest number of matching words for each word as the evaluation word string having the highest word similarity from the evaluation word string list. Detection means;
Utterance evaluation means for generating an evaluation result including at least the highest word similarity and utterance speed for the utterance word string;
Display control means for displaying the evaluation word string having the highest word similarity and the evaluation result on the display means only when the highest word similarity exceeds a first threshold;
An utterance evaluation apparatus comprising:

Language model weighting means for learning the language model using the evaluation word string included in the evaluation word string list;
The utterance evaluation apparatus according to claim 1, wherein the large vocabulary continuous speech recognition means converts speech data into an utterance word string based on a learned language model.

Data management means for storing in the storage means a correct word string that matches the utterance word string for each utterance of voice data;
The utterance evaluation apparatus according to claim 2, wherein the language model weighting unit adds the correct word string to the evaluation word string list and learns the language model using the correct word string.

The utterance evaluation means further comprises means for calculating a predetermined acoustic score and deriving pronunciation intelligibility by a weighted linear sum of the highest word similarity and the acoustic score;
The utterance evaluation apparatus according to claim 1, wherein the evaluation result further includes the pronunciation intelligibility.

The utterance evaluation means further includes means for storing the evaluation result in the storage means as a history,
Threshold determination means for determining whether or not the highest word similarity exceeds a second threshold, and when it is determined that the word similarity exceeds a second threshold, any of the evaluation results for the utterance word string is The highest value judging means for judging whether or not to show the highest value for the history, and the means for notifying the achievement of the new recording by voice or video when it is judged that the highest value is shown. The utterance evaluation apparatus according to claim 1, further comprising a new record achievement determination / notification unit.

In a computer having a storage unit and a display unit,
Storing an evaluation word string list including one or more predetermined evaluation word strings for evaluating speech data, a language model, a pronunciation dictionary, and an acoustic model in the storage unit;
Based on the language model, the pronunciation dictionary, and the acoustic model, speech data is recognized and converted into an utterance word string, and the sound quality analysis result of the utterance word string is based on the pronunciation dictionary and the acoustic model. A large vocabulary continuous speech recognition step for generating
Displaying the analysis result on the display unit;
The most similar evaluation word string that detects the evaluation word string having the largest number of matching words for each word as the evaluation word string having the highest word similarity from the evaluation word string list. Detecting step;
Generating, for the utterance word string, an evaluation result including at least the highest word similarity and utterance speed;
Only when the highest word similarity exceeds a first threshold value, displaying the evaluation word string having the highest word similarity and the evaluation result on the display unit;
An utterance evaluation program to execute