JP6786065B2

JP6786065B2 - Voice rating device, voice rating method, teacher change information production method, and program

Info

Publication number: JP6786065B2
Application number: JP2016087967A
Authority: JP
Inventors: 博章田川; 玲子山田
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2016-04-26
Filing date: 2016-04-26
Publication date: 2020-11-18
Anticipated expiration: 2036-04-26
Also published as: JP2017198790A

Description

本発明は、音声を評定する音声評定装置等に関するものである。 The present invention relates to a voice rating device or the like that evaluates voice.

従来、以下のような発音学習支援装置があった（例えば、特許文献１参照）。本発音学習支援装置は、領域ごとに語句と、当該語句の模範発音情報とを対応付けて記憶する領域別発音情報記憶手段と、ユーザ操作に基づいて、前記領域別発音情報記憶手段に記憶された何れかの語句，領域を、指定語句，指定領域として入力する語句領域入力手段と、前記指定語句についてのユーザ音声を取り込むユーザ音声入力手段と、前記指定語句及び前記指定領域に対応する模範発音情報に基づいて、前記ユーザ音声入力手段に取り込まれたユーザ音声の発音を評価するユーザ音声評価手段と、を備えることを特徴とする装置である。 Conventionally, there have been the following pronunciation learning support devices (see, for example, Patent Document 1). This pronunciation learning support device is stored in the region-specific pronunciation information storage means that stores words and phrases for each area in association with the model pronunciation information of the words, and in the region-specific pronunciation information storage means based on the user operation. A phrase area input means for inputting any of the words and areas as a designated phrase and a designated area, a user voice input means for capturing a user voice for the designated word, and a model pronunciation corresponding to the designated phrase and the designated area. The device is characterized by comprising a user voice evaluation means for evaluating the pronunciation of the user voice incorporated in the user voice input means based on the information.

特開２００８−８３４４６号公報Japanese Unexamined Patent Publication No. 2008-8346

しかしながら、従来の装置においては、発音された入力音声の流れを考慮した音声の評定ができなかったために、入力音声の適切な評定ができなかった。 However, in the conventional device, the voice cannot be evaluated in consideration of the flow of the sounded input voice, so that the input voice cannot be evaluated appropriately.

本第一の発明の音声評定装置は、教師となる音声情報である教師音声情報を構成する２以上の各部分音声情報の特徴量の変化に関する教師変化情報が格納される教師変化情報格納部と、２以上の部分音声を有する音声情報である入力音声情報を受け付ける受付部と、入力音声情報が有する２以上の各部分音声情報の特徴量の変化に関する入力変化情報を取得する取得部と、入力変化情報と教師変化情報とを用いて、入力音声情報の評定を行い、スコアを取得する評定部と、スコアを出力する出力部とを具備する音声評定装置である。 The voice evaluation device of the first invention includes a teacher change information storage unit that stores teacher change information regarding changes in the feature amounts of two or more partial voice information constituting the teacher voice information, which is voice information to be a teacher. A reception unit that receives input voice information that is voice information having two or more partial voices, an acquisition unit that acquires input change information regarding a change in the feature amount of each of two or more partial voice information that the input voice information has, and input. It is a voice rating device including a rating unit that evaluates input voice information using change information and teacher change information and acquires a score, and an output unit that outputs a score.

かかる構成により、発音された入力音声の流れを考慮した音声の評定ができるため、入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the voice in consideration of the flow of the sounded input voice, so that the input voice can be appropriately evaluated.

また、本第二の発明の音声評定装置は、第一の発明に対して、教師変化情報および入力変化情報は、部分音声情報の特徴量の大きさの順位に関する情報である音声評定装置である。 Further, the voice rating device of the second invention is a voice rating device in which the teacher change information and the input change information are information regarding the order of the magnitude of the feature amount of the partial voice information with respect to the first invention. ..

また、本第三の発明の音声評定装置は、第二の発明に対して、教師変化情報が有する情報であり、少なくとも２以上の評定対象部分音声情報の特徴量の大きさの順位に関する情報が同一の情報である場合、取得部は、入力変化情報が有する情報であり、同一の情報に対応する位置の、少なくとも２つの評定対象部分音声情報の特徴量の大きさの順位が隣り合っているか否かを判断し、隣り合っていると判断した場合は、少なくとも前記２つの評定対象部分音声情報の特徴量の大きさを同一の大きさと見なして、入力変化情報を取得する音声評定装置である。 Further, the voice rating device of the third invention is information possessed by the teacher change information with respect to the second invention, and information regarding the order of the magnitude of the feature amount of at least two or more evaluation target partial voice information. If the information is the same, the acquisition unit is the information possessed by the input change information, and is the order of the magnitudes of the feature amounts of at least two evaluation target partial voice information at positions corresponding to the same information adjacent to each other? If it is determined whether or not they are adjacent to each other, it is a voice evaluation device that acquires input change information by regarding at least the size of the feature amount of the two evaluation target partial voice information as the same size. ..

かかる構成により、発音された入力音声の流れを考慮した音声の評定ができるため、適切な入力音声の評定ができる。特に、文章の入力音声の流れを考慮した音声の評定ができるため、文章の入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the voice in consideration of the flow of the input voice that is pronounced, so that it is possible to evaluate the input voice appropriately. In particular, since the voice can be evaluated in consideration of the flow of the input voice of the sentence, the input voice of the sentence can be appropriately evaluated.

また、本第四の発明の音声評定装置は、第二の発明に対して、取得部は、入力音声情報が有する２以上の各部分音声情報の特徴量を取得し、入力音声情報が有する２以上の部分情報のうちの少なくとも２以上の評定対象部分音声情報の２以上の特徴量の大きさの順位を取得し、２以上の特徴量の大きさの順位を有する入力変化情報を取得する音声評定装置である。 Further, in the voice evaluation device of the fourth invention, with respect to the second invention, the acquisition unit acquires the feature amount of each of two or more partial voice information possessed by the input voice information, and the input voice information has 2 Of the above partial information, at least two or more parts to be evaluated are voices that acquire the order of the magnitude of two or more feature quantities and acquire the input change information having the rank of the magnitude of two or more feature quantities. It is a rating device.

また、本第五の発明の音声評定装置は、第二の発明に対して、取得部は、入力音声情報が有する２以上の各部分音声情報の特徴量を取得し、入力音声情報が有する２以上の部分情報のうちの少なくとも２以上の評定対象部分音声情報の２以上の特徴量に対して、最も大きい特徴量に対応する評定対象部分音声情報と他の評定対象部分音声情報とを区別する情報である入力変化情報を取得する音声評定装置である。 Further, in the voice evaluation device of the fifth invention, with respect to the second invention, the acquisition unit acquires the feature amounts of two or more partial voice information possessed by the input voice information, and the input voice information has 2 Distinguish between the evaluation target partial voice information corresponding to the largest feature amount and the other evaluation target partial voice information for two or more feature quantities of at least two or more evaluation target partial voice information among the above partial information. It is a voice rating device that acquires input change information that is information.

かかる構成により、発音された入力音声の流れを考慮した音声の評定ができるため、適切な入力音声の評定ができる。特に、単語の入力音声の単語内の流れを考慮した音声の評定ができるため、単語の入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the voice in consideration of the flow of the input voice that is pronounced, so that it is possible to evaluate the input voice appropriately. In particular, since the voice can be evaluated in consideration of the flow of the input voice of the word in the word, the input voice of the word can be appropriately evaluated.

また、本第六の発明の音声評定装置は、第二から第五いずれか１つの発明に対して、順位に関する情報は、教師音声情報または入力音声情報の２以上の各部分音声情報の特徴量の大きさの順位に関する並びの情報である特徴量パタンである音声評定装置である。 Further, in the voice evaluation device of the sixth invention, with respect to any one of the second to fifth inventions, the information regarding the ranking is the feature amount of two or more partial voice information of the teacher voice information or the input voice information. It is a voice rating device that is a feature quantity pattern that is information on the order of the size of.

また、本第七の発明の音声評定装置は、第一から第六いずれか１つの発明に対して、入力音声情報は、文章の音声情報であり、部分音声情報は、文章を構成する単語の音声情報である音声評定装置である。 Further, in the voice rating device of the seventh invention, for any one of the first to sixth inventions, the input voice information is the voice information of the sentence, and the partial voice information is the voice information of the words constituting the sentence. It is a voice rating device that is voice information.

かかる構成により、発音された文章の入力音声の流れを考慮した音声の評定ができるため、文章の入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the voice in consideration of the flow of the input voice of the pronounced sentence, so that the input voice of the sentence can be appropriately evaluated.

また、本第八の発明の音声評定装置は、第一から第六いずれか１つの発明に対して、入力音声情報は、単語の音声情報であり、部分音声情報は、単語を構成する音素の音声情報である音声評定装置である。 Further, in the voice rating device of the eighth invention, with respect to any one of the first to sixth inventions, the input voice information is the voice information of the word, and the partial voice information is the phoneme constituting the word. It is a voice rating device that is voice information.

かかる構成により、発音された単語の入力音声の流れを考慮した音声の評定ができるため、単語の入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the voice in consideration of the flow of the input voice of the pronounced word, so that the input voice of the word can be appropriately evaluated.

また、本第九の発明の音声評定装置は、第一から第八いずれか１つの発明に対して、部分音声情報の特徴量は、アクセントの強度に関する情報であるアクセント強度である音声評定装置である。 Further, the voice rating device of the ninth invention is a voice rating device in which the feature amount of the partial voice information is the accent strength, which is information on the accent strength, with respect to any one of the first to eighth inventions. is there.

かかる構成により、発音された入力音声のアクセント強度の変化を考慮した音声の評定ができるため、単語の入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the voice in consideration of the change in the accent intensity of the input voice pronounced, so that the input voice of the word can be appropriately evaluated.

また、本第十の発明の音声評定装置は、第一から第八いずれか１つの発明に対して、部分音声情報の特徴量は、音声情報の長さに関する情報であるリズム量である音声評定装置である。 Further, in the voice rating device of the tenth invention, for any one of the first to eighth inventions, the feature amount of the partial voice information is the voice rating which is the rhythm amount which is the information about the length of the voice information. It is a device.

かかる構成により、発音された入力音声のリズム量の変化を考慮した音声の評定ができるため、単語の入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the voice in consideration of the change in the rhythm amount of the input voice pronounced, so that the input voice of the word can be appropriately evaluated.

また、本第十一の発明の音声評定装置は、第一から第十いずれか１つの発明に対して、評定部は、入力変化情報と前記教師変化情報との順位相関係数をスコアとして取得する音声評定装置である。 Further, in the voice rating device of the eleventh invention, for any one of the first to tenth inventions, the rating unit acquires the rank correlation coefficient between the input change information and the teacher change information as a score. It is a voice rating device.

かかる構成により、適切なスコアが算定できる。 With such a configuration, an appropriate score can be calculated.

また、本第十二の発明の音声評定装置は、第一から第十一いずれか１つの発明に対して、入力音声情報に対する発音の評定を行い、第二スコアを取得する第二評定部と、評定部が取得したスコアと第二評定部が取得した第二スコアとを用いて、代表的なスコアである代表スコアを算出する算出部とをさらに具備し、出力部は、代表スコアを出力する音声評定装置である。 In addition, the voice rating device of the twelfth invention is a second rating unit that evaluates the pronunciation of the input voice information for any one of the first to eleventh inventions and obtains the second score. , A calculation unit for calculating a representative score, which is a representative score, using the score acquired by the rating unit and the second score acquired by the second rating unit is further provided, and the output unit outputs the representative score. It is a voice rating device.

かかる構成により、発音された入力音声の多角的な評定ができるため、入力音声の適切な評定ができる。 With such a configuration, it is possible to evaluate the sounded input voice from various angles, so that the input voice can be appropriately evaluated.

また、本第十三の発明の教師変化情報の生産装置は、教師音声情報を受け付ける受付部と、教師音声情報を２以上の部分音声情報に分割する分割手段と、２以上の部分音声情報が有する２以上の各評定対象部分音声情報から２以上の特徴量を取得する特徴量取得手段と、２以上の特徴量を用いて、教師変化情報を取得する変化情報取得手段と、教師変化情報を記録媒体に蓄積する蓄積部とを具備する教師変化情報の生産装置である。 Further, the teacher change information production device of the thirteenth invention includes a reception unit that receives teacher voice information, a dividing means that divides teacher voice information into two or more partial voice information, and two or more partial voice information. A feature amount acquisition means for acquiring two or more feature amounts from each of two or more evaluation target partial voice information, a change information acquisition means for acquiring teacher change information using two or more feature amounts, and a teacher change information. It is a teacher change information production device including a storage unit that stores in a recording medium.

かかる構成により、発音された入力音声の流れを考慮した音声の評定をするための教師データを自動生成できる。 With such a configuration, teacher data for evaluating the voice in consideration of the flow of the sounded input voice can be automatically generated.

本発明による音声評定装置によれば、発音された入力音声の流れを考慮した音声の評定ができるため、入力音声の適切な評定ができる。 According to the voice evaluation device according to the present invention, it is possible to evaluate the voice in consideration of the flow of the sounded input voice, so that the input voice can be evaluated appropriately.

実施の形態１における音声評定装置１のブロック図Block diagram of the voice rating device 1 according to the first embodiment 同音声評定装置１の動作について説明するフローチャートA flowchart explaining the operation of the voice rating device 1. 同変化情報取得処理について説明するフローチャートFlow chart explaining the change information acquisition process 同教師変化情報管理表を示す図Diagram showing the teacher change information management table 同出力例を示す図Diagram showing the same output example 実施の形態２における音声評定装置２のブロック図Block diagram of the voice rating device 2 according to the second embodiment 同音声評定装置２の動作について説明するフローチャートFlow chart explaining the operation of the voice rating device 2 実施の形態３における生産装置３のブロック図Block diagram of the production apparatus 3 according to the third embodiment 同生産装置３の動作について説明するフローチャートFlow chart explaining the operation of the production apparatus 3 上記実施の形態におけるコンピュータシステムの概観図Overview of the computer system according to the above embodiment 同コンピュータシステムのブロック図Block diagram of the computer system

以下、音声評定装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of the voice rating device and the like will be described with reference to the drawings. In addition, since the components having the same reference numerals perform the same operation in the embodiment, the description may be omitted again.

（実施の形態１）
本実施の形態において、２以上の部分音声情報を有する入力音声情報を受け付け、部分音声情報の特徴量の変化に関する入力変化情報を取得し、当該入力変化情報と教師音声の教師変化情報とを用いて、入力音声の評定を行う音声評定装置について説明する。 (Embodiment 1)
In the present embodiment, input voice information having two or more partial voice information is received, input change information relating to a change in the feature amount of the partial voice information is acquired, and the input change information and the teacher change information of the teacher voice are used. The voice rating device for rating the input voice will be described.

なお、入力変化情報および教師変化情報（以下、まとめて「変化情報」という場合がある。）は、例えば、２以上の部分音声情報の中の順位に関する情報である。また、順位に関する情報は、例えば、後述する特徴量パタンである。また、入力音声情報は、例えば、文章、単語などである。特徴量は、例えば、後述するアクセント強度、リズム量である。なお、特徴量がアクセント強度である場合、音声評定装置１はアクセント評定を行う装置となる。また、特徴量がリズム量である場合、音声評定装置１はリズム評定を行う装置となる。 The input change information and the teacher change information (hereinafter, may be collectively referred to as "change information") are, for example, information related to the ranking among two or more partial voice information. Further, the information regarding the ranking is, for example, a feature quantity pattern described later. The input voice information is, for example, a sentence, a word, or the like. The feature amount is, for example, the accent intensity and the rhythm amount described later. When the feature amount is the accent intensity, the voice rating device 1 is a device for performing accent rating. When the feature amount is a rhythm amount, the voice rating device 1 is a device that performs rhythm rating.

図１は、本実施の形態における音声評定装置１のブロック図である。 FIG. 1 is a block diagram of the voice rating device 1 according to the present embodiment.

音声評定装置１は、格納部１１、受付部１２、処理部１３、出力部１４を備える。 The voice evaluation device 1 includes a storage unit 11, a reception unit 12, a processing unit 13, and an output unit 14.

格納部１１は、教師変化情報格納部１１１を備える。処理部１３は、取得部１３１、評定部１３２を備える。取得部１３１は、分割手段１３１１、特徴量取得手段１３１２、変化情報取得手段１３１３を備える。 The storage unit 11 includes a teacher change information storage unit 111. The processing unit 13 includes an acquisition unit 131 and a rating unit 132. The acquisition unit 131 includes a division unit 1311, a feature amount acquisition unit 1312, and a change information acquisition unit 1313.

格納部１１は、各種の情報を格納し得る。各種の情報は、例えば、後述する教師変化情報、後述する入力音声情報、後述する教師音声情報等である。 The storage unit 11 can store various types of information. The various types of information include, for example, teacher change information described later, input voice information described later, teacher voice information described later, and the like.

教師変化情報格納部１１１は、１または２以上の教師変化情報が格納される。教師変化情報は、教師音声情報を構成する２以上の各部分音声情報の特徴量の変化に関する情報である。部分音声情報は、例えば、音素、単語等である。２以上の教師変化情報は、例えば、一の教師音声情報の文章の変化情報、および当該文章を構成する２以上の単語の変化情報である。また、２以上の教師変化情報は、例えば、２以上の教師音声情報の変化情報である。また、２以上の教師変化情報は、例えば、２以上の各教師音声情報の文章の変化情報、および当該文章を構成する２以上の単語の変化情報である。 The teacher change information storage unit 111 stores one or more teacher change information. The teacher change information is information related to a change in the feature amount of each of two or more partial voice information constituting the teacher voice information. The partial voice information is, for example, a phoneme, a word, or the like. The two or more teacher change information is, for example, change information of a sentence of one teacher voice information and change information of two or more words constituting the sentence. Further, the two or more teacher change information is, for example, change information of two or more teacher voice information. Further, the two or more teacher change information is, for example, two or more teacher voice information sentence change information and two or more word change information constituting the sentence.

なお、教師音声情報は、教師となる音声情報である。教師音声情報は、通常、単語または文章の音声情報である。文章は、文と言っても良い。教師変化情報は、例えば、２以上の部分音声情報の特徴量の大きさの順位に関する情報である。部分音声情報の特徴量の大きさの順位に関する情報は、例えば、２以上の各部分音声情報の特徴量の大きさの順位に関する並びの情報である特徴量パタンである。特徴量パタンとは、例えば、アクセント強度パタン、リズム量パタンである。アクセント強度パタンは、アクセントパタンと言っても良い。アクセントパタンとは、部分音声情報のアクセント強度の大きさの順位に関する並びの情報である。アクセントパタンは、音声情報の単語または音素のアクセント強度の大小関係を表す情報であり、例えば、単語または音素のアクセント強度を整数値でパタン化した情報である。なお、アクセント強度とは、アクセントの強度に関する情報である。アクセント強度には、例えば、音素ごとのアクセント強度、単語ごとのアクセント強度がある。音素ごとのアクセント強度の算出技術は、例えば、特許第４７１６１１６号等に示されており、公知技術である。音素ごとのアクセント強度は、フレームごとのアクセント強度の音素区間での代表値（通常、最大値であり、平均値や中央値などでも良い）である。なお、算出の対象とする音素は、通常、母音である。つまり、母音以外の音素は算出(評定)の対象外として、例えば、ゼロ値をアクセント強度とする。また、単語ごとのアクセント強度は、単語ごとにその単語内における音素ごとのアクセント強度の代表値（通常、最大値であり、平均値や中央値などでも良い）を算出する。また、単語ごとにその単語内におけるフレームごとのアクセント強度の代表値（通常、最大値であり、平均値や中央値などでも良い）を算出してもいい。なお、評定対象外の単語（基本的には無音区間のみ、もしくは母音の無い単語）は、例えば、ゼロ値をアクセント強度とする。 The teacher voice information is voice information that serves as a teacher. Teacher audio information is usually word or sentence audio information. A sentence can be called a sentence. The teacher change information is, for example, information regarding the rank of the magnitude of the feature amount of two or more partial voice information. The information regarding the rank of the feature amount of the partial voice information is, for example, a feature amount pattern which is a sequence of information regarding the rank of the feature amount of each of two or more partial voice information. The feature quantity pattern is, for example, an accent intensity pattern and a rhythm quantity pattern. The accent strength pattern may be called an accent pattern. The accent pattern is information on the order of the magnitude of the accent intensity of the partial voice information. The accent pattern is information representing the magnitude relationship of the accent intensity of a word or phoneme in speech information, and is, for example, information obtained by patterning the accent intensity of a word or phoneme with an integer value. The accent strength is information on the strength of the accent. The accent intensity includes, for example, the accent intensity for each phoneme and the accent intensity for each word. A technique for calculating the accent intensity for each phoneme is shown in, for example, Japanese Patent No. 4716116, and is a known technique. The accent intensity for each phoneme is a representative value of the accent intensity for each frame in the phoneme section (usually, it is the maximum value, and may be an average value, a median value, or the like). The phoneme to be calculated is usually a vowel. That is, phonemes other than vowels are excluded from the calculation (rating), and for example, the zero value is set as the accent intensity. Further, the accent intensity for each word is calculated as a representative value of the accent intensity for each phoneme in the word for each word (usually, the maximum value, which may be an average value or a median value). Further, for each word, a representative value of the accent intensity for each frame in the word (usually, the maximum value, which may be an average value or a median value) may be calculated. For words that are not subject to evaluation (basically only silent sections or words without vowels), for example, a zero value is set as the accent intensity.

また、リズム量とは、音声情報の長さに関する情報である。音声情報の長さに関する情報とは、部分音声情報の長さに関する情報である。リズム量パタンとは、部分音声情報のリズム量の大きさの順位に関する並びの情報である。特徴量パタンがリズム量パタンである場合、リズム評定が可能である。リズム評定とは、単語や音素を発声する長さが正しいか（ネイティブ発話に似ているか）どうかを評価する。なお、単語や音素の発声の長さはフォースドアライメントを用いて求められる。その長さをリズム量と呼ぶ。教師音声のリズム量からリズムパタンが生成される。教師音声情報から得られたリズム量パタンと入力音声情報のリズム量の類似度を、順位相関係数を用いて算出し、リズム評定スコアが求められる。 The rhythm amount is information related to the length of voice information. The information regarding the length of the voice information is information regarding the length of the partial voice information. The rhythm amount pattern is information on the order of the magnitude of the rhythm amount of the partial voice information. If the feature pattern is a rhythm pattern, a rhythm rating is possible. Rhythm evaluation evaluates whether the length of utterance of a word or phoneme is correct (similar to native utterance). The length of vocalization of words and phonemes can be obtained using forced alignment. The length is called the rhythm amount. A rhythm pattern is generated from the rhythm amount of the teacher's voice. The similarity between the rhythm amount pattern obtained from the teacher voice information and the rhythm amount of the input voice information is calculated using the rank correlation coefficient, and the rhythm rating score is obtained.

また、教師変化情報は、例えば、特徴量傾向などでも良い。特徴量傾向とは、２以上の各部分音声情報の特徴量が増加傾向か減少傾向かを示す情報である。特徴量傾向は、２以上の各部分音声情報の特徴量が、増加傾向か減少傾向か同一かのうちのいずれかの情報を採り得ても良い。特徴量傾向は、例えば、アクセント傾向、リズム量傾向である。アクセント傾向は、２以上の各部分音声情報のアクセント強度が増加傾向か減少傾向かを示す情報である。アクセント傾向は、２以上の各部分音声情報のアクセント強度が、増加傾向か減少傾向か同一かのうちのいずれかの情報を採り得ても良い。リズム量傾向は、２以上の各部分音声情報のリズム量が増加傾向か減少傾向か等を示す情報である。なお、特徴量が取得される２以上の部分音声情報は、評定対象外の部分音声情報を除いた、２以上の評定対象部分音声情報でも良い。また、特徴量が取得される２以上の部分音声情報は、すべての部分音声情報でも良い。評定対象外の部分音声情報は、例えば、無音区間、母音の無い区間の情報である。 Further, the teacher change information may be, for example, a feature quantity tendency. The feature amount tendency is information indicating whether the feature amount of each of two or more partial voice information tends to increase or decrease. As for the feature amount tendency, information may be obtained in which the feature amount of each of the two or more partial voice information has an increasing tendency, a decreasing tendency, or the same information. The feature amount tendency is, for example, an accent tendency and a rhythm amount tendency. The accent tendency is information indicating whether the accent intensity of each of two or more partial voice information tends to increase or decrease. As for the accent tendency, information may be obtained in which the accent intensity of each of the two or more partial voice information is either an increasing tendency, a decreasing tendency, or the same. The rhythm amount tendency is information indicating whether the rhythm amount of each of two or more partial voice information tends to increase or decrease. The two or more partial voice information for which the feature amount is acquired may be two or more partial voice information to be evaluated, excluding the partial voice information not to be evaluated. Further, the two or more partial voice information for which the feature amount is acquired may be all partial voice information. The partial voice information that is not subject to evaluation is, for example, information in a silent section or a section without vowels.

通常、教師変化情報のデータ構造と、後述する入力変化情報のデータ構造とは同一である。教師変化情報格納部１１１の教師変化情報は、例えば、識別子に対応付いていても良い。また、教師変化情報は、例えば、教師音声情報に対応付いていても良い。教師音声情報とは、教師となる音声情報である。教師となる音声情報は、模範となる音声情報である。 Usually, the data structure of the teacher change information and the data structure of the input change information described later are the same. The teacher change information of the teacher change information storage unit 111 may correspond to, for example, an identifier. Further, the teacher change information may correspond to, for example, teacher voice information. The teacher voice information is voice information that serves as a teacher. The voice information that serves as a teacher is a model voice information.

なお、教師変化情報格納部１１１に格納されている、特徴量パタン等の教師変化情報は、教師音声情報から生成された情報であることは好適である。かかる生成処理は、例えば、実施の形態３で説明する教師変化情報の生産装置による。ただし、教師変化情報格納部１１１の教師変化情報は、音声学や言語学の科学的な知見に基づいて、人手により作成される等しても良い。また、特徴量パタン等の教師変化情報は、基本的には教師音声情報から生成することができるが、評定対象の文章や単語の評定したいポイント（何に着目して評定するか）合わせて、人手により調整してもいい。 It is preferable that the teacher change information such as the feature quantity pattern stored in the teacher change information storage unit 111 is information generated from the teacher voice information. Such a generation process is performed by, for example, the teacher change information production apparatus described in the third embodiment. However, the teacher change information of the teacher change information storage unit 111 may be manually created based on the scientific knowledge of phonetics and linguistics. In addition, teacher change information such as feature pattern can be basically generated from teacher voice information, but according to the points to be evaluated (what to focus on) of sentences and words to be evaluated, It may be adjusted manually.

受付部１２は、２以上の部分音声を有する音声情報である音声情報を受け付ける。かかる音声情報は、例えば、入力音声情報、または教師音声情報である。音声情報は、通常、単語または文章の音声情報である。ここで、受け付けとは、マイクなどの入力デバイスから入力された情報の受け付け、有線もしくは無線の通信回線を介して送信された情報の受信、光ディスクや磁気ディスク、半導体メモリなどの記録媒体から読み出された情報の受け付けなどを含む概念である。 The reception unit 12 receives voice information which is voice information having two or more partial voices. Such voice information is, for example, input voice information or teacher voice information. The audio information is usually the audio information of a word or sentence. Here, acceptance means acceptance of information input from an input device such as a microphone, reception of information transmitted via a wired or wireless communication line, and reading from a recording medium such as an optical disk, a magnetic disk, or a semiconductor memory. It is a concept that includes the acceptance of information that has been received.

処理部１３は、各種の処理を行う。各種の処理とは、例えば、取得部１３１、評定部１３２等が行う処理である。 The processing unit 13 performs various processes. The various processes are, for example, processes performed by the acquisition unit 131, the rating unit 132, and the like.

取得部１３１は、音声情報が有する２以上の各部分音声情報の特徴量の変化に関する変化情報を取得する。取得部１３１は、入力音声情報が有する２以上の各部分音声情報の特徴量の変化に関する入力変化情報を取得する。また、実施の形態３で説明するように、取得部１３１は、教師音声情報が有する２以上の各部分音声情報の特徴量の変化に関する教師変化情報を取得しても良い。なお、本実施の形態において、取得部１３１は、入力変化情報を取得するものとして説明するが、実施の形態３では、取得部１３１は教師変化情報を取得し、その動作は同様である。 The acquisition unit 131 acquires change information regarding a change in the feature amount of each of two or more partial voice information possessed by the voice information. The acquisition unit 131 acquires the input change information regarding the change in the feature amount of each of the two or more partial voice information possessed by the input voice information. Further, as described in the third embodiment, the acquisition unit 131 may acquire the teacher change information regarding the change in the feature amount of each of two or more partial voice information possessed by the teacher voice information. In the present embodiment, the acquisition unit 131 will be described as acquiring the input change information, but in the third embodiment, the acquisition unit 131 acquires the teacher change information, and the operation is the same.

また、ここで、入力音声情報が文章の場合は、部分音声情報は、例えば、単語である。但し、入力音声情報が文章の場合、部分音声情報は音素でも良い。また、入力音声情報が単語の場合は、部分音声情報は、例えば、音素である。また、特徴量とは、例えば、アクセント強度、またはリズム量である。 Further, here, when the input voice information is a sentence, the partial voice information is, for example, a word. However, when the input voice information is a sentence, the partial voice information may be a phoneme. When the input voice information is a word, the partial voice information is, for example, a phoneme. The feature amount is, for example, an accent intensity or a rhythm amount.

また、入力変化情報は、例えば、入力音声情報のアクセントパタン、入力音声情報のアクセント傾向などである。 Further, the input change information includes, for example, an accent pattern of the input voice information, an accent tendency of the input voice information, and the like.

取得部１３１は、例えば、音声情報が有する２以上の各部分音声情報の特徴量を取得し、音声情報が有する２以上の部分情報のうちの少なくとも２以上の評定対象部分音声情報の２以上の特徴量の大きさの順位を取得し、２以上の特徴量の大きさの順位を有する変化情報を取得する。かかる方法を第一の変化情報取得方法という。なお、２以上の評定対象部分音声情報とは、音声情報が有する２以上の部分情報のうち評定対象外の部分音声情報を除いた部分情報である。また、「少なくとも２以上の評定対象部分音声情報の２以上の特徴量の大きさの順位を取得する」ことは、評定対象部分音声情報のみの特徴量の大きさの順位を取得することでも良いし、音声情報が有する２以上の部分情報のすべての特徴量の大きさの順位を取得することでも良い。 The acquisition unit 131 acquires, for example, the feature amount of each of two or more partial voice information possessed by the voice information, and at least two or more of the two or more partial information possessed by the voice information and two or more of the evaluation target partial voice information. The rank of the magnitude of the feature amount is acquired, and the change information having the rank of the magnitude of two or more feature quantities is acquired. Such a method is called the first change information acquisition method. The two or more evaluation target partial voice information is the partial information of the two or more partial information possessed by the voice information, excluding the partial voice information not subject to the evaluation. Further, "acquiring the rank of the magnitude of the feature amount of at least two or more evaluation target partial voice information" may be obtained by acquiring the rank of the feature quantity of only the rating target partial voice information. However, it is also possible to acquire the rank of the magnitude of all the features of the two or more partial information possessed by the voice information.

教師変化情報が有する情報であり、２以上の評定対象部分音声情報の特徴量の大きさの順位に関する情報が同一の情報である場合、取得部１３１は、例えば、入力変化情報が有する情報であり、前記同一の情報に対応する位置の２つの評定対象部分音声情報の特徴量の大きさの順位が隣り合っているか否かを判断し、隣り合っていると判断した場合は、当該２つの評定対象部分音声情報の特徴量の大きさを同一の大きさと見なして、入力変化情報を取得する。かかる方法を特殊方法という。 When the information is the information possessed by the teacher change information and the information regarding the rank of the feature amount of the two or more evaluation target partial voice information is the same information, the acquisition unit 131 is, for example, the information possessed by the input change information. , It is judged whether or not the ranks of the features of the two evaluation target partial voice information at the positions corresponding to the same information are adjacent to each other, and if it is determined that they are adjacent to each other, the two evaluations are made. Input change information is acquired by regarding the magnitude of the feature amount of the target partial voice information as the same magnitude. Such a method is called a special method.

例えば、入力音声情報の単語トランスクリプションが「/sil/alice/sil/looked/sil/up/sil/」であり、当該入力音声情報に対応する教師単語アクセントパタン（教師変化情報）が「/0/2/0/1/0/1/0/」である場合の説明を行う。かかる場合の入力単語アクセント強度の並びが「/0.000000/60.396744/0.000000/53.130833/0.000000/48.609158/0.000000/」であった、とする。その場合、例えば、以下のような１）から５）の動作が行われる。
１）取得部１３１は、教師単語アクセントパタン「/0/2/0/1/0/1/0/」から評定対象外のゼロ値を除く。すると、教師単語アクセントパタン「2 1 1」が取得される。
２）取得部１３１は、入力単語アクセント強度から評定対象外のゼロ値を除く。すると、入力単語アクセント強度「60.396744 53.130833 48.609158」が取得される。
３）取得部１３１は、教師単語アクセントパタンから順位データを求める。この時、教師変化情報が有する情報であり、２以上の評定対象部分音声情報の特徴量の大きさの順位に関する情報が同一の情報であり、隣り合っている情報（２位と３位が同じ値）であるので、取得部１３１は、その順位の平均（２＋３）／２＝２．５を、中間順位として与える。そして、取得部１３１は、教師単語アクセント順位「1 2.5 2.5」を得る。
４）教師単語アクセント順位の中にタイ(同順位)がある場合、そのタイデータの位置に対応する入力単語アクセント強度の順位が隣り合っていれば、取得部１３１は、その入力単語アクセント強度を大きい強度値に合わせてタイデータに変換する。つまり、取得部１３１は、入力単語アクセント強度「60.396744 53.130833 53.130833」を取得する。
５）取得部１３１は、入力単語アクセント強度から順位データを求める。ここで、タイ(同順位)を含むので、取得部１３１は、中間順位を与える。つまり、取得部１３１は、入力単語アクセント順位「1 2.5 2.5」を得る。 For example, the word transcription of the input voice information is "/ sil/alice/sil/looked/sil/up/sil/", and the teacher word accent pattern (teacher change information) corresponding to the input voice information is "/". The case of "0/2/0/1/0/1/0 /" will be explained. It is assumed that the sequence of input word accent intensities in such a case is "/0.000000/60.396744/0.000000/53.130833/0.000000/48.609158/0.000000/". In that case, for example, the following operations 1) to 5) are performed.
1) The acquisition unit 131 excludes zero values that are not subject to evaluation from the teacher word accent pattern "/ 0/2/0/1/0/1/0 /". Then, the teacher word accent pattern "2 1 1" is acquired.
2) The acquisition unit 131 excludes zero values that are not subject to evaluation from the input word accent intensity. Then, the input word accent intensity "60.396744 53.130833 48.609158" is acquired.
3) The acquisition unit 131 obtains the ranking data from the teacher word accent pattern. At this time, it is the information possessed by the teacher change information, the information regarding the rank of the feature amount of the two or more evaluation target partial voice information is the same information, and the adjacent information (the second and third places are the same). Since it is a value), the acquisition unit 131 gives the average of the ranks (2 + 3) / 2 = 2.5 as an intermediate rank. Then, the acquisition unit 131 obtains the teacher word accent rank “1 2.5 2.5”.
4) When there is a tie (equal rank) in the teacher word accent rank, if the ranks of the input word accent strength corresponding to the position of the tie data are adjacent to each other, the acquisition unit 131 determines the input word accent strength. Convert to tie data according to the large intensity value. That is, the acquisition unit 131 acquires the input word accent intensity “60.396744 53.130833 53.130833”.
5) The acquisition unit 131 obtains the ranking data from the input word accent intensity. Here, since the tie (equal rank) is included, the acquisition unit 131 gives an intermediate rank. That is, the acquisition unit 131 obtains the input word accent order “1 2.5 2.5”.

取得部１３１は、例えば、入力音声情報が有する２以上の各部分音声情報の特徴量を取得し、入力音声情報が有する２以上の部分情報のうち、評定対象外の部分音声情報を除いた、２以上の評定対象部分音声情報の２以上の特徴量に対して、最も大きい特徴量に対応する評定対象部分音声情報と他の評定対象部分音声情報とを区別する情報である入力変化情報を取得する。かかる方法を第二の変化情報取得方法という。なお、部分音声情報は、例えば、文章を構成する単語の音声情報である。また、部分音声情報は、例えば、単語を構成する音素の音声情報である。 For example, the acquisition unit 131 acquires the feature amount of each of the two or more partial voice information possessed by the input voice information, and excludes the partial voice information not subject to evaluation from the two or more partial information possessed by the input voice information. For two or more feature quantities of two or more evaluation target partial voice information, input change information which is information for distinguishing the evaluation target partial voice information corresponding to the largest feature quantity from other evaluation target partial voice information is acquired. To do. Such a method is called a second change information acquisition method. The partial voice information is, for example, voice information of words constituting a sentence. Further, the partial voice information is, for example, voice information of phonemes constituting a word.

分割手段１３１１は、入力音声情報を２以上の部分音声情報に分割する。分割手段１３１１は、例えば、入力音声情報の音素ごとの区間情報を、フォースドアライメントを用いて算出する。区間情報とは、当該音素が入力音声情報の中の区間を示す情報である。区間情報は、例えば、入力音声情報の何ミリ秒目から何ミリ秒目までかを示す情報である。分割手段１３１１は、例えば、受付部１２が受け付けた入力音声情報と、格納されている教師音声情報とのアラインメントを行う。教師音声情報は、格納部１１に格納されている情報であって、入力音声情報と同じ音韻のデータである。ここで、アラインメントとは、通常、フォーストアラインメント（フォースアラインメントともいう）である。フォーストアラインメントは、強制的に、音声情報が有する音韻と、教師データが有する音韻を対応付ける処理であり、公知技術であるので詳細な説明を省略する。また、分割手段１３１１は、入力音声情報を２以上の単語に分割しても良い。分割手段１３１１が入力音声情報を２以上の部分音声情報に分割するアルゴリズムは問わない。 The dividing means 1311 divides the input voice information into two or more partial voice information. The dividing means 1311 calculates, for example, the section information for each phoneme of the input voice information by using the forced alignment. The section information is information indicating a section in which the phoneme is in the input voice information. The section information is, for example, information indicating the number of milliseconds to the number of milliseconds of the input voice information. For example, the dividing means 1311 aligns the input voice information received by the reception unit 12 with the stored teacher voice information. The teacher voice information is information stored in the storage unit 11, and is the same phoneme data as the input voice information. Here, the alignment is usually a force alignment (also referred to as a force alignment). The force alignment is a process of forcibly associating the phoneme of the voice information with the phoneme of the teacher data, and since it is a known technique, detailed description thereof will be omitted. Further, the dividing means 1311 may divide the input voice information into two or more words. The algorithm in which the dividing means 1311 divides the input voice information into two or more partial voice information does not matter.

特徴量取得手段１３１２は、分割手段１３１１が分割した２以上の各部分音声情報から、特徴量を取得する。特徴量取得手段１３１２は、例えば、部分音声情報が有するフレームごとのアクセント強度を算出する。そして、特徴量取得手段１３１２は、例えば、部分音声情報が有する音素ごとのアクセント強度を、音素ごとの区間情報とフレームごとのアクセント強度から算出する。特徴量取得手段１３１２は、例えば、一の音素内の複数のフレームの複数のアクセント強度の代表値を音素のアクセント強度として取得する。代表値とは、例えば、最大値、平均値、中央値等である。なお、アクセント強度等の特徴量を算出する対象の音素は、通常、母音である。なお、特徴量取得手段１３１２は、母音以外の音素の特徴量をゼロ（０）とすることは好適である。また、特徴量取得手段１３１２は、例えば、入力音声情報の単語ごとのアクセント強度を、各単語内の音素ごとのアクセント強度から算出する。徴量取得手段１３１２は、例えば、一の単語が有する複数の音素の複数のアクセント強度の代表値を単語のアクセント強度として取得する。代表値については上述した。徴量取得手段１３１２は、例えば、一の単語が有する複数のフレームのアクセント強度の代表値を単語のアクセント強度として取得しても良い。 The feature amount acquisition means 1312 acquires the feature amount from each of two or more partial voice information divided by the dividing means 1311. The feature amount acquisition means 1312 calculates, for example, the accent intensity for each frame of the partial voice information. Then, the feature amount acquisition means 1312 calculates, for example, the accent intensity for each phoneme of the partial voice information from the section information for each phoneme and the accent intensity for each frame. The feature amount acquisition means 1312 acquires, for example, representative values of a plurality of accent intensities of a plurality of frames in one phoneme as the accent intensities of the phonemes. The representative value is, for example, a maximum value, an average value, a median value, or the like. The phoneme for which the feature amount such as the accent intensity is calculated is usually a vowel. It is preferable that the feature amount acquisition means 1312 sets the feature amount of phonemes other than vowels to zero (0). Further, the feature amount acquisition means 1312 calculates, for example, the accent intensity of each word of the input voice information from the accent intensity of each phoneme in each word. The statistic acquisition means 1312 acquires, for example, a representative value of a plurality of accent intensities of a plurality of phonemes possessed by one word as a word accent intensity. The representative value is described above. The metering acquisition means 1312 may acquire, for example, a representative value of the accent intensity of a plurality of frames possessed by one word as the accent intensity of the word.

なお、フレームごとのアクセント強度を算出する処理は、特許第４７１６１１６号等に記載されており、公知技術であるので、詳細な説明を省略する。 The process of calculating the accent strength for each frame is described in Japanese Patent No. 4716116 and the like, and is a known technique. Therefore, detailed description thereof will be omitted.

変化情報取得手段１３１３は、特徴量取得手段１３１２が取得した２以上の各部分音声情報の特徴量の変化に関する入力変化情報を取得する。 The change information acquisition means 1313 acquires input change information relating to a change in the feature amount of each of the two or more partial voice information acquired by the feature amount acquisition means 1312.

変化情報取得手段１３１３は、例えば、特徴量取得手段１３１２が取得した２以上の各部分音声情報の特徴量の大きさの順位を取得し、２以上の特徴量の大きさの順位を有する入力変化情報を取得する。かかる方法は、第一の変化情報取得方法である。 The change information acquisition means 1313 acquires, for example, the rank of the feature amount of each of the two or more partial voice information acquired by the feature amount acquisition means 1312, and the input change having the rank of the magnitude of the feature amount of two or more. Get information. Such a method is the first change information acquisition method.

第一の変化情報取得方法において、例えば、文章の入力音声情報の単語トランスクリプションが「/sil/alice/sil/looked/sil/up/sil/」であり、各単語の単語アクセント強度の並びが「/0.000000/60.396744/0.000000/53.130833/0.000000/48.609158/0.000000/」である場合、変化情報取得手段１３１３は、例えば、単語アクセントパタン「/0/3/0/2/0/1/0/」を取得する。つまり、変化情報取得手段１３１３は、単語/alice/の単語アクセント強度/60.396744/が最も大きなアクセント強度であるので、単語/alice/に対する順位の情報として最大数「3」を付与する。なお、最大数は、評定対象の単語数（評定対象部分音声情報の数）である。また、変化情報取得手段１３１３は、単語/looked/の単語アクセント強度/53.130833/が２番目に大きなアクセント強度であるので、単語/looked/に対する順位の情報として「2」を付与する。また、変化情報取得手段１３１３は、単語/up/の単語アクセント強度/48.609158/が３番目に大きなアクセント強度であるので、単語/up/に対する順位の情報として「1」を付与する。さらに、変化情報取得手段１３１３は、アクセント強度が/0.000000/の無音区間に対して、順位の情報「0」を付与する。以上により、変化情報取得手段１３１３は、単語アクセントパタン「/0/3/0/2/0/1/0/」を取得する。ここで、単語トランスクリプションとは、文章の音声を単語の音声に区切った場合の単語の音声列を表現したものである。また、単語アクセント強度とは、単語の特徴量の一例であり、単語のアクセント強度である。さらに、単語アクセントパタンとは、入力変化情報の一例であり、単語のアクセント強度の順位の並びに関する情報である。なお、スラッシュ「/」は単語の区切りである。「sil」は無音を表す記号であり、評定の対象としないので、通常、アクセント強度はゼロとする。また、無音は単語間に概ね存在するが、必ず存在するわけではない。なお、ここでのパタン化の基本的なルールは、例えば、以下の１）、２）である。
１）大きなアクセント強度には大きな整数値パタンを与える。
２）評定の対象としない単語や音素に対してはゼロを与える。 In the first change information acquisition method, for example, the word transcription of the input voice information of a sentence is "/ sil / alice / sil / looked / sil / up / sil /", and the word accent intensity of each word is arranged. When is "/0.000000/60.396744/0.000000/53.130833/0.000000/48.609158/0.000000/", the change information acquisition means 1313 is, for example, the word accent pattern "/ 0/3/0/2/0/1/0 /". To get. That is, since the word accent strength / 60.396744 / of the word / alice / is the largest accent strength in the change information acquisition means 1313, the maximum number "3" is given as the rank information for the word / alice /. The maximum number is the number of words to be evaluated (the number of partial voice information to be evaluated). Further, in the change information acquisition means 1313, since the word accent strength / 53.130833 / of the word / looked / is the second largest accent strength, "2" is added as the rank information with respect to the word / looked /. Further, in the change information acquisition means 1313, since the word accent strength / 48.609158 / of the word / up / is the third largest accent strength, "1" is added as the rank information with respect to the word / up /. Further, the change information acquisition means 1313 gives rank information "0" to the silent section having an accent intensity of /0.000000/. As described above, the change information acquisition means 1313 acquires the word accent pattern "/ 0/3/0/2/0/1/0 /". Here, the word transcription expresses a speech sequence of a word when the speech of a sentence is divided into the speech of the word. The word accent strength is an example of a word feature amount, and is a word accent strength. Further, the word accent pattern is an example of input change information, and is information related to the order of the accent strength of words. The slash "/" is a word delimiter. Since "sil" is a symbol representing silence and is not subject to evaluation, the accent intensity is usually set to zero. Also, silence generally exists between words, but it does not always exist. The basic rules for patterning here are, for example, the following 1) and 2).
1) Give a large integer pattern to a large accent intensity.
2) Give zero to words and phonemes that are not subject to rating.

上記のようにアクセント強度をパタン化することにより、単語や音素のアクセントに対する大小関係（強弱関係）のみを表す情報が得られる。アクセント評定では、ある単語や音素のアクセント強度の値がいくらであるかという情報は重要ではない。文章（または単語）の中で、どの単語（または音素）のアクセント強度が大きくて、どの単語（または音素）のアクセント強度が小さいかという大小関係が重要であり、教師音声のアクセント強度の単語(または音素)間における大小関係との類似度を見ることが、アクセント評定の目的である。つまりアクセントの教師となるアクセントパタンデータがあれば(アクセント強度データがなくても)、アクセント評定を十分精度よく実現することができる。 By patterning the accent strength as described above, information showing only the magnitude relation (strength relation) with respect to the accent of a word or phoneme can be obtained. In the accent rating, the information about what the accent intensity value of a word or phoneme is is not important. In a sentence (or word), the magnitude relationship of which word (or phoneme) has a large accent intensity and which word (or phoneme) has a low accent intensity is important, and the word (or phoneme) accent intensity of the teacher's voice ( The purpose of accent evaluation is to see the degree of similarity with the magnitude relationship between phonemes). In other words, if there is accent pattern data that serves as an accent teacher (even if there is no accent intensity data), accent rating can be achieved with sufficient accuracy.

また、教師音声情報と入力音声情報のアクセント強度どうしを比較するよりも、アクセントパタンを導入することで、後述のように調整したりすることができ、教師のアクセントをどのように構成するか、つまりどのような観点で入力音声情報のアクセントを評定するかを決める自由度が大きくなる。 Also, rather than comparing the accent intensities of the teacher's voice information and the input voice information, by introducing an accent pattern, it is possible to make adjustments as described below, and how to configure the teacher's accent. In other words, the degree of freedom in deciding from what point of view the accent of the input voice information is evaluated is increased.

例えば、後述するアクセントパタン生成方法の特殊方法では、音素アクセントパタンは(評定対象外のゼロ値を除いて)、/2 1 1/となっている。これは最大強度となる音素が１番目にあれば正解で、２番目３番目の強度の差はアクセントの良し悪しには無関係であるという教師パタンとなる。しかしながら、教師音声の音素アクセント強度をみると、１番目と２番目の値に差が少なく、３番目の値が離れている。これは、１番目と２番目の強度の差は良し悪しに無関係で、３番目が小さな強度となっていることが見たいポイントとであることを示している。このとき、教師アクセントパタンが/2 2 1/となっていれば、そのポイントを見ることができ、評定スコアにそれが反映される。以上のように、教師のアクセント情報にパタンという単純な整数値のデータを導入することにより、評定したいポイントをフレキシブルに調整できるようになる。 For example, in the special method of the accent pattern generation method described later, the phoneme accent pattern (excluding the zero value that is not subject to evaluation) is / 2 1 1 /. This is a correct answer if the phoneme with the maximum intensity is the first, and it is a teacher pattern that the difference in the intensity of the second and third is irrelevant to the quality of the accent. However, looking at the phoneme accent intensity of the teacher's voice, there is little difference between the first and second values, and the third value is far apart. This indicates that the difference between the first and second intensities is irrelevant whether it is good or bad, and the point that we want to see is that the third is a small intensity. At this time, if the teacher accent pattern is 2 2 1 /, you can see the point and it will be reflected in the grade score. As described above, by introducing simple integer value data called a pattern into the accent information of the teacher, it becomes possible to flexibly adjust the points to be evaluated.

また、第一の変化情報取得方法において、例えば、単語「understand」の音素トランスクリプションが「/sil/ah n d er s t ae n d/sil/」であり、単語を構成する音素の音素アクセント強度の並びが「/0.000000/62.717609 0.000000 0.000000 62.379860 0.000000 0.000000 51.971569 0.000000 0.000000/0.000000/」である場合、変化情報取得手段１３１３は、例えば、音素アクセントパタン「/0/3 0 0 2 0 0 1 0 0/0/」を取得する。つまり、変化情報取得手段１３１３は、音素「ah」の音素アクセント強度「62.717609」が最も大きなアクセント強度であるので、音素「ah」に対する順位の情報として最大数「3」を付与する。なお、最大数は、評定対象の音素数（評定対象部分音声情報の数）である。また、変化情報取得手段１３１３は、音素「er」の音素アクセント強度「62.379860」が２番目に大きなアクセント強度であるので、音素「er」に対する順位の情報として「2」を付与する。また、変化情報取得手段１３１３は、音素「ae」の音素アクセント強度「51.971569」が３番目に大きなアクセント強度であるので、単語/up/に対する順位の情報として「1」を付与する。さらに、変化情報取得手段１３１３は、アクセント強度が/0.000000/の無音区間または子音に対して、順位の情報「0」を付与する。以上により、変化情報取得手段１３１３は、音素アクセントパタン「/0/3 0 0 2 0 0 1 0 0/0/」を取得する。ここで、音素トランスクリプションとは、単語文章の音声を音素に区切った場合の音素の音声列を表現したものである。さらに、音素アクセントパタンとは、入力変化情報の一例であり、音素のアクセント強度の順位の並びに関する情報である。 Further, in the first change information acquisition method, for example, the phoneme transcription of the word "understand" is "/ sil / ah nd er st ae nd / sil /", and the phoneme accent intensity of the phonemes constituting the word is When the sequence is "/0.000000/62.717609 0.000000 0.000000 62.379860 0.000000 0.000000 51.971569 0.000000 0.000000 / 0.000000 /", the change information acquisition means 1313 is, for example, the phoneme accent pattern "/ 0/3 0 0 2 0 0 1 0 0/0". / ”Is acquired. That is, the change information acquisition means 1313 gives the maximum number "3" as the rank information with respect to the phoneme "ah" because the phoneme accent strength "62.717609" of the phoneme "ah" is the largest accent strength. The maximum number is the number of phonemes to be evaluated (the number of partial voice information to be evaluated). Further, in the change information acquisition means 1313, since the phoneme accent intensity "62.379860" of the phoneme "er" is the second largest accent intensity, "2" is added as the rank information with respect to the phoneme "er". Further, in the change information acquisition means 1313, since the phoneme accent intensity “51.971569” of the phoneme “ae” is the third largest accent intensity, “1” is added as ranking information for the word / up /. Further, the change information acquisition means 1313 adds rank information "0" to a silent section or a consonant having an accent intensity of /0.000000/. As described above, the change information acquisition means 1313 acquires the phoneme accent pattern "/ 0/3 0 0 2 0 0 1 0 0/0 /". Here, the phoneme transcription expresses a phoneme sequence when the speech of a word sentence is divided into phonemes. Further, the phoneme accent pattern is an example of input change information, and is information related to the order of phoneme accent intensities.

変化情報取得手段１３１３は、例えば、入力音声情報に対応する教師変化情報が有する情報であり、２以上の評定対象部分音声情報の特徴量の大きさの順位に関する情報が同一の情報であると判断し、かつ入力変化情報が有する情報であり、前記同一の情報に対応する位置の２つの評定対象部分音声情報の特徴量の大きさの順位が隣り合っていると判断した場合は、当該２つの評定対象部分音声情報の特徴量の大きさを同一の大きさと見なして、入力変化情報を取得する。かかる方法は、変化情報取得方法における特殊方法である。 The change information acquisition means 1313 determines, for example, the information possessed by the teacher change information corresponding to the input voice information, and the information regarding the rank of the feature amount of the two or more evaluation target partial voice information is the same information. However, if it is determined that the information is possessed by the input change information and the ranks of the features of the two evaluation target partial voice information at the positions corresponding to the same information are adjacent to each other, the two Input change information is acquired by regarding the size of the feature amount of the evaluation target partial voice information as the same size. Such a method is a special method in the change information acquisition method.

変化情報取得手段１３１３は、例えば、２以上の評定対象部分音声情報の２以上の特徴量に対して、最も大きい特徴量に対応する評定対象部分音声情報と他の評定対象部分音声情報とを区別する情報である入力変化情報を取得する。かかる場合、最も大きい特徴量に対応する評定対象部分音声情報に対する値を「２」、他の評定対象部分音声情報に対する値を「１」として、部分音声情報の並び通りの数字列を、入力変化情報として取得する。なお、かかる方法は、第二の変化情報取得方法である。 The change information acquisition means 1313 distinguishes between the evaluation target partial voice information corresponding to the largest feature amount and the other evaluation target partial voice information with respect to two or more feature quantities of two or more evaluation target partial voice information, for example. Acquires input change information, which is information to be input. In such a case, the value for the evaluation target partial voice information corresponding to the largest feature amount is set to "2", the value for the other evaluation target partial voice information is set to "1", and the numerical string in the order of the partial voice information is input and changed. Get as information. In addition, such a method is the second change information acquisition method.

第二の変化情報取得方法において、例えば、文章の入力音声情報の単語トランスクリプションが「/sil/alice/sil/looked/sil/up/sil/」であり、各単語の単語アクセント強度の並びが「/0.000000/60.396744/0.000000/53.130833/0.000000/48.609158/0.000000/」である場合、変化情報取得手段１３１３は、例えば、単語アクセントパタン「/0/2/0/1/0/1/0/」を取得する。つまり、変化情報取得手段１３１３は、単語/alice/の単語アクセント強度/60.396744/が最も大きなアクセント強度であるので、単語/alice/に対する順位の情報として最大数「2」を付与する。また、変化情報取得手段１３１３は、単語/looked/の単語アクセント強度/53.130833/が２番目以降に大きなアクセント強度であるので、単語/looked/に対する順位の情報として「1」を付与する。また、変化情報取得手段１３１３は、単語/up/の単語アクセント強度/48.609158/が２番目以降に大きなアクセント強度であるので、単語/up/に対する順位の情報として「1」を付与する。さらに、変化情報取得手段１３１３は、アクセント強度が/0.000000/の無音区間に対して、順位の情報「0」を付与する。以上により、変化情報取得手段１３１３は、単語アクセントパタン「/0/2/0/1/0/1/0/」を取得する。 In the second change information acquisition method, for example, the word transcription of the input voice information of a sentence is "/ sil / alice / sil / looked / sil / up / sil /", and the word accent intensity of each word is arranged. When is "/0.000000/60.396744/0.000000/53.130833/0.000000/48.609158/0.000000/", the change information acquisition means 1313 is, for example, the word accent pattern "/ 0/2/0/1/0/1/0 /". To get. That is, since the word accent strength / 60.396744 / of the word / alice / is the largest accent strength, the change information acquisition means 1313 gives the maximum number "2" as the rank information for the word / alice /. Further, in the change information acquisition means 1313, since the word accent intensity / 53.130833 / of the word / looked / is the second largest accent intensity, "1" is added as the rank information with respect to the word / looked /. Further, in the change information acquisition means 1313, since the word accent strength / 48.609158 / of the word / up / is the second and subsequent large accent strength, "1" is added as the rank information with respect to the word / up /. Further, the change information acquisition means 1313 gives rank information "0" to the silent section having an accent intensity of /0.000000/. As described above, the change information acquisition means 1313 acquires the word accent pattern "/ 0/2/0/1/0/1/0 /".

また、第二の変化情報取得方法において、例えば、単語「understand」の音素トランスクリプションが「/sil/ah n d er s t ae n d/sil/」であり、単語を構成する音素の音素アクセント強度の並びが「/0.000000/62.717609 0.000000 0.000000 62.379860 0.000000 0.000000 51.971569 0.000000 0.000000/0.000000/」である場合、変化情報取得手段１３１３は、例えば、音素アクセントパタン「/0/2 0 0 1 0 0 1 0 0/0/」を取得する。つまり、変化情報取得手段１３１３は、音素「ah」の音素アクセント強度「62.717609」が最も大きなアクセント強度であるので、音素「ah」に対する順位の情報として最大数「2」を付与する。また、変化情報取得手段１３１３は、音素「er」の音素アクセント強度「62.379860」が２番目以降に大きなアクセント強度であるので、音素「er」に対する順位の情報として「1」を付与する。また、変化情報取得手段１３１３は、音素「ae」の音素アクセント強度「51.971569」が２番目以降に大きなアクセント強度であるので、単語/up/に対する順位の情報として「1」を付与する。さらに、変化情報取得手段１３１３は、アクセント強度が/0.000000/の無音区間または子音に対して、順位の情報「0」を付与する。以上により、変化情報取得手段１３１３は、音素アクセントパタン「/0/2 0 0 1 0 0 1 0 0/0/」を取得する。 Further, in the second change information acquisition method, for example, the phoneme transcription of the word "understand" is "/ sil / ah nd er st ae nd / sil /", and the phoneme accent intensity of the phonemes constituting the word is When the sequence is "/0.000000/62.717609 0.000000 0.000000 62.379860 0.000000 0.000000 51.971569 0.000000 0.000000 / 0.000000 /", the change information acquisition means 1313 is, for example, the phoneme accent pattern "/ 0/2 0 0 1 0 0 1 0 0/0". / ”Is acquired. That is, since the phoneme accent intensity "62.717609" of the phoneme "ah" is the largest accent intensity, the change information acquisition means 1313 gives the maximum number "2" as the rank information with respect to the phoneme "ah". Further, in the change information acquisition means 1313, since the phoneme accent intensity “62.379860” of the phoneme “er” is the second largest accent intensity, “1” is added as the rank information with respect to the phoneme “er”. Further, in the change information acquisition means 1313, since the phoneme accent intensity "51.971569" of the phoneme "ae" is the second and subsequent largest accent intensity, "1" is added as the rank information for the word / up /. Further, the change information acquisition means 1313 adds rank information "0" to a silent section or a consonant having an accent intensity of /0.000000/. As described above, the change information acquisition means 1313 acquires the phoneme accent pattern "/ 0/2 0 0 1 0 0 1 0 0/0 /".

なお、第一の変化情報取得方法は、文章の評定に適している。文章の課題では、アクセントの強い単語から弱い単語まで、全て正しいアクセントで発声した場合に良い発音として評価する。そのた、全ての単語アクセント強度の大小(強弱)関係を見る必要がある。第一の変化情報取得方法のように教師音声のアクセント強度通りのアクセントパタンで評定すれば、それを見ることができる。また、第二の変化情報取得方法は、単語の評定に適している。単語の課題では、１番目(もしくは数番目程度まで)のアクセント強度(最大強度)を持つべき音素が正しいアクセント(最大強度)となっているかどうかを評価する。第二の変化情報取得方法のように１番目のアクセント強度(最大強度)を持つ音素のみが大きなアクセントパタン値となり、以外はフラットなアクセントパタン値となるアクセントパタンで評定すれば、それを評価することができる。 The first method of acquiring change information is suitable for evaluating sentences. In the sentence task, words with strong accents to words with weak accents are evaluated as good pronunciation when they are uttered with the correct accent. In addition, it is necessary to look at the magnitude (strength) relationship of all word accent intensities. You can see it if you rate it with an accent pattern according to the accent intensity of the teacher's voice as in the first change information acquisition method. In addition, the second change information acquisition method is suitable for word evaluation. In the word task, it is evaluated whether or not the phoneme that should have the first (or up to several) accent intensity (maximum intensity) has the correct accent (maximum intensity). If only the phoneme with the first accent intensity (maximum intensity) has a large accent pattern value as in the second change information acquisition method, and the other phonemes have a flat accent pattern value, it is evaluated. be able to.

評定部１３２は、入力変化情報と教師変化情報とを用いて、入力音声情報の評定を行い、スコアを取得する。評定部１３２は、通常、入力変化情報と教師変化情報との差異に関する情報（類似度合いに関する情報と言っても良い）を、スコアとして取得する。例えば、評定部１３２は、入力変化情報と教師変化情報との順位相関係数を、スコアとして取得する。順位相関係数は、例えば、スピアマンの順位相関係数である。スピアマンの順位相関係数は公知技術であるので、詳細な説明は省略する。 The rating unit 132 evaluates the input voice information using the input change information and the teacher change information, and acquires a score. The rating unit 132 usually acquires information on the difference between the input change information and the teacher change information (which may be called information on the degree of similarity) as a score. For example, the rating unit 132 acquires the rank correlation coefficient between the input change information and the teacher change information as a score. The rank correlation coefficient is, for example, Spearman's rank correlation coefficient. Since Spearman's rank correlation coefficient is a known technique, detailed description thereof will be omitted.

なお、評定部１３２がスピアマンの順位相関係数を用いて、スコアを算出する処理の例は、以下である。例えば、教師変化情報（教師単語アクセント順位）をx=｛x₁，x₂，・・・，x_N｝とし、xの中にタイ(同順位)の箇所がn_xあり、ｉ箇所目のタイの個数がt_i（i=1，2，・・・，n_x）として、入力変化情報（入力単語アクセント順位）をy=｛y₁，y₂，・・・，y_N｝とし、yの中にタイ(同順位)の箇所がn_yあり、ｊ箇所目のタイの個数がt_j（j=1，2，・・・，n_y）とする場合、評定部１３２は、スピアマンの順位相関係数を数式１により算出する。順位相関係数からアクセント評定スコアを数式４により算出する。この場合、評定スコアは１．０、つまり満点となる。 An example of the process in which the rating unit 132 calculates the score using Spearman's rank correlation coefficient is as follows. For example, the teacher change information (teacher word accent rank) is x = {x ₁ , x ₂ , ... , x _N }, and there are n _x ties (equal rank) in _x, and the i-th spot. Let t _i (i = 1, 2, ... , n _x ) be the number of ties, and y = {y ₁ , y ₂ , ... , y _N } for the input change information (input word accent order). If there is a tie (equal rank) in _y and the number of ties in the jth place is t _j (j = 1, 2, ... , n _y ), the rating unit 132 is a Spearman. The rank correlation coefficient of is calculated by Equation 1. The accent rating score is calculated from the rank correlation coefficient by Equation 4. In this case, the rating score is 1.0, that is, a perfect score.

順位相関係数は、スピアマンの順位相関係数以外の順位相関係数（例えば、ケンドールの順位相関係数）でも良い。なお、順位相関係数は−１から１までの値域となる。そして、評定部１３２が取得するスコアは、例えば、正の順位相関係数（０．０から１．０）とし、０以下の場合は０．０とする。 The rank correlation coefficient may be a rank correlation coefficient other than Spearman's rank correlation coefficient (for example, Kendall's rank correlation coefficient). The rank correlation coefficient is in the range of -1 to 1. Then, the score acquired by the rating unit 132 is, for example, a positive rank correlation coefficient (0.0 to 1.0), and 0.0 when it is 0 or less.

なお、評定部１３２は、例えば、教師音声情報の単語アクセントパタンと入力音声情報の単語アクセント強度から文章のスコアを算出する。このスコアは、文章のアクセントスコアである、と言える。また、同様に、評定部１３２は、例えば、音素アクセントパタンと音素アクセント強度を単語ごとに分けて、単語ごとののスコアを算出する。このスコアは、単語のアクセントスコアである、と言える。 The rating unit 132 calculates a sentence score from, for example, the word accent pattern of the teacher voice information and the word accent intensity of the input voice information. It can be said that this score is the accent score of sentences. Similarly, the rating unit 132 divides, for example, the phoneme accent pattern and the phoneme accent intensity for each word, and calculates the score for each word. It can be said that this score is a word accent score.

出力部１４は、評定部１３２が取得したスコアを出力する。ここで、出力とは、ディスプレイへの表示、プロジェクターを用いた投影、プリンタでの印字、音出力、外部の装置への送信、記録媒体への蓄積、他の処理装置や他のプログラムなどへの処理結果の引渡しなどを含む概念である。 The output unit 14 outputs the score acquired by the rating unit 132. Here, the output means display on a display, projection using a projector, printing by a printer, sound output, transmission to an external device, storage on a recording medium, storage on another processing device, another program, or the like. It is a concept that includes delivery of processing results.

格納部１１、教師変化情報格納部１１１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。 A non-volatile recording medium is suitable for the storage unit 11 and the teacher change information storage unit 111, but a volatile recording medium can also be used.

格納部１１等に情報が記憶される過程は問わない。例えば、記録媒体を介して情報が格納部１１等で記憶されるようになってもよく、通信回線等を介して送信された情報が格納部１１等で記憶されるようになってもよく、あるいは、入力デバイスを介して入力された情報が格納部１１等で記憶されるようになってもよい。 The process of storing information in the storage unit 11 or the like does not matter. For example, the information may be stored in the storage unit 11 or the like via the recording medium, or the information transmitted via the communication line or the like may be stored in the storage unit 11 or the like. Alternatively, the information input via the input device may be stored in the storage unit 11 or the like.

処理部１３、取得部１３１、評定部１３２、分割手段１３１１、特徴量取得手段１３１２、変化情報取得手段１３１３は、通常、ＭＰＵやメモリ等から実現され得る。処理部１３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The processing unit 13, the acquisition unit 131, the rating unit 132, the division means 1311, the feature amount acquisition unit 1312, and the change information acquisition unit 1313 can usually be realized from an MPU, a memory, or the like. The processing procedure of the processing unit 13 is usually realized by software, and the software is recorded on a recording medium such as ROM. However, it may be realized by hardware (dedicated circuit).

出力部１４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部１４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 14 may or may not include an output device such as a display or a speaker. The output unit 14 can be realized by the driver software of the output device, the driver software of the output device, the output device, or the like.

次に、音声評定装置１の動作について、図２のフローチャートを用いて説明する。 Next, the operation of the voice rating device 1 will be described with reference to the flowchart of FIG.

（ステップＳ２０１）受付部１２は、入力音声情報を受け付けたか否かを判断する。入力音声情報を受け付けた場合はステップＳ２０２に行き、入力音声情報を受け付けない場合はステップＳ２０１に戻る。 (Step S201) The reception unit 12 determines whether or not the input voice information has been received. If the input voice information is accepted, the process proceeds to step S202, and if the input voice information is not received, the process returns to step S201.

（ステップＳ２０２）取得部１３１は、ステップＳ２０１で受け付けられた音声情報の変化情報を取得する。変化情報取得処理について、図３フローチャートを用いて説明する。なお、ここでは、ステップＳ２０１で受け付けられた入力音声情報の入力変化情報を取得する。また、例えば、取得部１３１は、文章の入力音声情報について、文章の入力変化情報と、入力音声情報を構成する２以上の各単語の入力変化情報とを取得する、とする。 (Step S202) The acquisition unit 131 acquires change information of the voice information received in step S201. The change information acquisition process will be described with reference to the flowchart of FIG. Here, the input change information of the input voice information received in step S201 is acquired. Further, for example, the acquisition unit 131 acquires the input change information of the sentence and the input change information of two or more words constituting the input voice information with respect to the input voice information of the sentence.

（ステップＳ２０３）評定部１３２は、ステップＳ２０１で受け付けられた入力音声情報に対応する変化情報であり、文章の教師変化情報を教師変化情報格納部１１１から取得する。 (Step S203) The rating unit 132 is change information corresponding to the input voice information received in step S201, and acquires the teacher change information of the sentence from the teacher change information storage unit 111.

（ステップＳ２０４）評定部１３２は、ステップＳ２０２で取得された文章の入力変化情報と、ステップＳ２０３で取得した文章の教師変化情報とを用いて、スコアを取得する。 (Step S204) The rating unit 132 acquires a score by using the input change information of the sentence acquired in step S202 and the teacher change information of the sentence acquired in step S203.

（ステップＳ２０５）出力部１４は、ステップＳ２０４で取得されたスコアを出力する。このスコアは、文章の入力音声情報の全体のスコアである。 (Step S205) The output unit 14 outputs the score acquired in step S204. This score is the total score of the input voice information of the sentence.

（ステップＳ２０６）評定部１３２は、カウンタｉに１を代入する。 (Step S206) The rating unit 132 substitutes 1 for the counter i.

（ステップＳ２０７）評定部１３２は、ステップＳ２０１で受け付けられた入力音声情報の中に、ｉ番目の単語の音声情報が存在するか否かを判断する。ｉ番目の単語の音声情報が存在すればステップＳ２０８に行き、ｉ番目の単語の音声情報が存在しなければステップＳ２０１に戻る。なお、ｉ番目の単語の音声情報が存在するか否かは、入力音声情報に対応する文章の中に、ｉ番目の単語が存在するか否かと同意義である。つまり、ステップＳ２０６からステップＳ２１２のループにおいて、実質的に単語ごとにスコアを出力する処理が行えれば良く、ｉ番目の単語が存在するか否かの判断に使用する情報は問わない。 (Step S207) The rating unit 132 determines whether or not the voice information of the i-th word exists in the input voice information received in step S201. If the voice information of the i-th word exists, the process proceeds to step S208, and if the voice information of the i-th word does not exist, the process returns to step S201. It should be noted that whether or not the voice information of the i-th word exists has the same meaning as whether or not the i-th word exists in the sentence corresponding to the input voice information. That is, in the loop from step S206 to step S212, it suffices if the process of outputting the score for each word can be performed, and the information used for determining whether or not the i-th word exists does not matter.

（ステップＳ２０８）評定部１３２は、ステップＳ２０２で取得されていた入力変化情報のうちの、ｉ番目の単語の入力変化情報を取得する。 (Step S208) The rating unit 132 acquires the input change information of the i-th word among the input change information acquired in step S202.

（ステップＳ２０９）評定部１３２は、ｉ番目の単語の教師変化情報を教師変化情報格納部１１１から取得する。 (Step S209) The rating unit 132 acquires the teacher change information of the i-th word from the teacher change information storage unit 111.

（ステップＳ２１０）評定部１３２は、ステップＳ２０８で取得したｉ番目の単語の入力変化情報と、ステップＳ２０９で取得したｉ番目の単語の教師変化情報とを用いて、スコアを取得する。このスコアは、入力音声情報のうちのｉ番目の単語の音声のスコアである。 (Step S210) The rating unit 132 acquires a score by using the input change information of the i-th word acquired in step S208 and the teacher change information of the i-th word acquired in step S209. This score is the voice score of the i-th word in the input voice information.

（ステップＳ２１１）出力部１４は、ステップＳ２１０で取得されたスコアを出力する。このスコアは、入力音声情報のｉ番目の単語のスコアである。 (Step S211) The output unit 14 outputs the score acquired in step S210. This score is the score of the i-th word of the input voice information.

（ステップＳ２１２）評定部１３２は、カウンタｉを１、インクリメントする。ステップＳ２０７に戻る。 (Step S212) The rating unit 132 increments the counter i by 1. Return to step S207.

なお、図２のフローチャートにおいて、評定部１３２は、入力音声情報の文章のスコアと２以上の単語のスコアとを用いて、代表スコアを算出しても良い。そして、出力部１４は、この代表スコアを出力しても良い。なお、代表スコアは、通常、文章のスコアと２以上の単語のスコアとをパラメータとする増加関数である。代表スコアは、例えば、文章のスコアと２以上の単語のスコアの平均値、中央値、最大値等である。 In the flowchart of FIG. 2, the rating unit 132 may calculate the representative score by using the score of the sentence of the input voice information and the score of two or more words. Then, the output unit 14 may output this representative score. The representative score is usually an increasing function with the score of a sentence and the score of two or more words as parameters. The representative score is, for example, the average value, the median value, the maximum value, etc. of the score of the sentence and the score of two or more words.

また、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 Further, in the flowchart of FIG. 2, the processing is terminated by an interrupt of power off or processing termination.

次に、ステップＳ２０２の入力変化情報取得処理の例について、図３フローチャートを用いて説明する。 Next, an example of the input change information acquisition process in step S202 will be described with reference to the flowchart of FIG.

（ステップＳ３０１）取得部１３１の分割手段１３１１は、入力音声情報を２以上の音素に分割する。通常、分割手段１３１１は、入力音声情報から、音素ごとの区間情報を取得する。 (Step S301) The dividing means 1311 of the acquisition unit 131 divides the input voice information into two or more phonemes. Usually, the dividing means 1311 acquires the section information for each phoneme from the input voice information.

（ステップＳ３０２）取得部１３１の特徴量取得手段１３１２は、カウンタｉに１を代入する。 (Step S302) The feature amount acquisition means 1312 of the acquisition unit 131 substitutes 1 for the counter i.

（ステップＳ３０３）特徴量取得手段１３１２は、ステップＳ３０１で分割した２以上の音素の中で、ｉ番目の音素が存在するか否かを判断する。ｉ番目の音素が存在する場合はステップＳ３０４に行き、ｉ番目の音素が存在しない場合はステップＳ３１０に行く。 (Step S303) The feature amount acquisition means 1312 determines whether or not the i-th phoneme exists among the two or more phonemes divided in step S301. If the i-th phoneme is present, the process goes to step S304, and if the i-th phoneme is not present, the process goes to step S310.

（ステップＳ３０４）特徴量取得手段１３１２は、カウンタｊに１を代入する。 (Step S304) The feature amount acquisition means 1312 substitutes 1 for the counter j.

（ステップＳ３０５）特徴量取得手段１３１２は、ｉ番目の音素の中で、ｊ番目のフレームが存在するか否かを判断する。ｊ番目のフレームが存在する場合はステップＳ３０６に行き、ｊ番目のフレームが存在しない場合はステップＳ３０８に行く。 (Step S305) The feature amount acquisition means 1312 determines whether or not the j-th frame exists in the i-th phoneme. If the j-th frame exists, the process goes to step S306, and if the j-th frame does not exist, the process goes to step S308.

（ステップＳ３０６）特徴量取得手段１３１２は、ｊ番目のフレームの特徴量を取得する。特徴量は、例えば、アクセント強度である。 (Step S306) The feature amount acquisition means 1312 acquires the feature amount of the j-th frame. The feature amount is, for example, the accent intensity.

（ステップＳ３０７）特徴量取得手段１３１２は、カウンタｊを１、インクリメントする。ステップＳ３０５に戻る。 (Step S307) The feature amount acquisition means 1312 increments the counter j by 1. Return to step S305.

（ステップＳ３０８）特徴量取得手段１３１２は、ステップＳ３０６で取得した２以上の音素の特徴量から、ｊ番目の音素の代表特徴量を取得する。 (Step S308) The feature amount acquisition means 1312 acquires the representative feature amount of the j-th phoneme from the feature amounts of two or more phonemes acquired in step S306.

（ステップＳ３０９）特徴量取得手段１３１２は、カウンタｉを１、インクリメントする。ステップＳ３０３に戻る。 (Step S309) The feature amount acquisition means 1312 increments the counter i by 1. Return to step S303.

（ステップＳ３１０）取得部１３１の変化情報取得手段１３１３は、カウンタｋに１を代入する。 (Step S310) The change information acquisition means 1313 of the acquisition unit 131 substitutes 1 for the counter k.

（ステップＳ３１１）変化情報取得手段１３１３は、ｋ番目の単語が存在するか否かを判断する。ｋ番目の単語が存在する場合はステップＳ３１２に行き、ｋ番目の単語が存在しない場合はステップＳ３１６に行く。 (Step S311) The change information acquisition means 1313 determines whether or not the kth word exists. If the k-th word exists, the process goes to step S312, and if the k-th word does not exist, the process goes to step S316.

（ステップＳ３１２）変化情報取得手段１３１３は、ｋ番目の単語内の２以上の音素の代表特徴量を音素の並び順に取得する。 (Step S312) The change information acquisition means 1313 acquires representative feature quantities of two or more phonemes in the k-th word in the order of phoneme arrangement.

（ステップＳ３１３）変化情報取得手段１３１３は、ステップＳ３１２で取得した２以上の音素の代表特徴量を用いて、ｋ番目の単語の変化情報を取得する。 (Step S313) The change information acquisition means 1313 acquires the change information of the k-th word by using the representative features of two or more phonemes acquired in step S312.

（ステップＳ３１４）特徴量取得手段１３１２は、ステップＳ３１２で取得された２以上の音素の代表特徴量を用いて、ｋ番目の単語の代表特徴量を取得する。ｋ番目の単語の代表特徴量は、通常、２以上の音素の代表特徴量を代表する特徴量である。 (Step S314) The feature amount acquisition means 1312 acquires the representative feature amount of the k-th word by using the representative feature amounts of two or more phonemes acquired in step S312. The representative feature of the k-th word is usually a feature that represents a representative feature of two or more phonemes.

（ステップＳ３１５）変化情報取得手段１３１３は、カウンタｋを１、インクリメントする。ステップＳ３１１に戻る。 (Step S315) The change information acquisition unit 1313 increments the counter k by 1. Return to step S311.

（ステップＳ３１６）変化情報取得手段１３１３は、ステップＳ３１４で取得された２以上の単語の代表特徴量を用いて、文章の変化情報を取得する。上位処理にリターンする。なお、文章の変化情報とは、文章である音声情報の入力変化情報である。 (Step S316) The change information acquisition means 1313 acquires change information of a sentence by using the representative feature amounts of two or more words acquired in step S314. Return to higher processing. The text change information is input change information of voice information which is a text.

以下、本実施の形態における音声評定装置１の具体的な動作について説明する。 Hereinafter, the specific operation of the voice rating device 1 in the present embodiment will be described.

（具体例１）
今、教師変化情報格納部１１１には、図４に示す教師変化情報管理表が格納されている、とする。教師変化情報管理表は、文章「Alice looked up.」の教師音声情報の全体（文章）の教師変化情報と、文章「Alice looked up.」を構成する各単語「Alice」、「looked」、および「up」に対応する教師変化情報とが格納されている。 (Specific example 1)
Now, it is assumed that the teacher change information storage unit 111 stores the teacher change information management table shown in FIG. The teacher change information management table contains the teacher change information of the entire teacher voice information (sentence) of the sentence "Alice looked up.", And the words "Alice", "looked", and each word constituting the sentence "Alice looked up." The teacher change information corresponding to "up" is stored.

かかる状況において、ユーザが、音声評定装置１に対して、英語の文章「Alice looked up.」を読み上げた、とする。すると、音声評定装置１の受付部１２は、文章「Alice looked up.」の音声情報である、入力音声情報を受け付ける。 In such a situation, it is assumed that the user reads out the English sentence "Alice looked up." To the voice rating device 1. Then, the reception unit 12 of the voice rating device 1 receives the input voice information, which is the voice information of the sentence “Alice looked up.”.

次に、分割手段１３１１は、入力音声情報を２以上の単語に分割する。つまり、分割手段１３１１は、入力音声情報を構成する音素ごとの区間情報を、フォースドアライメント等を用いて、取得する。 Next, the dividing means 1311 divides the input voice information into two or more words. That is, the dividing means 1311 acquires the section information for each phoneme constituting the input voice information by using forced alignment or the like.

次に、特徴量取得手段１３１２は、音素ごとに、音素の並び順に、各音素が有する２以上の各フレームの特徴量を取得する。ここでは、特徴量は、例えば、アクセント強度である、とする。そして、特徴量取得手段１３１２は、音素ごとに、２以上のフレームの特徴量から、代表特徴量（例えば、最大値）を取得する。そして、この代表特徴量が、各音素の特徴量である。 Next, the feature amount acquisition means 1312 acquires the feature amount of each of two or more frames possessed by each phoneme in the order in which the phonemes are arranged for each phoneme. Here, it is assumed that the feature amount is, for example, the accent intensity. Then, the feature amount acquisition means 1312 acquires a representative feature amount (for example, the maximum value) from the feature amounts of two or more frames for each phoneme. And this representative feature is the feature of each phoneme.

次に、変化情報取得手段１３１３は、単語ごとに、当該単語内の２以上の各音素の特徴量（代表特徴量）を音素の並び順に取得する。つまり、まず、変化情報取得手段１３１３は、単語「Alice」に対応する音素トランスクリプション「/ae l ax s/」に対して、特徴量（音素アクセント強度）の並び「/55.148270 0.000000 60.396744 0.000000/」を得た、とする。そして、変化情報取得手段１３１３は、音素アクセント強度の並びから、単語「Alice」の入力変化情報「/1 0 2 0/」を得る。ここで、変化情報取得手段１３１３は、第二の変化情報取得方法により、入力変化情報を取得した。 Next, the change information acquisition means 1313 acquires the feature amounts (representative feature amounts) of two or more phonemes in the word for each word in the order of phoneme arrangement. That is, first, the change information acquisition means 1313 has a sequence of features (phoneme accent intensity) "/ 55.148270 0.000000 60.396744 0.000000 /" for the phoneme transcription "/ ae l ax s /" corresponding to the word "Alice". "Is obtained. Then, the change information acquisition means 1313 obtains the input change information “/1 0 2 0 /” of the word “Alice” from the arrangement of the phoneme accent intensities. Here, the change information acquisition means 1313 acquired the input change information by the second change information acquisition method.

同様に、変化情報取得手段１３１３は、単語「looked」に対応する音素トランスクリプション「l uh k t」に対して、特徴量（音素アクセント強度）の並び「0.000000 53.130833 0.000000 0.000000」を得る。そして、変化情報取得手段１３１３は、音素アクセント強度の並びから単語「looked」の入力変化情報「0 1 0 0」を取得する。 Similarly, the change information acquisition means 1313 obtains a sequence of features (phoneme accent intensities) "0.000000 53.130833 0.000000 0.000000" for the phoneme transcription "l uh k t" corresponding to the word "looked". Then, the change information acquisition means 1313 acquires the input change information "0 1 0 0" of the word "looked" from the arrangement of the phoneme accent intensities.

また、同様に、変化情報取得手段１３１３は、単語「up」に対応する音素トランスクリプション「ah p」に対して、特徴量（音素アクセント強度）の並び「48.609158 0.000000」を得る。そして、変化情報取得手段１３１３は、音素アクセント強度の並びから単語「up」の入力変化情報「1 0」を取得する。 Similarly, the change information acquisition means 1313 obtains a sequence of features (phoneme accent intensities) "48.609158 0.000000" for the phoneme transcription "ah p" corresponding to the word "up". Then, the change information acquisition means 1313 acquires the input change information "10" of the word "up" from the arrangement of the phoneme accent intensities.

次に、変化情報取得手段１３１３は、取得された２以上の単語の代表特徴量を用いて、文章の入力変化情報を取得する。つまり、変化情報取得手段１３１３は、単語トランスクリプション「/sil/alice/sil/looked/sil/up/sil/」を構成する各単語の特徴量の並びである単語アクセント強度「/0.000000/60.396744/0.000000/53.130833/0.000000/48.609158/0.000000/」から、第一の変化情報取得方法により、単語アクセントパタン「/0/3/0/2/0/1/0/」を取得する。この単語アクセントパタンは、入力変化情報の一例である。 Next, the change information acquisition means 1313 acquires the input change information of the sentence by using the representative feature amounts of the acquired two or more words. That is, the change information acquisition means 1313 has a word accent intensity “/0.000000/60.396744” which is a sequence of feature quantities of each word constituting the word transcription “/ sil/alice/sil/looked/sil/up/sil/”. From "/0.000000/53.130833/0.000000/48.609158/0.000000/", the word accent pattern "/ 0/3/0/2/0/1/0 /" is acquired by the first change information acquisition method. This word accent pattern is an example of input change information.

次に、評定部１３２は、受け付けられた入力音声情報（「Alice looked up.」に対応する音声情報）に対応する変化情報であり、文章の教師変化情報「/0/3/0/2/0/1/0/」を教師変化情報管理表（図４）から取得する。 Next, the rating unit 132 is change information corresponding to the received input voice information (voice information corresponding to "Alice looked up."), And is text teacher change information "/ 0/3/0/2 /". 0/1/0 / ”is acquired from the teacher change information management table (Fig. 4).

次に、評定部１３２は、取得された文章「Alice looked up.」の入力変化情報「/0/3/0/2/0/1/0/」と、取得した文章の教師変化情報「/0/3/0/2/0/1/0/」との類似度に関する情報であるスコアを、スピアマンの順位相関係数を用いて取得する。ここで、入力変化情報「/0/3/0/2/0/1/0/」と教師変化情報「/0/3/0/2/0/1/0/」とは同じであるので、評定部１３２は、スコア「１」を取得する。次に、評定部１３２は、取得したスコア「１」を１００倍し、出力する点数「１００」を算出する。 Next, the rating unit 132 receives the input change information "/ 0/3/0/2/0/1/0 /" of the acquired sentence "Alice looked up." And the teacher change information "/" of the acquired sentence. The score, which is information on the degree of similarity with "0/3/0/2/0/1/0 /", is acquired using Spearman's rank correlation coefficient. Here, the input change information "/ 0/3/0/2/0/1/0 /" and the teacher change information "/ 0/3/0/2/0/1/0 /" are the same. , The rating unit 132 acquires the score "1". Next, the rating unit 132 multiplies the acquired score "1" by 100 to calculate the score "100" to be output.

次に、出力部１４は、評定部１３２が取得した点数「１００」を出力する。かかる出力例は、図５である。図５において、点数は評定スコア５０１として表示されている。 Next, the output unit 14 outputs the score "100" acquired by the rating unit 132. An example of such an output is shown in FIG. In FIG. 5, the score is displayed as a rating score 501.

次に、評定部１３２は、各単語の評定を行う。つまり、評定部１３２は、１番目の単語「Alice」の入力変化情報である音素アクセント強度の並び「/1 0 2 0/」を取得する。次に、評定部１３２は、「Alice」と対になる教師変化情報「/1 0 2 0/」を教師変化情報管理表（図４）から取得する。そして、評定部１３２は、１番目の単語の入力変化情報「/1 0 2 0/」と、取得した１番目の単語の教師変化情報「/1 0 2 0/」とを用いて、スコア「１」を取得する。そして、評定部１３２は、スコア「１」を１００倍し、単語「Alice」の評定スコア「１００」を得る。そして、出力部１４は、単語「Alice」の評定スコア「１００」を出力する。 Next, the rating unit 132 evaluates each word. That is, the rating unit 132 acquires the phoneme accent intensity sequence “/1 0 2 0 /” which is the input change information of the first word “Alice”. Next, the rating unit 132 acquires the teacher change information “/1 0 2 0 /” paired with “Alice” from the teacher change information management table (FIG. 4). Then, the rating unit 132 uses the input change information “/1 0 2 0 /” of the first word and the teacher change information “/1 0 2 0 /” of the acquired first word to obtain a score “1 0 2 0 /”. 1 ”is acquired. Then, the rating unit 132 multiplies the score "1" by 100 to obtain a rating score "100" for the word "Alice". Then, the output unit 14 outputs the rating score “100” of the word “Alice”.

以上の処理を、単語「looked」「up」に対しても行い、単語「looked」「up」の評定スコア「１００」も出力される。なお、評定スコアの出力態様は問わない。 The above processing is also performed for the words "looked" and "up", and the rating score "100" for the words "looked" and "up" is also output. The output mode of the rating score does not matter.

以上、本実施の形態によれば、発音された入力音声の流れを考慮した音声の評定ができるため、入力音声の適切な評定ができる。 As described above, according to the present embodiment, the voice can be evaluated in consideration of the flow of the sounded input voice, so that the input voice can be evaluated appropriately.

なお、本実施の形態によれば、教師変化情報は予め用意されていた。しかし、教師変化情報も、教師音声情報から動的に生成されても良い。かかる生成には、例えば、実施の形態３で説明する生産装置３が用いられる。また、かかる場合の処理の具体例は、以下の１）から１６）である。
１）教師音声情報の音素ごとの区間情報をフォースドアライメントを用いて算出する。
２）教師音声情報のフレームごとのアクセント強度を教師音声情報から算出する。
３）教師音声情報の音素ごとのアクセント強度を音素ごとの区間情報とフレームごとのアクセント強度から算出する。
４）教師音声情報の単語ごとのアクセント強度を単語内の音素ごとのアクセント強度から算出する。
５）教師音声情報の単語アクセント順位を教師音声情報の単語アクセント強度から算出する。
６）単語ごとに教師音声情報の音素アクセント順位を、教師音声情報の音素アクセント強度から算出する。
７）入力音声情報の音素ごとの区間情報を、フォースドアライメントを用いて算出する。
８）入力音声情報のフレームごとのアクセント強度を、入力音声情報から算出する。
９）入力音声情報の音素ごとのアクセント強度を、音素ごとの区間情報とフレームごとのアクセント強度から算出する。
１０）入力音声情報の単語ごとのアクセント強度を、単語内の音素ごとのアクセント強度から算出する。
１１）入力音声情報の単語アクセント順位を入力音声情報の単語アクセント強度から算出する。
１２）単語ごとに入力音声情報の音素アクセント順位を、入力音声情報の音素アクセント強度から算出する。
１３）文章アクセントの順位相関係数を、教師音声情報の単語アクセント順位と入力音声の単語アクセント順位から算出する。
１４）単語ごとに単語アクセントの順位相関係数を、教師音声情報の音素アクセント順位と入力音声情報の音素アクセント順位から算出する。
１５）文章のアクセント評定スコアを、文章アクセントの順位相関係数から求める。
１６）単語ごとに単語のアクセント評定スコアを、単語アクセントの順位相関係数から求める。 According to the present embodiment, the teacher change information was prepared in advance. However, the teacher change information may also be dynamically generated from the teacher voice information. For such generation, for example, the production apparatus 3 described in the third embodiment is used. Specific examples of the processing in such a case are the following 1) to 16).
1) Calculate the section information for each phoneme of the teacher's voice information using forced alignment.
2) The accent intensity for each frame of the teacher voice information is calculated from the teacher voice information.
3) The accent intensity for each phoneme of the teacher voice information is calculated from the section information for each phoneme and the accent intensity for each frame.
4) The accent intensity for each word of the teacher's voice information is calculated from the accent intensity for each phoneme in the word.
5) The word accent rank of the teacher voice information is calculated from the word accent intensity of the teacher voice information.
6) The phoneme accent rank of the teacher voice information is calculated from the phoneme accent intensity of the teacher voice information for each word.
7) The section information for each phoneme of the input voice information is calculated using forced alignment.
8) The accent intensity for each frame of the input voice information is calculated from the input voice information.
9) The accent intensity for each phoneme of the input voice information is calculated from the section information for each phoneme and the accent intensity for each frame.
10) The accent intensity for each word of the input voice information is calculated from the accent intensity for each phoneme in the word.
11) The word accent rank of the input voice information is calculated from the word accent intensity of the input voice information.
12) The phoneme accent rank of the input voice information is calculated from the phoneme accent intensity of the input voice information for each word.
13) The sentence accent rank correlation coefficient is calculated from the word accent rank of the teacher voice information and the word accent rank of the input voice.
14) The rank correlation coefficient of the word accent for each word is calculated from the phoneme accent rank of the teacher voice information and the phoneme accent rank of the input voice information.
15) The sentence accent rating score is obtained from the sentence accent rank correlation coefficient.
16) The word accent rating score for each word is obtained from the rank correlation coefficient of the word accent.

また、本実施の形態において、特徴量をアクセント強度とリズム量の両方を用いて、アクセント評定とリズム評定とを行っても良い。そして、アクセント評定のスコアとリズム評定のスコアとの両方を用いて算出した代表スコアを算出し、出力しても良い。 Further, in the present embodiment, the accent rating and the rhythm rating may be performed by using both the accent intensity and the rhythm amount as the feature amount. Then, the representative score calculated by using both the accent rating score and the rhythm rating score may be calculated and output.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータがアクセス可能な記録媒体は、教師となる音声情報である教師音声情報を構成する２以上の各部分音声情報の特徴量の変化に関する教師変化情報が格納される教師変化情報格納部を具備し、コンピュータを、２以上の部分音声を有する音声情報である入力音声情報を受け付ける受付部と、前記入力音声情報が有する２以上の各部分音声情報の特徴量の変化に関する入力変化情報を取得する取得部と、前記入力変化情報と前記教師変化情報とを用いて、前記入力音声情報の評定を行い、スコアを取得する評定部と、前記スコアを出力する出力部として機能させるためのプログラム、である。 Further, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded on a recording medium such as a CD-ROM and disseminated. It should be noted that this also applies to other embodiments herein. The software that realizes the information processing device in this embodiment is the following program. That is, in this program, the recording medium accessible to the computer is a teacher change in which the teacher change information regarding the change in the feature amount of each of two or more partial voice information constituting the teacher voice information which is the teacher voice information is stored. An information storage unit is provided, and a computer is provided with a reception unit that receives input audio information that is audio information having two or more partial voices, and an input regarding a change in the feature amount of each of the two or more partial audio information possessed by the input audio information. Using the acquisition unit for acquiring change information, the input change information and the teacher change information, the input voice information is evaluated, and the evaluation unit for acquiring the score and the output unit for outputting the score are made to function. Program for.

また、上記プログラムにおいて、前記教師変化情報および前記入力変化情報は、前記部分音声情報の特徴量の大きさの順位に関する情報であることは好適である。 Further, in the above program, it is preferable that the teacher change information and the input change information are information relating to the order of the magnitude of the feature amount of the partial voice information.

また、上記プログラムにおいて、前記取得部は、前記入力音声情報が有する２以上の各部分音声情報の特徴量を取得し、前記入力音声情報が有する２以上の部分情報のうち、評定対象外の部分音声情報を除いた、２以上の評定対象部分音声情報の２以上の特徴量の大きさの順位を取得し、当該２以上の特徴量の大きさの順位を有する入力変化情報を取得するものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, the acquisition unit acquires the feature amount of each of the two or more partial voice information possessed by the input voice information, and the portion of the two or more partial information possessed by the input voice information that is not subject to evaluation. Assuming that the order of the magnitudes of two or more feature quantities of two or more evaluation target partial audio information excluding the audio information is acquired, and the input change information having the rank of the magnitudes of the two or more feature quantities is acquired. , It is preferable that it is a program that makes a computer function.

また、上記プログラムにおいて、前記教師変化情報が有する情報であり、２以上の評定対象部分音声情報の特徴量の大きさの順位に関する情報が同一の情報である場合、前記取得部は、前記入力変化情報が有する情報であり、前記同一の情報に対応する位置の２つの評定対象部分音声情報の特徴量の大きさの順位が隣り合っているか否かを判断し、隣り合っていると判断した場合は、前記２つの評定対象部分音声情報の特徴量の大きさを同一の大きさと見なして、入力変化情報を取得するものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, when the information possessed by the teacher change information and the information regarding the rank of the magnitude of the feature amount of the two or more evaluation target partial voice information is the same information, the acquisition unit performs the input change. When it is judged whether or not the ranks of the features of the two evaluation target partial voice information at the positions corresponding to the same information are adjacent to each other and are judged to be adjacent to each other. Is a program that operates a computer by regarding the magnitudes of the feature amounts of the two evaluation target partial voice information as the same magnitude and acquiring the input change information.

また、上記プログラムにおいて、前記取得部は、前記入力音声情報が有する２以上の各部分音声情報の特徴量を取得し、前記入力音声情報が有する２以上の部分情報のうち、評定対象外の部分音声情報を除いた、２以上の評定対象部分音声情報の２以上の特徴量に対して、最も大きい特徴量に対応する評定対象部分音声情報と他の評定対象部分音声情報とを区別する情報である入力変化情報を取得するものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, the acquisition unit acquires the feature amount of each of the two or more partial voice information possessed by the input voice information, and the portion of the two or more partial information possessed by the input voice information that is not subject to evaluation. Information that distinguishes between the evaluation target partial voice information corresponding to the largest feature amount and other evaluation target partial voice information for two or more feature quantities of two or more evaluation target partial voice information excluding the voice information. It is preferable that the program operates a computer to acquire certain input change information.

また、上記プログラムにおいて、前記順位に関する情報は、前記教師音声情報または前記入力音声情報の２以上の各部分音声情報の特徴量の大きさの順位に関する並びの情報であるアクセントパタンであるものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, the information regarding the ranking is assumed to be an accent pattern which is information regarding the ranking of the magnitude of the feature amount of each of two or more partial voice information of the teacher voice information or the input voice information. It is preferable that the program operates the computer.

また、上記プログラムにおいて、前記入力音声情報は、文章の音声情報であり、前記部分音声情報は、文章を構成する単語の音声情報であるものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, the input voice information is the voice information of a sentence, and the partial voice information is the voice information of the words constituting the sentence, and it is preferable that the program functions the computer. ..

また、上記プログラムにおいて、前記入力音声情報は、単語の音声情報であり、前記部分音声情報は、単語を構成する音素の音声情報であるものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, it is preferable that the input voice information is the voice information of the word and the partial voice information is the voice information of the phonemes constituting the word, so that the program functions the computer. ..

また、上記プログラムにおいて、前記部分音声情報の特徴量は、アクセントの強度に関する情報であるアクセント強度であるものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, it is preferable that the feature amount of the partial voice information is a program that causes the computer to function, assuming that the feature amount is the accent intensity, which is information on the intensity of the accent.

また、上記プログラムにおいて、前記部分音声情報の特徴量は、音声情報の長さに関する情報であるリズム量であるものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, it is preferable that the feature amount of the partial voice information is a rhythm amount which is information on the length of the voice information, and the program functions the computer.

（実施の形態２）
本実施の形態において、実施の形態１で行った評定に加えて、入力音声情報に対して発音評定を行い、実施の形態１で行った評定結果と発音評定結果とを用いて、最終的なスコアを算出する音声評定装置について説明する。なお、実施の形態１で行った評定は、例えば、アクセント評定、リズム評定である。つまり、本実施の形態において、アクセント評定、リズム評定、発音評定のうちの２以上の評定を行う音声評定装置について説明する。 (Embodiment 2)
In the present embodiment, in addition to the evaluation performed in the first embodiment, the pronunciation evaluation is performed on the input voice information, and the final evaluation result and the pronunciation evaluation result performed in the first embodiment are used. A voice rating device for calculating a score will be described. The evaluation performed in the first embodiment is, for example, an accent evaluation and a rhythm evaluation. That is, in the present embodiment, a voice rating device that performs two or more of the accent rating, the rhythm rating, and the pronunciation rating will be described.

図６は、本実施の形態における音声評定装置２のブロック図である。 FIG. 6 is a block diagram of the voice rating device 2 according to the present embodiment.

音声評定装置２は、格納部１１、受付部１２、処理部２３、出力部２４を備える。 The voice evaluation device 2 includes a storage unit 11, a reception unit 12, a processing unit 23, and an output unit 24.

処理部２３は、取得部１３１、評定部１３２、第二評定部２３１、算出部２３２を備える。 The processing unit 23 includes an acquisition unit 131, a rating unit 132, a second rating unit 231 and a calculation unit 232.

処理部２３は、各種の処理を行う。各種の処理とは、例えば、取得部１３１、評定部１３２、第二評定部２３１、算出部２３２等が行う処理である。 The processing unit 23 performs various processes. The various processes are, for example, processes performed by the acquisition unit 131, the rating unit 132, the second rating unit 231 and the calculation unit 232.

第二評定部２３１は、入力音声情報に対する発音の評定を行い、第二スコアを取得する。第二スコアは、発音評定のスコアである。第二評定部２３１は、例えば、特許第４８５９１２５号、特許第４９６２９３０号、特許第５００７４０１号等に記載されている発音評定装置等が行う発音評定と同様の処理を行い、発音の良し悪しの評価を示す第二スコアを得る。なお、格納部１１には、教師音声情報が格納されている、とする。また、格納部１１には、通常、１以上の音素毎の音響モデルである教師データを１以上格納されている。さらに、第二評定部２３１が入力音声情報の発音の良し悪しを評価し、第二スコアを取得するアルゴリズムは問わない。なお、第二スコアを得る発音評定のアルゴリズムは公知技術であるので、詳細な説明を省略する。 The second rating unit 231 evaluates the pronunciation of the input voice information and acquires the second score. The second score is the pronunciation rating score. The second rating unit 231 performs the same processing as the pronunciation rating performed by the pronunciation rating device or the like described in, for example, Patent No. 4859125, Patent No. 4962930, Patent No. 5007401, etc., and evaluates the quality of pronunciation. Get a second score that indicates. It is assumed that the teacher voice information is stored in the storage unit 11. Further, in the storage unit 11, one or more teacher data, which is an acoustic model for each one or more phonemes, is usually stored. Further, the algorithm in which the second rating unit 231 evaluates the sound quality of the input voice information and obtains the second score does not matter. Since the pronunciation evaluation algorithm for obtaining the second score is a known technique, detailed description thereof will be omitted.

算出部２３２は、評定部１３２が取得したスコアと第二評定部２３１が取得した第二スコアとを用いて、代表的なスコアである代表スコアを算出する。代表スコアは、例えば、スコアと第二スコアとの平均値である。代表スコアは、例えば、スコアと第二スコアとの加重平均の値である。また、ここで、評定部１３２が取得したスコアは、例えば、文章のスコアである。ただし、ここでの評定部１３２が取得したスコアは、例えば、文章のスコアと１以上の単語のスコアでも良い。また、ここでの評定部１３２が取得したスコアは、例えば、１以上の単語のスコアでも良い。 The calculation unit 232 calculates a representative score, which is a representative score, by using the score acquired by the rating unit 132 and the second score acquired by the second rating unit 231. The representative score is, for example, the average value of the score and the second score. The representative score is, for example, the value of the weighted average of the score and the second score. Further, here, the score acquired by the rating unit 132 is, for example, a sentence score. However, the score acquired by the rating unit 132 here may be, for example, a sentence score and a word score of 1 or more. Further, the score acquired by the rating unit 132 here may be, for example, the score of one or more words.

出力部２４は、算出部２３２が算出した代表スコアを出力する。出力部２４は、スコアまたは第二スコアをも出力しても良い。 The output unit 24 outputs the representative score calculated by the calculation unit 232. The output unit 24 may also output a score or a second score.

処理部２３、第二評定部２３１、算出部２３２は、通常、ＭＰＵやメモリ等から実現され得る。処理部２３等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The processing unit 23, the second rating unit 231 and the calculation unit 232 can usually be realized from an MPU, a memory, or the like. The processing procedure of the processing unit 23 and the like is usually realized by software, and the software is recorded on a recording medium such as ROM. However, it may be realized by hardware (dedicated circuit).

出力部２４は、ディスプレイやスピーカー等の出力デバイスを含むと考えても含まないと考えても良い。出力部２４は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。 The output unit 24 may or may not include an output device such as a display or a speaker. The output unit 24 can be realized by the driver software of the output device, the driver software of the output device, the output device, or the like.

次に、音声評定装置２の動作について、図７のフローチャートを用いて説明する。図７のフローチャートにおいて、図２のフローチャートと同一のステップについて説明を省略する。 Next, the operation of the voice rating device 2 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 7, the same steps as the flowchart of FIG. 2 will be omitted.

（ステップＳ７０１）第二評定部２３１は、ステップＳ２０１で受け付けられた入力音声情報に対する発音の評定を行い、第二スコアを取得する。 (Step S701) The second rating unit 231 evaluates the pronunciation of the input voice information received in step S201 and acquires the second score.

（ステップＳ７０２）算出部２３２は、評定部１３２が取得したスコアと、ステップＳ７０１で取得された第二スコアとを用いて、代表的なスコアである代表スコアを算出する。 (Step S702) The calculation unit 232 calculates a representative score, which is a representative score, by using the score acquired by the rating unit 132 and the second score acquired in step S701.

（ステップＳ７０３）出力部２４は、ステップＳ７０２で算出した算出部２３２が算出した代表スコアを出力する。ステップＳ２０１に戻る。 (Step S703) The output unit 24 outputs the representative score calculated by the calculation unit 232 calculated in step S702. Return to step S201.

なお、図７のフローチャートにおいて、代表スコアのみが出力されても良い。つまり、実施の形態１で算出されたスコアは出力されなくても良い。 In the flowchart of FIG. 7, only the representative score may be output. That is, the score calculated in the first embodiment does not have to be output.

また、図７のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。 Further, in the flowchart of FIG. 7, the processing is terminated by an interrupt of power off or processing termination.

以上、本実施の形態によれば、発音された入力音声の多角的な評定ができるため、入力音声の適切な評定ができる。具体的には、本実施の形態によれば、発音された入力音声に対して、例えば、アクセントの評価および発音の評価ができる。 As described above, according to the present embodiment, since the pronounced input voice can be evaluated from various angles, the input voice can be appropriately evaluated. Specifically, according to the present embodiment, for example, accent evaluation and pronunciation evaluation can be performed on the pronounced input voice.

なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータがアクセス可能な記録媒体は、教師となる音声情報である教師音声情報を構成する２以上の各部分音声情報の特徴量の変化に関する教師変化情報が格納される教師変化情報格納部を具備し、コンピュータを、２以上の部分音声を有する音声情報である入力音声情報を受け付ける受付部と、前記入力音声情報が有する２以上の各部分音声情報の特徴量の変化に関する入力変化情報を取得する取得部と、前記入力変化情報と前記教師変化情報とを用いて、前記入力音声情報の評定を行い、スコアを取得する評定部と、前記スコアを出力する出力部として機能させるためのプログラム、である。 The software that realizes the information processing device in this embodiment is the following program. That is, in this program, the recording medium accessible to the computer is a teacher change in which the teacher change information regarding the change in the feature amount of each of two or more partial voice information constituting the teacher voice information which is the teacher voice information is stored. An information storage unit is provided, and a computer is provided with a reception unit that receives input audio information that is audio information having two or more partial voices, and an input regarding a change in the feature amount of each of the two or more partial audio information possessed by the input audio information. Using the acquisition unit for acquiring change information, the input change information and the teacher change information, the input voice information is evaluated, and the evaluation unit for acquiring the score and the output unit for outputting the score are made to function. Program for.

また、上記プログラムにおいて、コンピュータを、前記入力音声情報に対する発音の評定を行い、第二スコアを取得する第二評定部と、前記評定部が取得したスコアと前記第二評定部が取得した第二スコアとを用いて、代表的なスコアである代表スコアを算出する算出部としてさらに機能させ、前記出力部は、前記代表スコアを出力するものとして、コンピュータを機能させるプログラムであることは好適である。 Further, in the above program, the computer evaluates the pronunciation of the input voice information and obtains a second score, a second rating unit, a score acquired by the evaluation unit, and a second evaluation unit acquired by the second evaluation unit. It is preferable that the output unit is a program that functions a computer to output the representative score by further functioning as a calculation unit for calculating a representative score which is a representative score by using the score. ..

（実施の形態３）
本実施の形態において、教師変化情報格納部１１１の教師変化情報を自動生成する生産装置について説明する。 (Embodiment 3)
In the present embodiment, the production apparatus that automatically generates the teacher change information of the teacher change information storage unit 111 will be described.

図８は、本実施の形態における生産装置３のブロック図である。 FIG. 8 is a block diagram of the production apparatus 3 according to the present embodiment.

生産装置３は、教師変化情報格納部１１１、受付部１２、取得部１３１、蓄積部３１を備える。取得部１３１は、分割手段１３１１、特徴量取得手段１３１２、変化情報取得手段１３１３を備える。 The production device 3 includes a teacher change information storage unit 111, a reception unit 12, an acquisition unit 131, and a storage unit 31. The acquisition unit 131 includes a division unit 1311, a feature amount acquisition unit 1312, and a change information acquisition unit 1313.

なお、ここで受付部１２が受け付ける音声情報は、教師音声情報である。また、ここでの取得部１３１の処理対象は、受付部１２が受け付けた教師音声情報である。 The voice information received by the reception unit 12 here is teacher voice information. Further, the processing target of the acquisition unit 131 here is the teacher voice information received by the reception unit 12.

分割手段１３１１は、受付部１２が受け付けた教師音声情報を２以上の部分音声情報に分割する。 The dividing means 1311 divides the teacher voice information received by the reception unit 12 into two or more partial voice information.

特徴量取得手段１３１２は、２以上の部分音声情報が有する２以上の各評定対象部分音声情報から２以上の特徴量を取得する。 The feature amount acquisition means 1312 acquires two or more feature amounts from each of the two or more evaluation target partial voice information possessed by the two or more partial voice information.

変化情報取得手段１３１３は、２以上の特徴量を用いて、教師変化情報を取得する。 The change information acquisition means 1313 acquires teacher change information using two or more features.

蓄積部３１は、教師変化情報を記録媒体に蓄積する。ここでの記録媒体は、通常、教師変化情報格納部１１１である。蓄積部３１は、例えば、教師音声情報に対応付けて、教師変化情報を記録媒体に蓄積しても良い。また、蓄積部３１は、例えば、教師音声情報から取得した単語トランスクリプションに対応付けて、教師変化情報を記録媒体に蓄積しても良い。また、蓄積部３１は、教師音声情報を構成する単語の音素トランスクリプションに対応付けて、単語の教師変化情報を記録媒体に蓄積しても良い。 The storage unit 31 stores the teacher change information in the recording medium. The recording medium here is usually the teacher change information storage unit 111. The storage unit 31 may store the teacher change information in the recording medium in association with the teacher voice information, for example. Further, the storage unit 31 may store the teacher change information in the recording medium in association with the word transcription acquired from the teacher voice information, for example. Further, the storage unit 31 may store the teacher change information of the word in the recording medium in association with the phoneme transcription of the word constituting the teacher voice information.

蓄積部３１は、通常、ＭＰＵやメモリ等から実現され得る。蓄積部３１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The storage unit 31 can usually be realized from an MPU, a memory, or the like. The processing procedure of the storage unit 31 is usually realized by software, and the software is recorded on a recording medium such as ROM. However, it may be realized by hardware (dedicated circuit).

次に、生産装置３の動作について、図９のフローチャートを用いて説明する。図９のフローチャートにおいて、図３のフローチャートと同一のステップについて説明を省略する。 Next, the operation of the production apparatus 3 will be described with reference to the flowchart of FIG. In the flowchart of FIG. 9, the same steps as the flowchart of FIG. 3 will be omitted.

（ステップＳ９０１）受付部１２は、教師音声情報を受け付けたか否かを判断する。教師音声情報を受け付けた場合はステップＳ３０１に行き、教師音声情報を受け付けない場合はステップＳ９０１に戻る。 (Step S901) The reception unit 12 determines whether or not the teacher voice information has been received. If the teacher voice information is received, the process proceeds to step S301, and if the teacher voice information is not received, the process returns to step S901.

（ステップＳ９０２）蓄積部３１は、ステップＳ３１６で取得された文章の変化情報を記録媒体に蓄積する。ここでの変化情報は、文章の教師変化情報である。 (Step S902) The storage unit 31 stores the change information of the text acquired in step S316 in the recording medium. The change information here is the teacher change information of the text.

（ステップＳ９０３）蓄積部３１は、ステップＳ３１３で取得された１以上の各単語の変化情報を記録媒体に蓄積する。処理を終了する。なお、ここでの変化情報は、単語の教師変化情報である。 (Step S903) The storage unit 31 stores the change information of one or more words acquired in step S313 in the recording medium. End the process. The change information here is word teacher change information.

以下、本実施の形態における生産装置３の具体的な動作について説明する。生産装置３の具体的な動作例は、以下の１）から６）の動作である。
１）教師音声情報の音素ごとの区間情報を、フォースドアライメントを用いて算出する。
２）教師音声情報のフレームごとのアクセント強度を教師音声データから算出する。
３）教師音声情報の音素ごとのアクセント強度を音素ごとの区間情報とフレームごとのアクセント強度から算出する。
４）教師音声情報の単語ごとのアクセント強度を単語内の音素ごとのアクセント強度から算出する。
５）教師音声情報の単語アクセントパタンを単語ごとのアクセント強度から生成する。
６）教師音声情報の音素アクセントパタンを音素ごとのアクセント強度から生成する。 Hereinafter, the specific operation of the production apparatus 3 in the present embodiment will be described. Specific operation examples of the production apparatus 3 are the following operations 1) to 6).
1) The section information for each phoneme of the teacher voice information is calculated using forced alignment.
2) The accent intensity for each frame of the teacher voice information is calculated from the teacher voice data.
3) The accent intensity for each phoneme of the teacher voice information is calculated from the section information for each phoneme and the accent intensity for each frame.
4) The accent intensity for each word of the teacher's voice information is calculated from the accent intensity for each phoneme in the word.
5) Generate a word accent pattern of teacher voice information from the accent intensity of each word.
6) Generate a phoneme accent pattern of teacher voice information from the accent intensity of each phoneme.

上記の動作のさらなる具体例を、以下に説明する。今、模範的な発音をする教師が、文章「Alice looked up.」を読み上げた、とする。そして、生産装置３の受付部１２は、文章「Alice looked up.」の音声データである教師音声情報を受け付ける。 Further specific examples of the above operation will be described below. Now suppose that a teacher with a model pronunciation reads out the sentence "Alice looked up." Then, the reception unit 12 of the production device 3 receives the teacher voice information which is the voice data of the sentence “Alice looked up.”.

次に、分割手段１３１１は、教師音声情報を２以上の単語に分割する。つまり、分割手段１３１１は、教師音声情報を構成する音素ごとの区間情報を、フォースドアライメント等を用いて、取得する。 Next, the dividing means 1311 divides the teacher voice information into two or more words. That is, the dividing means 1311 acquires the section information for each phoneme constituting the teacher voice information by using forced alignment or the like.

次に、特徴量取得手段１３１２は、音素ごとに、音素の並び順に、各音素が有する２以上の各フレームの特徴量を取得する。ここでは、特徴量は、アクセント強度である、とする。そして、特徴量取得手段１３１２は、音素ごとに、２以上のフレームの特徴量から、代表特徴量（例えば、最大値）を取得する。 Next, the feature amount acquisition means 1312 acquires the feature amount of each of two or more frames possessed by each phoneme in the order in which the phonemes are arranged for each phoneme. Here, it is assumed that the feature amount is the accent intensity. Then, the feature amount acquisition means 1312 acquires a representative feature amount (for example, the maximum value) from the feature amounts of two or more frames for each phoneme.

次に、変化情報取得手段１３１３は、単語ごとに、当該単語内の２以上の各音素の特徴量（代表特徴量）を音素の並び順に取得する。つまり、まず、変化情報取得手段１３１３は、単語「Alice」に対応する音素トランスクリプション「/ae l ax s/」に対して、特徴量（音素アクセント強度）の並び「/50.041230 0.000000 65.123454 0.000000/」を得た、とする。そして、変化情報取得手段１３１３は、音素アクセント強度の並びから、単語「Alice」の教師変化情報「/1 0 2 0/」を得る。ここで、変化情報取得手段１３１３は、第二の変化情報取得方法により、教師変化情報を取得した。 Next, the change information acquisition means 1313 acquires the feature amounts (representative feature amounts) of two or more phonemes in the word for each word in the order of phoneme arrangement. That is, first, the change information acquisition means 1313 has a sequence of features (phoneme accent intensities) "/50.041230 0.000000 65.123454 0.000000 /" for the phoneme transcription "/ ae l ax s /" corresponding to the word "Alice". "Is obtained. Then, the change information acquisition means 1313 obtains the teacher change information “/1 0 2 0 /” of the word “Alice” from the arrangement of the phoneme accent intensities. Here, the change information acquisition means 1313 acquired the teacher change information by the second change information acquisition method.

同様に、変化情報取得手段１３１３は、単語「looked」および単語「up」に対応する音素トランスクリプションに対して、特徴量（音素アクセント強度）の並びを得る。そして、変化情報取得手段１３１３は、第二の変化情報取得方法により、単語「looked」および単語「up」の教師変化情報を取得する。 Similarly, the change information acquisition means 1313 obtains a sequence of features (phoneme accent intensity) for the phoneme transcription corresponding to the word “looked” and the word “up”. Then, the change information acquisition means 1313 acquires the teacher change information of the word "looked" and the word "up" by the second change information acquisition method.

次に、変化情報取得手段１３１３は、取得された２以上の単語の代表特徴量を用いて、文章の入力変化情報を取得する。つまり、変化情報取得手段１３１３は、単語トランスクリプション「/sil/alice/sil/looked/sil/up/sil/」を構成する各単語の特徴量の並びである単語アクセント強度「/0.000000/65.123454/0.000000/54.012354/0.000000/45.987661/0.000000/」から、第一の変化情報取得方法により、単語アクセントパタン「/0/3/0/2/0/1/0/」を取得する。この単語アクセントパタンは、教師変化情報の一例である。 Next, the change information acquisition means 1313 acquires the input change information of the sentence by using the representative feature amounts of the acquired two or more words. That is, the change information acquisition means 1313 has a word accent intensity “/0.000000/65.123454” which is a sequence of feature quantities of each word constituting the word transcription “/ sil/alice/sil/looked/sil/up/sil/”. From "/0.000000/54.012354/0.000000/45.987661/0.000000/", the word accent pattern "/ 0/3/0/2/0/1/0 /" is acquired by the first change information acquisition method. This word accent pattern is an example of teacher change information.

そして、蓄積部３１は、取得された文章の教師変化情報「/0/3/0/2/0/1/0/」を記録媒体に蓄積する。ここで、蓄積部３１は、例えば、単語(音素)トランスクリプション「Alice looked up.」と文章の教師変化情報とを対にして蓄積する、とする。 Then, the storage unit 31 stores the teacher change information "/ 0/3/0/2/0/1/0 /" of the acquired sentence in the recording medium. Here, it is assumed that the storage unit 31 stores, for example, the word (phoneme) transcription "Alice looked up." And the teacher change information of the sentence as a pair.

また、蓄積部３１は、取得された単語の教師変化情報を記録媒体に蓄積する。ここで、蓄積部３１は、教師音声情報の単語(音素)トランスクリプション「Alice」等と単語の教師変化情報とを対にして蓄積する、とする。 In addition, the storage unit 31 stores the teacher change information of the acquired word in the recording medium. Here, it is assumed that the storage unit 31 stores the word (phoneme) transcription "Alice" or the like of the teacher voice information and the teacher change information of the word as a pair.

以上の処理により、生産装置３は、図４に示す教師変化情報管理表を得る。 Through the above processing, the production apparatus 3 obtains the teacher change information management table shown in FIG.

以上、本実施の形態によれば、発音された入力音声の流れを考慮した音声の評定をするための教師データを自動生成できる。 As described above, according to the present embodiment, it is possible to automatically generate teacher data for evaluating the voice in consideration of the flow of the sounded input voice.

なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータを、教師音声情報を受け付ける受付部と、前記教師音声情報を２以上の部分音声情報に分割する分割手段と、前記２以上の部分音声情報が有する、２以上の各評定対象部分音声情報から２以上の特徴量を取得する特徴量取得手段と、前記２以上の特徴量を用いて、教師変化情報を取得する変化情報取得手段と、前記教師変化情報を記録媒体に蓄積する蓄積部として、機能させるためのプログラム、である。 The software that realizes the information processing device in this embodiment is the following program. That is, this program divides the computer into a reception unit that receives teacher voice information, a dividing means that divides the teacher voice information into two or more partial voice information, and two or more of each of the two or more partial voice information. A feature amount acquisition means for acquiring two or more feature amounts from the evaluation target partial voice information, a change information acquisition means for acquiring teacher change information using the two or more feature amounts, and the teacher change information as a recording medium. It is a program for functioning as a storage unit for storage.

また、図１０は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の音声評定装置等を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図１０は、このコンピュータシステム３００の概観図であり、図１１は、システム３００のブロック図である。 In addition, FIG. 10 shows the appearance of a computer that executes the program described in the present specification to realize the voice rating device and the like of various embodiments described above. The above embodiments can be realized in computer hardware and computer programs running on it. FIG. 10 is an overview view of the computer system 300, and FIG. 11 is a block diagram of the system 300.

図１０において、コンピュータシステム３００は、ＣＤ−ＲＯＭドライブ３０１２を含むコンピュータ３０１と、キーボード３０２と、マウス３０３と、モニタ３０４と、マイク３０５とを含む。 In FIG. 10, the computer system 300 includes a computer 301 including a CD-ROM drive 3012, a keyboard 302, a mouse 303, a monitor 304, and a microphone 305.

図１１において、コンピュータ３０１は、ＣＤ−ＲＯＭドライブ３０１２と、ＭＰＵ３０１３と、ＭＰＵ３０１３と、バス３０１４と、ＲＯＭ３０１５と、ＲＡＭ３０１６と、ハードディスク３０１７とを含む。ＲＯＭ３０１５は、ブートアッププログラム等のプログラムを記憶している。ＲＡＭ３０１６は、ＭＰＵ３０１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供する。ハードディスク３０１７は、通常、アプリケーションプログラム、システムプログラム、及びデータを記憶している。ここでは、図示しないが、コンピュータ３０１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 In FIG. 11, the computer 301 includes a CD-ROM drive 3012, an MPU 3013, an MPU 3013, a bus 3014, a ROM 3015, a RAM 3016, and a hard disk 3017. The ROM 3015 stores a program such as a boot-up program. The RAM 3016 is connected to the MPU 3013 to temporarily store the instructions of the application program and provide a temporary storage space. The hard disk 3017 usually stores application programs, system programs, and data. Although not shown here, the computer 301 may further include a network card that provides a connection to the LAN.

コンピュータシステム３００に、上述した実施の形態の音声評定装置１等の機能を実行させるプログラムは、ＣＤ−ＲＯＭ３１０１に記憶されて、ＣＤ−ＲＯＭドライブ３０１２に挿入され、さらにハードディスク３０１７に転送されても良い。また、プログラムは、図示しないネットワークを介してコンピュータ３０１に送信され、ハードディスク３０１７に記憶されても良い。プログラムは実行の際にＲＡＭ３０１６にロードされる。プログラムは、ＣＤ−ＲＯＭ３１０１またはネットワークから直接、ロードされても良い。 The program for causing the computer system 300 to execute the functions of the voice rating device 1 and the like according to the above-described embodiment may be stored in the CD-ROM 3101, inserted into the CD-ROM drive 3012, and further transferred to the hard disk 3017. .. Further, the program may be transmitted to the computer 301 via a network (not shown) and stored in the hard disk 3017. The program is loaded into RAM 3016 at run time. The program may be loaded directly from the CD-ROM3101 or network.

プログラムは、コンピュータ３０１に、上述した実施の形態の音声評定装置１等の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切なモジュールを呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム３００がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) that causes the computer 301 to execute the functions of the voice rating device 1 and the like according to the above-described embodiment, or a third-party program. The program only needs to include a portion of the instructions that call the appropriate module in a controlled manner to achieve the desired result. It is well known how the computer system 300 works, and detailed description thereof will be omitted.

また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Further, the number of computers that execute the above program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、一の装置に存在する２以上の通信手段は、物理的に一の媒体で実現されても良いことは言うまでもない。 Further, in each of the above embodiments, it goes without saying that the two or more communication means existing in one device may be physically realized by one medium.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 Further, in each of the above-described embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be done.

本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 It goes without saying that the present invention is not limited to the above embodiments, and various modifications can be made, and these are also included in the scope of the present invention.

以上のように、本発明にかかる音声評定装置は、発音された入力音声の流れを考慮した音声の評定ができるため、入力音声の適切な評定ができるという効果を有し、英語等の外国語の学習装置等として有用である。 As described above, the voice rating device according to the present invention has the effect of being able to appropriately evaluate the input voice because it can evaluate the voice in consideration of the flow of the input voice pronounced, and is in a foreign language such as English. It is useful as a learning device for.

１、２音声評定装置
３生産装置
１１格納部
１２受付部
１３、２３処理部
１４、２４出力部
３１蓄積部
１１１教師変化情報格納部
１３１取得部
１３２評定部
２３１第二評定部
２３２算出部
１３１１分割手段
１３１２特徴量取得手段
１３１２徴量取得手段
１３１３変化情報取得手段 1, 2 Voice rating device 3 Production device 11 Storage section 12 Reception section 13, 23 Processing section 14, 24 Output section 31 Storage section 111 Teacher change information storage section 131 Acquisition section 132 Rating section 231 Second rating section 232 Calculation section 1311 Division Means 1312 Feature amount acquisition means 1312 Collection amount acquisition means 1313 Change information acquisition means

Claims

A teacher change information storage unit that stores teacher change information related to changes in the features of two or more partial voice information that constitutes teacher voice information, which is voice information that serves as a teacher.
A reception unit that accepts input voice information, which is voice information having two or more partial voice information,
An acquisition unit that acquires input change information relating to a change in the feature amount of each of two or more partial voice information possessed by the input voice information, and an acquisition unit.
Using the input change information and the teacher change information, the evaluation unit that evaluates the input voice information and acquires a score,
It is provided with an output unit that outputs the score.
The teacher change information and the input change information are
This is information regarding the order of the magnitude of the feature amount of the partial voice information.
When the information is the information possessed by the teacher change information and the information regarding the rank of the feature amount of at least two or more evaluation target partial voice information is the same information.
The acquisition unit
It is the information possessed by the input change information, and it is determined whether or not the ranks of the features of at least two evaluation target partial voice information at the positions corresponding to the same information are adjacent to each other, and they are adjacent to each other. A voice rating device that acquires input change information by regarding at least the size of the feature amount of the two evaluation target partial voice information as the same size when it is determined that the information is present.

The acquisition unit
The feature amount of each of two or more partial voice information possessed by the input voice information is acquired, and at least two or more of the two or more partial information possessed by the input voice information of the two or more feature amounts of the evaluation target partial voice information. get the size of the order, the sound evaluation apparatus according to claim 1, wherein acquiring the input change information having the two or more feature quantity of the size of the order.

The acquisition unit
The feature amount of each of two or more partial voice information possessed by the input voice information is acquired, and at least two or more of the two or more partial information possessed by the input voice information are converted into two or more feature amounts of the evaluation target partial voice information. against it, the greatest rating corresponding to the characteristic amount target portion speech information and other assessment target portion speech information and voice evaluation apparatus according to claim 1, wherein acquiring the input change information is information for distinguishing.

The input voice information is text voice information, and is
The voice rating device according to any one of claims 1 to 3 , wherein the partial voice information is voice information of words constituting a sentence.

The input voice information is word voice information, and is
The voice rating device according to any one of claims 1 to 3 , wherein the partial voice information is voice information of phonemes constituting a word.

The feature amount of the partial voice information is
The voice rating device according to claim 1, wherein the accent strength is information on the strength of the accent, or the amount of rhythm is the information on the length of the voice information.

The rating section
The voice rating device according to any one of claims 1 to 6 , wherein the rank correlation coefficient between the input change information and the teacher change information is acquired as a score.

The second rating section, which evaluates the pronunciation of the input voice information and obtains the second score,
A calculation unit for calculating a representative score, which is a representative score, is further provided by using the score acquired by the rating unit and the second score acquired by the second rating unit.
The output unit
The voice rating device according to any one of claims 1 to 7, which outputs the representative score.

The recording medium is
It is provided with a teacher change information storage unit for storing teacher change information regarding changes in the feature amount of each of two or more partial voice information constituting the teacher voice information which is the voice information to be a teacher.
A voice rating method realized by the reception section, acquisition section, rating section, and output section.
A reception step in which the reception unit receives input voice information which is voice information having two or more partial voice information,
An acquisition step in which the acquisition unit acquires input change information relating to a change in the feature amount of each of two or more partial voice information possessed by the input voice information.
A rating step in which the rating unit evaluates the input voice information using the input change information and the teacher change information and acquires a score.
The output unit includes an output step for outputting the score.
The teacher change information and the input change information are
This is information regarding the order of the magnitude of the feature amount of the partial voice information.
When the information is the information possessed by the teacher change information and the information regarding the rank of the feature amount of at least two or more evaluation target partial voice information is the same information.
In the acquisition step
It is the information possessed by the input change information, and it is determined whether or not the ranks of the features of at least two evaluation target partial voice information at the positions corresponding to the same information are adjacent to each other, and they are adjacent to each other. A voice rating method for acquiring input change information by regarding at least the size of the feature amount of the two evaluation target partial voice information as the same size.

A computer that can access the teacher change information storage unit that stores teacher change information related to changes in the feature amount of each of the two or more partial voice information that constitutes the teacher voice information, which is the voice information that serves as a teacher.
A reception unit that accepts input voice information, which is voice information having two or more partial voice information,
An acquisition unit that acquires input change information relating to a change in the feature amount of each of two or more partial voice information possessed by the input voice information, and an acquisition unit.
Using the input change information and the teacher change information, the evaluation unit that evaluates the input voice information and acquires a score,
It is a program for functioning as an output unit that outputs the score.
The teacher change information and the input change information are
This is information regarding the order of the magnitude of the feature amount of the partial voice information.
When the information is the information possessed by the teacher change information and the information regarding the rank of the feature amount of at least two or more evaluation target partial voice information is the same information.
The acquisition unit
It is the information possessed by the input change information, and it is determined whether or not the ranks of the features of at least two evaluation target partial voice information at the positions corresponding to the same information are adjacent to each other, and they are adjacent to each other. If it is determined that the information is present, at least the magnitudes of the feature amounts of the two evaluation target partial voice information are regarded as the same magnitude, and the input change information is acquired.