JP6622681B2

JP6622681B2 - Phoneme Breakdown Detection Model Learning Device, Phoneme Breakdown Interval Detection Device, Phoneme Breakdown Detection Model Learning Method, Phoneme Breakdown Interval Detection Method, Program

Info

Publication number: JP6622681B2
Application number: JP2016214874A
Authority: JP
Inventors: 清彰松井; 岡本　学; 学岡本; 隆朗福冨
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-02
Filing date: 2016-11-02
Publication date: 2019-12-18
Anticipated expiration: 2036-11-02
Also published as: JP2018072697A

Description

本発明は、音声認識技術に関し、特に不明瞭な発音に起因し発生した音素崩れ区間を検出する技術に関する。 The present invention relates to a speech recognition technique, and more particularly to a technique for detecting a phoneme breakage section that has occurred due to unclear pronunciation.

自然発話に関する音声認識技術は、コールセンタでの対話分析、会議における議事録作成、人間とロボットの雑談対話など様々な用途で幅広く用いられている。 Speech recognition technology related to natural utterances is widely used in various applications such as dialog analysis at call centers, creation of minutes in meetings, and chat conversation between humans and robots.

これまでの音声認識にはいくつかの方法がある。例えば、あらかじめ大量に用意した音声データを学習データとして各音素のテンプレートを生成し、認識対象となる音声データにテンプレートを順に当てはめていくことにより、最尤音素系列を生成する方法がある（非特許文献１）。 There are several methods for conventional speech recognition. For example, there is a method of generating a maximum likelihood phoneme sequence by generating a template for each phoneme using a large amount of prepared speech data as learning data and sequentially applying the template to the speech data to be recognized (non-patent document). Reference 1).

また、DNN(Deep Neural Networks)を用いた方法もある（非特許文献２）。この方法は、音声特徴量を入力として音素を出力するDNNを学習することにより、認識対象となる音声データの音声特徴量から直接音素へ変換し、音素系列を生成するものであり、学習データを大量に用意することで非常によい音声認識率が得られるものである。 There is also a method using DNN (Deep Neural Networks) (Non-patent Document 2). In this method, by learning DNN that outputs phonemes with speech features as input, speech features of speech data to be recognized are directly converted into phonemes, and phoneme sequences are generated. By preparing a large amount, a very good speech recognition rate can be obtained.

その他、アテトーゼ型脳性麻痺による構音障害者の発話を認識するために、CNN(Convolutional Neural Networks)を用いた特徴量抽出を行い、スペクトログラムの揺らぎを緩和させる方法もある（非特許文献３）。 In addition, in order to recognize the utterances of dysarthria due to athetoid cerebral palsy, there is a method of reducing the spectrogram fluctuation by extracting features using CNN (Convolutional Neural Networks) (Non-patent Document 3).

F. Jelinek, "Continuous speech recognition by statistical methods", Proceedings of the IEEE, Vol.64, No.4, pp.532-556, 1976.F. Jelinek, "Continuous speech recognition by statistical methods", Proceedings of the IEEE, Vol.64, No.4, pp.532-556, 1976. G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition", IEEE Signal Processing Magazine Vol.29, Issue 6, pp.82-97, 2012.G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, "Deep Neural Networks for Acoustic Modeling in Speech Recognition ", IEEE Signal Processing Magazine Vol.29, Issue 6, pp.82-97, 2012. 高島悠樹，中鹿亘，滝口哲也，有木康雄，“構音障害者音声認識のための混合正規分布に基づく音素ラベリングの検討”，電子情報通信学会，信学技報，vol. 115，no.100，pp.71-76，2015．Yuki Takashima, Wataru Nakakaka, Tetsuya Takiguchi, Yasuo Ariki, “Examination of phoneme labeling based on mixed normal distribution for speech recognition of dysarthria”, IEICE, IEICE Technical Report, vol. 115, no. 100, pp.71-76, 2015.

いずれの方法においても誤認識が生じることがあるが、その中でも特に問題となるのが、音声認識率が著しく低下してしまう場合である。その要因としていくつか考えられる。 In any of the methods, erroneous recognition may occur, but among them, a particular problem is when the speech recognition rate is significantly reduced. There are several possible causes.

現在の音声認識技術では、予め用意された学習用音声データから学習した音声の特徴を知識として音声認識に用いるため、雑音環境や話者の話し方が平均的なものから大きく逸脱した場合、音声認識率は著しく低下する。平均的なものから大きく逸脱する場合の例として、雑音環境の側面では学習用音声データにない新たな雑音環境にさらされる場合や非定常性の強い突発的な雑音が発生する場合などがあり、話者の話し方の側面では話者が強い感情をこめて発話する場合や声量が極端に大きい（極端に小さい）場合などがある。このようなケースが劣化要因となり、発話の一部あるいは全部で発生すると音声認識率が著しく低下してしまう。 The current speech recognition technology uses speech features learned from pre-prepared training speech data as knowledge for speech recognition, so if the noise environment or speaker's way of speaking deviates significantly from the average, speech recognition The rate drops significantly. Examples of cases that deviate significantly from the average are cases where the noise environment is exposed to a new noise environment that is not included in the speech data for learning, or sudden noise with strong non-stationarity occurs. In terms of the speaker's way of speaking, there are cases where the speaker utters with strong emotions, or the volume of the voice is extremely large (extremely small). Such a case becomes a cause of deterioration, and if it occurs in part or all of an utterance, the speech recognition rate is significantly lowered.

また、現在の音声認識技術には、現在着目している語の前にどのような語が続いているかという情報を利用しているものもあり、このため、不明瞭な発音である音素崩れにより誤認識をいったん起こしてしまうと後続の語についても連鎖的に誤認識していまうという現象(ピットフォールエラー)が発生することがある（参考非特許文献１）。このピットフォールエラーも音声認識率を著しく低下させてしまう。
（参考非特許文献１）浅見太一，野田喜昭，高橋敏，“ピットフォールエラーに着目した音声認識誤りの分析”，日本音響学会講演論文集２００８年３月，1-10-18，pp.53-54，2008． In addition, some current speech recognition technologies use information about what kind of word follows the currently focused word. Once misrecognition occurs, a phenomenon (pitfall error) in which subsequent words are misrecognized in a chained manner may occur (reference non-patent document 1). This pitfall error also significantly reduces the voice recognition rate.
(Reference Non-Patent Document 1) Taichi Asami, Yoshiaki Noda, Satoshi Takahashi, “Analysis of Speech Recognition Errors Focusing on Pit Fall Errors”, Proceedings of the Acoustical Society of Japan March 2008, 1-10-18, pp.53 -54, 2008.

そこで本発明は、１つの音素崩れに起因して連鎖的に誤認識が生じてしまう音素崩れ区間を検出することができる音素崩れ区間検出技術を提供することを目的とする。 Accordingly, an object of the present invention is to provide a phoneme breakage section detection technique capable of detecting a phoneme breakup section in which erroneous recognition occurs in a chain due to one phoneme breakup.

本発明の一態様は、学習用音素区間情報系列を、学習用音声データに付与される、音素を示す音素ラベル、当該音素の発話開始時間と発話終了時間、当該音素が不明瞭であることを示す音素崩れラベルかそれ以外であることを示すラベルのいずれかである音素崩れフラグを含む学習用音素区間情報の系列とし、前記学習用音声データと前記学習用音素区間情報系列から、前記学習用音素区間情報に含まれる母音音素を示す母音音素ラベルまたは音素崩れラベルと対応付けられている音素ラベルである学習用音素ラベルと、当該学習用音素ラベルの音素崩れフラグと、当該学習用音素ラベルの音素の発話開始時間から発話終了時間までの区間に対応する音声特徴量である学習用音素区間音声特徴量を抽出する学習用音素情報抽出部と、前記学習用音素ラベルと前記学習音素ラベルの音素崩れフラグと前記学習用音素区間音声特徴量から、音素の音素崩れを検出するためのモデルである音素崩れ決定木を学習する音素崩れ決定木学習部とを含む。 According to one aspect of the present invention, a learning phoneme segment information sequence is assigned to learning speech data, a phoneme label indicating a phoneme, an utterance start time and an utterance end time of the phoneme, and the phoneme being unclear. A learning phoneme segment information sequence including a phoneme collapse flag which is either a phoneme collapse label or a label indicating otherwise, and the learning speech data and the learning phoneme segment information sequence are used for the learning. A phoneme label associated with a vowel phoneme label or a phoneme breakage label indicating a vowel phoneme included in the phoneme section information, a phoneme breakage flag of the learning phoneme label, and a phoneme breakage flag of the learning phoneme label A learning phoneme information extraction unit for extracting a learning phoneme segment speech feature amount that is a speech feature amount corresponding to a segment from a phoneme utterance start time to an utterance end time; A phoneme collapse decision tree learning unit that learns a phoneme collapse decision tree that is a model for detecting phoneme collapse from a phoneme label, a phoneme collapse flag of the learning phoneme label, and a phoneme section speech feature value for learning. .

本発明によれば、音声認識時に母音の音素崩れを検出するためのモデルである音素崩れ決定木を学習することができる。 According to the present invention, it is possible to learn a phoneme collapse decision tree that is a model for detecting phoneme collapse of a vowel during speech recognition.

音素区間情報系列の一例を示す図。The figure which shows an example of a phoneme area information series. 学習用音素区間情報系列の一例を示す図。The figure which shows an example of the phoneme area information series for learning. 音素崩れ検出モデル学習装置１００の構成の一例を示す図。The figure which shows an example of a structure of the phoneme collapse detection model learning apparatus. 音素崩れ検出モデル学習装置１００の動作の一例を示す図。The figure which shows an example of operation | movement of the phoneme collapse detection model learning apparatus. 学習用音素情報抽出部１１０の構成の一例を示す図。The figure which shows an example of a structure of the phoneme information extraction part 110 for learning. 学習用音素情報抽出部１１０の動作の一例を示す図。The figure which shows an example of operation | movement of the phoneme information extraction part 110 for learning. 音素崩れ検出モデルである音素崩れ決定木の一例を示す図。The figure which shows an example of the phoneme collapse decision tree which is a phoneme collapse detection model. 音素崩れ区間検出装置２００の構成の一例を示す図。The figure which shows an example of a structure of the phoneme breaking area detection apparatus. 音素崩れ区間検出装置２００の動作の一例を示す図。The figure which shows an example of operation | movement of the phoneme breakage area detection apparatus 200. FIG. 音声認識部２３０による認識結果の一例を示す図。The figure which shows an example of the recognition result by the speech recognition part 230. FIG. 音声認識部２３０の構成の一例を示す図。The figure which shows an example of a structure of the speech recognition part 230. FIG. 音素照合部２５０の構成の一例を示す図。The figure which shows an example of a structure of the phoneme collation part 250. FIG. 音素照合部２５０の動作の一例を示す図。The figure which shows an example of operation | movement of the phoneme collation part 250. FIG. 推定音素系列生成部２４１の動作の一例を示す図。The figure which shows an example of operation | movement of the estimated phoneme series production | generation part 241. FIG. 音素系列比較部２４３の動作の一例を示す図。The figure which shows an example of operation | movement of the phoneme series comparison part 243. 音素照合部２５０による照合結果の一例を示す図。The figure which shows an example of the collation result by the phoneme collation part 250. FIG. 音素崩れ区間検出部２７０による検出結果の一例を示す図。The figure which shows an example of the detection result by the phoneme breaking area detection part 270.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

＜定義＞
以下、各実施形態で用いる用語について説明する。 <Definition>
Hereinafter, terms used in each embodiment will be described.

［音声データ］
音声データとは、学習（具体的には、音素崩れ決定木の学習）や音声認識に用いるため、あらかじめ収録しておく音声データのことである。音声データは、話者が発話した文章の音声であり、例えばサンプリング周波数１６ｋＨｚで離散値化されたデジタルデータである。 [Audio data]
The voice data is voice data recorded in advance for use in learning (specifically, learning of phoneme breaking decision trees) and voice recognition. The voice data is a voice of a sentence uttered by a speaker, for example, digital data digitized at a sampling frequency of 16 kHz.

［音素区間情報系列］
音素区間情報系列とは、音声データに対して付与される音素に関する情報（以下、音素区間情報という）の系列のことである。音声データに一つの音素区間情報系列が付与されている。 [Phoneme interval information series]
The phoneme section information series is a series of information about phonemes (hereinafter referred to as phoneme section information) given to speech data. One phoneme section information series is given to the voice data.

音素区間情報には、少なくとも音素を表す音素ラベル、音素の発話開始時間と発話終了時間の情報が含まれる。ここでいう発話開始時間・発話終了時間は、各発話の始点を0[秒]としたときの経過時間のことである。音素区間情報系列の一例を図１に示す。 The phoneme section information includes at least a phoneme label representing a phoneme, and information on the phoneme utterance start time and utterance end time. The utterance start time and utterance end time here are elapsed times when the start point of each utterance is set to 0 [seconds]. An example of a phoneme segment information sequence is shown in FIG.

また、音素崩れ決定木の学習に用いる学習データは、学習用音声データと学習用音素区間情報系列の組である。ここで、学習用音素区間情報は、音素区間情報に対して、人手により音素崩れを起こしている音素（不明瞭な発音となっている音素）にそのことを示す専用ラベル（以下、音素崩れラベルという）を対応付けたものである。学習用音素区間情報系列の一例を図２に示す。音素崩れラベルは、図２に示すように音素ラベルとは異なるラベルを付与する形としてもよいし、音素崩れを起こしている音素ラベルを音素崩れラベルで上書きする形としてもよい。この例では、記号”*”を付することにより、表の上から２行目、３行目、６行目、８行目の音素“a”、音素“r”、音素“u”、音素“u”が音素崩れを起こしていることを示している。なお、音素崩れを起こしていない音素に対して何も記号を付けない代わりに、音素が崩れていないことを示すnilなどの特別な記号を付してもよい。 The learning data used for learning the phoneme collapse decision tree is a set of learning speech data and a learning phoneme section information sequence. Here, the phoneme section information for learning is a dedicated label (hereinafter referred to as a phoneme breakage label) that indicates a phoneme that has been broken by hand (phoneme that is unclearly pronounced) with respect to the phoneme section information. Are associated with each other. An example of the learning phoneme segment information sequence is shown in FIG. As shown in FIG. 2, the phoneme breakage label may have a form different from the phoneme label, or a phoneme label in which phoneme breakage has occurred may be overwritten with the phoneme breakage label. In this example, by adding the symbol “*”, the phoneme “a”, phoneme “r”, phoneme “u”, phoneme in the second, third, sixth, and eighth rows from the top of the table. “U” indicates that the phoneme is broken. It should be noted that a special symbol such as nil indicating that the phoneme is not broken may be attached instead of adding a symbol to the phoneme that has not broken the phoneme.

以下では、各音素に対して、音素崩れラベルか、音素崩れを起こしていないことを示すラベルのいずれかが音素崩れの有無を示す音素崩れフラグとして対応づけられているものとする。 In the following, it is assumed that each phoneme is associated with either a phoneme breakage label or a label indicating that phoneme breakage has not occurred as a phoneme breakage flag indicating the presence or absence of phoneme breakage.

音素区間情報系列から学習用音素区間情報系列を生成する作業には、多少の主観が伴ってしまうが、例えば、音声認識処理の結果大きく認識誤りを起こしている箇所に作業範囲を限定して作業をすることで、作業者による音素崩れラベルの付与のばらつきをある程度抑制することができる。 The task of generating the learning phoneme segment information sequence from the phoneme segment information sequence is somewhat subjective, but for example, the task is limited to a location where a recognition error has been caused greatly as a result of the speech recognition process. By doing this, it is possible to suppress to some extent variations in the application of phoneme breakdown labels by the operator.

＜第一実施形態＞
以下、図３〜図７を参照して音素崩れ検出モデル学習装置１００について説明する。 <First embodiment>
Hereinafter, the phoneme breakage detection model learning apparatus 100 will be described with reference to FIGS.

［音素崩れ検出モデル学習装置１００］
図３に示すように音素崩れ検出モデル学習装置１００は、学習用音素情報抽出部１１０、音素崩れ決定木学習部１３０、記録部１９０を含む。記録部１９０は、音素崩れ検出モデル学習装置１００の処理に必要な情報を適宜記録する構成部である。音素崩れ検出モデル学習装置１００は、学習用音声データ、学習用音素区間情報系列を入力として、音素崩れ検出モデルである音素崩れ決定木を学習し、出力する。 [Phoneme breakdown detection model learning apparatus 100]
As shown in FIG. 3, the phoneme collapse detection model learning device 100 includes a learning phoneme information extraction unit 110, a phoneme collapse decision tree learning unit 130, and a recording unit 190. The recording unit 190 is a component that appropriately records information necessary for processing of the phoneme breakage detection model learning device 100. The phoneme breakage detection model learning device 100 learns and outputs a phoneme breakage decision tree, which is a phoneme breakage detection model, with the learning speech data and the learning phoneme section information sequence as inputs.

図４に従い音素崩れ検出モデル学習装置１００の動作について説明する。学習用音素情報抽出部１１０は、学習用音声データ、学習用音素区間情報系列を入力として、学習用音素区間情報に含まれる母音音素を示す母音音素ラベル、半母音音素を示す半母音音素ラベル、促音音素を示す促音音素ラベル、音素崩れラベルと対応付けられている音素ラベル（以下、これらを学習用音素ラベルという）を抽出、当該学習用音素ラベルの音素に対応する発話区間（つまり、当該学習用音素ラベルの音素の発話開始時間から発話終了時間までの区間）に対応するフレームから音声特徴量（以下、学習用音素区間音声特徴量という）を抽出、学習用音素ラベルと当該学習音素ラベルの音素崩れフラグ、学習用音素区間音声特徴量の組を出力する（Ｓ１１０）。 The operation of the phoneme breakage detection model learning apparatus 100 will be described with reference to FIG. The learning phoneme information extraction unit 110 receives the learning speech data and the learning phoneme segment information sequence as input, and the vowel phoneme label indicating the vowel phoneme included in the learning phoneme segment information, the semi-vowel phoneme label indicating the semi-vowel phoneme, and the prompting phoneme Phoneme labels indicating the phoneme labels, and phoneme labels associated with the phoneme breakage labels (hereinafter referred to as learning phoneme labels), and the utterance interval corresponding to the phoneme of the learning phoneme label (that is, the learning phoneme) Extract speech features (hereinafter referred to as learning phoneme segment speech features) from the frame corresponding to the label phoneme utterance start time to utterance end time, and break the phoneme between the learning phoneme label and the learning phoneme label A set of flags and learning phoneme segment speech feature values is output (S110).

ここで、音素崩れの大部分は、母音、半母音、促音のように語末に出現する音素がきちんと発音されないことに起因することが多いため、音素崩れラベルと対応付けられている音素に加えて、母音音素ラベル、半母音音素ラベル、促音音素ラベルの音素についても学習用音素ラベルの音素として選択することとした。したがって、日本語の場合、”a(あ)”、”i(い)”、”u(う)”、”e(え)”、”o(お)”、”ng(ん)”、”q(っ)”の７種の音素と音素崩れラベルが対応付けられた音素のラベルが抽出されることとなる。 Here, the majority of phoneme breakage is often caused by the fact that phonemes that appear at the end of words like vowels, semi-vowels, and prompting sounds are not properly pronounced, so in addition to the phonemes associated with the phoneme breakage labels, The phonemes of the vowel phoneme label, the semi-vowel phoneme label, and the prompt phoneme label are also selected as the phonemes of the learning phoneme label. Therefore, in Japanese, “a (a)”, “i (i)”, “u (u)”, “e (e)”, “o (o)”, “ng (n)”, “ The phoneme label in which the seven phonemes of q () "and the phoneme collapse label are associated with each other is extracted.

図２の例でいえば、表の上から２行目、３行目、６行目、８行目の音素“a”、音素“r”、音素“u”、音素“u”が音素崩れラベルを付与されているため、学習用音素ラベルとなる。また、表の上から４行目の音素“a”が母音音素ラベルであるため、学習用音素ラベルとなる。 In the example of FIG. 2, the phoneme “a”, phoneme “r”, phoneme “u”, and phoneme “u” in the second, third, sixth, and eighth rows from the top of the table are broken. Since a label is given, it becomes a phoneme label for learning. Also, since the phoneme “a” in the fourth line from the top of the table is a vowel phoneme label, it becomes a learning phoneme label.

したがって、母音音素ラベルの音素、半母音音素ラベルの音素、促音音素ラベルの音素、音素崩れを起こしている音素を用いて音素崩れ決定木を学習することになる。 Therefore, the phoneme collapse decision tree is learned using the phonemes of the vowel phoneme label, the phonemes of the semi-vowel phoneme label, the phonemes of the prompt phoneme label, and the phonemes causing the phoneme collapse.

なお、母音音素ラベルの音素と音素崩れを起こしている音素のみを用いて音素崩れ決定木を学習してもよい。音素崩れが、母音がきちんと発音されないことに起因することが特に多いためである。 Note that the phoneme breakage decision tree may be learned using only the phonemes of the vowel phoneme label and the phonemes causing the phoneme breakage. This is because phoneme breakdown is particularly caused by the fact that vowels are not properly pronounced.

音素崩れ決定木学習部１３０は、学習用音素ラベルと学習音素ラベルの音素崩れフラグ、学習用音素区間音声特徴量を入力として、音素崩れ決定木を学習し、出力する（Ｓ１３０）。 The phoneme breaking decision tree learning unit 130 learns and outputs a phoneme breaking decision tree using the learning phoneme label, the phoneme breaking flag of the learning phoneme label, and the learning phoneme segment speech feature as input (S130).

以下、学習用音素情報抽出部１１０、音素崩れ決定木学習部１３０の構成、動作について詳細に説明していく。 Hereinafter, the configurations and operations of the learning phoneme information extraction unit 110 and the phoneme collapse decision tree learning unit 130 will be described in detail.

まず、図５〜図６を参照して学習用音素情報抽出部１１０について説明する。図５に示すように学習用音素情報抽出部１１０は、音声特徴量生成部１０１、学習用音素選択部１０３を含む。図６に従い学習用音素情報抽出部１１０の動作について説明する。 First, the learning phoneme information extraction unit 110 will be described with reference to FIGS. As shown in FIG. 5, the learning phoneme information extraction unit 110 includes a speech feature value generation unit 101 and a learning phoneme selection unit 103. The operation of the learning phoneme information extraction unit 110 will be described with reference to FIG.

音声特徴量生成部１０１は、学習用音声データをフレームに分割し、音声特徴量を生成し、各学習用音素区間情報の音素に対応する発話区間（つまり、当該音素の発話開始時間から発話終了時間までの区間）に対応するフレームの音声特徴量（以下、音素区間音声特徴量という）と当該音素の音素ラベルと音素崩れフラグの組を生成し、音素ラベルと音素崩れフラグと音素区間音声特徴量を出力する（Ｓ１０１）。音声特徴量としては、例えば、MFCC（Mel-Frequency Cepstrum Coefficients；メル周波数ケプストラム係数）、FBANK（対数メルフィルタバンク）を用いればよい。一般に、Nを1以上の整数として、各音素の発話区間はNフレームと対応するため、１つの音素ラベルにN個の音声特徴量が対応付けられることになる。 The speech feature quantity generation unit 101 divides the training speech data into frames, generates speech feature quantities, and utterance sections corresponding to the phonemes of each learning phoneme section information (that is, utterance ends from the utterance start time of the phonemes) A pair of a phoneme label, a phoneme breakage flag, and a phoneme segment speech feature, and a pair of a phoneme label and a phoneme breakage flag of the corresponding phoneme. The amount is output (S101). For example, MFCC (Mel-Frequency Cepstrum Coefficients) or FBANK (logarithmic Mel filter bank) may be used as the audio feature amount. In general, N is an integer equal to or greater than 1, and the speech interval of each phoneme corresponds to an N frame, so that N phonetic feature amounts are associated with one phoneme label.

学習用音素選択部１０３は、音素ラベル、音素崩れフラグ、音素区間音声特徴量を入力として、当該音素ラベルが母音、半母音、促音のいずれかを示すものである場合、または、当該音素ラベルに音素崩れラベルが付されている（つまり、音声崩れフラグが記号”*”である）場合は、入力された音素ラベルと音素崩れフラグと音素区間音声特徴量を学習用音素ラベルと音素崩れフラグと学習用音素区間音声特徴量としてそのまま出力する。一方、それ以外の場合（つまり、音素ラベルが母音、半母音、促音のいずれを示すものでなく、音素崩れラベルも付されていない場合）は、入力された音素ラベル等はそのまま破棄し、出力しない（Ｓ１０３）。 The learning phoneme selection unit 103 receives a phoneme label, a phoneme breakage flag, and a phoneme segment speech feature as an input, and the phoneme label indicates any one of a vowel, a semi-vowel, and a prompt sound, or the phoneme label If a collapsed label is attached (that is, the speech collapse flag is the symbol “*”), the input phoneme label, the phoneme collapse flag, and the phoneme segment speech feature are learned with the phoneme label for learning and the phoneme collapse flag. It is output as is as a phoneme segment speech feature. On the other hand, in other cases (that is, when the phoneme label does not indicate any vowel, semi-vowel, or prompting sound and does not have a phoneme collapse label), the input phoneme label is discarded as it is and is not output. (S103).

次に、音素崩れ決定木学習部１３０について説明する。音素崩れ決定木学習部１３０は、学習用音素ラベルと音素崩れフラグと学習用音素区間音声特徴量を入力として、音素崩れ決定木を学習する（Ｓ１３０）。音素崩れ決定木は、図７に示すように、最上層の根ノードに入ってきた学習用音素区間音声特徴量に対して、Yes-Noクエスチョンを繰り返しながら（ここでは、学習用音素区間音声特徴量に関する属性についての質問とその答えを用いて）最下層の葉ノードまで到達し、到達した葉ノードに付与された学習用音素ラベルと音素崩れフラグを出力するものである。以下、音素崩れフラグが記号”*”である、つまり音素崩れが起きている葉ノードのことを音素崩れノードという。 Next, the phoneme breaking decision tree learning unit 130 will be described. The phoneme collapse decision tree learning unit 130 learns a phoneme collapse decision tree using the learning phoneme label, the phoneme collapse flag, and the learning phoneme segment speech feature as input (S130). As shown in FIG. 7, the phoneme collapse decision tree repeats a Yes-No question for the learning phoneme segment speech feature amount that has entered the root node of the top layer (here, the phoneme segment speech feature for learning). It reaches the lowest leaf node (using a question about the quantity attribute and its answer) and outputs a learning phoneme label and a phoneme collapse flag assigned to the reached leaf node. Hereinafter, a leaf node in which the phoneme breakage flag is a symbol “*”, that is, a phoneme breakage occurs is referred to as a phoneme breakup node.

一般に、決定木の学習では、各学習データをクラスタリングするために複数の属性と値のペアが必要になる。属性の数と種類は任意に決定することができるが、一般的に学習データは大量になるため、属性とその値は一定の手順に従い自動的に決定されるものが望ましい。例えば、音素区間の長さを属性とすることができる。この属性は、学習データである学習用音素区間音声特徴量の数から計算することができる。また、音の高さを表す特徴量であるF0の平均値を属性とすることができる。音素区間音声特徴量から計算で求めることができるからである。音素崩れは、口の動きの物理的制約により前後の音素を引きずってしまう音韻のなまけ現象が主要因となる。また、早口である人ほど音素崩れの傾向は強い。したがって、時間変化量に関する属性とその値、音素継続長に関する属性とその値を用いると音素崩れ決定木の学習が効率的に進む。 In general, learning of a decision tree requires a plurality of attribute / value pairs to cluster each learning data. The number and type of attributes can be arbitrarily determined. However, since learning data is generally large, it is desirable that the attributes and their values are automatically determined according to a certain procedure. For example, the length of the phoneme segment can be used as an attribute. This attribute can be calculated from the number of learning phoneme segment speech features that are learning data. In addition, an average value of F0, which is a feature amount representing the pitch of a sound, can be used as an attribute. It is because it can obtain | require by calculation from phoneme area audio | voice feature-value. The phoneme collapse is mainly due to the phonological blurring phenomenon that drags the phonemes before and after due to physical restrictions of mouth movement. Moreover, the tendency of phoneme disruption is stronger for people who are quick. Therefore, learning of the phoneme collapse decision tree efficiently proceeds by using the attribute and value related to the time change amount and the attribute and value related to the phoneme duration.

また、音素崩れ決定木の学習には、エントロピーを用いた学習法を適用することができる。エントロピーを用いた学習法は、音素崩れ決定木の構成に用いる属性の重要度を客観的に評価するためことができ、重要度の高い属性を根ノードに近づけることにより、よりコンパクトな決定木を構成することができる。以下、エントロピーを用いた学習法について簡単に説明する。決定木をT、m番目のノードをR_m、ノードR_m中の例題数（決定木Tに従い、クラスタリングしてきた際にノードR_mに割り当てられる学習データの数）をn_mとする。このとき、ノードR_mにおいてラベルがgになる確率P^_m,gは、式(1)のようになる。 In addition, a learning method using entropy can be applied to learning a phoneme collapse decision tree. The learning method using entropy can objectively evaluate the importance of attributes used to construct a phoneme breakdown decision tree, and by making the attributes with high importance close to the root node, a more compact decision tree can be created. Can be configured. Hereinafter, a learning method using entropy will be briefly described. The decision tree is T, the m-th node is R _m , and the number of examples in the node R _m (the number of learning data assigned to the node R _m when clustered according to the decision tree T) is n _m . At this time, the probability P ^ _{m, g} that the label is g at the node R _m is expressed by the following equation (1).

ここで、I[]は個々の学習データ、y_iは学習データIのラベルであり、Σを足し合わせる範囲はノードR_mに割り当てられる学習データが1からn_mまででナンバリングされているものとして、すべての学習データについてである。 Here, I [] Individual learning data, y _i is the label of the learning data I, ranges adding the Σ is as learning data that is assigned to the node R _m are numbered from 1 to n _m , About all learning data.

ノードR_mにおけるラベルの予測値y^(m)は、確率が最大となるラベルであるから、 Since the predicted value y ^ (m) of the label at node R _m is the label with the highest probability,

となる。エントロピーに基づく学習では、ノードR_mのコストQ_m(T)を式(3)で定義する。 It becomes. In learning based on entropy, the cost Q _m (T) of the node R _m is defined by equation (3).

すなわち、ノードR_mにおけるエントロピー（各ラベルのエントロピーの総和）の符号を反転させたものが、ノードR_mにおけるコストQ_m(T)になる（Q_m(T)≦0）。 That is, the cost Q _m (T) at the node R _m is obtained by inverting the sign of the entropy at the node R _m (the total entropy of each label) (Q _m (T) ≦ 0).

ここで、着目している属性が音素崩れの判別に有用であるならば、その属性がとる値と音素崩れのラベルの有無には大きな関連性が見られるはずである。すなわち、有用な属性に対しては、エントロピーは小さくなる（つまり、コストは大きくなる）。実際、式(3)において、p^_m,g=1、すなわち、100%の確率でラベルgを与えるような属性が存在する場合、そのエントロピーは0（コストも0）となり、純度が最大となる。つまり、コストの大きさが属性の重要度を示す。このため、コストが大きいノードをより上に（根ノードの近く）に配置することにより、よりコンパクトで、判定性能の高い決定木を構成することが可能となる。 Here, if the attribute of interest is useful for discrimination of phoneme breakage, there should be a great relationship between the value taken by the attribute and the presence or absence of the phoneme breakage label. That is, for useful attributes, the entropy is small (ie, the cost is high). In fact, in equation (3), if p ^ _{m, g} = 1, that is, there is an attribute that gives the label g with a probability of 100%, its entropy is 0 (and the cost is also 0), and the purity is maximum. Become. That is, the size of the cost indicates the importance of the attribute. For this reason, it is possible to configure a decision tree that is more compact and has high determination performance by arranging a node with a high cost above (near the root node).

その他、エントロピーに基づく決定木の学習には、下に続く枝の本数が二本以上でもよい、構成した木から不要な枝を削除する枝刈りが容易である等、多くのメリットがある。 In addition, learning of a decision tree based on entropy has many merits, such as the number of branches following below may be two or more, and pruning to remove unnecessary branches from the constructed tree is easy.

以下、図８〜図１７を参照して音素崩れ区間検出装置２００について説明する。 Hereinafter, the phoneme breaking section detection apparatus 200 will be described with reference to FIGS.

［音素崩れ区間検出装置２００］
図８に示すように音素崩れ区間検出装置２００は、音声特徴量生成部２１０、音声認識部２３０、音素照合部２５０、音素崩れ区間検出部２７０、記録部２９０を含む。記録部２９０は、音素崩れ区間検出装置２００の処理に必要な情報を適宜記録する構成部である。音素崩れ区間検出装置２００は、認識用音声データを入力として、音素崩れ検出モデル学習装置１００が学習した音素崩れ決定木を用いて、音素崩れ区間付き最尤音素系列を生成し、出力する。音素崩れ区間付き最尤音素系列は、音声認識部２３０による認識結果である最尤音素系列に音素が崩れている区間（音素崩れ区間）の情報を付したものである。 [Phoneme breaking section detection device 200]
As shown in FIG. 8, the phoneme breakage section detection device 200 includes a voice feature value generation unit 210, a voice recognition unit 230, a phoneme collation unit 250, a phoneme breakage section detection unit 270, and a recording unit 290. The recording unit 290 is a component that appropriately records information necessary for the processing of the phoneme breaking segment detection device 200. The phoneme breakage interval detection device 200 receives the recognition speech data as an input, and generates and outputs a maximum likelihood phoneme sequence with a phoneme breakage interval using the phoneme breakage determination tree learned by the phoneme breakage detection model learning device 100. The maximum likelihood phoneme sequence with phoneme collapse section is obtained by adding information of a segment (phoneme collapse segment) where the phoneme is collapsed to the maximum likelihood phoneme sequence as a recognition result by the speech recognition unit 230.

図９に従い音素崩れ区間検出装置２００の動作について説明する。音声特徴量生成部２１０は、認識用音声データをフレームに分割し、音声特徴量を生成し、出力する（Ｓ２１０）。音声特徴量生成部２１０は、音声特徴量生成部１０１における音声特徴量の生成と同一条件にて音声特徴量を生成する。 The operation of the phoneme breaking section detection apparatus 200 will be described with reference to FIG. The voice feature quantity generation unit 210 divides the voice data for recognition into frames, generates a voice feature quantity, and outputs it (S210). The voice feature quantity generation unit 210 generates a voice feature quantity under the same conditions as the voice feature quantity generation in the voice feature quantity generation unit 101.

音声認識部２３０は、Ｓ２１０で生成した音声特徴量を入力として、認識用音声データの一番尤もらしい音素の系列である最尤音素系列と、最尤音素系列の各音素の発話区間に対応するフレームの音声特徴量の系列である音声特徴量系列を生成し、最尤音素系列と音声特徴量系列を認識結果として出力する（Ｓ２３０）。認識結果の一例を図１０に示す。音声認識部２３０の構成の一例を図１１に示す。当該構成では、デコーダ２２１が各モデル（音響モデル２２５、言語モデル２２７、辞書２２９）を用いて入力となる音声特徴量から最尤音素系列を含む認識結果を生成する。音声認識部２３０の構成にはDNNを用いればよい。 The speech recognition unit 230 receives the speech feature value generated in S210 as an input, and corresponds to the maximum likelihood phoneme sequence that is the most likely phoneme sequence of the speech data for recognition and the speech section of each phoneme of the maximum likelihood phoneme sequence. A speech feature amount sequence that is a sequence of speech feature amounts of a frame is generated, and a maximum likelihood phoneme sequence and a speech feature amount sequence are output as recognition results (S230). An example of the recognition result is shown in FIG. An example of the configuration of the speech recognition unit 230 is shown in FIG. In this configuration, the decoder 221 uses each model (acoustic model 225, language model 227, dictionary 229) to generate a recognition result including the maximum likelihood phoneme sequence from the input speech feature. A DNN may be used for the configuration of the voice recognition unit 230.

音素照合部２５０は、Ｓ２３０で生成した最尤音素系列と音声特徴量系列を入力として、音素崩れ検出モデル学習装置１００が学習した音素崩れ決定木を用いて、最尤音素系列に含まれる、音素崩れを起こしている母音音素を示す母音音素ラベルに音素崩れラベルを付した音素単位照合結果の系列である音素崩れラベル付き最尤音素系列を生成し、出力する（Ｓ２５０）。図１２〜図１６を参照して音素照合部２５０について詳しく説明する。図１２に示すように音素照合部２５０は、推定音素系列生成部２４１、音素系列比較部２４３を含む。図１３に従い音素照合部２５０の動作について説明する。 The phoneme matching unit 250 receives the maximum likelihood phoneme sequence and the speech feature amount sequence generated in S230 as input, and uses the phoneme collapse decision tree learned by the phoneme collapse detection model learning device 100, thereby including the phonemes included in the maximum likelihood phoneme sequence. A maximum likelihood phoneme sequence with a phoneme break label, which is a sequence of phoneme unit collation results obtained by adding a phoneme break label to a vowel phoneme label indicating a broken vowel phoneme, is generated and output (S250). The phoneme matching unit 250 will be described in detail with reference to FIGS. As shown in FIG. 12, phoneme matching unit 250 includes an estimated phoneme sequence generation unit 241 and a phoneme sequence comparison unit 243. The operation of the phoneme matching unit 250 will be described with reference to FIG.

推定音素系列生成部２４１は、最尤音素系列、音声特徴量系列を入力として、音素崩れ決定木を用いて、推定音素系列を生成する（Ｓ２４１）。推定音素系列生成部２４１の動作について詳細に説明する（図１４参照）。図１４は、最尤音素系列をa₁…a_K、音声特徴量系列をb₁…b_Kを入力として推定音素系列をc₁…c_Kを出力する推定音素系列生成部２４１の動作を説明するフローチャートである（ただし、Kは系列の長さ（つまり、最尤音素系列に含まれる音素の数））。 The estimated phoneme sequence generation unit 241 receives the maximum likelihood phoneme sequence and the speech feature amount sequence, and generates an estimated phoneme sequence using a phoneme collapse decision tree (S241). The operation of the estimated phoneme sequence generation unit 241 will be described in detail (see FIG. 14). Figure 14 is a maximum likelihood phoneme sequence a ₁ ... a _K, the operation of estimating the phoneme sequence generator 241 for outputting an estimated phoneme sequence c ₁ ... c _K audio feature amount sequence as input b ₁ ... b _K Description (Where K is the length of the sequence (that is, the number of phonemes included in the maximum likelihood phoneme sequence)).

推定音素系列生成部２４１は、最尤音素系列中の音素ラベルが母音を示すものである場合は、音素崩れ決定木を用いて、音声特徴量系列中の当該母音に対応する音声特徴量から決定される音素を推定音素として生成する（Ｓ２４１−４ａ）。一方、最尤音素系列中の音素ラベルが子音等母音以外の音素を示すものである場合は、当該音素を推定音素として生成する（Ｓ２４１−４ｂ）。これらの推定音素を順に結合することで推定音素系列を生成する（Ｓ２４１−７）。 When the phoneme label in the maximum likelihood phoneme sequence indicates a vowel, the estimated phoneme sequence generation unit 241 uses a phoneme collapse decision tree to determine from the speech feature amount corresponding to the vowel in the speech feature amount sequence. The generated phonemes are generated as estimated phonemes (S241-4a). On the other hand, if the phoneme label in the maximum likelihood phoneme sequence indicates a phoneme other than a vowel such as a consonant, the phoneme is generated as an estimated phoneme (S241-4b). An estimated phoneme sequence is generated by sequentially combining these estimated phonemes (S241-7).

音素系列比較部２４３は、最尤音素系列、Ｓ２４１で生成した推定音素系列を入力として、音素崩れラベル付き最尤音素系列を生成する（Ｓ２４３）。音素系列比較部２４３の動作について詳細に説明する（図１５参照）。図１５は、最尤音素系列をa₁…a_K、推定音素系列をc₁…c_Kを入力として音素崩れラベル付き最尤音素系列をd₁…d_Kを出力する音素系列比較部２４３の動作を説明するフローチャートである。 The phoneme sequence comparison unit 243 receives the maximum likelihood phoneme sequence and the estimated phoneme sequence generated in S241, and generates a maximum likelihood phoneme sequence with a phoneme collapse label (S243). The operation of the phoneme sequence comparison unit 243 will be described in detail (see FIG. 15). Figure 15 is a maximum likelihood phoneme sequence a ₁ ... a _K, the phoneme collapse labeled maximum likelihood phoneme sequence estimation phoneme sequence as input c ₁ ... c _K phoneme sequence comparison unit 243 outputs the d ₁ ... d _K It is a flowchart explaining operation | movement.

音素系列比較部２４３は、Ｓ２４１で生成した推定音素系列の各音素ラベルと最尤音素系列の各音素ラベルを順に比較していき（Ｓ２４３−３）、一致する場合は最尤音素系列の音素ラベルのみを音素単位照合結果として生成する（Ｓ２４３−４ａ）。一方、一致しない場合は最尤音素系列の音素ラベルと音素崩れラベルの組を音素単位照合結果として生成する（Ｓ２４１−４ｂ）。これらの音素単位照合結果を順に結合することで音素崩れラベル付き最尤音素系列を照合結果として生成し、出力する（Ｓ２４３−７）。照合結果の一例を図１６に示す。 The phoneme sequence comparison unit 243 sequentially compares each phoneme label of the estimated phoneme sequence generated in S241 and each phoneme label of the maximum likelihood phoneme sequence (S243-3), and if they match, the phoneme label of the maximum likelihood phoneme sequence Only as a phoneme unit collation result (S243-4a). On the other hand, if they do not match, a pair of phoneme labels and phoneme collapse labels of the maximum likelihood phoneme sequence is generated as a phoneme unit collation result (S241-4b). By combining these phoneme unit matching results in order, a maximum likelihood phoneme sequence with a phoneme collapse label is generated and output as a matching result (S243-7). An example of the collation result is shown in FIG.

音素崩れ区間検出部２７０は、Ｓ２５０で生成した音素崩れラベル付き最尤音素系列を入力として、音素崩れラベルが付与された２つ以上の連接する音素群からなる音素崩れ区間を付与した音素崩れ区間付き最尤音素系列を生成し、出力する（Ｓ２７０）。具体的には以下のようにして音素崩れ区間付き最尤音素系列を生成する。音素崩れラベル付き最尤音素系列を先頭から順に見ていき、音素崩れラベルが付与されている母音音素ラベル（母音音素ラベル１）を見つけ出す。見つけたら、その次に出現する母音音素ラベル（母音音素ラベル２）を見つけ出し、音素崩れラベルが付与されているか否かを確認する。音素崩れラベルが付与されている場合は、その間にあるすべての子音等の音素ラベルに対して音素崩れラベルを付与する（つまり、母音音素ラベル１から母音音素ラベル２までのすべての音素ラベルに音素崩れラベルを付与する）。一方、音素崩れラベルが付与されていない場合は、見つけ出した音素崩れラベルが付与されている母音音素ラベル（母音音素ラベル１）から音素崩れラベルを削除する。この手続きを繰り返すことにより、音素崩れラベルが付与された２つ以上の連接する音素群からなる音素崩れ区間が生成され、音素崩れ区間付き最尤音素系列が生成される。したがって、最尤音素系列の中で母音音素ラベルのみをみたとき隣り合う３つの母音音素ラベルすべてに音素崩れラベルが付与されている場合は、前から１番目の母音音素ラベルから３番目の母音音素ラベルまでのすべての音素ラベルに音素崩れラベルを付与することになる。検出結果の一例を図１７に示す。 The phoneme breakage interval detection unit 270 receives the phoneme breakage interval including the two or more connected phoneme groups to which the phoneme breakage labels are attached, using the maximum likelihood phoneme sequence with the phoneme breakage label generated in S250 as an input. The attached maximum likelihood phoneme sequence is generated and output (S270). Specifically, the maximum likelihood phoneme sequence with phoneme breaking sections is generated as follows. The maximum likelihood phoneme sequence with the phoneme breakage label is viewed in order from the top, and the vowel phoneme label (vowel phoneme label 1) to which the phoneme breakage label is assigned is found. If found, the next vowel phoneme label (vowel phoneme label 2) appearing is found and it is confirmed whether or not a phoneme collapse label is given. If a phoneme breakage label is assigned, a phoneme breakage label is assigned to all phoneme labels such as consonants in between (that is, phoneme labels are assigned to all phoneme labels from vowel phoneme label 1 to vowel phoneme label 2). A collapsing label is given). On the other hand, when the phoneme breakage label is not assigned, the phoneme breakage label is deleted from the found vowel phoneme label (vowel phoneme label 1) to which the phoneme breakage label is found. By repeating this procedure, a phoneme breakage section composed of two or more connected phoneme groups to which a phoneme breakage label is assigned is generated, and a maximum likelihood phoneme sequence with a phoneme breakup section is generated. Accordingly, when only the vowel phoneme label is viewed in the maximum likelihood phoneme sequence, if the phoneme collapse label is assigned to all three adjacent vowel phoneme labels, the third vowel phoneme from the first vowel phoneme label from the front is given. A phoneme collapse label is assigned to all phoneme labels up to the label. An example of the detection result is shown in FIG.

本実施形態の発明によれば、音声認識時に母音の音素崩れを検出するためのモデルである音素崩れ決定木を学習することができる。また、音素崩れ決定木を用いて、母音の音素崩れのみを判定することにより音素崩れを迅速に検出することができる。さらに、音声認識率を著しく低下させる、音素崩れが２つ以上の音素で連続的に生じている音素崩れ区間を検出することができる。 According to the invention of this embodiment, it is possible to learn a phoneme collapse decision tree that is a model for detecting phoneme collapse of a vowel during speech recognition. Moreover, phoneme collapse can be detected quickly by determining only phoneme collapse of a vowel using the phoneme collapse decision tree. Furthermore, it is possible to detect a phoneme breakage section in which phoneme breakage occurs continuously with two or more phonemes, which significantly reduces the speech recognition rate.

＜変形例＞
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 <Modification>
It goes without saying that the present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A phoneme label indicating a phoneme, a speech start time and an end time of the phoneme, a phoneme collapse label indicating that the phoneme is unclear A series of learning phoneme section information including a phoneme collapse flag that is one of the labels indicating that,
A learning phoneme label which is a phoneme label associated with a vowel phoneme label or a phoneme collapse label indicating a vowel phoneme included in the learning phoneme segment information from the learning speech data and the learning phoneme segment information sequence; A learning phoneme that extracts a phoneme collapse flag of the learning phoneme label and a learning phoneme segment speech feature that is a speech feature corresponding to a segment from the speech start time to the speech end time of the phoneme of the learning phoneme label An information extractor;
Phoneme collapse decision tree learning that learns a phoneme collapse decision tree that is a model for detecting phoneme collapse from the phoneme collapse flag of the learning phoneme label, the phoneme label of the learning and the phoneme segment speech feature value for learning. Phoneme breakage detection model learning device.

A voice feature quantity generating unit for generating a voice feature quantity from the recognition voice data;
Using the speech feature amount, a maximum likelihood phoneme sequence that is a most likely phoneme sequence of the recognition speech data, and a sequence of speech feature amounts corresponding to a speech segment of each phoneme included in the maximum likelihood phoneme sequence A speech recognition unit for generating a speech feature amount sequence,
A phoneme collapse decision tree learned by the phoneme collapse detection model learning device according to claim 1 is used to cause phoneme collapse included in the maximum likelihood phoneme sequence from the maximum likelihood phoneme sequence and the speech feature amount sequence. A phoneme matching unit that generates a maximum likelihood phoneme sequence with a phoneme collapse label, which is a sequence of phoneme unit matching results obtained by adding a phoneme breakage label to a vowel phoneme label indicating a vowel phoneme,
Phoneme collapse interval detection for generating a maximum likelihood phoneme sequence with a phoneme break interval provided with a phoneme break interval composed of two or more connected phoneme groups to which the phoneme break label is assigned from the maximum likelihood phoneme sequence with the phoneme break label. Phoneme breakage section detection device including a part.

A phoneme label indicating a phoneme, a speech start time and an end time of the phoneme, a phoneme collapse label indicating that the phoneme is unclear A series of learning phoneme section information including a phoneme collapse flag that is one of the labels indicating that,
A phoneme break section detecting device is associated with a vowel phoneme label or a phoneme break label indicating a vowel phoneme included in the learning phoneme section information from the learning speech data and the learning phoneme section information sequence. Learning phoneme label, learning phoneme segment flag of the learning phoneme label, and phoneme section speech for learning which is a speech feature amount corresponding to a section from the utterance start time to the utterance end time of the phoneme of the learning phoneme label A phoneme information extraction step for extracting a feature amount;
A phoneme breakage determination tree, which is a model for detecting phoneme breakage of a phoneme from the phoneme breakage flag of the learning phoneme label, the phoneme breakage flag of the learning phoneme label, and the learning phoneme interval speech feature amount. A phoneme breakage detection model learning method including a phoneme breakage decision tree learning step to be learned.

A phoneme disruption section detection device that generates a speech feature amount from recognition speech data;
The phoneme breaking segment detection device uses the speech feature value to generate a maximum likelihood phoneme sequence that is a most likely phoneme sequence of the recognition speech data and a speech segment of each phoneme included in the maximum likelihood phoneme sequence. A speech recognition step for generating a speech feature amount sequence that is a sequence of speech feature amounts corresponding to
The phoneme breaking section detection device converts the maximum likelihood phoneme sequence and the speech feature amount sequence into the maximum likelihood phoneme sequence using the phoneme breaking decision tree learned by the phoneme breaking detection model learning method according to claim 3. A phoneme collation step for generating a phoneme breakage label-attached maximum likelihood phoneme sequence that is a series of phoneme unit collation results including a phoneme breakage label to a vowel phoneme label indicating a vowel phoneme that includes phoneme breakage,
The phoneme breakage interval detecting device is a maximum likelihood phoneme with a phoneme breakage interval in which a phoneme breakage interval comprising two or more connected phoneme groups to which the phoneme breakage label is attached is assigned from the maximum likelihood phoneme sequence with the phoneme breakage label. A phoneme breakage interval detection method including a phoneme breakage interval detection step for generating a sequence.

A program for causing a computer to function as the phoneme breakage detection model learning device according to claim 1 or the phoneme breakage detection unit according to claim 2.