JP2002116788A

JP2002116788A - Speech recognizing device and method

Info

Publication number: JP2002116788A
Application number: JP2000305386A
Authority: JP
Inventors: Masanori Ihara; 正典伊原; Ryuichi Oka; 隆一岡
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2000-10-04
Filing date: 2000-10-04
Publication date: 2002-04-19

Abstract

PROBLEM TO BE SOLVED: To provide a speech recognizing device and method capable of improving the recognition accuracy in phonemic boundaries. SOLUTION: The labels of the phoneme in which the phonemic pieces for standard patterns are included and the labels indicating the positions of the phonemic pieces are added to the characteristics of these phonemic pieces and are previously stored in an HDD 40. A CPU 10 integrates the labels of the same plural phonemes continuous among the labels obtained as the result of the recognition of the phonemic pieces, thereby providing the result of the recognition of the phonemes.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音素片の認識結果
を使用して音素認識を行なう音素認識装置および方法に
関する。[0001] 1. Field of the Invention [0002] The present invention relates to a phoneme recognition apparatus and method for performing phoneme recognition using recognition results of phoneme segments.

【０００２】[0002]

【従来の技術】従来は音声の音響的な特徴を音素片長さ
（１つの音素を分解した複数の音素片の長さ）で取得
し、音素片長さの特徴とその音素の内容を示す識別情報
（ラベルと一般に呼ばれる）とを予め用意しておく。音
声認識を行なう場合、認識対象の音声から得られた音素
片長さの特徴と予め用意されたラベル付けされた音素片
の特徴（いわゆる標準パターン）とを比較する。この比
較には上記２つの特徴の間の距離の値が用いられる。認
識対象の音素の特徴と複数の標準パターンの特徴とを比
較し、距離の最も小さい標準パターンのラベルが音声認
識結果と決定される。また、統計的出現確率に基づき音
素の認識結果の補正を行なう音声認識方法も提案されて
いる。2. Description of the Related Art Conventionally, acoustic characteristics of speech are obtained by phoneme segment length (length of a plurality of phoneme segments obtained by decomposing one phoneme), and identification information indicating the feature of phoneme segment length and the contents of the phoneme is obtained. (Generally called a label) are prepared in advance. When speech recognition is performed, a feature of a phoneme length obtained from a speech to be recognized is compared with a feature (a so-called standard pattern) of a labeled phoneme prepared in advance. The value of the distance between the two features is used for this comparison. The feature of the recognition target phoneme and the features of the plurality of standard patterns are compared, and the label of the standard pattern with the shortest distance is determined as the speech recognition result. Further, a speech recognition method for correcting a phoneme recognition result based on a statistical appearance probability has also been proposed.

【０００３】たとえば、下記の表１に示すように「かい
さつを」という音声を学習する場合、音声の音響的な特
徴は音素片単位で分解されて、［ｋ］，［ｋ］，
「ａ」，「ａ」・・・というように、ラベルが音素片単
位の特徴に付される。[0003] For example, as shown in Table 1 below, when learning a speech “Kai-satsu”, the acoustic features of the speech are decomposed in units of phonemes, and [k], [k],
Labels such as “a”, “a”,.

【０００４】認識対象の音声から抽出された音素片の特
徴に対して、距離計算結果が許容範囲内となる標準パタ
ーンがない場合には、その音素片は認識結果は認識不可
として扱われる（表１で？で表記）。If there is no standard pattern whose distance calculation result is within an allowable range for the characteristics of a phoneme segment extracted from the speech to be recognized, the speech result is treated as being unrecognizable (see Table 1). 1 and?).

【０００５】[0005]

【表１】 [Table 1]

【０００６】[0006]

【発明が解決しようとする課題】従来の手法で１種類と
してラベル付けされる音素も、その特徴を観察すると偏
りはあるものの分散が大きく音素境界において誤認識し
やすいという傾向がある。The phonemes labeled as one type by the conventional method also tend to have a large variance when observing their characteristics, but have a large variance, and are likely to be erroneously recognized at phoneme boundaries.

【０００７】そこで、本発明の目的は、音素境界部分の
認識精度を向上させることの可能な音素認識装置および
方法を提供することにある。An object of the present invention is to provide a phoneme recognition apparatus and method capable of improving the recognition accuracy of a phoneme boundary portion.

【０００８】[0008]

【課題を解決するための手段】このような目的を達成す
るために、請求項１の発明は、認識対象の音声を音素片
単位で音声認識し、その音声認識結果を使用して音素認
識を行なう音素認識装置において、予め発声内容が判明
している音声から抽出した音素片単位の第１の特徴と、
その特徴の位置を示すラベルおよびその音素片が含まれ
る音素のラベルとを互いに関連付けて複数組、記憶した
記憶手段と、認識対象の音声から音素片単位で第２の特
徴を抽出する特徴抽出手段と、前記記憶手段に記憶され
た複数組の第１の特徴の中から、前記特徴抽出手段によ
り抽出した第２の特徴と最も類似する第１の特徴を検出
する音素片認識手段と、当該検出された第１の特徴に関
連付けられた音素のラベルを抽出するラベル抽出手段
と、当該抽出された音素のラベル中の連続で同一のラベ
ルを１つのラベルに統合するラベル統合手段と、当該統
合されたラベルを音素認識結果として出力する出力手段
とを具えたことを特徴とする。In order to achieve the above object, according to the first aspect of the present invention, a speech to be recognized is speech-recognized in phoneme units, and phoneme recognition is performed using the speech recognition result. In the phoneme recognition device to be performed, a first feature per phoneme unit extracted from a speech whose utterance content is known in advance,
A storage unit that stores a plurality of sets of labels indicating the positions of the features and labels of phonemes including the phoneme segments in association with each other, and a feature extraction unit that extracts a second feature in speech unit units from the speech to be recognized And a phoneme segment recognizing means for detecting a first feature most similar to the second feature extracted by the feature extracting means from a plurality of sets of first features stored in the storage means; Label extracting means for extracting a label of a phoneme associated with the extracted first feature, label integrating means for integrating the same continuous label in the extracted label of the phoneme into one label, Output means for outputting the generated label as a phoneme recognition result.

【０００９】請求項２の発明は、請求項１に記載の音素
認識装置において、前記特徴抽出手段により学習対象の
音声から第１の特徴を抽出し、さらに該第１の特徴に関
連付けるべき、位置を示すラベルおよび音素のラベルを
入力する入力手段を具えたことを特徴とする。According to a second aspect of the present invention, in the phoneme recognition device according to the first aspect, a first feature is extracted from the learning target speech by the feature extracting means, and further, a position to be associated with the first feature is extracted. And input means for inputting a label indicating phoneme and a phoneme label.

【００１０】請求項３の発明は、請求項１に記載の音素
認識装置において、音素を前部、中央部、後部の３つの
部分に分割し、当該分割された部分の中に含まれる音素
片に対して、前記３つの部分のいずれかの位置を示すラ
ベルを前記第１の特徴と関連付けることを特徴とする。According to a third aspect of the present invention, in the phoneme recognition apparatus according to the first aspect, the phoneme is divided into three parts, a front part, a center part, and a rear part, and a phoneme fragment included in the divided part is included. In contrast, a label indicating a position of any of the three parts is associated with the first feature.

【００１１】請求項４の発明は、認識対象の音声を音素
片単位で音声認識し、その音声認識結果を使用して音素
認識を行なう音素認識方法において、予め発声内容が判
明している音声から抽出した音素片単位の第１の特徴
と、その特徴の位置を示すラベルおよびその音素片が含
まれる音素のラベルとを互いに関連付けて複数組、記憶
手段に記憶しておき、認識対象の音声から音素片単位で
第２の特徴を抽出し、前記記憶手段に記憶された複数組
の第１の特徴の中から、前記特徴抽出手段により抽出し
た第２の特徴と最も類似する第１の特徴を検出し、当該
検出された第１の特徴に関連付けられた音素のラベルを
抽出し、当該抽出された音素のラベル中の連続で同一の
ラベルを１つのラベルに統合し、当該統合されたラベル
を音素認識結果として出力するを具えたことを特徴とす
る。According to a fourth aspect of the present invention, there is provided a phoneme recognition method for performing speech recognition of a speech to be recognized in units of phonemes and performing phoneme recognition using the speech recognition result. A plurality of sets of the extracted first feature of each phoneme unit, a label indicating the position of the feature, and a label of a phoneme including the phoneme segment are associated with each other and stored in the storage unit. A second feature is extracted for each phoneme unit, and a first feature that is most similar to the second feature extracted by the feature extracting unit is selected from a plurality of sets of first features stored in the storage unit. Detecting, extracting a label of a phoneme associated with the detected first feature, integrating successive identical labels in the extracted label of the phoneme into one label, and As a phoneme recognition result Characterized in that comprises a output.

【００１２】請求項５の発明は、請求項４に記載の音素
認識方法において、学習対象の音声から第１の特徴を抽
出し、さらに該第１の特徴に関連付けるべき、位置を示
すラベルおよび音素のラベルを入力し、当該入力された
ラベルと前記学習対象の音声から抽出された特徴とを互
いに関連付けて前記記憶手段に記憶することを特徴とす
る。According to a fifth aspect of the present invention, there is provided the phoneme recognition method according to the fourth aspect, wherein a first feature is extracted from a speech to be learned, and a label indicating a position and a phoneme to be associated with the first feature. Is input, and the input label and the feature extracted from the learning target speech are stored in the storage unit in association with each other.

【００１３】請求項６の発明は、請求項４に記載の音素
認識方法において、音素を前部、中央部、後部の３つの
部分に分割し、当該分割された部分の中に含まれる音素
片に対して、前記３つの部分のいずれかの位置を示すラ
ベルを前記第１の特徴と関連付けることを特徴とする。According to a sixth aspect of the present invention, in the phoneme recognition method according to the fourth aspect, the phoneme is divided into three parts, a front part, a center part, and a rear part, and a phoneme fragment included in the divided part is included. In contrast, a label indicating a position of any of the three parts is associated with the first feature.

【００１４】[0014]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態を詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１５】最初に本発明を適用した音声認識方法を説
明する。First, a speech recognition method to which the present invention is applied will be described.

【００１６】本実施形態の音素認識方法では、１種類と
してラベル付けされた音素をその特徴により分類し再度
特徴を生成する事でより細密な特徴を求めるとともに、
同じ種類のラベル付けされた音素は同じ音素として認識
することで認識率の向上を図る。In the phoneme recognition method according to the present embodiment, a phoneme labeled as one type is classified according to its feature and a feature is generated again to obtain a finer feature.
The recognition rate is improved by recognizing the same type of labeled phonemes as the same phoneme.

【００１７】例えば「あ」という音素を「ｈａ」「ｍ
ａ」「ｔａ」、「い」という音素を「ｈｉ」「ｍｉ」
「ｔｉ」と音の出現位置の前部、中部、後部として１つ
の音素にたいし複数のラベリングを行ない複数組の音素
単語について学習する。ここでｈは音素の中の前部を意
味するラベルである。ｍは音素の中の中央部を意味する
ラベルである。ｔは音素の後部を意味するラベルであ
る。これらのラベルの後に、音素の内容を示すラベルこ
の場合ａが付される。For example, the phoneme “A” is replaced by “ha”, “m”
The phonemes "a", "ta" and "i" are "hi" and "mi"
A plurality of phoneme words are learned by performing a plurality of labeling on one phoneme as a front part, a middle part, and a rear part of the appearance position of “ti” and the sound. Here, h is a label meaning a front part in a phoneme. m is a label meaning a central part in a phoneme. t is a label indicating the rear part of the phoneme. After these labels, a label indicating the content of the phoneme, in this case a is added.

【００１８】このようなラベル表記方法を使用して、音
素片の音響的な特徴に上記ラベルを付す学習が予め行な
われる。Using such a label notation method, learning for attaching the above-mentioned label to the acoustic feature of the phoneme segment is performed in advance.

【００１９】このような学習結果（ラベル付き音素片の
特徴）を使用して音声認識を行なうと次のような表２の
結果が得られる。When speech recognition is performed using such learning results (characteristics of labeled speech segments), the following results in Table 2 are obtained.

【００２０】[0020]

【表２】 [Table 2]

【００２１】本実施形態では、最初に認識対象の音素
片、上記の例では、たとえば、「ｋ」に相当する音素片
と複数組の標準パターンとを比較し、最も類似する（距
離の小さい）ラベル、この場合、音素片の位置および音
素内容を示すラベル列「ｈｋ」を取得する。以下、次の
認識対象の音素片に最も類似する標準パターン側のラベ
ル列「ｍａ」を取得する。In the present embodiment, a phoneme segment to be recognized first, in the above example, for example, a phoneme segment corresponding to "k" is compared with a plurality of sets of standard patterns, and the most similar (smallest distance) is obtained. A label, in this case, a label string “hk” indicating the position of the phoneme segment and the phoneme content is obtained. Hereinafter, a label string “ma” on the standard pattern side most similar to the next phoneme segment to be recognized is obtained.

【００２２】このようにして、音声認識処理を認識対象
の音声に対して施して音素片の位置を含む音素のラベル
を取得する時系列に並べたラベル列の中で、音素内容を
示すラベルが同一でかつ連続するものを１つにまとめ
る。上記の表２の例では、「ｈｋ」と「ｍｋ」とは
「ｋ」が共通で連続するので、「ｋ」がその音素片を含
む音素認識結果として最終的に決定される。このような
方法で、音素のラベルを取得すると、４フレームや１２
フレーム目のように、「ｍａ」を「ｔａ」と誤認識して
いる場合でも、ａという同一カテゴリー音素であること
から認識結果を「ａ」として処理を行なうことができ
る。As described above, in the label sequence arranged in time series in which the speech recognition processing is performed on the speech to be recognized to obtain the phoneme labels including the positions of the phoneme segments, the label indicating the phoneme content is The same and continuous ones are combined into one. In the example of Table 2 above, since “k” is common and continuous between “hk” and “mk”, “k” is finally determined as a phoneme recognition result including the phoneme segment. When a phoneme label is obtained in this way, four frames or 12
Even if “ma” is erroneously recognized as “ta” as in the frame, the recognition result can be processed as “a” because the phonemes are in the same category of “a”.

【００２３】以上、説明した音素認識方法を使用する音
素認識装置を次に説明する。本実施形態の音素認識装置
の回路構成を図１に示す。Next, a phoneme recognition apparatus using the above-described phoneme recognition method will be described. FIG. 1 shows a circuit configuration of the phoneme recognition device of the present embodiment.

【００２４】音素認識装置としてはパーソナルコンピュ
ータなどの汎用コンピュータを使用することができるの
で、ハードウェア構成の説明は簡単に留める。Since a general-purpose computer such as a personal computer can be used as the phoneme recognition device, the description of the hardware configuration will be briefly described.

【００２５】図１において、ＣＰＵ１０、システムメモ
リ２０、入出力インターフェース（Ｉ／Ｏ）３０、ハー
ドディスク（ＨＤＤ）４０、入力装置５０、表示装置６
０等がバス接続されている。ＣＰＵ１０はハードディス
ク４０に記憶された後述の音素認識プログラムを使用し
て本発明に係る音素認識を行なう。システムメモリ２０
は音素認識にかかわる各種のデータを一時記憶する。Ｉ
／Ｏ３０はマイクロホンから学習すべき音声や認識すべ
き音声を入力する。In FIG. 1, a CPU 10, a system memory 20, an input / output interface (I / O) 30, a hard disk (HDD) 40, an input device 50, and a display device 6
0 and the like are bus-connected. The CPU 10 performs the phoneme recognition according to the present invention using a phoneme recognition program described later stored in the hard disk 40. System memory 20
Temporarily stores various data related to phoneme recognition. I
A / O30 inputs a voice to be learned or a voice to be recognized from the microphone.

【００２６】ハードディスク４０は、音素認識プログラ
ムおよびシステム全体を制御するオペレーティングシス
テム（いわゆるＯＳ）などが記憶されている。入力装置
５０はマウスおよびキーボードを有し、情報入力を行な
う。表示装置６０は、入力装置５０から入力された情
報、音素認識結果等を表示する。The hard disk 40 stores a phoneme recognition program, an operating system (so-called OS) for controlling the entire system, and the like. The input device 50 has a mouse and a keyboard, and inputs information. The display device 60 displays information input from the input device 50, a phoneme recognition result, and the like.

【００２７】図２および図３を参照して音素認識装置の
動作を説明する。図２は音素学習に関するプログラムの
内容を示す。図３は音素認識に関するプログラムの内容
を示す。The operation of the phoneme recognition device will be described with reference to FIGS. FIG. 2 shows the contents of a program relating to phoneme learning. FIG. 3 shows the contents of a program relating to phoneme recognition.

【００２８】（学習処理）マイクロホンから入力された
音声（信号）はシステムメモリ２０に一時記憶された
後、従来と同様にして音響の特徴が分析され、音素片単
位の分析結果がシステムメモリ２０に一時記憶される
（図２のステップＳ１００→Ｓ１１０）。(Learning process) The voice (signal) input from the microphone is temporarily stored in the system memory 20 and then the characteristics of the sound are analyzed in the same manner as in the prior art. It is temporarily stored (step S100 → S110 in FIG. 2).

【００２９】その分析結果は時系列的に表示装置４０に
表示される。ユーザは入力装置５０のキーボードから各
音響分析結果に対応するラベルを入力する。このラベル
は、表２の学習の欄に記載したラベル（正確には音素片
の位置を示すラベルとその音素片を含む音素の内容を示
すラベルからなるレベル）である。ＣＰＵ１０は入力さ
れたラベルと音素片の分析結果を１組のレコードとして
ハードディスク４０に記憶する。以上の処理を入力され
た音声から抽出される複数の音素片について行なう（図
２のステップＳ１２０→Ｓ１３０）。The results of the analysis are displayed on the display device 40 in chronological order. The user inputs a label corresponding to each acoustic analysis result from the keyboard of the input device 50. This label is the label described in the learning column of Table 2 (more precisely, a level including a label indicating the position of a phoneme and a label indicating the content of a phoneme including the phoneme). The CPU 10 stores the input analysis result of the label and the phoneme segment as a set of records in the hard disk 40. The above processing is performed on a plurality of phoneme segments extracted from the input speech (steps S120 → S130 in FIG. 2).

【００３０】（音素認識）音素認識する場合には、マイ
クロホンから認識すべき音声を入力する。入力された音
声はシステムメモリに一時記憶される（図３のステップ
Ｓ２００）。入力された音声は音素片単位で従来と同様
にして音響分析される（図３のステップＳ２１０）。(Phone element recognition) In the case of phoneme recognition, a voice to be recognized is input from a microphone. The input voice is temporarily stored in the system memory (step S200 in FIG. 3). The input speech is subjected to acoustic analysis in the same manner as in the past in units of phonemes (step S210 in FIG. 3).

【００３１】音響分析結果はハードディスク４０に記憶
されたラベル付き特徴（標準パターン）と特徴同士が比
較され、最も類似するラベル付き特徴が検出される。検
出されたラベル付き特徴からラベルが抽出されて時系列
的にシステムメモリ２０に記憶される（表２の「認識」
の欄のラベル参照）。時系列的に並べたレベルをラベル
系列と呼ぶことにする。ＣＰＵ１０は後述の図４のプロ
グラムを使用して、ラベル系列の中の連続で音素ラベル
が同一のものを１つのラベルに統合する処理を行なう
（図３のステップＳ２２０）。統合されたラベルの系列
が音素の認識結果として表示装置６０に表示されたり、
不図示のプリンタにより印刷出力される（表２の「最終
認識」の欄のラベル参照、図３のステップＳ２３０）。The results of the acoustic analysis are compared with labeled features (standard patterns) stored in the hard disk 40, and the most similar labeled features are detected. Labels are extracted from the detected labeled features and stored in the system memory 20 in chronological order ("Recognition" in Table 2).
Column label). The levels arranged in time series will be referred to as label series. The CPU 10 uses the program of FIG. 4 to be described later to perform a process of integrating consecutive phoneme labels having the same phoneme label into one label (step S220 in FIG. 3). The integrated label sequence is displayed on the display device 60 as a phoneme recognition result,
It is printed out by a printer (not shown) (see the label in the column of "final recognition" in Table 2; step S230 in FIG. 3).

【００３２】（ラベルの統合）連続で同一の音素ラベル
を有する部分を上記ラベル系列の中から検出する一処理
方法を図４を参照して説明する。(Integration of Labels) One processing method for detecting a portion having the same phoneme label continuously from the label sequence will be described with reference to FIG.

【００３３】図４のプログラムに移行すると、最初に初
期設定処理が行なわれる。ここで、最初のラベルと比較
する仮の直前ラベルが設定される（図４のステップＳ３
００）。このラベルは実際にはありえないラベル記号が
使用される。When the program shifts to the program of FIG. 4, an initial setting process is first performed. Here, a temporary immediately preceding label to be compared with the first label is set (step S3 in FIG. 4).
00). This label uses a label symbol that cannot actually exist.

【００３４】次に上記認識処理で得られたラベル系列の
中の先頭のラベル、すなわち１フレーム目のラベル、表
２の例では「ｈｋ」がシステムメモリ２０から読み出さ
れ、その中から音素内容を示すラベル「ｋ」が抽出され
る（図４のステップＳ３１０）。抽出された「ｋ」と初
期設定された仮の直前のラベルと一致比較される。ここ
では不一致の判定が得られるので、比較の対象となった
ラベル「ｋ」が統合すべきラベルとして、システムメモ
リ２０に一時記憶される（図４のステップＳ３２０→Ｓ
３３０）。Next, the first label in the label sequence obtained by the above-described recognition processing, that is, the label of the first frame, "hk" in the example of Table 2, is read out from the system memory 20, and the phoneme content is read out of it. Is extracted (step S310 in FIG. 4). The extracted “k” is compared with the initially set temporary immediately preceding label. Here, since a determination of mismatch is obtained, the label “k” to be compared is temporarily stored in the system memory 20 as a label to be integrated (step S320 → S in FIG. 4).
330).

【００３５】全てのラベルの処理を行なっていないの
で、手順は図４のステップＳ３３０→Ｓ３４０→Ｓ３１
０と移行し、次に２フレーム目の認識結果すなわち、表
２の「ｔｋ」がシステムメモリ２０から読み出され、音
素レベル「ｋ」が抽出される。このラベル「ｋ」と前
回、１フレームのラベルから取り出されたラベル「ｋ」
とが一致比較される（図４のステップＳ３２０）。この
判定ではＹＥＳ（一致）判定が得られるので、手順はス
テップＳ３２０→Ｓ３４０→Ｓ３１０へと戻る。このよ
うにして、ある時点の音素ラベルを認識結果のラベル系
列の中から抽出すると、前時点（１フレーム前）に抽出
された音素ラベルと比較することにより連続で、同一の
音素ラベルを検出する。前時点のラベルと一致しない場
合、たとえば、表２の５フレーム目の音素ラベルと６フ
レーム目の音素ラベルのような場合、現時点のラベルは
別のラベルの音素片の発声であると判断されて、そのラ
ベル名がシステムメモリ２０内に一時記憶される（図４
のステップＳ３３０）。Since all the labels have not been processed, the procedure is as follows: steps S330 → S340 → S31 in FIG.
After that, the recognition result of the second frame, that is, “tk” in Table 2 is read from the system memory 20, and the phoneme level “k” is extracted. This label "k" and the label "k" extracted from the label of one frame last time
Are compared (step S320 in FIG. 4). In this determination, a YES (match) determination is obtained, and the procedure returns to steps S320 → S340 → S310. As described above, when the phoneme label at a certain time is extracted from the label sequence of the recognition result, the same phoneme label is continuously detected by comparing with the phoneme label extracted at the previous time (one frame before). . In the case where the label does not match the label at the previous time, for example, in the case of the phoneme label of the fifth frame and the phoneme label of the sixth frame in Table 2, it is determined that the label at the current time is an utterance of a phoneme piece of another label. , The label name is temporarily stored in the system memory 20 (FIG. 4).
Step S330).

【００３６】前回のラベルと現時点のラベルが一致した
場合には、現時点のラベルはシステムメモリ２０には記
憶されないので、連続で同一のラベル名が続く部分が認
識結果のラベル系列の中に存在する場合には、この処理
手順では、連続部分の先頭のラベルが検出されて、シス
テムメモリ２０に記憶することで、連続する複数の同一
の音素ラベルが１つの音素ラベルに統合される。If the previous label and the current label match, the current label is not stored in the system memory 20. Therefore, a continuous portion of the same label name exists in the label sequence of the recognition result. In this case, in this processing procedure, the head label of the continuous part is detected and stored in the system memory 20, so that a plurality of consecutive same phoneme labels are integrated into one phoneme label.

【００３７】以上の処理をラベル系列全てに対して施す
と、システムメモリ２０には表２の最終認識の欄に示す
ようなラベル統合結果「ｋａ」，「ｉ」，「ｓａ」・・
・・が得られる。When the above processing is performed on all the label sequences, the label integration results "ka", "i", "sa",...
・・ Is obtained.

【００３８】上述の実施形態の他に次の形態を実施でき
る。The following embodiment can be carried out in addition to the above-described embodiment.

【００３９】１）上述の実施形態では、学習機能、すな
わち、発声内容が予め分かっている音声の特徴を抽出
し、抽出した特徴に位置および内容を表すラベル付けを
行なって、ハードディスク記憶装置にラベル付けされた
特徴を記憶する機能を持たせている。しかしながら、学
習機能と音素認識機能を分割してもよい。この場合には
音素認識装置に対して、外部からラベル付けされた特徴
データを与える。入力の方法は、通信、携帯用記録媒体
による方法など各種の方法を使用することができる。1) In the above-described embodiment, a learning function, that is, a feature of a speech whose utterance content is known in advance is extracted, and a label indicating the position and the content of the extracted feature is given to the hard disk storage device. It has the function of storing the attached features. However, the learning function and the phoneme recognition function may be divided. In this case, externally labeled feature data is provided to the phoneme recognition device. As an input method, various methods such as communication and a method using a portable recording medium can be used.

【００４０】２）学習時に音響特徴を抽出するプログラ
ムと音素認識時に音響特徴を抽出するプログラムは共有
することができる。2) A program for extracting acoustic features during learning and a program for extracting acoustic features during phoneme recognition can be shared.

【００４１】[0041]

【発明の効果】以上、説明したように、本発明によれ
ば、音素片により音素認識を行なって、他の位置の音素
片と誤認識しても、その音素片が含まれる音素のラベル
が第１の特徴に関連付けられているので、音素片の認識
結果（第２の特徴に最も類似する第１の特徴の音素のラ
ベル）の同一連続部分を統合することで、正しい音素認
識を行なうことができる。これにより、渡り区間とよば
れるような音素間にまたがるような音声部分もしくは、
平区間と呼ばれ、同一音素内で特徴の異なる部分におい
てもより近い音素を音素認識結果として出力することが
できる。As described above, according to the present invention, even if a phoneme is recognized by a phoneme piece and is erroneously recognized as a phoneme piece at another position, the label of the phoneme containing the phoneme piece is changed. Performing correct phoneme recognition by integrating the same continuous part of the phoneme segment recognition result (the phoneme label of the first feature most similar to the second feature) because it is associated with the first feature. Can be. As a result, a voice portion that spans between phonemes such as a cross section or
It is called a flat section, and it is possible to output a phoneme that is closer to a part having a different feature within the same phoneme as a phoneme recognition result.

[Brief description of the drawings]

【図１】本発明実施形態の回路構成を示すブロック図で
ある。FIG. 1 is a block diagram illustrating a circuit configuration according to an embodiment of the present invention.

【図２】本発明実施形態の学習用プログラムの内容を示
すフローチャートである。FIG. 2 is a flowchart showing the contents of a learning program according to the embodiment of the present invention.

【図３】本発明実施形態の音素認識処理用プログラムの
内容を示すフローチャートである。FIG. 3 is a flowchart showing the contents of a phoneme recognition processing program according to the embodiment of the present invention.

【図４】本発明実施形態の音素認識処理の詳細を示すフ
ローチャートである。FIG. 4 is a flowchart illustrating details of a phoneme recognition process according to the embodiment of the present invention.

[Explanation of symbols]

１０ＣＰＵ２０システムメモリ３０Ｉ／Ｏ４０ＨＤＤ５０入力装置６０表示装置 DESCRIPTION OF SYMBOLS 10 CPU 20 System memory 30 I / O 40 HDD 50 Input device 60 Display device

Claims

[Claims]

1. A phoneme recognition apparatus for performing speech recognition of a speech to be recognized in phoneme units and performing phoneme recognition using the speech recognition result, wherein a phoneme unit extracted from speech whose utterance content is known in advance is provided. A plurality of sets of the first feature, a label indicating the position of the feature, and a label of a phoneme including the phoneme segment in association with each other, and stored. A feature extracting unit for extracting a feature; and a phoneme detecting a first feature most similar to the second feature extracted by the feature extracting unit from a plurality of sets of the first features stored in the storage unit. Piece recognition means; label extraction means for extracting a label of a phoneme associated with the detected first feature; label for integrating the same consecutive labels in the extracted phoneme labels into one label And integrating means, phoneme recognition apparatus characterized by the integrated label comprises an output means for outputting a phoneme recognition result.

2. The phoneme recognition device according to claim 1, wherein the feature extracting means extracts a first feature from the learning target speech, and further associates the first feature with the first feature.
A phoneme recognition device comprising input means for inputting a label indicating a position and a phoneme label.

3. The phoneme recognition device according to claim 1, wherein the phoneme is divided into three parts: a front part, a center part, and a rear part.
A label indicating the position of any of the three parts is assigned to the first segment of the phoneme included in the divided part.
A phoneme recognition device characterized by associating with a feature of a phoneme.

4. A phoneme recognition method for recognizing a speech to be recognized in phoneme units and performing phoneme recognition using the speech recognition result, wherein a phoneme unit extracted from speech whose utterance content is known in advance is provided. A plurality of sets of the first feature, a label indicating the position of the feature, and a label of a phoneme including the phoneme segment are associated with each other, and stored in the storage unit. And extracting a first feature most similar to the second feature extracted by the feature extracting means from a plurality of sets of first features stored in the storage means. Extract the label of the phoneme associated with the extracted first feature, integrate the same continuous label in the extracted label of the phoneme into one label, and output the integrated label as a phoneme recognition result To do Phoneme recognition method characterized by the following.

5. The phoneme recognition method according to claim 4, wherein a first feature is extracted from the speech to be learned, and a label indicating a position and a phoneme label to be associated with the first feature are input. And storing the input label and the feature extracted from the learning target speech in the storage unit in association with each other.

6. The phoneme recognition method according to claim 4, wherein the phoneme is divided into three parts: a front part, a center part, and a rear part;
A label indicating the position of any of the three parts is assigned to the first segment of the phoneme included in the divided part.
A phoneme recognition method characterized by associating with a feature of a phoneme.