JPS6315296A

JPS6315296A - Voice recognition equipment

Info

Publication number: JPS6315296A
Application number: JP61160201A
Authority: JP
Inventors: 英生瀬川
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1986-07-08
Filing date: 1986-07-08
Publication date: 1988-01-22
Anticipated expiration: 2010-06-14
Also published as: JPH0756597B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】［発明の目的］（産業上の利用分野）この発明は、連続音声による単語の認ｍ精度を向上させ
るようにした音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION [Object of the Invention] (Industrial Application Field) The present invention relates to a speech recognition device that improves the recognition accuracy of words using continuous speech.

（従来の技術）連続音声の認識においては、認識対象が単語単位で連続
的に入力されるため、入力情報からこれを構成する音韻
構造を抽出する技術が重要となる。(Prior Art) In continuous speech recognition, recognition targets are input continuously in word units, and therefore a technique for extracting the phonological structure constituting the recognition target from input information is important.

この点に関し、従来の連続音声認識では入力された音声
信号をフレームと呼ばれる一定の分析時間間隔毎に区切
り、各フレーム毎に例えばスペクトル等の特徴情報を抽
出するとともに、フレーム毎に音韻のラベル付けを行っ
て同じラベルのフレームを統合しながら音調区間のセグ
メンテーション（区分）を自動的に行うようにしていた
。そして、フレーム長は、破裂性子音など継続時間が短
い音韻を識別するため、例えば１６ｍ５ｅｃ、　２０１
ｓｅｃ等、比較的短く固定的に設定されていた。Regarding this point, in conventional continuous speech recognition, the input speech signal is divided into fixed analysis time intervals called frames, and feature information such as spectrum is extracted for each frame, and phonological labels are attached to each frame. This was done to automatically segment tonal intervals while integrating frames with the same label. The frame length is, for example, 16 m5ec, 201 in order to identify phonemes with short durations such as plosive consonants.
It was set relatively short and fixed, such as sec.

しかしながら、フレーム長をあまり短く設定すると、外
部雑音等の影響によるパワーのゆらぎでラベルが変化し
、上述したセグメンテーションの自動化が不可能になっ
たり、また、音韻の変化点であるいわゆる゛わたりパの
部分のテンプレートが用意されていないことから、この
“わたり°′を含むフレームが認識できないという問題
を生じる。However, if the frame length is set too short, the labels will change due to power fluctuations due to the influence of external noise, making it impossible to automate the segmentation described above. Since a template for the part is not prepared, a problem arises in that a frame including this "crossing °" cannot be recognized.

この問題は、フレーム長をある程度長く取ることによっ
て解決することができるが、フレーム長を長くすると、
破裂性子音のような継続時間の短い音韻の認識率が低下
することになる。This problem can be solved by increasing the frame length to a certain extent, but if the frame length is increased,
The recognition rate of short-duration phonemes such as plosive consonants will decrease.

（発明が解決しようとする問題点）このように、従来の連続音声の認識においては、フレー
ム長をあまり短く設定すると、音声パワーのゆらぎによ
ってセグメンテーションの自動化が困難になったり、音
韻の変化点における゛わたり”部分が認識できず、また
、フレーム長を長く設定すると、継続時間の短い音韻の
認識が不可能になるという問題があった。(Problems to be Solved by the Invention) As described above, in conventional continuous speech recognition, if the frame length is set too short, it may become difficult to automate segmentation due to fluctuations in speech power, and There was a problem in that the "crossing" part could not be recognized, and if the frame length was set long, it became impossible to recognize phonemes with short durations.

そこで、この発明の目的は、継続時間の短い音韻の認識
率を低下させることなしに、音声パワーのゆらぎや“わ
たり”部分の音韻認識に与える影響を抑制でき、もって
ｒ！１識精度に優れ、自動セグメンテーションの可能な
音声認識装置を提供することにある。Therefore, it is an object of the present invention to suppress fluctuations in voice power and the influence on phoneme recognition of "crossing" parts without reducing the recognition rate of phonemes with short durations, thereby making it possible to suppress r! An object of the present invention is to provide a speech recognition device that has excellent recognition accuracy and is capable of automatic segmentation.

［発明の構成］（問題点を解決するための手段）音声信号は特徴抽出部に入力されている。この特徴抽出
部は、入力された音声信号を所定のフレーム毎に分析し
フレーム毎の特徴を抽出するもので、抽出されたフレー
ム毎の特徴情報は、音韻認識部に入力されている。音＠
認識部は、入力された特徴情報と予め定められた音韻辞
書と′を照合しその照合結果に基づいてフレーム毎に音
韻ラベルを付与するものである。この発−では、音韻認
識部は音声信号の同一の部分についてフレーム長を異に
した複数の音韻認識処理を並列的に行うものとなってい
る。並列的に得られた各フレーム長の音韻ラベルは、単
語０識部に導入されている。[Structure of the Invention] (Means for Solving the Problems) The audio signal is input to the feature extraction section. This feature extraction section analyzes the input audio signal for each predetermined frame and extracts features for each frame, and the extracted feature information for each frame is input to the phoneme recognition section. sound@
The recognition unit compares the input feature information with a predetermined phoneme dictionary and assigns a phoneme label to each frame based on the result of the check. In this production, the phoneme recognition section performs a plurality of phoneme recognition processes in parallel with different frame lengths on the same portion of the audio signal. The phoneme labels of each frame length obtained in parallel are introduced into the word 0 identification section.

単語認識部は、音Ｉ認識部で付与された音韻ラベルの系
統と予め定められた単語辞書とを照合し、単語認識結果
を得るものであるが、ここでは音韻認識の結果に基づい
て最適なフレーム長の音韻ラベルを選択する。The word recognition unit compares the phoneme label system assigned by the sound I recognition unit with a predetermined word dictionary to obtain word recognition results. Select a phonological label for the frame length.

（作用）本発明によれば、各時間におい１てフレーム長を異なら
せた複数の音韻認識が並列的に行われる。(Operation) According to the present invention, a plurality of phoneme recognitions with different frame lengths are performed in parallel at each time.

そして、得られた複数の音韻ラベルのうち例えばスコア
（類似度、距離等）の最も高いフレーム長の音韻ラベル
を用いて単語認識が行われる。このため、等価的にフレ
ーム長を音韻データに応じて適応的に変化させたことに
なり、フレーム長を一定の長さに固定する従来の方式に
比べ、単語認識率を大幅に高めることができる。Then, word recognition is performed using, for example, the phoneme label with the highest frame length and the highest score (similarity, distance, etc.) among the plurality of phoneme labels obtained. This means that the frame length is equivalently changed adaptively according to the phonological data, making it possible to significantly increase the word recognition rate compared to the conventional method that fixes the frame length to a constant length. .

（実施例）以下、図面を参照しながら本発明の実施例について説明
する。(Example) Hereinafter, an example of the present invention will be described with reference to the drawings.

第１図は、本発明の一実施例に係る連続音声認識装置の
構成を示す図である。FIG. 1 is a diagram showing the configuration of a continuous speech recognition device according to an embodiment of the present invention.

Ａ／Ｄ変換部１は、図示しない音声入力部や公衆電話回
線等から入力された音声信号を、例えば１２ＫＨｚ程度
のサンプリングレートで１２ビット程度のディジタル信
号に変換し、特徴抽出部２に出力する。The A/D conversion unit 1 converts an audio signal input from an audio input unit or a public telephone line (not shown) into a digital signal of approximately 12 bits at a sampling rate of, for example, approximately 12 KHz, and outputs the digital signal to the feature extraction unit 2. .

特徴抽出部２は、入力されたディジタル音声信号を、例
えば１６ｍ５ｅｃ程度に固定された単位フレーム長毎に
分析してその特徴を抽出する。この特徴抽出部２は、例
えば特徴抽出にスペクトル分析を使用する場合には、バ
ンドパスフィルタ群（フィルタパンク）により構成する
ことができる。ここで抽出された特徴ベクトルは並列変
換部３に入力される。The feature extraction unit 2 analyzes the input digital audio signal for each unit frame length, which is fixed to about 16 m5ec, for example, and extracts its features. For example, when spectral analysis is used for feature extraction, the feature extraction unit 2 can be configured with a group of bandpass filters (filter punctures). The feature vector extracted here is input to the parallel conversion section 3.

並列変換部３は、入力された単位フレーム毎の特徴ベク
トルをｎ系統に分割するもので、ｎ−１段の遅延回路３
ａと、これら各遅延回路３ａの出力と入力される特徴ベ
クトルとを加算するｎ−１個の加算器３ｂと、これら加
算器３ｂの出力を加算フレーム数で除算するためのｎ−
１個の係数回路３Ｃとで構成されている。これらｎ個の
系統を介して音韻認識部４に入力されるｎ種の特徴ベク
トルは、それぞれ加算フレーム数を異ならせて加重平均
をとったもので、加算フレーム数が少ない程、破裂性子
音などＩＦＡ時間の短い音韻の特徴情報を担っており、
加算フレーム数が多い程、パワーのゆらぎや゛わたりパ
部分の影響が過去の定常的な音韻の特徴ベクトルによっ
て抑制された音韻情報を担っている。これら特徴ベクト
ルは、並列的に音韻認識部４に入力される。The parallel conversion unit 3 divides the input feature vector for each unit frame into n systems, and includes an n-1 stage delay circuit 3.
n-1 adders 3b for adding the outputs of these delay circuits 3a and the input feature vectors, and n-1 adders 3b for dividing the outputs of these adders 3b by the number of frames to be added.
It is composed of one coefficient circuit 3C. The n types of feature vectors input to the phoneme recognition unit 4 via these n systems are obtained by taking a weighted average with different numbers of added frames. It is responsible for phonological feature information with a short IFA time,
The larger the number of frames added, the more the power fluctuations and the influence of the overlapping portion are suppressed by the past stationary phoneme feature vectors, and the phoneme information is carried. These feature vectors are input to the phoneme recognition unit 4 in parallel.

音韻認識部４は入力されたｎ種の特徴ベクトルについて
音韻辞書との照合を行ない、それぞれの特徴ベクトルに
音韻ラベルを付与する。音韻辞書との類似度計算は、例
えば複合類似度法を用いることができる。また、類似度
の代りに辞書のテンプレートとの間の距離を求めるよう
にしても良い。The phoneme recognition unit 4 compares the inputted n types of feature vectors with a phoneme dictionary and assigns a phoneme label to each feature vector. For example, a composite similarity method can be used to calculate the similarity with the phoneme dictionary. Further, instead of the similarity, the distance between the dictionary template and the template may be calculated.

これら、ｎ種の音韻ラベルは、そのスコア（類似度或は
距離）とともに音韻ラベルバッファ６に一旦蓄えられる
。These n types of phoneme labels are temporarily stored in the phoneme label buffer 6 along with their scores (similarities or distances).

音韻ラベルバッフ７６にはこの他にも過去に生成された
特徴ベクトルが蓄えられている。音ｍラベルバッファ６
は、これらの特徴ベクトルから、第２図に示すように、
ｎ−１フレーム前の時刻ｔ−ｎを中心として、過去・未
来にわたる計りｎ−１種類の加重平均特徴ベクトルを選
択してうへルソート部７に出力する。The phoneme label buffer 76 also stores feature vectors generated in the past. sound m label buffer 6
From these feature vectors, as shown in Figure 2,
Centering on time tn, which is n-1 frames ago, n-1 types of weighted average feature vectors covering the past and future are selected and output to the forward sorting unit 7.

ラベルソート部７は、これら２ｎ−１種類の特徴ベクト
ルのスコアをソーティング（大きい順に並び替え）する
。これによって例えば類似度の最も高い最上位の音韻ラ
ベルを時刻ｔ−ｎにおける音韻ラベルとしてスコアとと
もに単語認識部８に出力する。The label sorting unit 7 sorts the scores of these 2n-1 types of feature vectors (arranges them in descending order). As a result, for example, the topmost phoneme label with the highest degree of similarity is output to the word recognition unit 8 as the phoneme label at time tn together with the score.

単語認識部８では、入力された音韻ラベル系列と単語辞
書９との照合を行なう。照合は、例えば公知のＩ）　Ｐ
　（ｄｙｎａｍｉｃ　ｐｒｏｇｒａｎ＋ｍｉｎｇ　）　
？　ッチンクによって行われる。このＤＰマツチングは
、第３図に示すように、横軸に入力音韻系列、縦軸にネ
ットワーク表現された標準パターンをとり、始点から１
フレームずつ、標準パターンを構成する各音韻ラベルと
の類似度和が最大となるバスを選択していくものである
。時刻ｔにおいては、標準パターンの全てのノードに対
する最適なバスが計算され、次のフレームにおいては標
準パターンのネットワークにより許されたあらゆるバス
に対しスコアが計算され、時刻ｔ＋ｉにおける最適なバ
スが計算される。そして、類似度和が最大になった標準
パターンを認識結果として出力する。The word recognition unit 8 compares the input phoneme label sequence with the word dictionary 9. The verification may be performed using, for example, the known I) P
(dynamic progran+ming)
? done by tink. As shown in Figure 3, this DP matching takes the input phoneme sequence on the horizontal axis and the standard pattern represented by the network on the vertical axis, and performs 1 step from the starting point.
Frame by frame, the bus that has the maximum sum of similarities with each phoneme label constituting the standard pattern is selected. At time t, the optimal bus for all nodes of the standard pattern is calculated, and in the next frame, scores are calculated for every bus allowed by the network of the standard pattern, and the optimal bus at time t+i is calculated. Ru. Then, the standard pattern with the maximum similarity sum is output as the recognition result.

以上の構成の本実施例に係る音声認識装置によれば、各
時間においてｎ種類の分析フレーム長のうち類似度の最
も高いフレーム長の音韻ラベルを採用するようにしてい
るので、等価的にフレーム長を音韻データに応じて適応
的に変化させることができる。According to the speech recognition device according to the present embodiment configured as described above, since the phonetic label of the frame length with the highest degree of similarity among the n types of analysis frame lengths is adopted at each time, it is possible to equivalently The length can be adaptively changed according to phoneme data.

例えば、第２図において、時刻ｔ−ｎの単位フレームで
得られた音声データが、破裂音のような継続時間の短い
ものである場合には、Ｒ短フレーム長で得られた合議ラ
ベルのスコアが最大となり、この音韻ラベルがそのフレ
ームのラベルとして採用される。For example, in FIG. 2, if the audio data obtained in the unit frame at time t-n is of short duration, such as a plosive, then the score of the collegial label obtained with R short frame length is becomes the maximum, and this phonological label is adopted as the label for that frame.

一方、時刻ｔ−ｎで縛られた音声データが、異種音韻間
の変化点であるパわたりパ部分である場合には、時刻ｔ
−ｎを中心としてその前後に定常的な音韻データが集約
されている。したがって、時刻ｔ−ｎを中心として過去
又は将来の複数のフレームの加重平均をとれば、“わた
りパ部分の影響が加重フレーム数に応じて抑か１される
。したがって、この場合には長いフレームにより得られ
た音韻ラベルのスコアが大きくなり、これが認識結果と
して採用される。On the other hand, if the speech data bound by time t-n is a part that is a transition point between different types of phonemes, then time t
-N is the center, and stationary phoneme data is collected before and after it. Therefore, if a weighted average of multiple past or future frames is taken with time t-n as the center, the influence of the "crossing part" is suppressed according to the number of weighted frames. The score of the phonetic label obtained by this increases, and this is adopted as the recognition result.

また、音声パワーにゆらぎがある場合にも、他のフレー
ムとの加重平均を取ることによって、このゆらぎの影響
を抑制できるので、比較的長いフレームの音韻ラベルの
スコアが高くなる。Furthermore, even if there is a fluctuation in the voice power, the influence of this fluctuation can be suppressed by taking a weighted average with other frames, so that the score of the phonetic label of a relatively long frame becomes high.

ところで、連続音声の認識においてはセグメンテーショ
ンの自動化が大きな技術的課題であるが、この装置によ
れば、分析フレーム長を音韻データに応じて適応的に変
化させることにより、音声パワーの変動等を吸収して音
韻の誤認識によるラベルの変化を防止することができる
ので、同一の音韻区間で同一のラベルが付与される確率
が高い。By the way, automation of segmentation is a major technical challenge in continuous speech recognition, but this device can absorb fluctuations in speech power by adaptively changing the analysis frame length according to phonetic data. Since it is possible to prevent changes in labels due to misrecognition of phonemes, there is a high probability that the same label will be assigned to the same phoneme interval.

したがって、同一ラベルの統合処理を行うことによって
セグメンテーションの自動化が容易になるという利点が
ある。Therefore, there is an advantage that segmentation can be easily automated by performing the same label integration process.

第４図は本発明の他の実施例に係る音声認識の構成を示
す図である。なお、この因において、第１図と同一部分
には同一符号を付し、重複する部分の説明は省くことに
する。この装置が先の装置と異なる点は単語認識部１１
にある。この単語認識部１１は、音１１０識部４から出
力されるｔ−１゜ｔ−２，・・・、ｔ−ｎの各時刻まで
の音韻ラベルについて単語辞１１９内の標準パターンと
の類似度を並列的に計算する。得られた時刻ｔ−１，ｔ
−２゜・・・、ｔ−ｎの各時点までのＤＰマツチングの
結果（スコア）は単ＩＤＰマツチングスコアバッフ？１
２に保持される。FIG. 4 is a diagram showing the configuration of speech recognition according to another embodiment of the present invention. For this reason, the same parts as in FIG. 1 are given the same reference numerals, and the explanation of the overlapping parts will be omitted. This device differs from the previous device in that the word recognition unit 11
It is in. This word recognition unit 11 determines the degree of similarity between the phoneme labels outputted from the sound 110 recognition unit 4 up to each time t-1, t-2, . . . Compute in parallel. Obtained time t-1, t
Is the result (score) of DP matching up to each time point of -2゜..., t-n a single IDP matching score buffer? 1
2.

ＤＰマツチングはある時刻における全てのノードに対す
る最適なバスを計算する方法であるが、この実施例にお
いては、第５図に示すように、時刻ｔにおいて時刻ｔ−
１のＤＰマツチングの結果と１フレ一ム分の音韻ラベル
、時刻ｔ−２のＤＰマツチングの結果と２フレーム平均
したパターンの音韻ラベル（スコアは２倍）、・・・、
時刻ｔ−ｎのＤＰマツチングの結果とｎフレームの平均
したパターンの音韻ラベル（スコアは０倍）というよう
にｎ種類のマツチングを行う。これによって、時刻ｔま
で最適なフレーム長を動的に求めながら、Ｒ通なバスを
動的に求めることができる。このようなマツチングを繰
返した後、終点における類似度和ガ最大となったパター
ンが認識結果となり、表示部７に送られる。DP matching is a method of calculating the optimal bus for all nodes at a certain time, but in this embodiment, as shown in FIG.
The result of DP matching at time t-2 and the phoneme label for one frame, the result of DP matching at time t-2 and the phoneme label of the pattern averaged over two frames (score is doubled),...
N types of matching are performed, such as the result of DP matching at time tn and the phoneme label of the average pattern of n frames (score is 0 times). As a result, it is possible to dynamically find an R-type bus while dynamically finding the optimal frame length up to time t. After repeating such matching, the pattern with the maximum similarity sum at the end point becomes the recognition result and is sent to the display unit 7.

この方法ではＤＰマツチングを１フレーム毎にｎ通り行
うが、これは並列的に行われるため演算速度の低下はな
い。In this method, DP matching is performed n times for each frame, but since this is performed in parallel, there is no reduction in calculation speed.

第５図は更に他の実施例を示すものである。この装置で
は、第３図の装置において、単語！８！識部１１と並列
的に音韻継続時間１１１１１１部２１が新たに設けられ
ている。この音韻継続制御部２１は、音ｌＩ認識部４に
おいて単位フレーム長の分析で得られたラベルに基づき
、音韻継続時間を求め、それぞれのラベルに見合った継
続時間のフレーム長のＤＰマツチングを行う単Ｒｖｔ識
部１１に送るシステムである。FIG. 5 shows yet another embodiment. In this device, in the device shown in FIG. 3, the word! 8! A phoneme duration 111111 section 21 is newly provided in parallel with the identification section 11. This phoneme continuation control unit 21 is a unit that calculates phoneme duration based on the label obtained by analyzing the unit frame length in the sound II recognition unit 4, and performs DP matching of the frame length of the duration corresponding to each label. This is a system for sending the information to the Rvt recognition unit 11.

この装置によれば、例えばｐ″、ｔ　ＩＩのように継続
時間が短いことが予想される音韻ラベルが単位フレーム
長の分析により得られた時には、時刻ｔ−１までのＤＰ
マツチング結果と１フレ一ム分の音韻ラベルとを用いて
最適バスが求められ、逆に母音のように継続時間が長い
と予想される音韻については、時刻ｔ−ｎまでのＤＰマ
ツチング結果とｎフレーム分の音韻ラベルとを用いて最
適バスが求められる。即ち、継続時間の短い非定常的な
音韻は短いフレーム長で、継続時間の長い定常的な音韻
は長いフレーム長で認識されることなる。なお、音韻継
続時間制御部２１におけるＩＩＭ時間の設定は、経験的
に求められる各音韻についての継続時間確率分布に基づ
いて定めれば良い。According to this device, when a phonetic label whose duration is expected to be short, such as p″, t II, is obtained by analyzing the unit frame length, the DP up to time t-1 is
The optimal bass is found using the matching results and phoneme labels for one frame. Conversely, for phonemes that are expected to have a long duration, such as vowels, the DP matching results up to time t-n and n The optimal bus is determined using the phoneme labels for the frames. That is, non-stationary phonemes with a short duration are recognized with a short frame length, and regular phonemes with a long duration are recognized with a long frame length. Note that the setting of the IIM time in the phoneme duration control section 21 may be determined based on the duration probability distribution for each phoneme that is determined empirically.

［発明の効果コ以上説明したように、本発明によれば、音韻の分析フレ
ーム長として種々の長さのものを用意し、これら各フレ
ーム長について並列的に音ｔｍ　ＨＩ！処理を行って、
ｖｔ識結果の最も良好なフレーム長を採用するようにし
ているので、分析フレーム長を入力される音韻に応じて
適応的に変化させることができる。このため、継続時間
の短い音韻の０識率を低下させることなしに、音声パワ
ーのゆらぎや“わたり”部分の音鎮認誠に与える影響を
抑制でき、もって認識精度に優れ、自動セグメンテーシ
ョンの可能な音声比Ｖ＆装置を提供できる。[Effects of the Invention] As explained above, according to the present invention, various lengths are prepared as phoneme analysis frame lengths, and for each of these frame lengths, the sound tm HI! After processing,
Since the frame length with the best VT recognition result is adopted, the analysis frame length can be adaptively changed according to the input phoneme. Therefore, without reducing the zero recognition rate of phonemes with short durations, it is possible to suppress fluctuations in vocal power and the influence on the recognition accuracy of the tone of the "crossing" part, resulting in excellent recognition accuracy and the possibility of automatic segmentation. We can provide audio ratio V& equipment.

[Brief explanation of the drawing]

第１図は本発明の一実施例に係る音声認識装置の構成を
示すブロック図、第２図は同装置における音韻ラベルバ
ッファの内容を説明するための図、第３図は同装置にお
ける単語認識部の作用を説明するための図、第４図は本
発明の他の実施例に係る音声認識装置の構成を示すブロ
ック図、第５図は同装置における単語認識部の作用を説
明するための図、第６図は本発明の更に他の実茄例に係
る音声認識装置の構成を示すブロック図である。１・・・Ａ／Ｄ変換部、２・・・特徴抽出部、３・・・
並列変換部、４・・・音韻認識部、５・・・音讃辞書、
６・・・音韻ラベルバッファ、７・・・ラベルソート部
、８゜１１・・・１１語認識部、９・・・単語辞書、１
２・・・単語ＤＰマツチングスコアバッファ、２１・・
・音韻継続時間制皿部。出願人代理人　弁理士　鈴江武彦ｔ−２ｎ　　　　ｔ−ｎ　　　　　ｔ、１、第２図第３図第５図FIG. 1 is a block diagram showing the configuration of a speech recognition device according to an embodiment of the present invention, FIG. 2 is a diagram for explaining the contents of a phoneme label buffer in the same device, and FIG. 3 is a word recognition device in the same device. FIG. 4 is a block diagram showing the configuration of a speech recognition device according to another embodiment of the present invention, and FIG. 6 are block diagrams showing the configuration of a speech recognition device according to still another example of the present invention. 1... A/D conversion section, 2... Feature extraction section, 3...
Parallel conversion unit, 4... Phoneme recognition unit, 5... Onsan dictionary,
6... Phonological label buffer, 7... Label sorting section, 8゜11... 11 word recognition section, 9... Word dictionary, 1
2...Word DP matching score buffer, 21...
・Phonological duration time system part. Applicant's representative Patent attorney Takehiko Suzue t-2n t-nt , 1, Figure 2 Figure 3 Figure 5

Claims

[Claims]

(1) A feature extraction unit that analyzes the input audio signal for each predetermined frame and extracts the features for each frame, and the feature information for each frame extracted by this feature extraction unit and a predetermined phonetic dictionary. and a phoneme recognition unit that assigns a phoneme label to each frame based on the comparison result, and a phoneme recognition unit that compares the system of phoneme labels assigned by this phoneme recognition unit with a preset word dictionary to obtain word recognition results. In the speech recognition device, the phoneme recognition unit performs a plurality of phoneme recognition processes with different frame lengths on the same speech signal in parallel, and the word recognition unit obtains a word recognition unit that obtains a word recognition unit. A speech recognition device characterized in that a phoneme label to be used for word recognition is selected from a plurality of phoneme labels having different frame lengths based on a result of phoneme recognition.

(2) The word recognition unit performs word recognition using a phoneme label with a frame length that has the highest degree of similarity to a phoneme dictionary among the plurality of phoneme labels obtained in parallel by the phoneme recognition unit. A speech recognition device according to claim 1, characterized in that:

(3) When the word recognition unit performs word recognition using these phoneme labels among the plurality of phoneme labels obtained in parallel by the phoneme recognition unit, the degree of similarity with the word dictionary is the highest. 2. The speech recognition device according to claim 1, wherein the speech recognition device outputs a word recognition result using a frame length phonetic label.

(4) The word recognition unit selects a phoneme label with a frame length closest to the predicted phoneme duration based on the recognition result in the phoneme recognition unit, among the plurality of phoneme labels obtained in parallel by the phoneme recognition unit. 2. The speech recognition device according to claim 1, wherein the speech recognition device performs word recognition using.