JP2905686B2

JP2905686B2 - Voice recognition device

Info

Publication number: JP2905686B2
Application number: JP6050294A
Authority: JP
Inventors: 寿幸竹澤; 逞森元
Original assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Current assignee: Ei Tei Aaru Onsei Honyaku Tsushin Kenkyusho Kk
Priority date: 1994-03-22
Filing date: 1994-03-22
Publication date: 1999-06-14
Anticipated expiration: 2014-06-14
Also published as: JPH07261782A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は音声認識装置に関し、特
に、発声音声中におけるポーズ（無音区間）又は冗長語
などの無音区間等を検出して連続的に音声認識を実行す
る音声認識装置に関する。なお、本明細書では、ポーズ
と冗長語並びに韻律的な情報等を手がかりとする区切り
とを含むものを無音区間等という。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus, and more particularly, to a speech recognition apparatus that detects a pause (silent section) or a silent section such as a redundant word in a uttered speech and continuously executes speech recognition. . In this specification, a section including a pause, a redundant word, and a delimiter based on prosodic information or the like is referred to as a silent section or the like.

【０００２】[0002]

【従来の技術】近年、連続音声認識の研究が盛んに行わ
れ、いくつかの研究機関で文音声認識システムが構築さ
れている。これらのシステムの多くは丁寧に発声された
音声を入力対象にしている。しかしながら、人間同士の
コミュニケーションでは、「あのー」、「えーと」など
に代表される冗長語や、一時的に発声音声が無い無音区
間等の状態のポーズである言い淀みや言い誤り及び言い
直しなどが頻繁に出現する。2. Description of the Related Art In recent years, continuous speech recognition has been actively studied, and some research institutions have constructed sentence speech recognition systems. Many of these systems target carefully uttered speech. However, in human-to-human communication, redundant words such as “Ah” and “Ehto” and pauses and silences and restatements that are pauses in states such as silence periods where there is no utterance are temporarily present. Appears frequently.

【０００３】図２は、従来例の連続音声認識装置の音声
認識動作をスタック形式で示す図であり、ここでは、
「会議に申し込みます」と話した時に、「会議に申し込
みます」、「会議に申し上げます」、「会員に申し込み
ます」、「会員に申し上げます」の４つの候補が出力さ
れる連続音声認識の処理過程を描いている。FIG. 2 is a diagram showing a speech recognition operation of a conventional continuous speech recognition device in a stack format.
When you say "Apply for the meeting", continuous speech recognition will output four candidates: "Apply for the meeting", "Apply for the meeting", "Apply for the member", and "Apply for the member". The process is depicted.

【０００４】まず、「か」という音が認識され、文字と
して積まれる。次に、「い」という音が音声認識され、
文字として積まれる。その次には、「ぎ」という音と、
「い」「ん」という音が認識されるので、文字を積む装
置を２つに分離して、双方の文字を積んでいく。「かい
ぎ」と「かいいん」は音声認識用辞書に載っているの
で、ともに、名詞を表す「ｎ」という文字に変換され
る。次に「に」が認識され、辞書引きの結果、それが助
詞を表す「ｐ」という文字に変換される。そして、名詞
に助詞がつながって名詞句を表す「ＮＰ」という文字に
変換される。ここで、「会議に」と「会員に」は、とも
に名詞句「ＮＰ」となるので、その後に同一の候補「申
し込みます」と「申し上げます」がつながり得る。First, the sound "ka" is recognized and stacked as characters. Next, the sound "I" is recognized by speech,
Stacked as characters. Then, the sound of "gi" and
Since the sounds "I" and "N" are recognized, the character stacking device is separated into two, and both characters are stacked. Since "kaigi" and "kaiin" are listed in the speech recognition dictionary, both are converted to the character "n" representing a noun. Next, "Ni" is recognized, and as a result of dictionary lookup, it is converted to the letter "p" representing a particle. Then, the particle is connected to the noun and is converted into a character "NP" representing a noun phrase. Here, since “to the meeting” and “to the member” are both noun phrases “NP”, the same candidates “apply” and “apply” can be connected thereafter.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、従来例
の連続音声認識装置においては、音声中に無音区間等の
区切りを含む長い発話を扱うと、音声認識が進行するに
つれて、統語的に同一の働きをする複数の候補に対し
て、その先につながり得る、まったく同一の候補を、そ
れぞれ個別に処理しなければならなかった。すなわち、
従来例の連続音声認識装置では、それら同一の候補を個
別に処理しなければならず、処理量が不要に増大すると
いう問題点があった。However, in the conventional continuous speech recognition apparatus, when a long utterance including a break such as a silent section in a speech is handled, the same function is syntactically performed as the speech recognition progresses. , It is necessary to individually process identical candidates that can lead to the candidates. That is,
In the conventional continuous speech recognition device, the same candidates must be individually processed, and the processing amount is unnecessarily increased.

【０００６】本発明の目的は以上の問題点を解決し、従
来例に比較して大幅に処理量を削減することができ、音
声認識の処理速度を高めることができる音声認識装置を
提供することにある。An object of the present invention is to solve the above problems and to provide a speech recognition apparatus capable of greatly reducing the processing amount as compared with the conventional example and increasing the processing speed of speech recognition. It is in.

【０００７】[0007]

【課題を解決するための手段】本発明に係る請求項１記
載の音声認識装置は、入力された文字列からなる発声音
声文を音声認識する音声認識手段を備えた音声認識装置
において、入力された発声音声文に基づいてポーズと冗
長語と句又は節の境界とのうちの少なくとも１つを検出
して検出信号を出力する検出手段を備え、上記音声認識
手段は、隠れマルコフモデルを用いたＬＲ法を用いて音
声認識処理を実行し、かつ、上記検出信号が入力された
ときに、上記隠れマルコフモデルを用いたＬＲ法に用い
るセルに、音声認識結果候補を示す状態スタックの最上
位の内容が同一のセルを連結してマージすることによ
り、統語的に同一の働きをする複数の音声認識候補を１
つの音声認識候補に圧縮して音声認識処理を実行するこ
とを特徴とする。According to a first aspect of the present invention, there is provided a speech recognition apparatus comprising: a speech recognition apparatus having a speech recognition unit for recognizing an uttered speech sentence composed of an input character string. Detecting means for detecting at least one of a pause, a redundant word, a phrase or a clause boundary based on the uttered voice sentence, and outputting a detection signal, wherein the voice recognition means uses a hidden Markov model. A speech recognition process is performed using the LR method, and when the detection signal is input, a cell used for the LR method using the hidden Markov model is placed at the top of a state stack indicating a speech recognition result candidate. By combining and merging cells having the same contents, a plurality of speech recognition candidates having the same function syntactically can be identified as one.
The speech recognition process is performed by compressing the speech recognition candidates into one speech recognition candidate.

【０００８】[0008]

【０００９】さらに、請求項２記載の音声認識装置は、
請求項１記載の音声認識装置において、上記検出手段
は、上記発声音声文のパワーが、所定の時間の範囲だ
け、所定のしきい値以下である第１の条件と、上記発声
音声文のゼロクロスの数が、所定の時間の間において、
所定のしきい値以上である第２の条件とのうち少なくと
も１つの条件が満足することを検出することにより上記
ポーズを検出することを特徴とする。また、請求項３記
載の音声認識装置は、請求項１記載の音声認識装置にお
いて、上記検出手段は、予め格納された複数の冗長語の
言語モデルに一致するか否かを判断することにより上記
冗長語を検出することを特徴とする。さらに、請求項４
記載の音声認識装置は、請求項１記載の音声認識装置に
おいて、上記検出手段は、上記発声音声文の基本周波数
が所定の傾斜の度合い以上で急激に上昇し又は下降して
変化したことを検出することにより上記句又は節の境界
を検出することを特徴とする。[0009] Further, the speech recognition apparatus according to the second aspect,
2. The speech recognition device according to claim 1, wherein said detecting means includes: a first condition that the power of the uttered voice sentence is equal to or less than a predetermined threshold for a predetermined time range; During a predetermined time period,
The pause is detected by detecting that at least one of a second condition that is equal to or greater than a predetermined threshold is satisfied. Further, in the speech recognition apparatus according to the third aspect, in the speech recognition apparatus according to the first aspect, the detection unit determines whether or not the speech model matches a language model of a plurality of redundant words stored in advance. It is characterized by detecting redundant words. Claim 4
2. The voice recognition device according to claim 1, wherein the detection unit detects that the fundamental frequency of the uttered voice sentence suddenly rises or falls above a predetermined gradient. By doing so, the boundaries of the above phrases or clauses are detected.

【００１０】[0010]

【作用】請求項１記載の音声認識装置においては、上記
検出手段は、入力された発声音声文に基づいてポーズと
冗長語と句又は節の境界とのうちの少なくとも１つを検
出して検出信号を出力する。そして、上記音声認識手段
は、隠れマルコフモデルを用いたＬＲ法を用いて音声認
識処理を実行し、かつ、上記検出信号が入力されたとき
に、上記隠れマルコフモデルを用いたＬＲ法に用いるセ
ルに、音声認識結果候補を示す状態スタックの最上位の
内容が同一のセルを連結してマージすることにより、統
語的に同一の働きをする複数の音声認識候補を１つの音
声認識候補に圧縮して音声認識処理を実行する。In the speech recognition apparatus according to the first aspect, the detection means detects and detects at least one of a pause, a redundant word, and a phrase or clause boundary based on the input uttered speech sentence. Output a signal. The speech recognition means executes a speech recognition process using an LR method using a hidden Markov model, and, when the detection signal is input, a cell used in the LR method using the hidden Markov model. Then, a plurality of speech recognition candidates having the same function syntactically are compressed into one speech recognition candidate by connecting and merging the cells having the same contents at the top of the state stack indicating the speech recognition result candidate. To execute voice recognition processing.

【００１１】さらに、請求項２記載の音声認識装置にお
いては、上記検出手段は、好ましくは、上記発声音声文
のパワーが、所定の時間の範囲だけ、所定のしきい値以
下である第１の条件と、上記発声音声文のゼロクロスの
数が、所定の時間の間において、所定のしきい値以上で
ある第２の条件とのうち少なくとも１つの条件が満足す
ることを検出することにより上記ポーズを検出する。ま
た、請求項３記載の音声認識装置においては、上記検出
手段は、好ましくは、予め格納された複数の冗長語の言
語モデルに一致するか否かを判断することにより上記冗
長語を検出する。さらに、請求項４記載の音声認識装置
においては、上記検出手段は、好ましくは、上記発声音
声文の基本周波数が所定の傾斜の度合い以上で急激に上
昇し又は下降して変化したことを検出することにより上
記句又は節の境界を検出する。Further, in the voice recognition device according to the second aspect, the detecting means preferably includes a first threshold value which is lower than a predetermined threshold value for a predetermined time range. The pause is detected by detecting that at least one of a condition and a second condition in which the number of zero crossings of the uttered voice sentence is equal to or more than a predetermined threshold value during a predetermined time is satisfied. Is detected. Further, in the voice recognition device according to the third aspect, the detecting means preferably detects the redundant word by determining whether or not the word model matches a language model of a plurality of redundant words stored in advance. Further, in the voice recognition device according to the fourth aspect, preferably, the detecting means detects that the fundamental frequency of the uttered voice sentence suddenly rises or falls above a predetermined inclination degree and changes. Thus, the boundary of the phrase or clause is detected.

【００１２】[0012]

【実施例】以下、図面を参照して本発明に係る実施例に
ついて説明する。図１は、本発明に係る一実施例である
連続音声認識装置のブロック図である。本実施例の連続
音声認識装置は、ＳＳＳ（Successive State Splittin
g：逐次状態分割法）−ＬＲ（left-to-right rightmos
t derivation型、すなわち最右導出型）不特定話者連
続音声認識装置であって、隠れマルコフ網（以下、ＨＭ
網という。）メモリ１１に格納された隠れマルコフモデ
ル（以下、ＨＭＭという。）を用いて音素照合を音素照
合部４で実行しその結果である音声認識スコアを音素コ
ンテキスト依存型ＬＲパーザ（以下、ＬＲパーザとい
う。）５に送り、これに応答してＬＲパーザ５が連続音
声認識を実行して音素予測データを音素照合部４に送っ
て音声認識処理を行う。本実施例は特に、バッファメモ
リ３から出力される特徴パラメータの時系列に基づいて
ポーズや冗長語並びに韻律的な情報等を手がかりとする
区切りを含む無音区間等を検出してその検出信号をＬＲ
パーザ５に出力する無音区間等検出部３０を備え、これ
に応答してＬＲパーザ５は、検出信号が入力される毎
に、統語的に同一の働きをする複数の音声認識結果候補
を１つの音声認識結果候補に圧縮しながら音声認識処理
を実行することを特徴とする。ここで、上記ＳＳＳにお
いては、音素の特徴空間上に割り当てられた確率的定常
信号源（状態）の間の確率的な遷移により音声パラメー
タの時間的な推移を表現した確率モデルに対して、尤度
最大化の基準に基づいて個々の状態をコンテキスト方向
又は時間方向へ分割するという操作を繰り返すことによ
って、モデルの精密化を逐次的に実行する。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a continuous speech recognition apparatus according to one embodiment of the present invention. The continuous speech recognition apparatus according to the present embodiment has an SSS (Successive State Splittin).
g: sequential state division method)-LR (left-to-right rightmos)
t derivation type (that is, rightmost derived type) speaker-independent continuous speech recognizer, which is a hidden Markov network (hereinafter HM)
It is called a net. ) Using a hidden Markov model (hereinafter, referred to as HMM) stored in the memory 11, phoneme matching is executed by the phoneme matching unit 4, and the resulting speech recognition score is used as a phoneme context-dependent LR parser (hereinafter, referred to as LR parser). .), And in response, the LR parser 5 performs continuous speech recognition and sends phoneme prediction data to the phoneme matching unit 4 to perform speech recognition processing. In the present embodiment, in particular, based on the time series of the characteristic parameters output from the buffer memory 3, a pause, a redundant word, a silent section including a delimiter based on prosodic information, and the like are detected, and the detection signal is detected by LR.
In response to this, the LR parser 5 includes a plurality of speech recognition result candidates having the same function syntactically as one speech detection result. The speech recognition process is performed while compressing the speech recognition result candidates. Here, in the SSS, a likelihood model expressing a temporal transition of a speech parameter by a stochastic transition between probabilistic stationary signal sources (states) allocated on a feature space of a phoneme is used. The refinement of the model is performed sequentially by repeating the operation of dividing each state in the context direction or the time direction based on the criterion of degree maximization.

【００１３】図１において、話者の発声音声はマイクロ
ホン１に入力されて音声信号に変換された後、特徴抽出
部２に入力される。特徴抽出部２は、入力された音声信
号をＡ／Ｄ変換した後、例えばＬＰＣ分析を実行し、対
数パワー、１６次ケプストラム係数、Δ対数パワー及び
１６次Δケプストラム係数を含む３４次元の特徴パラメ
ータを抽出する。抽出された特徴パラメータの時系列は
バッファメモリ３を介して音素照合部４に入力される。In FIG. 1, a uttered voice of a speaker is input to a microphone 1 and converted into a voice signal, and then input to a feature extracting unit 2. After performing A / D conversion on the input audio signal, the feature extraction unit 2 performs, for example, LPC analysis, and performs 34-dimensional feature parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient. Is extracted. The time series of the extracted feature parameters is input to the phoneme matching unit 4 via the buffer memory 3.

【００１４】音素照合部４に接続されるＨＭ網メモリ１
１内のＨＭ網は、各状態をノードとする複数のネットワ
ークとして表され、各状態はそれぞれ以下の情報を有す
る。（ａ）状態番号（ｂ）受理可能なコンテキストクラス（ｃ）先行状態、及び後続状態のリスト（ｄ）出力確率密度分布のパラメータ（ｅ）自己遷移確率及び後続状態への遷移確率HM network memory 1 connected to phoneme matching unit 4
The HM network in 1 is represented as a plurality of networks having each state as a node, and each state has the following information. (A) State number (b) Acceptable context class (c) List of preceding and succeeding states (d) Parameters of output probability density distribution (e) Self transition probability and transition probability to succeeding state

【００１５】なお、本実施例において、ＨＭ網は、各分
布がどの話者に由来するかを特定する必要があるため、
所定の話者混合ＨＭ網を変換して作成する。ここで、出
力確率密度関数は３４次元の対角共分散行列をもつ混合
ガウス分布であり、各分布はある特定の話者のサンプル
を用いて学習されている。In this embodiment, since the HM network needs to specify which speaker each distribution originates from,
A predetermined speaker mixed HM network is converted and created. Here, the output probability density function is a Gaussian mixture distribution having a 34-dimensional diagonal covariance matrix, and each distribution is learned using a specific speaker sample.

【００１６】音素照合部４は、音素コンテキスト依存型
ＬＲパーザ（以下、ＬＲパーザという。）５からの音素
照合要求に応じて音素照合処理を実行する。このとき
に、ＬＲパーザ５からは、音素照合区間及び照合対象音
素とその前後の音素から成る音素コンテキスト情報が渡
される。音素照合部４は、受け取った音素コンテキスト
情報に基づいてそのようなコンテキストを受理すること
ができるＨＭ網上の状態を、先行状態リストと後続状態
リストの制約内で連結することによって、１つのモデル
が選択される。そして、このモデルを用いて音素照合区
間内のデータに対する尤度が計算され、この尤度の値が
音素照合スコアとしてＬＲパーザ５に返される。このと
きに用いられるモデルは、ＨＭＭと等価であるために、
尤度の計算には通常のＨＭＭで用いられている前向きパ
スアルゴリズムをそのまま使用する。The phoneme matching section 4 executes a phoneme matching process in response to a phoneme matching request from a phoneme context-dependent LR parser (hereinafter, referred to as an LR parser) 5. At this time, the LR parser 5 passes phoneme context information including a phoneme matching section, a phoneme to be matched, and phonemes before and after the phoneme. The phoneme matching unit 4 connects the states on the HM network capable of accepting such a context based on the received phoneme context information within the constraints of the preceding state list and the following state list, thereby forming one model. Is selected. Then, the likelihood for the data in the phoneme matching section is calculated using this model, and the value of the likelihood is returned to the LR parser 5 as a phoneme matching score. Since the model used at this time is equivalent to HMM,
The forward path algorithm used in the normal HMM is used as it is for the calculation of the likelihood.

【００１７】一方、無音区間等検出部３０は、バッファ
メモリ３から出力される特徴パラメータの時系列に基づ
いてポーズや冗長語並びに韻律的な情報等を手がかりと
する区切りを含む無音区間等を検出して、その検出信号
をＬＲパーザ５に出力する。ここで、当該検出部３０
は、冗長語については予め内部メモリに格納された冗長
語（例えば以下の表１乃至表３に示す冗長語）の音素モ
デルと比較照合することにより冗長語として認識する一
方、無音区間であるポーズについては以下の２つの条件
のうちの１つが満足するときにポーズとして検出する。（第１の検出条件）パワーが所定のしきい値レベル以下
である時間ｔ０が例えば以下の範囲の値のとき。好まし
くは、５０ミリ秒≦ｔ０≦３秒。より好ましくは、５０
ミリ秒≦ｔ０≦５００ミリ秒。（第２の検出条件）入力された音声信号がゼロ電位と交
差するゼロクロスの数が所定のしきい値以上である時間
ｔ１が例えば以下の範囲の値のとき。好ましくは、５０
ミリ秒≦ｔ１≦３秒。より好ましくは、５０ミリ秒≦ｔ
１≦５００ミリ秒。さらに、韻律的な情報等を手がかりとする区切りとは、
具体的には、イントネーションが急激に上昇又は下降す
るときは、句又は節の境界であると推測される。これに
ついては、入力される特徴パラメータのうち基本周波数
が所定の傾斜の度合い以上で急激に上昇し又は下降して
変化したことを検出することにより当該区切り又は境界
を判別する。On the other hand, the silent section etc. detecting section 30 detects a silent section including a pause based on a pause, a redundant word and prosodic information based on the time series of the characteristic parameters output from the buffer memory 3. Then, the detection signal is output to the LR parser 5. Here, the detection unit 30
Indicates that a redundant word is recognized as a redundant word by comparing and comparing with a phoneme model of a redundant word (for example, a redundant word shown in Tables 1 to 3 below) stored in an internal memory in advance, while a pause which is a silent section Is detected as a pause when one of the following two conditions is satisfied. (First Detection Condition) When the time t0 when the power is equal to or lower than the predetermined threshold level is, for example, a value in the following range. Preferably, 50 milliseconds ≦ t0 ≦ 3 seconds. More preferably, 50
Ms ≦ t0 ≦ 500 ms. (Second detection condition) When the time t1 during which the number of zero crosses at which the input audio signal crosses the zero potential is equal to or greater than a predetermined threshold value is, for example, in the following range. Preferably, 50
Milliseconds ≦ t1 ≦ 3 seconds. More preferably, 50 ms ≦ t
1 ≦ 500 milliseconds. Furthermore, a break based on prosodic information etc.
Specifically, when the intonation rapidly rises or falls, it is presumed that it is a phrase or clause boundary. In this regard, the boundary or the boundary is determined by detecting that the fundamental frequency of the input characteristic parameters suddenly rises or falls at or above a predetermined inclination degree and changes.

【００１８】そして、ＬＲパーザ５は、上記検出部３０
から検出信号が入力される毎に、統語的に同一の働きを
する複数の音声認識結果候補を１つの音声認識結果候補
に圧縮しながら音声認識処理を実行する。なお、冗長語
としては、例えば以下の表１乃至表３のような冗長語が
ある。The LR parser 5 is provided with the detecting unit 30
Each time a detection signal is input from the CPU, the speech recognition processing is executed while compressing a plurality of speech recognition result candidates having the same function syntactically into one speech recognition result candidate. The redundant words include, for example, redundant words as shown in Tables 1 to 3 below.

【００１９】[0019]

【表１】 ──────────────── 冗長語 ──────────────── 「あ」「あー」「あーっと」「あーん」「ああ」「あっ」「あの」「あのー」「あのう」「あのうー」「い」「いー」「いやー」「う」「うー」「うーん」「うーんと」「うん」「え」「えー」「えーっと」「えーっとー」 ────────────────[Table 1] ──────────────── Redundant words ──────────────── “A” “Ah” “Ahto” “ "Ah" "Ah" "Ah" "Ah" "Ah" "Ah" "Ahhh" "Ah" "Ah" "Noh" "U" "Wh" "Whh" "Hhmm" "Hhmm" "Hhmm" 」「え」「えっー「えっっ ──────────────── ────────────────

【００２０】[0020]

【表２】 ──────────────── 冗長語 ──────────────── 「えーっとですね」「えーと」「えーとー」「えーとですね」「えーまあ」「えーん」「ええ」「えっ」「えっーと」「えっと」「えっとー」「えと」「えとー」「お」「おー」「おっ」「こう」「この」「このー」「じゃ」「す」「すー」 ────────────────[Table 2] ──────────────── Redundant words ──────────────── “Em, it is” “Em,” “Em,” "Well," "Well," "Well," "Well," "Well," "Well," "Well," "Well," "Well," "Well," "Well," "O," "O," "O," "This" "this" "this" "ja" "su" "su" ────────────────

【００２１】[0021]

【表３】 ──────────────── 冗長語 ──────────────── 「すっ」「そ」「その」「そのー」「ちょっと」「つ」「で」「でー」「と」「とー」「は」「はあー」「ふーん」「ま」「まー」「まぁ」「まあ」「まっ」「も」「ん」「んー」「んと」 ────────────────[Table 3] 冗長 Redundant words ──────────────── “Sus” “Sou” “Usu” “Usu” "A little" "one" "de" "de" "to" "to" "ha" "ha" "hoo" "ma" "ma" "ma" "ma" "ma" "ma" "mo" 」んんー「んんん ────────────────

【００２２】文脈自由文法データベースメモリ２０内の
所定の文脈自由文法（ＣＦＧ）を公知の通り自動的に変
換してＬＲテーブルを作成してＬＲテーブルメモリ１３
に格納される。ＬＲパーザ５は、例えば音素継続時間長
モデルを含む話者モデルメモリ１２と上記ＬＲテーブル
とを参照して、入力された音素予測データについて左か
ら右方向に、後戻りなしに処理する。構文的にあいまい
さがある場合は、スタックを分割してすべての候補の解
析が平行して処理される。ＬＲパーザ５は、ＬＲテーブ
ルメモリ１３内のＬＲテーブルから次にくる音素を予測
して音素予測データを音素照合部４に出力する。これに
応答して、音素照合部４は、その音素に対応するＨＭ網
メモリ１１内の情報を参照して照合し、その尤度を音声
認識スコアとしてＬＲパーザ５に戻し、順次音素を連接
していくことにより、連続音声の認識を行っている。こ
こで、ＬＲパーザ５は、無音区間等検出部３０は、バッ
ファメモリ３から出力される特徴パラメータの時系列に
基づいてポーズや冗長語などを含む無音区間等を検出し
てその検出信号をＬＲパーザ５に出力する。これに応答
してＬＲパーザ５は、検出信号が入力される毎に、統語
的に同一の働きをする複数の音声認識結果候補を１つの
音声認識結果候補に圧縮しながら音声認識処理を実行す
る。すなわち、例えば図３に示すように、検出信号の入
力以前に処理済みの音声認識の複数の部分木を連結して
マージした後、検出信号の入力後においては、当該連結
した１つのノードから出発して音声認識処理を行う。そ
して、入力された話者音声の最後まで処理した後、全体
の尤度が最大のもの又は所定の上位複数個のものを認識
結果データ又は結果候補データとして出力する。As is well known, a predetermined context-free grammar (CFG) in the context-free grammar database memory 20 is automatically converted to create an LR table, and an LR table memory 13 is created.
Is stored in The LR parser 5 refers to the speaker model memory 12 including, for example, a phoneme duration model and the LR table, and processes the input phoneme prediction data from left to right without backtracking. If there is syntactic ambiguity, the stack is split and the analysis of all candidates is processed in parallel. The LR parser 5 predicts the next phoneme from the LR table in the LR table memory 13 and outputs phoneme prediction data to the phoneme matching unit 4. In response, the phoneme matching unit 4 performs matching by referring to information in the HM network memory 11 corresponding to the phoneme, returns the likelihood as a speech recognition score to the LR parser 5, and sequentially connects the phonemes. As a result, continuous speech recognition is performed. Here, the LR parser 5 detects a silent section or the like including a pause or a redundant word based on the time series of the characteristic parameter output from the buffer memory 3 and outputs the detected signal to the LR parser 5. Output to parser 5. In response, every time a detection signal is input, the LR parser 5 executes a speech recognition process while compressing a plurality of speech recognition result candidates having the same function syntactically into one speech recognition result candidate. . That is, as shown in FIG. 3, for example, after a plurality of subtrees of speech recognition that have been processed before the input of the detection signal are connected and merged, after the input of the detection signal, starting from the connected one node is performed. To perform voice recognition processing. Then, after processing to the end of the input speaker's voice, the one with the highest overall likelihood or a plurality of predetermined higher-order ones is output as recognition result data or result candidate data.

【００２３】図３は、図１の本実施例の連続音声認識装
置の音声認識動作をスタック形式で示す図であり、入力
された発声音声中の区切りとして、ポーズである無音区
間が存在した場合の例を示している。「会議に」、「会
員に」が認識される時点までの処理は従来の方法と共通
である。もし「会議に」の処理の直後で無音区間が無音
区間等検出部３０によって検出されれば、検出信号が当
該検出部３０からＬＲパーザ５に入力され、当該タイミ
ング以降の処理において、統語的に同一の働きをする複
数の候補を一つに圧縮する。この例の場合は、「会議
に」、「会員に」ともに名詞句「ＮＰ」に変換されてい
るので、その２つの音声認識結果候補の部分木が１つの
音声認識結果候補の部分木に圧縮される。従って、従来
例の装置では重複していた処理を、本発明に係る実施例
の方法で回避することができる。FIG. 3 is a diagram showing the speech recognition operation of the continuous speech recognition apparatus of the present embodiment of FIG. 1 in the form of a stack. In the case where there is a paused silent section as a break in the input uttered speech. Is shown. The processing up to the point at which “in a meeting” and “in a member” are recognized is the same as the conventional method. If a silence section is detected by the silence section etc. detection unit 30 immediately after the processing of “meeting”, a detection signal is input from the detection unit 30 to the LR parser 5, and in the processing after the timing, syntactically A plurality of candidates having the same function are compressed into one. In this example, since both “to the meeting” and “to the member” are converted into the noun phrase “NP”, the two partial trees of the two speech recognition result candidates are compressed into one partial tree of the speech recognition result candidate. Is done. Therefore, the processing which has been duplicated in the conventional apparatus can be avoided by the method according to the embodiment of the present invention.

【００２４】なお、統語的に同一の働きをする複数の音
声認識結果候補の圧縮操作を起動する時点を、本実施例
においては、無音区間等の区切りが、入力された発声音
声中に検出されるときに限定されている。この理由は次
の通りである。この装置は、音声認識装置であるため、
圧縮操作を時間に同期して起動しなければならない。一
方、現実には、同じ音声区間に対応する文字の個数が異
なる場合が頻繁に生ずる。この例においても、「かいぎ
に」は４文字であるが、「かいいんに」は５文字であ
る。文字の個数を揃えても圧縮操作の起動時点にはまっ
たく対応しない。そこで、無音区間等の区切りが音声中
に検出できた場合にのみ、圧縮操作を起動するのであ
る。In the present embodiment, the point at which the operation of compressing a plurality of speech recognition result candidates having the same function syntactically is started is detected in the input uttered speech by detecting a break such as a silent section. When it is limited. The reason is as follows. Since this device is a speech recognition device,
The compression operation must be started in time. On the other hand, in reality, the number of characters corresponding to the same voice section is often different. Also in this example, "Kaini" has four characters, but "Kaini" has five characters. Even if the number of characters is aligned, it does not correspond to the start time of the compression operation at all. Therefore, the compression operation is activated only when a break such as a silent section can be detected in the voice.

【００２５】さらに、無音区間等検出部３０からの検出
信号を処理するＬＲパーザ５の処理について詳細に説明
する。図５は図１の連続音声認識装置において用いるセ
ルのデータ構造を示す図である。図５に示すように、従
来のＨＭＭ−ＬＲ法の音声認識の解析に必要な情報を保
持するデータ構造、すなわち最上層の代表セル連結ポイ
ンタと、その下の層に位置し音韻列とその状態スタック
とからなるＬＲ作業域と、さらにその下の層に位置し２
つの音声認識スコアと確率テーブルとからなるＨＭＭ作
業域とを含むデータ構造におけるセルに、音声認識結果
候補を示す状態スタックの１番上の内容、すなわち最後
の内容であるスタックトップが同一である複数のセルを
マージするための、マージポインタを付加する。この複
数のセルは、図３の例では、無音区間の検出の前の２つ
のセルである。さらに、ポーズ区間処理のためのセルリ
スト(以下、ポーズセルリストという。)を新たに用意す
る。Further, the processing of the LR parser 5 for processing a detection signal from the silent section etc. detecting section 30 will be described in detail. FIG. 5 is a diagram showing a data structure of a cell used in the continuous speech recognition apparatus of FIG. As shown in FIG. 5, a data structure holding information necessary for analysis of speech recognition by the conventional HMM-LR method, that is, a top cell representative pointer, a phoneme sequence located in a layer below the top, and its state An LR work area consisting of a stack and a layer located therebelow.
In the cell in the data structure including the two voice recognition scores and the HMM work area including the probability table, a cell having the same contents at the top of the state stack indicating the voice recognition result candidate, that is, the same stack top as the last content is stored. A merge pointer is added for merging the cells. In the example of FIG. 3, the plurality of cells are two cells before the detection of a silent section. Further, a cell list for pause section processing (hereinafter referred to as a pause cell list) is newly prepared.

【００２６】図６は図１の連続音声認識装置において実
行される音声認識処理を示すフローチャートである。当
該処理における、セルのマージ処理と、ポーズの同期処
理の要点を以下に説明する。なお、以下の説明におい
て、ポーズは冗長語を含む。（１）ある音声認識結果候補を示す部分木の枝でポーズ
が検出され、すなわちシンボルスタックのトップがポー
ズとなることが検出され、音声の入力フレームがポーズ
単位の音声区間の末端にまで到達していれば、そのセル
をポーズセルリストに登録する。（２）ビーム探索による枝刈りか、もしくは、統語的に
棄却されることで、枝が伸ばせなくなったら、ポーズセ
ルリストに登録されている枝に対して圧縮操作（レデュ
ース操作を行なう。そして、「ある統語カテゴリ集合」
に属する要素に還元されない枝をすべて枝刈りする。（３）さらに、残った枝で状態スタックの１番上の内容
が同一のセルをマージする。複数の部分木の音声認識ス
コアは１番よいもので代表させる。なお、「ある統語カ
テゴリ集合」には任意の統語カテゴリを定義することが
可能である。もし、単語境界にポーズが入るような発話
を受理したければ、その統語カテゴリ集合をすべての単
語区切りに変更すればよい。FIG. 6 is a flowchart showing a speech recognition process executed in the continuous speech recognition apparatus of FIG. The main points of the cell merge processing and the pause synchronization processing in this processing will be described below. In the following description, a pause includes a redundant word. (1) A pause is detected at a branch of a subtree indicating a certain voice recognition result candidate, that is, it is detected that the top of the symbol stack becomes a pause, and the input frame of voice reaches the end of the voice section of the pause unit. If so, the cell is registered in the pause cell list. (2) If the branch cannot be extended due to pruning by beam search or rejected syntactically, a compression operation (reduce operation is performed on the branch registered in the pause cell list. A set of syntactic categories "
Prunes all branches that are not reduced to elements belonging to. (3) Further, the cells having the same contents at the top of the state stack in the remaining branches are merged. The voice recognition score of the plurality of subtrees is represented by the best one. Note that an arbitrary syntactic category can be defined in “a certain syntactic category set”. If it is desired to accept an utterance in which a pause occurs at a word boundary, the syntactic category set may be changed to all word breaks.

【００２７】次いで、セルのスプリット処理の要点を記
す。（４）マージされた位置よりさかのぼって処理をしなけ
ればならない時、ポインタを張り換えてセルを複数の部
分木にスプリット又は分割する。ここで、スプリット処
理後の音声認識スコアは元の値に戻す。Next, the point of the cell split processing will be described. (4) When processing needs to be performed retroactively from the merged position, the pointer is changed and the cell is split or divided into a plurality of subtrees. Here, the speech recognition score after the split processing is returned to the original value.

【００２８】以下、図６を参照して音声認識処理につい
て説明する。まず、ステップＳ１においては、ＨＭＭ作
業域の初期化、並びにＬＲパーザ５の初期化を実行す
る。具体的には、状態スタック０のセルを１個作成す
る。そして、ステップＳ２において、複数のポーズ単位
からなるポーズ区間のうち最後のポーズ単位（ポーズユ
ニット）の末端まで到達したか否かが判断され、到達し
ているときは当該音声認識処理を終了する。一方、ステ
ップＳ２において最後のポーズ単位の末端まで到達して
いないときは（ステップＳ２においてＮＯ）ステップＳ
３において、分析されたポーズ単位の音声区間のデータ
を読み込む。さらに、ステップＳ４において当該ポーズ
区間の最初のポーズ単位であるか否かが判断される。最
初のポーズ単位であるときは（ステップＳ４においてＹ
ＥＳ）ステップＳ７に進む。一方、最初のポーズ単位で
ないときは（ステップＳ４においてＮＯ）ステップＳ５
において前尤度、前置の最良の音声認識スコアの点（pr
e-bestpoint）を確保し、ここで、セルは最大ビーム幅
の個数だけ存在する。そして、ポーズセルリストのＨＭ
Ｍ作業域の初期化を実行する。次いで、ステップＳ６に
おいて、状態スタックの最上位に位置するスタックトッ
プと品詞などの内容が同一の複数のセルをマージする。
このときのマージセルリストの代表セルは最良の音声認
識スコアのものを選ぶ。さらに、ステップＳ７において
は、音声区間であるポーズ単位の処理のためにＨＭＭ−
ＬＲ法を用いたポーズ単位音声認識モジュール処理（図
７及び図８参照。）を実行する。Hereinafter, the speech recognition processing will be described with reference to FIG. First, in step S1, initialization of the HMM work area and initialization of the LR parser 5 are executed. Specifically, one cell of the state stack 0 is created. Then, in step S2, it is determined whether or not the end of the last pause unit (pause unit) in the pause section composed of a plurality of pause units has been reached. If the end has been reached, the voice recognition processing ends. On the other hand, when the end of the last pause unit has not been reached in step S2 (NO in step S2), step S2 is executed.
In step 3, the data of the analyzed speech section for each pause is read. Further, in step S4, it is determined whether or not the pause unit is the first pause unit in the pause section. If it is the first pause unit (Y in step S4)
ES) Proceed to step S7. On the other hand, if it is not the first pause unit (NO in step S4), step S5
At the pre-likelihood, the point of the best preceding speech recognition score (pr
e-bestpoint), where the number of cells is equal to the maximum beam width. And the HM of the pause cell list
Perform initialization of the M work area. Next, in step S6, a plurality of cells having the same content such as part of speech are merged with the stack top located at the top of the state stack.
At this time, the representative cell of the merged cell list is selected with the best speech recognition score. Further, in step S7, the HMM-
A pause unit speech recognition module process using the LR method is executed (see FIGS. 7 and 8).

【００２９】図７及び図８は、図６のポーズ単位音声認
識モジュール処理を示すフローチャートである。当該音
声認識モジュール処理では、解析された音素列長が終了
条件（末端）に至るまでステップＳ１２乃至Ｓ２１迄の
処理を繰り返す。図７に示すように、ステップＳ１１に
おいて、音韻列長が末端に到達しているか否かが判断さ
れ、到達しているときは（ＹＥＳ）そのままメインルー
チンに戻る。一方、到達していないときは（ステップＳ
１１においてＮＯ）ステップＳ１２に進む。FIGS. 7 and 8 are flowcharts showing the pause unit speech recognition module processing of FIG. In the voice recognition module process, the processes in steps S12 to S21 are repeated until the analyzed phoneme string length reaches the end condition (end). As shown in FIG. 7, in step S11, it is determined whether or not the phoneme string length has reached the end. If it has reached (YES), the process returns to the main routine. On the other hand, if it has not reached (step S
(NO at 11) The process proceeds to step S12.

【００３０】ステップＳ１２からステップＳ１４までの
処理において以下の処理が実行される。すなわち、すべ
ての代表セルに対して圧縮操作がある複数のセル、すな
わち上述のようにスタックトップが同一である複数のセ
ルは、それらで１つの新しいセルを作成した後、圧縮操
作を行ない、次のセルリストに接続する。The following processing is executed in the processing from step S12 to step S14. That is, a plurality of cells having a compression operation for all representative cells, that is, a plurality of cells having the same stack top as described above, perform a compression operation after creating one new cell with them, and Connect to cell list.

【００３１】次いで、ステップＳ１５からステップＳ１
７までの処理において以下の処理が実行される。すなわ
ち、すべてのマージセルに対して、状態スタックのスタ
ックトップが同一でなくなったセルは複数のセルに分割
するスプリット処理を実行する。このとき、各セルのＨ
ＭＭ確率テーブルは元の値に戻す。Next, from step S15 to step S1
In the processes up to 7, the following processes are executed. That is, for all the merged cells, a split process in which a cell whose stack top is not the same in the state stack is divided into a plurality of cells is executed. At this time, H of each cell
The MM probability table is returned to the original value.

【００３２】さらに、図８のステップＳ１８において
は、もし最後のポーズ単位でないならば、すべての代表
セルに対して、シンボルスタックのトップが無音又はポ
ーズで最良の音声認識スコアの点がその音声区間長を越
えていれば、ポーズセルリストにコピーして、次の音声
区間の初期セル候補とする。次いで、ステップＳ１９か
らステップＳ２２までにおいて、すべての代表セルに対
して以下の処理を行なう。（ａ）次の操作がシフト操作のとき、代表セルに対して
予測された音素と照合する音韻照合処理を実行する。（ｂ）次の操作が受理（アクセプト）操作のとき、入力
音声チェックがＯＫであれば、受理セルリストにそのマ
ージセルを登録する。（ｃ）それ以外は、そのセルの上記処理を実行しない。そして、ステップＳ２３において、代表セルを公知の方
法でビーム幅の個数に枝刈りする。枝刈りの際のスコア
は代表セルのものを利用する。さらに、ステップＳ１１
に戻る。そして、ステップＳ１１において当該音声区間
における最後のポーズ単位であれば（ＹＥＳ）処理を終
了する。Further, in step S18 in FIG. 8, if the last pause unit is not used, for all representative cells, the top of the symbol stack is silence or pause and the point of the best speech recognition score is the speech section. If it exceeds the length, it is copied to a pause cell list and used as an initial cell candidate for the next voice section. Next, in steps S19 to S22, the following processing is performed on all representative cells. (A) When the next operation is a shift operation, a phoneme matching process is performed on the representative cell to match a predicted phoneme. (B) When the next operation is an accept (accept) operation, if the input voice check is OK, the merge cell is registered in the accepted cell list. (C) Otherwise, the above processing of the cell is not executed. Then, in step S23, the representative cells are pruned to the number of beam widths by a known method. The score at the time of pruning uses the score of the representative cell. Further, step S11
Return to Then, in step S11, if it is the last pause unit in the voice section (YES), the process ends.

【００３３】図４の（ａ）は従来例の連続音声認識装置
における音声区間と音素区間を示す図であり、その
（ｂ）は図１の本実施例の連続音声認識装置における音
声区間と音素区間を示す図である。図４の（ａ）に示す
ように、従来のＨＭＭ−ＬＲ法は音素に同期した横型探
索を基本としているため、時間が進行するにつれて、照
合音素の存在可能範囲が徐々に広くなってしまう。本実
施例の装置によれば、図４の（ｂ）に示すように、検出
されたポーズ情報を利用して、マージによる圧縮処理を
実行したので、照合音素の存在可能範囲を狭めることが
可能である。これによって、処理すべき計算機の処理量
を大幅に減少させることができ、それ故、連続音声認識
の処理時間を短縮し、高速で音声認識させることができ
るという効果もある。FIG. 4A shows a speech section and a phoneme section in the conventional continuous speech recognition apparatus, and FIG. 4B shows a speech section and a phoneme in the continuous speech recognition apparatus of this embodiment shown in FIG. It is a figure showing a section. As shown in FIG. 4A, the conventional HMM-LR method is based on a horizontal search synchronized with a phoneme. Therefore, as time progresses, the possible range of the matching phoneme gradually increases. According to the apparatus of this embodiment, as shown in FIG. 4B, the compression processing by merging is performed using the detected pose information, so that the possible range of the verification phoneme can be narrowed. It is. As a result, the processing amount of the computer to be processed can be greatly reduced, and therefore, there is also an effect that the processing time of continuous speech recognition can be shortened and high-speed speech recognition can be performed.

【００３４】以上説明したように、この発明に係る実施
例によれば、音声認識過程で生ずる複数の候補に対し
て、ポーズや冗長語を含む無音区間等の区切りが音声中
に検出されるたびに、統語的に同一の働きをする複数の
候補を圧縮できるので、重複した処理を避けることので
きる連続音声認識が実現できる。従って、従来例に比較
して大幅に処理量を削減することができ、音声認識の処
理速度を高めることができる。As described above, according to the embodiment of the present invention, for each of a plurality of candidates generated in the speech recognition process, a pause or a segment such as a silent section including a redundant word is detected in the speech. Furthermore, since a plurality of candidates having the same function syntactically can be compressed, continuous speech recognition that can avoid redundant processing can be realized. Therefore, the processing amount can be significantly reduced as compared with the conventional example, and the processing speed of voice recognition can be increased.

【００３５】以上の実施例においては、ＨＭＭ−ＬＲ法
を用いた音声認識装置について述べているが、本発明は
これに限らず、ニューラルネットワークを用いた音声認
識装置など他の種類の音声認識装置に適用することがで
きる。以上の実施例において、無音区間等検出部３０は
冗長語及びポーズ並びに句又は節の境界を検出している
が、本発明はこれに限らず、冗長語とポーズと句又は節
の境界のうち少なくとも一方を検出するように構成して
もよい。In the above embodiments, the speech recognition apparatus using the HMM-LR method has been described. However, the present invention is not limited to this, and other types of speech recognition apparatuses such as a speech recognition apparatus using a neural network. Can be applied to In the above embodiment, the silent section etc. detecting unit 30 detects redundant words and pauses and boundaries between phrases or clauses. However, the present invention is not limited to this, and among redundant words, pauses and boundaries between phrases or clauses. You may comprise so that at least one is detected.

【００３６】[0036]

【発明の効果】以上詳述したように本発明によれば、入
力された文字列からなる発声音声文を音声認識する音声
認識手段を備えた音声認識装置において、入力された発
声音声文に基づいてポーズと冗長語と句又は節の境界と
のうちの少なくとも１つを検出して検出信号を出力する
検出手段を備え、上記音声認識手段は、隠れマルコフモ
デルを用いたＬＲ法を用いて音声認識処理を実行し、か
つ、上記検出信号が入力されたときに、上記隠れマルコ
フモデルを用いたＬＲ法に用いるセルに、音声認識結果
候補を示す状態スタックの最上位の内容が同一のセルを
連結してマージすることにより、統語的に同一の働きを
する複数の音声認識候補を１つの音声認識候補に圧縮し
て音声認識処理を実行する。それ故、音声認識過程で生
ずる複数の候補に対して、ポーズや冗長語を含む無音区
間等の区切りが音声中に検出されるたびに、統語的に同
一の働きをする複数の候補を圧縮できるので、重複した
処理を避けることのできる連続音声認識が実現できる。
従って、従来例に比較して大幅に処理量を削減すること
ができ、音声認識の処理速度を高めることができる。As described above in detail, according to the present invention, in a speech recognition apparatus provided with speech recognition means for recognizing an uttered speech sentence composed of an inputted character string, the speech recognition apparatus is provided based on the inputted uttered speech sentence. Detection means for detecting at least one of a pause, a redundant word, a phrase or a clause boundary, and outputting a detection signal, wherein the voice recognition means uses a LR method using a Hidden Markov Model to generate a voice. When the recognition process is performed and the detection signal is input, a cell having the same top-level content of a state stack indicating a speech recognition result candidate is set as a cell used for the LR method using the hidden Markov model. By linking and merging, a plurality of speech recognition candidates having the same function syntactically are compressed into one speech recognition candidate, and the speech recognition process is executed. Therefore, for a plurality of candidates generated in the speech recognition process, each time a break such as a silence section including a pause or a redundant word is detected in the speech, a plurality of candidates having the same function syntactically can be compressed. Therefore, continuous speech recognition that can avoid duplicate processing can be realized.
Therefore, the processing amount can be significantly reduced as compared with the conventional example, and the processing speed of voice recognition can be increased.

[Brief description of the drawings]

【図１】本発明に係る一実施例である連続音声認識装
置のブロック図である。FIG. 1 is a block diagram of a continuous speech recognition apparatus according to an embodiment of the present invention.

【図２】従来例の連続音声認識装置の音声認識動作を
スタック形式で示す図である。FIG. 2 is a diagram showing a speech recognition operation of a conventional continuous speech recognition device in a stack format.

【図３】図１の本実施例の連続音声認識装置の音声認
識動作をスタック形式で示す図である。FIG. 3 is a diagram showing a speech recognition operation of the continuous speech recognition apparatus of the present embodiment in FIG.

【図４】（ａ）は従来例の連続音声認識装置における
音声区間と音素区間を示す図であり、（ｂ）は図１の本
実施例の連続音声認識装置における音声区間と音素区間
を示す図である。4A is a diagram showing a speech section and a phoneme section in a conventional continuous speech recognition apparatus, and FIG. 4B is a view showing a speech section and a phoneme section in the continuous speech recognition apparatus of this embodiment in FIG. FIG.

【図５】図１の連続音声認識装置において用いるセル
のデータ構造を示す図である。FIG. 5 is a diagram showing a data structure of a cell used in the continuous speech recognition device of FIG. 1;

【図６】図１の連続音声認識装置において実行される
音声認識処理を示すフローチャートである。FIG. 6 is a flowchart showing a speech recognition process executed in the continuous speech recognition device of FIG. 1;

【図７】図６のポーズ単位音声認識モジュール処理の
第１の部分を示すフローチャートである。FIG. 7 is a flowchart showing a first part of the pause unit speech recognition module processing of FIG. 6;

【図８】図６のポーズ単位音声認識モジュール処理の
第２の部分を示すフローチャートである。FIG. 8 is a flowchart showing a second part of the pause-based speech recognition module processing of FIG. 6;

[Explanation of symbols]

１…マイクロホン、２…特徴抽出部、３…バッファメモリ、４…音素照合部、５…ＬＲパーザ、１１…隠れマルコフ網メモリ、１２…話者モデルメモリ、１３…ＬＲテーブルメモリ、２０…文脈自由文法データベースメモリ、３０…無音区間等検出部。 DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... Feature extraction part, 3 ... Buffer memory, 4 ... Phoneme collation part, 5 ... LR parser, 11 ... Hidden Markov network memory, 12 ... Speaker model memory, 13 ... LR table memory, 20 ... Context free Grammar database memory, 30 ... silence section detection unit.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開平５−19785（ＪＰ，Ａ) 特開平４−84197（ＪＰ，Ａ) 特開平４−86946（ＪＰ，Ａ) 特開平１−321498（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 - 9/18 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-5-19785 (JP, A) JP-A-4-84197 (JP, A) JP-A-4-86946 (JP, A) JP-A-1- 321498 (JP, A) (58) Field surveyed (Int. Cl. ⁶ , DB name) G10L 3/00-9/18 JICST file (JOIS)

Claims

(57) [Claims]

1. A speech recognition apparatus comprising speech recognition means for recognizing an uttered speech sentence consisting of an input character string, comprising: a pause, a redundant word, a boundary between phrases or clauses, based on the inputted uttered speech sentence. And detecting means for detecting at least one of the above, wherein the voice recognition means comprises an LR using a hidden Markov model.
When the speech recognition process is performed using the method and the detection signal is input, the cell used for the LR method using the hidden Markov model indicates the speech recognition result candidate. Is characterized by compressing a plurality of speech recognition candidates having the same function syntactically into one speech recognition candidate by executing the speech recognition process by connecting and merging the same cells. .

2. The method according to claim 1, wherein the first condition that the power of the uttered voice sentence is equal to or less than a predetermined threshold value for a predetermined time period, and the number of zero crossings of the uttered voice sentence are
2. The pause according to claim 1, wherein the pause is detected by detecting that at least one of a second condition that is equal to or greater than a predetermined threshold is satisfied during a predetermined time. Voice recognition device.

3. The speech recognition according to claim 1, wherein said detecting means detects the redundant word by determining whether or not the word model matches a language model of a plurality of redundant words stored in advance. apparatus.

4. The detecting means detects the boundary of the phrase or clause by detecting that the fundamental frequency of the uttered voice sentence suddenly rises or falls at a predetermined gradient or more. The speech recognition device according to claim 1, wherein: