JPH08123480A

JPH08123480A - Speech recognition device

Info

Publication number: JPH08123480A
Application number: JP6264994A
Authority: JP
Inventors: Shingaa Hararudo; ハラルド・シンガー; Tomohiko Beppu; 智彦別府; Yoshinori Kosaka; 芳典匂坂
Original assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Current assignee: ATR ONSEI HONYAKU TSUSHIN KENKYUSHO KK; ATR Interpreting Telecommunications Research Laboratories
Priority date: 1994-10-28
Filing date: 1994-10-28
Publication date: 1996-05-17

Abstract

PURPOSE: To provide the speech recognition device which can perform substantially real-time processing, perform speech recognition faster than a conventional example, and facilitate replacement by processing parts and is rich in extensibility. CONSTITUTION: The speech recognition device is equipped with an AD converting means 2, a feature extracting means 3, speech recognition means 4 and 5, and a control means 7 which performs control so that AD conversion, feature extraction, and speech recognition are performed in a frame synchronized state; and the control means 7 and converting means 2, the control means 7 and feature extracting means 3, and the control means 7 and speech recognition means 4 and 5 are connected by control buses 71-74 for sending control signals and status signals, the converting means 2 and feature extracting means 3 are connected by a 1st data bus 11 for transmitting speech data, and the feature extracting means 3 and speech recognizing means 4 and 5 are connected by 2nd data buses 12 and 13 for transmitting data on feature parameters.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device.

【０００２】[0002]

【従来の技術】図３は、従来の音声認識装置のブロック
図である。図３の音声認識装置は、ＡＤ変換器２ａと特
徴抽出部３ａと音素照合部４ａとＬＲパーザ５ａとスイ
ッチ８と音声認識コントローラ７ａを備え、ＡＤ変換器
２ａと特徴抽出部３ａと音素照合部４ａとＬＲパーザ５
ａが、それぞれ音声認識コントローラ７ａにのみ制御デ
ータバス７１ａ乃至７４ａで接続されて構成される。以
下、ＡＤ変換器２ａと特徴抽出部３ａと音素照合部４ａ
とＬＲパーザ５ａのことを総称して呼ぶときは、処理部
という。2. Description of the Related Art FIG. 3 is a block diagram of a conventional voice recognition device. The speech recognition apparatus of FIG. 3 includes an AD converter 2a, a feature extraction unit 3a, a phoneme matching unit 4a, an LR parser 5a, a switch 8 and a voice recognition controller 7a, and the AD converter 2a, the feature extraction unit 3a and a phoneme matching unit. 4a and LR parser 5
a is connected to the voice recognition controller 7a only by control data buses 71a to 74a. Hereinafter, the AD converter 2a, the feature extraction unit 3a, and the phoneme matching unit 4a.
And the LR parser 5a are collectively referred to as a processing unit.

【０００３】図３の従来の音声認識装置において、話者
は発声と同時に、例えばフットスイッチなどからなるス
イッチ８を押して音声認識装置を起動させる。話者の発
声音声は、マイクロフォン１に入力されて音声信号に変
換された後、ＡＤ変換器２ａに入力される。一方、音声
認識装置が起動されると、音声認識コントローラ７ａ
は、ＡＤ変換の開始を指示する制御信号を、制御データ
バス７１ａを介してＡＤ変換器２ａに出力する。ＡＤ変
換器２ａは、上記音声信号を、デジタル信号である音声
データにＡＤ変換した後、当該音声データとＡＤ変換処
理の終了を示すステータス信号を、制御データバス７１
ａを介して音声認識コントローラ７ａに出力する。音声
認識コントローラ７ａは、当該音声データと当該音声デ
ータの特徴抽出処理の開始を指示する制御信号を、制御
データバス７２ａを介して特徴抽出部３ａに出力する。
特徴抽出部３ａは、入力された音声データを、例えばＬ
ＰＣ分析を実行し、対数パワー、１６次ケプストラム係
数、Δ対数パワー及び１６次Δケプストラム係数を含む
３４次元の特徴パラメータを抽出して、当該特徴パラメ
ータの時系列データと特徴抽出処理の終了を示すステー
タス信号を、制御データバス７２ａを介して音声認識コ
ントローラ７ａに出力する。In the conventional voice recognition apparatus shown in FIG. 3, the speaker activates the voice recognition apparatus by pressing a switch 8 such as a foot switch at the same time as speaking. The uttered voice of the speaker is input to the microphone 1 and converted into a voice signal, and then input to the AD converter 2a. On the other hand, when the voice recognition device is activated, the voice recognition controller 7a
Outputs a control signal instructing the start of AD conversion to the AD converter 2a via the control data bus 71a. The AD converter 2a AD-converts the audio signal into audio data that is a digital signal, and then sends the audio data and a status signal indicating the end of the AD conversion process to the control data bus 71.
It is output to the voice recognition controller 7a via a. The voice recognition controller 7a outputs the voice data and a control signal instructing the start of the feature extraction processing of the voice data to the feature extraction unit 3a via the control data bus 72a.
The feature extraction unit 3a converts the input voice data into, for example, L
A PC analysis is performed to extract 34-dimensional characteristic parameters including logarithmic power, 16th-order cepstrum coefficient, Δ logarithmic power, and 16th-order Δcepstrum coefficient, and indicate time series data of the characteristic parameter and the end of the characteristic extraction processing. The status signal is output to the voice recognition controller 7a via the control data bus 72a.

【０００４】音声認識コントローラ７ａは、当該特徴パ
ラメータの時系列データと音素照合処理の開始を指示す
る制御信号を、制御データバス７３ａを介して音素照合
部４ａに出力する。音素照合部４ａは、抽出された特徴
パラメータの時系列データを、後述する音素予測データ
に対応する隠れマルコフ網メモリ（以下、ＨＭ網メモリ
という。）内の情報を参照して照合し、不特定話者モデ
ルを用いて音素照合区間のデータに対する尤度を計算し
て、この尤度の値を音素照合スコアとして音素照合処理
の終了を示すステータス信号とともに、制御データバス
７３ａを介して音声認識コントローラ７ａに出力する。
音声認識コントローラ７ａは、上記音素照合スコアとＬ
Ｒパージング処理の開始を指示する制御信号を、制御デ
ータバス７４ａを介してＬＲパーザ５ａに出力する。Ｌ
Ｒパーザ５ａは、ＬＲテーブルを参照して、入力された
音素照合スコアについて左から右方向に、後戻りなしに
処理する。ここで、上記ＬＲテーブルは、所定の文脈自
由文法を公知の通り変換して予め作成されて、ＬＲパー
ザ５ａ内のＬＲテーブルメモリに格納されている。構文
的にあい昧さがある場合には、スタックを分割してすべ
ての候補の解析が平行して処理される。ＬＲパーザ５ａ
は、ＬＲテーブルから次にくる音素を予測してその音素
予測データと音素予測の終了を示すステータス信号を、
制御データバス７４ａを介して音声認識コントローラ７
ａに出力する。以上の動作を順次行い、順次音素を連接
していくことにより、連続音声の認識を行う。そして、
ＬＲパーザ５ａは、音声認識結果データを外部装置に出
力する。以上の従来の音声認識装置では、ＡＤ変換器２
ａにおけるＡＤ変換と、特徴抽出部３ａにおける特徴パ
ラメータの抽出はフレーム毎に処理され、音素照合部４
ａにおける音素照合と、ＬＲパーザ５ａにおけるＬＲパ
ージングは音素毎に処理される。The voice recognition controller 7a outputs the time-series data of the characteristic parameter and a control signal for instructing the start of the phoneme matching process to the phoneme matching unit 4a via the control data bus 73a. The phoneme collation unit 4a collates the time-series data of the extracted feature parameters with reference to information in a hidden Markov network memory (hereinafter, referred to as HM network memory) corresponding to phoneme prediction data described later, and unidentifies. A speaker model is used to calculate the likelihood for the data in the phoneme matching section, and the likelihood value is used as a phoneme matching score together with a status signal indicating the end of the phoneme matching process, and a voice recognition controller via the control data bus 73a. Output to 7a.
The voice recognition controller 7a uses the phoneme matching score and L
A control signal instructing the start of the R purging process is output to the LR parser 5a via the control data bus 74a. L
The R parser 5a refers to the LR table and processes the input phoneme matching score from left to right without backtracking. Here, the LR table is created in advance by converting a predetermined context-free grammar as is known, and stored in the LR table memory in the LR parser 5a. In the case of syntactic ambiguity, the stack is split and parsing of all candidates is processed in parallel. LR Parser 5a
Is the next phoneme predicted from the LR table, the phoneme prediction data and a status signal indicating the end of the phoneme prediction,
The voice recognition controller 7 via the control data bus 74a
output to a. The above operation is sequentially performed, and phonemes are sequentially connected to recognize continuous speech. And
The LR parser 5a outputs the voice recognition result data to an external device. In the above conventional speech recognition device, the AD converter 2
The AD conversion in a and the extraction of the feature parameter in the feature extraction unit 3a are processed for each frame, and the phoneme matching unit 4
The phoneme matching in a and the LR parsing in the LR parser 5a are processed for each phoneme.

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、従来の
音声認識装置では、ＡＤ変換器２ａと特徴抽出部３ａ、
特徴抽出部３ａと音素照合部４ａ及び音素照合部４ａと
ＬＲパーザ５ａが、音声認識コントローラ７ａを介して
制御データバス７１ａ乃至７４ａによって接続され、か
つ、各処理部での各処理が音声認識コントローラ７ａか
ら送られるデータと制御信号を受けてから実行されるの
で、リアルタイム処理ができなかった。そのため音声認
識処理に時間がかかるという問題があった。また、音声
認識に関する種々の研究を行うためには、処理部毎に他
のものに取り替えて動作させる必要が生じるが、従来の
音声認識装置では、各処理部が音声認識コントローラ７
ａにのみに接続されていて、データと制御信号のインタ
ーフェースを共通にする必要があるので、各処理部の交
換が容易でないという問題があった。このために、例え
ば、音声信号からピッチ周波数を検出して、当該音声信
号の音声認識データと比較しようとしてもできなかっ
た。However, in the conventional voice recognition device, the AD converter 2a and the feature extraction unit 3a,
The feature extraction unit 3a, the phoneme matching unit 4a, the phoneme matching unit 4a, and the LR parser 5a are connected by the control data buses 71a to 74a via the voice recognition controller 7a, and each processing in each processing unit is performed by the voice recognition controller. Since it is executed after receiving the data and control signal sent from 7a, real-time processing could not be performed. Therefore, there is a problem that the voice recognition processing takes time. Further, in order to carry out various researches on speech recognition, it is necessary to replace each processing unit with another one and operate it. However, in the conventional speech recognition apparatus, each processing unit has a speech recognition controller 7
Since it is connected only to a and it is necessary to share the interface for data and control signals, there is a problem that it is not easy to replace each processing unit. Therefore, for example, it has been impossible to detect the pitch frequency from the voice signal and compare it with the voice recognition data of the voice signal.

【０００６】本発明の目的は、以上の問題を解決して、
実質的にリアルタイム処理が可能で、従来例に比較して
高速に音声認識ができ、しかも各処理部毎の交換が容易
で拡張性に富んだ音声認識装置を提供することにある。The object of the present invention is to solve the above problems,
It is an object of the present invention to provide a speech recognition apparatus capable of substantially real-time processing, capable of recognizing speech at a higher speed than in the conventional example, and easy to replace each processing unit, and having a high expandability.

【０００７】[0007]

【課題を解決するための手段】本発明に係る音声認識装
置は、入力された音声信号をデジタル信号である音声デ
ータにＡＤ変換して出力する変換手段と、上記音声デー
タから音声認識のための特徴パラメータを抽出して特徴
パラメータのデータを出力する特徴抽出手段と、上記特
徴パラメータのデータに基づいて上記入力された音声信
号の音声を認識して音声認識データを出力する音声認識
手段と、上記変換手段のＡＤ変換と上記特徴抽出手段の
特徴パラメータの抽出と上記音声認識手段の音声認識
が、所定の一定時間の音声信号に対応した１フレームの
音声データ毎にフレーム同期して実行されるように上記
変換手段と上記特徴抽出手段と上記音声認識手段を制御
する制御手段とを備えた音声認識装置であって、上記制
御手段と上記変換手段の間を、上記制御手段から上記変
換手段へＡＤ変換の実行の開始を指示する制御信号と上
記変換手段から上記制御手段へ１フレームのＡＤ変換の
終了を示すステータス信号を伝送するための第１の制御
バスで接続し、上記制御手段と上記特徴抽出手段の間
を、上記制御手段から上記特徴抽出手段へ特徴パラメー
タの抽出処理の実行の開始を指示する制御信号と上記特
徴抽出手段から上記制御手段へ１フレームの特徴パラメ
ータの抽出の終了を示すステータス信号を伝送するため
の第２の制御バスで接続し、上記制御手段と上記音声認
識手段の間を、上記制御手段から上記音声認識手段へ音
声の認識の実行の開始を指示する制御信号と上記音声認
識手段から上記制御手段へ１フレームの音声認識の終了
を示すステータス信号を伝送するための第３の制御バス
で接続し、かつ上記変換手段と上記特徴抽出手段の間
を、上記音声データを伝送するための第１のデータバス
で接続し、上記特徴抽出手段と音声認識手段の間を、上
記特徴パラメータのデータを伝送するための第２のデー
タバスで接続したことを特徴とする。A voice recognition apparatus according to the present invention includes a conversion means for AD-converting an input voice signal into voice data which is a digital signal and outputting the voice data, and a voice recognition device for voice recognition from the voice data. Feature extracting means for extracting feature parameters and outputting feature parameter data; voice recognition means for recognizing voice of the input voice signal based on the feature parameter data and outputting voice recognition data; The AD conversion of the converting means, the extraction of the characteristic parameter of the characteristic extracting means, and the voice recognition of the voice recognizing means are executed frame-synchronously for each frame of voice data corresponding to a voice signal of a predetermined fixed time. A voice recognition device comprising the conversion means, the feature extraction means, and a control means for controlling the voice recognition means, the control means and the conversion means During the period, a first signal for transmitting a control signal from the control means to the conversion means to start execution of AD conversion and a status signal from the conversion means to the control means indicating the end of AD conversion of one frame. And a control signal from the control means to the feature extraction means to start the execution of the feature parameter extraction processing, and the feature extraction means to perform the control. A second control bus for transmitting a status signal indicating the end of the extraction of the characteristic parameter of one frame to the means, and the control means to the voice recognition means between the control means and the voice recognition means. A control signal for instructing the start of execution of voice recognition and a status signal for indicating the end of voice recognition of one frame from the voice recognition means to the control means. 3 is connected by a control bus, and the converting means and the feature extracting means are connected by a first data bus for transmitting the voice data, and the feature extracting means and the voice recognizing means are connected to each other. It is characterized by being connected by a second data bus for transmitting the data of the characteristic parameters.

【０００８】[0008]

【作用】本発明に係る請求項１記載の音声認識装置にお
いて、上記制御手段は、第１の制御バスを介して、上記
変換手段にＡＤ変換処理の開始を指示する制御信号を入
力する。上記変換手段は、当該制御信号に応答して入力
された音声信号をＡＤ変換して、１フレームに対応した
所定の一定時間の音声信号を音声データにＡＤ変換する
毎に、当該音声データを第１のデータバスを介して上記
特徴抽出手段に出力する一方、１フレームのＡＤ変換処
理が終了したことを示すステータス信号を第１の制御バ
スを介して上記制御手段に出力する。上記制御手段は、
当該ステータス信号に基づいて、１フレーム毎に上記特
徴抽出手段に第２の制御バスを介して、上記変換手段の
ＡＤ変換とフレーム同期するように、入力された１フレ
ームの音声データの特徴抽出処理の開始を指示する制御
信号を出力する。上記特徴抽出手段は、当該制御信号に
応答して、１フレームの音声データ毎に特徴パラメータ
を抽出して、第２のデータバスを介して上記音声認識手
段へ出力する一方、１フレームの特徴抽出処理が終了し
たことを示すステータス信号を第２の制御バスを介して
上記制御手段に出力する。上記制御手段は、当該ステー
タス信号に基づいて、１フレーム毎に上記音声認識手段
に上記第３の制御バスを介して、上記特徴抽出手段の特
徴抽出処理とフレーム同期するように、入力された１フ
レームの特徴パラメータに基づいて音声の認識の開始を
指示する制御信号を出力する。上記音声認識手段は、当
該制御信号に応答して、１フレームの特徴パラメータ毎
に、音声を認識して音声認識データを出力する一方、上
記制御手段へ１フレームの音声データの認識の終了を示
すステータス信号を上記第３の制御バスを介して出力す
る。In the speech recognition apparatus according to the first aspect of the present invention, the control means inputs the control signal for instructing the conversion means to start the AD conversion processing via the first control bus. The conversion means AD-converts the audio signal input in response to the control signal and AD-converts the audio signal for a predetermined fixed time corresponding to one frame into audio data each time While outputting to the feature extraction means via the first data bus, a status signal indicating that the AD conversion processing for one frame is completed is output to the control means via the first control bus. The control means is
Based on the status signal, the feature extraction processing of the input one-frame audio data is performed frame by frame to the feature extraction means via the second control bus for each frame so as to be frame-synchronized with the AD conversion of the conversion means. A control signal for instructing the start of is output. In response to the control signal, the feature extraction means extracts a feature parameter for each voice data of one frame and outputs the feature parameter to the voice recognition means via the second data bus, while extracting the feature of one frame. A status signal indicating that the processing is completed is output to the control means via the second control bus. Based on the status signal, the control means inputs to the voice recognition means frame by frame through the third control bus so as to be frame-synchronized with the feature extraction processing of the feature extraction means. A control signal for instructing the start of voice recognition is output based on the characteristic parameter of the frame. The voice recognition means, in response to the control signal, recognizes voice for each feature parameter of one frame and outputs voice recognition data, while indicating to the control means the end of recognition of one frame of voice data. The status signal is output via the third control bus.

【０００９】上述のように、本発明に係る音声認識装置
では、上記制御手段によって、上記変換手段と上記特徴
抽出手段と上記音声認識手段が、１フレームの音声デー
タ毎にフレーム同期して第１乃至第３の制御バスを介し
て、各処理を実行するように制御されているので、上記
変換手段と上記特徴抽出手段と上記音声認識手段におけ
る各処理は実質的にリアルタイムで実行される。さら
に、上記変換手段は、第１のデータバスを介して上記音
声データのみを直接上記特徴抽出手段へ伝送し、上記特
徴抽出手段は、上記第２のデータバスを介して上記特徴
パラメータのデータのみを直接上記音声認識手段に伝送
する。As described above, in the voice recognition apparatus according to the present invention, the control means causes the conversion means, the feature extraction means, and the voice recognition means to perform frame synchronization for each frame of voice data. Through the third control bus, it is controlled to execute each process, so that each process in the conversion means, the feature extraction means, and the voice recognition means is executed in substantially real time. Further, the converting means directly transmits only the voice data to the feature extracting means via the first data bus, and the feature extracting means only sends the data of the feature parameter via the second data bus. Is directly transmitted to the voice recognition means.

【００１０】[0010]

【実施例】以下、図面を参照して本発明に係る実施例に
ついて説明する。図１は、本発明に係る実施例の音声認
識装置のブロック図である。図１の音声認識装置は、マ
イクロフォン１と、ＡＤ変換器２と、特徴抽出部３と、
音素照合部４と、ＬＲパーザ５と、ピッチ検出部６と、
音声認識コントローラ７と、スイッチ８とを備える。当
該音声認識装置の特徴は、以下の通りである。音声認識
コントローラ７は、上記ＡＤ変換器２のＡＤ変換と上記
特徴抽出部３の特徴パラメータの抽出と上記音素照合部
４の音素照合と上記ＬＲパーザ５のＬＲパージングが、
所定の一定時間の音声信号に対応した１フレームの音声
データ毎にフレーム同期して実行されるように上記ＡＤ
変換器２と上記特徴抽出部３と上記音素照合部４と上記
ＬＲパーザ５を制御する。ここで、ＡＤ変換器２と音声
認識コントローラ７の間を、上記音声認識コントローラ
７から上記ＡＤ変換器２へＡＤ変換の実行の開始を指示
する制御信号と上記ＡＤ変換器２から上記音声認識コン
トローラ７へ１フレームのＡＤ変換の終了を示すステー
タス信号を伝送するための、例えば、ＲＰＣ（Ｒｅｍｏ
ｔｅＰｒｏｃｅｄｕｒｅＣａｌｌｓ）などの制御バ
ス７１で接続する。また、特徴抽出部３と音声認識コン
トローラ７の間を、上記音声認識コントローラ７から上
記特徴抽出部３へ特徴パラメータの抽出処理の実行の開
始を指示する制御信号と上記特徴抽出部３から上記音声
認識コントローラ７へ１フレームの特徴パラメータの抽
出の終了を示すステータス信号を伝送するための、例え
ば、ＲＰＣなどの制御バス７２で接続する。さらに、音
素照合部４と音声認識コントローラ７の間を、上記音声
認識コントローラ７から上記音素照合部４へ音素照合処
理の開始を指示する制御信号と上記音素照合部４から上
記音声認識コントローラ７へ１フレームの音素照合処理
の終了を示すステータス信号を伝送するための、例え
ば、ＲＰＣなどの制御バス７３で接続する。またさら
に、ＬＲパーザ５と音声認識コントローラ７の間を、上
記音声認識コントローラ７から上記ＬＲパーザ５へＬＲ
パージングの開始を指示する制御信号と上記ＬＲパーザ
５から上記音声認識コントローラ７へ１フレームのＬＲ
パージング処理の終了を示すステータス信号を伝送する
ための、例えば、ＲＰＣなどの制御バス７４で接続す
る。また、ピッチ検出部６と音声認識コントローラ７の
間を、上記音声認識コントローラ７から上記ピッチ検出
部６へピッチ検出処理の開始を指示する制御信号と上記
ピッチ検出部から上記音声認識コントローラ７へ１フレ
ームのピッチ検出処理の終了を示すステータス信号を伝
送するための、例えば、ＲＰＣなどの制御バス７５で接
続する。ここで、上記各制御信号と各ステータス信号
は、音声データや後述する特徴パラメータの時系列デー
タや後述する音素照合スコアや後述する音素予測データ
に比べると小容量の信号である。また、ＡＤ変換器２と
特徴抽出部３の間は大容量データである音声データを伝
送するための、例えば、２４０ｋｂｐｓの伝送レートを
有する大容量データの高速伝送が可能なデータバス１１
によって接続し、ＡＤ変換器２とピッチ検出部６の間は
上記音声データを伝送するための、例えば、２４０ｋｂ
ｐｓの伝送レートを有する大容量データの高速伝送が可
能なデータバス１４によって接続する。特徴抽出部３と
音素照合部４の間は、大容量データである特徴パラメー
タの時系列データを伝送するための、例えば２４０ｋｂ
ｐｓの伝送レートを有する大容量データの高速伝送が可
能なデータバス１２によって接続し、音素照合部４とＬ
Ｒパーザ５の間は大容量データである音素照合スコアと
音素予測データを伝送するための、例えば２．４Ｍｂｐ
ｓの伝送レートを有する大容量データの高速伝送が可能
なデータバス１３によって接続する。以下、ＡＤ変換器
２と特徴抽出部３と音素照合部４とＬＲパーザ５とピッ
チ検出部６を総称して呼ぶときは、それぞれ処理部とい
う。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention. The speech recognition apparatus of FIG. 1 includes a microphone 1, an AD converter 2, a feature extraction unit 3,
A phoneme collation unit 4, an LR parser 5, a pitch detection unit 6,
The voice recognition controller 7 and the switch 8 are provided. The features of the voice recognition device are as follows. The voice recognition controller 7 performs the AD conversion of the AD converter 2, the extraction of the characteristic parameters of the characteristic extraction unit 3, the phoneme matching of the phoneme matching unit 4, and the LR purging of the LR parser 5.
The AD is executed so as to be executed in synchronization with each frame of audio data corresponding to an audio signal for a predetermined fixed time.
It controls the converter 2, the feature extraction unit 3, the phoneme matching unit 4, and the LR parser 5. Here, between the AD converter 2 and the voice recognition controller 7, a control signal for instructing the AD converter 2 to start execution of AD conversion and the AD converter 2 to the voice recognition controller are provided. 7 to transmit a status signal indicating the end of AD conversion of one frame, for example, RPC (Remo
te Procedure Calls). Further, between the feature extraction unit 3 and the voice recognition controller 7, a control signal for instructing the feature extraction unit 3 to start execution of a feature parameter extraction process and a voice signal from the feature extraction unit 3 between the feature recognition unit 3 and the voice recognition controller 7. A connection is made to the recognition controller 7 by a control bus 72 such as an RPC for transmitting a status signal indicating the end of extraction of one frame of characteristic parameters. Further, between the phoneme matching unit 4 and the voice recognition controller 7, a control signal for instructing the phoneme matching unit 4 to start the phoneme matching process from the voice recognition controller 7 and the phoneme matching unit 4 to the voice recognition controller 7. For example, a control bus 73 such as an RPC for transmitting a status signal indicating the end of the phoneme collation processing for one frame is connected. In addition, between the LR parser 5 and the voice recognition controller 7, the voice recognition controller 7 transfers the LR to the LR parser 5.
A control signal for instructing the start of purging and one frame of LR from the LR parser 5 to the voice recognition controller 7.
A control bus 74, such as an RPC, for transmitting a status signal indicating the end of the purging process is used for connection. Further, between the pitch detection unit 6 and the voice recognition controller 7, a control signal for instructing the pitch detection unit 6 to start the pitch detection process from the voice recognition controller 7 and the voice detection controller 7 from the pitch detection unit 1 A control bus 75, such as an RPC, for transmitting a status signal indicating the end of the frame pitch detection processing is used for connection. Here, each of the control signals and each of the status signals is a signal having a smaller capacity than voice data, time-series data of characteristic parameters described later, phoneme matching score described later, and phoneme prediction data described later. Further, between the AD converter 2 and the feature extraction unit 3, a data bus 11 for transmitting audio data, which is a large amount of data, capable of high-speed transmission of a large amount of data having a transmission rate of, for example, 240 kbps.
For transmitting the above audio data between the AD converter 2 and the pitch detector 6, for example, 240 kb.
Connection is made by a data bus 14 capable of high-speed transmission of a large amount of data having a transmission rate of ps. Between the feature extraction unit 3 and the phoneme matching unit 4, for example, 240 kb for transmitting time series data of feature parameters, which is a large amount of data.
Connected by a data bus 12 capable of high-speed transmission of a large amount of data having a ps transmission rate, and connected to the phoneme collation unit 4 and L
Between the R parsers 5, for example, 2.4 Mbp for transmitting the phoneme matching score and the phoneme prediction data, which are a large amount of data.
Connection is made by a data bus 13 capable of high-speed transmission of a large amount of data having a transmission rate of s. Hereinafter, the AD converter 2, the feature extraction unit 3, the phoneme matching unit 4, the LR parser 5, and the pitch detection unit 6 will be collectively referred to as a processing unit.

【００１１】図１の音声認識装置において、スイッチ８
は、例えばフットペタルスイッチなどからなり、オンさ
れる毎に、音声認識コントローラ７の端子を接地させ
て、発声開始又は発声終了を知らせる。音声認識コント
ローラ７は、音声認識コントロールプログラムを格納し
たＲＯＭと、上記音声認識コントロールプログラムに従
って音声認識コントロール処理を実行するＣＰＵと、処
理のためのワーキングエリアとして用いられるＲＡＭと
を備えて構成される。音声認識コントローラ７は、スイ
ッチ８から発声開始を知らされると、ＡＤ変換器２に、
制御バス７１を介して、ＡＤ変換処理の開始を指示する
制御信号を出力し、ＡＤ変換器２から制御バス７１を介
して、１フレームに対応する所定の一定時間の音声信号
のＡＤ変換が終了する毎に１フレームの処理が終了した
ことを示すステータス信号が入力される。In the voice recognition apparatus of FIG. 1, the switch 8
Is composed of, for example, a foot petal switch or the like, and every time it is turned on, the terminal of the voice recognition controller 7 is grounded to notify the start or end of utterance. The voice recognition controller 7 includes a ROM storing a voice recognition control program, a CPU that executes voice recognition control processing according to the voice recognition control program, and a RAM used as a working area for the processing. When the voice recognition controller 7 is notified by the switch 8 that the vocalization has started, the voice recognition controller 7 causes the AD converter 2 to
A control signal for instructing the start of AD conversion processing is output via the control bus 71, and AD conversion of the audio signal for a predetermined fixed time corresponding to one frame is completed from the AD converter 2 via the control bus 71. Each time, the status signal indicating that the processing for one frame is completed is input.

【００１２】次に、音声認識コントローラ７は、ＡＤ変
換器２からのステータス信号に基づいて、特徴抽出部３
に制御バス７２を介して１フレーム毎に特徴抽出処理の
開始を指示する制御信号を出力し、ピッチ検出部６に制
御バス７５を介して１フレーム毎にピッチ検出処理の開
始を指示する制御信号を出力する。音声認識コントロー
ラ７には、特徴抽出部３から制御バス７２を介して１フ
レームの特徴パラメータの抽出が終了する毎に１フレー
ムの処理が終了したことを示すステータス信号が入力さ
れる。Next, the voice recognition controller 7 uses the status signal from the AD converter 2 to extract the feature extraction unit 3
A control signal for instructing the start of the feature extraction processing for each frame via the control bus 72, and a control signal for instructing the pitch detection section 6 for the start of the pitch detection processing for each frame via the control bus 75. Is output. The voice recognition controller 7 receives a status signal from the feature extraction unit 3 via the control bus 72 each time the extraction of the feature parameters of one frame is finished, which indicates that the processing of one frame is finished.

【００１３】音声認識コントローラ７は、特徴抽出部３
からのステータス信号に基づいて音素照合部４に制御バ
ス７３を介して１フレーム毎に音素照合処理の開始を指
示する制御信号を出力し、音素照合部４から制御バス７
３を介して１フレームの音素照合処理が終了する毎に１
フレームの音素照合処理が終了したことを示すステータ
ス信号が入力される。音声認識コントローラ７は、音素
照合部４からのステータス信号に基づいてＬＲパーザ５
に制御バス７４を介して１フレーム毎にＬＲパージング
処理の開始を指示する制御信号を出力して、ＬＲパーザ
から制御バス７４を介して１フレームの処理が終了する
毎に１フレームの処理が終了したことを示すステータス
信号が入力される。The voice recognition controller 7 includes a feature extraction unit 3
A control signal for instructing the start of the phoneme matching processing for each frame is output to the phoneme matching unit 4 via the control bus 73 based on the status signal from the phoneme matching unit 4.
1 each time one frame of phoneme matching processing is completed via
A status signal indicating that the phoneme matching process of the frame is completed is input. The voice recognition controller 7 uses the LR parser 5 based on the status signal from the phoneme matching unit 4.
The control signal for instructing the start of the LR purging process is output to each of the frames via the control bus 74, and the processing of one frame is completed each time the processing of the one frame is completed from the LR parser via the control bus 74. A status signal indicating that this has been done is input.

【００１４】マイクロフォン１は、入力された話者の発
声音声を音声信号に変換して、ＡＤ変換器２に出力す
る。ＡＤ変換器２は、ＡＤ変換処理プログラムを格納し
たＲＯＭと、上記ＡＤ変換プログラムに従ってＡＤ変換
処理を実行するＣＰＵと、処理のためのワーキングエリ
アとして用いられるＲＡＭと、入力ソケット２１と、出
力ソケット２２，２３，２４とを備えて構成される。Ａ
Ｄ変換器２は、音声認識コントローラ７から制御バス７
１を介して入力されるＡＤ変換処理の開始を指示する制
御信号に応答して、マイクロフォン１から入力される音
声信号をデジタル信号である音声データに２０ミリ秒未
満の時間でＡＤ変換して、その音声データを１フレーム
毎に出力ソケット２２，２３，２４から出力する。ここ
で、出力ソケット２３から出力された音声データは、デ
ータバス１１を介して特徴抽出部３の入力ソケット３１
に入力され、出力ソケット２４から出力される音声デー
タは、データバス１４を介してピッチ検出部６の入力ソ
ケット６１に入力される。ＡＤ変換器２は、１フレーム
毎に当該フレームのＡＤ変換処理の終了を示すステータ
ス信号を制御バス７１を介して音声認識コントローラ７
に出力する。The microphone 1 converts the input uttered voice of the speaker into a voice signal and outputs the voice signal to the AD converter 2. The AD converter 2 includes a ROM that stores an AD conversion processing program, a CPU that executes AD conversion processing according to the AD conversion program, a RAM used as a working area for processing, an input socket 21, and an output socket 22. , 23, 24. A
The D converter 2 includes the voice recognition controller 7 to the control bus 7
In response to a control signal instructing to start the AD conversion processing input via 1, the audio signal input from the microphone 1 is AD converted into audio data which is a digital signal in a time of less than 20 milliseconds, The audio data is output from the output sockets 22, 23, 24 frame by frame. Here, the voice data output from the output socket 23 is input to the input socket 31 of the feature extraction unit 3 via the data bus 11.
Audio data input to the input socket 61 of the pitch detection unit 6 via the data bus 14 is input to the input socket 61 of the pitch detection unit 6. The AD converter 2 sends a status signal indicating the end of AD conversion processing of the frame for each frame via the control bus 71 to the voice recognition controller 7.
Output to.

【００１５】ピッチ検出部６は、ピッチ検出処理プログ
ラムを格納したＲＯＭと、上記ピッチ検出処理プログラ
ムに従ってピッチ検出処理を実行するＣＰＵと、処理の
ためのワーキングエリアとして用いられるＲＡＭと、入
力ソケット６１と、出力ソケット６２を備えて構成され
る。ピッチ検出部６は、音声認識コントローラ７から制
御バス７５を介して入力される１フレーム毎のピッチ検
出処理の開始を指示する制御信号に応答して、入力ソケ
ット６１から入力される音声データからピッチ周波数を
２０ミリ秒未満の時間で検出して、そのピッチ周波数を
出力ソケット６２から出力する。ピッチ検出部６は、１
フレーム毎に当該フレームのピッチ検出処理の終了を示
すステータス信号を制御バス７５を介して音声認識コン
トローラ７に出力する。The pitch detection unit 6 has a ROM storing a pitch detection processing program, a CPU for executing the pitch detection processing according to the pitch detection processing program, a RAM used as a working area for the processing, and an input socket 61. , And an output socket 62. The pitch detection unit 6 responds to a control signal input from the voice recognition controller 7 via the control bus 75 and instructing the start of pitch detection processing for each frame, and outputs pitch data from the voice data input from the input socket 61. The frequency is detected in less than 20 milliseconds and the pitch frequency is output from the output socket 62. The pitch detection unit 6 is 1
For each frame, a status signal indicating the end of the pitch detection processing of the frame is output to the voice recognition controller 7 via the control bus 75.

【００１６】特徴抽出部３は、特徴抽出処理プログラム
を格納したＲＯＭと、上記特徴抽出プログラムに従って
特徴抽出処理を実行するＣＰＵと、処理のためのワーキ
ングエリアとして用いられるＲＡＭと、入力ソケット３
１と、出力ソケット３２，３３，３４とを備えて構成さ
れる。特徴抽出部３は、音声認識コントローラ７から制
御バス７２を介して入力される１フレーム毎の特徴抽出
処理の開始を指示する制御信号に応答して、入力ソケッ
ト３１から入力される音声データから例えばＬＰＣ分析
を実行し、対数パワー、１６次ケプストラム係数、Δ対
数パワー及び１６次Δケプストラム係数を含む３４次元
の特徴パラメータを抽出して、その特徴パラメータの時
系列データを出力ソケット３２，３３，３４から出力す
る。ここで、出力ソケット３３から出力された特徴パラ
メータの時系列データは、データバス１２を介して音素
照合部４の入力ソケット４１に入力される。特徴抽出部
３は、上記特徴抽出処理を２０ミリ秒未満の時間で実行
する。特徴抽出部３は、１フレーム毎に当該フレームの
特徴抽出処理の終了を示すステータス信号を制御バス７
２を介して音声認識コントローラ７に出力する。The feature extraction unit 3 stores a ROM storing a feature extraction processing program, a CPU that executes the feature extraction processing according to the feature extraction program, a RAM used as a working area for the processing, and an input socket 3.
1 and output sockets 32, 33, 34. The feature extraction unit 3 responds to a control signal, which is input from the voice recognition controller 7 via the control bus 72, for instructing the start of feature extraction processing for each frame, from the voice data input from the input socket 31, for example. LPC analysis is performed to extract 34-dimensional characteristic parameters including logarithmic power, 16th-order cepstrum coefficient, Δlogarithmic power, and 16th-order Δcepstrum coefficient, and time-series data of the characteristic parameters are output to sockets 32, 33, 34. Output from. Here, the time-series data of the characteristic parameter output from the output socket 33 is input to the input socket 41 of the phoneme matching unit 4 via the data bus 12. The feature extraction unit 3 executes the above feature extraction processing in a time of less than 20 milliseconds. The feature extraction unit 3 sends a status signal indicating the end of the feature extraction processing of the frame for each frame to the control bus 7.
It outputs to the voice recognition controller 7 via 2.

【００１７】音素照合部４は、音素照合処理プログラム
を格納したＲＯＭと、上記音素照合プログラムに従って
音素照合処理を実行するＣＰＵと、処理のためのワーキ
ングエリアとして用いられるＲＡＭと、各状態をノード
とする複数のネットワークとして表されている隠れマル
コフ網データ（以下、ＨＭ網と称する）を格納したＨＭ
網メモリと、入力ソケット４１と、出力ソケット４２，
４４と入出力ソケット４３とを備えて構成される。以上
の構成により、音素照合部４は、音声認識コントローラ
７から制御バス７３を介して入力される１フレーム毎の
音素照合処理の開始を指示する制御信号に応答して、後
述する音素予測データに対応するＨＭ網を参照して照合
し、不特定話者モデルを用いて音素照合区間のデータに
対する尤度を計算して、この尤度の値を音素照合スコア
として入出力ソケット４３と出力ソケット４２，４４か
ら出力する。ここで、入出力ソケット４３から出力され
る音素照合スコアは、データバス１３を介してＬＲパー
ザ５の入出力ソケット５１に入力される。音素照合部４
は、上述の音素照合処理を２０ミリ秒未満の時間で処理
する。音素照合部４は、１フレーム毎に当該フレームの
音素照合処理の終了を示すステータス信号を制御バス７
３を介して音声認識コントローラ７に出力する。The phoneme matching unit 4 stores a ROM storing a phoneme matching processing program, a CPU for executing the phoneme matching processing according to the phoneme matching program, a RAM used as a working area for the processing, and each state as a node. HM storing hidden Markov network data (hereinafter referred to as HM network) represented as a plurality of networks
Network memory, input socket 41, output socket 42,
44 and the input / output socket 43. With the above configuration, the phoneme matching unit 4 responds to the control signal input from the voice recognition controller 7 via the control bus 73 and instructing the start of the phoneme matching process for each frame, to generate phoneme prediction data described later. Matching is performed by referring to the corresponding HM network, the likelihood for the phoneme matching section data is calculated using the unspecified speaker model, and the likelihood value is used as the phoneme matching score in the input / output socket 43 and the output socket 42. , 44. Here, the phoneme matching score output from the input / output socket 43 is input to the input / output socket 51 of the LR parser 5 via the data bus 13. Phoneme verification unit 4
Performs the phoneme matching process described above in a time period of less than 20 milliseconds. The phoneme collation unit 4 sends a status signal indicating the end of the phoneme collation processing of the frame for each frame to the control bus 7.
3 to the voice recognition controller 7.

【００１８】ＬＲパーザ５は、ＬＲパージング処理プロ
グラムを格納したＲＯＭと、上記ＬＲパージング処理プ
ログラムに従ってＬＲパージング処理を実行するＣＰＵ
と、処理のためのワーキングエリアとして用いられるＲ
ＡＭと、所定の文脈自由文法を公知の通り変換して予め
作成されたＬＲテーブルを格納したＬＲテーブルメモリ
と、入出力ソケット５１と、出力ソケット５２を備えて
構成される音素コンテキスト依存型ＬＲパーザである。
ＬＲパーザ５は、音声認識コントローラ７から制御バス
７４を介して入力される１フレーム毎のＬＲパージング
処理の開始を指示する制御信号に応答して、入出力ソケ
ット５１から入力された音素照合スコアを、ＬＲテーブ
ルを参照して左から右方向に、後戻りなしに処理する。
構文的にあい昧さがある場合には、スタックを分割して
すべての候補の解析を平行して処理する。また、ＬＲパ
ーザ５は、ＬＲテーブルから次にくる音素を予測して音
素予測データを、入出力ソケット５１から出力する。こ
こで、入出力ソケット５１から出力される音素予測デー
タは、データバス１３を介して音素照合部４の入出力ソ
ケット４３に入力される。そして、ＬＲパーザ５は、１
フレーム毎に当該フレームのＬＲパージング処理の終了
を示すステータス信号を制御バス７４を介して音声認識
コントローラ７に出力する。ＬＲパーザ５は、上述のＬ
Ｒパージング処理と音素予測を合わせて２０ミリ秒未満
の時間で実行する。以上のように、音素照合部４とＬＲ
パーザ５は、順次音素を連接していくことにより、連続
音声の音声認識を行い最終的な音声認識結果データを出
力ソケット５２から出力する。The LR parser 5 includes a ROM storing an LR purging processing program and a CPU for executing the LR purging processing according to the LR purging processing program.
And R used as a working area for processing
A phoneme context-dependent LR parser including an AM, an LR table memory storing an LR table created in advance by converting a predetermined context-free grammar as known, an input / output socket 51, and an output socket 52. Is.
The LR parser 5 responds to the control signal instructing the start of the LR purging process for each frame, which is input from the voice recognition controller 7 via the control bus 74, with the phoneme matching score input from the input / output socket 51. , LR table, processing from left to right without backtracking.
If there is syntactic ambiguity, split the stack and process parsing all candidates in parallel. Further, the LR parser 5 predicts the next phoneme from the LR table and outputs phoneme prediction data from the input / output socket 51. Here, the phoneme prediction data output from the input / output socket 51 is input to the input / output socket 43 of the phoneme matching unit 4 via the data bus 13. And the LR parser 5 is 1
For each frame, a status signal indicating the end of the LR purging process of the frame is output to the voice recognition controller 7 via the control bus 74. The LR parser 5 is the above-mentioned L
The R purging process and the phoneme prediction are executed together in less than 20 milliseconds. As described above, the phoneme matching unit 4 and the LR
The parser 5 successively recognizes continuous voices by sequentially connecting the phonemes and outputs final voice recognition result data from the output socket 52.

【００１９】以上のように構成された音声認識装置にお
いて、操作者は、スイッチ８をオンすることによって、
音声認識コントローラ７の端子を接地して、音声認識コ
ントローラ７を起動させ、話者の発声音声の音声認識処
理を開始させる。音声認識コントローラ７は、ＡＤ変換
器２にＡＤ変換処理の開始を指示する制御信号を入力す
る。一方、話者の発声音声はマイクロフォン１に入力さ
れて音声信号に変換された後、ＡＤ変換器２に連続的に
入力される。ＡＤ変換器２は入力される音声信号をＡＤ
変換して、１フレームに相当する２０ミリ秒間の音声信
号を音声データにＡＤ変換する毎に、当該音声データを
出力ソケット２３とデータバス１１と入力ソケット３１
を介して特徴抽出部３に出力する一方、１フレームのＡ
Ｄ変換処理が終了したことを示すステータス信号を音声
認識コントローラ７に出力する。音声認識コントローラ
７は、当該ステータス信号に基づいて、ＡＤ変換器２の
ＡＤ変換処理とフレーム同期するように１フレーム毎に
特徴抽出部３に、入力された１フレームの音声データの
処理の開始を指示する制御信号を出力する。特徴抽出部
３は、当該制御信号に応答して、１フレームの音声デー
タ毎に上述した特徴パラメータを抽出して、出力ソケッ
ト３３とデータバス１２と入力ソケット４１を介して音
素照合部４へ出力する一方、１フレームの特徴抽出処理
が終了したことを示すステータス信号を音声認識コント
ローラ７に出力する。In the voice recognition device configured as described above, the operator turns on the switch 8 to
The terminal of the voice recognition controller 7 is grounded, the voice recognition controller 7 is activated, and the voice recognition processing of the voice uttered by the speaker is started. The voice recognition controller 7 inputs a control signal for instructing the AD converter 2 to start the AD conversion process. On the other hand, the voice uttered by the speaker is input to the microphone 1 and converted into a voice signal, and then continuously input to the AD converter 2. The AD converter 2 AD-inputs the audio signal
Every time the audio signal is converted and AD-converted into a voice signal for 20 milliseconds corresponding to one frame, the voice data is output to the output socket 23, the data bus 11, and the input socket 31.
Output to the feature extraction unit 3 via
A status signal indicating that the D conversion process is completed is output to the voice recognition controller 7. Based on the status signal, the voice recognition controller 7 starts processing of the input one frame of voice data to the feature extraction unit 3 for each frame so as to be frame-synchronized with the AD conversion process of the AD converter 2. Outputs a control signal to instruct. In response to the control signal, the feature extraction unit 3 extracts the above-mentioned feature parameter for each frame of voice data, and outputs the feature parameter to the phoneme collation unit 4 via the output socket 33, the data bus 12, and the input socket 41. On the other hand, a status signal indicating that the feature extraction processing for one frame is completed is output to the voice recognition controller 7.

【００２０】音声認識コントローラ７は、特徴抽出部３
から出力されるステータス信号に基づいて、上記特徴抽
出部３の特徴抽出処理とフレーム同期するように１フレ
ーム毎に音素照合部４に、入力された１フレームの特徴
パラメータの音素照合の開始を指示する制御信号を出力
する。音素照合部４は、当該制御信号に応答して、１フ
レームの特徴パラメータ毎に、ＬＲパーザ５からの音素
予測データに対応するＨＭ網を参照して音素照合し、当
該フレームの音素照合スコアをＬＲパーザ５へ出力する
一方、当該フレームの音素照合処理が終了したことを示
すステータス信号を音声認識コントローラ７に出力す
る。音声認識コントローラ７は、上記音素照合部４から
のステータス信号に基づいて、音素照合部４の音素照合
処理とフレーム同期するように１フレーム毎にＬＲパー
ザ５に、入力された１フレームの音素照合スコアのＬＲ
パージングの開始を指示する制御信号を出力する。ＬＲ
パーザ５は、音声認識コントローラ７の制御信号に応答
して、ＬＲテーブルを参照して、入力された音素照合ス
コアについて左から右方向に、後戻りなしに処理する。
構文的にあい昧さがある場合には、スタックを分割して
すべての候補の解析を平行して処理する。ＬＲパーザ５
は、ＬＲテーブルから次にくる音素を予測して音素予測
データを、入出力ソケット５１とデータバス１３と入出
力ソケット４３を介して音素照合部４へ出力する。以上
の処理後、ＬＲパーザは、音声認識コントローラ７に、
ＬＲパージング処理の終了を示すステータス信号を出力
する。以上のようにして順次音素を連接していくことに
より、連続音声の認識を行いその音声認識結果データを
出力する。そして、操作者は、話者の発声の終了後、再
度スイッチ８をオンにして、音声認識コントローラ７の
端子を接地する。この後、音声認識コントローラ７は、
発声終了前に入力された音声の音声認識結果データがＬ
Ｒパーザ５から出力された後、ＬＲパーザからのステー
タス信号を受信して音声認識装置の処理を終了する。以
上のようにして、本実施例の音声認識装置において、各
処理部は所定の一定時間の音声信号に対応した１フレー
ム毎に処理を実行し、かつ音声認識コントローラ７は、
各処理部がフレーム同期して処理を実行するように制御
しているので、各処理部は実質的にリアルタイムで各処
理を実行する。The voice recognition controller 7 includes a feature extraction unit 3
Based on the status signal output from the above, the phoneme collation unit 4 is instructed to start the phoneme collation of the input one-frame feature parameter for each frame so as to be frame-synchronized with the feature extraction processing of the feature extraction unit 3. Output a control signal. In response to the control signal, the phoneme matching unit 4 refers to the HM network corresponding to the phoneme prediction data from the LR parser 5 for each feature parameter of one frame to perform phoneme matching, and obtains the phoneme matching score of the frame. While outputting to the LR parser 5, it outputs to the voice recognition controller 7 a status signal indicating that the phoneme matching process of the frame has ended. Based on the status signal from the phoneme matching unit 4, the voice recognition controller 7 inputs one frame of phoneme matching into the LR parser 5 for each frame so as to be frame-synchronized with the phoneme matching process of the phoneme matching unit 4. LR of score
A control signal for instructing the start of purging is output. LR
In response to the control signal from the voice recognition controller 7, the parser 5 refers to the LR table and processes the input phoneme matching score from left to right without backtracking.
If there is syntactic ambiguity, split the stack and process parsing all candidates in parallel. LR Parser 5
Outputs the phoneme prediction data to the phoneme collation unit 4 via the input / output socket 51, the data bus 13, and the input / output socket 43 by predicting the next phoneme from the LR table. After the above processing, the LR parser causes the voice recognition controller 7 to
A status signal indicating the end of the LR purging process is output. By sequentially connecting the phonemes as described above, continuous speech is recognized and the speech recognition result data is output. After the speaker finishes speaking, the operator turns on the switch 8 again to ground the terminal of the voice recognition controller 7. After this, the voice recognition controller 7
The voice recognition result data of the voice input before the end of utterance is L
After being output from the R parser 5, the status signal from the LR parser is received and the processing of the voice recognition device is terminated. As described above, in the speech recognition apparatus of this embodiment, each processing unit executes processing for each frame corresponding to a speech signal of a predetermined fixed time, and the speech recognition controller 7 is
Since each processing unit is controlled to execute the processing in frame synchronization, each processing unit executes each processing substantially in real time.

【００２１】図２は、発声された音声が、特徴抽出部
３、音素照合部４、ＬＲパーザ５において処理されると
きの、開始時間と終了時間を示したグラフである。図２
から明らかなように、発声音声の音声信号は、ＡＤ変換
された後、特徴抽出部３において特徴抽出処理が開始さ
れる。次に１フレーム分の特徴抽出が終了した後、音素
照合部４において音素照合処理が開始される。続いて、
１フレーム分の音素照合が終了した後、ＬＲパーザ５に
おいてＬＲパージング処理が開始される。以上のように
行なわれ、特徴抽出部３と音素照合部４とＬＲパーザ５
は、２０ミリ秒ずつ、すなわち１フレームずつ時間をず
らして処理を開始するように制御され、フレーム同期さ
せて処理を実行するように制御されている。FIG. 2 is a graph showing the start time and the end time when the uttered voice is processed by the feature extraction unit 3, the phoneme matching unit 4, and the LR parser 5. Figure 2
As is clear from the above, after the voice signal of the vocalized voice is AD-converted, the feature extraction unit 3 starts the feature extraction processing. Next, after the feature extraction for one frame is completed, the phoneme matching unit 4 starts the phoneme matching process. continue,
After the phoneme matching for one frame is completed, the LR parser 5 starts the LR purging process. The feature extraction unit 3, the phoneme matching unit 4, and the LR parser 5 are performed as described above.
Is controlled to start the processing by shifting the time by 20 milliseconds, that is, by one frame, and is controlled to execute the processing in frame synchronization.

【００２２】上述のように、本実施例の音声認識装置で
は、ＡＤ変換部２と特徴抽出部３と音素照合部４とＬＲ
パーザ５の各処理部で、１フレームずつずれるように、
フレーム同期させて各処理を実行しているので、各処理
部における各処理が実質的にリアルタイムで実行するこ
とが可能となり、これによって、全体としての音声認識
処理時間を音素毎に処理する従来例に比較して短くする
ことができる。As described above, in the speech recognition apparatus of this embodiment, the AD conversion unit 2, the feature extraction unit 3, the phoneme collation unit 4, and the LR.
Each processing unit of the parser 5 should be shifted by one frame,
Since each process is executed in frame synchronization, each process in each processing unit can be executed substantially in real time, and as a result, the conventional speech recognition processing time is processed for each phoneme. It can be shorter than.

【００２３】また、制御バス７１乃至７５を介して音声
データや特徴パラメータの時系列データや音素照合スコ
アや音素予測データに比較して小容量の信号である制御
信号とステータス信号（以下、制御信号等という。）が
伝送される。一方、大容量データの高速伝送が可能なデ
ータバス１１，１４を介して大容量の音声データが伝送
され、大容量データの高速伝送が可能なデータバス１２
を介して大容量の特徴パラメータの時系列データが伝送
され、大容量データの高速伝送が可能なデータバス１３
を介して大容量の音素照合スコアと音素予測データが伝
送されるように構成されている。従って、上記音声デー
タや上記特徴パラメータの時系列データや上記音素照合
スコアや上記音素予測データを含む各データと上記制御
信号等は、データ量に応じて設けられた各バスを用い
て、同時に伝送することができるので、信号伝送を高速
に行うことができる。これによって、上記各データと制
御信号等を同一の制御データバス７１ａ乃至７４ａを介
して伝送する従来の音声認識装置に比較して、音声認識
処理時間を短縮することができる。Control signals and status signals (hereinafter referred to as control signals), which are smaller in capacity than the voice data, the time-series data of characteristic parameters, the phoneme matching score, and the phoneme prediction data, are transmitted via the control buses 71 to 75. Etc.) is transmitted. On the other hand, a large amount of voice data is transmitted via the data buses 11 and 14 capable of high-speed transmission of large-capacity data, and a data bus 12 capable of high-speed transmission of large-capacity data.
A time-series data of large-capacity characteristic parameters is transmitted via the data bus 13 and high-speed transmission of large-volume data is possible.
A large-capacity phoneme matching score and phoneme prediction data are transmitted via. Therefore, the voice data, the time-series data of the characteristic parameters, the data including the phoneme matching score and the phoneme prediction data, the control signal, and the like are simultaneously transmitted using each bus provided according to the data amount. Therefore, signal transmission can be performed at high speed. As a result, the voice recognition processing time can be shortened as compared with the conventional voice recognition device that transmits the above-described data, control signals and the like via the same control data buses 71a to 74a.

【００２４】またさらに、ＡＤ変換器２と特徴抽出部３
と音声照合部４とＬＲパーザ５とピッチ検出部６の各処
理部は、それぞれ各データ毎の入出力用ソケットを備
え、各処理部間の接続は、それらのソケットを用いて容
易に行うことができ、また、各データを伝送するデータ
バス１１乃至１４と制御信号とステータス信号を伝送す
る制御バス７１乃至７５を別々に設けているので、各処
理部と音声認識コントローラの接続は、接続する処理部
と音声認識コントローラ７間のインターフェイスを合わ
せることのみで可能であり、また、各処理部間の接続
は、接続される各処理部間のみのインターフェースを合
わせることにより可能である。すなわち、各処理部内で
は、制御信号及びステータス信号用のインターフェース
と各データ用のインターフェースを合わせる必要はな
く、これによって、従来の音声認識装置に比較して、各
処理部は各処理部毎に例えば異なる処理を行う新しい処
理部と容易に取り替えることができるので、新しい音声
認識装置の研究に対応して容易に拡張することができ
る。ここで、新しい処理部とは、例えば、自然な音声の
認識の際に必要になる韻律情報を含めて音声認識を行う
処理部などのことである。Furthermore, the AD converter 2 and the feature extraction unit 3
The processing units of the voice collation unit 4, the LR parser 5, and the pitch detection unit 6 each have an input / output socket for each data, and the connection between the processing units can be easily performed using these sockets. Moreover, since the data buses 11 to 14 for transmitting each data and the control buses 71 to 75 for transmitting the control signal and the status signal are separately provided, the connection between each processing unit and the voice recognition controller is made. It is possible only by matching the interfaces between the processing units and the voice recognition controller 7, and the connections between the respective processing units are possible by matching the interfaces only between the respective processing units to be connected. That is, in each processing unit, it is not necessary to match the interface for control signals and status signals with the interface for each data, and as a result, each processing unit has each interface for each processing unit as compared with the conventional voice recognition device. Since it can be easily replaced with a new processing unit that performs different processing, it can be easily expanded in response to research on a new speech recognition device. Here, the new processing unit is, for example, a processing unit that performs voice recognition including prosodic information necessary for natural voice recognition.

【００２５】本実施例では、音素照合部４とＬＲパーザ
５を用いて音声を認識したが、本発明はこれに限らず、
例えば、ＯｎｅＰａｓｓＤＰ音声認識方法などを用
いる音声認識回路を用いて音声を認識するように構成し
てもよい。In the present embodiment, speech is recognized using the phoneme collation unit 4 and the LR parser 5, but the present invention is not limited to this.
For example, a voice recognition circuit using a One Pass DP voice recognition method or the like may be used to recognize voice.

【００２６】本実施例では、１フレームを２０ミリ秒に
設定して各処理を行ったが本発明はこれに限定されるも
のではない。In this embodiment, one frame is set to 20 milliseconds for each processing, but the present invention is not limited to this.

【００２７】本実施例の音声認識装置は、ピッチ検出部
６を備え、かつピッチ検出処理が音声認識と同一の音声
信号に基づいて同時に実行されるように構成されている
ので、同一の音声信号から検出される音声認識データと
ピッチ周波数を同時に比較することができる。Since the voice recognition apparatus of this embodiment is provided with the pitch detection section 6 and the pitch detection processing is simultaneously executed based on the same voice signal as the voice recognition, the same voice signal is used. It is possible to simultaneously compare the voice recognition data detected from the pitch frequency with the voice recognition data.

【００２８】本実施例の音声認識装置では、音声認識コ
ントローラ７とＡＤ変換器２と特徴抽出部３と音素照合
部４とＬＲパーザ５とピッチ検出部６のそれぞれが、Ｃ
ＰＵを備えて構成されているが、本発明はこれに限ら
ず、例えば、１つのＣＰＵのＵＮＩＸシステムを用い
て、すべての処理部を制御するように時分割多重パイプ
ライン処理を行ってもよい。In the voice recognition apparatus of this embodiment, each of the voice recognition controller 7, the AD converter 2, the feature extraction unit 3, the phoneme collation unit 4, the LR parser 5, and the pitch detection unit 6 is C.
Although the present invention is configured to include the PU, the present invention is not limited to this, and for example, a UNIX system of one CPU may be used to perform time division multiplex pipeline processing so as to control all processing units. .

【００２９】[0029]

【発明の効果】上述のように、本発明に係る音声認識装
置では、上記変換手段のＡＤ変換と上記特徴抽出手段の
特徴パラメータの抽出と上記音声認識手段の音声認識
が、所定の一定時間の音声信号に対応した１フレームの
音声データ毎にフレーム同期して実行されるように上記
変換手段と上記特徴抽出手段と上記音声認識手段を制御
する制御手段を備えて構成されているので、上記変換手
段と上記特徴抽出手段と上記音声認識手段における各処
理が実質的にリアルタイムで実行される。また、第１と
第２と第３の制御バスを介して、小容量の信号である制
御信号とステータス信号が伝送され、大容量データの高
速伝送が可能な第１と第２のデータバスを介して、大容
量の音声データと大容量の特徴パラメータの時系列デー
タが伝送される。すなわち、上記各データと上記制御信
号等が、データ量に応じて別々に設けられた各バスを介
して、同時に伝送されるので、制御信号等と各データは
高速に伝送される。以上のことから本発明によれば、従
来例に比較して高速の音声認識が可能な音声認識装置を
提供することができる。As described above, in the voice recognition apparatus according to the present invention, the AD conversion of the conversion means, the extraction of the characteristic parameter of the characteristic extraction means, and the voice recognition of the voice recognition means are performed in a predetermined fixed time. Since the conversion means, the feature extraction means, and the control means for controlling the voice recognition means are provided so as to be executed in frame synchronization for each frame of voice data corresponding to the voice signal, the conversion is performed. The respective processes in the means, the feature extracting means, and the voice recognizing means are executed substantially in real time. In addition, the first and second data buses capable of high-speed transmission of large-capacity data are transmitted via the first, second, and third control buses, in which control signals and status signals, which are small-capacity signals, are transmitted. Through this, a large amount of voice data and time series data of a large amount of characteristic parameters are transmitted. That is, since the respective data and the control signal and the like are simultaneously transmitted through the buses provided separately according to the data amount, the control signal and the data are transmitted at high speed. From the above, according to the present invention, it is possible to provide a voice recognition device capable of performing voice recognition at a higher speed than in the conventional example.

【００３０】また、データを伝送するためのデータバス
と、制御信号とステータス信号を伝送するための制御バ
スとを別けて構成しているので、データ用のインターフ
ェースと制御信号及びステータス信号用のインターフェ
イスを共通にする必要がなく、上記変換手段と上記特徴
抽出手段と上記音声認識手段の各処理手段毎に異なる処
理を行う新しい処理手段への取り替えが容易にできる。
これによって、音声認識装置の研究用にもちいることが
できる拡張性に富んだ音声認識装置を提供することがで
きる。Since the data bus for transmitting data and the control bus for transmitting the control signal and the status signal are separately configured, the interface for data and the interface for the control signal and the status signal are provided. It is not necessary to use a common processing unit, and it is possible to easily replace it with a new processing unit that performs different processing for each processing unit of the conversion unit, the feature extraction unit, and the voice recognition unit.
This makes it possible to provide a highly expandable speech recognition device that can be used for research of the speech recognition device.

[Brief description of drawings]

【図１】本発明に係る実施例である音声認識装置のブ
ロック図である。FIG. 1 is a block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】図１の音声認識装置において、特徴抽出部３
と音声照合部４とＬＲパーザ５で処理されるときの、処
理の開始時間と終了時間を示すグラフである。FIG. 2 is a block diagram of the feature extraction unit 3 in the voice recognition device of FIG.
3 is a graph showing the start time and the end time of the processing when the processing is performed by the voice matching unit 4 and the LR parser 5.

【図３】従来例の音声認識装置のブロック図である。FIG. 3 is a block diagram of a conventional voice recognition device.

[Explanation of symbols]

１…マイクロフォン、２…ＡＤ変換器、３…特徴抽出部、４…音素照合部、５…ＬＲパーザ、６…ピッチ検出部、７…音声認識コントローラ、１１，１２，１３，１４…データバス、２１，３１，４１，６１，…入力ソケット、２２，２３，２４，３２，３３，３４，４２，４４，５
２，６２…出力ソケット、４３，５１…入出力ソケット、７１，７２，７３，７４，７５…制御バス。DESCRIPTION OF SYMBOLS 1 ... Microphone, 2 ... AD converter, 3 ... Feature extraction part, 4 ... Phoneme collation part, 5 ... LR parser, 6 ... Pitch detection part, 7 ... Voice recognition controller, 11, 12, 13, 14 ... Data bus, 21, 31, 41, 61, ... Input sockets, 22, 23, 24, 32, 33, 34, 42, 44, 5
2, 62 ... Output sockets, 43, 51 ... Input / output sockets, 71, 72, 73, 74, 75 ... Control buses.

───────────────────────────────────────────────────── フロントページの続き (72)発明者別府智彦京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 (72)発明者匂坂芳典京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール音声翻訳通信研究所内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Tomohiko Beppu Tomohiko Beppu No.5 Mihiraya, Seiji-cho, Seika-cho, Soraku-gun, Kyoto Prefectural ATR Co., Ltd. Speech Translation Laboratory (72) Yoshinori Kosaka, Soraku Kyoto Gunma Seika-cho, Osamu Osamu, Osamu Osamu, 5 Hiratani, A-T Co., Ltd.

Claims

[Claims]

1. A conversion unit that AD-converts an input voice signal into voice data which is a digital signal and outputs the voice data, and a feature parameter for voice recognition is extracted from the voice data to output feature parameter data. Feature extraction means, voice recognition means for recognizing the voice of the input voice signal based on the feature parameter data and outputting voice recognition data, AD conversion of the conversion means, and feature parameters of the feature extraction means And the voice recognition of the voice recognition means are performed in frame synchronization for each frame of voice data corresponding to a voice signal of a predetermined fixed time, the conversion means, the feature extraction means, and the voice recognition. A voice recognition device comprising: a control means for controlling the means, wherein a space between the control means and the conversion means is changed from the control means to the conversion means. From the control signal and the converting means for instructing to start execution of D conversion of one frame to the control means AD
First for transmitting a status signal indicating the end of conversion
And a control signal from the control means to the feature extraction means for instructing the feature extraction means to start execution of feature parameter extraction processing, and the feature extraction means to perform control between the control means and the feature extraction means. A second control bus for transmitting a status signal indicating the end of extraction of one frame of characteristic parameters is connected to the means, and the control means and the voice recognition means are connected from the control means to the voice recognition means. A control signal for instructing to start the execution of voice recognition and the voice recognition means to the control means 1
A third control bus for transmitting a status signal indicating the end of voice recognition of a frame, and a first data bus for transmitting the voice data between the conversion means and the feature extraction means. The voice recognition device is characterized in that the feature extraction means and the voice recognition means are connected by a second data bus for transmitting the data of the feature parameters.