JPH10198397A

JPH10198397A - Voice recognition device and voice recognition method

Info

Publication number: JPH10198397A
Application number: JP9001007A
Authority: JP
Inventors: Shintaro Murakami; 伸太郎村上; Yoshihiro Matsuura; 嘉宏松浦; Shigeru Kashiwagi; 繁柏木
Original assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Current assignee: Meidensha Corp; Meidensha Electric Manufacturing Co Ltd
Priority date: 1997-01-08
Filing date: 1997-01-08
Publication date: 1998-07-31

Abstract

PROBLEM TO BE SOLVED: To obtain a result of voice recognition, substantially simultaneously with the completion of the voice. SOLUTION: After voice data from a voice input device 21 is taken into a voice data processing part 22, and is fed to a successive recording and processing part 23 so as to successively recorded on a recording buffer memory which is sectioned into a plurality of parts. Thus, when recording into one of the parts of the memory is completed, recording to the next memory is started while voice data with which recording is completed, are inputted in a feature extracting part 24a constituting a recognizing part 24, and thereafter the voice data obtains is subjected to frequency analysis so as to obtain a spectrum row which is then inputted a sound element recognizing part 24b constituted by a neural network so as to obtain a sound element candidate row. This candidate row is inputted to a word spot 24c and is verified with a dictionary template 24d and DTW so as to deliver a most resembling word as a result. The output results are calculated by an integrated distance calculating part 25, and thus calculated value is delivered to a back trance part 26 in order to take out a series of recognized words.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、逐次音声処理方
式による音声認識装置および音声認識処理方法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus and a speech recognition processing method using a sequential speech processing method.

【０００２】[0002]

【従来の技術】音声認識装置として図１２に示す離散単
語音声認識システムがある。このシステムは図１２に示
すように、電話やマイクロフォンなどの音声入力装置１
１から音声データが音声入力部１２に入力される。この
音声入力部１２に入力された音声データは特徴抽出部１
３に供給され、ここで音声データは周波数分析される。
この周波数分析の結果からスペクトル列を得て音素認識
部１４に入力される。音素認識部１４は出力を二重化し
たニューラルネットワークによって構成されている。こ
のニューラルネットワークは入力層、隠れ層、出力層か
らなりなり、入力層に例えば１時刻毎に５フレームのス
ペクトルが入力され、それの中央のスペクトルが、該当
する音素がどれであるかを、出力層のユニットの値によ
って送出する。出力ユニットは二重化されているため、
各音素カテゴリ毎にユニットは２個づつ対応づけられて
いる。それに対して結果は最大の出力値を示すものから
２つのユニットを選び、それが対応する音素を第１位、
第２位音素候補として得る。2. Description of the Related Art As a speech recognition apparatus, there is a discrete word speech recognition system shown in FIG. This system is, as shown in FIG. 12, a voice input device 1 such as a telephone or a microphone.
The voice data is input to the voice input unit 12 from 1. The voice data input to the voice input unit 12 is transmitted to the feature extraction unit 1
3 where the audio data is frequency analyzed.
A spectrum sequence is obtained from the result of the frequency analysis and input to the phoneme recognition unit 14. The phoneme recognition unit 14 is configured by a neural network with duplicated output. This neural network is composed of an input layer, a hidden layer, and an output layer. For example, a spectrum of five frames is input to the input layer at each time, and a central spectrum thereof indicates which phoneme corresponds. Send by layer unit value. Because the output unit is duplicated,
Two units are associated with each phoneme category. On the other hand, the result selects two units from the one showing the largest output value, and the corresponding phoneme is ranked first,
Obtained as the second phoneme candidate.

【０００３】その認識された音素候補列と、認識させた
い語彙の音素パターンを持たせた辞書中のテンプレート
１５との類似度を、テンプレート中の音素と認識された
音素候補列中の第１位および第２位候補との類似度を局
所スコアとし、その局所スコアをＤＴＷ法によってマッ
チング部１６でマッチングされ、最も類似する単語を累
積することで全体の類似度スコアを求め、認識させたい
全ての語彙の中で、その類似度スコアが最小となる単語
を認識結果としてマッチング部１６から出力する。[0003] The similarity between the recognized phoneme candidate sequence and the template 15 in the dictionary having the phoneme pattern of the vocabulary to be recognized is determined by the first place in the phoneme candidate sequence recognized as the phoneme in the template. And the similarity with the second-ranking candidate is set as a local score, and the local score is matched by the matching unit 16 by the DTW method, and the total similarity score is obtained by accumulating the most similar words, and all the scores to be recognized are obtained. The word having the minimum similarity score in the vocabulary is output from the matching unit 16 as a recognition result.

【０００４】[0004]

【発明が解決しようとする課題】上述した離散単語音声
認識システムでは、入力音声が終了したとみなされた
後、はじめて認識処理を行うようになっている。そのた
め、連続音声認識のような計算時間のかかる認識の場合
には、音声終了から認識結果の出力までにかなりの時間
を要するために、実用にはてきさなくってしまう。ま
た、単語認識のような短い音声の入力に対しても、高速
なワークステーションと比べ、比較的計算処理の遅いパ
ソコンなどを用いた場合には、処理時間が大幅に要し、
パソコンでそのような装置を構築することが困難であっ
た問題がある。In the above-described discrete word speech recognition system, recognition processing is performed only after it is determined that the input speech has been completed. Therefore, in the case of recognition requiring a long calculation time such as continuous speech recognition, it takes a considerable time from the end of speech to the output of a recognition result, which is not practical. Also, even for short speech input such as word recognition, processing time is significantly longer when using a personal computer or the like that has a relatively slow calculation process compared to a high-speed workstation.
There is a problem that it is difficult to construct such a device on a personal computer.

【０００５】この発明は上記の事情に鑑みてなされたも
ので、音声認識結果が音声終了とほぼ同時に求めること
ができる音声認識装置および音声認識処理方法を提供す
ることを課題とする。The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a speech recognition apparatus and a speech recognition processing method capable of obtaining a speech recognition result almost simultaneously with the end of speech.

【０００６】[0006]

【課題を解決するための手段】この発明は、上記の課題
を達成するために、第１発明は、電話等からなる音声入
力装置と、この音声入力装置より入力された音声データ
を音素認識する認識処理部とを有し、認識処理部は、音
声データを周波数分析し、それを出力多重化ニューラル
ネットに入力させて音素認識を行わせて、認識音素第１
位音素候補と第２音素候補を得、その認識された音素候
補列と、認識させたい語彙の音素パターンを持たせた辞
書中のテンプレートとの類似度を、テンプレート中の音
素と認識された音素候補列中の第１位および第２位候補
との類似度を局所スコアとし、その局所スコアをＤＴＷ
法によって累積することで全体の類似度スコアを求め、
認識させたい全ての語彙の中で、その類似度スコアが最
小となる単語を認識結果として出力するように構成され
た音声認識処理装置において、前記音声入力装置から出
力された音声データが入力され、このデータを音声処理
する音声処理部と、この音声処理部で処理された音声デ
ータを複数個に区画されたメモリに逐次録音されるバッ
ファメモリ部と、このメモリ部の一つのメモリへの録音
が終了したかを検出する録音終了検出部と、この検出部
がメモリの録音終了を検出したときに前記バッファメモ
リ部の出力を前記認識処理部へ入力させるとともに、バ
ッファメモリ部の次の録音開始メモリへ切り替えるメモ
リ切換部とを備えたことを特徴とするものである。SUMMARY OF THE INVENTION In order to achieve the above-mentioned object, a first aspect of the present invention provides a voice input device such as a telephone and the like, and phoneme recognition of voice data input from the voice input device. A recognition processing unit for analyzing the frequency of the voice data, inputting the frequency analysis to an output multiplexing neural network to perform phoneme recognition, and
The similarity between the recognized phoneme candidate sequence and the template in the dictionary having the phoneme pattern of the vocabulary to be recognized is obtained by obtaining the phoneme candidate and the second phoneme candidate, and the phoneme recognized as the phoneme in the template. The similarity between the first and second candidates in the candidate sequence is defined as a local score, and the local score is defined as DTW
The overall similarity score is obtained by accumulating by the method,
In a vocabulary to be recognized, in a voice recognition processing device configured to output a word having a minimum similarity score as a recognition result, voice data output from the voice input device is input, An audio processing unit that performs audio processing on the data, a buffer memory unit that sequentially records audio data processed by the audio processing unit into a plurality of divided memories, and a recording unit that stores the data in one of the memory units. A recording end detecting unit for detecting whether or not recording has been completed, and when the detecting unit detects the end of recording in the memory, the output of the buffer memory unit is input to the recognition processing unit. And a memory switching unit for switching to.

【０００７】第２発明は、前記バッファメモリ部への音
声データの録音が終了したなら、前記認識処理部から認
識単語列を取り出すためにバックトレースを行って認識
結果を求めることを特徴とするものである。According to a second aspect of the present invention, when the recording of the voice data in the buffer memory unit is completed, a backtrace is performed to extract a recognized word string from the recognition processing unit, and a recognition result is obtained. It is.

【０００８】第３発明は、音声入力装置から出力された
音声データを音声処理部で処理した後、処理した音声を
複数個に区画された録音用バッファメモリに逐次録音さ
せて行き、区画された一つの録音用バッファメモリの録
音が終了したなら次のメモリへの録音を開始させるとと
もに、並行して音素認識処理を行うことを特徴とするも
のである。According to a third aspect of the present invention, after audio data output from an audio input device is processed by an audio processing unit, the processed audio is sequentially recorded in a plurality of divided recording buffer memories. When the recording in one recording buffer memory is completed, the recording in the next memory is started, and the phoneme recognition processing is performed in parallel.

【０００９】[0009]

【発明の実施の形態】以下この発明の実施の形態を図面
に基づいて説明する。図１はこの発明の実施の形態を示
す概略構成説明図で、図１において、２１は電話やマイ
クロフォンなどからなる音声入力装置で、この音声入力
装置２１よりの音声データは音声処理部２２に入力され
る。この音声処理部（サウンドボード）２２に取り込ま
れた音声データは、逐次録音処理部２３に供給されて録
音用バッファメモリ（後述する）に逐次的に録音され
る。録音用バッファメモリは、複数個にメモリが区画さ
れ、一つのメモリへの録音が終了（メモリがいっぱいに
成る）すると、次のメモリへの録音が開始されるように
構成されるとともに、録音が終了したバッファメモリの
音声データは認識処理部２４を構成する特徴抽出部２４
ａに入力されて、認識処理が並行して行われるように構
成される。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a schematic structural explanatory view showing an embodiment of the present invention. In FIG. 1, reference numeral 21 denotes a voice input device such as a telephone or a microphone, and voice data from the voice input device 21 is input to a voice processing unit 22. Is done. The audio data captured by the audio processing unit (sound board) 22 is sequentially supplied to a recording unit 23 and sequentially recorded in a recording buffer memory (described later). The recording buffer memory is divided into a plurality of memories, and when the recording to one memory is completed (the memory is full), the recording to the next memory is started, and the recording is started. The completed voice data in the buffer memory is stored in the feature extraction unit 24 constituting the recognition processing unit 24.
a, and the recognition processing is performed in parallel.

【００１０】特徴抽出部２４ａに入力されたバッファメ
モリからの音声データは周波数分析されてスペクトル列
を得、このスペクトル列がニューラルネットワークから
構成される音素認識部２４ｂに入力されて、出力に音素
候補列が得られる。この候補列はワードスポット２４ｃ
（ワードスポットとは連続音声中の指定されたフレーム
を辞書単語の終端と仮定したときに、最適な始端と、照
合距離を求める計算を言う）に入力されて辞書テンプレ
ート２４ｄとＤＴＷによって照合されて最も類似する単
語を結果として出力する。出力結果は累積距離計算部２
５で計算され、音声入力があるときには、再び逐次録音
処理部２３から処理を行い、音声入力が終了したならバ
ックトレース（認識単語を後から取り出す計算方法）部
２６に供給して認識された単語列を取り出す。The voice data from the buffer memory input to the feature extraction unit 24a is subjected to frequency analysis to obtain a spectrum sequence, and this spectrum sequence is input to a phoneme recognition unit 24b composed of a neural network, and a phoneme candidate is output to the phoneme recognition unit 24b. A row is obtained. This candidate column is word spot 24c
(The word spot refers to a calculation for obtaining an optimum starting point and a matching distance when a designated frame in continuous speech is assumed to be the end of the dictionary word), and is collated by the dictionary template 24d and the DTW. Output the most similar words as the result. Output result is cumulative distance calculator 2
5, when there is a voice input, the processing is again performed from the sequential recording processing unit 23, and when the voice input is completed, it is supplied to a back trace (calculation method for extracting a recognized word later) unit 26 to recognize the recognized word. Retrieve a column.

【００１１】図２は上述した逐次録音処理部２３の詳細
を示すブロック構成図で、逐次録音処理部２３は、複数
個にメモリが区画された録音用バッファメモリ部２３ａ
を有し、このメモリ部２３ａには一つのメモリへの音声
録音が終了したことを検出する録音終了検出部２３ｂが
設けられるとともに、メモリ切換部２３ｃが設けられ
る。メモリ切換部２３ｃは、前記録音終了検出部２３ｂ
が検出出力を送出したとき、次のメモリへ録音開始を切
り替えるためのものである。FIG. 2 is a block diagram showing details of the above-mentioned sequential recording section 23. The sequential recording section 23 includes a recording buffer memory section 23a having a plurality of divided memories.
The memory section 23a is provided with a recording end detecting section 23b for detecting the end of voice recording to one memory and a memory switching section 23c. The memory switching unit 23c includes the recording end detection unit 23b.
Is used to switch the start of recording to the next memory when a detection output is sent.

【００１２】図３はサイクリック式の録音用バッファメ
モリ部２３ａの構造図で、このバッファメモリ部２３ａ
は、複数個に区画されたバッファメモリ１、２…（ｎ−
１）、ｎから構成され、バッファメモリ１、２…ｎには
音声データが並列的に入力されるように構成されている
が、一つのメモリが録音でいっぱいにならないと次のメ
モリには録音されないように制御されている。その制御
は図２に示すように行われる。FIG. 3 is a structural view of a cyclic recording buffer memory unit 23a.
Are buffer memories 1, 2,... (N-
1), n, and audio data is input in parallel to the buffer memories 1, 2,... N, but if one memory is not full of recordings, the next memory Is controlled not to be. The control is performed as shown in FIG.

【００１３】図４は録音用バックメモリ部２３ａへの録
音動作の概要を述べる説明図で、図４（Ａ）は初期状態
から無音アイドル状態までの動作概要を示し、Ａ１は初
期状態を示し、このときは録音バッファ数は「０」であ
る。Ａ２はバッファメモリ１が録音終了した状態を示
し、順次Ａ３はバッファメモリ１、２が録音終了し、Ａ
４はバッファメモリ１、、２、３が録音終了、すなわち
録音バッファ数が「４」になったことを示したものであ
る。そして、アイドル状態で一番上のバッファメモリは
一旦登録解除した後に再登録される。FIG. 4 is an explanatory diagram for explaining an outline of a recording operation to the recording back memory unit 23a. FIG. 4A shows an outline of an operation from an initial state to a silent idle state, A1 shows an initial state, At this time, the number of recording buffers is “0”. A2 indicates a state in which the buffer memory 1 has finished recording, and A3 indicates a state in which the buffer memories 1 and 2 have finished recording.
Reference numeral 4 denotes that the buffer memories 1, 2, and 3 have finished recording, that is, the number of recording buffers has become "4". Then, in the idle state, the uppermost buffer memory is once deregistered and then reregistered.

【００１４】図４（Ｂ）は音声部検出後の動作を述べる
もので、有声が継続したときには音声バッファを追加す
る。FIG. 4B describes the operation after the detection of the voice part. When the voiced voice continues, a voice buffer is added.

【００１５】図４（Ｃ）は無音部検出から録音終了まで
の動作を述べるもので、無声が継続したときには録音を
終了する。FIG. 4C illustrates the operation from the detection of a silent portion to the end of recording. When silence continues, the recording is terminated.

【００１６】次に上記のように構成された図１の実施の
形態の処理の流れを図５により述べる。この実施の形態
では認識のための処理として、逐次認識処理と逐次録音
処理を並行して行わせている。図５において、ステップ
Ｓ１で処理開始の命令が出ると、バッファメモリ（以下
バッファとする）への逐次録音処理（Ｓ２）が開始され
る。録音中のバッファの録音終了のイベントが発行され
たかを、ステップＳ３で判断し、（Ｙ）ならそのバッフ
ァの音声チェックをステップＳ４で行う。バッファが音
声部であると判定（Ｙ）されたなら、逐次認識処理を行
わせ（Ｓ５）、ステップＳ２の逐次録音処理に戻る。ス
テップＳ４でバッファが無音部であると認識（Ｎ）され
たなら、終了条件を満たすかをステップＳ６で判定し、
終了条件を満たせば（Ｙ）、認識結果を出力し、録音バ
ッファをリセット（Ｓ７）し、処理を終了する（Ｓ
８）。Next, the processing flow of the embodiment of FIG. 1 configured as described above will be described with reference to FIG. In this embodiment, sequential recognition processing and sequential recording processing are performed in parallel as processing for recognition. In FIG. 5, when a process start command is issued in step S1, a sequential recording process (S2) to a buffer memory (hereinafter referred to as a buffer) is started. In step S3, it is determined whether or not an event of ending the recording of the buffer being recorded has been issued. If (Y), the audio check of the buffer is performed in step S4. If it is determined that the buffer is a voice part (Y), a sequential recognition process is performed (S5), and the process returns to the sequential recording process of step S2. If the buffer is recognized (N) as a silent section in step S4, it is determined in step S6 whether the end condition is satisfied,
If the end condition is satisfied (Y), the recognition result is output, the recording buffer is reset (S7), and the process ends (S7).
8).

【００１７】ここで、上記逐次録音処理について述べ
る。この処理は、バッファがいっぱいになるまで録音を
行い、バッファがいっぱいになったところで次のバッフ
ァに録音するという制御を、音声の入力が終わったと判
断されるまで継続することを言う。この処理は、（１）
録音開始処理、（２）サウンドデバイスドライバでの処
理、（３）録音処理、（４）録音中止処理から構成され
ている。以下それぞれの処理の内容について述べる。Here, the sequential recording process will be described. This processing means that the recording is performed until the buffer is full, and the control for recording to the next buffer when the buffer is full is continued until it is determined that the voice input is completed. This processing includes (1)
The process includes a recording start process, (2) a process by a sound device driver, (3) a recording process, and (4) a recording stop process. The contents of each process will be described below.

【００１８】（１）録音開始処理（ａ）サウンドデバイスを録音用にオープンする（ｂ）録音用バッファメモリとして実現した録音用バッ
ファ群を登録する（ｃ）録音開始コマンドを発行する。(1) Recording start processing (a) Open a sound device for recording (b) Register a recording buffer group realized as a recording buffer memory (c) Issue a recording start command.

【００１９】（２）サウンドデバイスドライバでの処理（ａ）各録音用バッファに登録した順に連続して音声デ
ータを録音する（ｂ）各録音用バッファへの録音が終了する毎に、録音
データ数を記憶し、録音終了イベントを発行する。(2) Processing in the sound device driver (a) Record audio data continuously in the order registered in each recording buffer (b) Each time recording to each recording buffer is completed, the number of recorded data And issues a recording end event.

【００２０】（３）録音処理録音処理は図６に示すアルゴリズムのように処理され
る。図６において、まず、録音終了イベントの原因とな
ったバッファのバッファデータ数は零であるかを判断し
（Ｓ１１）、（Ｎ）ならデータが音声部か無音部かの音
声チェックを行う（Ｓ１２）。そのチェックでデータが
有声であるかを判断し（Ｓ１３）、（Ｙ）なら音声中で
あるかをステップＳ１４で判断する。その判断の結果、
（Ｙ）なら音声バッファを追加して処理を終了し、
（Ｎ）なら音声数をカウントし（Ｓ１５）、音声数が予
め定めたNR BUF START（Ｓ１６）以上音声が継続したな
ら（Ｙ）、音声中のフラグをＯＮして（Ｓ１７）、バッ
ファのアイドル状態を更新する（Ｓ１８）。(3) Recording process The recording process is performed like the algorithm shown in FIG. In FIG. 6, first, it is determined whether the number of buffer data in the buffer that caused the recording end event is zero (S11), and if (N), a voice check is performed to determine whether the data is a voice part or a silent part (S12). ). In the check, it is determined whether the data is voiced (S13). If (Y), it is determined in step S14 whether the data is being voiced. As a result of that judgment,
If (Y), add an audio buffer and end the process,
If (N), the number of voices is counted (S15), and if the number of voices continues more than a predetermined NR BUF START (S16) (Y), the flag in the voice is turned on (S17), and the buffer is idle. The state is updated (S18).

【００２１】一方、前記ステップＳ１３で（Ｎ）なら音
声中であるかステップＳ１９で判断する。その判断の結
果、（Ｙ）なら無音数をカウントし（Ｓ２０）、無音数
が予め定めたNR BUF END（Ｓ２１）以上無音が継続した
なら（Ｙ）、無音部とみなして録音中止処理を行って録
音を終了する（Ｓ２２）。なお、ステップＳ１９の判断
で無音中（Ｎ）ならバッファのアイドル状態を更新する
（Ｓ２３）。On the other hand, if (N) in the step S13, it is determined in a step S19 whether a voice is being played. If the result of the determination is (Y), the number of silences is counted (S20), and if the silence continues for more than a predetermined NR BUF END (S21) (Y), the recording is stopped assuming that it is a silence part. To end the recording (S22). If there is no sound (N) in the judgment of step S19, the idle state of the buffer is updated (S23).

【００２２】（４）録音中止処理（ａ）録音中止コマンドとリセットコマンドを発行し、
録音を中断する（ｂ）録音用バッファの登録解除（ｃ）サウンドデバイスのクローズを行う。(4) Recording stop processing (a) Issuing a recording stop command and a reset command,
Stop recording. (B) Cancel registration of recording buffer. (C) Close sound device.

【００２３】次に上記逐次認識処理について述べる。逐
次認識処理は、録音済みバッファが送られてくる度に、
それまでの計算結果も考慮して認識計算を行うような処
理である。この処理として離散単語認識と連続単語認識
の２つの方式を実施した。どちらも録音済みバッファが
逐次的に入力されてくるのに対応したものになってい
る。また、認識処理部の処理の終了を待たずに、別の処
理を行えるようにし、処理の効率化と高速化を図ってい
る。ここで、上記２方式の例を示す。Next, the sequential recognition process will be described. In the sequential recognition process, every time a recorded buffer is sent,
This is a process in which recognition calculation is performed in consideration of the calculation results up to that time. As this processing, two systems of discrete word recognition and continuous word recognition were implemented. In both cases, the recorded buffers are adapted to be sequentially input. In addition, another processing can be performed without waiting for the end of the processing of the recognition processing unit, thereby improving the efficiency and speed of the processing. Here, examples of the above two methods will be described.

【００２４】例１：離散単語認識方式図７において、認識開始をしてステップＳ３１でinvoic
e flg（音声検出フラグ）＝OFFとbuff count（バッファ
のカウンタ）＝0を確認した後、録音済みバッファの音
声チェックで音声が検出（Ｓ３２）される度に、音声検
出フラグinvoice flgをオンにセット（Ｓ３３）し、そ
うでないときはオフ（Ｓ３４）にセットする。音声検出
フラグinvoice flg（Ｓ３５）がオン（Ｙ）なら、認識
処理部が認識処理中であるか判断し（Ｓ３６）、処理中
なら（Ｙ）録音済みバッファの転送を見送りステップＳ
３２に戻る。ステップＳ３６で処理中でないなら
（Ｎ）、録音済みバッファを認識処理部へ転送して認識
処理を行う（Ｓ３７）とともに、buff countをインクリ
メントし、認識処理部の終了を待たずにステップＳ３２
に戻る。なお、認識処理は、buff countの値により異な
る。Example 1: Discrete Word Recognition Method In FIG. 7, recognition is started, and in step S31, an invoic
After confirming that e flg (voice detection flag) = OFF and buff count (buffer counter) = 0, the voice detection flag invoice flg is turned on each time a voice is detected in the voice check of the recorded buffer (S32). Set (S33), otherwise set to off (S34). If the voice detection flag invoice flg (S35) is ON (Y), it is determined whether the recognition processing unit is performing the recognition processing (S36). If the recognition processing unit is processing (Y), the transfer of the recorded buffer is postponed.
Return to 32. If the processing is not being performed in step S36 (N), the recorded buffer is transferred to the recognition processing unit to perform recognition processing (S37), and the buffer count is incremented, and step S32 is performed without waiting for the end of the recognition processing unit.
Return to The recognition process differs depending on the value of the buffer count.

【００２５】buff count＝１のとき、認識処理部は、周
波数分析、音素認識、ＤＴＷを実施する。その後、次の
バッファの計算に必要な、各辞書テンプレートとのＤＴ
Ｗの結果を指定された場所へ格納する。When buff count = 1, the recognition processing unit performs frequency analysis, phoneme recognition, and DTW. Then, the DT with each dictionary template required for the calculation of the next buffer
Store the result of W in the specified location.

【００２６】buff count＝２のとき、認識処理部は、格
納されている計算結果を取り出し、ＤＴＷの初期値とす
る。以下上記と同様に、入力されたバッファの周波数分
析、音素認識、ＤＴＷを実施し、各辞書テンプレートの
ＤＴＷの結果を指定された場所へ格納する。When buff count = 2, the recognition processing unit extracts the stored calculation result and sets it as the initial value of the DTW. Thereafter, in the same manner as described above, the frequency analysis, phoneme recognition, and DTW of the input buffer are performed, and the result of the DTW of each dictionary template is stored in the designated location.

【００２７】前記音声検出フラグinvoice flg（Ｓ３
５）がオンでないなら、すなわちオフ（Ｎ）なら、録音
済みバッファで認識未処理のものがあるかを判断し（Ｓ
３８）、認識未処理のものがある（Ｙ）なら全ての未処
理バッファが認識処理終了するまで認識処理部への転送
と認識を繰り返す（Ｓ３９）。その後、認識結果を求
め、処理を終了する。なお、ステップＳ３８で（Ｎ）な
ら認識処理を終了する。図８は上述した認識処理と録音
のタイミングチャートである。The voice detection flag invoice flg (S3
If 5) is not on, that is, if it is off (N), it is determined whether there is any recorded buffer that has not been recognized (S).
38), if there is any unrecognized one (Y), the transfer to the recognition processing unit and the recognition are repeated until all the unprocessed buffers complete the recognition processing (S39). After that, a recognition result is obtained, and the process is terminated. If (N) in step S38, the recognition process ends. FIG. 8 is a timing chart of the above-described recognition processing and recording.

【００２８】例２：連続単語認識方式図９において、認識開始をしてステップＳ４１でinvoic
e flg（音声検出フラグ）＝OFFとbuff count（バッファ
のカウンタ）＝0を確認した後、録音済みバッファの音
声チェックで音声が検出（Ｓ４２）される度に、音声検
出フラグinvoice flgをオンにセット（Ｓ４３）し、そ
うでないときはオフ（Ｓ４４）にセットする。音声検出
フラグinvoice flg（Ｓ４５）がオン（Ｙ）なら、認識
処理部が認識処理中であるか判断し（Ｓ４６）、処理中
なら（Ｙ）録音済みバッファの転送を見送りステップＳ
４２に戻る。ステップＳ４６で処理中でないなら
（Ｎ）、録音済みバッファを認識処理部へ転送して認識
処理を行う（Ｓ４７）とともに、buff countをインクリ
メントする。なお、認識処理は、buff countの値により
異なる。この判断はステップＳ４８で行う。Example 2: Continuous Word Recognition Method In FIG. 9, recognition is started, and in step S41, an invoic
After confirming that e flg (voice detection flag) = OFF and buff count (buffer counter) = 0, the voice detection flag invoice flg is turned on each time a voice is detected in the voice check of the recorded buffer (S42). Set (S43), otherwise set to off (S44). If the voice detection flag invoice flg (S45) is ON (Y), it is determined whether the recognition processing unit is performing the recognition processing (S46). If the recognition processing unit is processing (Y), the transfer of the recorded buffer is postponed.
Return to 42. If processing is not being performed in step S46 (N), the recorded buffer is transferred to the recognition processing unit to perform recognition processing (S47), and the buffer count is incremented. The recognition process differs depending on the value of the buffer count. This determination is made in step S48.

【００２９】ステップＳ４８で（Ｎ）ならbuff count＝
１であるからbuff count＝１のとき、認識処理部は、周
波数分析、音素認識、ワードスポットＤＴＷを実施す
る。その後、次のバッファの計算に必要な、ワードスポ
ットの結果を指定された場所へ格納する。認識処理部の
終了を待たずにステップＳ４２に戻る。If (N) in step S48, buff count =
Since it is 1, when buff count = 1, the recognition processing unit performs frequency analysis, phoneme recognition, and word spot DTW. After that, the result of the word spot necessary for the calculation of the next buffer is stored in the specified location. The process returns to step S42 without waiting for the end of the recognition processing unit.

【００３０】前記ステップＳ４８で（Ｙ）なら、buff c
ount＝２であるからbuff count＝２のとき、認識処理部
は、格納されている計算結果を取り出し、ワードスポッ
ト値を初期値とする。以下上記と同様に、入力されたバ
ッファの周波数分析、音素認識、ワードスポットＤＴＷ
を実施し、次の計算に必要な結果を格納する。その後、
認識処理部の処理の終了を待たずに、前のバッファまで
の累積距離の計算をステップＳ４９で行ってからステッ
プＳ４２に戻る。If (Y) in step S48, buff c
Since bunt count = 2 because ount = 2, the recognition processing unit extracts the stored calculation result and sets the word spot value as an initial value. In the same manner as above, the frequency analysis, phoneme recognition, and word spot DTW of the input buffer are performed.
And store the result needed for the next calculation. afterwards,
Without waiting for the end of the processing of the recognition processing unit, the calculation of the cumulative distance to the previous buffer is performed in step S49, and the process returns to step S42.

【００３１】前記音声検出フラグinvoice flg（Ｓ４
５）がオンでないなら、すなわちオフ（Ｎ）なら、録音
済みバッファで認識未処理のものがあるかを判断し（Ｓ
５０）、認識未処理のものがある（Ｙ）なら全ての未処
理バッファが認識処理終了するまで認識処理部への転送
と認識を行う（Ｓ５１）。認識処理後、累積距離計算を
ステップＳ５２で行って、計算した累積距離結果を用い
てバックトレース（Ｓ５３）を行い、認識結果を求め、
処理を終了する。なお、ステップＳ５０で（Ｎ）ならバ
ックトレース（Ｓ５３）を行って認識処理を終了する。
図１０は上述した認識処理と録音のタイミングチャート
である。The voice detection flag invoice flg (S4
If 5) is not on, that is, if it is off (N), it is determined whether there is any recorded buffer that has not been recognized (S).
50) If there is any unprocessed buffer (Y), transfer and recognition to the recognition processing unit are performed until all the unprocessed buffers complete the recognition process (S51). After the recognition processing, a cumulative distance calculation is performed in step S52, a back trace (S53) is performed using the calculated cumulative distance result, and a recognition result is obtained.
The process ends. If (N) in step S50, a back trace (S53) is performed and the recognition process ends.
FIG. 10 is a timing chart of the above-described recognition processing and recording.

【００３２】図１１は上述した実施の形態をパソコンに
て処理する際のシステム構成図で、図１１において、電
話などの音声入力装置２１からの音声を公衆回線網１０
０を介してネットワークコントローラ１０１に入力し、
ネットワークコントローラ１０１からパソコン１０２の
音声処理部２２に音声処理して連続単語認識をパソコン
１０２で行うようにしたものである。１０３は音声合成
装置である。FIG. 11 is a system configuration diagram when the above-described embodiment is processed by a personal computer. In FIG. 11, voice from a voice input device 21 such as a telephone is transmitted to the public line network 10.
0 to the network controller 101,
The voice processing unit 22 of the personal computer 102 performs voice processing from the network controller 101 to perform continuous word recognition on the personal computer 102. 103 is a speech synthesizer.

【００３３】なお、認識処理部２４は並列処理するよう
に構成すれば、処理の高速化を図ることができる。If the recognition processing section 24 is configured to perform parallel processing, the processing can be speeded up.

【００３４】[0034]

【発明の効果】以上述べたように、この発明によれば、
認識と録音の処理を一つのバッファへの録音が終了する
度に逐次的に行っていて、また認識処理部とパソコンで
の処理を並行して行わせているので、認識結果が音声終
了とほとんど同時に高速に求められる利点がある。ま
た、パソコンを利用しているので、手軽にシステムを構
成することができる。As described above, according to the present invention,
The recognition and recording processes are performed sequentially each time recording to one buffer is completed, and the recognition processing unit and the computer process are performed in parallel. At the same time, there is an advantage required at a high speed. In addition, since a personal computer is used, the system can be easily configured.

[Brief description of the drawings]

【図１】この発明の実施の形態を述べる概略構成説明
図。FIG. 1 is a schematic configuration explanatory view illustrating an embodiment of the present invention.

【図２】実施の形態の要部の逐次録音処理の構成図。FIG. 2 is a configuration diagram of a sequential recording process of a main part of the embodiment.

【図３】サイクリック式録音用バッファメモリの構造
図。FIG. 3 is a structural diagram of a cyclic recording buffer memory.

【図４】サイクリック式録音用バッファメモリへの録音
動作概要図FIG. 4 is a schematic diagram of a recording operation to a cyclic recording buffer memory.

【図５】実施の形態の流れを述べるフローチャート。FIG. 5 is a flowchart illustrating a flow of an embodiment.

【図６】録音処理のアルゴリズムのフローチャート。FIG. 6 is a flowchart of an algorithm of a recording process.

【図７】離散単語認識方式の認識処理のフローチャー
ト。FIG. 7 is a flowchart of a recognition process of a discrete word recognition method.

【図８】離散単語認識方式の認識と録音のタイミングチ
ャート。FIG. 8 is a timing chart of recognition and recording of the discrete word recognition method.

【図９】連続単語認識方式の認識処理のフローチャー
ト。FIG. 9 is a flowchart of a recognition process of the continuous word recognition method.

【図１０】連続単語認識方式の認識と録音のタイミング
チャート。FIG. 10 is a timing chart of recognition and recording in the continuous word recognition system.

【図１１】実施の形態のシステム構成図。FIG. 11 is a system configuration diagram of the embodiment.

【図１２】音声認識システムの概要図。FIG. 12 is a schematic diagram of a speech recognition system.

[Explanation of symbols]

２１…音声入力装置２２…音声処理部２３…逐次録音処理部２４…認識処理部２５…累積距離計算部２６…バックトレース部 Reference Signs List 21 voice input device 22 voice processing unit 23 sequential recording processing unit 24 recognition processing unit 25 cumulative distance calculation unit 26 backtrace unit

Claims

[Claims]

1. A speech input device comprising a telephone or the like, and a recognition processing unit for recognizing phonemes of speech data input from the speech input device, wherein the recognition processing unit analyzes the frequency of the speech data, Input to the output multiplexing neural network to perform phoneme recognition to obtain the first phoneme candidate and the second phoneme candidate for the recognized phoneme, and to give the recognized phoneme candidate sequence and the phoneme pattern of the vocabulary to be recognized. The similarity with the template in the extracted dictionary is defined as the similarity between the phoneme in the template and the first and second candidates in the sequence of recognized phoneme candidates, and the local score is accumulated by the DTW method. Thus, in the speech recognition processing device configured to obtain the overall similarity score and output, as a recognition result, a word having the minimum similarity score among all the vocabularies to be recognized, A voice processing unit that receives voice data output from a voice input device and processes the data, and a buffer memory unit that sequentially records the voice data processed by the voice processing unit in a plurality of divided memories. A recording end detecting unit that detects whether recording to one memory of the memory unit is completed, and an output of the buffer memory unit to the recognition processing unit when the detecting unit detects the end of recording of the memory. And a memory switching unit for inputting and switching to a next recording start memory of the buffer memory unit.

2. The speech recognition processing device according to claim 1, wherein when the recording of the speech data in the buffer memory unit is completed, a backtrace is performed to extract a recognized word string from the recognition processing unit. A speech recognition processing device for obtaining a result.

3. An audio processing unit processes audio data output from the audio input device, and then sequentially records the processed audio in a plurality of divided recording buffer memories.
A phoneme recognition processing method characterized in that when the recording in one of the divided recording buffer memories is completed, the recording in the next memory is started and the phoneme recognition processing is performed in parallel.