JP2012155301A

JP2012155301A - State recognition type speech recognition method

Info

Publication number: JP2012155301A
Application number: JP2011024369A
Authority: JP
Inventors: Shigeki Watanabe; 繋樹渡邊; Jeong Beom Park; 正範朴; Gye-Cheor Park; 癸▲テツ▼ 朴
Original assignee: PARK GYE CHEOR; WRK SOLUTION CO Ltd
Current assignee: PARK GYE CHEOR; WRK SOLUTION CO Ltd
Priority date: 2011-01-21
Filing date: 2011-01-21
Publication date: 2012-08-16

Abstract

PROBLEM TO BE SOLVED: To provide an essential noise resistance technique capable of recognizing a speech by estimating a dynamic state recognition parameter DIP of speech input data through preprocessing of speech recognition on an embedded type system used by a compact information device.SOLUTION: A speech recognition method constitutes an interpreter which calculates the dynamic state recognition parameter DIP for the speech input data as an output of an A/D conversion section by a state recognition preprocessing section S1 in Fig. 1, and determines a next processing part with information on the calculated dynamic state recognition parameter DIP, and a state recognition variable estimation section which calculates a variable noise processing reference parameter RTH. Through a state recognition generation section which generates a state recognition variable IP based upon the dynamic state recognition parameter DIP and a distribution section which compares the variable noise processing reference parameter RTH and the state recognition variable IP with each other, the speech recognition preprocessing is performed using speech extraction processing for extracting a speech section and reducing/removing noise of a part other than the speech section and noise processing for removing the noise and device noise.

Description

音響処理技術 Sound processing technology

音声認識技術を用いたカーナビゲーションシステムを始め、ＰＤＡやスマートフォンに代表される小型情報機器での小型情報システムが普及されている。小型端末機では背景雑音や騒音を解決する為に、ソフトウェア的な技術としては時間対振幅変化が不推測的な背景雑音や騒音を解決することが困難であり、ハードウェア的なＤＳＰチップセットを利用する方法及び操作ボタンの使用や画面タッチ、定められる装置を利用する方法など付随的な手段が必要だった。 In addition to car navigation systems using voice recognition technology, small information systems in small information devices represented by PDAs and smartphones are in widespread use. In order to solve background noise and noise in a small terminal, it is difficult to solve background noise and noise whose amplitude change with time is unpredictable as software technology. Additional methods such as a method of using and using an operation button, a screen touch, and a method of using a defined device are required.

音声認識方法に係わり、特に、デスクトップＰＣや携帯型音声認識端小型情報機器上で音声認識を行う場合、従来の音声認識技術では環境騒音条件下において認識が不可能な場合が多かったが、状況認知型音声認識システムを使用することで、安定した音声認識を可能にするものである。 In connection with speech recognition methods, especially when performing speech recognition on desktop PCs and portable information recognition devices, there are many cases where conventional speech recognition technology cannot be recognized under environmental noise conditions. By using a cognitive speech recognition system, stable speech recognition is possible.

本発明は、音声認識システム及び方法に係わり、携帯電話、スマートフォン、ＰＤＡ、カーナビ、タブレットＰＣ、ノートＰＣ等に代表される小型情報機器やデスクトップＰＣに用いる音声認識システム及び方法に関する。特に、背景雑音が多い環境で使用を想定されるＯＳベースのシステムやＯＳベースの埋込型システムに関して通常の場合は、雑音減少型サウンドカードやヘッドセットなどの別途外部デバイスに性能依存されてしまう為、このようなシステム制約がある音声認識システム等に最適である。The present invention relates to a voice recognition system and method, and more particularly to a voice recognition system and method used for small information devices such as mobile phones, smartphones, PDAs, car navigation systems, tablet PCs, notebook PCs, and desktop PCs. In particular, in the case of an OS-based system or an OS-based embedded system that is assumed to be used in an environment with a lot of background noise, the performance depends on a separate external device such as a noise-reducing sound card or a headset. Therefore, it is most suitable for a speech recognition system having such system restrictions.

雑音環境下で発話した音声を正確に音声認識させる手段として、先ず音声認識の前処理によって音声入力データの動的状況認知パラメータＤＩＰを推定して発話した音声から雑音部分を取り除き、音声認識システムが正確に音声認識できるようにクリアな音声データを提供することで高い認識率を持たせる。比較的大容量データ処理が可能であるデスクトップＰＣの場合、高性能な外部サウンドカードを使用し条件に合うデータを取得することができる。しかしながら、小型情報機器で使用される埋込型システム上ではデスクトップＰＣに比べ脆弱な為、音声認識システムを実用化するには、騒音環境下で発話した音声が認識できる耐雑音化技術が不可欠である。As a means for accurately recognizing a speech uttered in a noisy environment, first, a speech recognition system preliminarily estimates a dynamic state recognition parameter DIP of speech input data and removes a noise portion from the uttered speech. A high recognition rate is provided by providing clear voice data so that voice can be recognized accurately. In the case of a desktop PC capable of processing a relatively large amount of data, it is possible to acquire data that meets conditions using a high-performance external sound card. However, since embedded systems used in small information devices are more fragile than desktop PCs, in order to put the speech recognition system into practical use, noise reduction technology that can recognize speech uttered in a noisy environment is indispensable. is there.

上記課題を解決する為に、本発明による音声認識方法は、マイクロフォンや有・無線ヘッドセットに音声認識対象となる発話した音声を入力する音声入力部と、各音声入力部より得られた入力音声をサンプリング、量子化してデジタル信号に変換する図２のＡ／Ｄ変換部Ｆ１と、Ａ／Ｄ変換部Ｆ１の出力になる音声入力データに対して動的状況認知パラメータＤＩＰを算出する状況認知算出部Ｆ２と、算出された動的状況認知パラメータＤＩＰの情報により次の処理部分を決定するインタープリターＦ３と、可変雑音処理基準パラメータＲＴＨを算出する状況認知変数推定部Ｆ４と、動的状況認知パラメータＤＩＰの基準で状況認知変数ＩＰを生成する状況認知生成部Ｆ６と、可変雑音処理基準パラメータＲＴＨと状況認知変数ＩＰを比較する分配部Ｆ７と、音声区間を抽出して音声区間以外部分の雑音を減少／除去する音声抽出部Ｆ８と、騒音やデバイス雑音を除去する雑音処理部Ｆ９を使用して音声認識前処理を行う。In order to solve the above problems, a speech recognition method according to the present invention includes a speech input unit that inputs speech spoken to a microphone or wired / wireless headset, and input speech obtained from each speech input unit. A / D converter F1 in FIG. 2 that samples, quantizes and converts the signal into a digital signal, and the situation recognition calculation that calculates the dynamic situation recognition parameter DIP for the voice input data that is output from the A / D converter F1 Part F2, interpreter F3 for determining the next processing part based on the calculated dynamic situation recognition parameter DIP, situation recognition variable estimation part F4 for calculating variable noise processing reference parameter RTH, and dynamic situation recognition parameter The situation recognition generation unit F6 that generates the situation recognition variable IP based on the DIP standard, and compares the variable noise processing reference parameter RTH with the situation recognition variable IP. A distribution unit F7, an audio extraction portion F8 to reduce / eliminate noise portions other than the speech section by extracting speech segments using noise processing unit F9 to remove noise and device noise speech recognition preprocessing performed.

本発明の目的は、前記従来技術が持っていた時間対振幅変化が不推測的な背景雑音や騒音を解決することが困難である点に関して解決することが可能な音声認識方法を提供すること、また、雑音環境下でも音声認識精度をより向上させることが可能な音声認識システムを提供することである。ＰＤＡなどに代表される埋込型システムは、本体のメインボードに内蔵されたマイクロフォンが周囲環境で種々の雑音により影響を受けてしまう。例えば、屋外で使われる場合、風音、話者の息音、ボタンやスタイラスペンの操作音、その他振動などが入力音響信号に悪影響を与える。 An object of the present invention is to provide a speech recognition method capable of solving the problem that it is difficult to solve background noise and noise in which the prior art has a change in time vs. amplitude that is unpredictable, It is another object of the present invention to provide a speech recognition system that can further improve speech recognition accuracy even in a noisy environment. In an embedded system represented by a PDA or the like, a microphone built in a main board of the main body is affected by various noises in the surrounding environment. For example, when used outdoors, wind sounds, speaker breath sounds, button and stylus pen operation sounds, and other vibrations adversely affect the input sound signal.

Structure of the invention

図１には、不特定話者によって取得された音声入力データを状況認知型前処理部Ｓ１が行われる。特徴抽出部Ｓ２では、音声認識処理を行う為、音声入力データを最も適切な情報に変換する。認識処理部Ｓ４では、音響モデルＳ３と語彙処理部Ｓ６の語彙情報パラメータが含まれているグラマー処理部Ｓ５のグラマー情報パラメータを利用して音声認識を行う。音声認識後処理部としては、音声認識された結果値に自然言語処理部Ｓ７−Ａや表音文字処理部Ｓ７−Ｂの後処理を行い、テキストタイプの認識結果を出力する。In FIG. 1, the situation recognition type pre-processing unit S1 performs voice input data acquired by an unspecified speaker. The feature extraction unit S2 converts the voice input data into the most appropriate information in order to perform voice recognition processing. The recognition processing unit S4 performs speech recognition using the grammar information parameter of the grammar processing unit S5 that includes the vocabulary information parameters of the acoustic model S3 and the vocabulary processing unit S6. The speech recognition post-processing unit performs post-processing of the natural language processing unit S7-A and the phonetic character processing unit S7-B on the speech-recognized result value, and outputs a text type recognition result.

本発明によれば、以上のように音声認識方法図１を構成したので、図２のＡ／Ｄ変換部Ｆ１では、種々の雑音下でマイクロフォンからそれぞれ入力されたアナログ音声信号に対して、音響モデルを作成する際に登録した音響モデルパラメータと同期化パラメータを構成し、そのパラメータを基に符号化及び量子化を行い、デジタル音声信号に変換する。同期化パラメータは大きく分けるとサンプリングビットとサンプリング周波数である。 According to the present invention, since the speech recognition method FIG. 1 is configured as described above, the A / D conversion unit F1 of FIG. 2 performs acoustic processing on analog speech signals respectively input from the microphones under various noises. The acoustic model parameter and the synchronization parameter registered when creating the model are configured, and encoding and quantization are performed based on the parameter, and the digital speech signal is converted. The synchronization parameter is roughly divided into a sampling bit and a sampling frequency.

状況認知変数算出部Ｆ２では、Ａ／Ｄ変換部Ｆ１の量子化する際、取得したトータルビット値を利用してデジタル音声データの動的状況認知パラメータＤＩＰを推定する。動的状況認知パラメータＤＩＰの算出処理によって音声入力データの雑音の精度を算出、その結果に基づき、それに応じた可変雑音処理基準パラメータＲＴＨが作成できるようにインタープリターＦ３に算出結果を送信する。状況認知変数推定部Ｆ４では、可変雑音処理基準パラメータＲＴＨを決定し、状況認知変数推定部Ｆ４で動的状況認知パラメータＤＩＰと可変雑音処理基準パラメータＲＴＨの推定結果と比較する。比較判定によって状況認知変数生成部Ｆ６は状況認知変数ＩＰを生成し、分配部Ｆ７は音声抽出部Ｆ８と雑音処理部Ｆ９のどちらの処理に分岐するかを判断する。 The situation recognition variable calculation unit F2 estimates the dynamic situation recognition parameter DIP of the digital audio data using the acquired total bit value when the A / D conversion unit F1 performs quantization. The calculation process of the dynamic situation recognition parameter DIP calculates the noise accuracy of the voice input data, and based on the result, the calculation result is transmitted to the interpreter F3 so that the variable noise processing reference parameter RTH can be created. The situation recognition variable estimation unit F4 determines the variable noise processing reference parameter RTH, and the situation recognition variable estimation unit F4 compares it with the estimation result of the dynamic situation recognition parameter DIP and the variable noise processing reference parameter RTH. By the comparison determination, the situation recognition variable generation unit F6 generates the situation recognition variable IP, and the distribution unit F7 determines which process is branched to the voice extraction unit F8 or the noise processing unit F9.

図２は、図１の状況認知型前処理部Ｓ１の詳細処理を表している。音声入力デバイスは有線マイクと無線マイクに対応する為、Ａ／Ｄ変換とブルートゥース受信、二つの処理モジュールを配置。図２の状況認知変数算出部Ｆ２は音声入力データのストリームを分析し、動的状況認知パラメータＤＩＰを算出する。分析されたストリームについてインタープリターＦ３が音声入力データに対し、音声ＣＯＤＥＣ圧縮可否判断を行い、分配処理をする。圧縮された音声データについては、音声ＣＯＤＥＣ処理部Ｆ５−Ａの復号化処理とエラー隠匿処理部Ｆ５−Ｂを行い、状況認知変数推定部Ｆ４には、動的状況認知パラメータＤＩＰのデータストリム情報から可変雑音処理基準パラメータＲＴＨを算出する。また、状況認知変数生成部Ｆ６は、可変雑音処理基準パラメータＲＴＨと動的状況認知パラメータＤＩＰを用いて状況認知変数ＩＰを算出する。 FIG. 2 shows detailed processing of the situation recognition type pre-processing unit S1 of FIG. The audio input device supports wired microphones and wireless microphones, so it has two processing modules for A / D conversion and Bluetooth reception. The situation recognition variable calculation unit F2 of FIG. 2 analyzes the stream of voice input data and calculates a dynamic situation recognition parameter DIP. For the analyzed stream, the interpreter F3 determines whether or not audio CODEC compression is possible for the audio input data, and performs distribution processing. For the compressed audio data, the decoding process of the audio CODEC processing unit F5-A and the error concealment processing unit F5-B are performed, and the situation recognition variable estimation unit F4 receives from the data trim information of the dynamic situation recognition parameter DIP. A variable noise processing reference parameter RTH is calculated. The situation recognition variable generation unit F6 calculates the situation recognition variable IP using the variable noise processing reference parameter RTH and the dynamic situation recognition parameter DIP.

音声認識の問題点は、雑音・騒音環境下で音声認識システムを使用すると周囲雑音の影響で認識精度が大幅に低下してしまう点である。しかも、大語彙の音声認識、類似単語の多い場合は特に誤認識が生じやすい。本発明には、その問題点を解決するために、認識性能が劣化する大きな要因の一つである雑音・騒音に対して図１の状況認知型前処理部Ｓ１で雑音・騒音自体の精度計算を行い、良好な音声データを抽出するために変換する図２の音声抽出部Ｆ８と雑音・騒音を減少・除去する雑音処理部Ｆ９を配置することで、従来の音声認識方法とは異なる、雑音状況認知型音声認識システムを構成する。 The problem with speech recognition is that when a speech recognition system is used in a noise / noise environment, the recognition accuracy is greatly reduced due to the influence of ambient noise. Moreover, speech recognition of large vocabularies and misrecognition are particularly likely to occur when there are many similar words. In order to solve the problem, the present invention calculates the accuracy of noise / noise itself in the situation recognition type pre-processing unit S1 of FIG. 1 for noise / noise which is one of the major factors that degrade recognition performance. 2 is arranged, and the noise extraction unit F8 of FIG. 2 for converting to extract good audio data and the noise processing unit F9 for reducing / removing noise / noise are arranged, so that noise different from the conventional audio recognition method is provided. A situation recognition type speech recognition system is constructed.

上記の目的を達成するために、前記デジタル化された入力データについて入力データの現在フレームを生成する。音声抽出部Ｆ８には、Ａ／Ｄ変換部Ｆ１の出力であるデジタル化された音声入力データに対し、音声入力データの音声フレームを生成する。生成された音声フレームでは音声入力データから雑音状態を推定する方法を使用して雑音推定モジュールを構成する。更に周波数ドメインで音声データスペクトル信号と雑音データスペクトル信号を分離する。本発明によれば、分離された音声信号と雑音信号は音声抽出過程を通してからまとめられる。このタイミングに音声抽出部Ｆ８での音声区間を抽出すると共に一部分雑音減少成果も実現可能である。 To achieve the above object, a current frame of input data is generated for the digitized input data. The voice extraction unit F8 generates a voice frame of the voice input data for the digitized voice input data that is the output of the A / D conversion unit F1. In the generated speech frame, a noise estimation module is configured using a method of estimating a noise state from speech input data. Further, the speech data spectrum signal and the noise data spectrum signal are separated in the frequency domain. According to the present invention, the separated speech signal and noise signal are combined after the speech extraction process. At this timing, it is possible to extract a voice section in the voice extraction unit F8 and partially achieve a noise reduction result.

雑音処理部Ｆ９は、上述の通り、状況認知変数生成部Ｆ６より状況認知パラメータＩＰの雑音情報やフレーム区間パラメータ情報、可変雑音処理基準パラメータＲＴＨを用いて雑音処理をする。特に本発明の雑音処理部Ｆ９では、内部的に二つの処理モジュールで構成し、状況認知変数生成部Ｆ６での状況認知パラメータＩＰの情報を用いて分配部Ｆ７の可変雑音処理基準パラメータＲＴＨと比較する状況認知判断による音声区間の抽出可否に沿って処理モジュールを分けておく。先ず音声抽出部Ｆ８より引き継いだ結果フレームは分析し、現在フレームでの非音声区間情報を削除しながら音声データと雑音データが取り込んでいる区間を検出する。検出され取り込んだ区間において、音声データ部分に混合している雑音データを減少処理する。また分配部Ｆ７より引き継いだ現在フレームに対しては、非音声区間と区別するために、常に音声ではない区間の情報を検出しながら音声データと雑音データが混合している区間の雑音分布度を測定、現在フレームの音声データが音声認識できるように取り込んでいる区間の雑音減少処理を行う。 As described above, the noise processing unit F9 performs noise processing using the noise information of the situation recognition parameter IP, the frame section parameter information, and the variable noise processing reference parameter RTH from the situation recognition variable generation unit F6. In particular, the noise processing unit F9 of the present invention is internally configured by two processing modules, and is compared with the variable noise processing reference parameter RTH of the distribution unit F7 using information on the situation recognition parameter IP in the situation recognition variable generation unit F6. The processing modules are divided according to whether or not the voice section can be extracted based on the situation recognition judgment. First, a result frame taken over from the voice extraction unit F8 is analyzed, and a section in which voice data and noise data are taken in is detected while deleting non-voice section information in the current frame. In the detected and captured section, noise data mixed in the voice data portion is reduced. For the current frame taken over from the distribution unit F7, in order to distinguish from the non-speech section, the noise distribution degree of the section where the voice data and the noise data are mixed is detected while always detecting the information of the section that is not speech. Measurement and noise reduction processing for the section in which the voice data of the current frame is captured so that the voice can be recognized.

図１の特徴抽出部Ｓ２は、音声信号より音声特徴成分を求める処理部を示す。音声の特徴成分を用いて現在音声フレームパラメータより引き継いだ発話開始情報、発話終了情報等を取得及び設定する。更に、音声データ区間内の音声分析を行って、音声信号を解析、音声パラメータ特徴を抽出する。音声特徴成分は音素の特徴が出現する確率の分布であって、音響モデルＳ３のパラメータにマッピングするように音声パターンを音発声単位で変換してパラメータ化して認識処理部Ｓ４に入力する。 The feature extraction unit S2 in FIG. 1 is a processing unit that obtains an audio feature component from an audio signal. Using the voice feature component, utterance start information, utterance end information, etc. inherited from the current voice frame parameter are acquired and set. Further, the voice analysis in the voice data section is performed, the voice signal is analyzed, and the voice parameter feature is extracted. The speech feature component is a probability distribution of phoneme features, and the speech pattern is converted in units of sound utterances so as to be mapped to the parameters of the acoustic model S3, and is parameterized and input to the recognition processing unit S4.

音響モデルＳ３は、音声認識に用いられるモデルである。本発明の音声認識システムでは、他の音声認識システムで使用されている８ビットや１６ビットのサンプリングビット値と８ｋＨｚや１６ｋＨｚのサンプリング周波数を使用するスピーチオーディオを利用する。具体的に、音響モデルＳ３では、男性、女性をほぼ同数の話者合計１００人以上分の音声標本を用いて、音表特徴が出現する確率値の分布を計算し、次のどの特徴が現れる状態に遷移するか確率分布値を記録したものである。実際、この確率分布値を示す音声パターンを発話単位で変換して音表パラメータを音声認識システム側に読み込まれるように情報化し、オーディオデータベースとして構築しておく。 The acoustic model S3 is a model used for speech recognition. In the speech recognition system of the present invention, speech audio using the sampling bit value of 8 bits or 16 bits and the sampling frequency of 8 kHz or 16 kHz used in other speech recognition systems is used. Specifically, in the acoustic model S3, the distribution of probability values that the sound table features appear is calculated using speech samples for a total of 100 or more speakers of almost the same number of men and women, and which of the following features appears: This is a record of probability distribution values for transition to a state. Actually, the speech pattern indicating the probability distribution value is converted in units of utterances, and the speech table parameters are converted into information so as to be read by the speech recognition system side, and constructed as an audio database.

語彙処理部Ｓ６は、音声認識対象となる言葉、単語、文章を集めた辞書であるグラマー情報から音響モデルにマッチングできるように音素パラメータを抽出する。その上で、音声認識の対象になる言葉、単語、文章を集めた辞書であるグラマーをシステム側で認識できるように変換する処理を行うグラマー処理部Ｓ５と共に音素パラメータを設定しておく。詳しくは、音響モデルＳ３の音表パラメータ情報と、言葉、数字、基本単語（名詞、代名詞、動詞、等）、基本文章の組み合わせで構成されている語彙処理部Ｓ６に登録されている語彙データ中で一番近い近似単語で選ばれた結果値を認識処理部Ｓ４の認識処理結果値を出力させるための参考資料で提供する。 The vocabulary processing unit S6 extracts phoneme parameters from the grammar information, which is a dictionary that collects words, words, and sentences that are speech recognition targets so as to be matched with the acoustic model. Then, phoneme parameters are set together with a grammar processing unit S5 that performs a process of converting a grammar, which is a dictionary that collects words, words, and sentences to be recognized by speech recognition so that the system can recognize them. Specifically, in the vocabulary data registered in the vocabulary processing unit S6 composed of a combination of the sound table parameter information of the acoustic model S3, words, numbers, basic words (nouns, pronouns, verbs, etc.), and basic sentences. The result value selected by the closest approximate word is provided as reference material for outputting the recognition processing result value of the recognition processing unit S4.

本発明の音声認識システムには、特定話者の声を登録しなくても誰が話してもその声を認識できる不特定話者対応が可能な隠れマルコフモデル（ＨＭＭ：ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を使用している。図１での認識処理部Ｓ４は、音響モデルＳ３とグラマー情報を連結し、特徴抽出部Ｓ２の結果になる音声特徴成分に対して音素単位で照合し、グラマー処理部Ｓ５と語彙処理部Ｓ６の結果値中の一番近い言葉、単語、文章を選ぶ処理を行う。 The speech recognition system of the present invention uses a Hidden Markov Model (HMM) capable of dealing with an unspecified speaker who can recognize the voice of any speaker without registering the voice of the specific speaker. ing. The recognition processing unit S4 in FIG. 1 connects the acoustic model S3 and grammar information, collates the speech feature component resulting from the feature extraction unit S2 in units of phonemes, and the grammar processing unit S5 and the vocabulary processing unit S6. Perform the process of selecting the closest word, word, or sentence in the result value.

図１で認識後の処理については内部的に発音の類似性、文字の数類、文字表記による誤認識について解決方法として自然語処理部Ｓ７−Ａと表音文字処理部Ｓ７−Ｂを構成しておく。認識処理部Ｓ４の結果に対して望まれる形に変換し、最良の認識結果を得るために、図１の自然語処理部Ｓ７−Ａと表音文字処理部Ｓ７−Ｂは認識処理部Ｓ４より引き継いだ認識結果の音素を利用して文字を選定、符号化し、出力結果処理を行う。 As for the processing after recognition in FIG. 1, a natural language processing unit S7-A and a phonetic character processing unit S7-B are configured as a solution for misrecognition by pronunciation similarity, character class, and character notation. Keep it. In order to convert the result of the recognition processing unit S4 into a desired form and obtain the best recognition result, the natural language processing unit S7-A and the phonetic character processing unit S7-B of FIG. Characters are selected and encoded using the phoneme of the recognized recognition result, and output result processing is performed.

図１は本発明による状況認知型音声認識システムの各機能とその処理の流れを示すブロック図である。図２は本発明による状況認知型音声認識方法に関わる実施の形態を示す。以下は図２の各機能実施例を用いて説明する。実際に使用する環境下において、図２に示されるＡ／Ｄ変換部Ｆ１は、図３のマイクロフォン音声信号Ｐ１０とＡ／Ｄ変換Ｐ１１が該当する。マイクロフォン音声信号Ｐ１０は実際の音声データと環境に応じた雑音や音声データに取り込まれている雑音である音声信号を取得し、Ａ／Ｄ変換Ｐ１１は音声信号であるアナログ信号をデジタル信号に変換する。Ａ／Ｄ変換Ｐ１１によって、８ＫＨｚや１６ＫＨｚのサンプリング周波数により、アナログデータをデジタルデータに変換する。このアナログデータをデジタルデータに変換する過程において、信号対雑音比を調べ、状況認知度を推定して動的状況認知パラメータＤＩＰを算出する。FIG. 1 is a block diagram showing the functions of the situation recognition type speech recognition system according to the present invention and the flow of its processing. FIG. 2 shows an embodiment related to a situation recognition type speech recognition method according to the present invention. The following will be described using each functional embodiment of FIG. Under the actual use environment, the A / D conversion unit F1 shown in FIG. 2 corresponds to the microphone audio signal P10 and the A / D conversion P11 shown in FIG. The microphone audio signal P10 acquires the actual audio data and the noise signal corresponding to the environment or the noise incorporated in the audio data, and the A / D conversion P11 converts the analog signal that is the audio signal into a digital signal. . A / D conversion P11 converts analog data to digital data at a sampling frequency of 8 KHz or 16 KHz. In the process of converting the analog data into digital data, the signal-to-noise ratio is examined, the situation recognition degree is estimated, and the dynamic situation recognition parameter DIP is calculated.

Ａ／Ｄ変換Ｐ１１ではアナログ信号である音声データをデジタルデータに変換を行う。具体的には８ＫＨｚや１６ＫＨｚでサンプリング後、量子化する際、取得したトータルビット値や入力ストリーム情報を利用してデジタル音声データの動的状況認知パラメータＤＩＰを推定する。図３の動的状況認知パラメータＤＩＰを算出し、動的状況認知パラメータＤＩＰ値は雑音処理基準パラメータＲＴＨが生成できるように音声入力データの雑音の精度が計算された値である。その計算結果に基づき、それに応じた状態認知閾値が作成できるように、図３の現在フレーム生成Ｐ１２では現在フレームを生成し、該当する各フレームに音声データを渡してフレームレジスターに記録しておく。 In the A / D conversion P11, audio data which is an analog signal is converted into digital data. Specifically, when quantizing after sampling at 8 KHz or 16 KHz, the dynamic state recognition parameter DIP of the digital audio data is estimated using the acquired total bit value and input stream information. The dynamic situation recognition parameter DIP of FIG. 3 is calculated, and the dynamic situation recognition parameter DIP value is a value obtained by calculating the noise accuracy of the voice input data so that the noise processing reference parameter RTH can be generated. Based on the calculation result, the current frame generation P12 in FIG. 3 generates a current frame so that a state recognition threshold value corresponding to the calculation result can be created, and audio data is transferred to each corresponding frame and recorded in the frame register.

図３の現在フレーム生成Ｐ１２で生成されたフレームのデータ情報とＡ／Ｄ変換Ｐ１１で推定された計算値、動的状況認知パラメータＤＩＰを用いて可変雑音処理基準パラメータＲＴＨを作成し、作成されたＲＴＨの状態認知閾値をＤＩＰと比較して記録しておく。状況認知推定及び算出Ｐ１３では、現フレームの動的状況認知推定パラメータＤＩＰ情報と可変雑音処理基準パラメータＲＴＨの状況認知閾値を決定し、状況認知判断Ｐ１４で比較判断する。状況認知判断Ｐ１４は図２の分配部Ｆ７に該当する。 The variable noise processing reference parameter RTH is created by using the data information of the frame generated by the current frame generation P12 of FIG. 3, the calculated value estimated by the A / D conversion P11, and the dynamic situation recognition parameter DIP. The RTH state recognition threshold is recorded in comparison with DIP. In the situation recognition estimation and calculation P13, the situation recognition threshold value of the dynamic situation recognition estimation parameter DIP information and the variable noise processing reference parameter RTH of the current frame is determined, and the situation recognition judgment P14 makes a comparison judgment. The situation recognition determination P14 corresponds to the distribution unit F7 in FIG.

図３の状況認知判断Ｐ１４では、比較判断によって音声区間抽出Ｐ１５と雑音除去及び減少Ｐ１６のどちらの処理部分に分岐するか判断される。音声区間抽出Ｐ１５は図２の音声抽出部Ｆ８に該当し、雑音除去及び減少Ｐ１６は図２の雑音処理部Ｆ９に該当する。状況認知判断Ｐ１４は現フレームに対した可変雑音処理基準パラメータＲＴＨの状況認知しきい値を基準値で設定しておき、入力音声データと共に現在フレームの状況認知度を示す状況認知パラメータＩＰ値と比較で処理パスを決定する。例えば、現在フレームについて、

音声区間抽出Ｐ１５と雑音消去および減少Ｐ１６を行う。In the situation recognition determination P14 of FIG. 3, it is determined by the comparative determination whether the process branches to the speech segment extraction P15 or the noise removal / reduction P16. The speech segment extraction P15 corresponds to the speech extraction unit F8 in FIG. 2, and the noise removal and reduction P16 corresponds to the noise processing unit F9 in FIG. The situation recognition judgment P14 sets the situation recognition threshold value of the variable noise processing reference parameter RTH for the current frame as a reference value, and compares it with the situation recognition parameter IP value indicating the situation recognition degree of the current frame together with the input voice data. To determine the processing path. For example, for the current frame:

Speech segment extraction P15 and noise cancellation and reduction P16 are performed.

ついて処理を行う。図３の現在フレーム生成Ｐ１２によるデジタル化された入力音声データに対し、図２の分配部Ｆ７と図３の状況認知判断Ｐ１４を行い、入力音声データの時間ドメインと入力音声データの周波数ドメインで解析し、解析されたフレームのデータ情報より音声信号分離処理を行う為、処理データの情報を引き抜く。図３の音声区間抽出Ｐ１５には、動的状況認知パラメータＤＩＰを利用して信号の音声区間を再検出し、分離フィルタの係数を計算する。計算された音声区間値は分離フィルタの係数更新によって信号分離処理設定をして音声区間内の音声信号に対して抽出処理を行う。

Process. The distribution unit F7 in FIG. 2 and the situation recognition determination P14 in FIG. 3 are performed on the input voice data digitized by the current frame generation P12 in FIG. 3 and analyzed in the time domain of the input voice data and the frequency domain of the input voice data. In order to perform audio signal separation processing from the analyzed frame data information, the processing data information is extracted. In the speech segment extraction P15 in FIG. 3, the speech segment of the signal is detected again using the dynamic situation recognition parameter DIP, and the coefficient of the separation filter is calculated. The calculated speech section value is subjected to extraction processing for speech signals in the speech section by setting signal separation processing by updating the coefficients of the separation filter.

図２の雑音処理部Ｆ９は、図３の状況認知判断Ｐ１４より二つの処理データ値を引き継

本段階には簡単に言うと伝統的オーディオフィルターリングアルゴリズムとして音声信号

から引き継いだデータの雑音処理について説明する。分離された入力データストリムについて２Ｎ＋１サイズのウィンドウを持っているフィルタを用いる。結果的に音声信号内の低密度インパルス雑音を除去する。ＩＰ＞ＲＴＨの場合は、引き継いだ状況認知パラメータよりフィルタのウィンドウサイズが決まる。また、フィルタパラメータＮの値を計算および調整して、音声データの推定エラーを検出しながら音声信号内の低密度インパルス雑音を除去する。The noise processing unit F9 in FIG. 2 takes over two processing data values from the situation recognition determination P14 in FIG.

In this stage, the audio signal is the traditional audio filtering algorithm.

The noise processing of data inherited from will be described. A filter having a 2N + 1 size window for the separated input data stream is used. As a result, low density impulse noise in the speech signal is removed. When IP> RTH, the window size of the filter is determined by the inherited situation recognition parameter. Also, the filter parameter N is calculated and adjusted to remove low-density impulse noise in the audio signal while detecting an estimation error of the audio data.

以下、本発明に係る音声認識システムおよび方法のコア処理部分を図２および図３を用いて説明する。本実施例では、本発明の音声認識システムを業務用ＰＤＡ上で音声入力による患者観察情報記録アプリケーション製品に適用した場合について説明する。 Hereinafter, the core processing part of the speech recognition system and method according to the present invention will be described with reference to FIGS. In this embodiment, a case where the voice recognition system of the present invention is applied to a patient observation information recording application product by voice input on a business PDA will be described.

図１は入力音声データから音素特徴を抽出し、音響モデルと言語辞書を参照しながら音声認識処理を行い、音声認識結果を取得する従来の音声認識方法から進化した音声認識方法である。言語辞書の役割を語彙処理部Ｓ６とグラマー処理部Ｓ５に分けており、音声認識結果を取得する前に図１の音声認識後処理部Ｓ７を追加し、また、自然語処理を行う図１の自然語処理部Ｓ７−Ａと表音文字処理を行う図１の表音文字処理部Ｓ７−Ｂに分けている。図３のＰ音声信号特徴抽出Ｐ１７は図１の特徴抽出部Ｓ２に該当し、図３の音声認識Ｐ１８は音声認識結果を取得する為の図１の認識処理部Ｓ４に該当し、認識結果出力処理Ｐ１９は図１の自然言語処理部Ｓ７−Ａや表音文字処理部Ｓ７−Ｂの結果出力に該当する。 FIG. 1 shows a speech recognition method evolved from a conventional speech recognition method that extracts phoneme features from input speech data, performs speech recognition processing with reference to an acoustic model and a language dictionary, and acquires speech recognition results. The role of the language dictionary is divided into the vocabulary processing unit S6 and the grammar processing unit S5, the speech recognition post-processing unit S7 of FIG. 1 is added before the speech recognition result is acquired, and the natural language processing of FIG. The natural language processing unit S7-A and the phonetic character processing unit S7-B in FIG. P speech signal feature extraction P17 in FIG. 3 corresponds to the feature extraction unit S2 in FIG. 1, and speech recognition P18 in FIG. 3 corresponds to the recognition processing unit S4 in FIG. The process P19 corresponds to the result output of the natural language processing unit S7-A and the phonetic character processing unit S7-B in FIG.

図１の特徴抽出部Ｓ２に該当する図３の音声信号特徴抽出Ｐ１７は入力音声データから音声の特徴成分を求め、音声の特徴成分を用いて入力音声データ区間内の音声分析を行い、音素パラメータ特徴情報を抽出する。Ｐ１７の結果になる音素パラメータ特徴情報を音声認識Ｐ１８に渡す。 The speech signal feature extraction P17 of FIG. 3 corresponding to the feature extraction unit S2 of FIG. 1 obtains a speech feature component from the input speech data, performs speech analysis within the input speech data section using the speech feature component, and sets phoneme parameters. Extract feature information. The phoneme parameter feature information resulting from P17 is passed to the speech recognition P18.

音声認識Ｐ１８では、図１の語彙処理部Ｓ６には救急車内での救命士による患者観察情報記録に発話される言葉、単語及び文章を登録し、図１の音響モデルＳ３にマッチングできるように音素パラメータを抽出しておく。そして、グラマー処理部Ｓ５により、アプリケーションシステム側で読込めるように登録したデータを変換する処理を行う。そして、Ｐ１７の結果になる音素パラメータ特徴情報が渡されたら、図１の音響モデルＳ３と語彙処理部Ｓ６を照合し、確率分布と一番近い近似単語リストを選べ、選ばれた近似単語リストを基にして図１の自然語処理部Ｓ７−Ａと図１の表音文字処理部Ｓ７−Ｂを行い、結果を認識結果値として出力する。 In the speech recognition P18, words, words, and sentences uttered in the patient observation information record by the lifesaving technician in the ambulance are registered in the vocabulary processing unit S6 in FIG. 1, and the phonemes are set so as to match the acoustic model S3 in FIG. Extract parameters. Then, the grammar processing unit S5 performs processing for converting the registered data so that it can be read on the application system side. When the phoneme parameter feature information resulting from P17 is passed, the acoustic model S3 in FIG. 1 and the vocabulary processing unit S6 are collated, and the approximate word list closest to the probability distribution can be selected, and the selected approximate word list is selected. Based on the natural language processing unit S7-A in FIG. 1 and the phonetic character processing unit S7-B in FIG. 1, the result is output as a recognition result value.

以上詳細に説明したように、本発明の音声認識方法によれば、周辺環境の音声信号対雑音度を認知して音声認識する。ＭＭＸやワイヤレスＭＭＸ命令セットの単一命令複数データ演算技法を使用して、デスクトップＰＣやＰＤＡ基盤デバイス上で音声データ要素の並行処理によるオーディオ処理の高速化を実現する。具体的にＰＣ上には、ＭＭＸ命令セットを用いてＣ＋＋プログラミング言語にインラインコードを構成する。ＰＤＡ向けのアルゴリズム具現では、プロセッサのｗＲ０からｗＲ１５までのレジスタを使用して直接アドレス指定し、音声データ要素を並行して処理する。複数の音声データ要素を格納して、プロセッサが複数のデータ要素を同時に処理する単一命令複数データ演算処理方法を使用して、ＰＤＡ上の音声認識システムの性能を最適化する。As described above in detail, according to the speech recognition method of the present invention, speech recognition is performed by recognizing the sound signal-to-noise level of the surrounding environment. Using a single-instruction multiple-data operation technique of MMX or wireless MMX instruction set, high-speed audio processing is realized by parallel processing of audio data elements on a desktop PC or PDA-based device. Specifically, on the PC, an inline code is configured in the C ++ programming language using the MMX instruction set. In the implementation of the algorithm for PDA, direct addressing is performed using the registers wR0 to wR15 of the processor, and the audio data elements are processed in parallel. A single instruction multiple data processing method is used to store multiple speech data elements and a processor processes multiple data elements simultaneously to optimize the performance of the speech recognition system on the PDA.

The invention's effect

以上詳細に説明したように、本発明によれば、エンベデッドシステムに代表されるＯＳベースの小型情報機器は、音声認識機能を使用するとき、高い信号対雑音比を得るために、マイクロフォンの信号対雑音比機能に依存している。前記にて詳細に説明したように、入力音声データの状況認知パラメータを計算してから分配処理する。雑音除去処理によって背景雑音の種類に応じた雑音成分を除去し、さらにその背景雑音の種類に応じて音声区間が選択され、その選択された音声区間を参照して認識処理で入力音声の認識が行われる。そのため、背景雑音及び種々の雑音環境下において発話する時、雑音による信号対雑音比が低い値を持っても、マイクロフォンの種類と性能に依存することなく、高精度の音声認識が期待できる。As described above in detail, according to the present invention, an OS-based small information device typified by an embedded system uses a microphone signal pair in order to obtain a high signal-to-noise ratio when using a speech recognition function. Depends on the noise ratio function. As described in detail above, the situation recognition parameter of the input voice data is calculated and then distributed. A noise component corresponding to the type of background noise is removed by the noise removal process, and a speech segment is selected according to the type of background noise. The input speech is recognized by the recognition process with reference to the selected speech segment. Done. Therefore, when speaking in a background noise and various noise environments, high-accuracy speech recognition can be expected without depending on the type and performance of the microphone even if the signal-to-noise ratio due to noise has a low value.

現在、埋込型システムに代表されるＯＳベースの小型情報機器は、音声認識機能使用時、高い信号対雑音比を得るために、高ＳＮ比のマイクロフォンや雑音消去機能チップセットに依存している。本発明によって、個人用ＰＤＡ、業務用ＰＤＡからタブレットＰＣに至るまで埋込型システム全般で具現が不可能だった処理時電力制限、データ並列処理など既存の問題点を解決、最適化がソフトウェアだけで具現可能とする特徴を持つ。Currently, OS-based small information devices represented by embedded systems rely on high signal-to-noise ratio microphones and noise canceling function chip sets to obtain a high signal-to-noise ratio when using a speech recognition function. . The present invention solves existing problems such as processing power limitation and data parallel processing that could not be implemented in all embedded systems from personal PDAs, professional PDAs to tablet PCs, and only software is optimized. It has the characteristics that can be realized in.

本発明の実施例の音声認識方法における音声認識手順を示す図である。It is a figure which shows the speech recognition procedure in the speech recognition method of the Example of this invention. 本発明の状況認知推定処理プローを示す。The situation recognition presumption process probe of this invention is shown. 図１の音声認識手順を実施するための音声認識装置の機能ブロック図である。It is a functional block diagram of the speech recognition apparatus for implementing the speech recognition procedure of FIG.

〔図１〕
Ｓ１・・・状況認知型前処理部、Ｓ２・・・特徴抽出部、
Ｓ３・・・音響モデル、Ｓ４・・・認識処理部、
Ｓ５・・・グラマー処理部、Ｓ６・・・語彙処理部
Ｓ７−Ａ・・・自然語処理部、Ｓ７−Ｂ・・・表音文字処理部
〔図２〕
Ｆ１・・・Ａ／Ｄ変換部、Ｆ２・・・状況認知変数算出部、
Ｆ３・・・インタープリター、Ｆ４・・・状況認知変数推定部、
Ｆ５−Ａ・・・音声ＣＯＤＥＣ処理部、Ｆ５−Ｂ・・・エラー隠匿処理部、
Ｆ６・・・状況認知変数生成部、Ｆ７・・・分配部、
Ｆ８・・・音声抽出部、Ｆ９・・・雑音処理部
〔図３〕
Ｐ１０・・・マイクロフォン音声信号、Ｐ１１・・・Ａ／Ｄ変換、
Ｐ１２・・・現在フレーム生成、Ｐ１３・・・状況認知測定及び算出、
Ｐ１４・・・状況認知判断、Ｐ１５・・・音声区間抽出、
Ｐ１６・・・雑音除去及び減少、Ｐ１７・・・音声信号特徴抽出、
Ｐ１８・・・音声認識、Ｐ１９・・・認識結果出力処理[Figure 1]
S1 ... Situation recognition type pre-processing unit, S2 ... Feature extraction unit,
S3: Acoustic model, S4: Recognition processing unit,
S5: Grammar processing unit, S6: Vocabulary processing unit S7-A ... Natural language processing unit, S7-B ... Phonetic character processing unit [Fig. 2]
F1 ... A / D conversion unit, F2 ... Situation recognition variable calculation unit,
F3 ... interpreter, F4 ... situation recognition variable estimation part,
F5-A ... voice CODEC processing unit, F5-B ... error concealment processing unit,
F6 ... Situation recognition variable generation part, F7 ... Distribution part,
F8 ... voice extraction unit, F9 ... noise processing unit [Fig. 3]
P10: Microphone audio signal, P11: A / D conversion,
P12 ... Current frame generation, P13 ... Situation recognition measurement and calculation,
P14 ... Situation recognition judgment, P15 ... Voice segment extraction,
P16 ... noise removal and reduction, P17 ... voice signal feature extraction,
P18 ... voice recognition, P19 ... recognition result output processing

Claims

Estimate situation awareness in all situations subject to speech recognition on embedded systems, generate dynamic situation recognition parameters, and use it to configure a distribution unit to reduce noise during speech extraction And performing the removal (hereinafter referred to as noise processing algorithm) to display the result obtained from the noise processing algorithm as a speech recognition result or to input the result as speech data.

2. The voice recognition method according to claim 1, wherein dynamic situation recognition parameters and variable noise processing reference parameters are generated when analog data is converted to digital data (hereinafter referred to as A / D conversion), and dynamic situation recognition is performed from the voice input stream. A speech recognition method, comprising: calculating a noise processing reference parameter of a situation recognition parameter under a flowing situation by comparing a parameter and a variable noise processing reference parameter.

When generating dynamic situation recognition parameters at a pre-processing site in the situation recognition processing in the voice recognition method of claim 2, the data type is analyzed and classified from the voice input stream, and the noise processing reference parameter has variability. In addition, a voice recognition method characterized in that it can be recorded and used as a variable of a situation recognition parameter in which a noise processing standard is standardized in advance.

4. The speech recognition method according to claim 3, wherein a dynamic situation recognition parameter is generated according to the analysis result of the speech input stream, and a speech section and a non-speech section are determined by performing a comparison operation with the variable noise processing reference parameter according to the analysis result. A speech recognition method characterized by:

4. The speech recognition method according to claim 3, wherein hardware characteristics using dynamic situation recognition parameter information are detected according to the analysis result of the speech input stream, and the interpreter determines that speech CODEC processing is performed.

5. The voice recognition method according to claim 4, wherein the characteristics of the voice input stream taken over by the interpreter are detected, and voice CODEC processing and error concealment processing are performed.