JP2000029486A

JP2000029486A - Speech recognition system and method therefor

Info

Publication number: JP2000029486A
Application number: JP10193850A
Authority: JP
Inventors: Shinji Wakizaka; 新路脇坂; Kazuo Kondo; 和夫近藤; Hiroaki Kokubo; 浩明小窪
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1998-07-09
Filing date: 1998-07-09
Publication date: 2000-01-28

Abstract

PROBLEM TO BE SOLVED: To correctly distinguish speech from noise in an actual use environment. SOLUTION: To distinguish a speech block from a non-speech block over the whole block from the start to the end of a speech with respect to the inputted speech, this speech recognition system is provided with a speech block detecting part 103 for detecting the speech block by detecting information on the non-speech block. This detecting part 103 detects power levels of steady or unsteady noise and all other sound, and specifies the power level indicating the start and end of the speech in accordance with the power level fluctuating every moment. If the power level of an inputted speech exceeds the above specified power level, it is judges as the start of the speech, and if it is lower than the specified power level for a predetermined period of time or longer, it is judged as the end of the speech.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識システム
および方法に係わり、カーナビゲーションシステム、車
載用ＰＣ、ＰＤＡ、ハンドヘルドＰＣに代表される小型
情報機器、携帯型音声翻訳機、ゲーム、家電機器に用い
る音声認識システム及び方法に関する。本発明は、特
に、カーナビゲーションシステムや車載用ＰＣ等のよう
に、時間的に変動する雑音環境下での音声認識システム
および方法に適用して好適である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition system and method, and relates to a small information device represented by a car navigation system, an in-vehicle PC, a PDA, and a handheld PC, a portable speech translator, a game, and a home appliance. The present invention relates to a speech recognition system and method used. The present invention is particularly suitable for application to a speech recognition system and method under a time-varying noise environment such as a car navigation system and an in-vehicle PC.

【０００２】[0002]

【従来の技術】近年、音声認識技術を用いた小型情報シ
ステムが普久しつつある。カーナビゲーションシステム
をはじめとして、ＰＤＡに代表される小型情報機器、携
帯型翻訳機等である。このような音声認識システムの例
として、特開平５ー３５７７６号公報の「言語自動選択
機能付翻訳装置」には、マイクから入力した操作者の音
声を認識して、翻訳し、翻訳した言語の音声を出力する
ようにした携帯用の翻訳装置に関する技術が開示されて
いる。2. Description of the Related Art In recent years, small information systems using voice recognition technology have been around for a long time. In addition to car navigation systems, small information devices represented by PDAs, portable translators, and the like. As an example of such a speech recognition system, Japanese Patent Application Laid-Open No. 5-35776 discloses a "translation device with automatic language selection function" that recognizes and translates the voice of an operator input from a microphone and translates the translated language. A technology related to a portable translation device that outputs voice is disclosed.

【０００３】以下、図７を用いてこのような従来技術に
係わる音声翻訳装置の概要について説明する。図７は、
従来技術に係わる音声翻訳装置の構成を示すブロック図
である。制御部７０１は、マイクロプロセッサ等からな
り、装置の各部を制御する。音声区間切出し部７０２
は、マイク７０９から入力された音声をデジタル信号に
変換して切り出し、音声認識部７０３に送る。音声認識
部７０３は、キーボード又はスイッチ等により線路７１
１を介して入力された操作信号に基づく制御部７０１か
らの指示によって、マイク７０９、音声区間切出し部７
０２を経て、切り出された音声を分析する。そしてその
分析結果を、音声認識辞書部７０７に格納された標準音
声パターンと比較することにより、音声認識をおこな
う。音声合成部７０５は、音声認識部７０３により認識
された音声に対応した翻訳語を、翻訳語データ用メモリ
カード７０６から読み込み、音声信号に変換してスピー
カアンプ７１０、スピーカ７０８を経て出力する。[0003] An outline of such a speech translation apparatus according to the prior art will be described below with reference to FIG. FIG.
FIG. 11 is a block diagram illustrating a configuration of a speech translation device according to the related art. The control unit 701 includes a microprocessor or the like, and controls each unit of the device. Voice section extraction unit 702
Converts the voice input from the microphone 709 into a digital signal, cuts out the digital signal, and sends the digital signal to the voice recognition unit 703. The voice recognition unit 703 is connected to the line 71 by a keyboard or a switch.
The microphone 709 and the voice segment extraction unit 7 are instructed by the control unit 701 based on the operation signal input via the control unit 1.
After 02, the extracted voice is analyzed. Then, the speech recognition is performed by comparing the analysis result with a standard speech pattern stored in the speech recognition dictionary unit 707. The speech synthesis unit 705 reads a translated word corresponding to the speech recognized by the speech recognition unit 703 from the translated word data memory card 706, converts the translated word into an audio signal, and outputs the signal through the speaker amplifier 710 and the speaker 708.

【０００４】表示部７０４は、翻訳装置の使用者への指
示や翻訳語の文字による表示等をおこなう。翻訳語デー
タ用メモリカード７０６は、ＲＯＭカード等からなり、
翻訳語を音声合成して出力する場合には、音声データを
格納している。また、この翻訳語データ用メモリカード
７０６から、翻訳語に対応したキャラクターコードを読
み込み、表示部７０４に文字として表示する。そして、
この翻訳語データ用メモリカード７０６を他の言語のも
のと交換することにより、複数の言語に翻訳することが
可能となる。音声認識辞書部７０７は、ＲＡＭ等からな
り、操作者の発声に応じた標準音声パターンを格納して
いる。この標準音声パターンは、操作者があらかじめ格
納しておく。[0004] The display unit 704 gives instructions to the user of the translation apparatus, displays translated characters, and the like. The translation data memory card 706 is composed of a ROM card or the like,
When a translated word is synthesized and output, audio data is stored. Further, a character code corresponding to the translated word is read from the translated word data memory card 706 and displayed on the display unit 704 as a character. And
By exchanging the translated word data memory card 706 with one for another language, translation into a plurality of languages becomes possible. The voice recognition dictionary unit 707 is composed of a RAM or the like, and stores a standard voice pattern according to the utterance of the operator. This standard voice pattern is stored in advance by the operator.

【０００５】[0005]

【発明が解決しようとする課題】このような音声認識、
音声合成技術の分野は、半導体技術の向上を背景とし
て、システムがより人間的なユーザインタフェースを提
供すべきであるという要望から、その発展が期待されて
いる。上記従来の音声認識技術を用いた小型情報システ
ムにおいても、カーナビゲーションシステムをはじめと
して、ＰＤＡに代表される携帯型情報機器、携帯型翻訳
機、さらに、音声インタフェースを持った情報家電とし
て、今後ますます普及してくることが予想される。しか
しながら、音声認識は、処理すべき情報量が膨大なもの
になるため、従来の技術では、認識率や認識応答時間の
性能を低下させないために、認識する語数に制約を設け
る必要がある。そのためには、あらかじめ登録しておい
た単語、文に対して、その文字列が持つ統計的な話者の
音声の特徴と、実際に話者が発声した音声の特徴とを比
較し、確率的に一番近い値を認識結果としている。今
後、音声認識における技術革新や、それを実現するソフ
トウエア、ハードウエアの性能向上により、認識率や認
識応答時間の性能は向上すると考えられる。しかしなが
ら、音声認識システムの実用的な観点から、実際に使用
する環境での認識性能が問題となる。例えば、カーナビ
ゲーションシステムにおける音声認識システムでは、エ
ンジンを停止している場合、アイドリングの場合、一般
道路、市街地、高速道路、トンネル内を走行している場
合等で環境は異なり、しかも比較的短い時間で環境が変
化する。このような環境下において、音声区間の検出を
前提とする音声認識システムでは、音声区間検出精度が
低下する。SUMMARY OF THE INVENTION Such speech recognition,
In the field of speech synthesis technology, its development is expected from the demand that the system should provide a more human-like user interface with the improvement of semiconductor technology. In the above-mentioned small information systems using the conventional speech recognition technology, as car navigation systems, portable information devices represented by PDAs, portable translators, and information home appliances with voice interfaces, the future will continue. It is expected to become more and more popular. However, in speech recognition, the amount of information to be processed is enormous. In the conventional technology, it is necessary to set a limit on the number of words to be recognized in order not to lower the performance of the recognition rate and the recognition response time. To do this, for a word or sentence registered in advance, the statistical characteristics of the speaker's voice in the character string and the characteristics of the voice actually spoken by the speaker are compared, and the probability is calculated. The value closest to is used as the recognition result. It is expected that the performance of the recognition rate and the recognition response time will improve in the future due to technological innovation in speech recognition and the improvement of the software and hardware that realize it. However, from a practical point of view of the speech recognition system, recognition performance in an environment actually used becomes a problem. For example, in a voice recognition system of a car navigation system, the environment is different when the engine is stopped, when the vehicle is idling, when driving on a general road, an urban area, a highway, a tunnel, and the like. Changes the environment. In such an environment, in a speech recognition system on the premise of detecting a voice section, the voice section detection accuracy decreases.

【０００６】本発明の目的は、上記欠点を解決し、実際
に使用する環境において、認識性能が劣化しない音声認
識システムを提供することにある。An object of the present invention is to solve the above-mentioned drawbacks and to provide a speech recognition system in which the recognition performance does not deteriorate in an actual use environment.

【０００７】[0007]

【課題を解決するための手段】本発明の目的を達成する
ために、音声認識の対象となる単語や文章を集めた辞書
と、音声認識結果として前記辞書から得られた内容を表
示又は音声として出力する本発明による音声認識システ
ムにおいては、検出された非音声区間の情報に基づいて
音声区間を検出する音声検出部を設け、前記検出された
音声区間の音声認識を行なっている。In order to achieve the object of the present invention, a dictionary in which words and sentences to be subjected to speech recognition are collected, and contents obtained from the dictionary as a speech recognition result are displayed or spoken. In the voice recognition system according to the present invention for outputting, a voice detection unit for detecting a voice section based on information of a detected non-voice section is provided, and performs voice recognition of the detected voice section.

【０００８】この音声認識システムにおいて、前記非音
声区間のパワーからノイズしきい値を求める。また、こ
の音声認識システムにおいて、前記ノイズしきい値と前
記音声区間のパワーとを比較し、前記音声区間の前記パ
ワーが前記ノイズしきい値に達した時点近傍から音声認
識処理を開始する。この音声認識システムにおいて、前
記ノイズしきい値と前記音声区間のパワーとを比較し、
前記音声区間の前記パワーが前記ノイズしきい値に達し
た時点からあらかじめ定められた時間遡って音声認識処
理を行う。また、この音声認識システムにおいて、前記
ノイズしきい値は、音声又はノイズのパワーを分析する
単位であるフレームのあらかじめ定められた数の集合体
の平均のパワーに基づいて求められる。この音声認識シ
ステムにおいて、前記ノイズしきい値に基づいてパワー
しきい値を求める。さらにまた、この音声認識システム
において、前記音声区間のパワーが前記ノイズしきい値
を超え、前記パワーしきい値に到達した時に音声の始ま
りと判断して、この時点から予め定められた時間前から
音声認識処理を行う。この音声認識システムにおいて、
音声入力用ボタンを設け、前記ボタンが押された以降
で、前記音声区間のパワーが前記ノイズしきい値に達し
た時に、その時点の音声の分析単位であるフレームを記
憶する。また、この音声認識システムにおいて、前記音
声区間のパワーが前記パワーしきい値に達した時、少な
くとも前記記憶された前記フレームから音声認識処理を
する。In this speech recognition system, a noise threshold is obtained from the power in the non-speech section. Further, in this speech recognition system, the noise threshold is compared with the power of the speech section, and speech recognition processing is started near the time when the power of the speech section reaches the noise threshold. In this speech recognition system, comparing the noise threshold with the power of the speech section,
The voice recognition processing is performed going back a predetermined time from the time when the power of the voice section reaches the noise threshold. Further, in this speech recognition system, the noise threshold is obtained based on an average power of a predetermined number of aggregates of frames, which are units for analyzing the power of speech or noise. In this speech recognition system, a power threshold is obtained based on the noise threshold. Furthermore, in this voice recognition system, when the power of the voice section exceeds the noise threshold value and the power threshold value is reached, it is determined that the voice has started, and a predetermined time before this time point Perform voice recognition processing. In this speech recognition system,
A voice input button is provided, and when the power of the voice section reaches the noise threshold after the button is pressed, a frame, which is a unit of voice analysis at that time, is stored. Further, in this speech recognition system, when the power of the speech section reaches the power threshold, speech recognition processing is performed from at least the stored frames.

【０００９】上述の音声認識システムにおいて、前記音
声区間のパワーが前記パワーしきい値より低下する期間
が、予め定められた時間以下の場合、音声と音声の間の
無声音部分と判断する。また、この音声認識システムに
おいて、前記音声区間の前記パワーが前記パワーしきい
値以下に低下した後、予め定められた時間、前記音声区
間の前記パワーが前記パワーしきい値よりも低い値に保
たれている時には音声区間が終了したと判断する。音声
認識結果として、音声認識の対象となる単語や文章を集
めた辞書からピックアップされた内容を表示又は音声と
して出力する本発明による音声認識方法において、検出
された非音声区間の情報に基づいて音声区間を検出する
ステップと、前記検出された音声区間の音声認識を行う
ステップとを有する。この音声認識方法において、前記
非音声区間のパワーからノイズしきい値を求めるステッ
プを設ける。In the above-described speech recognition system, if the period during which the power of the speech section falls below the power threshold is less than or equal to a predetermined time, it is determined that the portion is unvoiced between speeches. Further, in the voice recognition system, after the power of the voice section falls below the power threshold, the power of the voice section is maintained at a value lower than the power threshold for a predetermined time. If it is, it is determined that the voice section has ended. In the voice recognition method according to the present invention for displaying or outputting as a voice the contents picked up from a dictionary in which words and sentences to be voice-recognized are collected as voice recognition results, the voice recognition is performed based on information of the detected non-voice section. Detecting a section; and performing voice recognition of the detected voice section. In this speech recognition method, a step of obtaining a noise threshold from the power of the non-speech section is provided.

【００１０】この音声認識方法において、前記ノイズし
きい値と前記音声区間のパワーとを比較し、前記パワー
が前記ノイズしきい値に達した時点近傍から音声認識処
理を開始するステップを有する。この音声認識方法にお
いて、前記非音声区間のパワーからノイズしきい値を求
めるステップと、前記ノイズしきい値に基づいてパワー
しきい値を求めるステップとを有する。この音声認識方
法において、前記音声区間のパワーが前記パワーしきい
値より低下する期間が予め定められた時間以下の場合、
音声と音声の間の無声音部分と判断するステップを有す
る。この音声認識方法において、前記音声区間のパワー
が前記パワーしきい値以下に低下した後、この状態を保
って予め定められた時間が経過した時には音声区間が終
了したと判断するステップを有する。In this speech recognition method, there is a step of comparing the noise threshold value with the power of the speech section, and starting a speech recognition process near the time when the power reaches the noise threshold value. This speech recognition method includes a step of obtaining a noise threshold from the power of the non-speech section, and a step of obtaining a power threshold based on the noise threshold. In this voice recognition method, when a period during which the power of the voice section falls below the power threshold is equal to or less than a predetermined time,
Determining a voiceless portion between the voices. In this voice recognition method, after the power of the voice section falls below the power threshold value, it is determined that the voice section has ended when a predetermined time has elapsed while maintaining this state.

【００１１】音声認識の対象となる単語や文章を集めて
辞書として定義し、音声認識された単語や文章を辞書か
らピックアップして、文字列表示や、単語が示す画像
や、音声合成を用いて音声として出力する本発明による
音声認識システムにおいては、入力された音声に対し
て、音声の始まりから終わりまでの音声区間を、音声を
含まない非音声区間と区別するために、非音声区間の情
報を検出しながら音声区間を検出する音声区間検出部
と、取り込んだ音声に対して、音声分析処理を行う音声
分析部と、音声のパターンを音素単位でもつ音響モデル
と、音声分析結果に対して、音響モデルと辞書を連結し
て、音声認識処理を行う音声認識部とを備え、音声区間
で検出された音声に対して音声認識を行う。[0011] Words and sentences to be subjected to speech recognition are collected and defined as a dictionary, and the words and sentences recognized by speech are picked up from the dictionary, and are displayed using character strings, images indicated by the words, and speech synthesis. In the speech recognition system according to the present invention, which outputs as speech, in order to distinguish the speech segment from the beginning to the end of the inputted speech from the non-speech segment containing no speech, information on the non-speech segment is used. A voice section detection unit that detects a voice section while detecting a voice, a voice analysis unit that performs voice analysis processing on the captured voice, an acoustic model having a voice pattern in phoneme units, and a voice analysis result. And a voice recognition unit that performs voice recognition processing by linking the acoustic model and the dictionary, and performs voice recognition on the voice detected in the voice section.

【００１２】また、音声認識システムにおいて、前記音
声区間検出部は、非音声区間の音に対してパワーレベル
を検出し、時々刻々と変化するパワーレベルに合わせ
て、音声の始まりと終わりを示す音声のパワーレベルを
決めて、入力された音声が、前記音声の前記パワーレベ
ルを超えたら音声の始まりと判断し、音声のパワーレベ
ルを一定以上の時間下回ったら音声の終わりと判断す
る。また、音声認識システムにおいて、前記音声区間検
出部は非音声区間の音に対して、パワーレベルを検出
し、前記パワーレベルからノイズしきい値と、パワーし
きい値のしきい値を計算し、入力された音声がノイズし
きい値を超え、さらに、パワーしきい値を超えた場合、
音声の始まりと判断する。この音声認識システムにおい
て、前記音声の前記パワーレベルが前記しきい値を一定
時間以上下回ったら音声の終わりと判断する。この音声
認識システムにおいて、前記パワーレベルは、予め定め
られた時間単位で区切られたフレームのパワーを複数フ
レーム亘って求めた平均のパワーとし、前記ノイズしき
い値は前記平均パワーのN倍に設定する。また、音声認
識システムにおいて、前記パワーしきい値ＰＴＨは、ノ
イズしきい値ＮＴＨに比べて、ＰＴＨ＞ＮＴＨの関係を
満足するように設定する。この音声認識システムにおい
て、ノイズしきい値ＮＴＨとパワーしきい値ＰＴＨの関
係は、比較的静かな環境下では、ΔＰＴＨ＝ＰＴＨ−Ｎ
ＴＨが小さくなるように設定し、逆に、雑音の大きな環
境下では、ΔＰＴＨ＝ＰＴＨ−ＮＴＨが大きくなるよう
に設定する。In the voice recognition system, the voice section detection unit detects a power level of a sound in a non-voice section, and determines a voice level indicating a start and an end of the voice in accordance with the power level that changes every moment. Is determined, and if the input voice exceeds the power level of the voice, it is determined that the voice has started. If the input voice falls below the power level of the voice for a predetermined time or more, it is determined that the voice has ended. In the voice recognition system, the voice section detection unit detects a power level for a sound in a non-voice section, and calculates a noise threshold and a power threshold from the power level, If the input voice exceeds the noise threshold and then exceeds the power threshold,
Judge as the beginning of voice. In this voice recognition system, when the power level of the voice falls below the threshold for a predetermined time or more, it is determined that the voice has ended. In this speech recognition system, the power level is an average power obtained over a plurality of frames by dividing the power of a frame divided by a predetermined time unit, and the noise threshold is set to N times the average power. I do. In the speech recognition system, the power threshold value PTH is set so as to satisfy a relationship of PTH> NTH compared to the noise threshold value NTH. In this speech recognition system, the relationship between the noise threshold NTH and the power threshold PTH is ΔPTH = PTH−N in a relatively quiet environment.
TH is set to be small, and conversely, in an environment with large noise, ΔPTH = PTH−NTH is set to be large.

【００１３】音声認識の対象となる単語や文章を集めて
辞書として定義し、音声認識された単語や文章を辞書か
らピックアップして、文字列表示や、単語が示す画像
や、音声合成を用いて音声として出力する音本発明によ
る声認識方法においては、入力された音声に対して、音
声の始まりから終わりまでの音声区間を、音声を含まな
い非音声区間と区別するために、非音声区間の情報を検
出しながら音声区間を検出するステップと、音声区間で
検出された音声に対して、音声分析処理を行うステップ
と、音声のパターンを音素単位でもつ音響モデルと前記
辞書とを連結して、音声分析結果に対して、音声認識処
理を行うステップとを有する。[0013] Words and sentences to be subjected to speech recognition are collected and defined as a dictionary. The words and sentences recognized by speech are picked up from the dictionary, and are displayed using character strings, images indicated by the words, and speech synthesis. Sound Output as Speech In the voice recognition method according to the present invention, in order to distinguish an input speech from a speech section from the beginning to the end of the speech with a non-speech section containing no speech, Detecting a voice section while detecting information, performing a voice analysis process on the voice detected in the voice section, and connecting an acoustic model having a voice pattern in phoneme units and the dictionary. Performing a voice recognition process on the voice analysis result.

【００１４】この音声認識方法において、非音声区間の
音に対してパワーレベルを検出するステップと、時々刻
々と変化するパワーレベルに合わせて、音声の始まりと
終わりを示す音声のパワーレベルを決めて、入力された
音声のパワーレベルが、前記音声の前記パワーレベルを
超えたら音声の始まりと判断し、前記音声の前記パワー
レベルを一定時間以上下回ったら音声の終わりと判断す
るステップを有する。また、この音声認識方法におい
て、前記非音声区間の音に対して、パワーレベルを検出
し、前記パワーレベルからノイズしきい値と、パワーし
きい値を計算するステップと、入力された音声のパワー
レベルがが前記ノイズしきい値を超え、さらに、前記パ
ワーしきい値を超えた場合、音声の始まりと判断するス
テップとを有する。この音声認識方法において、前記入
力された音声のパワーレベルが前記パワーしきい値を一
定以上の時間下回ったら音声の終わりと判断するステッ
プを有する。この音声認識方法において、予め定められ
た時間単位で区切られたフレームのパワーを複数フレー
ム亘って求めた平均のパワー前記パワーレベルとして求
めるステップと、前記ノイズしきい値は前記平均パワー
のN倍に設定するステップとを有する。この音声認識方
法においては、前記パワーしきい値ＰＴＨをノイズしき
い値ＮＴＨに比べて、ＰＴＨ＞ＮＴＨの関係を満足する
ように設定するステップを有する。また、音声認識方法
において、ノイズしきい値ＮＴＨとパワーしきい値ＰＴ
Ｈの関係を、比較的静かな環境下では、ΔＰＴＨ＝ＰＴ
Ｈ−ＮＴＨが小さくなるように設定し、逆に、雑音の大
きな環境下では、ΔＰＴＨ＝ＰＴＨ−ＮＴＨが大きくな
るように設定するステップとを有する。In this speech recognition method, a step of detecting a power level of a sound in a non-speech section, and determining a power level of the speech indicating the beginning and end of the speech in accordance with the power level which changes every moment. , When the power level of the input voice exceeds the power level of the voice, it is determined that the voice starts, and when the power level of the input voice falls below the power level for a predetermined time or more, it is determined that the voice ends. In this speech recognition method, a power level is detected for the sound in the non-speech section, and a noise threshold and a power threshold are calculated from the power level. If the level exceeds the noise threshold, and if the level exceeds the power threshold, determining that the sound has begun. This voice recognition method includes a step of determining the end of the voice when the power level of the input voice falls below the power threshold for a predetermined time or more. In this speech recognition method, a step of obtaining the power of a frame delimited by a predetermined time unit as an average power obtained over a plurality of frames as the power level; and setting the noise threshold to N times the average power. Setting. This speech recognition method includes a step of setting the power threshold value PTH so as to satisfy a relationship of PTH> NTH, as compared with the noise threshold value NTH. In the speech recognition method, a noise threshold NTH and a power threshold PT
In a relatively quiet environment, the relation of H is ΔPTH = PT
H-NTH is set to be small, and conversely, in an environment with large noise, ΔPTH = PTH-NTH is set to be large.

【００１５】上記目的を達成するために、本発明の音声
認識システムは、音声認識の対象となる単語や文章を集
めて辞書として定義し、音声認識結果として、それらの
単語や文章をピックアップして、文字列表示や、単語が
示す画像や、音声合成を用いて認識結果を音声として出
力する音声認識システムにおいて、入力された音声に対
して、音声の始まりから終わりまでの音声区間を、音声
ではない区間と区別するために、常に、音声ではない区
間の情報を検出しながら音声区間を検出する音声区間検
出部と、取り込んだ音声に対して、音声分析処理を行う
音声分析部と、音声のパターンを音素単位でもつ音響モ
デルと、音声分析結果に対して音響モデルと辞書を連結
して、音声認識処理を行う音声認識部とを備え、音声区
間検出された音声に対して音声認識するようにしたもの
である。In order to achieve the above object, the speech recognition system of the present invention collects words and sentences to be subjected to speech recognition, defines them as a dictionary, and picks up those words and sentences as a speech recognition result. In a speech recognition system that outputs a recognition result as speech using character string display, an image indicated by a word, or speech synthesis, a speech section from the beginning to the end of the speech is input. A voice section detection unit that detects a voice section while always detecting information on a non-voice section, a voice analysis unit that performs a voice analysis process on the captured voice, and a voice An audio model having a pattern for each phoneme, and a voice recognition unit for performing voice recognition processing by connecting the voice model and the dictionary to the voice analysis result; It is obtained so as to speech recognition for.

【００１６】より詳しくは、入力された音声に対して、
音声の始まりから終わりまでの音声区間を、音声ではな
い区間と区別するために、常に音声ではない区間の情報
を検出しながら音声区間を検出する音声区間検出部は、
音声でない定常の雑音や、非定常の雑音や、静かな環境
での音全てに対してパワーレベルを検出し、時々刻々と
変化するパワーレベルに合わせて、音声の始まりと終わ
りを示す音声のパワーレベルを決めて、入力された音声
が、音声のパワーレベルを超えたら音声の始まりと判断
し、音声のパワーレベルを一定以上の時間下回ったら音
声の終わりと判断するようにしたものである。More specifically, for the input voice,
In order to distinguish the voice section from the beginning to the end of the voice from the non-voice section, the voice section detection unit that detects the voice section while always detecting the information of the non-voice section,
Detects the power level of all non-voice stationary noise, non-stationary noise, and sound in quiet environments, and adjusts the power level of the voice to indicate the beginning and end of the voice according to the constantly changing power level. The level is determined, and when the input voice exceeds the power level of the voice, the start of the voice is determined, and when the power level of the voice falls below the power level for a certain time or more, the end of the voice is determined.

【００１７】また詳しくは、音声区間検出部は、音声で
ない定常の雑音や、非定常の雑音や、静かな環境での音
全てに対して、常に、パワーレベルを検出し、音声の始
まりと終わりを示す音声のパワーレベルとして、ノイズ
しきい値ＮＴＨと、パワーしきい値ＰＴＨの二つのしき
い値を計算し、入力された音声がノイズしきい値ＮＴＨ
を超えて、さらに、パワーしきい値ＰＴＨを超えた場
合、音声の始まりと判断し、また、パワーしきい値ＰＴ
Ｈを一定以上の時間下回ったら音声の終わりと判断する
ようにしたものである。More specifically, the voice section detection section always detects the power level of all non-voice stationary noise, non-stationary noise, and sound in a quiet environment, and starts and ends the voice. Are calculated as two noise thresholds NTH and PTH as the power level of the voice indicating
Exceeds the power threshold value PTH, the start of voice is determined, and the power threshold value PT
If the time falls below H for a certain time or more, the end of the voice is determined.

【００１８】また詳しくは、ノイズしきい値ＮＴＨは、
常に入力されてくる音声でない定常の雑音や、非定常の
雑音や、静かな環境での音に対してパワーを計算し、短
い時間単位で区切られたフレームパワーＰＷの平均のN
倍とし、また、パワーしきい値ＰＴＨは、ノイズしきい
値ＮＴＨに比べて、ＰＴＨ > ＮＴＨの関係になるよう
に設定するようにしたものである。さらに詳しくは、ノ
イズしきい値ＮＴＨとパワーしきい値ＰＴＨの関係は、
比較的静かな環境下では、ΔＰＴＨ＝ＰＴＨ-ＮＴＨが
小さくなるように設定し、逆に、雑音の大きな環境下で
は、ΔＰＴＨ＝ＰＴＨ-ＮＴＨが大きくなるように設定
するようにしたものである。More specifically, the noise threshold value NTH is
Calculates power for stationary noise that is not always input, non-stationary noise, and sound in a quiet environment, and calculates the average N of frame power PW divided in short time units.
The power threshold value PTH is set so as to satisfy the relationship of PTH> NTH as compared with the noise threshold value NTH. More specifically, the relationship between the noise threshold NTH and the power threshold PTH is
In a relatively quiet environment, ΔPTH = PTH-NTH is set to be small, and conversely, in an environment with large noise, ΔPTH = PTH-NTH is set to be large.

【００１９】[0019]

【発明の実施の形態】以下、本発明による音声認識シス
テム及び方法に係る実施の形態を、図１から図６に示す
幾つかの実施例を用いて説明する。図１は本発明による
音声認識システムの各機能とその処理の流れを示すブロ
ック図である。実際に使用する環境下においては、図１
に示されるマイク１０１から、環境に応じた雑音や音声
が取り込まれる。取り込まれた雑音や音声であるアナロ
グ信号は、アナログ信号をデジタル信号に変換するＡ／
Ｄ変換器１０２によって、任意に決められたサンプリン
グ周期により、アナログデータからデジタルデータに変
換される。このアナログデータからデジタルデータに変
換する過程において、変換前あるいは変換後に、例え
ば、定常的な雑音を除去するために、ハイパスフィルタ
（ＨＰＦ）（図示せず）等が用いられる。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of a speech recognition system and method according to the present invention; FIG. FIG. 1 is a block diagram showing the functions of the speech recognition system according to the present invention and the flow of the processing. Fig. 1
, Noise and voice corresponding to the environment are captured. The analog signal, which is the noise or voice captured, converts the analog signal into a digital signal.
The D converter 102 converts analog data into digital data at an arbitrarily determined sampling period. In the process of converting analog data into digital data, a high-pass filter (HPF) (not shown) or the like is used before or after the conversion, for example, to remove stationary noise.

【００２０】変換された非定常な雑音や、ハイパスフィ
ルタで除去しきれなかった雑音や、音声のデジタルデー
タは音声区間検出部１０３により、フレーム単位にパワ
ーが計算される。フレーム単位とは、入力されてくる雑
音や音声を短い時間（５〜２０ｍｓ）ごとに、入力され
てくる時間順に区切られた単位である。パワーとは、そ
のフレーム単位にサンプリングされたデータの０次の自
己相関である。したがって、サンプリングされたデータ
は、電気的には、時間に対する電圧値である。これらを
正規化して、各サンプリングポイントの値を２乗した値
が各フレームのパワー値である。各サンプリングポイン
トの２乗の値を加算した値が全フレームのパワー値の合
計である。計算されたパワー値は、例えば、１フレーム
からｉフレーム（ｉ＝３２）の合計のパワー値をＰＷと
すると１フレームからｉフレームの平均のパワー値はＰ
Ｗ／ｉで求められる。１フレーム経過する毎に前のデー
タが捨てられ、新たな１フレームのデータを加えること
によって、平均のパワー値のデータが更新される。この
ようにして求められた平均パワーＰＷ／ｉから、音声区
間検出に必要なパラメータ、ノイズしきい値ＮＴＨ、パ
ワーしきい値ＰＴＨを決定する。ノイズしきい値ＮＴ
Ｈ、パワーしきい値ＰＴＨについては、後で詳細に説明
する。The power of the converted non-stationary noise, the noise that cannot be removed by the high-pass filter, and the voice digital data is calculated by the voice section detection unit 103 in frame units. The frame unit is a unit in which the input noise or voice is divided for each short time (5 to 20 ms) in the order of the input time. The power is a zero-order autocorrelation of data sampled in the frame unit. Therefore, the sampled data is electrically a voltage value with respect to time. These values are normalized, and the value obtained by squaring the value of each sampling point is the power value of each frame. The value obtained by adding the value of the square of each sampling point is the sum of the power values of all the frames. The calculated power value is, for example, assuming that the total power value of 1 frame to i frame (i = 32) is PW, the average power value of 1 frame to i frame is PW
It is determined by W / i. Every time one frame elapses, the previous data is discarded, and data of the average power value is updated by adding new one frame of data. From the average power PW / i obtained in this way, parameters necessary for voice section detection, a noise threshold NTH, and a power threshold PTH are determined. Noise threshold NT
H and the power threshold value PTH will be described later in detail.

【００２１】さらに、変換された音声のデジタルデータ
は、音声分析部１０４によって、雑音処理や音声分析や
話者適応などの前処理がなされ、音声認識部１０５によ
り、音声分析結果を用いて、音声の照合を行う。ここ
で、音声認識とは、２つの処理を実行する。第１の処理
は、音声区間検出部１０３により、雑音環境下において
音声を正しく検出する音声区間検出処理である。第２の
処理は、音声分析部１０４並びに音声認識部１０５によ
り、音声信号を解析して、それを短い時間（５〜２０ｍ
ｓ）ごとの音素として分析して、そのパターンを解析
し、該当する単語や文章を辞書から選択することであ
る。以上の２つの処理から、音声認識システムとして、
音声認識結果１０９を出力する。Further, the digital data of the converted speech is subjected to preprocessing such as noise processing, speech analysis, and speaker adaptation by a speech analysis unit 104, and speech recognition is performed by a speech recognition unit 105 using the speech analysis result. Is matched. Here, the speech recognition executes two processes. The first process is a voice section detection process in which the voice section detection unit 103 detects voices correctly in a noise environment. In the second process, the voice signal is analyzed by the voice analyzing unit 104 and the voice recognizing unit 105, and the voice signal is analyzed for a short time (5 to 20 m).
s) is analyzed as a phoneme, the pattern is analyzed, and a corresponding word or sentence is selected from a dictionary. From the above two processes, as a speech recognition system,
The voice recognition result 109 is output.

【００２２】音声認識部１０５は、音声分析部１０４で
分析された入力音声の音声分析結果に対して、音響モデ
ル１０６、単語辞書１０７をモデル連結部１０８によっ
て連結された音素単位で照合して、登録した単語辞書１
０７の中で、一番近い単語をピックアップする。音響モ
デル１０６は、音声認識に用いられるモデルであり、具
体的には、単語辞書１０７に用いられている文字と音素
との対応であり、音素の特徴が出現する確率の分布、出
現した音素の特徴が次のどの特徴が現れる状態に遷移す
るかの確率の分布を記憶したものである。実際には、分
布を示すパラメータが記憶されており、音声認識処理毎
にパラメータから分布を計算する。これにより、音声認
識システム全体のメモリ容量が削減できる。音響モデル
１０６は、あらかじめ声を登録しなくても、誰が話し手
でもその声を認識できる、いわゆる「不特定話者対応」
が、一般的になってきている。このような音響モデルと
しては、例えば、隠れマルコフモデル（ＨＭＭ：Ｈｉｄ
ｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を用いることがで
きる。The speech recognition unit 105 checks the acoustic analysis result of the input speech analyzed by the speech analysis unit 104 against the acoustic model 106 and the word dictionary 107 for each phoneme connected by the model connection unit 108, Registered word dictionary 1
In 07, pick up the closest word. The acoustic model 106 is a model used for speech recognition. Specifically, the acoustic model 106 is a correspondence between a character and a phoneme used in the word dictionary 107, and a distribution of a probability that a feature of the phoneme appears, It stores the distribution of the probability that the feature will transition to a state where the next feature appears. Actually, a parameter indicating the distribution is stored, and the distribution is calculated from the parameter for each speech recognition process. Thereby, the memory capacity of the entire speech recognition system can be reduced. The acoustic model 106 is capable of recognizing the voice of any speaker without registering the voice in advance.
But it is becoming more common. As such an acoustic model, for example, a hidden Markov model (HMM: Hid:
den Markov Model) can be used.

【００２３】単語辞書１０７は、言葉、単語（名詞、動
詞等）、文章を集めたものである。例えば、カーナビゲ
ーションシステムにおいては、通り名、地名、建造物
名、町名、番地、交差点名、個人住宅（個人名）、電話
番号等や、必要最小限の会話に必要な言葉の集合体であ
る。この単語辞書１０７は、システムの能力に応じて一
つの辞書あたり、例えば１０〜５０００語の単語で構成
する。以上から、実際の環境で使用する音声認識システ
ムとは、雑音と音声とを正しく見極める音声区間検出
と、音声信号を解析して、それを短い時間ごとの音素と
して分析して、そのパターンを解析し、該当する単語や
文章を辞書から選択することである。なお、図１に示す
各処理ブロックは、複数のＬＳＩやメモリで構成された
システムであっても、半導体素子上に構成された一つな
いし複数のシステムオンチップであってもよい。また、
各処理は、専用ＬＳＩや専用ＩＣで処理するハードウエ
アであっても、ＤＳＰやＲＩＳＣマイコン等のソフトウ
エアで実現したミドルウエアであってもよい。The word dictionary 107 is a collection of words, words (nouns, verbs, etc.), and sentences. For example, in a car navigation system, it is a collection of words necessary for a minimum necessary conversation, such as street names, place names, building names, town names, street addresses, intersection names, private houses (personal names), telephone numbers, and the like. . This word dictionary 107 is composed of words of, for example, 10 to 5000 words per one dictionary according to the capability of the system. From the above, the speech recognition system used in the actual environment is a speech section detection that correctly identifies noise and speech, analyzes the speech signal, analyzes it as phonemes every short time, and analyzes the pattern Then, a corresponding word or sentence is selected from the dictionary. Note that each processing block illustrated in FIG. 1 may be a system configured with a plurality of LSIs and memories, or may be one or a plurality of system-on-chips configured on a semiconductor element. Also,
Each process may be hardware processed by a dedicated LSI or a dedicated IC, or may be middleware implemented by software such as a DSP or a RISC microcomputer.

【００２４】図２（ａ）は音声入力波形図、図２（ｂ）
は音声パワーを示す特性図である。図２において、横軸
は時間ｔを示し、図２（ａ）の縦軸は電圧Ｖを、図２
（ｂ）の縦軸はパワーＰを示す。図２は図１で説明した
音声認識システムをカーナビゲーションシステムで使用
した場合、話者が車内で発声した音声、「渋谷（しぶ
や）ｓｈｉｂｕｙａ」の音声入力波形と音声のパワーを
示す。図２（ａ）の音声入力波形２０１は話者が平常の
音声で「しぶや」と発声したときの音声波形を示す。音
声信号は、時々刻々と変化する非定常な信号である。FIG. 2A is a voice input waveform diagram, and FIG.
Is a characteristic diagram showing audio power. In FIG. 2, the horizontal axis represents time t, and the vertical axis in FIG.
The vertical axis in (b) shows the power P. FIG. 2 shows a voice uttered in a car, a voice input waveform of "Shibuya" and a voice power when the voice recognition system described in FIG. 1 is used in a car navigation system. The voice input waveform 201 in FIG. 2A shows a voice waveform when the speaker utters "reluctant" with normal voice. The audio signal is an unsteady signal that changes every moment.

【００２５】このときの周囲の環境は、比較的静かな一
般道路を４０ｋｍ／ｈで走行している乗用車の車内であ
る。車の窓はすべて閉められており、ラジオやカーステ
レオもオフされ、エアコンの出力は低い値に設定されて
いる。この音声信号を２０ｍｓの短時間で切り出して見
ると、定常信号と同様なスペクトル音声分析ができる。
切り出された音声信号のサンプル値から、例えば、音声
分析で広く用いられているＬＰＣ分析において、自己相
関関数を計算すると、音声の特徴パラメータの一つとし
て、音声のパワーが求められる。曲線２０２は音声波形
２０１の音声信号から計算されたパワーであり、時間ｔ
に対するパワーの変化を表わしている。The surrounding environment at this time is the inside of a passenger car traveling on a relatively quiet general road at 40 km / h. All windows are closed, the radio and car stereo are turned off, and the air conditioner output is set to a low value. If this audio signal is cut out and viewed in a short time of 20 ms, the same spectral audio analysis as that of a stationary signal can be performed.
When the autocorrelation function is calculated from the sampled values of the extracted audio signal in, for example, LPC analysis widely used in audio analysis, the power of the audio is obtained as one of the characteristic parameters of the audio. The curve 202 is the power calculated from the audio signal of the audio waveform 201, and the time t
Represents the change in power with respect to.

【００２６】ここで、この「ｓｈｉｂｕｙａ」の音声に
対して音声認識を正しく行うためには、音声区間検出を
する必要がある。そのためには、このパワー情報に対し
て、しきい値を任意に、すなわち、予め定められた計算
式で計算した値や実験から得られる値に設定し、入力さ
れた雑音および音声毎にこのしきい値を超えたかどうか
を観測する。この観測は、音声区間検出部１０３で行
う。Here, in order to correctly perform voice recognition on the voice of "shibuya", it is necessary to detect a voice section. For this purpose, a threshold value is arbitrarily set for the power information, that is, set to a value calculated by a predetermined formula or a value obtained from an experiment, and the threshold is set for each input noise and voice. Observe whether the threshold has been exceeded. This observation is performed by the voice section detection unit 103.

【００２７】雑音および音声を常に入力して以上の観測
を行ってもよいが、特に、カーナビゲーションシステム
に代表されるような複数の処理を実行しているシステム
ではできるだけＣＰＵの負荷を軽減して低消費電力化し
たい。よって、音声認識を行うときだけ雑音や音声を取
り込むようにするために、音声入力ボタンを押した時点
から雑音や音声が入力されるようにする。図２（ａ）に
おいて、２０４は音声入力ボタンが押された時点をしめ
す。音声入力ボタンとしては音声を取り入れる間ボタン
を押し続ける様にしてもよいし、最初に音声入力用ボタ
ンを押して音声を取り入れ、音声及び雑音が予め定めれ
れた値以下になるとこれを検知して自動的に音声の入力
がオフされるようにしてもい。The above observation may be performed by always inputting noise and voice. In a system executing a plurality of processes such as a car navigation system, the load on the CPU is reduced as much as possible. I want to reduce power consumption. Therefore, in order to capture noise or voice only when performing voice recognition, noise or voice is input from the point in time when the voice input button is pressed. In FIG. 2A, reference numeral 204 denotes a point in time when the voice input button is pressed. As the voice input button, the button may be kept pressed while taking in the voice, or the voice input button may be pressed first to take in the voice, and when the voice and noise fall below a predetermined value, this is detected and the voice is automatically detected. Alternatively, the input of the voice may be turned off.

【００２８】図２（ｂ）において、レベルＮＴＨは雑音
と音声を区別する第１段階のしきい値を示し、ノイズし
きい値と称する。この音声の始まり近傍は、雑音と音声
とが同じパワーレベルであり、音声なのか雑音なのか区
別が困難である。そこで、ノイズしきい値ＮＴＨを超え
たら音声の始まりの可能性が高いと判断し、ノイズしき
い値ＮＴＨとパワー値Ｐを示す曲線２０２とが交わる時
点（又は交点）２０５のフレーム位置を記憶しておく。
つぎに、ＰＴＨは音声であることを見極めるための第２
段階のしきい値であり、パワーしきい値と称する。ここ
では、かなりのパワーレベルを検出することから、音声
であることが分かる。その時点、すなわち、パワーしき
い値ＰＴＨとパワー値Ｐを示す曲線２０２とが交わる時
点（又は交点）２０６は音声であることが分かる。した
がって、音声であることを検出した時点２０６におい
て、時点２０５で記憶したフレーム位置から認識処理を
開始する。あるいは、記憶したフレームよりもkフレー
ム前のフレームから認識処理を開始する。これにより、
雑音に埋もれた音声の始まりを検出して正しい認識が可
能となる。In FIG. 2B, the level NTH indicates a first-stage threshold value for distinguishing noise from speech, and is referred to as a noise threshold value. In the vicinity of the beginning of the voice, the noise and the voice have the same power level, and it is difficult to distinguish between the voice and the noise. Therefore, when the noise threshold NTH is exceeded, it is determined that the possibility of the start of voice is high, and the frame position at the time (or intersection) 205 where the noise threshold NTH intersects with the curve 202 indicating the power value P is stored. Keep it.
Next, PTH is the second
This is a step threshold, and is referred to as a power threshold. Here, since a considerable power level is detected, it can be understood that the sound is sound. At that time, that is, at the time (or intersection) 206 at which the power threshold value PTH and the curve 202 indicating the power value P intersect, it is understood that the sound is sound. Therefore, at time point 206 when it is detected that the voice is voice, the recognition process is started from the frame position stored at time point 205. Alternatively, the recognition processing is started from a frame k frames before the stored frame. This allows
Correct recognition is possible by detecting the beginning of speech buried in noise.

【００２９】ノイズしきい値ＮＴＨの計算は音声入力用
ボタンを押さなくても、雑音及び音声はマイク１０１を
通して入力され、後述するＲＡＭに書き込まれるように
して常にノイズしきい値ＮＴＨを計算するようにし、音
声入力用ボタンが押される直前のＮＴＨの値を採用す
る。このように入力ボタンを押す前のｉフレーム分のデ
ータからノイズしきい値ＮＴＨを求め、ノイズしきい値
ＮＴＨとパワー値の比較を１フレーム毎に行うことによ
って、時点２０５を検出することができるし、この時点
２０５より前から認識処理を開始することができる。ま
た、音声には必ず文字と文字の間に無音声部分が存在す
る。例えば「渋谷」では、「ｓｈｉ」と「ｂｕ」、「ｂ
ｕ」と「ｙａ」の間に無音声部分が存在する。このと
き、音声のパワーレベルは、パワーしきい値ＰＴＨより
も低くなる。ところが音声はまだ終了していないことか
ら、音声が終了したと判断しては誤りである。よって、
音声のパワーレベルがパワーしきい値ＰＴＨより低くな
っても、その期間がある設定フレーム数未満ならば、ま
だ音声が終了していないとして認識処理を継続する。逆
に、その期間がある設定フレーム数以上ならば、音声が
終了したと判断して認識処理を終了する。図２（ｂ）に
おいて、時点２０７は音声が終了したと判断した時点で
あり、認識処理はその時点からｊフレーム後に終了する
ものとする。音声が終了したと判断した時点からｊフレ
ーム後まで音声パワーをパワーしきい値と比較するの
は、実際に音声が終了したのか、又は音声と音声の間の
無音部分なのかを判断するためである。なお、ｊフレー
ムの値は実験によって予め定められる。これにより、音
声の始まりと同様に、雑音に埋もれた音声の終わりを検
出して正しい認識が可能となる。この場合、ノイズしき
い値ＮＴＨはパワーしきい値ＰＴＨを超えないように設
定する。以上のようにして、正しい音声認識に必要な音
声区間検出である音声区間２０３が検出される。The noise threshold NTH can be calculated such that noise and voice are input through the microphone 101 and written into the RAM, which will be described later, without the need to press the voice input button. Then, the value of NTH immediately before the voice input button is pressed is adopted. Thus, the time point 205 can be detected by calculating the noise threshold value NTH from the data for i frames before pressing the input button and comparing the noise threshold value NTH with the power value for each frame. However, the recognition process can be started before the time point 205. In addition, voices always have a non-voice portion between characters. For example, in "Shibuya", "shi" and "bu", "b
A silent part exists between "u" and "ya". At this time, the power level of the audio is lower than the power threshold value PTH. However, since the sound has not ended yet, it is erroneous to determine that the sound has ended. Therefore,
Even if the power level of the audio is lower than the power threshold value PTH, if the period is less than a set number of frames, the recognition processing is continued assuming that the audio has not been completed. Conversely, if the period is equal to or greater than the set number of frames, it is determined that the voice has ended, and the recognition process ends. In FIG. 2B, a time point 207 is a time point at which it is determined that the voice has ended, and the recognition processing ends at a time j frames after the time point. The reason why the audio power is compared with the power threshold from the time point when it is determined that the audio has ended to after j frames is to determine whether the audio has actually ended or whether it is a silent part between the audios. is there. Note that the value of the j frame is determined in advance by an experiment. As a result, the end of the voice buried in the noise can be detected and the correct recognition can be performed, similarly to the start of the voice. In this case, the noise threshold NTH is set so as not to exceed the power threshold PTH. As described above, the voice section 203, which is the voice section detection necessary for correct voice recognition, is detected.

【００３０】図３（ａ）は音声入力波形図、図３（ｂ）
は音声パワーを示す特性図である。図３（ａ）、（ｂ）
において、横軸は時間ｔを示し、図３（ａ）の縦軸は電
圧Ｖを、図３（ｂ）の縦軸はパワーＰを示す。図３
（ａ）は、図１で説明したカーナビゲーションシステム
に適用した音声認識システムにおいて、話者が車内で発
声した音声「渋谷（しぶや）ｓｈｉｂｕｙａ」の音声入
力波形と音声のパワーを示しており、このときの周囲の
環境は、かなり静かなパーキングに車を止めてアイドリ
ング状態にしている乗用車の車内であり、窓はすべて閉
められており、ラジオやカーステレオもオフされ、エア
コンの出力も低い値に設定されている。図３（ａ）の音
声入力波形３０１はこのような環境下における波形を示
している。図２で説明した環境下、すなわち比較的静か
な一般道路を、すべての窓は閉められ、ラジオやカース
テレオはオフされ、エアコンも低い値に押さえられてい
る状態で４０ｋｍ／ｈで走行している乗用車の車内とい
う環境下の音声のパワーレベルに比べて、図３（a）に
示す音声のパワーレベルは低くなる。この現象は、話者
の周囲の雑音のパワーレベルが低くなり、話者自身の音
声が小さくてもよく聞こえることから、音声のパワーレ
ベルが低くなるためである。よって、雑音のパワーレベ
ルが低くなり、音声のパワーレベルも低くなる環境下
で、正しい音声区間検出を行うためには、ノイズしきい
値ＮＴＨ、パワーしきい値ＰＴＨを下げる必要がある。
さらに、ＰＴＨとＮＴＨの差ΔＰＴＨも小さくなる。FIG. 3 (a) is a voice input waveform diagram, and FIG. 3 (b)
Is a characteristic diagram showing audio power. FIG. 3 (a), (b)
In FIG. 3, the horizontal axis indicates time t, the vertical axis in FIG. 3A indicates voltage V, and the vertical axis in FIG. 3B indicates power P. FIG.
(A) shows the voice input waveform and the power of the voice of the voice "Shibuya shibuya" which the speaker uttered in the car in the voice recognition system applied to the car navigation system described in FIG. The surrounding environment at the time is the interior of a passenger car that is parked in a very quiet parking and idling, all windows are closed, radio and car stereo are turned off, and the output of air conditioner is also low. Is set. The voice input waveform 301 in FIG. 3A shows a waveform under such an environment. In the environment described in FIG. 2, that is, on a relatively quiet general road, driving at 40 km / h with all windows closed, the radio and car stereo turned off, and the air conditioner kept at a low value. The power level of the voice shown in FIG. 3A is lower than the power level of the voice in the environment of the passenger car. This phenomenon is because the power level of the noise around the speaker is low, and the speaker's own voice can be heard well even if the voice is small, so that the power level of the voice is low. Therefore, in an environment in which the power level of noise is low and the power level of voice is low, it is necessary to lower the noise threshold NTH and the power threshold PTH in order to perform correct voice section detection.
Further, the difference ΔPTH between PTH and NTH also becomes smaller.

【００３１】この音声信号を２０ｍｓの短時間で切り出
して見ると、定常信号と同様なスペクトル音声分析がで
きる。切り出された音声信号のサンプル値から、例え
ば、音声分析で広く用いられているＬＰＣ分析におい
て、自己相関関数を計算すると、音声の特徴パラメータ
の一つとして、音声のパワーが求められる。When this voice signal is cut out in a short time of 20 ms and viewed, the same spectrum voice analysis as that of a stationary signal can be performed. When the autocorrelation function is calculated from the sampled values of the extracted audio signal in, for example, LPC analysis widely used in audio analysis, the power of the audio is obtained as one of the characteristic parameters of the audio.

【００３２】図３（ｂ）において、３０２は音声波形３
０１の音声信号から計算されたパワーＰを示す曲線であ
り、時間ｔに対するパワーの変化を表わしている。ここ
で、この「ｓｈｉｂｕｙａ」の音声に対して音声認識を
正しく行うためには、音声区間を検出をする必要があ
る。そのためには、このパワー情報に対して、しきい値
を任意に設定し、入力された雑音および音声毎にこのし
きい値を超えたかどうかを観測する。この観測は、音声
区間検出部１０３で行う。そこで、雑音および音声を常
に入力して前述の観測を行ってもよいが、特に、カーナ
ビゲーションシステムに代表されるような複数の処理を
実行しているシステムではできるだけＣＰＵの負荷を軽
減させて低消費電力化したい。よって、音声認識すると
きだけ雑音や音声を取り込むために、音声入力用ボタン
を押した時点から雑音や音声が入力されるものとする。
３０４が音声入力用ボタンが押された時点である。In FIG. 3B, reference numeral 302 denotes a voice waveform 3
12 is a curve showing the power P calculated from the audio signal No. 01, and shows a change in power with respect to time t. Here, it is necessary to detect a voice section in order to correctly perform voice recognition on the voice of "shibuya". For this purpose, a threshold value is arbitrarily set for the power information, and it is observed whether the threshold value is exceeded for each of the input noise and voice. This observation is performed by the voice section detection unit 103. Therefore, the above observation may be performed by always inputting noise and voice. However, in a system that executes a plurality of processes such as a car navigation system, the load on the CPU is reduced as much as possible. I want to reduce power consumption. Therefore, in order to capture noise and voice only when recognizing voice, it is assumed that noise and voice are input when the voice input button is pressed.
Reference numeral 304 denotes a point in time when the voice input button is pressed.

【００３３】ＮＴＨは雑音と音声を区別する第１段階の
しきい値であり、ノイズしきい値と称する。この音声の
始まり近傍は、雑音と音声とが同じパワーレベルであ
り、音声なのか雑音なのか区別が困難である。そこで、
ノイズしきい値ＮＴＨを超えたら音声の始まりの可能性
が高いと判断し、時点３０５で示されるフレーム位置を
記憶しておく。つぎに、ＰＴＨは音声であることを見極
めるための第２段階のしきい値であり、パワーしきい値
と称する。ここでは、かなりのパワーレベルを検出する
ことから、音声であることがわかる。その時点が３０６
である。したがって、音声であることを検出した時点３
０６において、時点３０５で記憶したフレーム位置から
認識処理を開始する。あるいは、記憶したフレームより
もkフレーム前のフレームから認識処理を開始する。こ
れにより、雑音に埋もれた音声の始まりを検出して正し
い認識が可能となる。また、音声には必ず文字と文字の
間に無音声部分が存在する。例えば「渋谷」では、「ｓ
ｈｉ」と「ｂｕ」、「ｂｕ」と「ｙａ」の間に無音声部
分が存在する。このとき、音声のパワーレベルは、パワ
ーしきい値ＰＴＨよりも低くなる。ところが音声はまだ
終了していないことから、音声が終了したと判断しては
誤りである。よって、音声のパワーレベルがパワーしき
い値ＰＴＨより低くなっても、その期間がある設定フレ
ーム数未満ならば、まだ音声が終了していないとして認
識処理を継続する。逆に、その期間がある設定フレーム
数以上ならば、音声が終了したと判断して認識処理を終
了する。図３（ｂ）において、時点３０７が音声が終了
したと判断された時点であり、認識処理はその時点から
ｊフレーム後に終了するものとする。NTH is a first-stage threshold value for distinguishing between noise and speech, and is referred to as a noise threshold value. In the vicinity of the beginning of the voice, the noise and the voice have the same power level, and it is difficult to distinguish between the voice and the noise. Therefore,
If the noise threshold value NTH is exceeded, it is determined that the possibility of the start of speech is high, and the frame position indicated by the time point 305 is stored. Next, PTH is a second-stage threshold value for determining that the voice is speech, and is referred to as a power threshold value. Here, since a considerable power level is detected, it can be understood that the sound is sound. 306
It is. Therefore, at the time 3 when the voice is detected,
At 06, the recognition process is started from the frame position stored at time 305. Alternatively, the recognition processing is started from a frame k frames before the stored frame. As a result, the beginning of the voice buried in the noise can be detected and correct recognition can be performed. In addition, voices always have a non-voice portion between characters. For example, in "Shibuya", "s
There is a silent portion between "hi" and "bu" and between "bu" and "ya". At this time, the power level of the audio is lower than the power threshold value PTH. However, since the sound has not ended yet, it is erroneous to determine that the sound has ended. Therefore, even if the power level of the audio is lower than the power threshold value PTH, if the period is less than the set number of frames, the recognition processing is continued assuming that the audio has not been completed. Conversely, if the period is equal to or greater than the set number of frames, it is determined that the voice has ended, and the recognition process ends. In FIG. 3B, it is assumed that the time point 307 is the time point at which it is determined that the voice has ended, and the recognition processing ends after j frames from that time point.

【００３４】前述のようにすることによって、音声の始
まりと同様に、雑音に埋もれた音声の終わりを検出して
正しい認識が可能となる。このとき、ノイズしきい値Ｎ
ＴＨは、パワーしきい値ＰＴＨを超えないものとする。
以上のようにして、正しい音声認識に必要な音声区間検
出である音声区間３０３が検出される。As described above, the end of the voice buried in the noise can be detected and the correct recognition can be performed in the same manner as the start of the voice. At this time, the noise threshold N
TH shall not exceed the power threshold value PTH.
As described above, the voice section 303 which is a voice section detection necessary for correct voice recognition is detected.

【００３５】図３（ｃ）は音声入力波形図、図３（ｄ）
は音声パワーを示す特性図である。図３（ｃ）、（ｄ）
において、横軸は時間ｔを示し、図３（ｃ）の縦軸は電
圧Ｖを、図２（ｄ）の縦軸はパワーＰを示す。図３
（ｃ）、（ｄ）は、図１で説明したカーナビゲーション
システムに適用した音声認識システムにおいて、話者が
車内で発声した音声「渋谷（しぶや）ｓｈｉｂｕｙａ」
の音声入力波形と音声のパワーを示したものである。図
３（ｃ）に示す波形３１１は、すべての窓は閉められ、
ラジオやカーステレオもオフされ、エアコンの出力も低
い値に設定されているにも関わらず、高速道路を１００
ｋｍ／ｈで走行しているために車内にはかなりの雑音が
あり、かなりうるさい車内環境状況にある時の音声入力
波形を示している。FIG. 3 (c) is a voice input waveform diagram, and FIG. 3 (d).
Is a characteristic diagram showing audio power. FIG. 3 (c), (d)
In FIG. 3, the horizontal axis indicates time t, the vertical axis in FIG. 3C indicates voltage V, and the vertical axis in FIG. 2D indicates power P. FIG.
(C) and (d) show the voice “Shibuya shibuya” uttered by the speaker in the car in the voice recognition system applied to the car navigation system described in FIG.
3 shows the voice input waveform and the power of the voice. In a waveform 311 shown in FIG. 3C, all windows are closed,
Even though the radio and car stereo are turned off and the air conditioner output is set to a low value,
Since the vehicle is traveling at km / h, there is considerable noise in the vehicle, and the voice input waveform when the vehicle is in a noisy environment is shown.

【００３６】図３（ｃ）に示す音声入力波形３１１の音
声パワーレベルは図３（ａ）に示す音声入力波形３０１
の音声パワーレベルに比べて、かなり高くなっている。
この現象は、図３（ｃ）の場合の話者の周囲の雑音のパ
ワーレベルが高く、話者自身の音声がよく聞こえず、大
きな声で発生することから、音声のパワーレベルが高く
なるためである。雑音のパワーレベルが高くなり、音声
のパワーレベルも高くなる環境下で、正しい音声区間検
出を行うためには、ノイズしきい値ＮＴＨ、パワーしき
い値ＰＴＨを上げる必要がある。さらに、ＰＴＨとＮＴ
Ｈの差ΔＰＴＨも大きくなる。The audio power level of the audio input waveform 311 shown in FIG. 3C is equal to the audio input waveform 301 shown in FIG.
Is considerably higher than the audio power level.
This phenomenon occurs because the power level of the noise around the speaker in the case of FIG. 3 (c) is high and the speaker's own voice is not well heard and is generated in a loud voice, so that the power level of the voice increases. It is. In an environment where the power level of noise increases and the power level of voice increases, it is necessary to increase the noise threshold NTH and the power threshold PTH in order to perform correct voice section detection. In addition, PTH and NT
The difference ΔPTH in H also increases.

【００３７】この音声信号を２０ｍｓの短時間で切り出
して見ると、定常信号と同様なスペクトル音声分析がで
きる。切り出された音声信号のサンプル値から、例え
ば、音声分析で広く用いられているＬＰＣ分析におい
て、自己相関関数を計算すると、音声の特徴パラメータ
の一つとして、音声のパワーが求められる。When this voice signal is cut out in a short time of 20 ms and viewed, the same spectrum voice analysis as that of a stationary signal can be performed. When the autocorrelation function is calculated from the sampled values of the extracted audio signal in, for example, LPC analysis widely used in audio analysis, the power of the audio is obtained as one of the characteristic parameters of the audio.

【００３８】図３（ｄ）に示す曲線３１２は、音声波形
３１１の音声信号から計算されたパワーＰをしめし、時
間ｔにおけるパワーの変化を表わしている。ここで、こ
の「ｓｈｉｂｕｙａ」の音声に対して音声認識を正しく
行うためには、音声区間検出をする必要がある。そのた
めには、このパワー情報に対して、しきい値を任意に設
定し、入力された雑音および音声毎にこのしきい値を超
えたかどうかを観測する。この観測は、音声区間検出部
１０３で行う。そこで、雑音および音声を常に入力して
以上の観測を行ってもよいが、特に、カーナビゲーショ
ンシステムに代表されるような複数の処理を実行してい
るシステムではできるだけＣＰＵの負荷を軽減させて低
消費電力化したい。よって、音声認識するときだけ雑音
や音声を取り込むために、音声入力用ボタンを押した時
点から雑音や音声が入力されるものとする。３１４は音
声入力用ボタンが押された時点をしめす。A curve 312 shown in FIG. 3D represents the power P calculated from the audio signal of the audio waveform 311 and represents a change in power at time t. Here, in order to correctly perform speech recognition for the speech of “shibuya”, it is necessary to detect a speech section. For this purpose, a threshold value is arbitrarily set for the power information, and it is observed whether the threshold value is exceeded for each of the input noise and voice. This observation is performed by the voice section detection unit 103. Therefore, the above observation may be performed by always inputting noise and voice. However, especially in a system executing a plurality of processes such as a car navigation system, the load on the CPU is reduced as much as possible. I want to reduce power consumption. Therefore, in order to capture noise and voice only when recognizing voice, it is assumed that noise and voice are input when the voice input button is pressed. Reference numeral 314 indicates a point in time when the voice input button is pressed.

【００３９】図（ｄ）において、ＮＴＨは雑音と音声を
区別する第１段階のしきい値であり、ノイズしきい値と
称する。この音声の始まりの近傍は、雑音と音声とが同
じパワーレベルにあるため、音声なのか雑音なのか区別
が困難である。そこで、ノイズしきい値ＮＴＨを超えた
ら音声の始まりの可能性が高いと判断し、時点３１５で
示されるフレーム位置を記憶しておく。つぎに、ＰＴＨ
は音声であることを見極めるための第２段階のしきい値
であり、パワーしきい値と称する。ここでは、かなりの
パワーレベルを検出することから、音声であることがわ
かる。その時が時点３１６である。したがって、音声で
あることを検出した時点３１６において、時点３１５で
記憶したフレーム位置から認識処理を開始する。あるい
は、記憶したフレームよりもkフレーム前のフレームか
ら認識処理を開始する。これにより、雑音に埋もれた音
声の始まりを検出して正しい認識が可能となる。In FIG. 4D, NTH is a first-stage threshold for distinguishing noise from speech, and is referred to as a noise threshold. In the vicinity of the beginning of the voice, since the noise and the voice are at the same power level, it is difficult to distinguish between the voice and the noise. Therefore, if the noise threshold value NTH is exceeded, it is determined that there is a high possibility of the start of speech, and the frame position indicated at time point 315 is stored. Next, PTH
Is a second-stage threshold for determining that the voice is voice, and is referred to as a power threshold. Here, since a considerable power level is detected, it can be understood that the sound is sound. That time is time point 316. Therefore, at the time point 316 when it is detected that the voice is voice, the recognition processing is started from the frame position stored at the time point 315. Alternatively, the recognition processing is started from a frame k frames before the stored frame. As a result, the beginning of the voice buried in the noise can be detected and correct recognition can be performed.

【００４０】また、音声には必ず文字と文字の間に無音
声部分が存在する。例えば「渋谷」では、「ｓｈｉ」と
「ｂｕ」、「ｂｕ」と「ｙａ」の間に無音声部分が存在
する。この場合、音声のパワーレベルは、パワーしきい
値ＰＴＨよりも低くなる。ところが音声はまだ終了して
いないことから、音声が終了したと判断しては誤りであ
る。よって、音声のパワーレベルがパワーしきい値ＰＴ
Ｈより低くなっても、その期間がある設定フレーム数未
満ならば、まだ音声が終了していないとして認識処理を
継続する。逆に、その期間がある設定フレーム数以上な
らば、音声が終了したと判断して認識処理を終了する。
図３（ｄ）において、３１７は音声が終了したと判断し
た時点を示しており、認識処理はその時点からｊフレー
ム後に終了するものとする。これにより、音声の始まり
と同様に、雑音に埋もれた音声の終わりを検出して正し
い認識が可能となる。このとき、ノイズしきい値ＮＴＨ
は、パワーしきい値ＰＴＨを超えないものとする。以上
から、正しい音声認識に必要な音声区間検出である音声
区間３１３が検出される。In addition, voices always have a non-voice portion between characters. For example, in "Shibuya", a non-voice portion exists between "shi" and "bu" and between "bu" and "ya". In this case, the power level of the audio is lower than the power threshold PTH. However, since the sound has not ended yet, it is erroneous to determine that the sound has ended. Therefore, the power level of the sound is equal to the power threshold PT
Even if it becomes lower than H, if the period is less than a certain set number of frames, the recognition processing is continued assuming that the sound has not ended yet. Conversely, if the period is equal to or greater than the set number of frames, it is determined that the voice has ended, and the recognition process ends.
In FIG. 3D, reference numeral 317 denotes a point in time when it is determined that the voice has ended, and it is assumed that the recognition processing ends after j frames from that point. As a result, the end of the voice buried in the noise can be detected and the correct recognition can be performed, similarly to the start of the voice. At this time, the noise threshold value NTH
Does not exceed the power threshold value PTH. From the above, the voice section 313, which is the voice section detection necessary for correct voice recognition, is detected.

【００４１】しかしながら、今仮に、ノイズしきい値Ｎ
ＴＨが図３（ｄ）において、ＮＴＨ1の位置に設定され
たとする。本来なら、正しい音声区間は３１３で示され
る区間であり、ノイズしきい値ＮＴＨとパワーレベルの
交点３１５を検出しなければならない。ところが、ノイ
ズしきい値ＮＴＨ1とパワーレベルの交点は存在せず、
音声入力用ボタンが押された３１４の時点ですでに、雑
音のパワーレベルがＮＴＨ1を超えていることから、ボ
タンの押された直後の時点３１９から音声の始まりと判
断して、音声区間は３１８となり、誤った音声区間を検
出するため、認識結果も誤認識となる。以上のことから
明らかなように、ノイズしきい値ＮＴＨ、パワーしきい
値ＰＴＨを、実際に使用する環境に合わせて、それも、
時間的に短いサイクル（例えば、３秒間隔）で最適な値
に設定、更新していく必要がある。特に、カーナビゲー
ションシステム、カーエレクトロニクス製品、ＰＤＡ、
ハンドヘルドＰＣ等の使用する環境では、雑音レベルが
短い時間の間隔で相当変動する。However, suppose now that the noise threshold N
Assume that TH is set to the position of NTH1 in FIG. Normally, the correct voice section is the section indicated by 313, and the intersection 315 of the noise threshold NTH and the power level must be detected. However, there is no intersection between the noise threshold value NTH1 and the power level,
Since the power level of the noise has already exceeded NTH1 at 314 when the voice input button is pressed, it is determined that the voice starts from time 319 immediately after the button is pressed, and the voice section is 318. , And an incorrect voice section is detected, so that the recognition result is erroneously recognized. As is clear from the above, the noise threshold value NTH and the power threshold value PTH are set according to the environment in which they are actually used.
It is necessary to set and update the optimum value in a short cycle (for example, every three seconds). In particular, car navigation systems, car electronics products, PDAs,
In an environment where a handheld PC or the like is used, the noise level fluctuates considerably at short time intervals.

【００４２】以下にノイズしきい値ＮＴＨ、パワーしき
い値ＰＴＨの計算式の１例を（数１）〜（数５）に示
す。なお、（数１）〜（数５）において、ＰＷは音声認
識モードになってから、音声入力用ボタンが押される直
前の１からｉフレーム間の入力雑音パワーの総和を示
し、ＰＷ／ｉは１フレームからｉフレーム間の入力雑音
パワーの平均値を示す。また、Ｎ１、Ｎ２、Ｎ３は安全
率であり、実験によって定める正の整数である。本発明
の実施例においては、Ｎ１、Ｎ２を５に定め、Ｎ６は１
０に定めている。このＮ１〜Ｎ３の値は音声入力用ボタ
ンを押す前のｉフレーム（例えば、３２フレーム）の平
均のパワー値によって変えてもよい。Ｐ１はノイズの状
況によって変わる値であり、ノイズが一定の場合には予
め定められた一定値を取る。例えば、音声入力用ボタン
を押す前のｉフレーム（例えば３２フレーム）の平均的
なパワー値をみて、ノイズ値が大きい場合にはＰ１は大
きく設定され、ノイズ値が小さい時にはＰ１は小さく設
定される。ＮＴＨ、ＰＴＨは正規化された値を取ること
から、本実施例においてはＰ１＝１００，０００であ
る。Ｐ２はＰ１と同様ノイズ値によって左右されるが、
実験によって定める。One example of the equations for calculating the noise threshold value NTH and the power threshold value PTH is shown in (Equation 1) to (Equation 5). In (Equation 1) to (Equation 5), PW indicates the total sum of the input noise powers from 1 to i frames immediately before the voice input button is pressed after the voice recognition mode is set, and PW / i is The average value of the input noise power from one frame to the i-th frame is shown. N1, N2, and N3 are safety factors and are positive integers determined by experiments. In the embodiment of the present invention, N1 and N2 are set to 5 and N6 is set to 1
It is set to 0. The values of N1 to N3 may be changed according to the average power value of the i-frame (for example, 32 frames) before pressing the voice input button. P1 is a value that changes depending on the state of the noise, and takes a predetermined constant value when the noise is constant. For example, looking at the average power value of the i frame (for example, 32 frames) before pressing the voice input button, if the noise value is large, P1 is set large, and if the noise value is small, P1 is set small. . Since NTH and PTH take normalized values, P1 = 100,000 in this embodiment. P2 depends on the noise value like P1,
Determined by experiment.

【００４３】ＮＴＨ＝（ＰＷ／ｉ）×Ｎ１ …（数１）ＰＴＨ＝ＮＴＨ＋Ｐ１ …（数２）あるいは、ＮＴＨ＝（ＰＷ／ｉ）×Ｎ２ …（数３）ＰＴＨ＝（ＰＷ／ｉ）×Ｎ３ …（数４）ただし、もし、ＰＴＨ＜Ｐ２、ならばＰＴＨ＝Ｐ２ …（数５）次に、図４を用いて本発明による音声認識システムおよ
び方法に係るハードウエア構成について説明する。図４
は本発明による音声認識システムの一実施例を示すブロ
ック図である。音声を取り込むためのマイク４０１とし
ては、カーナビゲーションシステム、携帯型情報端末、
ＰＤＡ、ハンドヘルドＰＣ、ゲーム、携帯型翻訳機、並
びに、エアコン等の家庭電化製品等では、周囲の雑音を
取り込まないために指向性をもたせた指向性マイクが用
いられる。４０４は、マイク４０１により取り込まれた
アナログ音声データを、デジタル音声データに変換する
Ａ／Ｄ変換器である。音声入力用ボタン４０２は、音声
を入力している区間を指定するためのボタンである。ボ
タンが押されている間、あるいは、ボタンが押された時
点から音声が入力されたことをシステムに知らせる。４
０５は、音声入力用ボタン４０２と、システムを接続す
るためのインタフェースである。NTH = (PW / i) × N1 (Equation 1) PTH = NTH + P1 (Equation 2) Alternatively, NTH = (PW / i) × N2 (Equation 3) PTH = (PW / i) × N3 (Equation 4) However, if PTH <P2, then PTH = P2 (Equation 5) Next, the hardware configuration of the speech recognition system and method according to the present invention will be described with reference to FIG. FIG.
1 is a block diagram showing one embodiment of a speech recognition system according to the present invention. As the microphone 401 for capturing voice, a car navigation system, a portable information terminal,
In PDA, handheld PC, games, portable translators, home appliances such as air conditioners, etc., directional microphones having directivity are used in order not to capture ambient noise. An A / D converter 404 converts analog audio data captured by the microphone 401 into digital audio data. The voice input button 402 is a button for designating a section in which voice is being input. While the button is being pressed, or from the time the button is pressed, the system is notified that a voice has been input. 4
Reference numeral 05 denotes an interface for connecting the voice input button 402 to the system.

【００４４】キー入力用デバイス４０９は、例えば、携
帯型情報端末であれば、ペン入力用のデジタイザであ
り、ハンドヘルドＰＣであれば、キーボードである。ま
た、ファミコンなどのゲーム機であれば、キャラクタ等
を操作するキーパッドや、ジョイスティックである。４
１０は、キー入力用デバイス４０９と、システムを接続
するためのインタフェースである。ＣＰＵ４０３は、カ
ーナビゲーションシステム、携帯型情報端末、ＰＤＡ、
ハンドヘルドＰＣ、ゲーム、携帯型翻訳機、並びに、家
庭電化製品等のメインシステムの制御と、音声認識シス
テムにおける音声認識処理を行う。このＣＰＵ４０３に
は、ＲＩＳＣマイコンやＤＳＰが用いられるのが、最近
の潮流である。ＲＯＭ４０６は、音声認識用単語辞書、
音響モデル、プログラムを格納しておく記憶装置であ
る。また、複数の辞書や、音響モデルを格納しておくた
めに、メモリカードを用いてもよい。The key input device 409 is, for example, a digitizer for pen input in the case of a portable information terminal, and is a keyboard in the case of a handheld PC. In the case of a game console such as a NES, a keypad for operating a character or the like or a joystick is used. 4
Reference numeral 10 denotes an interface for connecting the key input device 409 to the system. The CPU 403 includes a car navigation system, a portable information terminal, a PDA,
It controls a main system such as a handheld PC, a game, a portable translator, and a home appliance, and performs voice recognition processing in a voice recognition system. A recent trend is to use a RISC microcomputer or DSP for the CPU 403. ROM 406 is a word dictionary for speech recognition,
This is a storage device that stores acoustic models and programs. A memory card may be used to store a plurality of dictionaries and acoustic models.

【００４５】ＲＡＭ４０７は、ＲＯＭ４０６から転送さ
れた一部の辞書や、音響モデル、プログラムが格納さ
れ、また、音声認識処理に必要な必要最小限のワークメ
モリであり、ＲＯＭ４０６に比べて、通常アクセス時間
の短い半導体素子が用いられる。バス４０８は、システ
ムにおけるデータバス、アドレスバス、制御信号バスと
して用いられる。音声認識結果を出力表示するためのデ
ィスプレイ４１２は、ＴＦＴ液晶ディスプレイ等のＬＣ
Ｄで構成され、音声認識結果を表示する。４１１は、デ
ィスプレイ４１２と、システムを接続するためのインタ
フェースである。音声認識結果を音で出力するためのス
ピーカ４１４は、音声認識結果を音声合成して出力す
る。４１３は、音声認識結果をテキストから音声合成デ
ータに変換処理した後、デジタル音声合成データからア
ナログ音声信号に変換するＡ／Ｄ変換器である。The RAM 407 stores some dictionaries, acoustic models, and programs transferred from the ROM 406, and is a minimum necessary work memory required for speech recognition processing. Is used. The bus 408 is used as a data bus, an address bus, and a control signal bus in the system. The display 412 for outputting and displaying the voice recognition result is an LC such as a TFT liquid crystal display.
D to display the speech recognition result. An interface 411 connects the display 412 to the system. A speaker 414 for outputting the speech recognition result as sound synthesizes and outputs the speech recognition result. An A / D converter 413 converts the speech recognition result from text to speech synthesis data, and then converts the digital speech synthesis data to an analog speech signal.

【００４６】以下、本発明に係る音声認識システムおよ
び方法の一実施例を図５および図６を用いて説明する。
本実施例では、本発明の音声認識システムをカーナビゲ
ーションシステム、カーマルチメディア、カーエレクト
ロニクス製品に適用した場合について説明する。An embodiment of the speech recognition system and method according to the present invention will be described below with reference to FIGS.
In this embodiment, a case will be described in which the speech recognition system of the present invention is applied to a car navigation system, car multimedia, and car electronics products.

【００４７】図５は本発明による音声認識システムおよ
び方法に使用される音声区間検出の動作フローの一実施
例を説明するためのフローチャートである。図におい
て、ステップ５０１はカーナビゲーションシステムが起
動したことを示すスタートである。ステップ５０１にお
いて、カーナビゲーションシステムがスタートすると、
ステップ５０２に移り、カーナビゲーションシステムの
音声認識システムが起動する。例えば、リモコンを操作
して、音声認識モードに切り変えた状態を示している。
音声認識モードの状態になると、ステップ５０３に移行
し、マイクから入力されてくる雑音や音声のパワーの計
算をフレーム毎に開始する。例えば、３２フレーム分計
算し終わった時点で、（数１）〜（数５）に示すノイズ
しきい値ＮＴＨ、パワーしきい値ＰＴＨの式にしたがっ
て、ＮＴＨとＰＴＨを計算する。次からは、例えば、１
フレーム毎に、新しいフレームと一番古いフレームのパ
ワー値を入れ変えて、再度、ＮＴＨとＰＴＨを計算し更
新する。この頻度は、システムにより異なり、１秒に１
回、２秒に１回、３秒に１回というように実行する。FIG. 5 is a flowchart for explaining one embodiment of the operation flow of voice section detection used in the voice recognition system and method according to the present invention. In the figure, step 501 is a start indicating that the car navigation system has been started. In step 501, when the car navigation system starts,
Moving to step 502, the voice recognition system of the car navigation system is activated. For example, a state in which the remote control is operated to switch to the voice recognition mode is shown.
In the state of the voice recognition mode, the process proceeds to step 503, and the calculation of noise and voice power input from the microphone is started for each frame. For example, when the calculation for 32 frames is completed, NTH and PTH are calculated according to the equations of the noise threshold NTH and the power threshold PTH shown in (Equation 1) to (Equation 5). From the following, for example, 1
The power values of the new frame and the oldest frame are changed for each frame, and the NTH and PTH are calculated and updated again. This frequency depends on the system, and is one per second.
Once every 2 seconds, once every 3 seconds, and so on.

【００４８】次に、ステップ５０４で、音声認識を実行
するに当たり話者は音声入力ボタン４０２を押す。この
時点で、ステップ５０５において、ステップ５０３で計
算されたＮＴＨとＰＴＨの最新の値を、実際に使用する
音声区間検出のためのＮＴＨおよびＰＴＨと決定する。
ステップ５０６では、ステップ５０５で決定されたＮＴ
ＨおよびＰＴＨを用いて音声区間検出が実行される。ス
テップ５０７は、音声区間検出および音声認識処理が終
了したことを示す。Next, in step 504, the speaker presses the voice input button 402 to execute voice recognition. At this time, in step 505, the latest values of NTH and PTH calculated in step 503 are determined as NTH and PTH for detecting a speech section to be actually used.
In step 506, the NT determined in step 505
Voice section detection is performed using H and PTH. Step 507 indicates that the voice section detection and the voice recognition processing have been completed.

【００４９】なお、ステップ５０６の音声区間検出につ
いては、図６を用いて詳細に説明する。図６は図５で説
明した音声区間検出の動作フローの一実施例を説明する
ためのフローチャートである。ステップ６０１で音声区
間検出が起動する。ステップ６０２では、入力された雑
音や音声の１フレーム毎のパワーＰＷが、ノイズしきい
値ＮＴＨと比較される。その結果が、ＮＴＨ＜ＰＷに
ついて、ＮＯの場合には、音声の始まりではないと判断
されて、次に入力されてくるフレームに対して、同様な
処理を行う。ＹＥＳの場合には、音声の始まりと判断し
てステップ６０４の処理へ進む。The voice section detection in step 506 will be described in detail with reference to FIG. FIG. 6 is a flowchart for explaining an embodiment of the operation flow of voice section detection described in FIG. In step 601, voice section detection starts. In step 602, the power PW of each frame of the input noise or voice is compared with a noise threshold NTH. If the result is NO with respect to NTH <PW, it is determined that the beginning of the speech is not attained, and the same processing is performed for the next frame to be input. In the case of YES, it is determined that the sound has begun, and the process proceeds to step 604.

【００５０】ステップ６０４では、さらに、パワーＰＷ
がパワーしきい値ＰＴＨと比較される。その結果が、Ｐ
ＴＨ＜ＰＷについて、ＮＯの場合には、ステップ６０８
に移行し、カウンタ（ＣＮＴ）を１カウントアップ（＋
１）、すなわち（インクリメント）して、ステップ６０
６へ進む。ＹＥＳの場合には、ステップ６０９で、カウ
ンタ（ＣＮＴ）の値を、ＣＮＴ＝０にして、ステップ６
０５の音声認識処理へ進み、リコグニションフラッグ
（ＲＦ）を、ＲＦ＝１にして、音声分析ならびに音声照
合などの音声認識処理がそのフレームに対して実行され
る。先に述べたステップ６０４でＮＯの場合で、まだ、
ステップ６０５の認識処理を一度も実行していない場合
は、ステップ６０６のＣＮＴ＜ｎ＆ＲＦ＝１についてＮ
Ｏであるため、ステップ６０３で、ＲＦ＝０、ＣＮＴ＝
０にして、ステップ６０７でフレーム毎の音声区間検出
を終了する。ステップ６０６で、ＣＮＴ＜ｎのｎは、音
声と音声の間の数、例えば「ｓｈｉ」と「ｂｕ」の間、
「ｂｕ」と「ｙａ」の間のように音声が途切れる数を示
している。従って、ＣＮＴ＜ｎがＮＯと言うことはカウ
ンタ値がｎよりも大きい、すなわち音声が終了している
ことを示しており、予め定められたフレーム、例えば３
０フレーム以上音声がこないことを意味する。リコグニ
ションＲＦは０又は１の値を取り、音声認識処理をして
いる場合は１、その他の場合は０の値を取る。At step 604, power PW
Is compared with the power threshold PTH. The result is P
For TH <PW, if NO, step 608
And the counter (CNT) is incremented by 1 (+
1), that is, (increment), and step 60
Proceed to 6. In the case of YES, in step 609, the value of the counter (CNT) is set to CNT = 0, and
Proceeding to the voice recognition process of 05, the recognition flag (RF) is set to RF = 1, and voice recognition processes such as voice analysis and voice verification are performed on the frame. In the case of NO in step 604 described above,
If the recognition processing in step 605 has never been executed, N is set for CNT <n & RF = 1 in step 606.
Since it is O, in step 603, RF = 0, CNT =
In step 607, the voice section detection for each frame is terminated. In step 606, n for CNT <n is a number between voices, for example, between "shi" and "bu",
The number indicates that the sound is interrupted as between “bu” and “ya”. Therefore, if CNT <n is NO, it indicates that the counter value is larger than n, that is, the voice has ended, and a predetermined frame, for example, 3
This means that no sound is heard over 0 frames. Recognition RF takes a value of 0 or 1, takes 1 when speech recognition processing is performed, and takes 0 in other cases.

【００５１】また、ステップ６０５の認識処理を実行し
ている場合で、ステップ６０４でＮＯの場合は、ＲＦ＝
１であり、カウンタ値が、ｎよりも小さければ、ＣＮＴ
＜ｎ＆ＲＦ＝１についてＹＥＳであり、音声が終了して
いないと判断して、ステップ６０５の音声認識処理へ進
み、認識処理を実行する。さらに、ステップ６０５の
認識処理を実行している場合で、ステップ６０４でＮＯ
の場合は、ＲＦ＝１であり、カウンタ値が、ｎよりも大
きければ、ＣＮＴ＜ｎ＆ＲＦ＝１についてＮＯであり、
音声が終了したと判断して、ステップ６０３で、ＲＦ＝
０、ＣＮＴ＝０にして、ステップ６０７でフレーム毎の
音声区間検出を終了する。以上の動作により、音声区間
検出が実行される。In the case where the recognition processing in step 605 is being executed, and if the answer is NO in step 604, RF =
If the counter value is smaller than n, CNT
<N & RF = 1 is YES, and it is determined that the voice has not ended, and the process proceeds to the voice recognition process of step 605 to execute the recognition process. Further, when the recognition processing of step 605 is being executed,
, RF = 1, and if the counter value is greater than n, NO for CNT <n & RF = 1,
It is determined that the voice has ended, and in step 603, RF =
0, CNT = 0, and ends the voice section detection for each frame in step 607. With the above operation, voice section detection is performed.

【００５２】[0052]

【発明の効果】本発明によれば、カーナビゲーションシ
ステム、小型情報システム、ゲームに用いられる音声認
識システムにおいて、実際に使用する環境で、雑音のレ
ベルに合わせて音声区間検出用しきい値の設定を自動化
し、自動しきい値設定による音声区間検出および、認識
性能が実環境下でも劣化しない、良好な音声認識システ
ムを実現することができる。According to the present invention, in a car navigation system, a small information system, and a voice recognition system used in a game, a threshold for voice section detection is set in accordance with a noise level in an environment actually used. Can be realized, and a good speech recognition system can be realized in which speech section detection by automatic threshold setting and recognition performance do not deteriorate even in a real environment.

[Brief description of the drawings]

【図１】本発明による音声認識システムの各機能とその
処理の流れを示すブロック図である。FIG. 1 is a block diagram showing functions of a speech recognition system according to the present invention and a flow of processing thereof.

【図２】音声入力波形および音声パワーを示す特性図で
ある。FIG. 2 is a characteristic diagram showing an audio input waveform and audio power.

【図３】音声入力波形および音声パワーを示す特性図で
ある。FIG. 3 is a characteristic diagram showing an audio input waveform and audio power.

【図４】本発明による音声認識システムのハードウエア
構成を示すブロック図である。FIG. 4 is a block diagram showing a hardware configuration of a speech recognition system according to the present invention.

【図５】本発明による音声認識システムおよび方法に使
用される音声区間検出動作の一実施例を説明するための
フローチャートである。FIG. 5 is a flowchart illustrating an embodiment of a voice section detection operation used in the voice recognition system and method according to the present invention;

【図６】図５で示した音声区間検出の動作フローの一実
施例を説明するためのフローチャートである。FIG. 6 is a flowchart illustrating an example of an operation flow of voice section detection illustrated in FIG. 5;

【図７】従来の音声認識システムを使用した携帯型翻訳
装置のブロック図である。FIG. 7 is a block diagram of a portable translation device using a conventional speech recognition system.

[Explanation of symbols]

１０１…音声入力用マイク、１０２…Ａ／Ｄ変換器、１
０３…音声区間検出部、１０４…音声分析部、１０５…
音声認識部、１０６…音響モデル、１０７…単語辞書、
１０８…音響モデルと単語辞書の連結部、２０１…音声
入力波形、２０２…音声パワー。101: voice input microphone, 102: A / D converter, 1
03: Voice section detection unit, 104: Voice analysis unit, 105:
Speech recognition unit, 106: acoustic model, 107: word dictionary,
108: a connection part between the acoustic model and the word dictionary; 201, a voice input waveform; 202, a voice power.

───────────────────────────────────────────────────── フロントページの続き (72)発明者小窪浩明東京都国分寺市東恋ケ窪一丁目280番地株式会社日立製作所中央研究所内Ｆターム(参考） 5B091 CB12 CD01 5D015 CC14 DD05 KK01 ──────────────────────────────────────────────────続き Continuing from the front page (72) Inventor Hiroaki Kokubo 1-280 Higashi-Koigakubo, Kokubunji-shi, Tokyo F-term in Hitachi Central Research Laboratory 5B091 CB12 CD01 5D015 CC14 DD05 KK01

Claims

[Claims]

1. A speech recognition system for collecting a word or a sentence to be subjected to speech recognition and displaying or outputting as speech the contents obtained from the dictionary as a speech recognition result. A voice recognition system, comprising: a voice detection unit that detects a voice section based on information; and performing voice recognition of the detected voice section.

2. The speech recognition system according to claim 1, wherein a noise threshold is obtained from the power of the non-speech section.

3. The speech recognition system according to claim 2, wherein said noise threshold value is compared with the power of said speech section, and said speech threshold value is compared with the power of said speech section when said power level reaches said noise threshold value. A speech recognition system for starting a recognition process.

4. A speech recognition system according to claim 2, wherein said noise threshold value is compared with the power of said voice section, and said power is predetermined in advance when said power of said voice section reaches said noise threshold value. A speech recognition system characterized by performing speech recognition processing retroactively to a given time.

5. The speech recognition system according to claim 2, wherein said noise threshold value is based on an average power of a predetermined number of aggregates of frames, which is a unit for analyzing speech or noise power. A speech recognition system characterized by what is required.

6. A speech recognition system according to claim 2, wherein a power threshold is obtained based on said noise threshold.

7. The speech recognition system according to claim 6, wherein the power of the speech section exceeds the noise threshold, and when the power reaches the power threshold, it is determined that the speech has begun, and from this point in time the speech is determined in advance. A speech recognition system wherein speech recognition processing is performed from a predetermined time.

8. The speech recognition system according to claim 6, further comprising a voice input button, wherein when the power of the voice section reaches the noise threshold after the button is pressed, A speech recognition system characterized by storing a frame which is a unit of speech analysis.

9. The speech recognition system according to claim 8, wherein when the power of the speech section reaches the power threshold, speech recognition processing is performed at least from the stored frame. Recognition system.

10. The voice recognition system according to claim 6, wherein a period during which the power of the voice section falls below the power threshold is equal to or less than a predetermined time, and an unvoiced sound portion between voices is generated. A speech recognition system characterized by making a decision.

11. The speech recognition system according to claim 6, wherein said power in said speech section is said power threshold for a predetermined time after said power in said speech section falls below said power threshold. A speech recognition system characterized by determining that a speech section has ended when the value is kept lower than the value.

12. A voice recognition method for displaying or outputting as a voice a content picked up from a dictionary in which words and sentences to be voice-recognized are collected as voice recognition results, based on information of a detected non-voice section. Detecting a voice section by performing a voice recognition, and performing a voice recognition of the detected voice section.

13. The speech recognition method according to claim 12, further comprising the step of obtaining a noise threshold from the power of the non-speech section.

14. A speech recognition method according to claim 13, wherein said noise threshold value is compared with the power of said speech section, and a speech recognition process is started near the time when said power reaches said noise threshold value. Voice recognition method.

15. The speech recognition method according to claim 12, further comprising: a step of obtaining a noise threshold from the power of the non-speech section; and a step of obtaining a power threshold based on the noise threshold. A speech recognition method characterized by the following.

16. The voice recognition method according to claim 15, wherein when a period during which the power of the voice section falls below the power threshold is less than or equal to a predetermined time, it is determined that the voice section is a voiceless part between voices. Voice recognition method.

17. The voice recognition method according to claim 15, wherein after the power of said voice section falls below said power threshold value, when a predetermined time has elapsed while maintaining this state, said voice section ends. A voice recognition method, comprising the step of determining that the voice recognition has been performed.

18. Words and sentences to be subjected to speech recognition are collected and defined as a dictionary, and words and sentences recognized by speech are picked up from the dictionary, and character strings are displayed, images indicated by the words, and speech synthesis are performed. In a speech recognition system that outputs speech as speech, information on non-speech sections is detected to distinguish speech segments from the beginning to end of speech from non-speech segments that do not contain speech. A voice section detection unit that detects voice sections, a voice analysis unit that performs voice analysis processing on the captured voice, an acoustic model that has a voice pattern in phoneme units, and an audio model A voice recognition system comprising: a voice recognition unit that performs voice recognition processing by connecting a model and a dictionary; and performs voice recognition on voice detected in a voice section.

19. A speech recognition system according to claim 18, wherein said speech section detection section detects a power level of a sound in a non-speech section, and adjusts the start of the speech in accordance with the power level which changes every moment. When the input voice exceeds the power level of the voice, the voice is determined to be the start of the voice, and when the power level of the voice falls below the power level for a predetermined time or more, the voice is determined to be the end. A speech recognition system characterized by:

20. The speech recognition system according to claim 18, wherein said speech section detection unit detects a power level of a sound in a non-speech section, and outputs a noise threshold value and a power from the power level. Calculates the threshold value of the threshold, the input voice exceeds the noise threshold,
A speech recognition system, wherein when a power threshold is exceeded, it is determined that speech has begun.

21. The speech recognition system according to claim 20, wherein the end of the speech is determined when the power level of the speech falls below the threshold for a predetermined time or more.

22. The speech recognition system according to claim 20, wherein the power level is an average power obtained over a plurality of frames by dividing a power of a frame divided by a predetermined time unit, and the noise threshold is set. A speech recognition system, wherein the value is set to N times the average power.

23. The speech recognition system according to claim 22, wherein said power threshold value PTH is set so as to satisfy a relationship of PTH> NTH as compared with a noise threshold value NTH. Voice recognition system.

24. The speech recognition system according to claim 23, wherein the relationship between the noise threshold NTH and the power threshold PTH is ΔPTH = PTH− in a relatively quiet environment.
A speech recognition system, wherein NTH is set to be small, and conversely, ΔPTH = PTH-NTH is set to be large in an environment where noise is large.

25. Words and sentences to be subjected to speech recognition are collected and defined as a dictionary, and words and sentences recognized by speech are picked up from the dictionary to display character strings, images represented by the words, and speech synthesis. In the speech recognition method that outputs speech as speech, information on non-speech sections is detected for the input speech in order to distinguish speech sections from the beginning to end of speech from non-speech sections that do not contain speech. Detecting a voice section while performing
Performing voice analysis processing, and connecting an acoustic model having a voice pattern in phoneme units and the dictionary to perform voice recognition processing on a voice analysis result. Method.

26. A speech recognition method according to claim 25, wherein a power level is detected for a sound in a non-speech section, and a speech indicating a beginning and an end of the speech in accordance with the power level which changes every moment. Is determined, and when the power level of the input voice exceeds the power level of the voice, it is determined that the voice starts, and when the power level of the voice falls below the power level for a certain time or more, it is determined that the voice ends. A speech recognition method comprising steps.

27. The voice recognition method according to claim 25, wherein a power level is detected for the sound in the non-voice section, and a noise threshold and a power threshold are calculated from the power level. Step, the power level of the input voice exceeds the noise threshold,
Determining the start of voice if the power threshold is exceeded.

28. A speech recognition method according to claim 27, further comprising the step of judging the end of the speech when the power level of the inputted speech falls below the power threshold for a predetermined time or more. Voice recognition method.

29. A speech recognition method according to claim 27, further comprising: determining a power of a frame delimited by a predetermined time unit as the average power obtained over a plurality of frames; and the noise threshold. Setting the value to N times the average power.

30. A speech recognition method according to claim 29, wherein said power threshold value PTH is a noise threshold value NTH.
A speech recognition system comprising a step of setting so as to satisfy a relationship of PTH> NTH as compared with the above.

31. A speech recognition method according to claim 30, wherein the relationship between the noise threshold value NTH and the power threshold value PTH is ΔPTH = PTH−N in a relatively quiet environment.
Setting the TH to be small, and conversely, setting ΔPTH = PTH-NTH to be large in an environment with a large amount of noise.

32. The speech recognition system according to claim 22, wherein the noise threshold value is set so as to decrease the value of N in an environment where the noise power is relatively quiet and the fluctuation of the noise power in a short time is small. A speech recognition system characterized in that the value of N is set to be large in an environment in which noise power fluctuates greatly in a short time with large noise.

33. The speech recognition method according to claim 29, wherein the noise threshold value is set to be small in an environment where the noise power is relatively quiet and the fluctuation of the noise power in a short time is small; On the other hand, a speech recognition method characterized by comprising a step of setting the value of N to be large in an environment where noise power fluctuates greatly in a short time with large noise.