JPH09134193A

JPH09134193A - Speech recognition device

Info

Publication number: JPH09134193A
Application number: JP7289865A
Authority: JP
Inventors: Tetsutada Sakurai; 哲真桜井; Yoshio Nakadai; 芳夫中台; Yutaka Nishino; 豊西野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1995-11-08
Filing date: 1995-11-08
Publication date: 1997-05-20

Abstract

PROBLEM TO BE SOLVED: To make it possible to recognize the place-names throughout Japan and to deal with a command response under high noise as well by switching speech recognition programs according to the environment to be used. SOLUTION: Speech characteristic patterns are extracted from speech signals inputted from a speech input section 10 in a speech recognition section 10 and the speeches are recognized in accordance with these speech characteristics. A first storage section 14 stores the vocabulary to be recognized and/or the standard patterns for recognition and/or the vocabulary groups to be recognized of different application objects and/or the standard pattern groups for recognition of different application objects. A second storage section 20 stores the plural speech recognition programs 121 to be loaded into the speech recognition section 12. The device includes a sensor for detecting external input signal 40 and an interface section 41. The plural speech recognition programs 121 are dynamically and selectively used according to the condition of the input signals 40 from the sensor.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、音声認識装置に
関し、特に、音声を入力して文字その他の認識結果を出
力する音声認識装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device, and more particularly to a voice recognition device that inputs voice and outputs characters and other recognition results.

【０００２】[0002]

【従来の技術】人間の手で操作する代りに音声を入力し
て電気機器その他の機器を操作する音声認識装置につい
ては、従来より様々な研究開発がなされてる。音声認識
技術は人間が任意の場所から任意のタイミングで発声し
た任意長の音声を１００％の確率で認識できることが理
想である。しかし、実際の使用条件下においては、騒音
が存在するし、任意の時刻で発声された音声を捕捉しよ
うとすると、音声入力処理において雑音をも含めて観測
される信号区間の中から音声の始端と終端とを何度も検
出し、かつ雑音のみを除外するための複雑なアルゴリズ
ムを常に実行しなければならず、全体の計算量が膨大に
なることは避けられない。このために、音声認識技術に
は複雑なアルゴリズムを効率的に実行する手法およびア
ルゴリズムが必要とされ、当該出願の発明者らの手によ
る特願平７−１５１６９８号その他の多くの提案がなさ
れている。この音声認識装置の先行例を図３を参照して
説明する。2. Description of the Related Art Various researches and developments have hitherto been made on a voice recognition device for operating an electric device or other device by inputting a voice instead of operating it by a human hand. Ideally, the voice recognition technology is capable of recognizing a voice of arbitrary length, which a human utters from an arbitrary place at an arbitrary timing, with a probability of 100%. However, under actual usage conditions, there is noise, and if we try to capture speech that is uttered at any time, the beginning of the speech will be detected from the signal section that is also included in the speech during the speech input processing. It is inevitable that a large amount of calculation will be required because a complicated algorithm for detecting the noise and the end many times and excluding only the noise must always be executed. For this reason, the voice recognition technology requires a method and an algorithm for efficiently executing a complicated algorithm, and Japanese Patent Application No. 7-151698 and many other proposals have been made by the inventors of the present application. There is. A prior example of this voice recognition device will be described with reference to FIG.

【０００３】図３において、音声認識装置は、マイクロ
ホンその他の音声を音声信号に変換する音響電気変換器
より成る音声入力部１Ｐ、音声波形データをディジタル
の数値に変換する波形変換部２Ｐ、音声波形から音声認
識のための特徴を抽出する音声特徴抽出部３Ｐ、音声認
識をするために音声区間検出時の始端検出開始のトリガ
ーを与える起動スイッチ部４Ｐ、音声特徴抽出部から得
られる音声特徴量より音声の始端および終端をそれぞれ
１箇所だけ決定する音声区間検出部５Ｐ、音声区間検出
部により決定された音声始端から終端に到る音声特徴量
を取り込んで未知入力パターンとする入力パターン格納
部６Ｐ、ラベル名を付与された認識のための複数の音声
パターンを格納した標準パターン記憶部８Ｐ、入力パタ
ーン格納部に格納される未知の入力音声パターンおよび
標準パターン記憶部に格納される各標準パターンとの間
の類似度の計算を行なってその結果である入力音声パタ
ーンとの間の距離値例えばマハラノビス距離の数式で定
義される特徴量上の距離値を出力するパターンマッチン
グ部１０Ｐ、各標準パターンについてそれぞれ出力され
た未知入力音声パターンとの間の距離値の内の最も小さ
い距離値を有する標準パターンを決定する距離比較部１
１Ｐ、距離比較部において最も小さい距離値を有するも
のと決定された標準パターンのラベル名を上位ホスト或
はシステムバスに送出する結果出力部１２Ｐから構成さ
れる。In FIG. 3, the voice recognition device includes a voice input unit 1P including a microphone and other acoustoelectric converters for converting voices into voice signals, a waveform conversion unit 2P for converting voice waveform data into digital numerical values, and voice waveforms. From a voice feature extraction unit 3P that extracts a feature for voice recognition from a voice switch, a start switch unit 4P that gives a start edge detection start trigger at the time of voice segment detection for voice recognition, and a voice feature amount obtained from the voice feature extraction unit. A voice section detection unit 5P that determines only one start and end of the voice, and an input pattern storage unit 6P that takes in the voice feature amount from the voice start end to the end determined by the voice section detection unit and sets it as an unknown input pattern, Stored in a standard pattern storage unit 8P that stores a plurality of voice patterns for recognition with label names and an input pattern storage unit The unknown input voice pattern and the standard pattern stored in the standard pattern storage unit are calculated for the similarity, and the result is the distance value to the input voice pattern, for example, the Mahalanobis distance. A pattern matching unit 10P that outputs a distance value on the feature amount, and a distance comparison unit that determines the standard pattern having the smallest distance value among the distance values to the unknown input voice pattern output for each standard pattern. 1
1P, a result output unit 12P for transmitting the label name of the standard pattern determined to have the smallest distance value in the distance comparison unit to the upper host or the system bus.

【０００４】この様な音声認識装置は、これに必要にし
て充分な計算能力を付与せしめる必要があるところか
ら、高性能なマイクロプロセッサ、或はディジタルシグ
ナルプロセッサの如きＣＰＵを装置の中心に具備せしめ
る（例えば、特開平７−１４０９９８号公報参照）。高
性能なＣＰＵを他の用途に使用することは、用途に応じ
た応用プログラムの利用コストを引き下げて有利となる
ので、図２に示される様な音声認識装置の形態が採用さ
れる。ここで、メッセージ処理部１１が、音声認識部１
２の一部を構成するＣＰＵ或は独立別配置のＣＰＵより
成るものとする。音声認識装置１の音声認識部１２は、
入力された音声信号を認識し、その結果が意味するとこ
ろのメッセージをメッセージ処理部１１により解釈し、
音声認識装置の直近に具備される応用プログラム群２か
らプログラムをメッセージに従って音声認識装置に転送
する。この時、何れの応用プログラムを音声認識装置が
利用しているか、利用することができるかを管理する応
用プログラム管理テーブル１３を音声認識装置内に具備
することは極めて有効なことである。メッセージ処理部
１１を構成する高性能のＣＰＵを活用してマルチタスク
として複数の応用プログラム２を同時に実行することも
行われる。この様な高いパフォーマンスで音声認識装置
の運用をすることができるに到る一方において、音声認
識部１２のパフォーマンスには未だに以下に述べる改善
の余地がある。In such a voice recognition device, a high-performance microprocessor or a CPU such as a digital signal processor is provided at the center of the device because it is necessary to give necessary and sufficient calculation ability to it. (See, for example, JP-A-7-140998). Using a high-performance CPU for other purposes is advantageous because it lowers the cost of using an application program according to the purpose, and therefore the form of the speech recognition apparatus as shown in FIG. 2 is adopted. Here, the message processing unit 11 causes the voice recognition unit 1 to
It is assumed that it is composed of a CPU forming a part of 2 or a CPU arranged independently. The voice recognition unit 12 of the voice recognition device 1
The input voice signal is recognized, and the message meaning of the result is interpreted by the message processing unit 11,
A program is transferred from the application program group 2 provided in the vicinity of the voice recognition device to the voice recognition device according to a message. At this time, it is extremely effective to equip the voice recognition device with an application program management table 13 that manages which application program the voice recognition device uses and can use. It is also possible to utilize a high-performance CPU forming the message processing unit 11 to execute a plurality of application programs 2 simultaneously as a multitask. While it is possible to operate the voice recognition device with such high performance, there is still room for improvement as described below in the performance of the voice recognition unit 12.

【０００５】[0005]

【発明が解決しようとする課題】以上の音声認識装置に
要請される「何時でも、誰の声でも、如何なる内容でも
認識することができる」ことを満足するには、超大型の
汎用コンピュータによる演算を必要とする。語彙数を数
万以下に限定した場合、ワークステーションレベルの計
算機能力を有するコンピュータで事足りることとなる
が、ワークステーションの大きさと重量から明らかな如
く、これは到底持ち運べるものではない。そして、コス
ト的にも現在において数十万円ないし数百万円のコスト
を要する。１個の優れた音声認識装置により全ての音声
認識に対応しようとするとこの様なことになる。また、
雑音環境下における音声認識に対して、適応的雑音除去
技術を活用することなく簡易な低コストの音声認識装置
を構成すると、対象とする語彙数が数語以内と少なくな
る実用的な制約が生ずる。In order to satisfy the above-mentioned demand for the voice recognition device "to recognize any time, anybody's voice, any kind of content", calculation by a super-large general-purpose computer is required. Need. If the number of vocabularies is limited to tens of thousands or less, a computer having a workstation-level calculation function will suffice, but as is obvious from the size and weight of the workstation, this is not portable at all. And, in terms of cost, a cost of several hundred thousand yen to several million yen is currently required. This is the case when it is attempted to support all voice recognition by one excellent voice recognition device. Also,
For voice recognition under noisy environment, if a simple low-cost voice recognition device is constructed without using adaptive noise reduction technology, the number of target vocabulary is reduced to within a few words, which is a practical constraint. .

【０００６】音声認識装置のこの様な問題はこの装置が
置かれる環境に音声認識プログラムが充分に対応するこ
とができないことに起因して生起する。この点について
具体的に説明する。例えば、良く知られる隠れマルコフ
モデルＨＭＭに立脚した音声認識プログラムは、認識語
彙数が千〜数万を対象とすることができる。例えば「え
〜札幌」における”え〜”の如き余剰語が認識対象語彙
の前後に付随しても高い認識率を示す長所を有する。一
方、認識のための辞書モデルの作成に手間と時間を必要
とするという様な改善されるべき点もある。そして、例
えば、良く知られたダイナミックプログラミング手法を
不特定話者認識に拡張したＳＰＬＩＴ法（管村、古井：
擬音韻標準パターンによる大語彙単語音声認識”、信学
論、Ｊ６５−Ｄ、８、ｐｐ．１０４１−１０４８（昭５
７））は、ＨＭＭより少ない計算量と記憶容量で音声認
識を行うことができる。その反面、標準パターンの作成
に特徴が有るため、ＨＭＭと比較して大語彙への対応が
困難である。また、簡易な音声認識技術としては、一般
的には１０次以上である次数を８次程度にまで削減した
上で自己相関係数を求めて標準パターンと比較するＳＡ
ＤＰ（ＳｔａｇｇｅｒｅｄＡｒｒａｙＤＰ）法のプ
ログラムが知られている。この方法は、計算量と記憶量
が一般的な音声認識の１／１０程度であり、素早い応答
をすることができる反面、認識語彙数は精々２０語以内
に限定される。更に、雑音環境の下において有効な適応
的雑音除去を組み込んだ音声認識技術は耐雑音性が優れ
る反面、計算量および記憶量が数割増大するという難点
を有する。Such a problem of the voice recognition device is caused by the inability of the voice recognition program to cope with the environment in which the device is placed. This point will be specifically described. For example, a speech recognition program based on the well-known Hidden Markov Model HMM can target the recognition vocabulary of 1,000 to tens of thousands. For example, a surplus word such as "E ~" in "E ~ Sapporo" has an advantage of showing a high recognition rate even if it is attached before and after the recognition target vocabulary. On the other hand, there is also a point to be improved that it takes time and effort to create a dictionary model for recognition. And, for example, the SPLIT method (Kanmura, Furui:
Large Vocabulary Word Speech Recognition Using Pseudophonic Standard Patterns ", Theological Theory, J65-D, 8, pp. 1041-1048 (Sho 5
7)) can perform voice recognition with a smaller calculation amount and storage capacity than the HMM. On the other hand, it is difficult to deal with a large vocabulary as compared with the HMM because the standard pattern is characterized. In addition, as a simple voice recognition technique, an SA that compares the standard pattern with the autocorrelation coefficient after reducing the order, which is generally 10th order or more, to about 8th order
A program of the DP (Staggered Array DP) method is known. In this method, the amount of calculation and the amount of memory are about 1/10 of those of general speech recognition, and a quick response is possible, but the number of recognized vocabularies is limited to within 20 words. Further, the speech recognition technology incorporating adaptive noise reduction effective in a noisy environment is excellent in noise resistance, but has a drawback that the amount of calculation and the amount of storage increase by several tenths.

【０００７】以上の通り、音声認識用のプログラムとし
て全ての状況に対応するプログラムはないと言ってもよ
く、認識語彙数、耐雑音性、計算機能力の小型経済性そ
の他の要請の内から重視される要請を選定し、それに適
合するうアルゴリズム或はプログラムを音声認識装置に
搭載する様にしているのが現状である。この発明は、上
述した問題を解消した音声認識装置を提供するものであ
る。As described above, it can be said that there is no program for all situations as a speech recognition program, and it is emphasized from the viewpoint of the number of recognition vocabulary, noise resistance, small economy of computing ability and other requirements. It is the current situation that an algorithm or program suitable for the request is selected and installed in the voice recognition device. The present invention provides a voice recognition device that solves the above problems.

【０００８】[0008]

【課題を解決するための手段】音声信号を入力する音声
入力部１０を具備し、入力された音声信号より音声特徴
パターンを抽出し、その音声特徴パターン情報に基づい
て音声を認識する音声認識部１２を具備し、認識対象の
語彙および／或は認識用の標準パターン、および／或は
適用対象の異なる認識対象の語彙群および／或は適用対
象の異なる認識用標準パターン群を格納する第１の記憶
部１４を具備し、音声認識部１２にロードされるべき複
数の音声認識プログラム１２１を格納する第２の記憶部
２０を具備する音声認識装置を構成した。A voice recognition unit having a voice input unit for inputting a voice signal, extracting a voice feature pattern from the input voice signal, and recognizing voice based on the voice feature pattern information. And a vocabulary group for recognition and / or a standard pattern for recognition, and / or a vocabulary group for recognition and / or a standard pattern group for recognition differently applied. The voice recognition apparatus is configured to include the second storage unit 20 that includes the storage unit 14 and stores the plurality of voice recognition programs 121 to be loaded in the voice recognition unit 12.

【０００９】そして、先の音声認識装置において、標準
パターンとして音声認識プログラム１２１に付属する認
識対象テーブル１２１１を具備する音声認識装置を構成
した。また、以上の音声認識装置において、外部入力信
号４０を検知するセンサおよびインタフェース部４１を
具備し、検知された外部入力信号により複数の音声認識
プログラムを切り替える構成を具備する音声認識装置を
構成した。Then, in the above speech recognition apparatus, a speech recognition apparatus having a recognition target table 1211 attached to the speech recognition program 121 as a standard pattern is constructed. Further, in the above speech recognition apparatus, a sensor for detecting the external input signal 40 and the interface unit 41 are provided, and the speech recognition apparatus is configured to switch a plurality of speech recognition programs according to the detected external input signal.

【００１０】[0010]

【発明の実施の形態】この発明は、ストアードプログラ
ム方式のコンピュータと同様に音声認識に使用される音
声認識プログラムを音声認識装置に複数搭載し、複数プ
ログラムをセンサからの入力信号の状況に応じてダイナ
ミックに使い分けるというストアードプログラム方式の
コンピュータには認められない構成で認識を実行する音
声認識装置を構成したものである。音声認識装置に課せ
られる「何時でも、誰の声でも、どんな内容でも認識す
ることができる」という要請を満足する手立てとして、
この発明は状況に応じた適切な認識プログラムをプログ
ラム記憶部から音声認識部にロード即ち読み込んで対応
することによりこの要請に対処するものである。BEST MODE FOR CARRYING OUT THE INVENTION According to the present invention, a plurality of voice recognition programs used for voice recognition are installed in a voice recognition device in the same manner as a stored program type computer, and the plurality of programs are provided according to the situation of an input signal from a sensor. This is a voice recognition device configured to perform recognition with a configuration that is not recognized in a computer of a stored program system that is dynamically used properly. As a means to satisfy the request imposed on the voice recognition device, "any time, anyone's voice, any content can be recognized",
The present invention addresses this demand by loading an appropriate recognition program according to the situation from the program storage unit into the voice recognition unit and reading it.

【００１１】[0011]

【実施例】先ず、この発明の実施例の概要を図１を参照
して説明する。図１において、点線の矢印１によって分
割された左方の部分は音声認識装置を示す。この音声認
識装置１の音声入力部１０は、音声を受信してこれを音
声信号に変換するところであり、例えば、オーディオマ
イクロホン、音声波形データを受信するディジタルの信
号入力端子等により構成される。なお、音声入力部１０
に入力される音響は機械音、コンピュータの合成音、或
は動物の鳴き声その他の非音声であって差し支えない
が、説明の都合上、これらを音声と表現して一括説明す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS First, the outline of an embodiment of the present invention will be described with reference to FIG. In FIG. 1, the left part divided by a dotted arrow 1 indicates a voice recognition device. The voice input unit 10 of the voice recognition device 1 is for receiving voice and converting the voice into a voice signal, and is composed of, for example, an audio microphone, a digital signal input terminal for receiving voice waveform data, and the like. The voice input unit 10
The sound input to may be a mechanical sound, a synthetic sound of a computer, a cry of an animal, or other non-speech, but for convenience of explanation, these will be collectively described as voices.

【００１２】音声認識部１２は音声入力部１０を介して
採録された入力音声信号を認識し、認識結果が意味する
メッセージを中央演算装置（ＣＰＵ）１１１により解釈
する。音声認識部１２は、音声認識のアルゴリズムを高
速に実行することができるＤＳＰにより構成することが
一般的ではあるが、ＣＰＵと同様なマイクロプロセッサ
ーとすることもできる。そして、音声認識部１２のＣＰ
ＵとしてＣＰＵ１１１を使用する構成も採用することが
できる。この場合、音声認識装置１の本体の制御と音声
認識の演算処理制御とを同時に実施する必要上、ＣＰＵ
１１１としてマルチタスクの機能を有するものを使用す
る。The voice recognition unit 12 recognizes an input voice signal recorded via the voice input unit 10 and interprets a message indicated by the recognition result by a central processing unit (CPU) 111. The voice recognition unit 12 is generally composed of a DSP capable of executing a voice recognition algorithm at high speed, but may be a microprocessor similar to a CPU. Then, the CP of the voice recognition unit 12
A configuration using the CPU 111 as U can also be adopted. In this case, since it is necessary to simultaneously control the main body of the voice recognition device 1 and the arithmetic processing control of the voice recognition, the CPU
What has a multitasking function is used as 111.

【００１３】音声認識装置１およびＣＰＵ１１１は、音
声認識或は命令を効率的に実行する必要上、第１の記憶
部１４を具備する。第１の記憶部１４は、音声認識に必
要とされる認識辞書／標準パターン１５、音声認識装置
１が何れの応用プログラムを利用しているか、利用する
ことができるかを管理する応用プログラム管理テーブル
１３、および「音声認識装置１が何れの音声認識プログ
ラムを利用しているか或は利用することができるか」を
管理する音声認識プログラム管理テーブル１６を格納し
ている。The voice recognition device 1 and the CPU 111 are provided with a first storage section 14 in order to efficiently execute voice recognition or commands. The first storage unit 14 is a recognition dictionary / standard pattern 15 required for voice recognition, and an application program management table for managing which application programs the voice recognition device 1 uses and which ones can be used. 13 and a voice recognition program management table 16 for managing which voice recognition program the voice recognition device 1 uses or can use.

【００１４】音声認識装置１は、更に、第２の記憶部２
０を具備している。この様に第１の記憶部１４および第
２の記憶部２０の２個の記憶部を併せ持つ理由は、記憶
部の利用効率を向上させたいがためである。一般に、Ｄ
ＲＡＭ或はＳＲＡＭの如き半導体メモリは記憶部として
小型、高速応答する利点を有する。その反面、比較的に
高価であるという欠点を有する。これらの半導体メモリ
と対象的なメモリとして、ＣＤ−ＲＯＭ或はディジタル
ビデオディスクＤＶＤの如きメモリを挙げることができ
る。これらの利害得失は半導体メモリの利点が欠点に、
欠点が利点になっている。このために、コンピュータの
要素を有する装置は、記憶装置として半導体メモリを併
用することが一般的である。この発明も例えば第１の記
憶部１４として半導体メモリより成るキャッシュメモリ
を具備し、第２の記憶部２０として容量の大なるＣＤ−
ＲＯＭより成る外部メモリを具備する構成を採用する。
これに際して、記憶容量の大なる第２の記憶部２０に
は、音声認識プログラム１２１ないし１２１ｘ、音声認
識プログラム１２１に付属する認識対象テーブル１２１
１が格納されている。第２の記憶部２０には、更に、複
数の応用プログラム２も格納されており、音声により命
令されるコマンドを理解して、要求されたプログラムを
音声認識装置が即座に使用し得る状態とされる。The voice recognition device 1 further includes a second storage unit 2.
0 is provided. The reason why the first storage unit 14 and the second storage unit 20 are both provided in this manner is to improve the utilization efficiency of the storage units. In general, D
A semiconductor memory such as a RAM or an SRAM has an advantage that it is small in size and responds at high speed as a storage unit. On the other hand, it has the drawback of being relatively expensive. A memory such as a CD-ROM or a digital video disk DVD can be used as a target memory for these semiconductor memories. These advantages and disadvantages are the disadvantages of the semiconductor memory,
The disadvantage is the advantage. For this reason, a device having a computer element generally uses a semiconductor memory together as a storage device. The present invention also includes, for example, a cache memory formed of a semiconductor memory as the first storage unit 14 and a large-capacity CD-ROM as the second storage unit 20.
A configuration including an external memory including a ROM is adopted.
At this time, in the second storage unit 20 having a large storage capacity, the voice recognition programs 121 to 121x and the recognition target table 121 attached to the voice recognition program 121 are stored.
1 is stored. The second storage unit 20 also stores a plurality of application programs 2 so that the voice recognition device can immediately use the requested program by understanding the command instructed by voice. It

【００１５】音声認識装置１においては、第１の記憶部
１４および第２の記憶部２０の間において双方に分散し
て格納されるテーブル或はプログラムを入れ替え、或は
音声認識部１２およびＣＰＵ１１１にロードすることは
常に行なわれるている。この操作はメモリマネージメン
トユニットＭＭＵ３０を介して行う。ＭＭＵ３０は、図
１に示される様にＣＰＵ１１１とは別のモジュールとす
る構成の他に、ＣＰＵ１１１或は音声認識部１２の中核
をなすＤＳＰの一部の機能を活用して構成することもで
きる。In the voice recognition apparatus 1, tables or programs stored in the first storage unit 14 and the second storage unit 20 in a distributed manner are exchanged, or the voice recognition unit 12 and the CPU 111 are replaced. Loading is always done. This operation is performed via the memory management unit MMU30. As shown in FIG. 1, the MMU 30 may be configured as a module different from the CPU 111, or may be configured by utilizing a part of the functions of the CPU 111 or the DSP which is the core of the voice recognition unit 12.

【００１６】音声認識装置１は、上述した記憶部、音声
認識部およびＣＰＵの間において記憶内容、認識結果或
は応用プログラムの呼び出しを高速で実施するに、デー
タ情報の高速伝達経路であるバスを具備する。３１は第
１の記憶部１４に設けられる第１のバスであり、３２は
第２の記憶部２０に設けられる第２のバスである。必要
に応じて更なるバスを設けることができる。第１のバス
３１と第２のバス３２を接続することもできる。これら
のバスには、音声認識部１２、ＣＰＵ１１１、ＭＭＵ３
０、第１の記憶部１４或は第２の記憶部２０がデータ或
は命令情報を流す経路が設けられ、それぞれが効率良
く、音声認識装置の動作に破綻を来さない様に利用する
ことができる状態を先のＭＭＵ３０が作り出している。The voice recognition apparatus 1 uses a bus which is a high-speed transmission path of data information in order to call the stored contents, the recognition result or the application program at high speed between the above-mentioned storage unit, voice recognition unit and CPU. To have. Reference numeral 31 is a first bus provided in the first storage unit 14, and 32 is a second bus provided in the second storage unit 20. Additional buses can be provided if desired. It is also possible to connect the first bus 31 and the second bus 32. These buses include a voice recognition unit 12, a CPU 111, and an MMU3.
0, the first storage unit 14 or the second storage unit 20 is provided with a route for passing data or command information, and each is used efficiently so as not to cause a failure in the operation of the voice recognition device. The above MMU 30 creates a state in which the above can be performed.

【００１７】以上の通り、音声認識装置の使用環境に応
じて音声認識プログラムを入れ替えることにより、音声
認識装置のパフォーマンスを総合的に向上させることが
できる。なお、この発明の音声認識装置の構成は、従来
の音声認識装置の構成に比べて若干、複雑な構成とな
る。また、複数の音声認識プログラムを開発する必要が
あるところから、これが付加的なコスト増の要因とな
る。しかし、この発明による多数の音声認識装置が世に
提供されることにより、開発コストがその数で按配され
るため、個々の音声認識装置の経費負担は微々たるもの
になる。複数のプログラムを搭載する記憶部の容量増に
関しては、低コストのＣＤ-ＲＯＭ部に常時プログラム
を格納し、認識実行時に高速アクセスすることができる
半導体メモリを一時的に利用する形態（キャッシュメモ
リと称する）として使用することにより実効的なコスト
アップを抑制することができる。As described above, the performance of the voice recognition device can be comprehensively improved by changing the voice recognition program according to the usage environment of the voice recognition device. The configuration of the voice recognition device of the present invention is slightly more complicated than that of the conventional voice recognition device. Further, since it is necessary to develop a plurality of voice recognition programs, this causes an additional cost increase. However, since a large number of voice recognition devices according to the present invention are provided to the world, the development cost is proportionally distributed, and the cost burden of each voice recognition device becomes insignificant. To increase the capacity of the storage unit that stores multiple programs, a low-cost CD-ROM unit is used to store the programs all the time and a semiconductor memory that can be accessed at high speed during recognition is temporarily used (cache memory and It is possible to suppress an effective increase in cost by using this as a "name".

【００１８】この発明は、音声を含む一般的な音圧、振
動、加速度に起因する信号、ＧＰＳ号その他の外部入力
信号を音声認識装置に取り込み、音声認識の効率を向上
させることができる。即ち、これらの一般的な外部入力
信号４０は音声認識装置に具備されたセンサおよびイン
タフェース部４１により検知され、インタフェース部を
介してバス３２に供給される。センサおよびインタフェ
ース部４１を介して得られる信号は、ＣＰＵ１１１に伝
送され、予め設定された環境条件の判断に使用される。
以下、カーナビゲーション装置に適用したところを例と
して取り上げて説明する。The present invention can improve the efficiency of voice recognition by incorporating into a voice recognition device a general sound pressure including voice, a signal resulting from vibration and acceleration, a GPS signal and other external input signals. That is, these general external input signals 40 are detected by the sensor and interface unit 41 included in the voice recognition device, and are supplied to the bus 32 via the interface unit. The signal obtained through the sensor and the interface unit 41 is transmitted to the CPU 111 and used for judging the preset environmental condition.
Hereinafter, the case where the invention is applied to a car navigation device will be described as an example.

【００１９】公知の如く、カーナビゲーション装置は最
初に目的地の設定をする。日本全国の地名は町村のレベ
ルまで対象とすると、おおよそ２０万程度の数の内の一
地名を認識する必要がある。出発前の設定であるか否か
は、エンジンキーが差し込まれた後の最初の命令か否か
の判断に加えて、カーナビゲーション装置が具備するＧ
ＰＳセンサーからの位置情報信号の時間変化を検出する
ことにより決定することができる。なお、ＧＰＳ信号に
よる位置情報の変化がなければ車は静止していることを
意味する。車の動きは、また、加速度を検出する加速度
センサの信号から判断することもできる。カーナビゲー
ション装置が最初の目的地の設定時に、その動作する環
境条件として問題となるのは、せいぜい車のエンジン音
であり、車の騒音の大半を占めるロードノイズ或は風切
り音は問題とはならない。この様な環境においては耐雑
音性よりも認識率の高い、或は多数の語彙を認識するこ
とができるプログラムを音声認識装置にロードして動作
させることが得策である。As is known, a car navigation system first sets a destination. If the name of a place in Japan is to the level of a town or village, it is necessary to recognize one place out of about 200,000. Whether or not the setting is before departure is determined by determining whether or not it is the first command after the engine key is inserted, and whether or not the G is included in the car navigation device.
It can be determined by detecting the time change of the position information signal from the PS sensor. If the position information does not change due to the GPS signal, it means that the vehicle is stationary. The movement of the vehicle can also be determined from the signal of an acceleration sensor that detects acceleration. When setting the first destination of the car navigation system, the only problem in the operating environment is the engine sound of the car, and road noise or wind noise, which accounts for most of the car's noise, is not a problem. . In such an environment, it is a good idea to load a program that has a higher recognition rate than noise resistance or can recognize a large number of vocabulary words into a voice recognition device and operate it.

【００２０】次に、走行中を想定するに、走行中である
か否かは上述した加速度センサから得られる加速度信
号、車両が発生するタイヤの回転数をパルス信号化した
車速信号を検出することにより判定することができる。
この様な状況下においては、地名認識の様な膨大な認識
対象から一つの語彙を選定する困難な操作／コマンドで
はなくして、「今、何処？」「次は？」「地図拡大」の
如き限定されたコマンドを認識することができるもので
あれば充分であり、一度に認識対象とすべき語彙数は高
々数１０以下の語彙数に限定される。走行中においてロ
ードノイズ或は風切り音が著しい環境下においては、耐
雑音性を重視した少数語彙対象の認識プログラムを音声
認識装置にロードして動作させることが得策である。Next, assuming that the vehicle is traveling, whether or not the vehicle is traveling is detected by detecting an acceleration signal obtained from the above-described acceleration sensor and a vehicle speed signal obtained by converting the rotational speed of the tire generated by the vehicle into a pulse signal. Can be determined by
In such a situation, instead of the difficult operation / command to select one vocabulary from a huge recognition target such as place name recognition, instead of "where is now?", "What is next?", "Map expansion", etc. It is sufficient that the limited commands can be recognized, and the number of vocabularies to be recognized at one time is limited to the number of vocabularies of several tens or less. In an environment where road noise or wind noise is noticeable during running, it is a good idea to load a recognition program for a small number of vocabulary objects, which emphasizes noise resistance, into a voice recognition device to operate.

【００２１】以上の図１の実施例における音声認識部１
２は、例えば、特願平７−１５１６９８号明細書に記載
される音声認識装置により構成することができる。これ
を図３を参照して説明する図３において、（１）音声入力部１Ｐより得られる音声
データをディジタル数値に変換する波形変換部２Ｐを具
備する。この波形変換部は、例えば、アナログの音声波
形をディジタルデータに変換する処理、音声をＡＤＰＣ
Ｍの如き圧縮されたデータとして受信して線形のデータ
に変換する過程も含まれるものとする。The speech recognition unit 1 in the embodiment shown in FIG.
2 can be configured by, for example, a voice recognition device described in Japanese Patent Application No. 7-151698. This will be described with reference to FIG. 3. In FIG. 3, (1) the waveform conversion unit 2P for converting the voice data obtained from the voice input unit 1P into a digital numerical value is provided. This waveform conversion unit is, for example, a process for converting an analog voice waveform into digital data, and ADPC for voice.
A process of receiving as compressed data such as M and converting it into linear data is also included.

【００２２】そして、（２）波形変換部２Ｐにより得ら
れた音声波形データから音声区間を検出すると共に音声
認識に使用する特徴量を抽出する音声特徴抽出部３Ｐを
具備する。音声特徴抽出部の分析手法としては短時間対
数パワー分析、ケプストラム分析その他の音声認識技術
において良く知られている分析手法が採用される。音声
認識プログラムの詳細、アルゴリズムおよび他のモジュ
ールとの間のプロトコルは第１の記憶部１４或は第２の
記憶部２０に格納されており、必要に応じて音声認識部
１２にロードして目的の機能を果たさせる。Then, (2) a voice feature extraction unit 3P for detecting a voice section from the voice waveform data obtained by the waveform conversion unit 2P and extracting a feature amount used for voice recognition is provided. As the analysis method of the voice feature extraction unit, a short-time logarithmic power analysis, a cepstrum analysis, and other analysis methods well known in the voice recognition technology are adopted. The details of the voice recognition program, the algorithm, and the protocol with other modules are stored in the first storage unit 14 or the second storage unit 20, and are loaded into the voice recognition unit 12 as necessary to be used. To fulfill the function of.

【００２３】また、（３）音声特徴抽出部３Ｐから得ら
れる音声特徴量より音声始端および音声終端を特定する
音声区間検出部５Ｐを具備する。音声区間を検出する手
法としては音声発声以前の雑音レベルを測定しておき、
この雑音レベルと比較して一定閾値以上の対数パワー値
を有する信号成分が一定時間内で推移する区間を音声区
間とする手法を使用することができる。雑音レベルの検
出は、音声認識装置の具備するマイクロフォンそのもの
を使用することができるが、マイクロフォンの周波数特
性或は指向性の制約から別に入力することがより実際的
である。一般に、入力部には単一指向性のマイクロフォ
ンが適しており、雑音検出部には全指向性のマイクロフ
ォンが適している。Further, (3) a voice section detecting unit 5P for identifying a voice start end and a voice end from a voice feature amount obtained from the voice feature extraction unit 3P is provided. As a method to detect the voice section, measure the noise level before voice utterance,
It is possible to use a method in which a section in which a signal component having a logarithmic power value equal to or more than a certain threshold value changes within a certain time period as a speech section is compared with the noise level. The noise level can be detected by using the microphone itself included in the voice recognition device, but it is more practical to input the noise level separately due to the frequency characteristic or directivity of the microphone. Generally, a unidirectional microphone is suitable for the input section, and an omnidirectional microphone is suitable for the noise detection section.

【００２４】更に、（４）音声認識処理するに際して音
声区間検出時の始端検出開始のトリガーを与える起動ス
イッチ部４Ｐを具備する。この起動スイッチとしてはボ
イススイッチを使用して使用者が発声した時を自動的に
捕捉する構成とすることができ、或は発声に際して使用
者がプレストークボタンを押圧する構成とすることもで
きる。Further, (4) there is provided an activation switch section 4P for giving a trigger for starting the start edge detection at the time of detecting the voice section in the voice recognition processing. As the activation switch, a voice switch may be used to automatically capture the time when the user utters, or the user may press the press talk button when uttering.

【００２５】ここで、（５）音声区間検出部により決定
された音声始端から音声終端に到る音声特徴量を取り込
んで未知入力パターンとして格納する入力パターン格納
部６Ｐを具備する。そして、（６）入力パターン格納部
に未知入力パターンが格納されるに到る手順と同様の手
順により分析、格納され、ラベル名を付与された複数の
音声標準パターンを格納した標準パターン記憶部８Ｐを
具備する。この標準パターン情報には音声区間検出部で
検出したものに相当する音声区間情報も含まれる。Here, (5) there is provided an input pattern storage unit 6P which takes in the voice feature amount from the voice start end to the voice end determined by the voice section detection unit and stores it as an unknown input pattern. Then, (6) a standard pattern storage unit 8P that stores a plurality of voice standard patterns that have been analyzed and stored by the same procedure as the procedure for storing the unknown input pattern in the input pattern storage unit, and have label names. It is equipped with. This standard pattern information also includes voice section information corresponding to that detected by the voice section detection unit.

【００２６】また、（７）入力パターン格納部に格納さ
れた未知の入力音声パターンと標準パターン記憶部に記
憶される各標準パターンとの間の類似度の計算を行なう
パターンマッチング部１０Ｐを具備する。類似度の計算
は、例えば、ＤＰマッチングを使用して行う。更に、
（８）それぞれの類似度の演算結果を蓄積し、何れの標
準パターンと未知入力音声パターンとの間の差異が最も
小さくなる標準パターンを決定する距離比較部１１Ｐを
具備する。(7) A pattern matching unit 10P is provided for calculating the degree of similarity between the unknown input voice pattern stored in the input pattern storage unit and each standard pattern stored in the standard pattern storage unit. . The calculation of the similarity is performed using DP matching, for example. Furthermore,
(8) The distance comparison unit 11P that accumulates the calculation results of the respective degrees of similarity and determines the standard pattern that minimizes the difference between any standard pattern and the unknown input voice pattern is provided.

【００２７】また、（９）距離比較部において最も類似
していると判定された標準パターンのラベル名を音声認
識部１２の上位ホストであるＣＰＵ１１１に出力する結
果出力部１２Ｐを具備する。以上の実施例において、標
準パターンとしては、予め分析され、整備されたものが
既に登録されているのが普通である。即ち、この登録さ
れた標準パターンが図１における認識辞書／標準パター
ン１５であり、また、小規模な標準パターンとしては音
声認識プログラム１２１に付属する認識対象テーブル１
２１１の様な形で利用することもできる。この発明にお
いては、音声認識装置をその使用環境に適合させ、或は
音声認識装置をして使用者の要請に応じた効率的な応答
をさせるために、認識対象語彙、認識用の標準パター
ン、或はこれらの双方、適用対象の異なる認識対象語彙
群、適用対象の異なる認識用標準パターン群、或はこれ
らの双方、を必要に応じて記憶部に格納する。これら
は、ＳＰＬＩＴ法或はＨＭＭによる音声認識方法その
他、そのプログラム毎に適した状況があり、必要に応じ
てバスを経由して高速にプログラム情報を音声認識部１
２に転送する仕組が必要である。Further, (9) the result comparing section 12P is provided for outputting the label name of the standard pattern determined to be the most similar in the distance comparing section to the CPU 111 which is the upper host of the voice recognizing section 12. In the above-mentioned embodiment, it is usual that a standard pattern that has been analyzed and prepared in advance is already registered as the standard pattern. That is, the registered standard pattern is the recognition dictionary / standard pattern 15 in FIG. 1, and as a small-scale standard pattern, the recognition target table 1 attached to the voice recognition program 121.
It can also be used in the form of 211. In the present invention, in order to adapt the voice recognition device to its usage environment, or to use the voice recognition device to make an efficient response in response to a user's request, a recognition target vocabulary, a standard pattern for recognition, Alternatively, both of them, a recognition target vocabulary group having a different application target, a recognition standard pattern group having a different application target, or both of them are stored in a storage unit as necessary. There are situations such as the SPLIT method or the voice recognition method using the HMM, and other situations suitable for each program. If necessary, the voice recognition unit 1 can quickly obtain program information via a bus.
A mechanism to transfer to 2 is required.

【００２８】次に、この発明において、ＤＰマッチング
による音声認識プログラムと、８次の自己相関係数のみ
により音声認識を行う簡易音声認識プログラムとを音声
認識装置に搭載し、状況に応じて認識実験を行った結果
を説明する。認識対象語彙は、文献「音響学会予稿集、
音声認識用共通音声データ」（著者板橋、１９８５年発
表）に記載される日本都市名１００単語中の上位６０単
語およびコマンド（大：地図の拡大、小：地図の縮小、
終わり：コマンドの終了、その他の１０単語）を男性話
者４名が騒音レベル６５ｄＢおよび７５ｄＢの環境下に
おいて発声したものである。音声は３００Ｈｚ〜３．４
ｋＨｚのフィルタを介して８ｋＨｚで変換され、ＤＰマ
ッチングによる音声認識プログラムに対しては、１２８
ｍｓｅｃ毎の短時間ＬＰＣケプストラム分析を実行し
た。音声区間検出は短時間対数パワーで行った。音声始
端の検出方法は、信号パワー値が音声のない状態から或
る一定閾値以上の大きな値で一定時間継続したときにそ
の信号パワー値の立ち上がり位置を始端とする。この
後、音声区間検出部は音声の信号パワー値の減衰点を検
出して音声の終端とする。簡易音声認識プログラムに対
しては、音声信号の前処理段階はＤＰマッチングプログ
ラムと同条件とし、後の類似度の比較でプログラムその
ものの手順に沿った処理とした。ＤＰマッチングによる
音声認識プログラムにおける類似度の計算は始端固定、
終端フリーのＳｔａｇｇｅｒｅｄＡｒｒａｙＤＰで
ある。Next, in the present invention, a voice recognition program by DP matching and a simple voice recognition program for performing voice recognition only by an 8th order autocorrelation coefficient are installed in a voice recognition device, and a recognition experiment is carried out depending on the situation. The result of performing is explained. The vocabulary to be recognized is based on the literature “Acoustic Society Proceedings,
Common speech data for voice recognition "(author Itabashi, published in 1985), the top 60 words and commands in 100 words of Japanese city names (large: enlarge map, small: reduce map,
End: The end of the command, and the other 10 words) were uttered by four male speakers in an environment with noise levels of 65 dB and 75 dB. The sound is 300Hz to 3.4.
Converted at 8 kHz through a kHz filter, and 128 for a voice recognition program using DP matching.
A short LPC cepstrum analysis was performed every msec. The voice section was detected with logarithmic power for a short time. In the method of detecting the voice start point, when the signal power value continues from a state in which there is no voice at a large value of a certain threshold value or more for a predetermined time, the rising position of the signal power value is set as the start point. After that, the voice section detection unit detects the attenuation point of the signal power value of the voice and sets it as the end of the voice. For the simple voice recognition program, the pre-processing stage of the voice signal has the same condition as that of the DP matching program, and the comparison of the similarities is performed according to the procedure of the program itself. Calculation of similarity in speech recognition program by DP matching is fixed at the beginning
It is a termination-free Staged Array DP.

【００２９】ここで、ＤＰマッチングプログラムによる
音声認識の場合は、雑音レベルが６５ｄＢの時の誤認識
率は５％であったのに対して、雑音レベルが１０ｄＢ増
大した７５ｄＢの環境下においては、誤認識率はおよそ
４０％にまで増大し、実用上支障を来すことが予想され
た。ＳＰＬＩＴ法或はＨＭＭによる音声認識の場合もこ
れと同様の結果となるものと予測される。Here, in the case of voice recognition by the DP matching program, the erroneous recognition rate was 5% when the noise level was 65 dB, whereas in the environment of 75 dB where the noise level was increased by 10 dB, The misrecognition rate increased to about 40%, which was expected to be a problem in practical use. It is expected that similar results will be obtained in the case of voice recognition by the SPLIT method or HMM.

【００３０】一方、簡易音声認識プログラムによる音声
認識の場合は、定義した１０個のコマンドについて、６
５ｄＢの騒音下においては誤認識率は２％以内にとどま
り、７５ｄＢの騒音下においても１０％以内の誤認識率
に収まった。簡易音声認識プログラムによる音声認識の
騒音下における比較的に高い認識率は、図１に示される
認識辞書／標準パターン１５および認識対象テーブル１
２１１を作成するに際して、相互に識別し易い語彙構成
とした効果も含まれている。これもダイナミックにこれ
らの情報を切り替えて使用するこの発明の一効果であ
る。On the other hand, in the case of voice recognition by the simple voice recognition program, it is possible to use 6 commands for the 10 defined commands.
The misrecognition rate remained within 2% under the noise of 5 dB, and was within 10% even under the noise of 75 dB. The relatively high recognition rate under the noise of the voice recognition by the simple voice recognition program shows that the recognition dictionary / standard pattern 15 and the recognition target table 1 shown in FIG.
When creating 211, the effect of having a vocabulary structure that is easily distinguishable from each other is also included. This is also one effect of the present invention in which these pieces of information are dynamically switched and used.

【００３１】これらの結果に基づいて、エンジンキーが
差し込まれて車両のアクセサリー類に通電された時に音
声認識装置が以下の機能を果たすプロトタイプの音声入
力機能付きカーナビゲーション装置を試作し、良好な結
果を得ることができた。１．音声入力機能付きカーナビゲーション装置の初期化
とＳＰＬＩＴ方式の音声認識プログラムのロードこれによる地名入力待ち。Based on these results, a prototype car navigation device with a voice input function, in which the voice recognition device performs the following functions when the engine key is inserted and the accessories of the vehicle are energized, is produced with good results. I was able to get 1. Initialize car navigation system with voice input function and load SPLIT voice recognition program.

【００３２】２．音声入力による出発前の目的設定３．加速度センサによる車の移動速度の検出によるナビ
ゲーションの開始同時に音声認識部のメインプログラムを先のものから簡
易音声認識プログラムに切り替え。４．応答コマンド、拡大、縮小、・・・の入力待ち５．使用者のコマンドに応じた応用プログラムのロー
ド、画面操作への対応６．エンジンキーの抜き取りによる音声入力機能付きカ
ーナビゲーション装置の自動終了2. Purpose setting before departure by voice input 3. Start navigation by detecting the moving speed of the car with the acceleration sensor At the same time, switch the main program of the voice recognition unit from the previous one to the simple voice recognition program. 4. 4. Wait for input of response command, enlargement, reduction, ... 5. Support for loading application programs and screen operations according to user commands. Automatic termination of car navigation system with voice input function by removing engine key

【００３３】[0033]

【発明の効果】以上の通りであって、この発明は、音声
認識装置の使用される環境に応じて音声認識プログラム
を切り替える構成を採用するものであり、単一のプログ
ラムのみにより動作させる音声認識装置に依っては困難
であった、例えば日本全国の地名認識をすることができ
ると共に高騒音下におけるコマンド応答にも対応するこ
とができるという両用の音声認識装置を実現することが
できる。そして、プログラムの切り替えに音声認識装置
の音響センサを使用することにより、使用者に負担を掛
けることなくして自然な音声認識サービスを提供するこ
とができる。As described above, the present invention employs a configuration in which a voice recognition program is switched according to the environment in which the voice recognition device is used, and the voice recognition is operated only by a single program. Depending on the device, it is difficult to realize, for example, a dual-purpose voice recognition device capable of recognizing a place name nationwide in Japan and also capable of responding to a command response under high noise. By using the acoustic sensor of the voice recognition device for switching the program, a natural voice recognition service can be provided without burdening the user.

[Brief description of the drawings]

【図１】実施例を説明するブロック図。FIG. 1 is a block diagram illustrating an embodiment.

【図２】音声認識装置の従来例を説明するブロック図。FIG. 2 is a block diagram illustrating a conventional example of a voice recognition device.

【図３】音声認識部の先行例を説明するブロック図。FIG. 3 is a block diagram illustrating a preceding example of a voice recognition unit.

[Explanation of symbols]

１音声認識装置１０音声入力部１２音声認識部１１１ＣＰＵ１３応用プログラム管理テーブル１４第１の記憶部１５認識辞書／標準パターン１６音声認識プログラム管理テーブル２０第２の記憶部３０ＭＭＵ３１第１のバス３２第２のバス４０外部入力信号４１センサおよびインタフェース部１２１音声認識プログラム１２１１認識対象テーブル 1 voice recognition device 10 voice input unit 12 voice recognition unit 111 CPU 13 application program management table 14 first storage unit 15 recognition dictionary / standard pattern 16 voice recognition program management table 20 second storage unit 30 MMU 31 first bus 32 second bus 40 external input signal 41 sensor and interface section 121 speech recognition program 1211 recognition target table

Claims

[Claims]

1. A voice recognition unit, comprising: a voice input unit for inputting a voice signal; extracting a voice characteristic pattern from the input voice signal; and recognizing a voice based on the voice characteristic pattern information. Target vocabulary and / or standard pattern for recognition, and / or recognition target vocabulary group and / or different application target
Alternatively, it is provided with a first storage unit for storing a recognition standard pattern group having different application targets, and a second storage unit for storing a plurality of voice recognition programs to be loaded into the voice recognition unit. And a voice recognition device.

2. The voice recognition device according to claim 1, further comprising a recognition target table attached to a voice recognition program as a standard pattern.

3. The voice recognition device according to claim 1, further comprising a sensor for detecting an external input signal and an interface unit, wherein a plurality of voices are detected according to the detected external input signal. A voice recognition device comprising a configuration for switching a recognition program.