JPWO2005010868A1

JPWO2005010868A1 - Speech recognition system and its terminal and server

Info

Publication number: JPWO2005010868A1
Application number: JP2005504586A
Authority: JP
Inventors: 知宏成田; 貴志須藤; 利行花沢
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2003-07-29
Filing date: 2003-07-29
Publication date: 2006-09-14
Also published as: WO2005010868A1

Abstract

多様な環境で使用されても、高精度の音声認識を行う音声認識システムを提供する。外部マイクロホン１が収集した音声信号から音声特徴量を算出し、複数の音響モデルを記憶し、前記複数の音響モデルから外部マイクロホン１が集音する環境に適した音響モデルを選択し、前記音響モデルの標準パターンと前記音声特徴量とのパターンマッチングを行って認識結果を出力する音声認識処理を、ネットワークに接続された音声認識端末２と音声認識サーバ６とにより分担して実行するクライアントサーバ型音声認識システムにおいて、外部マイクロホン１の集音環境を検知するために、音声認識端末２にセンサ１２を設け、センサ１２の出力を音声認識サーバ６に送信する送信部１３を設けた。Provided is a speech recognition system that performs highly accurate speech recognition even when used in various environments. An audio feature amount is calculated from an audio signal collected by the external microphone 1, a plurality of acoustic models are stored, an acoustic model suitable for an environment in which the external microphone 1 collects sound is selected from the plurality of acoustic models, and the acoustic model Client-server-type voice that performs voice recognition processing that performs pattern matching between the standard pattern of the voice and the voice feature quantity and outputs a recognition result in a shared manner by the voice recognition terminal 2 and the voice recognition server 6 connected to the network In the recognition system, in order to detect the sound collection environment of the external microphone 1, a sensor 12 is provided in the voice recognition terminal 2, and a transmission unit 13 that transmits the output of the sensor 12 to the voice recognition server 6 is provided.

Description

この発明は、音声認識システム及びその端末とサーバに係るものであり、特にさまざまな使用状況を想定して準備された複数の音響モデルから、使用状況に応じて適切な音響モデルを選択し音声認識を行う技術に関するものである。 The present invention relates to a speech recognition system and its terminals and servers. In particular, from a plurality of acoustic models prepared assuming various usage situations, an appropriate acoustic model is selected according to the usage situation and voice recognition is performed. It is related to the technology to perform.

音声認識は、入力音声から音声特徴量の時系列を抽出し、この音声特徴量の時系列と予め準備された音響モデルとの照合によって、候補語を算出することにより行われる。
しかし現実の使用環境で発声された音声には、背景騒音が重畳しているため、音声認識の精度が劣化する。背景騒音の種類及び重畳の仕方は、使用環境によって異なる。そのため、精度の高い音声認識を行うには、複数の音響モデルを準備し、さらに複数の音響モデルの中から現在の使用環境に適した音響モデルを選択する必要がある。このような音響モデルの選択方法として、例えば、特開２０００−２９５００（特許文献１）がある。
特許文献１による音響モデルの選択方法は、例えば車載用音声認識装置において、速度センサなどの各種車載センサが出力する値（センサからのアナログ信号をＡ／Ｄ変換して得たデータをいう。以後、この値のことをセンサ情報と呼ぶこととする）に対応する雑音から雑音スペクトルを算出して、この雑音スペクトルと各種車載センサからのセンサ情報とを関連づけて記憶しておき、次回の音声認識時に得られる各種車載センサからのセンサ情報と、予め記憶している雑音スペクトルのセンサ情報との類似度が所定値以内の場合に、このセンサ情報に対応する雑音スペクトルを音声特徴量の時系列から減算する、というものである。
しかしこの方法では、今まで使用したことのない環境下で音声認識の精度を向上させることができないという問題がある。そこで、例えば工場出荷時に、予め各種センサの出力値の中から所定の値をいくつか選択し、センサがこれらの値を出力する環境条件下で学習した音響モデルを作成しておく。そして、現実の使用環境で得られたセンサ情報と音響モデルの環境条件とを比較して、適切な音響モデルを選択する方法が考えられる。
ところで、１つの音響モデルのデータサイズは、音声認識システムの設計方法や実装方法によっても異なるものの、数百キロバイトにも及ぶ場合がある。カーナビゲーションシステムや携帯電話のようなモバイル機器では、筐体サイズや重量の制約から、搭載可能な記憶装置の容量が厳しく制限される。したがって、モバイル機器に、これほどのデータサイズを有する音響モデルを複数個記憶させる構成を採用するのは現実的ではない。
特にセンサが複数個ある場合に、各センサのセンサ情報の値をそれぞれ複数選択して、それらの組み合わせに対応した音響モデルを準備しようとすると、膨大な記憶容量が必要となってしまう。
この発明は、上記課題を解決するためになされたもので、複数の音響モデルを記憶している音声認識サーバに、音声認識端末からネットワークを介してセンサ情報を送信することにより、現実の使用環境に適した音響モデルを選択して高精度な音声認識処理を実現することを目的としている。Speech recognition is performed by extracting a time series of speech feature values from the input speech and calculating candidate words by collating the time series of speech feature values with a prepared acoustic model.
However, since the background noise is superimposed on the voice uttered in the actual usage environment, the accuracy of the voice recognition deteriorates. The type of background noise and the method of superposition differ depending on the usage environment. Therefore, in order to perform highly accurate speech recognition, it is necessary to prepare a plurality of acoustic models and further select an acoustic model suitable for the current use environment from the plurality of acoustic models. As a method for selecting such an acoustic model, for example, there is JP-A-2000-29500 (Patent Document 1).
The acoustic model selection method according to Patent Document 1 is a value output by various in-vehicle sensors such as a speed sensor in an in-vehicle voice recognition device (refers to data obtained by A / D conversion of an analog signal from the sensor. The noise spectrum is calculated from the noise corresponding to this value), and the noise spectrum and sensor information from various in-vehicle sensors are stored in association with each other for the next speech recognition. When the similarity between the sensor information from various on-vehicle sensors obtained at the time and the sensor information of the noise spectrum stored in advance is within a predetermined value, the noise spectrum corresponding to this sensor information is calculated from the time series of voice feature values. Subtract.
However, this method has a problem that the accuracy of speech recognition cannot be improved in an environment that has never been used. Therefore, for example, at the time of factory shipment, some predetermined values are selected in advance from the output values of various sensors, and an acoustic model learned under environmental conditions in which the sensor outputs these values is created. Then, a method of selecting an appropriate acoustic model by comparing the sensor information obtained in the actual use environment with the environmental conditions of the acoustic model is conceivable.
By the way, although the data size of one acoustic model varies depending on the design method and the mounting method of the speech recognition system, it may reach several hundred kilobytes. In a mobile device such as a car navigation system or a mobile phone, the capacity of a storage device that can be mounted is severely limited due to restrictions on the housing size and weight. Therefore, it is not practical to employ a configuration in which a plurality of acoustic models having such a data size are stored in the mobile device.
In particular, when there are a plurality of sensors, if a plurality of sensor information values for each sensor are selected and an acoustic model corresponding to the combination is prepared, an enormous storage capacity is required.
The present invention has been made to solve the above-described problem, and transmits sensor information from a voice recognition terminal to a voice recognition server storing a plurality of acoustic models via a network, thereby realizing an actual use environment. The purpose is to realize a highly accurate speech recognition process by selecting an acoustic model suitable for the above.

この発明に係る音声認識システムは、
音声認識サーバと複数の音声認識端末とをネットワークにより接続した音声認識システムであって、
前記音声認識端末は、
外部マイクロホンを接続し、その外部マイクロホンが集音した音声信号を入力する入力端と、
前記入力端から入力された音声信号から音声特徴量を算出するクライアント側音響分析手段と
前記音声信号に重畳する騒音の種別を表すセンサ情報を検出するセンサと、
前記ネットワークを介して前記センサ情報を前記音声認識サーバに送信するクライアント側送信手段と、
前記音声認識サーバから音響モデルを受信するクライアント側受信手段と、
前記音響モデルと前記音声特徴量とを照合するクライアント側照合手段と、を備え、
前記音声認識サーバは、
前記クライアント側送信手段が送信したセンサ情報を受信するサーバ側受信手段と、
複数の音響モデルを記憶するサーバ側音響モデル記憶手段と、
前記複数の音響モデルから前記センサ情報に適合する音響モデルを選択するサーバ側音響モデル選択手段と、
前記サーバ側音響モデル選択手段が選択した音響モデルを前記音声認識端末に送信するサーバ側送信手段と、を備えたものである。
このように、この音声認識システムでは、記憶容量に制限のない音声認識サーバに、様々な集音環境に対応した複数の音響モデルを記憶させておき、各音声認識端末に備えられたセンサからの情報に基づいてその音声認識端末の集音環境に適合した音響モデルを選択して、音声認識端末に送信するようにした。これにより、音声認識端末は、筐体サイズや重量などの制約から、その端末自身の記憶容量が制限される場合であっても、その集音環境に適合した音響モデルを取得し、その音響モデルを用いて音声認識を行うので、音声認識の精度を向上できるのである。The speech recognition system according to the present invention is:
A speech recognition system in which a speech recognition server and a plurality of speech recognition terminals are connected via a network,
The voice recognition terminal is
An external microphone is connected, and an input terminal for inputting an audio signal collected by the external microphone;
A client-side acoustic analysis unit that calculates a voice feature amount from a voice signal input from the input terminal; and a sensor that detects sensor information representing a type of noise to be superimposed on the voice signal;
Client-side transmission means for transmitting the sensor information to the voice recognition server via the network;
Client-side receiving means for receiving an acoustic model from the speech recognition server;
Client-side collating means for collating the acoustic model with the voice feature amount,
The voice recognition server
Server-side receiving means for receiving sensor information transmitted by the client-side transmitting means;
Server-side acoustic model storage means for storing a plurality of acoustic models;
Server-side acoustic model selection means for selecting an acoustic model that matches the sensor information from the plurality of acoustic models;
Server-side transmission means for transmitting the acoustic model selected by the server-side acoustic model selection means to the voice recognition terminal.
As described above, in this voice recognition system, a plurality of acoustic models corresponding to various sound collection environments are stored in a voice recognition server having no limitation in storage capacity, and sensors from each voice recognition terminal are used. Based on the information, an acoustic model suitable for the sound collection environment of the voice recognition terminal is selected and transmitted to the voice recognition terminal. As a result, the voice recognition terminal acquires an acoustic model suitable for the sound collection environment even if the storage capacity of the terminal itself is limited due to restrictions such as the housing size and weight, and the acoustic model Since voice recognition is carried out using this, the accuracy of voice recognition can be improved.

図１はこの発明の実施例１による音声認識端末及びサーバの構成を示したブロック図、
図２はこの発明の実施例１による音声認識端末及びサーバの動作を示すフローチャート、
図３はこの発明の実施例２による音声認識端末及びサーバの構成を示したブロック図、
図４はこの発明の実施例２による音響モデルのクラスタリング処理示すフローチャート、
図５はこの発明の実施例２による音声認識端末及びサーバの動作を示すフローチャート、
図６はこの発明の実施例３による音声認識端末及びサーバの構成を示したブロック図、
図７はこの発明の実施例３による音声認識端末及びサーバの動作を示すフローチャート、
図８はこの発明の実施例４による音声認識端末及びサーバの構成を示すブロック図、
図９はこの発明の実施例４による音声認識端末及びサーバの動作を示すフローチャート、
図１０はこの発明の実施例４による音声認識端末から音声認識サーバに送信されるセンサ情報及び音声データのデータフォーマットの構成図、
図１１はこの発明の実施例５による音声認識端末から音声認識サーバの構成を示すブロック図、
図１２はこの発明の実施例５による音声認識端末及びサーバの動作を示すフローチャートである。1 is a block diagram showing the configuration of a speech recognition terminal and a server according to Embodiment 1 of the present invention,
FIG. 2 is a flowchart showing the operation of the speech recognition terminal and server according to Embodiment 1 of the present invention.
FIG. 3 is a block diagram showing the configuration of a speech recognition terminal and server according to Embodiment 2 of the present invention.
FIG. 4 is a flowchart showing acoustic model clustering processing according to the second embodiment of the present invention.
FIG. 5 is a flowchart showing the operation of the voice recognition terminal and server according to Embodiment 2 of the present invention.
FIG. 6 is a block diagram showing the configuration of a speech recognition terminal and server according to Embodiment 3 of the present invention.
FIG. 7 is a flowchart showing the operation of the speech recognition terminal and server according to Embodiment 3 of the present invention.
FIG. 8 is a block diagram showing the configuration of a speech recognition terminal and server according to Embodiment 4 of the present invention.
FIG. 9 is a flowchart showing the operation of the speech recognition terminal and server according to Embodiment 4 of the present invention.
FIG. 10 is a configuration diagram of a data format of sensor information and voice data transmitted from the voice recognition terminal to the voice recognition server according to the fourth embodiment of the present invention.
FIG. 11 is a block diagram showing a configuration of a voice recognition server from a voice recognition terminal according to Embodiment 5 of the present invention;
FIG. 12 is a flowchart showing the operation of the speech recognition terminal and server according to Embodiment 5 of the present invention.

図１は、この発明の一実施例による音声認識端末及びサーバの構成を示すブロック図である。図において、マイクロホン１は音声を収集する装置または部品であって、音声認識端末２は入力端３を介してマイクロホン１が収集した音声を音声認識して、認識結果４を出力する装置である。また入力端３は、オーディオ端子またはマイクロホン接続端子である。
音声認識端末２はネットワーク５を介して音声認識サーバ６と接続されている。ネットワーク５はインターネットやＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、公衆回線網、携帯電話網、人工衛星を用いた通信網などディジタル情報を通信するネットワーク網である。ただしネットワーク５は、結果として、このネットワークに接続されている機器間でディジタルデータを送受信するようになっていればよいのであって、ネットワーク５上に伝送されている情報の形式を問うものではない。したがって、例えば、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）やＳＣＳＩ（ＳｍａｌｌＣｏｍｐｕｔｅｒＳｙｓｔｅｍｓＩｎｔｅｒｆａｃｅ）などのように複数の機器を接続するように設計されたバスであっても構わない。また、音声認識端末２が車載用の音声認識装置である場合には、ネットワーク５は移動体通信のデータ通信サービスを利用することになる。データ通信サービスでは、送受信するデータをパケットと呼ばれる単位に分割して一つ一つ送受信する通信方式を使用する。パケットには、送信側機器が受信側機器に送信しようとしているデータの他に、受信側機器を特定するための受信側機器を識別する情報（送信先アドレス）、そのパケットがデータ全体のどの部分を構成するかを示す位置情報、誤り訂正符号などの制御情報が付加されている。
音声認識サーバ６は、ネットワーク５を介して音声認識端末２と接続されるように構成されているサーバコンピュータである。音声認識サーバ６は、音声認識端末２よりも大きな記憶容量のハードディスク装置またはメモリなどの記憶装置を有しており、音声認識に必要となる標準パターンを記憶している。また、複数の音声認識端末２が、ネットワーク５を介して音声認識サーバ６と接続されるようになっている。
次に音声認識端末２の詳細な構成について説明する。音声認識端末２は、端末側音響分析部１１とセンサ１２、端末側送信部１３、端末側受信部１４、端末側音響モデル記憶部１５、端末側音響モデル選択部１６、端末側照合部１７を備えている。
端末側音響分析部１１は、入力端３から入力された音声信号に基づいて音響分析を行い、音声特徴量を算出する部位である。
センサ１２は、マイクロホン１が取得する音声信号に重畳する騒音の種別に関する情報を得ることを目的として、環境条件を検出するセンサであって、マイクロホン１が設置されている環境における物理量や、その変化量を検出又は取得する素子、または装置である。しかし、それのみならず、さらに検出量を適切な信号に変換して出力する素子又は装置をも含んでよい。また、ここでいう物理量とは、温度・圧力・流量・光・磁気の他、時間や電磁波なども含むものとする。したがって、例えばＧＰＳアンテナはＧＰＳ信号に対するセンサである。また必ずしも外界から何らかの信号を取得して物理量を検出するものである必要はなく、例えば内蔵クロックに基づいてマイクロホンのおかれている地点の時刻を取得するようになっている回路も、ここでいうセンサに含まれる。
なお、以降の説明では、これらの物理量を総称して、センサ情報と呼ぶこととする。また一般に、センサはアナログ信号を出力するようになっており、出力されたアナログ信号をＡ／Ｄ変換器又は素子によって、ディジタル信号にサンプリングするのが通常の構成である。したがって、センサ１２は、このようなＡ／Ｄ変換器又は素子を含むものであってもよい。さらに、複数種類のセンサ、例えば音声認識端末２が車載用ナビゲーションシステムの端末である場合には、速度センサやエンジンの回転数をモニタリングするセンサ、ワイパーの稼働状況をモニタリングするセンサ、ドアのガラスの開閉状況をモニタリングするセンサ、カーオーディオのボリュームをモニタリングするセンサなど、複数のセンサを組み合わせてもよい。
端末側送信部１３は、センサ１２によって得られたマイクロホン１近傍のセンサ情報を音声認識サーバ６に送信する部位である。
端末側受信部１４は、音声認識サーバ６からの情報を受信する部位であり、端末側音響モデル選択部１６に受信した情報を出力するようになっている。端末側送信部１３と端末側受信部１４は、ネットワークケーブルに信号を送出し、またネットワークケーブルから信号を受信する回路又は素子から構成されているが、この回路又は素子を制御するためのコンピュータプログラムを端末側送信部１３と端末側受信部１４の一部に含めてもよい。もっとも、ネットワーク５が無線通信網である場合には、端末側送信部１３と端末側受信部１４は通信波を送受信するようなアンテナを備えることになる。なお、端末側送信部１３と端末側受信部１４とを別体の部位として構成してもよいが、同一のネットワーク入出力装置で構成するようにしてもよい。
端末側音響モデル記憶部１５は、音響モデルを記憶するための記憶素子又は回路である。ここで、音響モデルは、学習環境に応じて複数個存在しうるものとし、そのうちの一部のみが端末側音響モデル記憶部１５に記憶されているものとする。また各音響モデルは、その音響モデルを学習した環境条件を表すセンサ情報と関連づけられており、センサ情報の数値から、その環境条件に適した音響モデルが特定できるようになっている。例えば、音声認識端末２が車載用音声認識装置である場合には、自動車が時速４０ｋｍで走行している場合の騒音環境下で発声されたサンプルに基づいて作成された音響モデル、自動車が時速５０ｋｍで走行している場合の騒音環境下で発声されたサンプルに基づいて作成された音響モデル、といったものが準備されている。ただし、後述するように、音声認識サーバ６にもさまざまな環境条件に対応した音響モデルが記憶されているので、端末側音響モデル記憶部１５に、すべての環境条件下で学習された音響モデルが記憶されている必要はない。このような構成を採用することで、音声認識端末２が搭載しなくてはならない記憶装置の記憶容量は極めて小さく済む。
端末側音響モデル選択部１６は、端末側受信部１４が取得した音響モデル（あるいは端末側音響モデル記憶部１５に記憶されている音響モデル）と、端末側音響分析部１１が出力した音声特徴量との尤度を算出する部位である。端末側照合部１７は、端末側音響モデル選択部１６が算出した尤度に基づいて語彙を選択し、認識結果４として出力する部位である。
なお、音声認識端末２の構成要素のうち、端末側音響分析部１１、端末側送信部１３、端末側受信部１４、端末側音響モデル記憶部１５、端末側音響モデル選択部１６、端末側照合部１７はそれぞれ専用の回路により構成してもよいが、中央演算装置（ＣＰＵ）及びネットワークＩ／Ｏ装置（ネットワークアダプタ装置など）、記憶装置に、それぞれの機能に相当する処理を実行させるコンピュータプログラムとして構成するようにしてもよい。
続いて、音声認識サーバ６の詳細な構成について説明する。音声認識サーバ６はサーバ側受信部２１、サーバ側音響モデル記憶部２２、サーバ側音響モデル選択部２３、サーバ側送信部２４とを備えている。サーバ側受信部２１は、ネットワーク５を介して音声認識端末２の端末側送信部１３から送信されてくるセンサ情報を受信する部位である。
サーバ側音響モデル記憶部２２は、複数の音響モデルを記憶するための記憶装置である。このサーバ側音響モデル記憶部２２はハードディスク装置や、ＣＤ−ＲＯＭ媒体とＣＤ−ＲＯＭドライブとの組み合わせなどによる大容量記憶装置として構成される。
サーバ側音響モデル記憶部２２は、端末側音響モデル記憶部１５とは異なり、この音声認識システムで使用する可能性のある音響モデルをすべて記憶しており、さらにそうするのに十分な記憶容量を有しているものとする。
サーバ側音響モデル選択部２３は、サーバ側音響モデル記憶部２２が記憶する音響モデルから、サーバ側受信部２１が受信したセンサ情報に適する音響モデルを選択する部位である。
サーバ側送信部２４は、サーバ側音響モデル選択部２３が選択した音響モデルをネットワーク５を介して音声認識端末２に送信する部位である。
なお、音声認識サーバ６の構成要素のうち、サーバ側受信部２１、サーバ側音響モデル記憶部２２、サーバ側音響モデル選択部２３、サーバ側送信部２４はそれぞれ専用の回路により構成してもよいが、中央演算装置（ＣＰＵ）及びネットワークＩ／Ｏ装置（ネットワークアダプタ装置など）、記憶装置に、それぞれの機能に相当する処理を実行させるコンピュータプログラムとして構成するようにしてもよい。
次に音声認識端末２及び音声認識サーバ６の動作について、図を参照しながら説明する。図２は実施例１による音声認識端末２と音声認識サーバ６との処理を示したフローチャートである。図において、利用者がマイクロホン１から音声入力を行うと（ステップＳ１０１）、入力端３を介して端末側音響分析部１１に音声信号が入力される。続いて、端末側音響分析部１１においてＡ／Ｄ変換器によりディジタル信号に変換されて、ＬＰＣケプストラム（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇＣｅｐｓｔｒｕｍ）などの音声特徴量の時系列を算出する（ステップＳ１０２）。
次に、センサ１２はマイクロホン１周辺の物理量を取得する（ステップＳ１０３）。例えば、音声認識端末２がカーナビゲーションシステムであって、センサ１２が、このカーナビゲーションシステムが搭載されている車両（自動車）の速度などを検出する速度センサである場合には、速度がこのような物理量に相当する。なお図２において、ステップＳ１０３によるセンサ情報の収集をステップＳ１０２による音響分析の次に行うこととしている。しかし、ステップＳ１０３の処理はステップＳ１０１〜Ｓ１０２の処理よりも前に行ってもよいし、また同時に、または並行して行うようにしてもよいことはいうまでもない。
続いて、端末側音響モデル選択部１６は、センサ１２が得たセンサ情報、すなわちマイクロホン１が音声を収集する環境に最も近い条件で学習した音響モデルを選択する。ここで、音響モデルの環境条件は複数通り考えられ、さらに端末側音響モデル記憶部１５はそのすべてを記憶しているわけではない。そこで、端末側音響モデル記憶部１５が現在記憶している音響モデルの中に、マイクロホン１の環境条件に近い環境条件で学習されたものがない場合には、音声認識サーバ６より音響モデルを取得するのである。
次に処理の説明に先立って、用語と表記の定義を行っておく。音響モデルｍが学習された条件下のセンサｋについてのセンサ情報を、単に「音響モデルｍのセンサ情報」と呼ぶこととする。端末側音響モデル記憶部１５は、Ｍ個の音響モデルを記憶しているものとし、各音響モデルを音響モデルｍ（ただしｍ＝１，２，…，Ｍ）として表す。またセンサ１２はＫ個のセンサから構成されており、それぞれのセンサをセンサｋ（ただしｋ＝１，２，…，Ｋ）とする。さらに音響モデルｍが学習された環境条件下におけるセンサｋについてのセンサ情報をＳ_ｍ，ｋで表すことにし、またセンサｋの現在のセンサ情報（ステップＳ１０３で出力したセンサ情報）をｘ_ｋとする。
以下、これらの処理をより具体的に説明する。まず、端末側音響モデル選択部１６は、音響モデルｍのセンサ情報Ｓ_ｍ，ｋと、センサ１２によって取得されたセンサ情報ｘ_ｋとの距離値Ｄ（ｍ）を算出する（ステップＳ１０４）。いま、あるセンサｋにおけるセンサ情報ｘ_ｋと音響モデルｍのセンサ情報Ｓ_ｍ，ｋとの距離値をＤ_ｋ（ｘ_ｋ，Ｓ_ｍ，ｋ）とする。距離値Ｄ_ｋ（ｘ_ｋ，Ｓ_ｍ，ｋ）の具体的な値としては、例えばセンサ情報の差分の絶対値などを採用すればよい。すなわちセンサ情報が速度であるならば、学習時の速度（例えばＳ_ｍ，ｋ＝４０ｋｍ／ｈ）と現在の速度（例えばｘ_ｋ＝５０ｋｍ／ｈ）の差（１０ｋｍ／ｈ）を距離値Ｄ_ｋ（ｘ_ｋ，Ｓ_ｍ，ｋ）とする。
また距離値Ｄ（ｍ）については、センサ毎の距離値Ｄ_ｋ（ｘ_ｋ，Ｓ_ｍ，ｋ）を用いて、次のように算出する。

ここで、ｗ_ｋは各センサに対する重み係数である。
ここで、物理量としてのセンサ情報と距離値Ｄ（ｍ）との関係について説明しておく。センサ情報が位置（経度や緯度に基づいて定めてもよいし、特定の場所を原点として、そこからの距離によって定めてもよい）である場合と、速度である場合とでは、センサ情報の物理量としての次元が相違する。しかしここでは、重み係数ｗ_ｋを調整することで、ｗ_ｋＤ_ｋ（ｘ_ｋ，Ｓ_ｍ，ｋ）の距離値への寄与度を適切に設定できるので、次元の相違を無視しても問題がない。また単位系が相違する場合であっても同様である。例えば、速度の単位としてｋｍ／ｈを用いる場合と、ｍｐｈを用いる場合では、物理的に同じ速度であっても、センサ情報として異なる値をとりうる。このような場合、例えばｋｍ／ｈで算出した速度値に対しては１．６の重み係数を与え、ｍｐｈで算出した速度値に対しては１．０の重み係数を与えれば、距離値の算出における速度の効果を等しくすることができる。
次に、端末側音響モデル選択部１６は、式（１）で算出した各ｍに対する距離値Ｄ（ｍ）の最小値ｍｉｎ｛Ｄ（ｍ）｝を求め、このｍｉｎ｛Ｄ（ｍ）｝が所定の値Ｔよりも小さいかどうかを評価する（ステップＳ１０５）。すなわち、端末側音響モデル記憶部１５が記憶している端末側音響モデルの環境条件中に、マイクロホン１が集音する現在の環境条件に十分近いものが存在するかどうかを検定するのである。所定の値Ｔとは、このような条件を満たすかどうかを検定するために予め設定された値である。
ｍｉｎ｛Ｄ（ｍ）｝が所定の値Ｔよりも小さい場合には（ステップＳ１０５：Ｙｅｓ）、ステップＳ１０６に進む。端末側音響モデル選択部１６は、マイクロホン１が集音する現在の環境に適する音響モデルとして、端末側の音響モデルｍを選択する（ステップＳ１０６）。そして照合処理（ステップＳ１１２）に進む。以降の処理については後述する。
また、ｍｉｎ｛Ｄ（ｍ）｝が所定の値Ｔ以上である場合には（ステップＳ１０５：Ｎｏ）、ステップＳ１０７に進む。この場合には、端末側音響モデル記憶部１５が記憶している音響モデルの環境条件中に、マイクロホン１が集音する現在の環境条件に十分近いものが存在しないことになる。そこで、端末側送信部１３は、音声認識サーバ６にセンサ情報を送信する（ステップＳ１０７）。
なお、所定の値Ｔを大きくすると、ｍｉｎ｛Ｄ（ｍ）｝がＴよりも小さいと判断される頻度が多くなり、ステップＳ１０７が実行される回数が減少する。すなわち、Ｔの値を大きくとれば、ネットワーク５を介した送受信の回数を削減できる。したがってネットワーク５の伝送量を抑制する効果が発生する。
また反対に、Ｔの値を小さくすると、ネットワーク５の送受信回数が増えることになる。しかしこの場合には、センサ１２が取得したセンサ情報と音響モデルが学習された条件との距離値がより小さな音響モデルを使用して、音声認識が行われるので、音声認識の精度を向上させることができる。以上のことから、ネットワーク５の伝送量と目標とする音声認識精度とを考慮してＴの値を決定するとよい。
音声認識サーバ６において、端末側受信部２１はネットワーク５を介してセンサ情報を受信する（ステップＳ１０８）。そしてサーバ側音響モデル選択部２３は、サーバ側音響モデル記憶部２２によって記憶されている音響モデルが学習された環境条件とサーバ側受信部２１が受信したセンサ情報との距離値を、ステップＳ１０４と同様にして算出し、この距離値が最小となる音響モデルを選択する（ステップＳ１０９）。続いてサーバ側送信部２４は、サーバ側音響モデル選択部２３が選択した音響モデルを音声認識端末２に送信する（ステップＳ１１０）。
音声認識端末２の端末側受信部１４は、サーバ側送信部２４が送信した音響モデルをネットワーク５を介して受信する（ステップＳ１１１）。
次に、端末側照合部１７は、端末側音響分析部１１が出力した音声特徴量と音響モデルとの照合処理を行う（ステップＳ１１２）。ここでは、音響モデルとして記憶されている標準パターンと音声特徴量の時系列との間で最も尤度の高い候補を認識結果４とする。例えば、ＤＰ（ＤｙｎａｍｉｃＰｒｏｇｒａｍｍｉｎｇ）マッチングによるパターンマッチングを行い、距離値が最小のものを認識結果４とする。
以上のように、実施例１による音声認識端末２及びサーバ６によれば、音声認識端末２に少数の音響モデルしか記憶できない場合であっても、マイクロホン１の集音環境をセンサ１２によって取得し、音声認識サーバ６が記憶している多数の音響モデルの中から、この集音環境に近い環境条件で学習した音響モデルを選択して音声認識を行うことができる。
したがって、音声認識端末２には大容量の記憶素子や回路、記憶媒体を搭載する必要がなくなり、機器構成を簡素化し、廉価に高精度の音声認識を行う音声認識端末を提供できる。前述の通り、一つの音響モデルのデータサイズは、実装の仕方にもよるが、数百キロバイト程度のサイズを有する場合がある。したがって、音声認識端末が記憶する必要のある音響モデルの個数を削減することによる効果は大きい。
なお、センサ情報は連続的な値をとりうるが、通常はその連続値からいくつかの値を選択し、この値をセンサ情報とする音響モデルを学習することになる。今、センサ１２が複数種類のセンサ（第１のセンサ、及び第２のセンサとする）から構成されていて、音声認識端末２及び音声認識サーバ６が記憶している各音響モデルの第１のセンサに関するセンサ情報として選択された値の個数をＭ１、第２のセンサに関するセンサ情報として選択された値の個数をＭ２とすると、音声認識端末２及び音声認識サーバ６が記憶している音響モデルの総数はＭ１×Ｍ２として算出される。
この場合において、Ｍ１＜Ｍ２が成立する場合、つまり第１のセンサのセンサ情報として選択された値の個数の方が、第２のセンサのセンサ情報として選択された値の個数よりも小さい場合に、第１のセンサのセンサ情報に対する重み係数を第２のセンサのセンサ情報に対する重み係数よりも小さくすることで、マイクロホン１の集音環境に応じた音響モデルを選択することができる。
また、音声認識端末２には端末側音響モデル記憶部１５と端末側音響モデル選択部１６を備えて、音声認識端末２が記憶する音響モデルと、音声認識サーバ６が記憶する音響モデルとを、適切に選択して音声認識処理を行うこととした。しかし音声認識端末２に端末側音響モデル記憶部１５と端末側音響モデル選択部１６を備えることは必須ではない。すなわち、センサ１２の取得するセンサ情報に基づいて、無条件に音声認識サーバ６が記憶する音響モデルを転送するような構成も可能であることはいうまでもない。このような構成を採用しても、音声認識端末２の記憶容量を削減しつつ、センサ１２によるマイクロホン１の集音環境に即した音響モデルを選択し、精度の高い音声認識処理を行うことができるというこの発明の特徴が損なわれることがないのである。
また上記に説明した構成に加えて、音声認識サーバ６より受信した音響モデルを端末側音響モデル記憶部１５に新たに記憶させたり、音声認識端末２側の音響モデルの一部に代えて音声認識サーバ６より受信した音響モデルを記憶させる構成も可能である。こうすることで、次回再び同じ音響モデルを用いて音声認識する場合に、音声認識サーバ６より再度音響モデルを転送する必要がなくなるので、ネットワーク５の伝送負荷を軽減できるし、送受信に要する時間を短縮することもできる。FIG. 1 is a block diagram showing the configuration of a voice recognition terminal and server according to one embodiment of the present invention. In the figure, a microphone 1 is a device or component that collects voice, and a voice recognition terminal 2 is a device that recognizes voice collected by the microphone 1 via an input terminal 3 and outputs a recognition result 4. The input terminal 3 is an audio terminal or a microphone connection terminal.
The voice recognition terminal 2 is connected to the voice recognition server 6 via the network 5. The network 5 is a network network for communicating digital information, such as the Internet, a LAN (Local Area Network), a public line network, a mobile phone network, and a communication network using an artificial satellite. However, the network 5 only needs to be able to transmit and receive digital data between devices connected to the network as a result, and does not ask the format of information transmitted on the network 5. . Therefore, for example, a bus designed to connect a plurality of devices such as USB (Universal Serial Bus) or SCSI (Small Computer Systems Interface) may be used. When the voice recognition terminal 2 is an in-vehicle voice recognition device, the network 5 uses a data communication service of mobile communication. The data communication service uses a communication method in which data to be transmitted / received is divided into units called packets and transmitted / received one by one. In the packet, in addition to the data that the transmitting device intends to transmit to the receiving device, information (transmission destination address) that identifies the receiving device for identifying the receiving device, and which part of the entire data the packet is Control information such as position information and error correction code indicating whether or not to constitute the information is added.
The voice recognition server 6 is a server computer configured to be connected to the voice recognition terminal 2 via the network 5. The speech recognition server 6 has a storage device such as a hard disk device or a memory having a larger storage capacity than the speech recognition terminal 2, and stores a standard pattern necessary for speech recognition. A plurality of voice recognition terminals 2 are connected to the voice recognition server 6 via the network 5.
Next, a detailed configuration of the voice recognition terminal 2 will be described. The voice recognition terminal 2 includes a terminal side acoustic analysis unit 11 and a sensor 12, a terminal side transmission unit 13, a terminal side reception unit 14, a terminal side acoustic model storage unit 15, a terminal side acoustic model selection unit 16, and a terminal side verification unit 17. I have.
The terminal-side acoustic analysis unit 11 is a part that performs acoustic analysis based on a voice signal input from the input terminal 3 and calculates a voice feature amount.
The sensor 12 is a sensor that detects environmental conditions for the purpose of obtaining information on the type of noise to be superimposed on the audio signal acquired by the microphone 1, and is a physical quantity in the environment in which the microphone 1 is installed and its change. An element or device that detects or acquires a quantity. However, not only that, but also an element or device that converts the detected amount into an appropriate signal and outputs it. In addition, the physical quantity here includes time, electromagnetic waves, etc. in addition to temperature, pressure, flow rate, light, and magnetism. Thus, for example, a GPS antenna is a sensor for GPS signals. Further, it is not always necessary to detect a physical quantity by acquiring some signal from the outside world. For example, a circuit that acquires the time of a point where a microphone is placed based on a built-in clock is also referred to here. Included in the sensor.
In the following description, these physical quantities are collectively referred to as sensor information. In general, the sensor outputs an analog signal, and the output signal is typically sampled into a digital signal by an A / D converter or element. Therefore, the sensor 12 may include such an A / D converter or element. Furthermore, when a plurality of types of sensors, for example, the voice recognition terminal 2 is a terminal for an in-vehicle navigation system, a speed sensor, a sensor for monitoring the engine speed, a sensor for monitoring the operating status of the wiper, a door glass A plurality of sensors such as a sensor for monitoring the opening / closing state and a sensor for monitoring the volume of the car audio may be combined.
The terminal-side transmitter 13 is a part that transmits sensor information in the vicinity of the microphone 1 obtained by the sensor 12 to the voice recognition server 6.
The terminal-side receiving unit 14 is a part that receives information from the voice recognition server 6, and outputs the received information to the terminal-side acoustic model selecting unit 16. The terminal-side transmitter 13 and the terminal-side receiver 14 are composed of circuits or elements that send signals to the network cable and receive signals from the network cable. A computer program for controlling these circuits or elements May be included in part of the terminal-side transmitter 13 and the terminal-side receiver 14. Of course, when the network 5 is a wireless communication network, the terminal-side transmitter 13 and the terminal-side receiver 14 are provided with antennas that transmit and receive communication waves. In addition, although the terminal side transmission part 13 and the terminal side receiving part 14 may be comprised as a separate site | part, you may make it comprise with the same network input / output device.
The terminal-side acoustic model storage unit 15 is a storage element or circuit for storing an acoustic model. Here, it is assumed that a plurality of acoustic models can exist according to the learning environment, and only a part of them is stored in the terminal-side acoustic model storage unit 15. Each acoustic model is associated with sensor information representing the environmental condition in which the acoustic model is learned, and an acoustic model suitable for the environmental condition can be identified from the numerical value of the sensor information. For example, when the speech recognition terminal 2 is a vehicle-mounted speech recognition device, an acoustic model created based on a sample uttered in a noisy environment when the vehicle is traveling at 40 km / h, and the vehicle is 50 km / h. An acoustic model created based on a sample uttered in a noisy environment when traveling in a vehicle is prepared. However, as will be described later, since the acoustic model corresponding to various environmental conditions is also stored in the speech recognition server 6, the acoustic model learned under all environmental conditions is stored in the terminal-side acoustic model storage unit 15. There is no need to remember. By adopting such a configuration, the storage capacity of the storage device that must be mounted on the speech recognition terminal 2 can be extremely small.
The terminal-side acoustic model selection unit 16 includes the acoustic model acquired by the terminal-side reception unit 14 (or the acoustic model stored in the terminal-side acoustic model storage unit 15) and the voice feature amount output by the terminal-side acoustic analysis unit 11. This is a part for calculating the likelihood. The terminal side collation unit 17 is a part that selects a vocabulary based on the likelihood calculated by the terminal side acoustic model selection unit 16 and outputs the vocabulary as a recognition result 4.
Among the components of the speech recognition terminal 2, the terminal-side acoustic analysis unit 11, the terminal-side transmission unit 13, the terminal-side reception unit 14, the terminal-side acoustic model storage unit 15, the terminal-side acoustic model selection unit 16, and the terminal-side verification Although each of the units 17 may be configured by a dedicated circuit, a computer program that causes a central processing unit (CPU), a network I / O device (such as a network adapter device), and a storage device to execute processing corresponding to each function You may make it comprise as.
Next, the detailed configuration of the voice recognition server 6 will be described. The voice recognition server 6 includes a server-side receiving unit 21, a server-side acoustic model storage unit 22, a server-side acoustic model selection unit 23, and a server-side transmission unit 24. The server-side receiving unit 21 is a part that receives sensor information transmitted from the terminal-side transmitting unit 13 of the voice recognition terminal 2 via the network 5.
The server-side acoustic model storage unit 22 is a storage device for storing a plurality of acoustic models. The server-side acoustic model storage unit 22 is configured as a hard disk device or a mass storage device such as a combination of a CD-ROM medium and a CD-ROM drive.
Unlike the terminal-side acoustic model storage unit 15, the server-side acoustic model storage unit 22 stores all acoustic models that may be used in this speech recognition system, and has sufficient storage capacity to do so. It shall have.
The server-side acoustic model selection unit 23 is a part that selects an acoustic model suitable for the sensor information received by the server-side receiving unit 21 from the acoustic models stored in the server-side acoustic model storage unit 22.
The server-side transmitter 24 is a part that transmits the acoustic model selected by the server-side acoustic model selector 23 to the voice recognition terminal 2 via the network 5.
Of the components of the speech recognition server 6, the server-side receiving unit 21, the server-side acoustic model storage unit 22, the server-side acoustic model selecting unit 23, and the server-side transmitting unit 24 may be configured by dedicated circuits. However, you may make it comprise as a computer program which makes a central processing unit (CPU), a network I / O apparatus (network adapter apparatus etc.), and a memory | storage device perform the process corresponded to each function.
Next, operations of the voice recognition terminal 2 and the voice recognition server 6 will be described with reference to the drawings. FIG. 2 is a flowchart showing processing performed by the voice recognition terminal 2 and the voice recognition server 6 according to the first embodiment. In the figure, when a user inputs a voice from the microphone 1 (step S101), a voice signal is input to the terminal-side acoustic analysis unit 11 via the input terminal 3. Subsequently, the terminal-side acoustic analysis unit 11 converts the digital signal into a digital signal by an A / D converter, and calculates a time series of speech feature values such as an LPC cepstrum (Linear Predictive Coding Cepstrum) (step S102).
Next, the sensor 12 acquires a physical quantity around the microphone 1 (step S103). For example, when the voice recognition terminal 2 is a car navigation system and the sensor 12 is a speed sensor that detects the speed of a vehicle (automobile) in which the car navigation system is mounted, the speed is such as It corresponds to a physical quantity. In FIG. 2, sensor information is collected in step S103 after the acoustic analysis in step S102. However, it goes without saying that the process of step S103 may be performed before the processes of steps S101 to S102, or may be performed simultaneously or in parallel.
Subsequently, the terminal-side acoustic model selection unit 16 selects the acoustic information learned by the sensor information obtained by the sensor 12, that is, the condition closest to the environment in which the microphone 1 collects sound. Here, a plurality of environmental conditions of the acoustic model are conceivable, and the terminal-side acoustic model storage unit 15 does not store all of them. Therefore, if none of the acoustic models currently stored in the terminal-side acoustic model storage unit 15 is learned under environmental conditions close to the environmental conditions of the microphone 1, an acoustic model is acquired from the speech recognition server 6. To do.
Next, prior to explaining the processing, terms and notations are defined. The sensor information regarding the sensor k under the condition where the acoustic model m is learned is simply referred to as “sensor information of the acoustic model m”. The terminal-side acoustic model storage unit 15 stores M acoustic models, and represents each acoustic model as an acoustic model m (where m = 1, 2,..., M). The sensor 12 is composed of K sensors, and each sensor is referred to as a sensor k (where k = 1, 2,..., K). Further, the sensor information about the sensor k under the environmental condition where the acoustic model m is learned is represented by S _{m, k} , and the current sensor information of the sensor k (the sensor information output in step S103) is x _k . .
Hereinafter, these processes will be described more specifically. First, the terminal-side acoustic model selection unit 16 calculates a distance value D (m) between the sensor information S _{m, k} of the acoustic model m and the sensor information x _k acquired by the sensor 12 (step S104). Now, sensor information _{S m} of the sensor information _{x k} and the acoustic model m at a sensor _k, a distance value between _{_{_{_{k D k (x k, S}}}} m, k) and. As a specific value of the distance value D _k (x _k , S _{m, k} ), for example, an absolute value of a difference in sensor information may be employed. That if the sensor information is speed, the speed of the learning (e.g. _{S m, k = 40km / h} ) and the distance value _{D k} the difference (10 km / h) of the current speed (e.g., _x k = 50km / h) Let (x _k , S _{m, k} ).
The distance value D (m) is calculated as follows using the distance value D _k (x _k , S _{m, k} ) for each sensor.

Here, w _k is a weighting factor for each sensor.
Here, the relationship between the sensor information as the physical quantity and the distance value D (m) will be described. The physical quantity of the sensor information depends on whether the sensor information is a position (it may be determined based on longitude or latitude, or may be determined by a distance from a specific place as the origin) and speed. The dimensions are different. However, here, by adjusting the weighting factor w _k , the degree of contribution to the distance value of w _k D _k (x _k , S _{m, k} ) can be set appropriately, so there is no problem even if dimensional differences are ignored. There is no. The same applies to the case where the unit systems are different. For example, when km / h is used as a unit of speed and when mph is used, even if the speed is physically the same, different values can be taken as sensor information. In such a case, for example, if the speed value calculated in km / h is given a weighting factor of 1.6 and if the speed value calculated in mph is given a weighting factor of 1.0, the distance value The speed effect in the calculation can be made equal.
Next, the terminal-side acoustic model selection unit 16 obtains the minimum value min {D (m)} of the distance value D (m) for each m calculated by Expression (1), and this min {D (m)} It is evaluated whether it is smaller than a predetermined value T (step S105). That is, it is verified whether or not the environmental conditions of the terminal-side acoustic model stored in the terminal-side acoustic model storage unit 15 are sufficiently close to the current environmental conditions where the microphone 1 collects sound. The predetermined value T is a value set in advance to test whether such a condition is satisfied.
When min {D (m)} is smaller than the predetermined value T (step S105: Yes), the process proceeds to step S106. The terminal-side acoustic model selection unit 16 selects the terminal-side acoustic model m as an acoustic model suitable for the current environment where the microphone 1 collects sound (step S106). Then, the process proceeds to the collation process (step S112). Subsequent processing will be described later.
If min {D (m)} is equal to or greater than the predetermined value T (step S105: No), the process proceeds to step S107. In this case, none of the environmental conditions of the acoustic model stored in the terminal-side acoustic model storage unit 15 is sufficiently close to the current environmental conditions where the microphone 1 collects sound. Therefore, the terminal side transmission unit 13 transmits the sensor information to the voice recognition server 6 (step S107).
If the predetermined value T is increased, the frequency at which min {D (m)} is determined to be smaller than T increases, and the number of times step S107 is executed decreases. That is, if the value of T is increased, the number of transmissions / receptions via the network 5 can be reduced. Therefore, an effect of reducing the transmission amount of the network 5 occurs.
On the other hand, if the value of T is decreased, the number of transmissions / receptions of the network 5 increases. However, in this case, since speech recognition is performed using an acoustic model having a smaller distance value between the sensor information acquired by the sensor 12 and the condition under which the acoustic model was learned, the accuracy of speech recognition is improved. Can do. From the above, the value of T may be determined in consideration of the transmission amount of the network 5 and the target voice recognition accuracy.
In the voice recognition server 6, the terminal side receiving unit 21 receives the sensor information via the network 5 (step S108). Then, the server-side acoustic model selection unit 23 obtains the distance value between the environmental condition in which the acoustic model stored in the server-side acoustic model storage unit 22 is learned and the sensor information received by the server-side reception unit 21 from step S104. Calculation is performed in the same manner, and an acoustic model that minimizes the distance value is selected (step S109). Subsequently, the server side transmission unit 24 transmits the acoustic model selected by the server side acoustic model selection unit 23 to the voice recognition terminal 2 (step S110).
The terminal side receiver 14 of the voice recognition terminal 2 receives the acoustic model transmitted by the server side transmitter 24 via the network 5 (step S111).
Next, the terminal side collation part 17 performs collation processing with the audio | voice feature-value and acoustic model which the terminal side acoustic analysis part 11 output (step S112). Here, the candidate with the highest likelihood between the standard pattern stored as the acoustic model and the time series of the voice feature amount is set as the recognition result 4. For example, pattern matching based on DP (Dynamic Programming) matching is performed, and the recognition result 4 is the one having the smallest distance value.
As described above, according to the voice recognition terminal 2 and the server 6 according to the first embodiment, even when only a small number of acoustic models can be stored in the voice recognition terminal 2, the sound collection environment of the microphone 1 is acquired by the sensor 12. From among a large number of acoustic models stored in the speech recognition server 6, it is possible to perform speech recognition by selecting an acoustic model learned under environmental conditions close to the sound collection environment.
Therefore, it is not necessary to mount a large-capacity storage element, circuit, or storage medium in the voice recognition terminal 2, and it is possible to provide a voice recognition terminal that simplifies the device configuration and performs highly accurate voice recognition at low cost. As described above, the data size of one acoustic model may have a size of about several hundred kilobytes although it depends on the mounting method. Therefore, the effect of reducing the number of acoustic models that the speech recognition terminal needs to store is significant.
The sensor information can take continuous values, but usually, some values are selected from the continuous values, and an acoustic model is learned using these values as sensor information. Now, the sensor 12 is composed of a plurality of types of sensors (referred to as a first sensor and a second sensor), and the first acoustic model stored in the speech recognition terminal 2 and the speech recognition server 6 is stored. If the number of values selected as sensor information related to the sensor is M1, and the number of values selected as sensor information related to the second sensor is M2, the acoustic model stored in the speech recognition terminal 2 and the speech recognition server 6 is stored. The total number is calculated as M1 × M2.
In this case, when M1 <M2 holds, that is, when the number of values selected as sensor information of the first sensor is smaller than the number of values selected as sensor information of the second sensor. The acoustic model corresponding to the sound collection environment of the microphone 1 can be selected by making the weighting coefficient for the sensor information of the first sensor smaller than the weighting coefficient for the sensor information of the second sensor.
The speech recognition terminal 2 includes a terminal-side acoustic model storage unit 15 and a terminal-side acoustic model selection unit 16, and stores an acoustic model stored in the speech recognition terminal 2 and an acoustic model stored in the speech recognition server 6. It was decided to perform the speech recognition process with appropriate selection. However, it is not essential for the speech recognition terminal 2 to include the terminal-side acoustic model storage unit 15 and the terminal-side acoustic model selection unit 16. That is, it is needless to say that an acoustic model stored in the speech recognition server 6 can be transferred unconditionally based on sensor information acquired by the sensor 12. Even if such a configuration is adopted, it is possible to select a sound model according to the sound collection environment of the microphone 1 by the sensor 12 and perform highly accurate voice recognition processing while reducing the storage capacity of the voice recognition terminal 2. The feature of the present invention that can be done is not impaired.
Further, in addition to the configuration described above, the acoustic model received from the speech recognition server 6 is newly stored in the terminal-side acoustic model storage unit 15, or speech recognition is performed instead of a part of the acoustic model on the speech recognition terminal 2 side. A configuration in which the acoustic model received from the server 6 is stored is also possible. In this way, when the speech recognition is performed again using the same acoustic model next time, it is not necessary to transfer the acoustic model again from the speech recognition server 6, so that the transmission load of the network 5 can be reduced and the time required for transmission and reception can be reduced. It can also be shortened.

実施例１による音声認識端末によれば、センサ情報に対応した音響モデルを音声認識端末が記憶していない場合には、音声認識サーバからセンサ情報に適した音響モデルを転送する構成とした。
しかし音響モデル１個あたりのデータサイズを考慮すると、音声認識サーバから音響モデル全体をネットワークを介して音声認識端末に転送することは、ネットワークに大きな負荷を与え、また音響モデルのデータ転送に要する時間がよって全体の処理性能に与える影響も無視することができない。
このような問題を回避する一つの方法は、音響モデルのデータサイズがなるべく小さくなるように音声認識処理を設計することである。音響モデルのサイズを小さければ、音響モデルを音声認識サーバから音声認識端末に転送しても、ネットワークにはそれほど負荷を与えることにはならないからである。
一方、相互に類似する複数の音響モデルをクラスタリングし、同一クラスタ内の音響モデル間で差分を予め求めておいた上で、音声認識サーバの記憶している音響モデルを転送する必要がある場合に、音声認識端末が記憶している音響モデルとの差分のみを転送し、音声認識端末が記憶している音響モデルと差分から音声認識サーバの音響モデルを合成する方法も考えられる。実施例２による音声認識端末及びサーバは、かかる原理に基づいて動作するものである。
図３は、実施例２による音声認識端末及びサーバの構成を示すブロック図である。図において、音響モデル合成部１８は、端末側受信部１４の受信内容と端末側音響モデル記憶部１５が記憶している音響モデルから、音声認識サーバ６の記憶する音響モデルと等価な音響モデルを合成する部位である。また音響モデル差分算出部２５は端末側音響モデル記憶部１５が記憶している音響モデルとサーバ側音響モデル記憶部２２が記憶している音響モデルとの差分を算出する部位である。その他、図１と同一の符号を付した部位については実施例１と同様であるので、説明を省略する。
前述の通り、実施例２の音声認識装置２及びサーバ６は、音響モデルを予めクラスタリングしている点を特徴とする。そこで、まず音響モデルのクラスタリング方法について説明する。なお音響モデルのクラスタリングは、音声認識装置２及びサーバ６によって音声認識処理がなされる前に完了しているものである。
音響モデルは、多数の話者によって発声された大量の音声から各音韻（または音素あるいは音節）の音声特徴量の統計量を示したものである。統計量は、平均値ベクトルμ＝｛μ（１），μ（２），…，μ（Ｋ）｝と、対角共分散ベクトルΣ＝｛σ（１）^２，σ（２）^２，…，σ（Ｋ）^２｝から構成される。そこで、音韻ｐの音響モデルをＮ_ｐ｛μ_ｐ、Σ_ｐ｝で表すこととする。
音響モデルのクラスタリングは、以下に述べるように、最大ＶＱ歪クラスタを逐次分割するように改良したＬＢＧアルゴリズムにより行う。図４は、音響モデルのクラスタリング処理を示すフローチャートである。
まず、初期クラスタの作成を行う（ステップＳ２０１）。ここでは、この音声認識システムで使用される可能性のあるすべての音響モデルから、一つの初期クラスタを作成する。初期クラスタｒの統計量の算出には、式（２）と式（３）を用いる。ここで、Ｎはクラスタに属する分布の数を、またＫは音声特徴量の次元数を表す。

次に、これまで実行してきたクラスタリング処理によって、すでに必要となるクラスタの個数が得られているかどうかを判定する（ステップＳ２０２）。必要なクラスタの個数は、音声認識処理システム設計時に決定される。一般的にいって、クラスタ数が多ければ多いほど、同一クラスタ内の音響モデル間の距離が小さくなる。その結果、差分データの情報量を小さくなり、ネットワーク５を介して送受信される差分データのデータ量も抑制できる。特に、音声認識端末２及びサーバ６が記憶している音響モデルの総数が多い場合には、クラスタ数を多くするとよい。
しかし、あらゆる場合に単純にクラスタの数を多くすればよいというわけにはいかない。その理由は次のとおりである。すなわち、実施例２では、音声認識端末２が記憶している音響モデル（以下、ローカル音響モデルと呼ぶ）と差分とを組み合わせて音声認識サーバ６の記憶する音響モデルを合成する、あるいは音声認識サーバ６の記憶する音響モデルと同等の音響モデルを得ようとするものである。
ここで使用される差分は、ローカル音響モデルと組み合わせるものであり、このローカル音響モデルと同じクラスタに属する音響モデルとの間で求められたものでなければならない。差分によって合成される音響モデルはセンサ情報に対応したものだから、そうすると、センサ情報に対応した音響モデルとローカル音響モデルが同一のクラスタに分類されている状態が最も効率のよい状態ということになる。
ところで、クラスタ数が多くなると、それぞれのクラスタに属する音響モデルの個数は少なくなって、各音響モデルは多数のクラスタに分断された状態となる。このような場合、音声認識端末２が記憶しているローカル音響モデルと同じクラスタに属する音響モデル数も少なくなる傾向にある。さらに、センサ情報に対応した音響モデルと音声認識端末２が記憶するローカル音響モデルとが同じクラスタに属する確率も小さくなる。
その結果、このような場合、異なるクラスタに属する音響モデル間の差分を準備できない状況や、あるいは差分を準備してもそのデータサイズが十分小さいものにはならない状況が生じる。
このような理由から、ローカル音響モデルの個数を多くすることができない場合、つまり音声認識端末２に搭載するメモリやハードディスクなどの記憶装置の記憶容量が確保できない場合には、クラスタ数を多くしない方がよい。
なお、必要なクラスタ数が２以上であれば、初期クラスタ作成直後はクラスタ数が１であるので、ステップＳ２０３に進む（ステップＳ２０２：Ｎｏ）。またすでに後述する処理によって複数のクラスタが得られており、その個数が必要なクラスタの個数以上であれば、終了する（ステップＳ２０２：Ｙｅｓ）。
次に、最大ＶＱ歪クラスタ分割を行う（ステップＳ２０３）。ここでは、ＶＱ歪が最も大きいクラスタｒｍａｘ（１回目のループの時は初期クラスタ）をｒ１、ｒ２の２つのクラスタに分割する。これにより、クラスタの個数が増加する。分割後のクラスタ統計量は、以下の式によって算出する。なお、Δ（ｋ）は、音声特徴量の各次元毎に予め定められた微小値とする。

続いて、各音響モデルの統計量と各クラスタ（ステップＳ２０３で分割されたすべてのクラスタ）の統計量との距離値を算出する（ステップＳ２０４）。ここでは、すべての音響モデルと、すでに求められているすべてのクラスタからそれぞれ一つずつ選択されて距離が算出される。ただしすでに距離が算出されている音響モデルとクラスタの組み合わせについては再び距離が算出されることはない。そのような制御を行うために、クラスタ毎に距離を算出済みの音響モデルのフラグを設けるようにしてもよい。この音響モデルの統計量と各クラスタの統計量の距離値には、例えば式（８）で定義するバタチャリア（Ｂｈａｔｔａｃｈａｒｙｙａ）距離値を用いる。

なお、式（８）において、１をサフィックスとするパラメータは音響モデルの統計量であり、２をサフィックスとするパラメータはクラスタの統計量である。
以上求められた距離値に基づいて、各音響モデルを最も距離値の小さいクラスタに属するようにする。なお、式（８）以外の方法で、音響モデルの統計量とクラスタの統計量との距離値を算出してもよい。その場合であっても、式（１）によって算出される距離値が近い場合に、同一のクラスタに属するような距離値が得られる式を採用することが望ましい。ただしこのことは必須ではない。
次に各クラスタのコードブックの更新を行う（ステップＳ２０５）。そのために、式（２）及び（３）を用いて、クラスタに属する音響モデルの統計量の代表値を算出する。また式（８）を用いて、クラスタに属する音響モデルの統計量と、代表値との距離を累積し、これを現在のクラスタのＶＱ歪と定義する。
続いてクラスタリングの評価値を算出する（ステップＳ２０６）。ここでは、全クラスタのＶＱ歪の総和をクラスタリングの評価値とする。なおステップＳ２０４〜ステップＳ２０７は複数回実行されるループを構成する。そして、ステップＳ２０６で算出された評価値は、次回のループ実行まで記憶されている。そして、この評価値と前回ループ実行時に算出された評価値との差分を求め、その絶対値が所定の閾値未満か否かを判定する（ステップＳ２０７）。この差分が所定の閾値未満である場合は、すべての音響モデルがすでに求められているクラスタのうち、適切なクラスタに所属したので、ステップＳ２０２に戻る（ステップＳ２０７：Ｙｅｓ）。一方、差分が所定の閾値以上である場合は、まだ適切なクラスタに属していない音響モデルが存在するので、ステップＳ２０４に戻る（ステップＳ２０７：Ｎｏ）。
以上がクラスタリング処理である。次に、このようにしてクラスタリングされた音響モデルに基づいて行われる実施例２の音声認識装置２及びサーバ６における音声認識処理について、図を用いて説明する。図５は、音声認識装置２及びサーバ６の動作のフローチャートである。図において、ステップＳ１０１〜Ｓ１０５においては、実施例１と同様に音声がマイクロホン１から入力され、音響分析とセンサ情報の取得を行った後に、このセンサ情報に適したローカル音響モデルが存在するかどうかを判定する。
そして、センサ情報との距離が最も小さいローカル音響モデル（このローカル音響モデルを識別する番号または名前をｍと呼ぶ）をもってしても、その距離が所定の閾値Ｔ未満とならない場合には、ステップＳ２０８に進む（ステップＳ１０５：Ｎｏ）。
次に、端末側送信部１３は、センサ情報とローカル音響モデルを識別する情報ｍとを、音声認識サーバ６に送信する（ステップＳ２０８）。
サーバ側受信部２１は、センサ情報とｍとを受信し（ステップＳ２０９）、サーバ側音響モデル選択部２３は、受信したセンサ情報に最も適した音響モデルを選択する（ステップＳ１０９）。そして、この音響モデルとローカル音響モデルｍとが同一のクラスタに属するか否かを判断する（ステップＳ２１０）。同一のクラスタに属する場合には、ステップＳ２１１に進み（ステップＳ２１０：Ｙｅｓ）、音響モデル差分算出部２５は、この音響モデルとローカル音響モデルｍとの差分を算出して（ステップＳ２１１）、サーバ側送信部２４は差分を音声認識端末２に送信する（ステップＳ２１２）。
なお差分を求めるには、例えば、音声特徴量の各次元の成分の値の差異やオフセットのずれ（それぞれの要素の格納位置の差）に基づいて算出すればよい。異なるバイナリデータ間（バイナリファイル間など）の差分値を求める技術は公知となっているので、それを利用してもよい。また、実施例２による技術は、音響モデルのデータ構造をついて特別な要求を求めるものではないので、差分を求めやすいデータ構造を設計しておく方法も考えられる。
一方、同一のクラスタに属さない場合には、直接ステップＳ２１２に進む（ステップＳ２１０：Ｎｏ）。この場合は、差分ではなく、選択した音響モデルそのものを送信する（ステップＳ２１２）。
なお、上記の処理においては、音声認識端末２側でセンサ情報に最も適していると判断したローカル音響モデル（ステップＳ１０５で、センサ情報との距離が最も小さいと判断した音響モデル）を基準に差分を生成することを前提としている。そのため、このようなローカル音響モデルｍに関する情報を前もってステップＳ２０８で送信した。しかし、この他にも音声認識サーバ６側で音声認識端末２が記憶しているローカル音響モデルの種類を把握（あるいは管理）しておき、さらに音声認識サーバがセンサ情報に近い音響モデルを選択した後に、選択された音響モデルと同じクラスタに属するローカル音響モデルを管理しているローカル音響モデルから選択して、それらの差分を算出するようにしてもよい。この場合には、音声認識サーバ６によって算出された差分がどのローカル音響モデルに基づいているかを音声認識端末２に通知する必要があるので、ステップＳ２１２において、音声認識サーバ６が差分算出の基礎としたローカル音響モデルを識別する情報を送信する。
次に音声認識端末２の端末側受信部１４は、差分データ、あるいは音響モデルを受信する（ステップＳ２１３）。受信したデータ差分である場合には、音響モデル合成部１８が差分の基礎となるローカル音響モデルｍと差分から音響モデルを合成する（ステップＳ２１４）。そして、端末側照合部１７が音響モデルの標準パターンと音声特徴量とのパターンマッチングを行って最も尤度の高い認識候補を認識結果４として出力する。
以上から明らかなように、実施例２の音声認識端末２が記憶するローカル音響モデルと音声認識サーバ６が記憶する音響モデルとの差分のみをネットワークを介して送受信することとした。そのため、音声認識端末２の記憶容量が小さい場合でも、マイクロホン１の集音環境に即した多様な音響モデルに基づいて高精度な音声認識を行うことができるという実施例１の効果に加えて、ネットワークに与える負荷を低減し、データ転送に要する時間を短くすることによって処理性能を向上するという効果を奏するのである。According to the voice recognition terminal according to the first embodiment, when the voice recognition terminal does not store the acoustic model corresponding to the sensor information, the acoustic model suitable for the sensor information is transferred from the voice recognition server.
However, considering the data size per acoustic model, transferring the entire acoustic model from the speech recognition server to the speech recognition terminal via the network places a heavy load on the network and the time required to transfer the acoustic model data. Therefore, the influence on the overall processing performance cannot be ignored.
One method for avoiding such a problem is to design the speech recognition process so that the data size of the acoustic model is as small as possible. This is because, if the size of the acoustic model is small, even if the acoustic model is transferred from the speech recognition server to the speech recognition terminal, the load on the network is not so much.
On the other hand, when a plurality of acoustic models that are similar to each other are clustered and the difference between the acoustic models in the same cluster is obtained in advance, and the acoustic model stored in the speech recognition server needs to be transferred A method of transferring only the difference from the acoustic model stored in the speech recognition terminal and synthesizing the acoustic model of the speech recognition server from the difference between the acoustic model stored in the speech recognition terminal is also conceivable. The voice recognition terminal and server according to the second embodiment operate based on this principle.
FIG. 3 is a block diagram illustrating the configuration of the voice recognition terminal and the server according to the second embodiment. In the figure, an acoustic model synthesis unit 18 selects an acoustic model equivalent to the acoustic model stored in the speech recognition server 6 from the received content of the terminal side receiving unit 14 and the acoustic model stored in the terminal side acoustic model storage unit 15. It is the site to synthesize. The acoustic model difference calculation unit 25 is a part that calculates a difference between the acoustic model stored in the terminal-side acoustic model storage unit 15 and the acoustic model stored in the server-side acoustic model storage unit 22. Since other parts denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, the description thereof is omitted.
As described above, the speech recognition apparatus 2 and the server 6 according to the second embodiment are characterized in that acoustic models are clustered in advance. First, an acoustic model clustering method will be described. The clustering of the acoustic model is completed before the speech recognition process is performed by the speech recognition device 2 and the server 6.
The acoustic model indicates a statistic of a speech feature value of each phoneme (or phoneme or syllable) from a large amount of speech uttered by many speakers. The statistics include the mean vector μ = {μ (1), μ (2),..., Μ (K)} and the diagonal covariance vectors Σ = {σ (1) ² , σ (2) ² ,. , Σ (K) ² }. Therefore, the acoustic model of the phoneme p is represented by N _p {μ _p , Σ _p }.
As described below, the acoustic model is clustered by an LBG algorithm improved so as to sequentially divide the maximum VQ distortion cluster. FIG. 4 is a flowchart showing the clustering process of the acoustic model.
First, an initial cluster is created (step S201). Here, one initial cluster is created from all acoustic models that may be used in the speech recognition system. Formula (2) and formula (3) are used to calculate the statistics of the initial cluster r. Here, N represents the number of distributions belonging to the cluster, and K represents the number of dimensions of the speech feature quantity.

Next, it is determined whether or not the necessary number of clusters has already been obtained by the clustering process executed so far (step S202). The number of necessary clusters is determined when the speech recognition processing system is designed. Generally speaking, the greater the number of clusters, the smaller the distance between acoustic models in the same cluster. As a result, the information amount of the difference data is reduced, and the data amount of the difference data transmitted / received via the network 5 can be suppressed. In particular, when the total number of acoustic models stored in the speech recognition terminal 2 and the server 6 is large, the number of clusters may be increased.
However, in all cases, simply increasing the number of clusters is not enough. The reason is as follows. That is, in the second embodiment, the acoustic model stored in the speech recognition server 6 is synthesized by combining the acoustic model stored in the speech recognition terminal 2 (hereinafter referred to as a local acoustic model) and the difference, or the speech recognition server. 6 to obtain an acoustic model equivalent to the acoustic model stored in FIG.
The difference used here is to be combined with the local acoustic model and must be obtained between the local acoustic model and an acoustic model belonging to the same cluster. Since the acoustic model synthesized by the difference corresponds to the sensor information, the state in which the acoustic model corresponding to the sensor information and the local acoustic model are classified into the same cluster is the most efficient state.
By the way, when the number of clusters increases, the number of acoustic models belonging to each cluster decreases, and each acoustic model is divided into a large number of clusters. In such a case, the number of acoustic models belonging to the same cluster as the local acoustic model stored in the speech recognition terminal 2 tends to decrease. Furthermore, the probability that the acoustic model corresponding to the sensor information and the local acoustic model stored in the speech recognition terminal 2 belong to the same cluster is also reduced.
As a result, in such a case, a situation in which a difference between acoustic models belonging to different clusters cannot be prepared, or a situation in which the data size is not sufficiently small even if the difference is prepared occurs.
For this reason, when the number of local acoustic models cannot be increased, that is, when the storage capacity of a storage device such as a memory or a hard disk mounted on the speech recognition terminal 2 cannot be secured, the number of clusters should not be increased. Is good.
If the required number of clusters is 2 or more, the number of clusters is 1 immediately after the creation of the initial cluster, so the process proceeds to step S203 (No in step S202). If a plurality of clusters have already been obtained by the processing described later, and the number of clusters is equal to or greater than the number of necessary clusters, the process ends (step S202: Yes).
Next, maximum VQ distortion cluster division is performed (step S203). Here, the cluster rmax with the largest VQ distortion (the initial cluster in the case of the first loop) is divided into two clusters r1 and r2. As a result, the number of clusters increases. The cluster statistic after the division is calculated by the following formula. Note that Δ (k) is a minute value predetermined for each dimension of the audio feature amount.

Subsequently, a distance value between the statistic of each acoustic model and the statistic of each cluster (all clusters divided in step S203) is calculated (step S204). Here, the distance is calculated by selecting one from all the acoustic models and all the already obtained clusters. However, distances are not calculated again for combinations of acoustic models and clusters for which distances have already been calculated. In order to perform such control, an acoustic model flag for which the distance has been calculated may be provided for each cluster. As the distance value between the statistic of the acoustic model and the statistic of each cluster, for example, a Battacharya distance defined by Equation (8) is used.

In equation (8), a parameter with 1 as a suffix is an acoustic model statistic, and a parameter with 2 as a suffix is a cluster statistic.
Based on the distance value thus obtained, each acoustic model is made to belong to the cluster having the smallest distance value. Note that the distance value between the statistic of the acoustic model and the statistic of the cluster may be calculated by a method other than Equation (8). Even in such a case, it is desirable to adopt an expression that can obtain a distance value that belongs to the same cluster when the distance value calculated by Expression (1) is close. However, this is not essential.
Next, the code book of each cluster is updated (step S205). For this purpose, the representative value of the statistic of the acoustic model belonging to the cluster is calculated using equations (2) and (3). Also, using equation (8), the distance between the statistic of the acoustic model belonging to the cluster and the representative value is accumulated, and this is defined as the VQ distortion of the current cluster.
Subsequently, an evaluation value for clustering is calculated (step S206). Here, the sum of the VQ distortions of all clusters is used as the evaluation value for clustering. Steps S204 to S207 constitute a loop executed a plurality of times. The evaluation value calculated in step S206 is stored until the next loop execution. Then, a difference between this evaluation value and the evaluation value calculated at the previous loop execution is obtained, and it is determined whether or not the absolute value is less than a predetermined threshold (step S207). If this difference is less than the predetermined threshold value, all acoustic models belong to an appropriate cluster among the already obtained clusters, and the process returns to step S202 (step S207: Yes). On the other hand, if the difference is equal to or greater than the predetermined threshold, there is an acoustic model that does not yet belong to an appropriate cluster, and the process returns to step S204 (step S207: No).
The above is the clustering process. Next, the speech recognition processing in the speech recognition apparatus 2 and the server 6 according to the second embodiment performed based on the acoustic models clustered in this way will be described with reference to the drawings. FIG. 5 is a flowchart of the operations of the speech recognition apparatus 2 and the server 6. In the figure, in steps S101 to S105, whether or not a local acoustic model suitable for the sensor information exists after sound is input from the microphone 1 and acoustic analysis and acquisition of sensor information are performed as in the first embodiment. Determine.
If the local acoustic model having the smallest distance from the sensor information (the number or name identifying this local acoustic model is referred to as m) does not fall below the predetermined threshold T, step S208 is performed. (Step S105: No).
Next, the terminal side transmission part 13 transmits the sensor information and the information m which identifies a local acoustic model to the speech recognition server 6 (step S208).
The server side receiving unit 21 receives the sensor information and m (step S209), and the server side acoustic model selecting unit 23 selects an acoustic model most suitable for the received sensor information (step S109). Then, it is determined whether or not the acoustic model and the local acoustic model m belong to the same cluster (step S210). When belonging to the same cluster, the process proceeds to step S211 (step S210: Yes), and the acoustic model difference calculation unit 25 calculates the difference between the acoustic model and the local acoustic model m (step S211), and the server side The transmission unit 24 transmits the difference to the voice recognition terminal 2 (step S212).
In order to obtain the difference, for example, the difference may be calculated based on the difference in the component values of each dimension of the voice feature amount or the offset shift (difference in the storage position of each element). Since a technique for obtaining a difference value between different binary data (such as between binary files) is known, it may be used. In addition, since the technique according to the second embodiment does not require a special requirement for the data structure of the acoustic model, a method of designing a data structure in which a difference can be easily obtained can be considered.
On the other hand, if they do not belong to the same cluster, the process directly proceeds to step S212 (step S210: No). In this case, not the difference but the selected acoustic model itself is transmitted (step S212).
In the above processing, the difference is based on the local acoustic model that has been determined to be most suitable for the sensor information on the voice recognition terminal 2 side (the acoustic model that has been determined to have the smallest distance from the sensor information in step S105). Is assumed to be generated. Therefore, information regarding such a local acoustic model m is transmitted in advance in step S208. However, in addition to this, the type of the local acoustic model stored in the voice recognition terminal 2 is grasped (or managed) on the voice recognition server 6 side, and the voice recognition server selects an acoustic model close to the sensor information. Later, a local acoustic model belonging to the same cluster as the selected acoustic model may be selected from the managing local acoustic models, and the difference between them may be calculated. In this case, since it is necessary to notify the voice recognition terminal 2 which local acoustic model the difference calculated by the voice recognition server 6 is based on, in step S212, the voice recognition server 6 determines the difference calculation basis. The information for identifying the local acoustic model is transmitted.
Next, the terminal side receiving unit 14 of the voice recognition terminal 2 receives the difference data or the acoustic model (step S213). If the difference is the received data difference, the acoustic model synthesis unit 18 synthesizes an acoustic model from the local acoustic model m that is the basis of the difference and the difference (step S214). And the terminal side collation part 17 performs the pattern matching with the standard pattern of an acoustic model, and an audio | voice feature-value, and outputs the recognition candidate with the highest likelihood as the recognition result 4. FIG.
As apparent from the above, only the difference between the local acoustic model stored in the speech recognition terminal 2 of Example 2 and the acoustic model stored in the speech recognition server 6 is transmitted / received via the network. Therefore, in addition to the effect of the first embodiment that highly accurate speech recognition can be performed based on various acoustic models in accordance with the sound collection environment of the microphone 1 even when the storage capacity of the speech recognition terminal 2 is small, This reduces the load on the network and shortens the time required for data transfer, thereby improving the processing performance.

実施例１及び２による音声認識端末２では、音声認識処理に必要となる音響モデルを記憶していない場合であっても、音声認識サーバ６が記憶する音響モデルを、ネットワーク５を介して受信することにより、マイクロホン１の集音環境に即した音声認識を行うものであった。しかし、音響モデルの送受信に代えて、音声特徴量を送受信するようにしてもよい。実施例３による音声認識端末及びサーバはこのような原理に基づいて動作するものである。
図６は、実施例３による音声認識端末及びサーバの構成を示すブロック図である。図において、図１と同一の符号を付した部位については実施例１と同様であるので、説明を省略する。実施例３においても、音声認識端末２と音声認識サーバ６はネットワーク５を介して接続されている。しかし、音声認識端末２から音声認識サーバ６に対して音声特徴量とセンサ情報が送信されるようになっており、また認識結果７が音声認識サーバ６より出力されるようになっている点で、実施例１と異なる。なお、音声認識サーバ６において、サーバ側照合部２７は、実施例１の端末側照合部１７と同様に音声特徴量と音響モデルとの照合を行う部位である。
次に実施例３における音声認識端末２及び音声認識サーバ６の動作について、図を参照しながら説明する。図７は、実施例２による音声認識端末２と音声認識サーバ６との処理を示したフローチャートである。なおこのフローチャートにおいて、図２と同一の符号を付した処理については実施例１と同様である。そこで以下においては、このフローチャート独自の符号を付した処理を中心に説明を行う。
まず、利用者がマイクロホン１から音声入力を行うと、入力端３を介して音声認識端末２に音声信号が入力され（ステップＳ１０１）、入力された音声信号から音響分析部１１によって音声特徴量の時系列が算出されるとともに（ステップＳ１０２）、センサ１２によってセンサ情報が収集される（ステップＳ１０３）。
次に端末側送信部１３によってセンサ情報と音声特徴量がネットワーク５を介して音声認識サーバ６に転送され（ステップＳ３０１）、サーバ側受信部２１によってセンサ情報と音声特徴量が音声認識サーバ６に取り込まれる（ステップＳ３０２）。音声認識サーバ６のサーバ側音響モデル記憶部２２は、音響モデルを複数のセンサ情報に合わせて予め準備しており、サーバ側音響モデル選択部２３は、サーバ側受信部２１によって取得されたセンサ情報と、各音響モデルのセンサ情報との距離値を式（１）によって算出して、最も距離値の小さい音響モデルを選択する（ステップＳ１０９）。
続いてサーバ側照合部２７は、選択された音響モデルにおける標準パターンとサーバ側受信部２１によって取得された音声特徴量とのパターンマッチングを行って、最も尤度の高い語彙を認識結果７として出力する（ステップＳ３０３）。この処理は、実施例１の照合処理（ステップＳ１１２）と同様であるので、詳細な説明については省略する。
以上のように、実施例３による音声認識端末２およびサーバ６によれば、音声認識端末２において音声特徴量の算出とセンサ情報の取得のみを行い、このセンサ情報に基づいて、音声認識サーバ６に音声特徴が記憶する音響モデルから適切な音響モデルを選択して、音声認識することとした。こうすることで、音声認識端末２に音響モデルを記憶するための記憶装置、あるいは素子又は回路が不要となり、音声認識端末２の構成を簡素化することができる。
また、音声特徴量とセンサ情報のみをネットワーク５を介して音声認識サーバ６に転送するようにしたので、ネットワーク５に伝送負荷をかけずに音声認識を行うことができる。
なお、前述の通り、音響モデルのデータサイズは数百キロバイトに及ぶ場合がある。したがってネットワークの帯域幅が制限されている場合には、音響モデルそのものを送信しようとすると伝送能力の限界に達してしまう場合もある。しかし音声特徴量であれば、せいぜい２０ｋｂｐｓの帯域幅が確保できれば、実時間内に十分転送が可能である。したがって極めてネットワーク負荷が軽いクライアントサーバ側音声認識システムを構築できるとともに、マイクロホン１の集音環境に合わせた高精度な音声認識処理を行うことができる。
なお実施例１とは異なり、実施例３では認識結果７を音声認識端末２から出力するのではなく、音声認識サーバ６から出力する構成とした。例えば音声認識端末２がインターネットを閲覧しており、発話によってＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃｅＬｏｃａｔｉｏｎ）を音声入力し、このＵＲＬから決定されるＷｅｂページを音声認識サーバ６が取得して、音声認識端末２に送信して表示させるような場合は、このような構成で十分である。
しかしながら、実施例１と同じように、音声認識端末２が認識結果を出力するような構成とすることもできる。この場合は、音声認識端末２に端末側受信部、音声認識サーバ６にサーバ側送信部を備えるようにし、照合部２７の出力結果を音声認識サーバ６の送信部からネットワーク５を介して音声認識端末２の受信部に送信し、この受信部から所望の出力先に出力するように構成すればよい。In the speech recognition terminal 2 according to the first and second embodiments, the acoustic model stored in the speech recognition server 6 is received via the network 5 even when the acoustic model necessary for speech recognition processing is not stored. As a result, voice recognition is performed in accordance with the sound collection environment of the microphone 1. However, voice feature values may be transmitted / received instead of the acoustic model transmission / reception. The voice recognition terminal and server according to the third embodiment operate based on such a principle.
FIG. 6 is a block diagram illustrating the configuration of the voice recognition terminal and the server according to the third embodiment. In the figure, the portions denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description thereof is omitted. Also in the third embodiment, the voice recognition terminal 2 and the voice recognition server 6 are connected via the network 5. However, the voice feature amount and the sensor information are transmitted from the voice recognition terminal 2 to the voice recognition server 6, and the recognition result 7 is output from the voice recognition server 6. This is different from the first embodiment. In the voice recognition server 6, the server-side matching unit 27 is a part that performs matching between the voice feature quantity and the acoustic model, like the terminal-side matching unit 17 of the first embodiment.
Next, operations of the voice recognition terminal 2 and the voice recognition server 6 in the third embodiment will be described with reference to the drawings. FIG. 7 is a flowchart illustrating processing performed by the voice recognition terminal 2 and the voice recognition server 6 according to the second embodiment. In this flowchart, the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, the description will be focused on the processing given the unique reference numerals in this flowchart.
First, when a user inputs a voice from the microphone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the acoustic analysis unit 11 determines the voice feature amount from the input voice signal. A time series is calculated (step S102), and sensor information is collected by the sensor 12 (step S103).
Next, the sensor information and the voice feature amount are transferred to the voice recognition server 6 via the network 5 by the terminal side transmission unit 13 (step S301), and the sensor information and the voice feature amount are transferred to the voice recognition server 6 by the server side reception unit 21. Captured (step S302). The server-side acoustic model storage unit 22 of the voice recognition server 6 prepares an acoustic model in advance according to a plurality of sensor information, and the server-side acoustic model selection unit 23 acquires the sensor information acquired by the server-side receiving unit 21. And the distance value with the sensor information of each acoustic model is calculated by the equation (1), and the acoustic model having the smallest distance value is selected (step S109).
Subsequently, the server-side matching unit 27 performs pattern matching between the standard pattern in the selected acoustic model and the voice feature amount acquired by the server-side receiving unit 21, and outputs the vocabulary with the highest likelihood as the recognition result 7. (Step S303). Since this process is the same as the collation process (step S112) of the first embodiment, detailed description thereof is omitted.
As described above, according to the voice recognition terminal 2 and the server 6 according to the third embodiment, the voice recognition terminal 2 only calculates the voice feature amount and acquires the sensor information, and based on the sensor information, the voice recognition server 6 An appropriate acoustic model is selected from the acoustic models stored in the voice feature, and voice recognition is performed. By doing so, a storage device, element, or circuit for storing the acoustic model in the speech recognition terminal 2 becomes unnecessary, and the configuration of the speech recognition terminal 2 can be simplified.
Further, since only the voice feature amount and the sensor information are transferred to the voice recognition server 6 via the network 5, voice recognition can be performed without applying a transmission load to the network 5.
As described above, the data size of the acoustic model may reach several hundred kilobytes. Therefore, if the bandwidth of the network is limited, transmission of the acoustic model itself may reach the limit of transmission capability. However, in the case of a voice feature amount, if a bandwidth of 20 kbps can be secured at most, it can be transferred sufficiently in real time. Therefore, it is possible to construct a client / server side voice recognition system with a very light network load, and to perform highly accurate voice recognition processing according to the sound collection environment of the microphone 1.
Unlike the first embodiment, in the third embodiment, the recognition result 7 is output from the voice recognition server 6 instead of being output from the voice recognition terminal 2. For example, when the speech recognition terminal 2 is browsing the Internet, a URL (Uniform Resource Location) is input by speech, and the speech recognition server 6 acquires a Web page determined from this URL and transmits it to the speech recognition terminal 2. In such a case, such a configuration is sufficient.
However, as in the first embodiment, the voice recognition terminal 2 can output a recognition result. In this case, the voice recognition terminal 2 is provided with a terminal side receiving unit, the voice recognition server 6 is provided with a server side transmitting unit, and the output result of the collating unit 27 is recognized by the voice recognition server 6 via the network 5 from the transmission unit. What is necessary is just to comprise to transmit to the receiving part of the terminal 2, and to output to a desired output destination from this receiving part.

実施例１及び２における音響モデルの送受信、実施例３における音声特徴量の送受信に代えて、音声データを送受信する方法も考えられる。実施例４による音声認識端末及びサーバはこのような原理に基づいて動作するものである。
図８は、実施例４による音声認識端末及びサーバの構成を示すブロック図である。図において、図１と同一の符号を付した部位については実施例１と同様であるので、説明を省略する。実施例４においても、音声認識端末２と音声認識サーバ６はネットワーク５を介して接続されている。しかし、音声認識端末２から音声認識サーバ６に対して音声データとセンサ情報が送信されるようになっており、また認識結果７が音声認識サーバ６より出力されるようになっている点で、実施例１と異なる。
音声ディジタル処理部１９は入力端３から入力された音声をディジタルデータに変換する部位であって、Ａ／Ｄ変換器あるいは素子又は回路を備えるものである。さらにＡ／Ｄ変換されたサンプリングデータをネットワーク５を介して伝送するのに適する形式に変換する専用回路、またはこのような専用回路と同等の処理を行うコンピュータプログラムとこのプログラムを実行する中央演算装置をさらに備えるようにしてもよい。また、サーバ側音響分析部２８は音声認識サーバ６上で入力音声から音声特徴量を算出する部位であって、実施例１及び２における端末側音響分析部１１と同様の機能を有する。
次に実施例４における音声認識端末２及び音声認識サーバ６の動作について、図を参照しながら説明する。図９は、実施例１による音声認識端末２と音声認識サーバ６との処理を示したフローチャートである。なおこのフローチャートにおいて、図２と同一の符号を付した処理については実施例１と同様である。そこで以下においては、このフローチャート独自の符号を付した処理を中心に説明を行う。
まず、利用者がマイクロホン１から音声入力を行うと、入力端３を介して音声認識端末２に音声信号が入力され（ステップＳ１０１）、音声ディジタル処理部１９は、ステップＳ１０１で入力された音声信号をＡ／Ｄ変換によってサンプリングする（ステップＳ４０１）。なお、音声ディジタル処理部１９では、音声信号のＡ／Ｄ変換だけでなく、音声データの符号化、あるいは圧縮処理を行うことが望ましいが、このことは必須ではない。具体的な音声の圧縮方法としては、ディジタル方式の公衆有線電話網（ＩＳＤＮなど）で使用されているｕ−ｌａｗ６４ｋｂｐｓＰＣＭ方式（ＰｕｌｓｅＣｏｄｅｄＭｏｄｕｌａｔｉｏｎ、ＩＴＵ−ＴＧ．７１１）や、ＰＨＳで使用されている適応差分符号化ＰＣＭ方式（ＡｄａｐｔｉｖｅＤｉｆｆｅｒｅｎｔｉａｌｅｎｃｏｄｉｎｇＰＣＭ、ＡＤＰＣＭ．ＩＴＵ−ＴＧ．７２６）、携帯電話で使用されているＶＳＥＬＰ方式（ＶｅｃｔｏｒＳｕｍＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ）、ＣＥＬＰ方式（ＣｏｄｅＥｘｃｉｔｅｄＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ）等を適用する。通信網の使用可能帯域幅やトラフィックに応じて、これらの方式のうちのいずれかを選択するとよい。例えば、帯域幅が６４ｋｂｐｓである場合にはｕ−ｌａｗＰＣＭ方式、１６〜４０ｋｂｐｓである場合にはＡＤＰＣＭ方式、１１．２ｋｂｐｓである場合にはＶＳＥＬＰ方式、５．６ｋｂｐｓである場合にはＣＥＬＰ方式が適していると考えられる。ただし他の符号化方式を適用しても、この発明の特徴が失われるわけではない。
次に、センサ１２によってセンサ情報が収集され（ステップＳ１０３）、さらに収集されたセンサ情報と符号化された音声データは、例えば図１０で示すようなデータフォーマットに並べ替えられて、端末側送信部１３によってネットワーク５を介して音声認識サーバ６に転送される（ステップＳ４０２）。
なお、図１０において領域７０１には、音声データの処理時刻を表すフレーム番号が格納される。このフレーム番号は、例えば音声データのサンプリング時刻に基づいて、一意に決定される。ここで、「一意に決定される」という語の意義は、音声認識端末２と音声認識サーバ６との間で調整された相対的な時刻に基づいて決定される場合を含み、この相対的な時刻が異なる場合には、異なるフレーム番号が与えられるようにする、という意味である。あるいは、音声認識端末２と音声認識サーバ６との外部に存在する時計より絶対的な時刻の供給を受け、この時刻に基づいてフレーム番号を一意に決定するようにしてもよい。時刻からフレーム番号を算出するには、例えば年（西暦４桁が望ましい）、月（値域１〜１２で２桁を割り当てる）、日（値域１〜３１で２桁を割り当てる）、時（値域０〜２３で２桁を割り当てる）、分（値域０〜５９で２桁を割り当てる）、秒（値域０〜５９で２桁を割り当てる）、千分の一秒（値域０〜９９９で３桁を割り当てる）の各数値をそれぞれの桁数でパディングし、これらの順に数字列として連結してもよいし、ビット単位で年・月・日・時・分・秒・ミリ秒の各値をパックして一定の値を得るようにしてもよい。
また、図１０のデータフォーマットの領域７０２には、センサ情報の占有するデータサイズが格納される。例えばセンサ情報が３２ビット値であるならば、センサ情報を格納するのに必要な領域の大きさ（４バイト）をバイトで表現して４が格納される。センサ１２が複数個のセンサから構成される場合には、それぞれのセンサ情報を格納するのに必要となる配列領域のデータサイズが格納されることになる。さらに領域７０３には、ステップＳ１０３においてセンサ１２によって取得されたセンサ情報が格納される領域である。センサ１２が複数個のセンサから構成される場合は、領域７０３にセンサ情報の配列が格納される。また領域７０３のデータサイズは、領域７０２に保持されたデータサイズと一致する。
領域７０４には音声データサイズが格納される。なお、送信部１３は音声データを複数のパケット（その構造は図７で示されるデータフォーマットと等しいものとする）に分割して送信する場合がある。その場合、領域７０４に格納されるのは、それぞれのパケットに含まれる音声データのデータサイズである。複数のパケットに分割する場合については、後に再び述べることにする。続いて領域７０５には音声データが格納される。
ネットワーク５の特性から、パケットサイズの上限が定められている場合には、端末側送信部１３は入力端３を介して入力された音声データを複数のパケットに分割する。図７のデータフォーマットにおいて、領域７０１に格納されるフレーム番号は、その音声データの処理時刻を表す情報であり、このフレーム番号は、それぞれのパケットに含まれる音声データのサンプリング時刻に基づいて決定される。さらにすでに述べたように、領域７０４にそれぞれのパケットに含まれる音声データのデータサイズを格納する。またセンサ１２を構成するセンサの出力結果が短時間の間に刻々と変化する性質を有する場合には、領域７０３に格納されるセンサ情報もパケット間で異なることになる。例えば音声認識端末２が車載用音声認識装置であり、センサ１２が背景重畳雑音の大きさを取得するセンサ（マイクロホン１とは別のマイクロホンなど）の場合、話者の発話の最中に自動車がトンネルを出入りすると、背景重畳雑音の大きさは著しく異なることになる。このような場合に、図１０のデータフォーマットによるパケットを送信することで、発話の途中であってもセンサ情報を適切に反映させることが可能となる。そのために端末側送信部１３は、発話の最中にセンサ情報が大きく変化した場合に、ネットワーク５の特性とは関係なく、センサ情報が変化した時点で音声データを分割し、異なるセンサ情報を格納したパケットを送信するのが望ましい。
引き続き、音声認識端末２及び音声認識サーバ６の動作を説明する。サーバ側受信部２１によってセンサ情報と音声データ音声認識サーバ６に取り込まれる（ステップＳ４０３）。サーバ側音響分析部２８は、取り込まれた音声データを音響分析して、音声特徴量の時系列を算出する（ステップＳ４０４）。さらにサーバ側音響モデル選択部２３は、取得したセンサ情報に基づいて、最も適切な音響モデルを選択し（ステップＳ１０９）、サーバ側照合部２６はこの音響モデルの標準パターンと音声特徴量とを照合する（ステップＳ４０５）。
以上より明らかなように、この実施例４では、音声認識端末２がセンサ情報と音声データを音声認識サーバ６に転送することとしたので、音声認識端末２側で音響分析を行うことなく、集音環境に適した音響モデルに基づいて高精度な音声認識処理を行うことができる。
したがって、音声認識端末２に音声認識のための特別な部品や回路、コンピュータプログラムなどを設けなくても音声認識機能を実現することができる。
また実施例４によれば、フレーム毎にセンサ情報を送信するようにしたので、発話中にマイクロホン１が集音する環境条件が急激に変化した場合であっても、フレーム毎に適切な音響モデルを選択して、音声認識を行うことができる。
なお、音声認識端末２からの送信を複数のフレームに分割するという方法は、実施例３の音声特徴量の送信にも適用できる。すなわち、音声特徴量は時系列成分を有するから、フレームに分割する場合には、その時系列順にフレーム分割するとよい。またそれぞれのフレームに、その時系列の時刻におけるセンサ情報を実施例４と同様に格納し、音声認識サーバ６側で、各フレームに含まれる最新のセンサ情報に基づいて最適な音響モデルを選択するようにすれば、さらに音声認識の精度を向上させることができる。Instead of transmitting / receiving the acoustic model in the first and second embodiments and transmitting / receiving the voice feature amount in the third embodiment, a method of transmitting / receiving audio data is also conceivable. The voice recognition terminal and server according to the fourth embodiment operate based on such a principle.
FIG. 8 is a block diagram illustrating the configuration of the voice recognition terminal and the server according to the fourth embodiment. In the figure, the portions denoted by the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description thereof is omitted. Also in the fourth embodiment, the voice recognition terminal 2 and the voice recognition server 6 are connected via the network 5. However, voice data and sensor information are transmitted from the voice recognition terminal 2 to the voice recognition server 6, and the recognition result 7 is output from the voice recognition server 6. Different from the first embodiment.
The voice digital processing unit 19 is a part for converting voice inputted from the input terminal 3 into digital data, and includes an A / D converter, an element or a circuit. Further, a dedicated circuit that converts the A / D converted sampling data into a format suitable for transmission via the network 5, or a computer program that performs processing equivalent to such a dedicated circuit and a central processing unit that executes the program May be further provided. The server-side acoustic analysis unit 28 is a part that calculates a speech feature amount from input speech on the speech recognition server 6, and has the same function as the terminal-side acoustic analysis unit 11 in the first and second embodiments.
Next, operations of the voice recognition terminal 2 and the voice recognition server 6 in the fourth embodiment will be described with reference to the drawings. FIG. 9 is a flowchart illustrating processing performed by the voice recognition terminal 2 and the voice recognition server 6 according to the first embodiment. In this flowchart, the processes denoted by the same reference numerals as those in FIG. 2 are the same as those in the first embodiment. Therefore, in the following, the description will be focused on the processing given the unique reference numerals in this flowchart.
First, when the user performs voice input from the microphone 1, a voice signal is input to the voice recognition terminal 2 via the input terminal 3 (step S101), and the voice digital processing unit 19 receives the voice signal input in step S101. Are sampled by A / D conversion (step S401). The audio digital processing unit 19 desirably performs not only A / D conversion of an audio signal but also encoding or compression processing of audio data, but this is not essential. As a specific audio compression method, it is used in u-law 64 kbps PCM method (Pulse Coded Modulation, ITU-T G.711) used in digital public wired telephone networks (ISDN, etc.) and PHS. Adaptive differential encoding PCM method (Adaptive Differential encoding PCM, ADPCM.ITU-T G.726), VSELP method (Vector Sum Excited Linear Prediction), CELP method (Code Excluded Prediction) used in mobile phones Apply. Any one of these methods may be selected according to the available bandwidth and traffic of the communication network. For example, the u-law PCM method is used when the bandwidth is 64 kbps, the ADPCM method is used when the bandwidth is 16 to 40 kbps, the VSELP method is used when it is 11.2 kbps, and the CELP method is used when it is 5.6 kbps. It is considered suitable. However, the characteristics of the present invention are not lost even when other encoding methods are applied.
Next, sensor information is collected by the sensor 12 (step S103), and the collected sensor information and encoded audio data are rearranged in a data format as shown in FIG. 13 is transferred to the voice recognition server 6 via the network 5 (step S402).
In FIG. 10, an area 701 stores a frame number representing the processing time of the audio data. This frame number is uniquely determined based on the sampling time of the audio data, for example. Here, the meaning of the term “uniquely determined” includes the case where it is determined based on the relative time adjusted between the speech recognition terminal 2 and the speech recognition server 6, and this relative This means that when the times are different, different frame numbers are given. Alternatively, an absolute time may be supplied from a clock existing outside the voice recognition terminal 2 and the voice recognition server 6, and the frame number may be uniquely determined based on this time. To calculate the frame number from the time, for example, year (preferably 4 digits in the Christian era), month (assign 2 digits in the range 1-12), day (assign 2 digits in the range 1-31), time (range 0) -23 to assign 2 digits), minutes (assigns 2 digits in the range 0-59), seconds (assigns 2 digits in the range 0-59), thousandths of seconds (assigns 3 digits in the range 0-999) ) Padding each number with the number of digits and concatenating them as a string of numbers in that order, or packing each value of year / month / day / hour / minute / second / millisecond in bit units. A constant value may be obtained.
Further, the data size occupied by the sensor information is stored in the area 702 of the data format in FIG. For example, if the sensor information is a 32-bit value, the size of the area (4 bytes) required to store the sensor information is expressed in bytes and 4 is stored. When the sensor 12 is composed of a plurality of sensors, the data size of the array area necessary for storing each sensor information is stored. Further, the area 703 is an area for storing sensor information acquired by the sensor 12 in step S103. When the sensor 12 includes a plurality of sensors, an array of sensor information is stored in the area 703. In addition, the data size of the area 703 matches the data size held in the area 702.
The area 704 stores the audio data size. The transmitting unit 13 may transmit the audio data by dividing it into a plurality of packets (whose structure is the same as the data format shown in FIG. 7). In this case, what is stored in the area 704 is the data size of the audio data included in each packet. The case of dividing into a plurality of packets will be described later again. Subsequently, audio data is stored in area 705.
When the upper limit of the packet size is determined from the characteristics of the network 5, the terminal side transmission unit 13 divides the voice data input via the input terminal 3 into a plurality of packets. In the data format of FIG. 7, the frame number stored in the area 701 is information indicating the processing time of the audio data, and this frame number is determined based on the sampling time of the audio data included in each packet. The Further, as already described, the data size of the audio data included in each packet is stored in the area 704. In addition, when the output result of the sensors constituting the sensor 12 has a property of changing every moment in a short time, the sensor information stored in the area 703 also differs between packets. For example, when the voice recognition terminal 2 is an on-vehicle voice recognition device and the sensor 12 is a sensor (such as a microphone different from the microphone 1) that acquires the magnitude of background superimposed noise, the car is in the middle of the speaker's speech. When entering and exiting the tunnel, the magnitude of the background superimposed noise will be significantly different. In such a case, it is possible to appropriately reflect the sensor information even during the utterance by transmitting the packet in the data format of FIG. Therefore, when the sensor information changes greatly during the utterance, the terminal side transmission unit 13 divides the voice data when the sensor information changes and stores different sensor information regardless of the characteristics of the network 5. It is desirable to transmit the packet.
Next, operations of the voice recognition terminal 2 and the voice recognition server 6 will be described. The server side receiving unit 21 captures the sensor information and the voice data into the voice recognition server 6 (step S403). The server-side acoustic analysis unit 28 performs acoustic analysis on the captured speech data and calculates a time series of speech feature values (step S404). Furthermore, the server-side acoustic model selection unit 23 selects the most appropriate acoustic model based on the acquired sensor information (step S109), and the server-side matching unit 26 matches the standard pattern of the acoustic model with the voice feature amount. (Step S405).
As is clear from the above, in the fourth embodiment, since the voice recognition terminal 2 transfers the sensor information and the voice data to the voice recognition server 6, the voice recognition terminal 2 side collects the sound without performing the acoustic analysis. A highly accurate speech recognition process can be performed based on an acoustic model suitable for the sound environment.
Therefore, the voice recognition function can be realized without providing the voice recognition terminal 2 with special parts, circuits, computer programs, etc. for voice recognition.
In addition, according to the fourth embodiment, since sensor information is transmitted for each frame, an appropriate acoustic model is obtained for each frame even when the environmental conditions in which the microphone 1 collects a sound during speech change abruptly. Voice recognition can be performed by selecting.
Note that the method of dividing the transmission from the voice recognition terminal 2 into a plurality of frames can also be applied to the transmission of the voice feature amount of the third embodiment. That is, since the audio feature amount has a time series component, when dividing into frames, it is preferable to divide the frames in the order of the time series. Further, the sensor information at the time series in each frame is stored in the same manner as in the fourth embodiment, and the optimum acoustic model is selected on the voice recognition server 6 side based on the latest sensor information included in each frame. If so, the accuracy of voice recognition can be further improved.

実施例１〜４の音声認識システムでは、音声認識端末２の備えるセンサ１２が取得した環境条件に基づいて、音声認識端末２及びサーバ６の記憶する音響モデルを選択することにより、実環境に対応した音声認識処理を行うというものであった。しかし、センサ１２が取得した環境条件だけでなく、インターネットなどから得られる付加情報を組み合わせて、音響モデルを選択する方法も考えられる。実施例５の音声認識システムはこのような特徴を有するものである。
なお、実施例５の特徴は上記のとおり、インターネットから得られる付加情報とセンサ情報とを組み合わせて、音響モデルを選択する、というものなので、実施例１〜４のいずれの音声認識システムと組み合わせることも可能であり、得られる効果についても同じであるが、ここでは例として実施例１の音声認識システムにインターネットから得られる付加情報を組み合わせた場合について説明することにする。
図１１は、実施例５による音声認識システムの構成を示すブロック図である。この図から明らかなとおり、実施例５の音声認識システムは、実施例１の音声認識システムに、インターネット情報取得部２９を付加したものであって、図１と同一の符号を付した構成要素は実施例１と同様であるので、説明を省略する。また、インターネット情報取得部２９は、インターネットを介して付加情報を取得する部位であり、具体的にはｈｔｔｐ（ＨｙｐｅｒＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）によってＷｅｂページを取得するインターネットブラウザ相当の機能を有するものである。さらに、実施例５における音声認識サーバ６が記憶している音響モデルでは、その音響モデルを学習した環境条件をセンサ情報と付加情報とで表現するようにしているものとする。
ここで、付加情報とは、例えば気象情報や交通情報である。インターネットには気象情報や交通情報を提供するＷｅｂサイトが存在しており、これらのＷｅｂサイトによれば、各地の気象条件や渋滞情報、工事状況などを入手することができる。
そこで、このような付加情報を利用して、より精度の高い音声認識を行うために、入手できる付加情報にあわせた音響モデルを準備する。例えば、気象情報が付加情報である場合は、豪雨や強風などによって生じる背景雑音の影響を加味して音響モデルが学習される。また例えば交通情報の場合は、道路工事などによって生じる背景雑音の影響を加味して音響モデルが学習される。
次に実施例５による音声認識端末２及びサーバ６の動作について説明する。図１２は、実施例５による音声認識端末２及びサーバ６の動作を示すフローチャートである。図１２のフローチャートと図２のフローチャートとが異なるのは、ステップＳ５０１の有無のみである。そこで、以降では、ステップＳ５０１の処理を中心に説明することとする。
音声認識サーバ６において、センサ情報を受信した後に（ステップＳ１０８）、インターネット情報取得部２９は、音声認識端末２に接続されたマイクロホン１が集音する環境に影響を与える情報をインターネットから収集する（ステップＳ５０１）。例えば、センサ１２にＧＰＳアンテナが備えられている場合、センサ情報には音声認識端末２及びマイクロホン１の存在する位置情報が含まれることになる。そこで、インターネット情報取得部２９は、この位置情報に基づいて音声認識端末２及びマイクロホン１の存在する場所の気象情報や交通情報などの付加情報をインターネットから収集する。
続いて、サーバ側音響モデル選択部２３は、センサ情報と付加情報とに基づいて音響モデルを選択する。具体的には、まず現在の音声認識端末２及びマイクロホン１の存在する場所の付加情報と音響モデルの付加情報が一致しているかどうかが判定される。そして付加情報が一致している音響モデルの中から、次にセンサ情報について、実施例１で示した式（１）に基づいて算出された距離値が最小となる音響モデルを選択する。
以後の処理については実施例１と同様であるので、説明を省略する。
以上から明らかなように、実施例５の音声認識システムによれば、音響モデルを学習した環境条件が、センサ情報だけでは完全に表現できないものであっても、付加情報を用いて表現することができるので、マイクロホン１の集音環境についてより適切な音響モデルを選択することができる。またこの結果として、音声認識精度を向上させることができる、という効果を奏する。
なお上記において、付加情報を入手する方法としてインターネットを経由する方法について説明したが、付加情報を用いる技術的意義は、音声認識の精度を劣化させる環境的諸要因のうち、あくまでもセンサ情報では表現できない要素に基づいて音響モデルを準備することにある。したがって、このような付加情報を入手する方法は、インターネットに限定されるものではなく、例えば、付加情報を提供するための専用システムや専用コンピュータを準備してもよい。In the speech recognition systems according to the first to fourth embodiments, the acoustic model stored in the speech recognition terminal 2 and the server 6 is selected based on the environmental conditions acquired by the sensor 12 included in the speech recognition terminal 2, thereby supporting the real environment. The voice recognition process was performed. However, a method for selecting an acoustic model by combining not only the environmental conditions acquired by the sensor 12 but also additional information obtained from the Internet or the like is also conceivable. The voice recognition system according to the fifth embodiment has such characteristics.
The feature of the fifth embodiment is that, as described above, the acoustic model is selected by combining the additional information obtained from the Internet and the sensor information, and therefore, combined with any of the voice recognition systems of the first to fourth embodiments. Although the same is true for the obtained effect, here, as an example, a case where additional information obtained from the Internet is combined with the voice recognition system of the first embodiment will be described.
FIG. 11 is a block diagram illustrating the configuration of the speech recognition system according to the fifth embodiment. As is clear from this figure, the voice recognition system of the fifth embodiment is obtained by adding the Internet information acquisition unit 29 to the voice recognition system of the first embodiment, and the components denoted by the same reference numerals as those in FIG. Since it is the same as that of Example 1, description is abbreviate | omitted. The Internet information acquisition unit 29 is a part that acquires additional information via the Internet. Specifically, the Internet information acquisition unit 29 has a function equivalent to an Internet browser that acquires a Web page by using HTTP (Hyper Text Transfer Protocol). Furthermore, in the acoustic model stored in the speech recognition server 6 in the fifth embodiment, it is assumed that the environmental condition for learning the acoustic model is expressed by sensor information and additional information.
Here, the additional information is, for example, weather information or traffic information. There are Web sites that provide weather information and traffic information on the Internet. According to these Web sites, it is possible to obtain weather conditions, traffic jam information, construction status, and the like in each region.
Therefore, in order to perform more accurate speech recognition using such additional information, an acoustic model is prepared according to the additional information that can be obtained. For example, when the weather information is additional information, the acoustic model is learned by taking into account the influence of background noise caused by heavy rain or strong winds. For example, in the case of traffic information, an acoustic model is learned in consideration of the influence of background noise caused by road construction.
Next, operations of the voice recognition terminal 2 and the server 6 according to the fifth embodiment will be described. FIG. 12 is a flowchart illustrating operations of the voice recognition terminal 2 and the server 6 according to the fifth embodiment. The flowchart in FIG. 12 differs from the flowchart in FIG. 2 only in the presence or absence of step S501. Therefore, hereinafter, the processing in step S501 will be mainly described.
After receiving the sensor information at the voice recognition server 6 (step S108), the internet information acquisition unit 29 collects information from the internet that affects the environment in which the microphone 1 connected to the voice recognition terminal 2 collects sound ( Step S501). For example, when the sensor 12 is provided with a GPS antenna, the sensor information includes position information where the voice recognition terminal 2 and the microphone 1 exist. Therefore, the Internet information acquisition unit 29 collects additional information such as weather information and traffic information of the place where the voice recognition terminal 2 and the microphone 1 exist based on the position information from the Internet.
Subsequently, the server-side acoustic model selection unit 23 selects an acoustic model based on the sensor information and the additional information. Specifically, first, it is determined whether or not the additional information of the location where the current voice recognition terminal 2 and the microphone 1 exist matches the additional information of the acoustic model. Then, from among acoustic models having the same additional information, an acoustic model having the smallest distance value calculated based on the formula (1) shown in the first embodiment is selected for the sensor information.
Since the subsequent processing is the same as that of the first embodiment, description thereof is omitted.
As is clear from the above, according to the speech recognition system of the fifth embodiment, even if the environmental condition learned from the acoustic model cannot be completely expressed only by sensor information, it can be expressed using additional information. Therefore, it is possible to select a more appropriate acoustic model for the sound collection environment of the microphone 1. As a result, the voice recognition accuracy can be improved.
In the above description, the method of obtaining additional information via the Internet has been described. However, the technical significance of using additional information cannot be expressed by sensor information among environmental factors that degrade the accuracy of speech recognition. To prepare an acoustic model based on the elements. Therefore, the method for obtaining such additional information is not limited to the Internet. For example, a dedicated system or a dedicated computer for providing the additional information may be prepared.

Industrial applicability

以上のように、この発明に係る音声認識システム並びに端末及びサーバは、使用する場所が変化しても高精度の音声認識処理を実現するために有用であり、特にカーナビゲーションシステムや携帯電話など、筐体の大きさや重量、価格帯等の制限から、搭載可能な記憶装置の容量が限られた機器に音声認識機能を提供するのに適している。 As described above, the voice recognition system, the terminal, and the server according to the present invention are useful for realizing high-accuracy voice recognition processing even if the place to be used is changed. Due to restrictions on the size, weight, price range, etc. of the housing, it is suitable for providing a voice recognition function to a device having a capacity of a storage device that can be mounted.

Claims

In a voice recognition system in which a voice recognition server and a plurality of voice recognition terminals are connected via a network,
The voice recognition terminal is
Connect an external microphone, and input terminal to input the audio signal collected by the external microphone,
A client-side acoustic analysis unit that calculates a voice feature amount from a voice signal input from the input terminal; and a sensor that detects sensor information representing a type of noise to be superimposed on the voice signal;
Client-side transmission means for transmitting the sensor information to the voice recognition server via the network;
Client-side receiving means for receiving an acoustic model from the speech recognition server;
Client-side collating means for collating the acoustic model with the voice feature amount,
The voice recognition server
Server-side receiving means for receiving sensor information transmitted by the client-side transmitting means;
Server-side acoustic model storage means for storing a plurality of acoustic models;
Server-side acoustic model selection means for selecting an acoustic model that matches the sensor information from the plurality of acoustic models;
And a server-side transmission unit that transmits the acoustic model selected by the server-side acoustic model selection unit to the voice recognition terminal.

In a voice recognition system in which a voice recognition server and a plurality of voice recognition terminals are connected via a network,
The voice recognition terminal is
Connect an external microphone, and input terminal to input the audio signal collected by the external microphone,
A client-side acoustic analysis unit that calculates a voice feature amount from a voice signal input from the input terminal; and a sensor that detects sensor information representing a type of noise to be superimposed on the voice signal;
Client-side transmission means for transmitting the sensor information and the voice feature amount to the voice recognition server via the network;
The voice recognition server
Server-side receiving means for receiving the sensor information and the voice feature amount;
Server-side acoustic model storage means for storing a plurality of acoustic models;
Server-side acoustic model selection means for selecting an acoustic model that matches the sensor information from the plurality of acoustic models;
A speech recognition system comprising: a server-side collating unit that collates the acoustic model selected by the server-side acoustic model selecting unit with the voice feature amount.

In a voice recognition system in which a voice recognition server and a plurality of voice recognition terminals are connected via a network,
The voice recognition terminal is
Connect an external microphone, and input terminal to input the audio signal collected by the external microphone,
A sensor for detecting sensor information representing a type of noise superimposed on the audio signal;
Client-side transmission means for transmitting the sensor information and the voice signal to the voice recognition server via the network,
The voice recognition server
Server-side receiving means for receiving the sensor information and the audio signal;
Server-side acoustic analysis means for calculating a voice feature amount from the voice signal; server-side acoustic model storage means for storing a plurality of acoustic models;
Server-side acoustic model selection means for selecting an acoustic model that matches the sensor information from the plurality of acoustic models;
A speech recognition system comprising: a server-side collating unit that collates the acoustic model selected by the server-side acoustic model selecting unit with the voice feature amount.

The voice recognition server
A traffic information acquisition means for acquiring traffic information from the Internet;
The server-side acoustic model selection unit selects an acoustic model that matches both the sensor information and the traffic information acquired by the traffic information acquisition unit from the plurality of acoustic models. The speech recognition system according to any one of Items 1 to 3.

The voice recognition server
It further includes weather information acquisition means for acquiring weather information from the Internet,
The server-side acoustic model selection unit selects an acoustic model that matches both the sensor information and the weather information acquired by the weather information acquisition unit from the plurality of acoustic models. The speech recognition system according to any one of Items 1 to 3.

Connect an external microphone, and input terminal to input the audio signal collected by the external microphone,
A client-side acoustic analysis unit that calculates a voice feature amount from a voice signal input from the input terminal; and a sensor that detects sensor information representing a type of noise to be superimposed on the voice signal;
Client-side transmission means for selecting the acoustic model that matches the sensor information from a plurality of acoustic models and transmitting the sensor information to a voice recognition server that transmits the acoustic model via a network;
Client-side receiving means for receiving the acoustic model transmitted by the voice recognition server;
A voice recognition terminal comprising: a client-side collating unit that collates the acoustic model with the voice feature amount.

Audio that stores a plurality of acoustic models, selects an acoustic model suitable for the sound collection environment of the plurality of speech recognition terminals from the plurality of acoustic models, and transmits the acoustic model to each speech recognition terminal via the network In the recognition server,
Server-side receiving means for receiving sensor information representing the sound collection environment from each voice recognition terminal;
Server-side acoustic model storage means for storing the plurality of acoustic models;
Server-side acoustic model selection means for selecting an acoustic model suitable for the sensor information;
A speech recognition server, comprising: server side transmission means for transmitting the acoustic model selected by the server side acoustic model selection means to each of the speech recognition terminals.

Acoustic model difference calculating means for calculating a difference between the acoustic model stored in the voice recognition terminal and the acoustic model selected by the server-side acoustic model selecting means;
Further comprising
The server-side transmission means transmits the difference instead of the acoustic model.
The voice recognition server according to claim 7, wherein:

The server-side acoustic model storage means further stores a plurality of acoustic models clustered in advance based on acoustic model statistics.
The acoustic model difference calculating means calculates a difference between the plurality of clustered acoustic models.
The voice recognition server according to claim 8, wherein:

Of a plurality of acoustic models stored in the voice recognition server, a local acoustic model storage unit that stores some acoustic models;
The acoustic model stored in the local acoustic model storage means is added with the difference between the acoustic model and the acoustic model selected by the speech recognition server as the acoustic model that matches the sensor information, and conforms to the sensor information. An acoustic model synthesis means for generating an acoustic model, and
The voice recognition terminal according to claim 6, wherein the client-side receiving unit receives the difference transmitted from the voice recognition server instead of the acoustic model.

A plurality of acoustic models are stored, voice features of input speech extracted by a plurality of speech recognition terminals are received via a network, and acoustic models suitable for the sound collection environment of each speech recognition terminal are In the speech recognition server that selects from the acoustic model and recognizes the speech feature using the acoustic model,
Server-side receiving means for receiving sensor information representing the sound collection environment and the voice feature amount from each voice recognition terminal;
Server-side acoustic model storage means for storing the plurality of acoustic models;
Server-side acoustic model selection means for selecting an acoustic model suitable for the sensor information;
A speech recognition server, comprising: a server-side collation unit that collates the voice feature amount with the acoustic model selected by the server-side acoustic model selection unit.

Connect an external microphone, and input terminal to input the audio signal collected by the external microphone,
A client-side acoustic analysis unit that calculates a voice feature amount from a voice signal input from the input terminal; and a sensor that detects sensor information representing a type of noise to be superimposed on the voice signal;
The sensor information and the voice feature are selected in a voice recognition server that selects a voice model that matches the sensor information from a plurality of acoustic models, and performs voice recognition of the voice feature amount received via the network based on the acoustic model. Client-side transmission means for transmitting the amount;
A voice recognition terminal comprising:

The client-side transmission means divides the audio feature value into a plurality of frames in time series order, and transmits the sensor information detected by the sensor at each time in the time series, added to each frame. The voice recognition terminal according to claim 12.

The server-side receiving means receives the sensor information and the voice feature amount for each frame,
The server-side acoustic model selection means selects an acoustic model that matches the sensor information for each frame,
12. The server according to claim 11, wherein the server-side collating unit collates the acoustic model selected for each frame by the server-side acoustic model selecting unit and a voice feature amount of the frame. Speech recognition server.

Audio digital signals are received from a plurality of voice recognition terminals via a network, an acoustic model suitable for the sound collection environment of each voice recognition terminal is selected from the plurality of acoustic models, and the voice is used using the acoustic model. In a speech recognition server that performs speech recognition of digital signals,
Server-side receiving means for receiving sensor information representing the sound collection environment and the voice digital signal from each voice recognition terminal;
Server-side acoustic analysis means for calculating speech feature values from the speech digital signal; server-side acoustic model storage means for storing the plurality of acoustic models;
Server-side acoustic model selection means for selecting an acoustic model suitable for the sensor information;
A speech recognition server, comprising: a server-side collation unit that collates the voice feature amount with the acoustic model selected by the server-side acoustic model selection unit.

Connect an external microphone, and input terminal to input the audio signal collected by the external microphone,
Audio digital processing means for calculating an audio digital signal from an audio signal input from the input end;
A sensor for detecting sensor information representing a type of noise superimposed on the audio signal;
An acoustic model that matches the sensor information is selected from a plurality of acoustic models, and the sensor information and the audio digital are sent to a speech recognition server that recognizes a speech signal digital signal received via a network based on the acoustic model. A client-side transmission means for transmitting a signal;
A voice recognition terminal comprising:

The client-side transmission means divides the audio digital signal into a plurality of frames in time series, and adds sensor information detected by the sensor at each time in the time series to the frames for transmission. The voice recognition terminal according to claim 16.

The server-side receiving means receives an audio digital signal and sensor information for each frame,
The server-side acoustic analysis means calculates a voice feature amount for each frame from the voice digital signal,
The server-side acoustic model selection means selects an acoustic model that matches the sensor information for each frame of the frame,
16. The server according to claim 15, wherein the server-side collating unit collates the acoustic model selected for each frame by the server-side acoustic model selecting unit and a voice feature amount of the frame. Voice recognition server.

A traffic information acquisition means for acquiring traffic information from the Internet;
The server-side acoustic model selection unit selects an acoustic model that matches both the sensor information and the traffic information acquired by the traffic information acquisition unit from the plurality of acoustic models. The speech recognition server according to any one of Items 7 to 9, Item 11, Item 14, Item 15, Item 18.

The server-side acoustic model selection means further includes weather information acquisition means for acquiring weather information from the Internet,
The server-side acoustic model selection unit selects an acoustic model that matches both the sensor information and the weather information acquired by the weather information acquisition unit from the plurality of acoustic models. The speech recognition server according to any one of Items 7 to 9, Item 11, Item 14, Item 15, Item 18.