JP2005140860A

JP2005140860A - Speech recognizing device and its control method

Info

Publication number: JP2005140860A
Application number: JP2003374817A
Authority: JP
Inventors: Kohei Yamada; 耕平山田; Yasuo Okuya; 泰夫奥谷; Yasuhiro Komori; 康弘小森; Hiroki Yamamoto; 寛樹山本
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2003-11-04
Filing date: 2003-11-04
Publication date: 2005-06-02

Abstract

PROBLEM TO BE SOLVED: To guide a microphone for silence speech recognition in such a manner that the microphone is fitted to an optimum position. SOLUTION: The speech input is formed from a microphone worn by a user (S 501) and the acoustic featured values relating to the input speech are extracted (S 502). Next, an acoustic likelihood is calculated by using a previously created acoustic model for identifying the wearing position (S 503). The information relating to the propriety of the wearing position of the microphone is presented (S 504) by using the calculated acoustic likelihood. A user is able to judge whether the microphone is mounted in the appropriate position based on the presentation. COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声認識装置およびその制御方法に関し、特に、使用者の所定部位に装着することでその使用者の非可聴音声を収集するように構成されたマイクロフォンを使用して音声認識を行う音声認識装置およびその制御方法に関する。 The present invention relates to a voice recognition device and a control method thereof, and in particular, a voice for performing voice recognition using a microphone configured to collect a non-audible voice of a user by wearing the voice recognition device at a predetermined part of the user. The present invention relates to a recognition device and a control method thereof.

無音声認識とよばれる音声認識技術が提案されている（例えば非特許文献１）。これは、従来の音声認識で扱う通常音声を対象としたものではなく、声帯振動を伴わず独り言のようにささやく、およそ第三者には聞き取ることのできない音声（非可聴音声）を対象にした音声認識技術である。この無音声認識では、耳の後ろ付近に聴診器のような体表接着型の特殊なマイクロフォンを装着する。発声者が発した非可聴音声は、体内を通じて体表まで伝わり、特殊マイクロフォンに集音される。このようにして集音された音声信号は、周囲の環境音による影響を受けにくいため、実環境下でも良好な認識性能を期待することができる。また、無音声は周囲の人間にほとんど聞こえないため秘匿性にも優れている。 A speech recognition technique called non-speech recognition has been proposed (for example, Non-Patent Document 1). This is not intended for normal speech handled by conventional speech recognition, but for speech that cannot be heard by third parties (whispering), whispering like a monologue without vocal cord vibration. Speech recognition technology. In this voiceless recognition, a special surface-bonding microphone such as a stethoscope is attached near the back of the ear. The non-audible sound uttered by the speaker is transmitted to the body surface through the body and collected by a special microphone. Since the sound signal collected in this way is not easily affected by surrounding environmental sounds, good recognition performance can be expected even in an actual environment. In addition, since silence is hardly audible to surrounding humans, it is excellent in secrecy.

中島淑貴他，「微弱体内伝導音抽出による無音声認識」，日本音響学会2003年春季研究発表会講演論文集Ｉ，p175-176、2003年３月）Nakajima Yuki et al., "Speechless recognition by extracting conducted sound in weak body", Acoustical Society of Japan 2003 Spring Conference Presentation Proceedings I, p175-176, March 2003)

上記したとおり、無音声認識で使用する特殊マイクロフォンは話者の耳の後ろ付近に取り付けるものであるが、非可聴音声をうまく収集できる位置は限られている。つまり、無音声認識におけるマイクロフォンの装着位置は、非可聴音声を認識するうえで非常に重要な要素であり、音声認識精度はマイクロフォンの装着位置に大きく依存する。しかし、使用者が常にマイクロフォンを適正な位置に装着するとは限らず、ずれた位置に装着する場合も考えられる。このような場合を考えると、マイクロフォンを最適な位置に装着するようガイドするメカニズムが必要であると言える。あるいは、適正な位置からずれた位置にマイクロフォンが装着された場合であっても認識率が大きく低下しないようにする技術を導入する必要がある。 As described above, a special microphone used for speechless recognition is attached near the back of the speaker's ear, but the position where non-audible speech can be successfully collected is limited. In other words, the microphone mounting position in silent recognition is a very important factor in recognizing non-audible speech, and the voice recognition accuracy largely depends on the microphone mounting position. However, the user does not always wear the microphone at an appropriate position, and a case where the user wears the microphone at a shifted position is also conceivable. Considering such a case, it can be said that a mechanism for guiding the microphone to be mounted at an optimum position is necessary. Alternatively, it is necessary to introduce a technique for preventing the recognition rate from greatly decreasing even when the microphone is mounted at a position shifted from an appropriate position.

本発明の一側面によれば、例えば、使用者の所定部位に装着することでその使用者の声帯振動を伴わない微弱な音声を収集するように構成されたマイクロフォンを使用して音声認識を行う音声認識装置であって、前記マイクロフォンを最適集音位置に装着した状態の話者から収集した音声に基づいて作成された音響モデルと、使用者が前記マイクロフォンを装着して入力した音声に対し、前記音響モデルに基づいて音響尤度を計算する計算手段と、前記計算手段により計算された前記音響尤度を用いて、前記マイクロフォンの現在の装着位置の適否に関する情報を使用者に提示する提示手段とを有することを特徴とする音声認識装置が提供される。 According to one aspect of the present invention, for example, voice recognition is performed using a microphone configured to collect weak voices that are not accompanied by vocal fold vibration of the user by being worn on a predetermined part of the user. A voice recognition device, an acoustic model created based on voice collected from a speaker in a state where the microphone is mounted at an optimal sound collection position, and voice input by a user wearing the microphone, Calculation means for calculating an acoustic likelihood based on the acoustic model, and a presentation means for presenting information regarding the suitability of the current mounting position of the microphone to the user using the acoustic likelihood calculated by the calculation means. There is provided a voice recognition device characterized by comprising:

本発明によれば、無音声認識用のマイクロフォンを適切な位置に装着するようガイドすることができる。 According to the present invention, it is possible to guide the microphone for silent recognition to be mounted at an appropriate position.

以下、図面を参照して本発明の好適な実施形態について詳細に説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings.

（実施形態１）
図１は、本実施形態における音声認識装置の構成を示すブロック図である。 (Embodiment 1)
FIG. 1 is a block diagram illustrating a configuration of a speech recognition apparatus according to the present embodiment.

１０１は本装置全体の制御を司る中央処理装置（ＣＰＵ）、１０２は制御メモリ（ＲＯＭ）、１０３はランダムアクセスメモリ（ＲＡＭ）である。１０４はハードディスク装置や不揮発性メモリなどの外部記憶装置、１０５はボタンやキーボードなどの情報入力部、１０６は装置の状態を表示または音声によるガイダンスを行う情報出力部、１０７は音声入力部、１０８は上記の各部を接続するバスである。音声入力部１０７は具体的には、使用者の所定部位（例えば、耳の後ろ付近）に装着することでその使用者の非可聴音声を収集するように構成されたマイクロフォンである。このような非可聴音声を対象とする音声入力部１０７には、中島等による論文「微弱体内伝導音抽出による無音声認識」（日本音響学会２００３年春期研究発表会講演論文集３ーＱー１２，ｐｐ．１７５ー１７６）や特開２０００ー５７３２５号公報で提案されている装置が使用可能である。図１４に、音声入力部１０７の集音部の構成例を示す。この集音部は例えば、振動板１の振動をコンデンサマイク２で収録する構成となっている。この振動板１を話者の体表（たとえば耳の後方、首筋の付近の位置）に接着して使用する。無音発声音声といえども、その振動は体内から体表に伝わってくるので、このような構成により無音発声音声を拾うことが可能である。 101 is a central processing unit (CPU) that controls the entire apparatus, 102 is a control memory (ROM), and 103 is a random access memory (RAM). Reference numeral 104 denotes an external storage device such as a hard disk device or a nonvolatile memory, 105 denotes an information input unit such as a button or a keyboard, 106 denotes an information output unit that displays the state of the device or provides voice guidance, 107 denotes a voice input unit, and 108 denotes This is a bus that connects the above-described units. Specifically, the voice input unit 107 is a microphone configured to collect a non-audible voice of the user by wearing the voice input unit 107 at a predetermined part of the user (for example, near the back of the ear). The speech input unit 107 for such non-audible speech includes a paper “Non-speech recognition by extraction of weak body conduction sound” by Nakajima et al. , Pp. 175-176) and JP 2000-57325 A can be used. FIG. 14 shows a configuration example of the sound collection unit of the voice input unit 107. For example, the sound collecting unit is configured to record the vibration of the diaphragm 1 with the condenser microphone 2. This diaphragm 1 is used by adhering it to the speaker's body surface (for example, behind the ear and in the vicinity of the neck). Even in the case of a silent voice, the vibration is transmitted from the body to the body surface, and thus it is possible to pick up the silent voice.

ＲＯＭ１０２には、図示のように音声認識プログラムが格納され、外部記憶装置１０４には、後述する各種の音響モデルが格納されている。もっとも、音声認識プログラムはＲＯＭ１０２ではなく外部記憶装置１０４にインストールされるものであってもよい。 The ROM 102 stores a voice recognition program as shown, and the external storage device 104 stores various acoustic models described later. However, the voice recognition program may be installed in the external storage device 104 instead of the ROM 102.

本装置は、ＣＰＵ１０１がＲＯＭ１０２に格納された音声認識プログラムを実行することで、音声入力部１０７より入力された音声データに対し音声認識処理を行い、情報出力部１０５に認識結果を表示する。 In this apparatus, the CPU 101 executes a speech recognition program stored in the ROM 102, thereby performing speech recognition processing on the speech data input from the speech input unit 107 and displaying a recognition result on the information output unit 105.

本実施形態における音声認識装置の構成は概ね上記のとおりであるが、以下、音声認識プログラムによる各処理について説明する。 The configuration of the speech recognition apparatus in the present embodiment is generally as described above, but each process by the speech recognition program will be described below.

図２は、本実施形態における音響モデル作成処理に関与する音声認識プログラムのモジュールの構成を示す図である。 FIG. 2 is a diagram showing a module configuration of a speech recognition program involved in the acoustic model creation process in the present embodiment.

この音響モデル作成処理では、音声認識の際に用いる音響モデル（認識用音響モデル）と、マイクロフォンの装着位置を同定するために用いる音響モデル（装着位置同定用音響モデル）の、２種類の音響モデルが作成される。特徴量計算処理モジュール２０１は、入力された音声データから音響的な特徴量を抽出する。データ記憶モジュール２０２は、外部記憶装置１０４にデータの記録を行う。装着位置同定用音響モデル作成モジュール２０３は、特徴量計算処理モジュール２０１により計算された特徴量を用いて装着位置同定用音響モデルを作成する。認識用音響モデル作成モジュール２０４は、特徴量計算処理モジュール２０１により計算された特徴量を用いて音声認識用音響モデルを作成する。ここで、装着位置同定用音響モデル作成モジュール２０３と認識用音響モデル作成モジュール２０４において、両者の音響モデルは、同じ音声データを使用して同じ方法で作成してもよいが、装着位置同定用音響モデルには、より限定された音声に対して音響尤度を算出するコンパクトな音響モデルが適している。他方、認識用音響モデルは、認識性能仕様によって異なるが、装着位置同定用音響モデルより音響的に詳細な音響モデルであることが好ましい。 In this acoustic model creation process, two types of acoustic models are used: an acoustic model used for speech recognition (recognition acoustic model) and an acoustic model used to identify the mounting position of the microphone (mounting position identification acoustic model). Is created. The feature amount calculation processing module 201 extracts an acoustic feature amount from the input voice data. The data storage module 202 records data in the external storage device 104. The mounting position identification acoustic model creation module 203 creates a mounting position identification acoustic model using the feature quantity calculated by the feature quantity calculation processing module 201. The recognition acoustic model creation module 204 creates a speech recognition acoustic model using the feature amount calculated by the feature amount calculation processing module 201. Here, in the mounting position identification acoustic model creation module 203 and the recognition acoustic model creation module 204, both acoustic models may be created by the same method using the same audio data. A compact acoustic model that calculates acoustic likelihood for a more limited speech is suitable for the model. On the other hand, although the acoustic model for recognition differs depending on the recognition performance specification, it is preferable that the acoustic model is acoustically more detailed than the acoustic model for identifying the mounting position.

図３は、本実施形態における音響モデル作成処理の流れを示すフローチャートである。 FIG. 3 is a flowchart showing a flow of acoustic model creation processing in the present embodiment.

まず、ステップＳ３０１では、不特定話者の音声を入力する。ここではマイクロフォンは非可聴音声を最も効果的に収集できる位置（以下、「最適集音位置」という。）に装着されていることが前提である。次に、ステップＳ３０２で、特徴量計算処理モジュール２０１により音声データから音響的な特徴量が抽出され、データ記憶モジュール２０２によって外部記憶装置１０４に格納される。そして、ステップＳ３０３で、ステップＳ３０２で算出された特徴量を用いて、装着位置同定用音響モデル作成モジュール２０３および認識用音響モデル作成モジュール２０４によりそれぞれ、装着位置同定用音響モデルおよび認識用音響モデルが作成され、これら２つの音響モデルはデータ記憶部２０２によって外部記憶装置１０４に格納される。 First, in step S301, the voice of an unspecified speaker is input. Here, it is assumed that the microphone is mounted at a position where the non-audible sound can be collected most effectively (hereinafter referred to as “optimum sound collection position”). Next, in step S 302, the acoustic feature quantity is extracted from the voice data by the feature quantity calculation processing module 201 and stored in the external storage device 104 by the data storage module 202. In step S303, the mounting position identification acoustic model and the recognition acoustic model are obtained by the mounting position identification acoustic model creation module 203 and the recognition acoustic model creation module 204, respectively, using the feature amount calculated in step S302. These two acoustic models are created and stored in the external storage device 104 by the data storage unit 202.

以上の処理により、不特定話者からマイクロフォンを最適集音位置に装着した状態で音声を収集し、これに基づいて、装着位置同定用音響モデルおよび認識用音響モデルの２つの音響モデルが作成された。 Through the above processing, voice is collected from an unspecified speaker with the microphone mounted at the optimum sound collection position, and based on this, two acoustic models are created: a mounting position identification acoustic model and a recognition acoustic model. It was.

続いて、本実施形態におけるマイクロフォン装着位置確認処理について説明する。この処理はマイクロフォンが装着されたときに実行されるもので、この処理によってマイクロフォンの装着位置が不適切な場合に、その位置を使用者に調整せしめることをねらいとしたものである。 Next, the microphone mounting position confirmation process in the present embodiment will be described. This process is executed when a microphone is attached, and this process is intended to allow the user to adjust the position of the microphone when the position of the microphone is inappropriate.

図４は、本実施形態におけるマイクロフォン装着位置確認処理に関与する音声認識プログラムのモジュールの構成を示す図である。 FIG. 4 is a diagram showing a module configuration of a speech recognition program involved in the microphone mounting position confirmation process in the present embodiment.

音声入力モジュール４０１は、装着されたマイクロフォンから音声データの入力を受ける。特徴量計算処理モジュール４０２は、入力された音声データから音響的な特徴量を抽出する。尤度計算処理モジュール４０３は、抽出された特徴量と装着位置同定用音響モデルを用いて尤度計算を行う。音響モデル記憶モジュール４０４は、外部記憶装置１０４に音響モデルを記憶する。状態表示モジュール４０５は、算出する音響尤度に応じた描画や音などを情報出力部１０６に表示させる。これにより使用者はマイクロフォンの装着位置が適切かどうかを判断することができる。情報入力処理モジュール４０６は、ボタンやキーボードなどを通じて使用者がシステムに対して行うフィードバックを入力する。 The audio input module 401 receives input of audio data from the attached microphone. The feature quantity calculation processing module 402 extracts an acoustic feature quantity from the input voice data. The likelihood calculation processing module 403 performs likelihood calculation using the extracted feature quantity and the mounting position identification acoustic model. The acoustic model storage module 404 stores the acoustic model in the external storage device 104. The status display module 405 causes the information output unit 106 to display drawing, sound, and the like corresponding to the calculated acoustic likelihood. As a result, the user can determine whether or not the mounting position of the microphone is appropriate. The information input processing module 406 inputs feedback that the user performs to the system through buttons, a keyboard, and the like.

図５は、本実施形態におけるマイクロフォン装着位置確認処理の流れを示すフローチャートである。 FIG. 5 is a flowchart showing the flow of the microphone attachment position confirmation process in the present embodiment.

まず、ステップＳ５０１で、音声入力モジュール４０１により、使用者に装着されたマイクロフォンから音声入力を行う。この時発声する入力音声は、使用する装着位置同定用音響モデルによって異なり、認識用音響モデルと同じ条件で作成した音響モデルの場合は指定した発声である必要はないが、限られた発声条件で作成した音響モデルの場合は限定した発声のみ入力音声として受け付けるものとする。また、装着位置を同定している間の入力音声は、必ず同じ音声を入力する必要がある。このステップＳ５０１では、音声入力がない場合は入力状態を維持したまま状態を保持し、音声入力があった場合に次のステップＳ５０２に移る。 First, in step S501, the voice input module 401 performs voice input from a microphone attached to the user. The input voice to be uttered at this time differs depending on the acoustic model for identifying the mounting position to be used.In the case of an acoustic model created under the same conditions as the acoustic model for recognition, it is not necessary to be the specified utterance, In the case of the created acoustic model, only limited utterances are accepted as input speech. In addition, it is necessary to always input the same voice as the input voice while identifying the mounting position. In step S501, when there is no voice input, the state is maintained while the input state is maintained, and when there is a voice input, the process proceeds to the next step S502.

ステップＳ５０２では、特徴量計算処理モジュール４０２により入力音声に関する音響的な特徴量が抽出される。次に、ステップＳ５０３で、尤度計算処理モジュール４０３により、あらかじめ作成された装着位置同定用音響モデルを音響モデル記憶モジュール４０４を介して外部記憶装置１０４から読み込み、ステップＳ５０２で抽出された特徴量と読み込んだ音響モデルを用いて音響尤度を計算する。 In step S502, the acoustic feature quantity related to the input voice is extracted by the feature quantity calculation processing module 402. Next, in step S503, the likelihood calculation processing module 403 reads the acoustic model for mounting position identification created in advance from the external storage device 104 via the acoustic model storage module 404, and the feature amount extracted in step S502. The acoustic likelihood is calculated using the read acoustic model.

ステップＳ５０４では、状態表示モジュール４０５により、ステップＳ５０３で算出された音響尤度を用いて、マイクロフォンの装着位置の適否に関する情報が提示される。ここでは例えば、尤度の大きさに応じて大きくなるビープ音を発するような報知や、情報出力部１０６に音響尤度の推移をグラフで表示する等の態様が含まれる。また、音響尤度について知識のない使用者のために、音響尤度が所定のしきい値を超えるときにマイクロフォンの装着位置が適切である旨を表示するようにしてもよいであろう。使用者は、このような表示に基づいてマイクロフォンが適切な位置に装着されているかどうかを判断することができる。 In step S504, the state display module 405 presents information regarding the suitability of the microphone mounting position using the acoustic likelihood calculated in step S503. Here, for example, a notification that generates a beep sound that increases in accordance with the magnitude of the likelihood, and a mode in which the transition of the acoustic likelihood is displayed in a graph on the information output unit 106 are included. Further, for a user who does not know the acoustic likelihood, it may be displayed that the microphone mounting position is appropriate when the acoustic likelihood exceeds a predetermined threshold. The user can determine whether or not the microphone is mounted at an appropriate position based on such display.

そして、ステップＳ５０５では、情報入力処理モジュール４０６により、使用者からのフィードバックをボタンやキーボードなどを通じて受け、マイクロフォンの装着位置の再調整を行うためのフィードバックを受けた場合はステップＳ５０１へ戻り、集音位置の決定を示すフィードバックを受けた場合はマイクロフォンの装着位置を固定して終了する。 In step S505, the information input processing module 406 receives feedback from the user through a button, a keyboard, or the like. If feedback for readjustment of the microphone mounting position is received, the process returns to step S501 to collect sound. When feedback indicating the determination of the position is received, the microphone mounting position is fixed and the process ends.

図６は、本実施形態における音声認識処理の流れを示すフローチャートである。 FIG. 6 is a flowchart showing the flow of the speech recognition process in the present embodiment.

この図は大きく３つの工程で構成されているが、第一の工程であるステップＳ６０１は、上述したマイクロフォン装着位置同定処理をまとめたものであり、ステップＳ６０１を経てマイクロフォンは最適集音位置に装着されたものとして、ステップＳ６０２に移る。ステップＳ６０２では、認識用音響モデル作成モジュール２０４により作成された認識用音響モデルを、データ記憶モジュール２０２を介して外部記憶装置１０４から読み込む。そして、ステップＳ６０３で、ステップＳ６０１で装着されたマイクロフォンからの入力音声を６０２の工程で読み込んだ音響モデルを用いて音声認識を行い、結果を出力して終了する。 This figure is mainly composed of three processes. Step S601, which is the first process, is a summary of the above-described microphone mounting position identification process, and after step S601, the microphone is mounted at the optimum sound collection position. As a result, the process moves to step S602. In step S 602, the recognition acoustic model created by the recognition acoustic model creation module 204 is read from the external storage device 104 via the data storage module 202. In step S603, voice recognition is performed using the acoustic model read in the step 602 of the input voice from the microphone mounted in step S601, the result is output, and the process ends.

このように本実施形態によれば、音声認識を実行する前に、入力した音声に基づいて音響尤度に関する情報が表示されるので、操作者は、それに基づいてマイクロフォンの装着位置を調整する機会が与えられる。 As described above, according to the present embodiment, since the information regarding the acoustic likelihood is displayed based on the input voice before the voice recognition is performed, the operator has an opportunity to adjust the mounting position of the microphone based on the information. Is given.

（実施形態２）
上述の実施形態１は最適集音位置における装着位置同定用音響モデルを１つ作成しておき、これに基づいてマイクロフォンの装着位置を使用者に調整せしめるものであったが、本実施形態では、最適集音位置およびその周辺の複数の位置についてそれぞれ、マイクロフォンを装着した場合の音響モデルをそれぞれ設計し、これに基づいてマイクロフォンの装着位置の調整すべき方向を使用者に提示する方法を示す。なお、本実施形態では同時に複数のマイクロフォンを装着して集音する場合について説明するが、これに限定されるものではなく、別々に集音するようにしてもよい。 (Embodiment 2)
In the first embodiment described above, one acoustic model for identifying the mounting position at the optimal sound collection position is created in advance, and the user is allowed to adjust the mounting position of the microphone based on this, but in this embodiment, An acoustic model when a microphone is mounted is designed for each of the optimum sound collection position and a plurality of positions around the optimum sound collection position, and a method of presenting the user with the direction in which the microphone mounting position should be adjusted based on this is shown. In the present embodiment, a case where a plurality of microphones are simultaneously attached to collect sound will be described. However, the present invention is not limited to this, and sound may be collected separately.

以下、本実施形態におけるマイクロフォン装着位置確認処理の概要を説明する。 Hereinafter, an outline of the microphone mounting position confirmation process in the present embodiment will be described.

図９Ａは人間の頸部の周辺を示した図で、Ａ〜Ｅがマイクロフォンの装着位置の例を示している。ここでは位置Ｅが最適集音位置である。しかし、使用者に何らの指示も与えずにマイクロフォンを装着させる場合にはＥではなくＡ，Ｂ，Ｃ，Ｄのような位置に装着されるケースが多いと考えられる。 FIG. 9A is a diagram showing the periphery of a human neck, and A to E show examples of microphone mounting positions. Here, the position E is the optimum sound collection position. However, when the microphone is mounted without giving any instruction to the user, it is considered that there are many cases where the microphone is mounted at positions such as A, B, C, and D instead of E.

本実施形態では、最適集音位置Ｅに装着されたマイクロフォンから収集された音声データに基づいて音響モデルを作成する他に、Ａ，Ｂ，Ｃ，Ｄの各位置ごとに、その位置に装着されたマイクロフォンから収集された音声データに基づく音響モデルをあらかじめ作成しておく。以下では、Ａ，Ｂ，Ｃ，Ｄ，Ｅの各位置に対応する音響モデルをそれぞれ、音響モデルａ，ｂ，ｃ，ｄ，ｅとする。 In the present embodiment, in addition to creating an acoustic model based on voice data collected from a microphone mounted at the optimum sound collection position E, each position of A, B, C, D is mounted at that position. An acoustic model based on voice data collected from a microphone is created in advance. In the following, acoustic models corresponding to the positions A, B, C, D, and E are referred to as acoustic models a, b, c, d, and e, respectively.

使用者が実際にマイクロフォンを装着して音声が入力されると、上記の各音響モデルについて、その音響モデルを使用したときの音響尤度を計算し、音響尤度が最大となる音響モデルはどれなのかを調べることによってマイクロフォンの装着位置を推定する。 When a user actually wears a microphone and inputs sound, for each of the above acoustic models, the acoustic likelihood when using the acoustic model is calculated, and which acoustic model has the maximum acoustic likelihood. The position where the microphone is attached is estimated by examining whether or not it is.

そしてこの推定の結果、装着位置が最適なＥではなく、Ａ，Ｂ，Ｃ，Ｄのいずれかである場合には、その位置からＥに向かう方向を表示することで、使用者にその方向にマイクロフォンの装着位置を調整するように促す。 As a result of this estimation, if the mounting position is not the optimum E but A, B, C, or D, the direction from the position toward E is displayed to the user in that direction. Encourage the user to adjust the microphone position.

推定された装着位置から最適な位置Ｅに向かう方向は、例えば、音響尤度が最大となる音響モデルと、それによって推定される装着位置から最適な位置Ｅに向かう方向の情報（すなわち調整すべき方向）とを対応付けて記述した調整方向決めテーブルをあらかじめ作成しておき、この調整方向決めテーブルを参照することで決定することができる。図９Ｂに、そのような調整方向決めテーブルの構造例を示す。このように調整方向決めテーブルには、音響モデルａ，ｂ，ｃ，ｄと、各音響モデルに対応する装着位置から最適集音位置Ｅに向かう方向とが対応して記述される。 The direction from the estimated wearing position to the optimum position E is, for example, an acoustic model having the maximum acoustic likelihood, and information on the direction from the wearing position estimated thereby to the optimum position E (that is, to be adjusted) An adjustment direction determination table in which the (direction) is described in association with each other is created in advance and can be determined by referring to the adjustment direction determination table. FIG. 9B shows an example of the structure of such an adjustment direction determination table. Thus, in the adjustment direction determination table, the acoustic models a, b, c, and d are described in correspondence with the direction from the mounting position corresponding to each acoustic model to the optimum sound collection position E.

図７は、本実施形態における調整方向決めテーブルを作成する処理に関与する音声認識プログラムのモジュールの構成を示す図である。 FIG. 7 is a diagram showing a module configuration of the speech recognition program involved in the process of creating the adjustment direction determination table in the present embodiment.

７０１〜７０４のモジュールはそれぞれ、図２の２０１〜２０４のモジュールと同様のモジュールである。図２との相違点は、テーブル作成モジュール７０５が追加されている点である。テーブル作成モジュール７０５は、例えば図９Ｂに示したような構造の調整方向決めテーブルを作成し、データ記憶モジュール７０２を介して外部記憶装置１０４に保持するように機能する。 The modules 701 to 704 are similar to the modules 201 to 204 in FIG. The difference from FIG. 2 is that a table creation module 705 is added. The table creation module 705 functions to create an adjustment direction determination table having a structure as shown in FIG. 9B, for example, and hold it in the external storage device 104 via the data storage module 702.

図８は、本実施形態における調整方向決めテーブルを作成する処理の流れを示すフローチャートである。 FIG. 8 is a flowchart showing a flow of processing for creating an adjustment direction determination table in the present embodiment.

まず、ステップＳ８０１〜Ｓ８０３の処理で、最適集音位置から外れた複数の装着位置に対応する音響モデルを作成する。これらのステップＳ８０１〜Ｓ８０３は、図３に示したステップＳ３０１〜Ｓ３０３の処理に相当する。ただしステップＳ８０１〜Ｓ８０３の各工程は装着位置ごとに独立して行い、装着位置ごとの音響モデルを作成する。また、ここで作成される複数の音響モデルはすべて装着位置同定用の音響モデルである。各装着位置ごとの音響モデルを作成し終えた時点で、次のステップＳ８０４に進む。 First, in the processes of steps S801 to S803, acoustic models corresponding to a plurality of mounting positions deviating from the optimum sound collection position are created. These steps S801 to S803 correspond to the processing of steps S301 to S303 shown in FIG. However, steps S801 to S803 are performed independently for each mounting position, and an acoustic model for each mounting position is created. The plurality of acoustic models created here are all acoustic models for mounting position identification. When the acoustic model for each mounting position has been created, the process proceeds to the next step S804.

そして、ステップＳ８０４では、テーブル作成モジュール７０５により、音響モデルごとに、対応するマイクロフォンの装着位置から最適集音位置に向かう方向を記述した調整方向決めテーブルが作成される。具体的には、図９Ａを参照すると、例えばＡは最適位置Ｅの左斜め下に位置しているので、調整すべき方向としては右斜め上ということになる。同様に、Ｂは最適位置Ｅの右斜め上に位置しているので、調整すべき方向としては左斜め下ということになる。このようにして作成された調整方向決めテーブルは、データ記憶モジュール７０２を介して外部記憶装置１０４に記憶される。 In step S804, the table creation module 705 creates an adjustment direction determination table describing the direction from the corresponding microphone mounting position to the optimum sound collection position for each acoustic model. Specifically, referring to FIG. 9A, for example, A is located diagonally to the left of the optimum position E, so that the direction to be adjusted is diagonally upward to the right. Similarly, since B is located on the upper right side of the optimum position E, the direction to be adjusted is the lower left side. The adjustment direction determination table created in this way is stored in the external storage device 104 via the data storage module 702.

次に、本実施形態におけるマイクロフォン装着位置確認処理を、図１０のフローチャートを用いて説明する。なお、このマイクロフォン装着位置確認処理に関与する音声認識プログラムのモジュールの構成は図４に示したものと同様であるのでここではその説明を省略するが、参照番号は援用する。 Next, the microphone mounting position confirmation process in the present embodiment will be described with reference to the flowchart of FIG. The configuration of the speech recognition program module involved in the microphone attachment position confirmation process is the same as that shown in FIG. 4, and therefore the description thereof is omitted here, but the reference numerals are used.

まず、ステップＳ１００１で、音声入力モジュール４０１により、使用者に装着されたマイクロフォンから音声入力を行う。ここで音声入力がない場合は入力状態を維持したまま状態を保持し、音声入力があった場合に次のステップＳ１００２に移る。ステップＳ１００２では、特徴量計算処理モジュール４０２により入力音声の音響的な特徴量が抽出される。 First, in step S1001, the voice input module 401 performs voice input from a microphone attached to the user. If there is no voice input, the state is maintained while the input state is maintained. If there is a voice input, the process proceeds to the next step S1002. In step S1002, the feature amount calculation processing module 402 extracts the acoustic feature amount of the input voice.

次に、ステップＳ１００３で、尤度計算処理モジュール４０３により、あらかじめ作成された音響モデルＡ，Ｂ，Ｃ，Ｄ、Ｅを音響モデル記憶モジュール４０４を介して外部記憶装置１０４から読み込み、ステップＳ１００２で抽出された特徴量に対して、各音響モデル毎に音響尤度を計算する。 Next, in step S1003, the likelihood calculation processing module 403 reads the acoustic models A, B, C, D, and E created in advance from the external storage device 104 via the acoustic model storage module 404, and extracts them in step S1002. An acoustic likelihood is calculated for each acoustic model with respect to the feature amount.

次に、ステップＳ１００４で、ステップＳ１００３で算出した各音響モデルを使用したときの音響尤度をそれぞれ比較し、これにより現在のマイクロフォンの装着位置を推定する。例えば、音響モデルＥを使用したときの音響尤度が他のどの音響尤度よりも大きい場合は、現在のマイクロフォンの装着位置は最適な位置Ｅにあると推定される。この場合には装着位置を調整する必要はないので、このまま本処理を終了する。一方、音響モデルＥを使用したときの音響尤度が最大でない場合は、現在のマイクロフォンの装着位置はＡ，Ｂ，Ｃ，Ｄのいずれかであると推定され、ステップＳ１００５に進む。 Next, in step S1004, the acoustic likelihoods when using each acoustic model calculated in step S1003 are compared, and thereby the current microphone mounting position is estimated. For example, when the acoustic likelihood when the acoustic model E is used is larger than any other acoustic likelihood, it is estimated that the current microphone mounting position is at the optimum position E. In this case, there is no need to adjust the mounting position, so this process is terminated as it is. On the other hand, if the acoustic likelihood is not maximum when the acoustic model E is used, the current microphone mounting position is estimated to be any one of A, B, C, and D, and the process proceeds to step S1005.

ステップＳ１００５では、外部記憶装置１０４に記憶されている調整方向決めテーブルを参照して、マイクロフォンを調整すべき方向を情報出力部１０６に表示させる。例えば、音響モデルＡの尤度が最大であった場合は、情報出力部１０６に、「右斜め上に移動」と表示される。このようにマイクロフォンの装着位置の調整すべき方向が提示されるので、使用者は調整方向を迷うことなくマイクロフォンの装着位置を移動することが可能になる。このステップＳ１００５の表示処理の結果、マイクロフォンの装着位置が移動すると、ステップＳ１００１に戻って処理が繰り返される。 In step S1005, the information output unit 106 displays the direction in which the microphone should be adjusted with reference to the adjustment direction determination table stored in the external storage device 104. For example, when the likelihood of the acoustic model A is the maximum, “move right upward” is displayed on the information output unit 106. Since the direction in which the microphone mounting position is to be adjusted is presented in this way, the user can move the microphone mounting position without losing the adjustment direction. If the mounting position of the microphone moves as a result of the display process in step S1005, the process returns to step S1001 and the process is repeated.

（実施形態３）
上述の実施形態２では、使用者に提示できるマイクロフォンの装着位置の調整すべき方向は、調整方向決めテーブルによって定義された方向（例えば４方向）に限定されるものであったが、テーブルの定義に拘束されることなく任意の方向を提示することも可能である。 (Embodiment 3)
In Embodiment 2 described above, the direction in which the microphone mounting position that can be presented to the user should be adjusted is limited to the directions defined by the adjustment direction determination table (for example, four directions). It is also possible to present an arbitrary direction without being constrained by.

本実施形態では、実施形態２で計算される４種類の音響モデルａ，ｂ，ｃ，ｄ（音響モデルｅを除く）の音響尤度の大きさと調整方向決めテーブルで定義される方向を利用して、きめの細かい表示方向を決定する。例えば、各音響モデルの音響尤度をベクトルの大きさとみなし、また、音響モデルごとに調整方向決めテーブルにより定義されている方向をそのベクトルの方向とみなす。これにより、調整すべき方向として提示する方向をこれらのベクトルの総和で表すことができる。これを定式化すると以下のようになる。 In this embodiment, the acoustic likelihood magnitudes of the four types of acoustic models a, b, c, and d (excluding the acoustic model e) calculated in the second embodiment and directions defined in the adjustment direction determination table are used. Determine the fine display direction. For example, the acoustic likelihood of each acoustic model is regarded as the magnitude of the vector, and the direction defined by the adjustment direction determination table for each acoustic model is regarded as the direction of the vector. Thereby, the direction presented as the direction to be adjusted can be represented by the sum of these vectors. This is formulated as follows.

すなわち、求めるべき表示方向ベクトルＸは、音響モデルｉ（ｉ＝ａ，ｂ，ｃ，ｄ）の方向ベクトルＥに音響尤度Ｐ_ｉおよび重み係数ω_ｉをかけたものの総和で表現できる。 That is, the display direction vector X to be obtained can be expressed by the sum of the direction vector E of the acoustic model i (i = a, b, c, d) multiplied by the acoustic likelihood P _i and the weighting coefficient ω _i .

このようにして求めた結果は、情報出力部１０６などの表示装置に表示してもよいし、角度など方向を音声として提示してもよい。 The result obtained in this way may be displayed on a display device such as the information output unit 106, or a direction such as an angle may be presented as a voice.

さらに、表示方向ベクトルＸの大きさに応じて、移動すべき距離も併せて提示することも有効である。 It is also effective to present the distance to be moved in accordance with the size of the display direction vector X.

（実施形態４）
上述の実施形態１〜３は、音響尤度を基準にマイクロフォンの装着位置の調整せしめるものであったが、音響尤度のかわりに別の評価関数を用いてもよい。 (Embodiment 4)
In the first to third embodiments described above, the microphone mounting position is adjusted based on the acoustic likelihood, but another evaluation function may be used instead of the acoustic likelihood.

例えば、実施形態２と同様に５つの音響モデルが存在し、各モデルから決定される音響尤度を評価関数に用いる場合について説明する。また、処理の流れは実施形態１と同様であるとする。 For example, there will be described a case where there are five acoustic models as in the second embodiment, and the acoustic likelihood determined from each model is used as the evaluation function. The processing flow is the same as in the first embodiment.

マイクロフォンが実際に装着された位置ｘが最適集音位置Ｅに近づくにつれて、位置ｘに装着されたマイクロフォンによる入力音声の特徴量に対する、音響モデルeを用いたときの音響尤度は大きくなり、一方、音響モデルａ，ｂ、ｃ、ｄを用いたときの音響尤度はそれぞれ小さくなることが予想できる。これを評価関数に表すと次のようになる。 As the position x where the microphone is actually mounted approaches the optimum sound collection position E, the acoustic likelihood when the acoustic model e is used for the feature quantity of the input voice by the microphone mounted at the position x increases, The acoustic likelihoods when the acoustic models a, b, c, and d are used can be expected to be small. This is expressed in the evaluation function as follows.

ここでＰ_ｉは、位置ｘに装着されたマイクロフォンによる入力音声の、音響モデルｉを用いたときの音響尤度を表し、ｗ_ｉは、Ｐ_ｉの重み係数を表す。 Here, P _i represents the acoustic likelihood of the input voice from the microphone mounted at the position x when the acoustic model i is used, and w _i represents the weighting factor of P _i .

（実施形態５）
実施形態１と２を併せて実現することにより、より効率的にマイクロフォンの装着位置をガイドすることが可能になる。例えば、装着されたマイクロフォンに対して、まず実施形態２によりマイクロフォンの装着位置を調整すべき方向を表示し、表示された方向へ移動している最中にマイクロフォンが最適集音位置付近に近づくと、実施形態１によるピープ音もしくは画面表示による音響尤度表示に切り替わるようにすれば、効率的にマイクロフォンの装着位置をガイドすることができる。 (Embodiment 5)
By realizing the first and second embodiments together, it is possible to guide the mounting position of the microphone more efficiently. For example, the direction in which the microphone mounting position should be adjusted according to the second embodiment is first displayed for the mounted microphone, and the microphone approaches the optimum sound collection position while moving in the displayed direction. By switching to the beep sound according to the first embodiment or the acoustic likelihood display based on the screen display, it is possible to efficiently guide the mounting position of the microphone.

（実施形態６）
上述した実施形態１〜５では、マイクロフォンの装着位置を厳密に同定して、装着位置に依存した音響モデルを使って音声認識を行う方法を示したが、本実施形態では、マイクロフォンの装着位置には厳密には依存しない音響モデルを作成し、これを用いて音声認識を行う方法を示す。 (Embodiment 6)
In Embodiments 1 to 5 described above, the method of precisely identifying the mounting position of the microphone and performing speech recognition using an acoustic model depending on the mounting position is shown, but in this embodiment, the mounting position of the microphone is shown. Shows a method of creating a sound model that does not depend strictly and performing speech recognition using it.

図１１は、本実施形態におけるマイクロフォンの装着位置のずれに対して頑強な音響モデルを作成する処理の流れを示すフローチャートである。ここでは同時に複数のマイクロフォンを装着して集音する場合について説明するが、本発明はこれに限定されるものではなく、別々に集音するようにしてもよい。 FIG. 11 is a flowchart showing a flow of processing for creating an acoustic model that is robust against displacement of the microphone mounting position in the present embodiment. Here, a case where a plurality of microphones are simultaneously attached to collect sound will be described. However, the present invention is not limited to this, and sound may be collected separately.

まず、ステップＳ１１０１で、実施形態１で示した最適集音位置およびその周辺に複数個のマイクロフォンを装着する。次に、ステップＳ１１０２で、入力された全音声データから音響的な特徴量をすべて抽出し、続くステップＳ１１０３で全特徴量を外部記憶装置１０４に出力する。 First, in step S1101, a plurality of microphones are mounted at and around the optimum sound collection position shown in the first embodiment. Next, in step S1102, all acoustic feature values are extracted from all input voice data, and in step S1103, all feature values are output to the external storage device 104.

次に、ステップＳ１１０４で、外部記憶装置１０４に記録された特徴量すべてを用いて学習を行う。これによりマイクロフォンの装着位置のずれに対して頑強な音響モデルが作成される。作成された音響モデルは、ステップＳ１１０５で外部記憶装置１０４に記録される。 In step S1104, learning is performed using all the feature values recorded in the external storage device 104. This creates an acoustic model that is robust against displacement of the microphone mounting position. The created acoustic model is recorded in the external storage device 104 in step S1105.

図１２は、本実施形態におけるマイクロフォンの装着位置のずれに対して頑強な音声認識の流れを示すフローチャートである。 FIG. 12 is a flowchart showing the flow of speech recognition that is robust against the displacement of the microphone mounting position in the present embodiment.

まず、ステップＳ１２０１で、使用者に装着されたマイクロフォンからの音声入力を受ける。次に、ステップＳ１２０２で特徴量を抽出し、ステップＳ１２０３で、上記の図１１のフローチャートに従って作成された音響モデルを読み込む。そして、ステップＳ１２０４で、ステップＳ１２０３で読み込んだ音響モデルとステップＳ１２０２で抽出した特徴量とに基づいて音声認識を行い、ステップＳ１２０５でその認識結果を出力する。 First, in step S1201, a voice input is received from a microphone attached to the user. In step S1202, feature amounts are extracted, and in step S1203, an acoustic model created according to the flowchart of FIG. 11 is read. In step S1204, speech recognition is performed based on the acoustic model read in step S1203 and the feature amount extracted in step S1202, and the recognition result is output in step S1205.

（実施形態７）
実施形態１から５では、最適集音位置から入力される音声だけを用いて作成される音響モデルを、実際の音声認識に使用するように説明したが、これに限るものではなく、実施形態６で説明した、マイクロフォンの装着位置のずれに対して頑強な音響モデルを用いて音声認識を行うようにしてもよい。 (Embodiment 7)
In the first to fifth embodiments, it has been described that the acoustic model created using only the voice input from the optimum sound collection position is used for actual voice recognition. However, the present invention is not limited to this. The speech recognition may be performed using an acoustic model that is robust against the displacement of the microphone mounting position described above.

（実施形態８）
ここでは、上述した各実施形態に適用可能な音声入力部１０７の具体例を示す。 (Embodiment 8)
Here, a specific example of the voice input unit 107 applicable to each of the above-described embodiments is shown.

図１３において、１３０１は、複数のマイクロフォンが一つの筐体に配置された構造の音声入力装置の例を示している。上述した実施形態１〜７では、マイクロフォン装着時に音声入力を行う際は単一のマイクロフォンを用いることを想定していたが、音声入力の際に、１３０１のような音声入力装置を装着して、複数個のマイクロフォンの各々に入力される音声と音響モデルとの尤度計算を独立して行うことで、最も尤度の高いマイクロフォンを音声認識時の入力マイクロフォンとして使用して音声認識を行うことが可能になる。 In FIG. 13, reference numeral 1301 denotes an example of a voice input device having a structure in which a plurality of microphones are arranged in one housing. In Embodiments 1 to 7 described above, it is assumed that a single microphone is used when performing voice input when the microphone is mounted. However, when voice input is performed, a voice input device such as 1301 is mounted, Performing speech recognition using the microphone with the highest likelihood as the input microphone at the time of speech recognition by independently calculating the likelihood of the speech input to each of the plurality of microphones and the acoustic model It becomes possible.

他方、１３０２は、集音するためのマイクロフォンを内部に１個だけ保持し、そのマイクロフォンが筐体内の所定の範囲内を自動走行する構造を有する音声入力装置である。１３０２の音声入力装置を最適集音位置付近に装着し、音響尤度を算出する工程において内部のマイクロフォンが筐体内の所定の範囲内を自動走行し、それぞれの位置で尤度の計算を行い、筐体内の移動可能な全範囲による尤度計算の結果から、筐体内で最大尤度を算出した位置に集音マイクロフォンを固定させ、その後に音声認識を行うことができる。 On the other hand, reference numeral 1302 denotes a voice input device having a structure in which only one microphone for collecting sound is held inside and the microphone automatically travels within a predetermined range in the casing. A voice input device 1302 is mounted near the optimal sound collection position, and in the step of calculating the acoustic likelihood, the internal microphone automatically travels within a predetermined range in the housing, and the likelihood is calculated at each position. From the result of likelihood calculation over the entire movable range in the housing, the sound collecting microphone can be fixed at the position where the maximum likelihood is calculated in the housing, and then speech recognition can be performed.

（他の実施形態）
以上、本発明の実施形態を詳述したが、本発明は、例えばシステム、装置、方法、プログラムもしくは記憶媒体等としての実施態様をとることが可能である。また、本発明は、複数の機器から構成されるシステムに適用してもよいし、また、一つの機器からなる装置に適用してもよい。 (Other embodiments)
The embodiment of the present invention has been described in detail above. However, the present invention can take an embodiment as a system, apparatus, method, program, storage medium, or the like. In addition, the present invention may be applied to a system composed of a plurality of devices, or may be applied to an apparatus composed of a single device.

なお、本発明は、前述した実施形態の機能を実現するソフトウェアのプログラム（図６〜８のいずれかに示すフローチャートに対応したプログラム）を、システムあるいは装置に直接あるいは遠隔から供給し、そのシステムあるいは装置のコンピュータがその供給されたプログラムコードを読み出して実行することによっても達成される場合を含む。その場合、プログラムの機能を有していれば、その形態はプログラムである必要はない。 In the present invention, a software program (a program corresponding to the flowchart shown in any of FIGS. 6 to 8) that realizes the functions of the above-described embodiments is directly or remotely supplied to a system or apparatus. This includes the case where the computer of the apparatus is also achieved by reading and executing the supplied program code. In that case, as long as it has the function of a program, the form does not need to be a program.

従って、本発明の機能処理をコンピュータで実現するために、そのコンピュータにインストールされるプログラムコード自体も本発明を実現するものである。つまり、本発明の特許請求の範囲には、本発明の機能処理を実現するためのコンピュータプログラム自体も含まれる。 Accordingly, since the functions of the present invention are implemented by computer, the program code installed in the computer also implements the present invention. That is, the scope of the claims of the present invention includes the computer program itself for realizing the functional processing of the present invention.

その場合、プログラムの機能を有していれば、オブジェクトコード、インタプリタにより実行されるプログラム、ＯＳに供給するスクリプトデータ等、プログラムの形態を問わない。 In this case, the program may be in any form as long as it has a program function, such as an object code, a program executed by an interpreter, or script data supplied to the OS.

プログラムを供給するための記録媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＭＯ、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、ＣＤ−ＲＷ、磁気テープ、不揮発性のメモリカード、ＲＯＭ、ＤＶＤ（ＤＶＤ−ＲＯＭ、ＤＶＤ−Ｒ）などがある。 As a recording medium for supplying the program, for example, flexible disk, hard disk, optical disk, magneto-optical disk, MO, CD-ROM, CD-R, CD-RW, magnetic tape, nonvolatile memory card, ROM, DVD (DVD-ROM, DVD-R).

その他、プログラムの供給方法としては、クライアントコンピュータのブラウザを用いてインターネットのホームページに接続し、そのホームページから本発明のコンピュータプログラムそのもの、もしくは圧縮され自動インストール機能を含むファイルをハードディスク等の記録媒体にダウンロードすることによっても供給できる。また、本発明のプログラムを構成するプログラムコードを複数のファイルに分割し、それぞれのファイルを異なるホームページからダウンロードすることによっても実現可能である。つまり、本発明の機能処理をコンピュータで実現するためのプログラムファイルを複数のユーザに対してダウンロードさせるＷＷＷサーバも、本発明のクレームに含まれるものである。 As another program supply method, a client computer browser is used to connect to an Internet homepage, and the computer program itself of the present invention or a compressed file including an automatic installation function is downloaded from the homepage to a recording medium such as a hard disk. Can also be supplied. It can also be realized by dividing the program code constituting the program of the present invention into a plurality of files and downloading each file from a different homepage. That is, a WWW server that allows a plurality of users to download a program file for realizing the functional processing of the present invention on a computer is also included in the claims of the present invention.

また、本発明のプログラムを暗号化してＣＤ−ＲＯＭ等の記憶媒体に格納してユーザに配布し、所定の条件をクリアしたユーザに対し、インターネットを介してホームページから暗号化を解く鍵情報をダウンロードさせ、その鍵情報を使用することにより暗号化されたプログラムを実行してコンピュータにインストールさせて実現することも可能である。 In addition, the program of the present invention is encrypted, stored in a storage medium such as a CD-ROM, distributed to users, and key information for decryption is downloaded from a homepage via the Internet to users who have cleared predetermined conditions. It is also possible to execute the encrypted program by using the key information and install the program on a computer.

また、コンピュータが、読み出したプログラムを実行することによって、前述した実施形態の機能が実現される他、そのプログラムの指示に基づき、コンピュータ上で稼動しているＯＳなどが、実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現され得る。 In addition to the functions of the above-described embodiments being realized by the computer executing the read program, the OS running on the computer based on the instruction of the program is a part of the actual processing. Alternatively, the functions of the above-described embodiment can be realized by performing all of them and performing the processing.

さらに、記録媒体から読み出されたプログラムが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれた後、そのプログラムの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によっても前述した実施形態の機能が実現される。 Furthermore, after the program read from the recording medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion board or The CPU or the like provided in the function expansion unit performs part or all of the actual processing, and the functions of the above-described embodiments are realized by the processing.

実施形態における音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus in embodiment. 実施形態１における音響モデル作成処理に関与する音声認識プログラムのモジュールの構成を示す図である。It is a figure which shows the structure of the module of the speech recognition program in connection with the acoustic model creation process in Embodiment 1. FIG. 実施形態１における音響モデル作成処理の流れを示すフローチャートである。3 is a flowchart showing a flow of acoustic model creation processing in the first embodiment. 実施形態１におけるマイクロフォン装着位置確認処理に関与する音声認識プログラムのモジュールの構成を示す図である。It is a figure which shows the structure of the module of the speech recognition program in connection with the microphone mounting position confirmation process in Embodiment 1. FIG. 実施形態１におけるマイクロフォン装着位置確認処理の流れを示すフローチャートである。6 is a flowchart illustrating a flow of microphone mounting position confirmation processing according to the first embodiment. 実施形態１における音声認識処理の流れを示すフローチャートである。3 is a flowchart showing a flow of voice recognition processing in the first embodiment. 実施形態２における調整方向決めテーブルを作成する処理に関与する音声認識プログラムのモジュールの構成を示す図である。It is a figure which shows the structure of the module of the speech recognition program in connection with the process which produces the adjustment direction determination table in Embodiment 2. FIG. 実施形態２における調整方向決めテーブルを作成する処理の流れを示すフローチャートである。10 is a flowchart illustrating a flow of processing for creating an adjustment direction determination table in the second embodiment. 実施形態におけるマイクロフォンの装着位置を示す図である。It is a figure which shows the mounting position of the microphone in embodiment. 実施形態２における調整方向決めテーブルの構造例を示す図である。It is a figure which shows the structural example of the adjustment direction determination table in Embodiment 2. FIG. 実施形態２におけるマイクロフォン装着位置確認処理を示すフローチャートである。10 is a flowchart illustrating a microphone mounting position confirmation process according to the second embodiment. 実施形態６におけるマイクロフォンの装着位置のずれに対して頑強な音響モデルを作成する処理の流れを示すフローチャートである。14 is a flowchart showing a flow of processing for creating an acoustic model that is robust against a displacement of a microphone mounting position in the sixth embodiment. 実施形態６におけるマイクロフォンの装着位置のずれに対して頑強な音声認識の流れを示すフローチャートである。14 is a flowchart showing a flow of speech recognition that is robust against displacement of the microphone mounting position in the sixth embodiment. 実施形態１３における音声入力部の具体的な構造例を示す図である。It is a figure which shows the specific structural example of the audio | voice input part in Embodiment 13. FIG. 実施形態における音声入力部の構成例を示す図である。It is a figure which shows the structural example of the audio | voice input part in embodiment.

Claims

A voice recognition device that performs voice recognition using a microphone configured to collect weak voices that do not accompany the user's vocal cord vibrations when worn on a predetermined part of the user,
An acoustic model created based on speech collected from a speaker in a state where the microphone is mounted at an optimum sound collection position;
A calculation means for calculating an acoustic likelihood based on the acoustic model for a voice input by a user wearing the microphone;
Presenting means for presenting information about the suitability of the current mounting position of the microphone to the user using the acoustic likelihood calculated by the calculating means;
A speech recognition apparatus comprising:

A voice recognition device that performs voice recognition using a microphone configured to collect weak voices that do not accompany the user's vocal cord vibrations when worn on a predetermined part of the user,
A plurality of acoustic models created for each position of the optimum sound collection position and a plurality of positions around the optimum sound collection position based on speech collected from a speaker with the microphone attached at that position;
Calculating means for calculating the acoustic likelihood based on one acoustic model of the plurality of acoustic models for the voice inputted by the user wearing the microphone; ,
Estimating means for estimating the current mounting position of the microphone based on the acoustic likelihood calculated by the calculating means;
When the wearing position estimated by the estimating means is not the optimum sound collecting position, the direction from the estimated wearing position to the optimum sound collecting position is defined as the direction in which the microphone wearing position should be adjusted. Presenting means to present to,
A speech recognition apparatus comprising:

The estimation means selects an acoustic model that maximizes the acoustic likelihood calculated by the calculation means, and determines the mounting position of the microphone at the time of collecting the voice that is the basis of the acoustic model. The speech recognition apparatus according to claim 2, wherein the speech recognition apparatus estimates the mounting position.

The presenting means refers to a table in which the plurality of acoustic models and a direction from the microphone mounting position toward the optimum sound collecting position at the time of collecting the voice that is the basis of each acoustic model are described in association with each other The voice recognition device according to claim 2, wherein a direction in which the mounting position of the microphone is to be adjusted is determined.

A method for controlling a voice recognition device that performs voice recognition using a microphone configured to collect weak voice without vibration of the user's vocal cords by wearing it on a predetermined part of the user,
An input step of inputting voice from a user wearing the microphone;
A calculation step of calculating an acoustic likelihood of the voice input by the input step based on an acoustic model created from a voice collected from a speaker in a state where the microphone is mounted at an optimum sound collection position;
A presenting step of presenting information regarding the suitability of the current mounting position of the microphone to the user using the acoustic likelihood calculated in the calculating step;
A method for controlling a speech recognition apparatus, comprising:

A method for controlling a voice recognition device that performs voice recognition using a microphone configured to collect weak voice without vibration of the user's vocal cords by wearing it on a predetermined part of the user,
An input step of inputting voice from a user wearing the microphone;
For each of the optimal sound collection position and a plurality of positions around it, one acoustic model of a plurality of acoustic models created from speech collected from a speaker with the microphone attached at that position A calculation step for calculating the acoustic likelihood of the speech input by the input step for all of the plurality of acoustic models;
An estimation step of estimating a current mounting position of the microphone based on the acoustic likelihood calculated by the calculation step;
When the wearing position estimated by the estimating step is not the optimum sound collecting position, the direction from the estimated wearing position to the optimum sound collecting position is set as the direction in which the microphone wearing position is to be adjusted. Presenting step to present to,
A method for controlling a speech recognition apparatus, comprising:

The estimation step selects an acoustic model having the maximum acoustic likelihood calculated by the calculation step, and determines the mounting position of the microphone at the time of collecting the voice that is the basis of the acoustic model, The method for controlling a speech recognition apparatus according to claim 6, wherein the position is estimated as a mounting position.

The presenting step refers to a table in which the plurality of acoustic models and the direction from the microphone mounting position toward the optimum sound collecting position in association with the sound that is the basis of each acoustic model are described in association with each other The method for controlling the speech recognition apparatus according to claim 6, wherein a direction in which the mounting position of the microphone is to be adjusted is determined.

A program for causing a computer to execute the method for controlling a speech recognition apparatus according to any one of claims 5 to 8.

A computer-readable storage medium storing the program according to claim 9.

A voice recognition device that performs voice recognition using a microphone configured to collect weak voices that do not accompany the user's vocal cord vibrations when worn on a predetermined part of the user,
One acoustic model created for each of the optimum sound collection position and a plurality of positions around the optimum sound collection position based on the voice collected from the speaker with the microphone attached at that position;
Extracting means for extracting the feature amount of the voice input from the user wearing the microphone,
A speech recognition apparatus, wherein speech recognition is performed based on the feature amount extracted by the extraction unit and the acoustic model.

A voice recognition method for performing voice recognition using a microphone configured to collect weak voice without vibration of the user's vocal cords by attaching to a predetermined part of the user,
An input step of inputting voice from a user wearing the microphone;
An extraction step of extracting a feature amount of the voice input by the input step;
A feature amount extracted by the extraction step, and an optimum sound collection position and each of a plurality of positions around it are created based on speech collected from a speaker with the microphone mounted at that position. A speech recognition step for performing speech recognition based on the acoustic model of
A speech recognition method comprising:

In order to perform voice recognition using a microphone configured to collect weak voice without attaching the user's vocal cord vibration by wearing it on a predetermined part of the user,
An input step for inputting voice from a user wearing the microphone;
An extraction step for extracting a feature amount of the voice input in the input step;
A feature amount extracted by the extraction step, and an optimum sound collection position and each of a plurality of positions around it are created based on speech collected from a speaker with the microphone mounted at that position. A speech recognition step for performing speech recognition based on the acoustic model of
A program for running

A computer-readable storage medium storing the program according to claim 13.