JP2016180916A

JP2016180916A - Voice recognition system, voice recognition method, and program

Info

Publication number: JP2016180916A
Application number: JP2015061833A
Authority: JP
Inventors: 智子川瀬; Tomoko Kawase; 和則小林; Kazunori Kobayashi; 仲大室; Hitoshi Omuro
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-03-25
Filing date: 2015-03-25
Publication date: 2016-10-13
Anticipated expiration: 2035-03-25
Also published as: JP6389787B2

Abstract

PROBLEM TO BE SOLVED: To provide a voice recognition system capable of reducing the number of false recognition and improving utilization efficiency of a system.SOLUTION: A client device 10 includes: a receiving unit 15 for receiving voice recognition results from a voice recognition server device group 20 selected based on a voice collection condition; a correct answer candidate extracting unit for extracting a voice recognition result of a last signal from a repeated signal group as a correct answer candidate; a transmission unit 14 for transmitting, to a control unit, a pair of the correct answer candidate and a re-learning signal group; and a transmission destination changing unit 18 for changing a relation between a voice recognition server device functioning as a transmission destination of an acoustic signal and the voice collection condition on the basis of a transmission destination information. A control unit 30 includes: a voice recognition result receiving part for receiving voice recognition results for the re-learning signal group from all the voice recognition server devices 21; a transmission destination information updating part for updating transmission destination information on the basis of a similarity between each of the voice recognition results received from all the voice recognition server devices and the correct answer candidate; and a transmission destination information transmitting part for transmitting the updated transmission destination information to the client device.SELECTED DRAWING: Figure 1

Description

本発明は、クライアント装置と複数の音声認識サーバ装置と管理部を含む音声認識システム、音声認識方法、プログラムに関する。 The present invention relates to a voice recognition system, a voice recognition method, and a program including a client device, a plurality of voice recognition server devices, and a management unit.

従来、クライアント装置で検出した音声区間の信号に対して音声認識サーバ装置で音声認識を行ってその結果をクライアント装置に返信する、サーバ・クライアント型音声認識システムが存在する（例えば特許文献１）。音声認識サーバ装置を、何れのクライアント装置からでもアクセス可能なネットワーク上に配置することにより、音声認識システムを利用したサービスを多数のクライアント装置が享受できるようになる。 2. Description of the Related Art Conventionally, there is a server / client type speech recognition system that performs speech recognition on a speech section signal detected by a client device and returns the result to the client device (for example, Patent Document 1). By arranging the voice recognition server device on a network accessible from any client device, a number of client devices can enjoy services using the voice recognition system.

特開２００５−３３１６１６号公報JP 2005-331616 A

このとき、音声認識サーバ装置の認識性能が十分でなければ、クライアント装置は正解の認識結果を得るまで何度も音声認識サーバ装置にアクセスする必要がある。この複数回のアクセスにより音声認識サーバ装置の負荷が増大し、システムの利用効率が低下する。システムの利用効率を向上させるためには、誤認識の回数を削減することが必要である。 At this time, if the recognition performance of the voice recognition server device is not sufficient, the client device needs to access the voice recognition server device many times until a correct recognition result is obtained. This multiple access increases the load on the voice recognition server device and reduces the system utilization efficiency. In order to improve the utilization efficiency of the system, it is necessary to reduce the number of erroneous recognitions.

そこで本発明では、誤認識の回数を削減でき、システムの利用効率を向上させることができる音声認識システムを提供することを目的とする。 Therefore, an object of the present invention is to provide a speech recognition system that can reduce the number of erroneous recognitions and improve the utilization efficiency of the system.

本発明の音声認識システムは、クライアント装置と、複数の音声認識サーバ装置と、管理部を含む。クライアント装置は、受信部と、言い直し判定部と、送信部と、送信先変更部を含む。 The speech recognition system of the present invention includes a client device, a plurality of speech recognition server devices, and a management unit. The client device includes a receiving unit, a rephrase determining unit, a transmitting unit, and a transmission destination changing unit.

受信部は、クライアント装置に入力された音響信号に対する音声認識結果を、その収音条件に基づいて選択された音声認識サーバ装置から受信する。言い直し判定部は、ユーザによる同一内容を示す発声の複数回の繰り返しを観測した信号群である繰り返し信号群を取得して、繰り返し信号群のうち最後の信号の音声認識結果を正解候補として抽出する。送信部は、繰り返し信号群を全て再学習信号群とし、正解候補と、再学習信号群の組を管理部に送信する。送信先変更部は、音響信号の送信先となる音声認識サーバ装置と収音条件との関係に関する情報である送信先情報に基づいて、音響信号の送信先となる音声認識サーバ装置と収音条件との関係を変更する。 The receiving unit receives a speech recognition result for the acoustic signal input to the client device from the speech recognition server device selected based on the sound collection condition. The rephrasing determination unit acquires a repeated signal group that is a signal group obtained by observing a plurality of repetitions of the utterance indicating the same content by the user, and extracts the speech recognition result of the last signal from the repeated signal group as a correct candidate To do. The transmission unit sets all repetitive signal groups as relearning signal groups, and transmits a set of correct answer candidates and relearning signal groups to the management unit. The transmission destination changing unit includes a voice recognition server device serving as a transmission destination of the acoustic signal and a sound collection condition based on transmission destination information that is information regarding a relationship between the voice recognition server device serving as the transmission destination of the acoustic signal and the sound collection condition. Change the relationship.

管理部は、音声認識結果受信部と、送信先情報更新部と、送信先情報送信部を含む。 The management unit includes a voice recognition result reception unit, a transmission destination information update unit, and a transmission destination information transmission unit.

音声認識結果受信部は、全ての音声認識サーバ装置から再学習信号群に対する音声認識結果を受信する。送信先情報更新部は、全ての音声認識サーバ装置から受信した各音声認識結果と正解候補との類似度に基づいて、送信先情報を更新する。送信先情報送信部は、更新された送信先情報をクライアント装置に送信する。 The speech recognition result receiving unit receives speech recognition results for the relearning signal group from all speech recognition server devices. The transmission destination information update unit updates the transmission destination information based on the similarity between each voice recognition result received from all the voice recognition server devices and the correct answer candidates. The transmission destination information transmission unit transmits the updated transmission destination information to the client device.

本発明の音声認識システムによれば、誤認識の回数を削減でき、システムの利用効率を向上させることができる。 According to the speech recognition system of the present invention, the number of erroneous recognitions can be reduced, and the utilization efficiency of the system can be improved.

実施例１の音声認識システムの構成を示すブロック図。1 is a block diagram illustrating a configuration of a voice recognition system according to Embodiment 1. FIG. 実施例１の音声認識システムの言い直し判定部の構成を示すブロック図。The block diagram which shows the structure of the rephrase determination part of the speech recognition system of Example 1. FIG. 実施例１の音声認識システムの管理部の構成を示すブロック図。FIG. 2 is a block diagram illustrating a configuration of a management unit of the voice recognition system according to the first embodiment. 実施例１の音声認識システムの音声認識動作を示すシーケンス図。FIG. 3 is a sequence diagram illustrating a voice recognition operation of the voice recognition system according to the first embodiment. 実施例１の音声認識システムの情報更新動作を示すシーケンス図。FIG. 3 is a sequence diagram illustrating an information update operation of the voice recognition system according to the first embodiment. 実施例１の音声認識システムの言い直し判定部の動作を示すフローチャート。3 is a flowchart illustrating the operation of a rephrase determining unit of the voice recognition system according to the first embodiment. 実施例１の音声認識システムの言い直し判定動作を例示する図。The figure which illustrates rephrasing determination operation | movement of the speech recognition system of Example 1. FIG. 実施例１の音声認識システムの送信先情報更新動作を例示する図。The figure which illustrates the transmission destination information update operation | movement of the speech recognition system of Example 1. FIG. 実施例２の音声認識システムの構成を示すブロック図。FIG. 3 is a block diagram illustrating a configuration of a voice recognition system according to a second embodiment. 実施例２の音声認識システムの管理部の構成を示すブロック図。FIG. 6 is a block diagram illustrating a configuration of a management unit of the voice recognition system according to the second embodiment. 実施例２の音声認識システムの情報更新動作を示すシーケンス図。FIG. 9 is a sequence diagram illustrating an information update operation of the voice recognition system according to the second embodiment. 実施例３の音声認識システムの構成を示すブロック図。FIG. 6 is a block diagram illustrating a configuration of a voice recognition system according to a third embodiment. 実施例３の音声認識システムの管理部の構成を示すブロック図。FIG. 9 is a block diagram illustrating a configuration of a management unit of the voice recognition system according to the third embodiment. 実施例３の音声認識システムの情報更新動作を示すシーケンス図。FIG. 10 is a sequence diagram illustrating an information update operation of the voice recognition system according to the third embodiment.

以下、本発明の実施の形態について、詳細に説明する。なお、同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the structure part which has the same function, and duplication description is abbreviate | omitted.

以下の説明では、音声認識対象とする発声された信号を音声信号、音声信号と音声信号以外の背景雑音信号などが混在した状態で収音した信号を音響信号と呼ぶこととする。 In the following description, a signal uttered as a speech recognition target is referred to as a sound signal, and a signal collected in a state where a sound signal and a background noise signal other than the sound signal are mixed is referred to as an acoustic signal.

以下、図１、図２、図３を参照して本実施例の音声認識システムの構成について説明する。図１は、本実施例の音声認識システム１の構成を示すブロック図である。図２は、本実施例の音声認識システム１の言い直し判定部１７の構成を示すブロック図である。図３は、本実施例の音声認識システム１の管理部３０の構成を示すブロック図である。 Hereinafter, the configuration of the speech recognition system according to the present embodiment will be described with reference to FIGS. 1, 2, and 3. FIG. 1 is a block diagram showing the configuration of the speech recognition system 1 of the present embodiment. FIG. 2 is a block diagram illustrating a configuration of the rephrase determining unit 17 of the speech recognition system 1 according to the present embodiment. FIG. 3 is a block diagram illustrating a configuration of the management unit 30 of the voice recognition system 1 according to the present embodiment.

図１に示すように、本実施例の音声認識システム１は、クライアント装置１０と、複数の音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎ（ＮはＮ≧２を充たす整数、ｎは１≦ｎ≦Ｎを充たす整数）と、管理部３０を含む。図１においてクライアント装置１０は１台のみ図示したが、クライアント装置１０は複数台存在してもよい。音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎをまとめて呼称する際には、音声認識サーバ装置群２０と呼ぶ。クライアント装置１０と音声認識サーバ装置群２０は、ネットワークを介し、無線または有線で通信可能に接続されているものとする。管理部３０は、単独のハードウェア（装置）として構成されてもよい。管理部３０を単独のハードウェア（装置）として構成した場合は、これを管理装置３０と呼んでもよい。管理部３０を単独のハードウェア（装置）として構成した場合、クライアント装置１０と音声認識サーバ装置群２０と管理部３０（管理装置３０）はネットワークを介して、無線または有線で通信可能に接続されているものとする。また、管理部３０は、クライアント装置１０内の構成要件であってもよいし、音声認識サーバ装置群２０内の何れかの音声認識サーバ装置内の構成要件であってもよい。 As shown in FIG. 1, the speech recognition system 1 of the present embodiment includes a client device 10 and a plurality of speech recognition server devices 21-1, ..., 21-n, ..., 21-N (N is N ≧ 2). An integer to be satisfied, n is an integer satisfying 1 ≦ n ≦ N), and the management unit 30. Although only one client device 10 is shown in FIG. 1, a plurality of client devices 10 may exist. When the voice recognition server devices 21-1,..., 21-n,..., 21-N are collectively called, they are referred to as a voice recognition server device group 20. It is assumed that the client device 10 and the voice recognition server device group 20 are connected to be communicable wirelessly or by wire via a network. The management unit 30 may be configured as a single hardware (device). When the management unit 30 is configured as a single piece of hardware (device), it may be called the management device 30. When the management unit 30 is configured as a single piece of hardware (device), the client device 10, the voice recognition server device group 20, and the management unit 30 (management device 30) are connected to be communicable wirelessly or wired via a network. It shall be. Further, the management unit 30 may be a configuration requirement in the client device 10 or a configuration requirement in any voice recognition server device in the voice recognition server device group 20.

音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎは、クライアント装置１０に入力される音響信号の収音条件（詳細は後述）に基づいて音響信号の音声認識処理を担当するか否かが予め設定され、互いに異なる特性の音響モデルを記憶しているものとする。音響モデルの特性として例えば雑音特性などが考えられる。クライアント装置１０は、収音条件抽出部１１と、しきい値記憶部１１１と、選択部１２と、送信先記憶部１２１と、信号処理部１３と、送信部１４と、受信部１５と、呈示部１６と、言い直し判定部１７と、言い直し情報記憶部１７１と、送信先変更部１８を含む。図２に示すように本実施例の言い直し判定部１７は、反応時間測定部１７Ａと、信頼度取得部１７Ｂと、類似度算出部１７Ｃと、判定部１７Ｄを含む。図３に示すように本実施例の管理部３０は、正解候補受信部３０Ａと、再学習信号群送信部３０Ｂと、音声認識結果受信部３０Ｃと、送信先情報更新部３０Ｄと、送信先情報送信部３０Ｅと、正解候補記憶部３０Ｆを含む。 The voice recognition server devices 21-1,..., 21-n,..., 21-N are in charge of the sound recognition processing of the sound signals based on the sound collection conditions (details will be described later) of the sound signals input to the client device 10. It is assumed that whether or not to perform is set in advance and acoustic models having different characteristics are stored. As the characteristics of the acoustic model, for example, noise characteristics can be considered. The client device 10 includes a sound collection condition extraction unit 11, a threshold storage unit 111, a selection unit 12, a transmission destination storage unit 121, a signal processing unit 13, a transmission unit 14, a reception unit 15, and a presentation. A re-statement determination unit 17; a re-statement information storage unit 171; and a transmission destination change unit 18. As shown in FIG. 2, the rephrasing determination unit 17 of the present embodiment includes a reaction time measurement unit 17A, a reliability acquisition unit 17B, a similarity calculation unit 17C, and a determination unit 17D. As shown in FIG. 3, the management unit 30 of the present embodiment includes a correct candidate receiving unit 30A, a relearning signal group transmitting unit 30B, a speech recognition result receiving unit 30C, a transmission destination information updating unit 30D, and transmission destination information. A transmission unit 30E and a correct candidate storage unit 30F are included.

以下、図４を参照して本システムの音声認識動作について説明する。図４は、本実施例の音声認識システム１の音声認識動作を示すシーケンス図である。まず、収音条件抽出部１１は、入力された音響信号の収音条件を抽出する（Ｓ１１）。選択部１２は、抽出された収音条件に基づいて、対応する音声信号の送信先となる音声認識サーバ装置（例えば音声認識サーバ装置２１−１）を選択する（Ｓ１２）。収音条件と送信先となる音声認識サーバ装置との関係は送信先情報として、送信先記憶部１２１に予め記憶されているものとする。 Hereinafter, the speech recognition operation of this system will be described with reference to FIG. FIG. 4 is a sequence diagram showing the voice recognition operation of the voice recognition system 1 of the present embodiment. First, the sound collection condition extraction unit 11 extracts a sound collection condition of the input acoustic signal (S11). The selection unit 12 selects a voice recognition server device (for example, the voice recognition server device 21-1) as a transmission destination of the corresponding voice signal based on the extracted sound collection condition (S12). It is assumed that the relationship between the sound collection condition and the voice recognition server device as the transmission destination is stored in advance in the transmission destination storage unit 121 as transmission destination information.

＜収音条件＞
収音条件は、例えば音声信号の大きさと背景雑音信号の大きさの比率であるＳ／Ｎ比に関する特徴量、音響信号のひずみに関する特徴量、背景雑音信号のスペクトル形状に関する特徴量、背景雑音信号の大きさに関する特徴量のうち少なくとも何れかの特徴量についてのしきい値に基づく条件とすることができる。しきい値は、しきい値記憶部１１１に予め記憶されているものとする。 <Sound collection conditions>
The sound collection conditions include, for example, a feature amount relating to the S / N ratio, which is a ratio of the size of the audio signal and the background noise signal, a feature amount relating to the distortion of the acoustic signal, a feature amount relating to the spectrum shape of the background noise signal, and a background noise signal. It is possible to set a condition based on a threshold value for at least one of the feature amounts related to the size of the feature amount. It is assumed that the threshold value is stored in advance in the threshold value storage unit 111.

背景雑音信号とは発声音声や目的音が入力される直前の一定時間にマイクで観測された信号である。背景雑音信号の大きさとは、背景雑音信号のパワースペクトルの一定時間の平均値である。背景雑音信号のスペクトル形状とは、背景雑音信号のスペクトルにおける各帯域の成分やその時間変化である。音声信号と背景雑音信号のＳ／Ｎ比とは、発声音声（目的音）入力中の音響信号中の音声信号の大きさと背景雑音信号の大きさの比である。音声信号として、発声音声（目的音）入力中の一定時間の音響信号のパワースペクトルから背景雑音信号のパワースペクトルの一定時間の平均値を差し引いたパワースペクトルを用いることができる。音声信号の大きさとは、発声音声（目的音）入力中の一定時間の音声信号のパワースペクトルの平均値である。 The background noise signal is a signal observed by a microphone for a certain time immediately before the voiced sound or the target sound is input. The magnitude of the background noise signal is an average value of the power spectrum of the background noise signal over a certain period of time. The spectrum shape of the background noise signal is a component of each band in the spectrum of the background noise signal and its time change. The S / N ratio between the audio signal and the background noise signal is the ratio of the size of the audio signal in the acoustic signal being input to the uttered speech (target sound) and the size of the background noise signal. As the audio signal, a power spectrum obtained by subtracting the average value of the power spectrum of the background noise signal for a certain time from the power spectrum of the acoustic signal for a certain time during the input of the uttered voice (target sound) can be used. The magnitude of the audio signal is an average value of the power spectrum of the audio signal for a certain time during the input of the uttered voice (target sound).

音響信号のひずみとは、音響信号の入力が大きすぎたためにマイクロホン素子、マイクロホンアンプ回路、Ａ／Ｄ変換がクリッピングしているものを指す。入力信号レベルが、あらかじめ決めた閾値以上の振幅を持つ区間を検出し、その時間的な割合を計算する。この割合が高ければひずみが大きく、割合が小さければひずみが小さい。閾値以上の振幅となっていなければ、ひずみなしとすることができる。閾値は、マイク素子、回路、AD変換のクリッピングレベルに合わせて設定する。 The distortion of the acoustic signal means that the microphone element, the microphone amplifier circuit, and the A / D conversion are clipping because the input of the acoustic signal is too large. A section where the input signal level has an amplitude greater than or equal to a predetermined threshold is detected, and the time ratio is calculated. If this ratio is high, the strain is large, and if the ratio is small, the strain is small. If the amplitude does not exceed the threshold, no distortion can be achieved. The threshold is set according to the clipping level of the microphone element, circuit, and AD conversion.

＜収音条件抽出部１１（Ｓ１１）、選択部１２（Ｓ１２）＞
以下に、収音条件抽出部１１、選択部１２の動作（Ｓ１１、Ｓ１２）の例について述べる。収音条件抽出部１１は、例えば入力された音響信号から収音条件を表す特徴量を抽出し、特徴量の値に応じて入力された音響信号をグループ（例えば収音条件を表す符号）に分ける。 <Sound Collection Condition Extraction Unit 11 (S11), Selection Unit 12 (S12)>
Hereinafter, an example of the operation (S11, S12) of the sound collection condition extraction unit 11 and the selection unit 12 will be described. The sound collection condition extraction unit 11 extracts, for example, a feature amount representing the sound collection condition from the input sound signal, and the sound signal input according to the value of the feature amount is grouped (for example, a code representing the sound collection condition). Divide.

次に、選択部１２は、表１に記載のように、グループ（収音条件を表す符号）とインデックス（送信先音声認識サーバ装置を表す符号）の関係に基づいて、対応する音響信号の送信先となる音声認識サーバ装置（例えば音声認識サーバ装置２１−１）を選択する（Ｓ１２）。 Next, as shown in Table 1, the selection unit 12 transmits a corresponding acoustic signal based on the relationship between a group (a code representing a sound collection condition) and an index (a code representing a destination speech recognition server device). A previous voice recognition server device (for example, the voice recognition server device 21-1) is selected (S12).

特徴量xは例えば、音響信号に含まれる音声信号の大きさと背景雑音信号の大きさの比率であるＳ／Ｎ比、音響信号のひずみの有無やひずみの頻度、背景雑音信号のスペクトル形状、背景雑音信号の大きさ、などとすることができる。 The feature amount x is, for example, the S / N ratio that is the ratio of the size of the audio signal included in the acoustic signal and the size of the background noise signal, the presence or absence of distortion of the acoustic signal, the frequency of distortion, the spectrum shape of the background noise signal, the background The magnitude of the noise signal, etc.

特徴量xをＳ／Ｎ比とする場合、例えばしきい値をθ₁=0dB、θ₂=10dB、θ₃=20dB等と設定し、x=5dBならば収音条件抽出部１１はグループ2を収音条件として抽出し、選択部１２はインデックス2を選択する。 When the feature quantity x is an S / N ratio, for example, threshold values are set as θ ₁ = 0 dB, θ ₂ = 10 dB, θ ₃ = 20 dB, and so on. Are extracted as sound collection conditions, and the selection unit 12 selects index 2.

特徴量xを音響信号のひずみとする場合、たとえばビットデプス16bitで量子化した信号で、0.5秒間で振幅の絶対値が30000以上となる時間の割合を特徴量xとする。しきい値をθ₁=0.8等と設定し、x=0ならば収音条件抽出部１１はグループ1を収音条件として抽出し、選択部１２はインデックス1を選択し、x=0.9ならば収音条件抽出部１１はグループ2を収音条件として抽出し、選択部１２はインデックス2を選択する。 When the feature quantity x is a distortion of an acoustic signal, for example, the ratio of the time when the absolute value of the amplitude is 30000 or more in 0.5 seconds is a feature quantity x in a signal quantized with a bit depth of 16 bits. The threshold is set as θ ₁ = 0.8, etc., and if x = 0, the sound collection condition extraction unit 11 extracts group 1 as the sound collection condition, the selection unit 12 selects index 1, and if x = 0.9 The sound collection condition extraction unit 11 extracts group 2 as the sound collection condition, and the selection unit 12 selects index 2.

特徴量xを背景雑音信号のスペクトル形状とする場合、例えば背景雑音信号の大きさを周波数帯域や継続時間によりx₁、x₂、…、x_m等（mはm≧2を充たす整数）と分けて評価する。収音条件抽出部１１は、評価結果の組み合わせからグループを抽出し、選択部１２はそのインデックスを選択する。また、背景雑音信号のスペクトル形状を特徴量として用いる別の方法として、複数種類の背景雑音信号のモデルを記憶しておき、入力された信号の背景雑音信号をモデルのいずれかに分類することもできる。複数種類の背景雑音信号とは、例えばホワイトノイズ、ピンクノイズ、バーストノイズなどである。この方法では、モデル一つ一つに対応するグループを割振っておき、入力された信号の背景雑音信号が分類されたモデルに応じてグループが決定される。 When the feature amount x is the spectrum shape of the background noise signal, for example, the size of the background noise signal is x ₁ , x ₂ ,..., X _m (m is an integer satisfying m ≧ 2) depending on the frequency band and duration. Separately evaluate. The sound collection condition extraction unit 11 extracts a group from the combination of evaluation results, and the selection unit 12 selects the index. As another method of using the spectrum shape of the background noise signal as a feature amount, it is also possible to store a plurality of types of background noise signal models and classify the input background noise signal into one of the models. it can. The multiple types of background noise signals are, for example, white noise, pink noise, burst noise, and the like. In this method, a group corresponding to each model is allocated, and a group is determined according to a model in which background noise signals of input signals are classified.

特徴量xを背景雑音信号の大きさとする場合、たとえばθ₁=40dBA、θ₂=55dBA、θ₃=70dBA等と設定し、特徴量x=50dBAならば収音条件抽出部１１はグループ2を収音条件として抽出し、選択部１２はインデックス2を選択する。ここでdBAとは人間の聴覚を考慮した周波数重み付け特性（A特性）のもとで測定した騒音レベルのdB値の単位である。 When the feature amount x is set to the size of the background noise signal, for example, θ ₁ = 40 dBA, θ ₂ = 55 dBA, θ ₃ = 70 dBA, etc. are set, and if the feature amount x = 50 dBA, the sound collection condition extraction unit 11 sets the group 2 Extracting is performed as a sound collection condition, and the selection unit 12 selects index 2. Here, dBA is a unit of dB value of noise level measured under frequency weighting characteristics (A characteristics) considering human hearing.

＜信号処理部１３（Ｓ１３）＞
信号処理部１３は、抽出された収音条件が所定の条件に該当する場合に、対応する音響信号を信号処理する（Ｓ１３）。具体的には信号処理部１３は、Ｓ／Ｎ比や背景雑音信号の大きさが、収音条件抽出部１１で抽出された収音条件に基づいて決定される音声認識サーバ装置において音声認識対象として想定していた特徴量の範囲に適合するように、対応する音響信号を信号処理する。例えばＳ／Ｎ比＝１近傍、すなわち０ｄＢ近傍の収音条件は、音声信号の大きさと背景雑音信号の大きさが同等であり、そのような音響信号をそのまま音声認識に用いれば性能の低下を招きやすい。従って、Ｓ／Ｎ比＝０ｄＢ近傍の収音条件を収音条件抽出部１１で抽出した場合は、当該収音条件の音響信号に対して背景雑音信号を抑圧する信号処理を信号処理部１３で適用する。あるいは例えばＳ／Ｎ比＝１００近傍、すなわち２０ｄＢ近傍の収音条件を収音条件抽出部１１で抽出した場合は、前述の０ｄＢ近傍の収音条件と同様に、Ｓ／Ｎ比の値に応じて適応的に背景雑音信号を抑圧する処理を行うとしても良いし、抑圧する処理を全く行わないとしても良い。その他の収音条件においても、信号処理部１３において、収音条件抽出部１１で抽出した結果に基づき、音響信号への信号処理を適応的に行う。 <Signal processing unit 13 (S13)>
The signal processing unit 13 performs signal processing on the corresponding acoustic signal when the extracted sound collection condition corresponds to a predetermined condition (S13). Specifically, the signal processing unit 13 is a speech recognition target in a speech recognition server device in which the S / N ratio and the size of the background noise signal are determined based on the sound collection conditions extracted by the sound collection condition extraction unit 11. The corresponding acoustic signal is subjected to signal processing so as to conform to the range of the feature amount assumed as. For example, in the sound pickup condition in the vicinity of S / N ratio = 1, that is, in the vicinity of 0 dB, the size of the audio signal and the size of the background noise signal are the same. Easy to invite. Accordingly, when the sound collection condition extraction unit 11 extracts a sound collection condition in the vicinity of S / N ratio = 0 dB, the signal processing unit 13 performs signal processing for suppressing the background noise signal with respect to the sound signal of the sound collection condition. Apply. Alternatively, for example, when the sound collection condition extraction unit 11 extracts a sound collection condition in the vicinity of S / N ratio = 100, that is, in the vicinity of 20 dB, according to the value of the S / N ratio, similar to the sound collection condition in the vicinity of 0 dB described above. Thus, the process of adaptively suppressing the background noise signal may be performed, or the process of suppressing may not be performed at all. Even in other sound collection conditions, the signal processing unit 13 adaptively performs signal processing on the acoustic signal based on the result extracted by the sound collection condition extraction unit 11.

以下に、信号処理部１３の動作（Ｓ１３）の例について述べる。音声認識では、多くの場合、前処理として信号処理により入力音声を補正する。音声認識において前処理して対処すべき音響特性として、例えば、加法性雑音と乗法性雑音がある。加法性雑音は、音声入力環境に遍在する雑音のように音声信号に対して加法的に観測される信号である。一方、乗法性雑音とはマイクの特性や空間伝達特性などの音響特性に起因する雑音（ひずみ）であり、時間波形では原音声波形に対する畳み込み演算として観測されるもので、スペクトル波形では乗算性のひずみとなるものである。加法性雑音に対処した音声認識処理の例としては、参考特許文献１の段落［０００５］に開示されたスペクトルサブトラクション法に基づく雑音抑圧法、または同文献の段落［０００７］に開示されたウィナー・フィルタ法（以下、ＷＦ法という）に基づく雑音抑圧法などのように、雑音の重畳した音声から雑音を抑圧して音声認識に適用する方法がある。
（参考特許文献１：特許第４４６４７９７号公報） Hereinafter, an example of the operation (S13) of the signal processing unit 13 will be described. In speech recognition, in many cases, input speech is corrected by signal processing as preprocessing. Examples of acoustic characteristics to be dealt with by preprocessing in speech recognition include additive noise and multiplicative noise. Additive noise is a signal that is additively observed with respect to a speech signal, such as noise ubiquitous in the speech input environment. On the other hand, multiplicative noise is noise (distortion) caused by acoustic characteristics such as microphone characteristics and spatial transfer characteristics, and is observed as a convolution operation with respect to the original speech waveform in the time waveform. It becomes a distortion. Examples of speech recognition processing that addresses additive noise include the noise suppression method based on the spectral subtraction method disclosed in paragraph [0005] of Reference Patent Document 1, or the Wiener method disclosed in paragraph [0007] of that document. There is a method of suppressing noise from speech with superimposed noise and applying it to speech recognition, such as a noise suppression method based on a filter method (hereinafter referred to as WF method).
(Reference Patent Document 1: Japanese Patent No. 4464797)

加法性雑音に加えて乗法性雑音に対処した音声認識処理の例としては、参考特許文献１のように乗法性雑音の影響を除去した音声モデルに雑音モデルを重畳させた雑音重畳音声モデルを生成してから乗法性特徴量に基づいてモデルを更新する方法がある。あるいは参考特許文献２の発明のように、雑音モデルに対しても乗法性雑音特徴量に基づいて正規化した上で正規化雑音重畳音声モデルを生成する方法がある。
（参考特許文献２：特許第５２０００８０号公報） As an example of speech recognition processing that copes with multiplicative noise in addition to additive noise, a noise superimposed speech model is generated by superimposing a noise model on a speech model from which the influence of multiplicative noise has been removed as in Patent Document 1. Then, there is a method of updating the model based on the multiplicative feature amount. Alternatively, there is a method of generating a normalized noise superimposed speech model after normalizing a noise model based on the multiplicative noise feature quantity as in the invention of Reference Patent Document 2.
(Reference Patent Document 2: Japanese Patent No. 5200080)

信号処理部１３が行う信号処理として典型的には雑音抑圧が考えられる。雑音抑圧以外の信号処理としては、例えばAGC(Automatic Gain Control)、CMN(Cepstrum Mean Normalization)、イコライザなどでもよい。 Noise suppression is typically considered as signal processing performed by the signal processing unit 13. As signal processing other than noise suppression, for example, AGC (Automatic Gain Control), CMN (Cepstrum Mean Normalization), and an equalizer may be used.

＜AGC＞
Automatic Gain Control(AGC)は、入力音声信号の短時間平均パワーまたは短時間平均振幅をもとに入力信号レベルを検出し、入力信号レベルと最適レベル（目標値）との差分が少なくなるように音声入力段の利得（ゲイン）を調整する処理である。AGCはA/D変換後の音声波形が過少または過大になって音声特徴量が不明瞭になることを防ぐ効果がある。AGCについては、例えば参考特許文献３の段落［０００１］に開示されている。
（参考特許文献３：特許第３５８８５５５号公報） <AGC>
Automatic Gain Control (AGC) detects the input signal level based on the short-time average power or short-time average amplitude of the input audio signal so that the difference between the input signal level and the optimum level (target value) is reduced. This is a process for adjusting the gain of the audio input stage. AGC has an effect of preventing the voice feature amount from becoming unclear due to the voice waveform after A / D conversion being too small or excessive. AGC is disclosed in paragraph [0001] of Reference Patent Document 3, for example.
(Reference Patent Document 3: Japanese Patent No. 3588555)

＜CMN＞
Cepstrum Mean Normalization(CMN)とは、音声認識の特徴量であるケプストラムにおいて、入力音声信号の長時間ケプストラム平均を求め、各フレームの入力音声のケプストラムから長時間ケプストラム平均をさし引く処理である。CMNは、マイクロホンの特性、マイクロホンの位置、部屋の形状に代表される乗算性ひずみの影響を軽減するために用いられる。CMNについては、例えば参考特許文献１の段落［００１０］に開示されている。 <CMN>
Cepstrum Mean Normalization (CMN) is a process of obtaining a long-term cepstrum average of an input speech signal in a cepstrum that is a feature amount of speech recognition, and subtracting the long-term cepstrum average from the cepstrum of the input speech of each frame. The CMN is used to reduce the influence of multiplicative distortion represented by the characteristics of the microphone, the position of the microphone, and the shape of the room. CMN is disclosed in paragraph [0010] of Reference Patent Document 1, for example.

なお、クライアント装置１０の信号処理部１３でCMNを実施する場合、クライアント装置１０から音声認識サーバ装置へは、音声認識のための音響信号に由来する信号として、CMN適用後のMFCC(メル周波数ケプストラム)が送信されることとしておけば、音声認識サーバ装置で再度ケプストラム分析する処理を省くことができる。 When the CMN is performed by the signal processing unit 13 of the client device 10, the client device 10 transmits to the speech recognition server device an MFCC (Mel Frequency Cepstrum after CMN application) as a signal derived from an acoustic signal for speech recognition. ) Is transmitted, it is possible to omit the cepstrum analysis process again by the speech recognition server device.

＜イコライザ＞
イコライザとは、入力音声信号のゲインを周波数帯域ごとに調整する処理である。例えば音声入力用のマイクロホンの音響特性が平坦でないことが予めわかっていれば、イコライザを経由することで、音響特性を改善したうえで収音することができる。イコライザについては、例えば参考特許文献４の段落［００１０］、［００１６］に開示されている。
（参考特許文献４：特許第２８６５２６８号公報） <Equalizer>
The equalizer is a process for adjusting the gain of the input audio signal for each frequency band. For example, if it is known in advance that the acoustic characteristics of a microphone for voice input are not flat, sound can be collected after improving the acoustic characteristics via an equalizer. The equalizer is disclosed in, for example, paragraphs [0010] and [0016] of Reference Patent Document 4.
(Reference Patent Document 4: Japanese Patent No. 2865268)

次に、送信部１４は、抽出された収音条件に対応する音声認識サーバ装置（ステップＳ１２で選択された音声認識サーバ装置）に、音響信号または音響信号に由来する信号を送信する（Ｓ１４）。このとき、送信部１４は、ステップＳ１３の信号処理がされていない場合と信号処理がされた場合とで送信先を異ならせて、信号処理がされていない音響信号、または信号処理がされた音響信号を送信するものとする。また、ステップＳ１２で選択された音声認識サーバ装置とは関係なくステップＳ１３の信号処理が実施されたか否かだけで、異なる音声認識サーバ装置のうちのいずれかの送信先を決定しても良い。なお、音響信号に由来する信号とは、音響信号の特徴量を表す信号、ステップＳ１３における信号処理を施した音響信号などを指す。また送信部１４は、音響信号または音響信号に由来する信号を送信する際に、収音条件（グループ）やそのしきい値、信号処理部１３における信号処理の有無に関する情報を音声認識サーバ装置に送信しても良い。音声認識サーバ装置は収音条件（グループ）やそのしきい値、や信号処理の有無から、どのような収音条件または信号処理条件において当該音声認識サーバ装置が選択されたかを記録することが可能になる。 Next, the transmission unit 14 transmits an acoustic signal or a signal derived from the acoustic signal to the speech recognition server device (speech recognition server device selected in step S12) corresponding to the extracted sound collection condition (S14). . At this time, the transmission unit 14 varies the transmission destination between the case where the signal processing of step S13 is not performed and the case where the signal processing is performed, and the acoustic signal which is not subjected to signal processing or the acoustic signal which is subjected to signal processing. A signal shall be transmitted. Moreover, you may determine the transmission destination in any one of different speech recognition server apparatuses only by whether the signal processing of step S13 was implemented irrespective of the speech recognition server apparatus selected by step S12. Note that the signal derived from the acoustic signal refers to a signal representing a feature amount of the acoustic signal, an acoustic signal subjected to the signal processing in step S13, and the like. In addition, when the transmission unit 14 transmits an acoustic signal or a signal derived from the acoustic signal, the sound recognition condition (group), a threshold value thereof, and information on the presence or absence of signal processing in the signal processing unit 13 are transmitted to the voice recognition server device. You may send it. The voice recognition server device can record the sound pickup condition or signal processing condition for selecting the voice recognition server device from the sound pickup condition (group), its threshold value, and the presence or absence of signal processing. become.

音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎは、クライアント装置１０から音響信号または音響信号に由来する信号を受信する（Ｓ２１Ａ）。音響信号または音響信号に由来する信号を受信した音声認識サーバ装置（例えば音声認識サーバ装置２１−１）は、音声認識処理を実行する（Ｓ２１Ｂ）。 The voice recognition server devices 21-1, ..., 21-n, ..., 21-N receive an acoustic signal or a signal derived from the acoustic signal from the client device 10 (S21A). The speech recognition server device (for example, the speech recognition server device 21-1) that has received the acoustic signal or the signal derived from the acoustic signal executes speech recognition processing (S21B).

＜音声認識処理（Ｓ２１Ｂ）＞
ステップＳ２１Ｂの音声認識処理は、例えば以下のように実行される。音声認識サーバ装置は、一文章や一単語の発話を文字列に変換する。音声認識サーバ装置は、音声特徴量として音声のパワーやその変化量、MFCC(メル周波数ケプストラム、Mel-Frequency Cepstrum Coefficient)やその動的変化量を用いる。音声認識サーバ装置は、統計的な音響モデルや言語モデルを用いて単語列を探索する。 <Voice recognition processing (S21B)>
The voice recognition process in step S21B is executed as follows, for example. The speech recognition server device converts an utterance of one sentence or one word into a character string. The speech recognition server device uses speech power and its variation, MFCC (Mel-Frequency Cepstrum Coefficient) and its dynamic variation as speech feature amounts. The speech recognition server device searches for a word string using a statistical acoustic model or a language model.

ステップＳ２１Ｂの音声認識処理を実行した音声認識サーバ装置は、音声認識結果をクライアント装置１０に送信する（Ｓ２１Ｃ）。クライアント装置１０の受信部１５は、音声認識結果を受信する（Ｓ１５Ａ）。クライアント装置１０の呈示部１６は、受信した音声認識結果を呈示する（Ｓ１６）。 The voice recognition server apparatus that has executed the voice recognition process in step S21B transmits the voice recognition result to the client apparatus 10 (S21C). The receiving unit 15 of the client device 10 receives the voice recognition result (S15A). The presentation unit 16 of the client device 10 presents the received voice recognition result (S16).

以下、図５、図６を参照して本実施例の音声認識システム１の情報更新動作について説明する。図５は、本実施例の音声認識システム１の情報更新動作を示すシーケンス図である。図６は、本実施例の音声認識システム１の言い直し判定部１７の動作を示すフローチャートである。言い直し判定部１７は、ユーザによる同一内容を示す発声の複数回の繰り返しを観測した信号群である繰り返し信号群を監視し、これを取得する（Ｓ１７）。ユーザによる同一内容を示す発声の複数回の繰り返しを認識するために、言い直し判定部１７の各構成要件（図２参照）は例えば以下の処理（図６参照）を実行する。ここでクライアント装置１０には全部でＭ個（ＭはＭ≧２を充たす整数）の音響信号が入力されたものとし、ｍは２≦ｍ≦Ｍを充たす整数とし、以下では言い直し判定部１７の各構成要件によりｍ番目の音響信号が言い直しであるか否かが判定される場合について説明する。 Hereinafter, the information updating operation of the voice recognition system 1 of the present embodiment will be described with reference to FIGS. FIG. 5 is a sequence diagram showing an information update operation of the speech recognition system 1 of the present embodiment. FIG. 6 is a flowchart showing the operation of the rephrase determining unit 17 of the voice recognition system 1 of the present embodiment. The rephrasing determination unit 17 monitors and acquires a repeated signal group, which is a signal group in which a plurality of repetitions of utterances indicating the same content by the user are observed (S17). In order to recognize a plurality of repetitions of utterances indicating the same content by the user, each constituent requirement (see FIG. 2) of the rephrase determining unit 17 executes, for example, the following processing (see FIG. 6). Here, it is assumed that a total of M acoustic signals (M is an integer satisfying M ≧ 2) are input to the client device 10, and m is an integer satisfying 2 ≦ m ≦ M. A case will be described in which it is determined whether or not the m-th acoustic signal is rephrased according to each component requirement.

反応時間測定部１７Ａは、クライアント装置１０にｍ−１番目に入力された音響信号に対する音声認識結果がクライアント装置１０により呈示された時刻（以下、ｍ−１番目の呈示時刻という）とクライアント装置１０にｍ番目に入力された音響信号の入力時刻（以下、ｍ番目の入力時刻という）との差分である反応時間を測定する（Ｓ１７Ａ）。 The reaction time measurement unit 17A includes a time when the client device 10 presents a speech recognition result for the acoustic signal input to the client device 10 (hereinafter referred to as the (m-1) th presentation time) and the client device 10. The reaction time, which is the difference from the input time of the mth input acoustic signal (hereinafter referred to as the mth input time), is measured (S17A).

信頼度取得部１７Ｂは、クライアント装置１０にｍ−１番目に入力された音響信号に対する音声認識結果（以下ｍ−１番目の音声認識結果という）の信頼度を取得する（Ｓ１７Ｂ）。信頼度は音声認識結果を示す文字列情報とともに音声認識サーバ装置からクライアント装置１０に送信される情報である。 The reliability acquisition unit 17B acquires the reliability of the speech recognition result (hereinafter referred to as the m-1th speech recognition result) for the acoustic signal input to the client device 10 as the m-1st (S17B). The reliability is information transmitted from the voice recognition server device to the client device 10 together with character string information indicating the voice recognition result.

類似度算出部１７Ｃは、クライアント装置１０にｍ−１番目、ｍ番目に入力された各音響信号の類似度、クライアント装置１０にｍ−１番目、ｍ番目に入力された各音響信号に対する各音声認識結果の類似度の少なくとも何れかを算出する（Ｓ１７Ｃ）。 The similarity calculation unit 17 </ b> C includes the similarity of each of the m−1 and mth input acoustic signals to the client device 10, and each sound corresponding to each of the m−1 and mth input acoustic signals to the client device 10. At least one of the similarities of the recognition results is calculated (S17C).

判定部１７Ｄは、ステップＳ１７Ａ〜Ｓ１７Ｃで取得された反応時間、信頼度、類似度のうち少なくとも何れか一つに基づいて、クライアント装置１０にｍ番目に入力された音響信号が言い直しであるか否かを判定し、判定の結果に基づいて繰り返し信号群を取得する（Ｓ１７Ｄ）。判定部１７Ｄはユーザの反応時間が所定の閾値よりも小さく、ｍ−１番目の音声認識結果の信頼度が所定の閾値と比較して低く、類似度が所定の閾値よりも高くなる場合などに、ｍ−１番目の音声認識結果は誤認識であり、ｍ番目の音響信号に含まれる発話はユーザによる言い直し（同一発話の繰り返し）であると判定する。すなわち判定部１７Ｄは、ｍ−１番目に入力された音響信号とｍ番目に入力された音響信号を繰り返し信号群として取得する。 Whether the determination unit 17D rephrases the m-th acoustic signal input to the client device 10 based on at least one of the reaction time, reliability, and similarity acquired in steps S17A to S17C. It is determined whether or not, and a repeated signal group is acquired based on the determination result (S17D). The determination unit 17D determines that the user's reaction time is smaller than a predetermined threshold, the reliability of the (m-1) th speech recognition result is lower than the predetermined threshold, and the similarity is higher than the predetermined threshold. , The (m-1) th speech recognition result is misrecognition, and the utterance included in the mth acoustic signal is determined to be rephrased by the user (repetition of the same utterance). That is, the determination unit 17D acquires the m−1th input acoustic signal and the mth input acoustic signal as a repeated signal group.

なお、ステップＳ１７Ａ〜Ｓ１７Ｃは必ずしも上述の順序で実行されなくてもよく、これらの処理は順序が入れ替わってもよい。例えば、ステップＳ１７Ａ〜Ｓ１７Ｃを処理の負荷が軽いものから実行することとすると、最初にステップＳ１７Ｂを、次にステップＳ１７Ａを、最後にステップＳ１７Ｃを実行する順序となる。また、ステップＳ１７Ａ〜Ｓ１７Ｃの各ステップ終了後に逐一ステップＳ１７Ｄの判定を行い、言い直しでないと判定された場合に後のステップを打ち切ってもよい。前述したようにステップＳ１７Ａ〜Ｓ１７Ｃのうち処理の負荷が軽いものから先に実行することとし（Ｓ１７Ｂ→Ｓ１７Ａ→Ｓ１７Ｃ）、最初のステップ（Ｓ１７Ｂ）で言い直しでないと判定された場合に、残りの二つのステップ（Ｓ１７Ａ、Ｓ１７Ｃ）を省略し、二番目のステップ（Ｓ１７Ａ）で言い直しでないと判定された場合に、残りの一つのステップ（Ｓ１７Ｃ）を省略することにより、クライアント装置１０の負荷を軽減することができる。 Note that steps S17A to S17C are not necessarily executed in the order described above, and the order of these processes may be changed. For example, if steps S17A to S17C are to be executed starting with a light processing load, step S17B is executed first, step S17A is executed next, and step S17C is finally executed. Further, after each step of steps S17A to S17C, the determination of step S17D may be performed one by one, and if it is determined that it is not rephrased, the subsequent steps may be aborted. As described above, among the steps S17A to S17C, the process having the lightest processing load is executed first (S17B → S17A → S17C), and when it is determined in the first step (S17B) that it is not rephrased, the remaining Two steps (S17A, S17C) are omitted, and when it is determined that the second step (S17A) is not rephrased, the remaining one step (S17C) is omitted, thereby reducing the load on the client device 10. Can be reduced.

ステップＳ１７Ｃにおける類似度は、例えば次のいずれかとすることができる。
・各音響信号の特徴量のユークリッド距離の逆数値や符号を反転した値（特徴量としてケプストラムやパワー、またはそれらの変化量を用いることができる）。
・音声認識サーバ装置から得た各音声認識結果の文字列の編集距離の逆数値や符号を反転した値。ここで文字列とは表記上の文字列に限定するものではなく、文字列の読みを読み仮名や音素表記へ変換した文字列でも良い。 The similarity in step S17C can be, for example, one of the following.
A value obtained by inverting the reciprocal value or sign of the Euclidean distance of the feature value of each acoustic signal (cepstrum, power, or a change amount thereof can be used as the feature value).
A value obtained by inverting the reciprocal value or sign of the edit distance of the character string of each voice recognition result obtained from the voice recognition server device. Here, the character string is not limited to a character string on the notation, and may be a character string obtained by reading a character string and converting it into a kana or phoneme notation.

なお、繰り返し信号群に対する音声認識は、ステップＳ１２で選択された音声認識サーバ装置が実行する（Ｓ２１Ａ〜Ｓ２１Ｃ）。 Note that the speech recognition server apparatus selected in step S12 executes speech recognition for the repeated signal group (S21A to S21C).

言い直しの判定に上述の反応時間を用いたのは、言い直しの場合、そうでない場合に比べて、前の認識結果の呈示を見てから次の発話を行うまでの時間が短くなる傾向が認められるからである。また言い直しの判定に信頼度を用いたのは、認識結果が誤っている場合、当該音声認識結果の信頼度が低くなる傾向が認められるからである。また言い直しの判定に類似度を用いたのは、言い直しの場合、発話同士の類似度が高くなる傾向が認められるからである。ユーザの反応時間とは、ユーザが認識結果を読んで理解するまでの時間であることから、反応時間を判定するための閾値は、認識結果として呈示する文字数に比例した値としても良い。また認識結果に漢字が含まれる場合は漢字の文字数に応じてより長い閾値としても良い。 The reason for using the above reaction time for the rephrase determination is that in the case of rephrase, the time from the presentation of the previous recognition result to the next utterance tends to be shorter than in the case where it is not. Because it is recognized. The reason why the reliability is used for the re-statement determination is that when the recognition result is incorrect, the reliability of the speech recognition result tends to be low. The reason why the similarity is used for the re-statement determination is that, in the case of re-statement, a tendency that the degree of similarity between utterances increases is recognized. Since the reaction time of the user is the time until the user reads and understands the recognition result, the threshold for determining the reaction time may be a value proportional to the number of characters presented as the recognition result. If the recognition result includes kanji, the threshold may be longer depending on the number of kanji characters.

言い直し判定部１７（判定部１７Ｄ）は、繰り返し信号群のうち最後の信号の音声認識結果を正解候補として抽出する（Ｓ１７）。言い直し判定部１７（判定部１７Ｄ）は、繰り返し信号群を全て再学習信号群とし、正解候補と対応付けて言い直し情報記憶部１７１に記憶する。繰り返し信号群を全て再学習信号群として用いる理由は、繰り返し信号群の全てにおいて正解候補と等しい、あるいは正解候補と類似度が高い音声認識結果を生成できる音声認識サーバ装置が、対応する音響信号の送信先として好適であるためである。 The rephrase determination unit 17 (determination unit 17D) extracts the speech recognition result of the last signal from the repeated signal group as a correct candidate (S17). The rephrase determination unit 17 (determination unit 17D) sets all the repetitive signal groups as relearning signal groups, and stores them in the reword information storage unit 171 in association with correct answer candidates. The reason for using all repetitive signal groups as the relearning signal group is that the speech recognition server device that can generate a speech recognition result that is equal to the correct answer candidate or has a high similarity with the correct answer candidate in all of the repetitive signal groups, This is because it is suitable as a transmission destination.

以下、図７を参照して言い直し判定部１７の言い直し判定動作の例について説明する。図７は、本実施例の音声認識システム１の言い直し判定動作を例示する図である。本実施例の音声認識システム１は単語に限定しない文章の音声認識を可能とするが、要点を理解しやすくするため単語音声認識の例で説明する。図７に示すように、クライアント装置１０のユーザ９が、クライアント装置１０に向かって「きりゅう」（桐生）と発話（以下、この発話を発話１という）したものとする。クライアント装置１０は、発話１を含む音響信号をステップＳ１２で選択された音声認識サーバ装置（ここでは音声認識サーバ装置２１−ｎとする）に送信する。音声認識サーバ装置２１−ｎは、発話１を含む音響信号を音声認識し、音声認識結果「知立」（ちりゅう）をクライアント装置１０に返信する（以下、この音声認識結果を認識結果１という）。クライアント装置１０は認識結果１をユーザ９に呈示する。 Hereinafter, an example of the restatement determination operation of the restatement determination unit 17 will be described with reference to FIG. FIG. 7 is a diagram illustrating the rephrasing determination operation of the voice recognition system 1 according to the present embodiment. Although the speech recognition system 1 of the present embodiment enables speech recognition of sentences not limited to words, an example of word speech recognition will be described for easy understanding of the main points. As shown in FIG. 7, it is assumed that the user 9 of the client apparatus 10 utters “Kiryu” (Kiryu) toward the client apparatus 10 (hereinafter, this utterance is referred to as “utterance 1”). The client device 10 transmits an acoustic signal including the utterance 1 to the voice recognition server device (here, the voice recognition server device 21-n) selected in step S12. The speech recognition server device 21-n recognizes an acoustic signal including the utterance 1 and returns a speech recognition result “Chiryu” to the client device 10 (hereinafter, the speech recognition result is referred to as a recognition result 1). . The client device 10 presents the recognition result 1 to the user 9.

ユーザ９は呈示された認識結果１が誤認識であることに気付いて、先ほどと同じようにクライアント装置１０に向かって「きりゅう」（桐生）と発話（以下、この発話を発話２という）したものとする。クライアント装置１０は、発話２を含む音響信号を音声認識サーバ装置２１−ｎに送信する。音声認識サーバ装置２１−ｎは、発話２を含む音響信号を音声認識し、音声認識結果「桐生」（きりゅう）をクライアント装置１０に返信する（以下、この音声認識結果を認識結果２という）。クライアント装置１０は認識結果２をユーザ９に呈示する。 The user 9 notices that the presented recognition result 1 is a misrecognition, and utters “Kiryu” (Kiryu) toward the client device 10 in the same manner as before (hereinafter, this utterance is referred to as utterance 2). Shall. The client device 10 transmits an acoustic signal including the utterance 2 to the voice recognition server device 21-n. The voice recognition server device 21-n recognizes an acoustic signal including the utterance 2 and returns a voice recognition result “Kiryu” to the client device 10 (hereinafter, the voice recognition result is referred to as a recognition result 2). . The client device 10 presents the recognition result 2 to the user 9.

ユーザ９は呈示された認識結果２を見て、正しく音声認識が実行されたことを確認し、今度はクライアント装置１０に向かって「くどう」（工藤）と発話（以下、この発話を発話３という）したものとする。クライアント装置１０は、発話３を含む音響信号を音声認識サーバ装置２１−ｎに送信する。音声認識サーバ装置２１−ｎは、発話３を含む音響信号を音声認識し、音声認識結果「工藤」（くどう）をクライアント装置１０に返信する（以下、この音声認識結果を認識結果３という）。クライアント装置１０は認識結果３をユーザ９に呈示する。 The user 9 looks at the presented recognition result 2 and confirms that the voice recognition has been executed correctly. This time, the user 9 is directed to the client device 10 and speaks (hereinafter referred to as utterance 3). ). The client device 10 transmits an acoustic signal including the utterance 3 to the voice recognition server device 21-n. The voice recognition server device 21-n recognizes an acoustic signal including the utterance 3 and returns a voice recognition result “Kudo” to the client device 10 (hereinafter, the voice recognition result is referred to as a recognition result 3). The client device 10 presents the recognition result 3 to the user 9.

上述の例において、反応時間測定部１７Ａは、認識結果１の呈示時刻と発話２を含む音響信号の入力時刻との差分である反応時間（以下、反応時間１という）を測定する（Ｓ１７Ａ）。信頼度取得部１７Ｂは、認識結果１の信頼度を取得する（Ｓ１７Ｂ）。類似度算出部１７Ｃは、発話１を含む音響信号と発話２を含む音響信号の類似度、あるいは認識結果１と認識結果２の類似度の少なくとも何れかを算出する（Ｓ１７Ｃ）。この場合、判定部１７Ｄは反応時間１が所定の閾値よりも小さくなる、認識結果１の信頼度が所定の閾値と比較して低くなる、発話１を含む音響信号と発話２を含む音響信号の類似度、あるいは認識結果１と認識結果２の類似度が所定の閾値よりも高くなることなどから、認識結果１は誤認識であり、発話２はユーザによる言い直し（同一発話の繰り返し）であると判定する（Ｓ１７Ｄ）。 In the above example, the reaction time measurement unit 17A measures the reaction time (hereinafter referred to as reaction time 1) that is the difference between the presentation time of the recognition result 1 and the input time of the acoustic signal including the utterance 2 (S17A). The reliability acquisition unit 17B acquires the reliability of the recognition result 1 (S17B). The similarity calculation unit 17C calculates at least one of the similarity between the acoustic signal including the utterance 1 and the acoustic signal including the utterance 2 or the similarity between the recognition result 1 and the recognition result 2 (S17C). In this case, the determination unit 17D determines that the acoustic signal including the utterance 1 and the acoustic signal including the utterance 2 have the reaction time 1 smaller than the predetermined threshold, and the reliability of the recognition result 1 is lower than the predetermined threshold. Since the similarity or the similarity between the recognition result 1 and the recognition result 2 is higher than a predetermined threshold, the recognition result 1 is a false recognition, and the utterance 2 is a rephrase by the user (repetition of the same utterance). (S17D).

同様に、判定部１７Ｄは反応時間２が所定の閾値よりも小さくならない（反応時間２が十分に長い）、認識結果２の信頼度が所定の閾値と比較して低くくならない（認識結果２の信頼度が十分に高い）、発話２を含む音響信号と発話３を含む音響信号の類似度、あるいは認識結果２と認識結果３の類似度が所定の閾値よりも高くならない（二つの音響信号、あるいは認識結果が十分に非類似である）ことなどから、認識結果２は正解候補であり、発話３はユーザによる言い直し（同一発話の繰り返し）ではないものと判定する（Ｓ１７Ｄ）。この場合、発話１を含む音響信号と発話２を含む音響信号が再学習信号群に該当することになる。 Similarly, the determination unit 17D does not make the reaction time 2 smaller than the predetermined threshold (the reaction time 2 is sufficiently long), and the reliability of the recognition result 2 does not become lower than the predetermined threshold (the recognition result 2). The degree of reliability is sufficiently high), the similarity between the acoustic signal including the utterance 2 and the acoustic signal including the utterance 3, or the similarity between the recognition result 2 and the recognition result 3 does not become higher than a predetermined threshold (two acoustic signals, It is determined that the recognition result 2 is a correct candidate and the utterance 3 is not re-stated by the user (repetition of the same utterance) (S17D). In this case, the acoustic signal including the utterance 1 and the acoustic signal including the utterance 2 correspond to the relearning signal group.

次に、送信部１４は、正解候補と再学習信号群の組を管理部３０に送信する（Ｓ１４Ｂ）。 Next, the transmission unit 14 transmits a set of the correct answer candidate and the relearning signal group to the management unit 30 (S14B).

管理部３０の正解候補受信部３０Ａは、クライアント装置１０から正解候補と再学習信号群の組を受信する（Ｓ３０Ａ）。正解候補と再学習信号群の組は、正解候補記憶部３０Ｆに記憶される。管理部３０の再学習信号群送信部３０Ｂは、再学習信号群を音声認識サーバ装置群２０（全ての音声認識サーバ装置）に送信する（Ｓ３０Ｂ）。 The correct answer candidate receiving unit 30A of the management unit 30 receives a set of a correct answer candidate and a relearning signal group from the client device 10 (S30A). A set of the correct answer candidate and the relearning signal group is stored in the correct answer candidate storage unit 30F. The relearning signal group transmission unit 30B of the management unit 30 transmits the relearning signal group to the speech recognition server device group 20 (all speech recognition server devices) (S30B).

音声認識サーバ装置群２０は、管理部３０から再学習信号群を受信する（Ｓ２１Ｄ）。音声認識サーバ装置群２０は、受信した再学習信号群を音声認識する（Ｓ２１Ｅ）。音声認識サーバ装置群２０は、音声認識結果を管理部３０に送信する（Ｓ２１Ｆ）。 The voice recognition server device group 20 receives the relearning signal group from the management unit 30 (S21D). The speech recognition server device group 20 recognizes the received relearning signal group (S21E). The voice recognition server device group 20 transmits the voice recognition result to the management unit 30 (S21F).

管理部３０の音声認識結果受信部３０Ｃは、全ての音声認識サーバ装置から再学習信号群に対する音声認識結果を受信する（Ｓ３０Ｃ）。管理部３０の送信先情報更新部３０Ｄは、全ての音声認識サーバ装置から受信した各音声認識結果と正解候補との類似度に基づいて、送信先情報を更新する（Ｓ３０Ｄ）。送信先情報とは、音響信号の送信先となる音声認識サーバ装置と収音条件との関係に関する情報である。典型的には、送信先情報更新部３０Ｄは、正解候補と再学習信号群（Ｌ個の信号群とする、Ｌ≧２）に対する音声認識結果の類似度（第１類似度、第２類似度、…、第Ｌ類似度）がいずれも高くなる、あるいは再学習信号群に対する音声認識結果の信頼度が何れも高くなる音声認識サーバ装置に対して、前述のクライアント装置１０からの音響信号が送信されるように、送信先情報を更新する（Ｓ３０Ｄ）。言い換えれば、ステップＳ３０Ｄにおいて、再学習信号群の何れの再学習信号に対しても正解候補と等しい（類似度の高い）音声認識結果を生成することができる音声認識サーバ装置が好適な送信先として選択される。例えば前述の発話「きりゅう」（桐生）の言い直しがＬ回実行された場合には、発話「きりゅう」（桐生）を含む再学習信号が計Ｌ個存在することになる。このとき、送信先の音声認識サーバ装置として好適なのは、Ｌ個の発話「きりゅう」（桐生）を含む再学習信号群の何れに対しても正解候補である「桐生」（きりゅう）と等しい、あるいは高い類似度の音声認識結果を生成することができる音声認識サーバ装置である。また、ステップＳ３０Ｄにおいて、再学習信号群の何れの再学習信号に対しても信頼度の高い音声認識結果を生成することができる音声認識サーバ装置が好適な送信先として選択されてもよい。前述の例では、Ｌ個の発話「きりゅう」（桐生）を含む再学習信号群の何れに対しても高い信頼度の音声認識結果を生成することができる音声認識サーバ装置が好適な送信先として選択されてもよい。送信先情報更新部３０Ｄは、正解候補と再学習信号群に対する音声認識結果の類似度（第１類似度、第２類似度、…、第Ｌ類似度）が所定の閾値よりも大きくなる回数が最も多い音声認識サーバ装置、あるいは再学習信号群に対する音声認識結果の信頼度が所定の閾値よりも大きくなる回数が最も多い音声認識サーバ装置に対して、前述のクライアント装置１０からの音響信号が送信されるように、送信先情報を更新してもよい（Ｓ３０Ｄ）。言い換えれば、ステップＳ３０Ｄにおいて、再学習信号群に対して正解候補と等しい（類似度が高い）音声認識結果を最も多く生成することができる音声認識サーバ装置が好適な送信先として選択される。例えば前述の発話「きりゅう」（桐生）の言い直しがＬ回実行された場合には、発話「きりゅう」（桐生）を含む再学習信号が計Ｌ個存在することになる。このとき、送信先の音声認識サーバ装置として好適なのは、Ｌ個の発話「きりゅう」（桐生）を含む再学習信号群に対して正解候補である「桐生」（きりゅう）と等しい、あるいは高い類似度の音声認識結果を最も多く生成することができる音声認識サーバ装置である。また、ステップＳ３０Ｄにおいて、再学習信号群に対して信頼度が高い音声認識結果を最も多く生成することができる音声認識サーバ装置が好適な送信先として選択されてもよい。前述の例では、Ｌ個の発話「きりゅう」（桐生）を含む再学習信号群に対して高い信頼度の音声認識結果を最も多く生成することができる音声認識サーバ装置が好適な送信先として選択されてもよい。 The speech recognition result receiving unit 30C of the management unit 30 receives speech recognition results for the relearning signal group from all speech recognition server devices (S30C). The transmission destination information update unit 30D of the management unit 30 updates the transmission destination information based on the similarity between each voice recognition result received from all voice recognition server devices and the correct candidate (S30D). The transmission destination information is information relating to the relationship between the voice recognition server device that is the transmission destination of the acoustic signal and the sound collection conditions. Typically, the transmission destination information update unit 30D performs similarity (first similarity, second similarity) of the speech recognition result for the correct answer candidate and the relearning signal group (L signal group, L ≧ 2). ,..., The Lth similarity) is increased, or the acoustic signal from the client device 10 is transmitted to the speech recognition server device in which the reliability of the speech recognition result for the relearning signal group is increased. As described above, the destination information is updated (S30D). In other words, in step S30D, a speech recognition server device that can generate a speech recognition result equal to the correct answer (high similarity) for any relearning signal in the relearning signal group is a suitable transmission destination. Selected. For example, when the above-mentioned utterance “Kiryu” (Kiryu) is re-stated L times, a total of L relearning signals including the utterance “Kiryu” (Kiryu) exist. At this time, the preferred speech recognition server device of the transmission destination is equal to “Kiryu” (Kiryu) which is a correct answer candidate for any of the relearning signal groups including L utterances “Kiryu” (Kiryu). Or a speech recognition server device that can generate a speech recognition result with a high degree of similarity. In step S30D, a speech recognition server device that can generate a highly reliable speech recognition result for any relearning signal in the relearning signal group may be selected as a suitable transmission destination. In the above example, a voice recognition server device capable of generating a highly reliable voice recognition result for any of the relearning signal groups including L utterances “Kiryu” (Kiryu) is a suitable destination. May be selected. The transmission destination information update unit 30D determines the number of times that the similarity (first similarity, second similarity,..., Lth similarity) of the speech recognition result for the correct candidate and the relearning signal group is greater than a predetermined threshold. The acoustic signal from the client device 10 is transmitted to the most speech recognition server device or the speech recognition server device having the largest number of times the reliability of the speech recognition result for the relearning signal group is greater than a predetermined threshold. As described above, the destination information may be updated (S30D). In other words, in step S30D, a speech recognition server device that can generate the largest number of speech recognition results that are equal to the correct candidate (high similarity) for the relearning signal group is selected as a suitable transmission destination. For example, when the above-mentioned utterance “Kiryu” (Kiryu) is re-stated L times, a total of L relearning signals including the utterance “Kiryu” (Kiryu) exist. At this time, the preferred speech recognition server device of the transmission destination is equal to or higher than “Kiryu” which is a correct answer candidate for the relearning signal group including L utterances “Kiriu” (Kiryu). This is a voice recognition server device that can generate the most similar voice recognition results. In step S30D, a speech recognition server device that can generate the most highly reliable speech recognition results for the relearning signal group may be selected as a suitable transmission destination. In the above example, a speech recognition server device that can generate the most highly reliable speech recognition results for a re-learning signal group including L utterances “Kiryu” (Kiryu) is a suitable transmission destination. It may be selected.

以下、図８を参照して管理部３０の送信先情報更新動作の例について説明する。図８は、本実施例の音声認識システム１の送信先情報更新動作を例示する図である。図８に示すように、クライアント装置１０は、再学習信号群である発話１を含む音響信号、発話２を含む音響信号と、正解候補である認識結果２を組にして管理部３０に送信する。管理部３０は、再学習信号群である発話１を含む音響信号、発話２を含む音響信号を音声認識サーバ装置２１−ｂや音声認識サーバ装置２１−ｃに送信する。音声認識サーバ装置２１−ｂは、発話１を含む音響信号、発話２を含む音響信号に対して、認識結果１ｂ、２ｂを返信したものとする。この認識結果１ｂ、２ｂには、誤認識である「知立」（ちりゅう）が少なくとも一つは含まれていたものとする。一方、音声認識サーバ装置２１−ｃは、発話１を含む音響信号、発話２を含む音響信号に対して、認識結果１ｃ、２ｃを返信したものとし、認識結果１ｃ、２ｃは、何れも正解候補と等しい結果である「桐生」（きりゅう）であったものとする。 Hereinafter, an example of the transmission destination information update operation of the management unit 30 will be described with reference to FIG. FIG. 8 is a diagram illustrating the transmission destination information update operation of the speech recognition system 1 of the present embodiment. As illustrated in FIG. 8, the client device 10 transmits the acoustic signal including the utterance 1 that is the re-learning signal group, the acoustic signal including the utterance 2, and the recognition result 2 that is the correct candidate to the management unit 30. . The management unit 30 transmits the acoustic signal including the utterance 1 and the acoustic signal including the utterance 2 that are the relearning signal group to the speech recognition server device 21-b and the speech recognition server device 21-c. It is assumed that the speech recognition server device 21-b returns the recognition results 1b and 2b to the acoustic signal including the utterance 1 and the acoustic signal including the utterance 2. It is assumed that the recognition results 1b and 2b include at least one “chiryu” that is a misrecognition. On the other hand, the speech recognition server device 21-c returns the recognition results 1c and 2c to the acoustic signal including the utterance 1 and the acoustic signal including the utterance 2, and both the recognition results 1c and 2c are correct candidates. "Kiryu", which is the same result as

図８の例では、管理部３０は、認識結果２（正解候補）と認識結果１ｃ、２ｃが等しく、認識結果１ｃ、２ｃの信頼度が所定の閾値以上となることから、音声認識サーバ装置２１−ｃが最適な送信先であって、当該装置で保持されている認識設定（認識設定Ｃという）や音響モデルが、対応する音響信号に対して最適であるものと判定し、音声認識サーバ装置２１−ｃが音響信号の送信先となるように前述の送信先情報を更新する。 In the example of FIG. 8, since the recognition result 2 (correct answer candidate) and the recognition results 1c and 2c are equal and the reliability of the recognition results 1c and 2c is equal to or higher than a predetermined threshold, the management unit 30 recognizes the voice recognition server device 21. -C is the optimal transmission destination, and the recognition setting (referred to as recognition setting C) and the acoustic model held by the device are determined to be optimal for the corresponding acoustic signal, and the speech recognition server device The above-mentioned transmission destination information is updated so that 21-c becomes the transmission destination of the acoustic signal.

次に、管理部３０の送信先情報送信部３０Ｅは、更新された送信先情報をクライアント装置１０に送信する（Ｓ３０Ｅ）。 Next, the transmission destination information transmission unit 30E of the management unit 30 transmits the updated transmission destination information to the client device 10 (S30E).

クライアント装置１０の受信部１５は、管理部３０から送信先情報を受信する（Ｓ１５Ｂ）。クライアント装置１０の送信先変更部１８は、受信した送信先情報に基づいて、音響信号の送信先となる音声認識サーバ装置と収音条件との関係を変更する（Ｓ１８）。クライアント装置１０の送信先変更部１８は、例えば送信先記憶部１２１に記憶済みの送信先情報を新たに受信した送信先情報に上書きすることによって、上記変更を行うことができる。 The receiving unit 15 of the client device 10 receives the transmission destination information from the management unit 30 (S15B). The transmission destination changing unit 18 of the client device 10 changes the relationship between the voice recognition server device that is the transmission destination of the acoustic signal and the sound collection conditions based on the received transmission destination information (S18). The transmission destination changing unit 18 of the client device 10 can perform the above-described change by overwriting the transmission destination information stored in the transmission destination storage unit 121 with the newly received transmission destination information, for example.

このように本実施例の音声認識システム１によれば、全ての音声認識サーバ装置が再学習信号群を音声認識し、何れの再学習信号群に対しても正解候補と等しい（類似度が高い）音声認識結果を送信した音声認識サーバ装置、あるいは何れの再学習信号群に対しても信頼度が高い音声認識結果を送信した音声認識サーバ装置が新たな送信先となるように管理部３０が送信先情報を更新し、クライアント装置１０が更新された送信先情報に基づいて、送信先となる音声認識サーバ装置を変更することにより、システム全体の誤認識の回数が減少する方向に送信先情報が最適化（再学習）されるため、システムの利用効率を向上させることができる。 As described above, according to the speech recognition system 1 of the present embodiment, all speech recognition server devices recognize the relearning signal group as speech, and are equal to the correct candidate for any relearning signal group (high similarity). ) The management unit 30 sets the voice recognition server apparatus that has transmitted the voice recognition result, or the voice recognition server apparatus that has transmitted the voice recognition result with high reliability for any relearning signal group, as a new transmission destination. The destination information is updated so that the number of misrecognitions in the entire system decreases by changing the voice recognition server device as the destination based on the destination information updated by the client device 10. Is optimized (relearning), so that the utilization efficiency of the system can be improved.

以下、送信先情報を更新する代わりにしきい値を更新することで実施例１と同様の効果を奏する実施例２の音声認識システムについて説明する。まず図９、図１０を参照して本実施例の音声認識システムの構成について説明する。図９は、本実施例の音声認識システム２の構成を示すブロック図である。図１０は、本実施例の音声認識システム２の管理部５０の構成を示すブロック図である。図９に示すように、本実施例の音声認識システム２は、クライアント装置４０と、複数の音声認識サーバ装置２１−１、…、２１−ｎ、…、２１−Ｎと、管理部５０を含む。クライアント装置４０は複数台存在してもよい。クライアント装置４０と音声認識サーバ装置群２０は、ネットワークを介し、無線または有線で通信可能に接続されているものとする。管理部５０は、単独のハードウェア（装置）として構成されてもよく、管理部５０を単独のハードウェア（装置）として構成した場合は、これを管理装置５０と呼んでもよい。管理部５０を単独のハードウェア（装置）として構成した場合、クライアント装置４０と音声認識サーバ装置群２０と管理部５０（管理装置５０）はネットワークを介して、無線または有線で通信可能に接続されているものとする。また、管理部５０は、クライアント装置４０内の構成要件であってもよいし、音声認識サーバ装置群２０内の何れかの音声認識サーバ装置内の構成要件であってもよい。 Hereinafter, the voice recognition system according to the second embodiment that achieves the same effect as the first embodiment by updating the threshold value instead of updating the transmission destination information will be described. First, the configuration of the speech recognition system according to the present embodiment will be described with reference to FIGS. FIG. 9 is a block diagram showing the configuration of the voice recognition system 2 of the present embodiment. FIG. 10 is a block diagram illustrating the configuration of the management unit 50 of the voice recognition system 2 of the present embodiment. As shown in FIG. 9, the speech recognition system 2 according to the present exemplary embodiment includes a client device 40, a plurality of speech recognition server devices 21-1 to 21 -n, 21 to N, and a management unit 50. . There may be a plurality of client devices 40. It is assumed that the client device 40 and the voice recognition server device group 20 are connected to be communicable wirelessly or by wire via a network. The management unit 50 may be configured as a single hardware (device). When the management unit 50 is configured as a single hardware (device), this may be referred to as the management device 50. When the management unit 50 is configured as a single piece of hardware (device), the client device 40, the voice recognition server device group 20, and the management unit 50 (management device 50) are connected to be communicable wirelessly or wired via a network. It shall be. Further, the management unit 50 may be a configuration requirement in the client device 40 or a configuration requirement in any of the voice recognition server devices in the voice recognition server device group 20.

図９に示すように本実施例のクライアント装置４０は、実施例１のクライアント装置１０が備える送信先変更部１８の代わりに、しきい値変更部４８を備える。クライアント装置４０の、しきい値変更部４８以外の構成要件は実施例１のクライアント装置１０の各構成要件と同じであるため、説明を割愛する。 As illustrated in FIG. 9, the client device 40 according to the present exemplary embodiment includes a threshold value changing unit 48 instead of the transmission destination changing unit 18 included in the client device 10 according to the first exemplary embodiment. Since the configuration requirements of the client device 40 other than the threshold value changing unit 48 are the same as the configuration requirements of the client device 10 of the first embodiment, a description thereof will be omitted.

図１０に示すように本実施例の管理部５０は、実施例１の管理部３０が備える送信先情報更新部３０Ｄと、送信先情報送信部３０Ｅの代わりに、しきい値更新部５０Ｄと、しきい値送信部５０Ｅを備える。また、本実施例の管理部５０は実施例１の管理部３０が備えないしきい値記憶部５０Ｇと、信号処理部５０Ｈを備える。しきい値更新部５０Ｄ、しきい値送信部５０Ｅ、しきい値記憶部５０Ｇ、信号処理部５０Ｈ以外の構成要件は実施例１の管理部３０の各構成要件と同じであるため、説明を割愛する。 As shown in FIG. 10, the management unit 50 of the present embodiment includes a transmission destination information update unit 30D included in the management unit 30 of the first embodiment, a threshold value update unit 50D instead of the transmission destination information transmission unit 30E, A threshold transmission unit 50E is provided. Further, the management unit 50 according to the present embodiment includes a threshold value storage unit 50G that is not included in the management unit 30 according to the first embodiment, and a signal processing unit 50H. Since the configuration requirements other than the threshold update unit 50D, the threshold transmission unit 50E, the threshold storage unit 50G, and the signal processing unit 50H are the same as the configuration requirements of the management unit 30 of the first embodiment, a description thereof is omitted. To do.

なお、本実施例の音声認識システム２の音声認識動作は実施例１の音声認識動作（Ｓ１１〜Ｓ１４Ａ、Ｓ２１Ａ〜Ｓ２１Ｃ、Ｓ１５Ａ、Ｓ１６）と全く同じであるから説明を省略する。 Note that the voice recognition operation of the voice recognition system 2 of the present embodiment is completely the same as the voice recognition operation (S11 to S14A, S21A to S21C, S15A, and S16) of the first embodiment, and thus description thereof is omitted.

以下、図１１を参照して本実施例の音声認識システム２の情報更新動作について説明する。図１１は、本実施例の音声認識システム２の情報更新動作を示すシーケンス図である。 Hereinafter, the information update operation of the voice recognition system 2 of the present embodiment will be described with reference to FIG. FIG. 11 is a sequence diagram showing an information update operation of the speech recognition system 2 of the present embodiment.

ステップＳ１７、Ｓ１４Ｂ、Ｓ３０Ａは実施例１と同様に実行される。次に、信号処理部５０Ｈは、前述のステップＳ１３において実行される信号処理であって、収音条件に応じて予め定められている信号処理のパターンの全て（信号処理自体を実施しないパターンを含む）を、ステップＳ１３における信号処理を実行する前の音響信号に適用して、収音条件毎に異なる信号処理を施した信号処理済音響信号を取得する。本実施例では、この信号処理済音響信号を再学習信号群とする（Ｓ５０Ｈ）。 Steps S17, S14B, and S30A are executed in the same manner as in the first embodiment. Next, the signal processing unit 50H is the signal processing executed in step S13 described above, and includes all of the signal processing patterns predetermined according to the sound collection conditions (including patterns in which the signal processing itself is not performed). ) Is applied to the acoustic signal before the signal processing in step S13 is executed, and a signal-processed acoustic signal subjected to different signal processing for each sound collection condition is acquired. In the present embodiment, this signal processed acoustic signal is set as a relearning signal group (S50H).

なお、ステップＳ５０Ｈの信号処理による負荷の増大が問題となる場合は、ステップＳ５０Ｈを省略してもよい。この場合、再学習信号群に施された信号処理と後述するしきい値変更後にステップＳ１３において音響信号に施される信号処理とが異なる場合があることを許容し、ステップＳ１３において信号処理された音響信号をそのまま再学習信号群とする。ステップＳ３０Ｂは、実施例１と同様であるが、ステップＳ５０Ｈが実行される場合には、Ｓ５０Ｈにおいて各収音条件に応じて各信号処理を施された各音響信号（各再学習信号）は、対応する収音条件を担当する各音声認識サーバ装置にそれぞれ分配されるものとする。この場合のステップＳ３０Ｂは、全ての収音条件のパターンにおけるステップＳ１３−Ｓ１４を管理部５０において再現する処理ということができる。 Note that if an increase in load due to the signal processing in step S50H becomes a problem, step S50H may be omitted. In this case, the signal processing applied to the relearning signal group may be different from the signal processing applied to the acoustic signal in step S13 after the threshold value change described later, and the signal processing is performed in step S13. The acoustic signal is directly used as a relearning signal group. Step S30B is the same as that of the first embodiment. However, when step S50H is executed, each acoustic signal (each relearning signal) subjected to each signal processing in S50H according to each sound collection condition is It is assumed that it is distributed to each voice recognition server device in charge of the corresponding sound collection condition. Step S30B in this case can be said to be processing in which steps S13 to S14 in all sound collection condition patterns are reproduced by the management unit 50.

以下、ステップＳ２１Ｄ〜Ｓ２１Ｆ、ステップＳ３０Ｃは実施例１と同様に実行される。 Hereinafter, Steps S21D to S21F and Step S30C are executed in the same manner as in the first embodiment.

次に、しきい値更新部５０Ｄは、全ての音声認識サーバ装置から受信した各音声認識結果と正解候補との類似度、あるいは各音声認識結果の信頼度に基づいて、しきい値を更新する（Ｓ５０Ｄ）。前述したとおりしきい値とは、収音条件を抽出するために予め設定されている値のことである。典型的には、しきい値更新部５０Ｄは、正解候補と再学習信号群に対する音声認識結果の類似度（第１類似度、第２類似度、…、第Ｌ類似度）がいずれも高くなる、あるいは再学習信号群に対する音声認識結果の信頼度が何れも高くなる音声認識サーバ装置に対して、前述のクライアント装置１０からの音響信号が送信されるように、しきい値を更新する（Ｓ５０Ｄ）。次に、管理部５０のしきい値送信部５０Ｅは、更新されたしきい値をクライアント装置４０に送信する（Ｓ５０Ｅ）。 Next, the threshold update unit 50D updates the threshold based on the similarity between each speech recognition result received from all speech recognition server devices and the correct answer candidate or the reliability of each speech recognition result. (S50D). As described above, the threshold value is a value set in advance for extracting the sound collection condition. Typically, the threshold update unit 50D increases the similarity (first similarity, second similarity,..., Lth similarity) of the speech recognition results for the correct candidate and the relearning signal group. Alternatively, the threshold value is updated so that the acoustic signal from the client device 10 is transmitted to the speech recognition server device in which the reliability of the speech recognition result for the relearning signal group is high (S50D). ). Next, the threshold transmission unit 50E of the management unit 50 transmits the updated threshold to the client device 40 (S50E).

クライアント装置４０の受信部１５は、管理部５０からしきい値を受信する（Ｓ１５Ｂ）。クライアント装置４０のしきい値変更部４８は、受信したしきい値に基づいて、予め設定されたしきい値を変更する（Ｓ４８）。クライアント装置４０のしきい値変更部４８は、例えばしきい値記憶部１１１に記憶済みのしきい値を新たに受信したしきい値に上書きすることによって、上記変更を行うことができる。 The receiving unit 15 of the client device 40 receives the threshold value from the management unit 50 (S15B). The threshold value changing unit 48 of the client device 40 changes a preset threshold value based on the received threshold value (S48). The threshold value changing unit 48 of the client device 40 can make the above-described change by overwriting the threshold value already stored in the threshold value storage unit 111 with the newly received threshold value, for example.

このように本実施例の音声認識システム２によれば、全ての音声認識サーバ装置が再学習信号群を音声認識し、何れの再学習信号群に対しても正解候補と等しい（類似度が高い）音声認識結果を送信した音声認識サーバ装置、あるいは何れの再学習信号群に対しても信頼度が高い音声認識結果を送信した音声認識サーバ装置が新たな送信先となるように、管理部５０がしきい値を更新し、クライアント装置４０が更新されたしきい値に基づいて、予め設定されていたしきい値を新たなしきい値に変更することにより、システム全体の誤認識の回数が減少する方向にしきい値が最適化（再学習）されるため、システムの利用効率を向上させることができる。 As described above, according to the speech recognition system 2 of the present embodiment, all speech recognition server devices recognize a relearning signal group as speech, and are equal to a correct candidate for any relearning signal group (high similarity). ) The management unit 50 so that the voice recognition server apparatus that has transmitted the voice recognition result or the voice recognition server apparatus that has transmitted the voice recognition result having high reliability for any relearning signal group becomes a new transmission destination. Updates the threshold value, and the client device 40 changes the preset threshold value to a new threshold value based on the updated threshold value, thereby reducing the number of false recognitions in the entire system. Since the threshold value is optimized (re-learning) in the direction, the utilization efficiency of the system can be improved.

以下、送信先となる音声認識サーバ装置を変更する代わりに、音声認識サーバ装置に記憶された音響モデル、音声認識に関する設定を更新（入れ替え、再学習）する構成とした実施例３の音声認識システムについて説明する。まず図１２、図１３を参照して本実施例の音声認識システムの構成について説明する。図１２は、本実施例の音声認識システム３の構成を示すブロック図である。図１３は、本実施例の音声認識システム３の管理部９０の構成を示すブロック図である。図１２に示すように、本実施例の音声認識システム３は、クライアント装置７０と、複数の音声認識サーバ装置８１−１、…、８１−ｎ、…、８１−Ｎと、管理部９０を含む。音声認識サーバ装置８１−１、…、８１−ｎ、…、８１−Ｎをまとめて呼称する際には、音声認識サーバ装置群８０と呼ぶ。クライアント装置７０は複数台存在してもよい。クライアント装置７０と音声認識サーバ装置群８０は、ネットワークを介し、無線または有線で通信可能に接続されているものとする。管理部９０は、単独のハードウェア（装置）として構成されてもよく、管理部９０を単独のハードウェア（装置）として構成した場合は、これを管理装置９０と呼んでもよい。管理部９０を単独のハードウェア（装置）として構成した場合、クライアント装置７０と音声認識サーバ装置群８０と管理部９０（管理装置９０）はネットワークを介して、無線または有線で通信可能に接続されているものとする。また、管理部９０は、クライアント装置７０内の構成要件であってもよいし、音声認識サーバ装置群８０内の何れかの音声認識サーバ装置内の構成要件であってもよい。 Hereinafter, instead of changing the voice recognition server device as the transmission destination, the voice recognition system according to the third embodiment configured to update (replace, relearn) settings related to the acoustic model and the voice recognition stored in the voice recognition server device. Will be described. First, the configuration of the voice recognition system according to the present embodiment will be described with reference to FIGS. FIG. 12 is a block diagram showing the configuration of the voice recognition system 3 of the present embodiment. FIG. 13 is a block diagram illustrating a configuration of the management unit 90 of the voice recognition system 3 according to the present embodiment. As shown in FIG. 12, the voice recognition system 3 of the present embodiment includes a client device 70, a plurality of voice recognition server devices 81-1,..., 81-n,. . .., 81 -N are collectively referred to as a voice recognition server device group 80. There may be a plurality of client devices 70. It is assumed that the client device 70 and the voice recognition server device group 80 are connected via a network so that they can communicate wirelessly or by wire. The management unit 90 may be configured as a single hardware (device). When the management unit 90 is configured as a single hardware (device), this may be referred to as a management device 90. When the management unit 90 is configured as a single piece of hardware (device), the client device 70, the voice recognition server device group 80, and the management unit 90 (management device 90) are connected to be communicable wirelessly or wired via a network. It shall be. Further, the management unit 90 may be a configuration requirement in the client device 70 or a configuration requirement in any voice recognition server device in the voice recognition server device group 80.

図１２に示すように本実施例のクライアント装置７０は、実施例１のクライアント装置１０が備える送信先変更部１８を備えない。クライアント装置７０のそれ以外の構成要件は実施例１のクライアント装置１０の各構成要件と同じであるため、説明を割愛する。 As illustrated in FIG. 12, the client device 70 according to the present exemplary embodiment does not include the transmission destination changing unit 18 included in the client device 10 according to the first exemplary embodiment. Since the other configuration requirements of the client device 70 are the same as the configuration requirements of the client device 10 of the first embodiment, a description thereof will be omitted.

図１３に示すように本実施例の管理部９０は、実施例１の管理部３０が備える送信先情報更新部３０Ｄと、送信先情報送信部３０Ｅの代わりに、設定情報更新部９０Ｄと、設定情報送信部９０Ｅを備える。設定情報更新部９０Ｄと、設定情報送信部９０Ｅ以外の構成要件は実施例１の管理部３０の各構成要件と同じであるため、説明を割愛する。 As illustrated in FIG. 13, the management unit 90 according to the present embodiment includes a transmission destination information update unit 30D included in the management unit 30 according to the first embodiment, a setting information update unit 90D instead of the transmission destination information transmission unit 30E, An information transmission unit 90E is provided. Since the configuration requirements other than the setting information update unit 90D and the setting information transmission unit 90E are the same as the configuration requirements of the management unit 30 of the first embodiment, a description thereof will be omitted.

なお、本実施例の音声認識システム３の音声認識動作は実施例１の音声認識動作（Ｓ１１〜Ｓ１４Ａ、Ｓ２１Ａ〜Ｓ２１Ｃ、Ｓ１５Ａ、Ｓ１６）と全く同じであるから説明を省略する。 Note that the voice recognition operation of the voice recognition system 3 of the present embodiment is completely the same as the voice recognition operation (S11 to S14A, S21A to S21C, S15A, and S16) of the first embodiment, and thus the description thereof is omitted.

以下、図１４を参照して本実施例の音声認識システム３の情報更新動作について説明する。図１４は、本実施例の音声認識システム３の情報更新動作を示すシーケンス図である。ステップＳ１７〜Ｓ１４Ｂ、Ｓ３０Ａ〜Ｓ３０Ｂ、Ｓ２１Ｄ〜Ｓ２１Ｆ、Ｓ３０Ｃは実施例１と同様に実行される。 Hereinafter, the information update operation of the speech recognition system 3 of the present embodiment will be described with reference to FIG. FIG. 14 is a sequence diagram showing the information update operation of the speech recognition system 3 of the present embodiment. Steps S17 to S14B, S30A to S30B, S21D to S21F, and S30C are executed in the same manner as in the first embodiment.

次に、管理部９０の設定情報更新部９０Ｄは、全ての音声認識サーバ装置から受信した各音声認識結果と正解候補との類似度、あるいは各音声認識結果の信頼度に基づいて、ステップＳ１２で選択された音声認識サーバ装置の設定情報を更新する（Ｓ９０Ｄ）。設定情報とは、音声認識の設定に関する情報であって、音響モデルを指定する情報、音声認識に関する設定を指定する情報を含んでいる。設定情報には音響モデルそのものが含まれていてもよい。典型的には設定情報更新部９０Ｄは、正解候補と再学習信号群に対する音声認識結果の類似度（第１類似度、第２類似度、…、第Ｌ類似度）がいずれも高くなる、あるいは再学習信号群に対する音声認識結果の信頼度が何れも高くなる音声認識サーバ装置と同じ音声認識の設定、同じ音響モデルとなるように、ステップＳ１２で選択された音声認識サーバ装置の設定情報を更新する（Ｓ９０Ｄ）。管理部９０の設定情報送信部９０Ｅは、更新された設定情報をステップＳ１２で選択された音声認識サーバ装置に送信する（Ｓ９０Ｅ）。 Next, the setting information update unit 90D of the management unit 90, in step S12, based on the similarity between each speech recognition result received from all speech recognition server devices and correct answer candidates or the reliability of each speech recognition result. The setting information of the selected voice recognition server device is updated (S90D). The setting information is information related to voice recognition settings, and includes information specifying an acoustic model and information specifying settings related to voice recognition. The setting information may include the acoustic model itself. Typically, the setting information update unit 90D increases the similarity (first similarity, second similarity,..., Lth similarity) of the speech recognition results for the correct candidate and the relearning signal group, or Update the setting information of the voice recognition server device selected in step S12 so that the same voice recognition setting and the same acoustic model as the voice recognition server device in which the reliability of the voice recognition result for the re-learning signal group becomes high. (S90D). The setting information transmission unit 90E of the management unit 90 transmits the updated setting information to the voice recognition server device selected in step S12 (S90E).

ステップＳ１２で選択された音声認識サーバ装置は、設定情報を受信して（Ｓ８０Ｇ）、受信した設定情報に基づいて自装置の音声認識の設定（音響モデル、音声認識に関する設定）を変更する（Ｓ８０Ｈ）。音声認識サーバ装置群８０は、音声認識サーバ装置群８０の稼働量が少ない時間帯（例えば夜間）に、設定を変更してもよい。また音声認識サーバ装置群８０は、それ以外の予め計画された時間帯に設定を変更してもよい。 The voice recognition server device selected in step S12 receives the setting information (S80G), and changes its own voice recognition setting (acoustic model, settings related to voice recognition) based on the received setting information (S80H). ). The voice recognition server device group 80 may change the setting in a time zone (for example, at night) when the operation amount of the voice recognition server device group 80 is small. In addition, the voice recognition server device group 80 may change the setting in other time zones planned in advance.

このように本実施例の音声認識システム３によれば、管理部９０の設定情報更新部９０Ｄが設定情報を更新し、ステップＳ１２で選択された音声認識サーバ装置が更新された設定情報に基づいて自装置の音響モデル、音声認識に関する設定を変更するため、ステップＳ１２で選択された音声認識サーバ装置の誤認識が減少し、システムの利用効率を向上させることができる。 Thus, according to the voice recognition system 3 of the present embodiment, the setting information update unit 90D of the management unit 90 updates the setting information, and the voice recognition server device selected in step S12 is updated based on the updated setting information. Since the settings relating to the acoustic model and speech recognition of the own device are changed, the erroneous recognition of the speech recognition server device selected in step S12 can be reduced, and the utilization efficiency of the system can be improved.

＜補記＞
本発明の装置は、例えば単一のハードウェアエンティティとして、キーボードなどが接続可能な入力部、液晶ディスプレイなどが接続可能な出力部、ハードウェアエンティティの外部に通信可能な通信装置（例えば通信ケーブル）が接続可能な通信部、ＣＰＵ（Central Processing Unit、キャッシュメモリやレジスタなどを備えていてもよい）、メモリであるＲＡＭやＲＯＭ、ハードディスクである外部記憶装置並びにこれらの入力部、出力部、通信部、ＣＰＵ、ＲＡＭ、ＲＯＭ、外部記憶装置の間のデータのやり取りが可能なように接続するバスを有している。また必要に応じて、ハードウェアエンティティに、ＣＤ−ＲＯＭなどの記録媒体を読み書きできる装置（ドライブ）などを設けることとしてもよい。このようなハードウェア資源を備えた物理的実体としては、汎用コンピュータなどがある。 <Supplementary note>
The apparatus of the present invention includes, for example, a single hardware entity as an input unit to which a keyboard or the like can be connected, an output unit to which a liquid crystal display or the like can be connected, and a communication device (for example, a communication cable) capable of communicating outside the hardware entity. Can be connected to a communication unit, a CPU (Central Processing Unit, may include a cache memory or a register), a RAM or ROM that is a memory, an external storage device that is a hard disk, and an input unit, an output unit, or a communication unit thereof , A CPU, a RAM, a ROM, and a bus connected so that data can be exchanged between the external storage devices. If necessary, the hardware entity may be provided with a device (drive) that can read and write a recording medium such as a CD-ROM. A physical entity having such hardware resources includes a general-purpose computer.

ハードウェアエンティティの外部記憶装置には、上述の機能を実現するために必要となるプログラムおよびこのプログラムの処理において必要となるデータなどが記憶されている（外部記憶装置に限らず、例えばプログラムを読み出し専用記憶装置であるＲＯＭに記憶させておくこととしてもよい）。また、これらのプログラムの処理によって得られるデータなどは、ＲＡＭや外部記憶装置などに適宜に記憶される。 The external storage device of the hardware entity stores a program necessary for realizing the above functions and data necessary for processing the program (not limited to the external storage device, for example, reading a program) It may be stored in a ROM that is a dedicated storage device). Data obtained by the processing of these programs is appropriately stored in a RAM or an external storage device.

ハードウェアエンティティでは、外部記憶装置（あるいはＲＯＭなど）に記憶された各プログラムとこの各プログラムの処理に必要なデータが必要に応じてメモリに読み込まれて、適宜にＣＰＵで解釈実行・処理される。その結果、ＣＰＵが所定の機能（上記、…部、…手段などと表した各構成要件）を実現する。 In the hardware entity, each program stored in an external storage device (or ROM or the like) and data necessary for processing each program are read into a memory as necessary, and are interpreted and executed by a CPU as appropriate. . As a result, the CPU realizes a predetermined function (respective component requirements expressed as the above-described unit, unit, etc.).

本発明は上述の実施形態に限定されるものではなく、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。また、上記実施形態において説明した処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されるとしてもよい。 The present invention is not limited to the above-described embodiment, and can be appropriately changed without departing from the spirit of the present invention. In addition, the processing described in the above embodiment may be executed not only in time series according to the order of description but also in parallel or individually as required by the processing capability of the apparatus that executes the processing. .

既述のように、上記実施形態において説明したハードウェアエンティティ（本発明の装置）における処理機能をコンピュータによって実現する場合、ハードウェアエンティティが有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記ハードウェアエンティティにおける処理機能がコンピュータ上で実現される。 As described above, when the processing functions in the hardware entity (the apparatus of the present invention) described in the above embodiments are realized by a computer, the processing contents of the functions that the hardware entity should have are described by a program. Then, by executing this program on a computer, the processing functions in the hardware entity are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、ＤＶＤ（Digital Versatile Disc）、ＤＶＤ−ＲＡＭ（Random Access Memory）、ＣＤ−ＲＯＭ（Compact Disc Read Only Memory）、ＣＤ−Ｒ（Recordable）／ＲＷ（ReWritable）等を、光磁気記録媒体として、ＭＯ（Magneto-Optical disc）等を、半導体メモリとしてＥＥＰ−ＲＯＭ（Electronically Erasable and Programmable-Read Only Memory）等を用いることができる。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only). Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording medium, MO (Magneto-Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、ハードウェアエンティティを構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, a hardware entity is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

Claims

A voice recognition system including a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
A receiving unit that receives a speech recognition result for an acoustic signal input to the client device from a speech recognition server device selected based on the sound collection condition;
A rephrase determining unit that acquires a repeated signal group that is a signal group obtained by observing a plurality of repetitions of utterances indicating the same content by a user, and extracts a speech recognition result of the last signal from the repeated signal group as a correct candidate When,
All the repetitive signal groups are re-learning signal groups, the correct candidate, and a transmission unit that transmits a set of the re-learning signal groups to the management unit,
Based on the destination information that is the relationship between the voice recognition server device that is the transmission destination of the acoustic signal and the sound collection condition, the voice recognition server device that is the destination of the acoustic signal and the sound collection condition Including a destination change section that changes the relationship,
The management unit
A speech recognition result receiving unit for receiving speech recognition results for the relearning signal group from all speech recognition server devices;
A destination information update unit that updates the destination information based on the similarity between each voice recognition result received from all the voice recognition server devices and the correct candidate;
A speech recognition system including a transmission destination information transmission unit that transmits the updated transmission destination information to the client device.

A voice recognition system including a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
A receiving unit that receives a speech recognition result for an acoustic signal input to the client device from a speech recognition server device selected based on the sound collection condition;
A rephrase determining unit that acquires a repeated signal group that is a signal group obtained by observing a plurality of repetitions of utterances indicating the same content by a user, and extracts a speech recognition result of the last signal from the repeated signal group as a correct candidate When,
The repetitive signal group is a re-learning signal group, and includes a transmission unit that transmits a set of the correct answer candidate and the re-learning signal group to the management unit,
The management unit
A speech recognition result receiving unit for receiving speech recognition results for the relearning signal group from all speech recognition server devices;
Setting information for updating setting information, which is information related to the voice recognition setting of the selected voice recognition server device, based on the similarity between each voice recognition result received from all the voice recognition server devices and the correct answer candidate Update section,
A setting information transmitting unit that transmits the updated setting information to the selected voice recognition server device;
Each of the voice recognition server devices
When receiving the setting information, the voice recognition system changes the voice recognition setting of the own device based on the received setting information.

A voice recognition system including a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
A receiving unit that receives a speech recognition result for an acoustic signal input to the client device from a speech recognition server device selected based on the sound collection condition;
A rephrase determining unit that acquires a repeated signal group that is a signal group obtained by observing a plurality of repetitions of utterances indicating the same content by a user, and extracts a speech recognition result of the last signal from the repeated signal group as a correct candidate When,
All the repetitive signal groups are re-learning signal groups, the correct candidate, and a transmission unit that transmits a set of the re-learning signal groups to the management unit,
A threshold value changing unit that changes a threshold value that is a preset value for extracting the sound pickup condition;
The management unit
A speech recognition result receiving unit for receiving speech recognition results for the relearning signal group from all speech recognition server devices;
A threshold update unit that updates the threshold based on the similarity between each voice recognition result received from all the voice recognition server devices and the correct answer candidate;
A speech recognition system including a threshold value transmission unit that transmits the updated threshold value to the client device.

The speech recognition system according to any one of claims 1 to 3,
The rephrase determining unit
m is an integer greater than or equal to 2, and the time when the speech recognition result for the m-1st acoustic signal input to the client device is presented by the client device and the input of the mth acoustic signal input to the client device The reaction time that is the difference from the time, the reliability of the speech recognition result for the m−1th acoustic signal input to the client device, the m−1th and mth acoustic signal input to the client device The mth input to the client device is based on at least one of the similarity and the similarity of each speech recognition result for each of the m−1th and mth input acoustic signals to the client device. A speech recognition system that determines whether or not the acoustic signal is a rephrase and acquires the repetitive signal group based on the determination result.

A voice recognition method executed by a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
Receiving a speech recognition result for an acoustic signal input to the client device from a speech recognition server device selected based on the sound collection condition;
Obtaining a repetitive signal group that is a signal group in which multiple repetitions of utterances indicating the same content by the user are observed, and extracting the speech recognition result of the last signal of the repetitive signal group as a correct candidate;
All the repetitive signal groups are re-learning signal groups, and the step of transmitting the correct candidate and a set of the re-learning signal groups to the management unit,
The management unit
Receiving speech recognition results for the relearning signal group from all speech recognition server devices;
It is information related to the relationship between the voice recognition server device that is the transmission destination of the acoustic signal and the sound collection condition based on the similarity between each voice recognition result received from all the voice recognition server devices and the correct answer candidate. Updating the destination information;
Performing the step of transmitting the updated destination information to the client device;
The client device is
A speech recognition method for executing a step of changing a relationship between a speech recognition server device serving as a transmission destination of the acoustic signal and the sound collection condition based on the transmission destination information.

A voice recognition method executed by a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
Receiving a speech recognition result for an acoustic signal input to the client device from a speech recognition server device selected based on the sound collection condition;
Obtaining a repetitive signal group that is a signal group in which multiple repetitions of utterances indicating the same content by the user are observed, and extracting the speech recognition result of the last signal of the repetitive signal group as a correct candidate;
All the repetitive signal groups are re-learning signal groups, and the step of transmitting the correct candidate and a set of the re-learning signal groups to the management unit,
The management unit
Receiving speech recognition results for the relearning signal group from all speech recognition server devices;
Updating setting information, which is information related to voice recognition settings of the selected voice recognition server device, based on the similarity between each voice recognition result received from all the voice recognition server devices and the correct answer candidate; ,
Executing the step of transmitting the updated setting information to the selected voice recognition server device;
Each of the voice recognition server devices
A voice recognition method for executing a step of changing a voice recognition setting of the own apparatus based on the received setting information when the setting information is received.

A voice recognition method executed by a client device, a plurality of voice recognition server devices, and a management unit,
The client device is
Receiving a speech recognition result for an acoustic signal input to the client device from a speech recognition server device selected based on the sound collection condition;
Obtaining a repetitive signal group that is a signal group in which multiple repetitions of utterances indicating the same content by the user are observed, and extracting the speech recognition result of the last signal of the repetitive signal group as a correct candidate;
All the repetitive signal groups are re-learning signal groups, and the step of transmitting the correct candidate and a set of the re-learning signal groups to the management unit,
The management unit
Receiving speech recognition results for the relearning signal group from all speech recognition server devices;
A step of updating a threshold value, which is a value set in advance for extracting the sound pickup condition, based on the similarity between each voice recognition result received from all the voice recognition server devices and the correct answer candidate When,
Performing the step of transmitting the updated threshold to the client device;
The client device is
A speech recognition method for executing the step of changing the threshold value.

A program for causing a computer to function as a voice recognition server device included in the voice recognition system according to any one of claims 1 to 4.

The program for functioning a computer as a management part contained in the speech recognition system in any one of Claim 1 to 4.