JPWO2014049944A1

JPWO2014049944A1 - Audio processing device, audio processing method, audio processing program, and noise suppression device

Info

Publication number: JPWO2014049944A1
Application number: JP2014538111A
Authority: JP
Inventors: 健花沢; 剛範辻川; 秀治古明地
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-09-27
Filing date: 2013-08-21
Publication date: 2016-08-22
Also published as: WO2014049944A1

Abstract

事前に音声認識の知識を利用できない場合でも、精度の高い雑音抑圧を行うことができる音声処理装置を提供する。音声処理装置は、予め発声内容が把握されている音声であり雑音抑圧処理用のパラメータが用いられて雑音抑圧処理された複数の教師信号を、音声認識エンジンに送信する教師信号送信部１１と、音声認識エンジンによる複数の教師信号に対する音声認識処理の認識結果を受け取る認識結果受信部１２と、認識結果の精度に基づいて、複数の教師信号に用いられたパラメータのうち、音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択するパラメータ選択部１３とを備える。Provided is a speech processing apparatus capable of performing highly accurate noise suppression even when knowledge of speech recognition cannot be used in advance. The speech processing apparatus includes a teacher signal transmission unit 11 that transmits a plurality of teacher signals to a speech recognition engine, which are speeches whose utterance contents have been grasped in advance and are subjected to noise suppression processing using parameters for noise suppression processing; A recognition result receiving unit 12 that receives a recognition result of speech recognition processing on a plurality of teacher signals by the speech recognition engine, and speech recognition by the speech recognition engine among parameters used for the plurality of teacher signals based on the accuracy of the recognition result And a parameter selection unit 13 that selects parameters for use in noise suppression processing performed before the processing.

Description

本発明は、音声処理装置、音声処理方法、音声処理プログラムおよび雑音抑圧装置に関し、特に音声認識に用いられる音声処理装置、音声処理方法、音声処理プログラムおよび雑音抑圧装置に関する。 The present invention relates to a voice processing device, a voice processing method, a voice processing program, and a noise suppression device, and more particularly to a voice processing device, a voice processing method, a voice processing program, and a noise suppression device used for voice recognition.

近年、音声認識技術の実用化が盛んである。音声認識を行う一般的な装置は、マイクロフォンから入力した音声を認識処理する。しかし、その際、目的とする音声以外に周囲の雑音などが混入することがあり、これが音声認識率を下げる大きな要因の一つである。 In recent years, voice recognition technology has been put into practical use. A general apparatus that performs speech recognition performs speech recognition processing on speech input from a microphone. However, in that case, ambient noise or the like may be mixed in addition to the target speech, which is one of the major factors that lower the speech recognition rate.

この問題を解決するための方法として、雑音抑圧、音声強調の技術が長年研究されてきた。例えば、入力中の雑音成分を推定して入力から差し引くことで雑音を抑圧するスペクトルサブトラクションなどは既に実用化されている。また近年、特にモデルベースの雑音抑圧、音声強調技術が発展してきており、単純に音声の品質を上げるだけでは解消しない音声の歪みを軽減し、より音声認識に適した変換を行う研究がされている。 As a method for solving this problem, techniques of noise suppression and speech enhancement have been studied for many years. For example, spectral subtraction that suppresses noise by estimating a noise component in an input and subtracting it from the input has already been put into practical use. In recent years, model-based noise suppression and speech enhancement technologies have been developed, and research has been conducted to reduce speech distortion that cannot be eliminated simply by improving speech quality, and to perform conversion suitable for speech recognition. Yes.

特許文献１に、モデルベース雑音抑圧、音声強調技術の例が記載されている。特許文献１に記載された技術は、雑音平均スペクトルを求める手段と、入力信号と雑音平均スペクトルから仮推定音声を求める手段と、標準パタンと、標準パタンを用いて仮推定音声の補正値を求める手段を用いる。これにより、特許文献１に記載された技術は、音声の情報を欠落させることなく高い精度で雑音成分を除去できる雑音抑圧システムを提供することができる。 Patent Document 1 describes an example of model-based noise suppression and speech enhancement technology. The technique described in Patent Literature 1 obtains a noise average spectrum, a means for obtaining a temporary estimated speech from an input signal and a noise average spectrum, a standard pattern, and a correction value of the temporary estimated speech using the standard pattern. Use means. Thereby, the technique described in Patent Document 1 can provide a noise suppression system that can remove noise components with high accuracy without missing voice information.

特許４７６５４６１号公報Japanese Patent No. 4765461

一般的なモデルベース雑音抑圧を行う雑音抑圧システムは、後段の音声認識エンジンが使用する標準パタンすなわちモデルと同質のモデルを利用することで、音声認識に適した変換を行っていた。このとき、雑音抑圧システムは、雑音抑圧用のモデル構築時に、後段の音声認識エンジンの知識（モデル）を利用できることを前提としていた。 A noise suppression system that performs general model-based noise suppression performs conversion suitable for speech recognition by using a standard pattern, that is, a model having the same quality as the model used by the subsequent speech recognition engine. At this time, the noise suppression system is based on the premise that the knowledge (model) of the subsequent speech recognition engine can be used when building a model for noise suppression.

しかし、例えば雑音抑圧処理を行う前段の構成と、音声認識を行う後段の構成とが独立に構築される場合、後段の音声認識エンジンの知識が利用可能であるとは限らない。例えば、クライアントサーバ型音声認識システムにおいて、雑音抑圧処理を行うクライアント側と音声認識を行うサーバ側とが独立している場合がある。このような場合、クライアント側が、雑音抑圧の際に音声認識の知識を使わず、雑音抑圧のパラメータ（モデル）と音声認識のパラメータとのミスマッチが生じた場合、適切な雑音抑圧ができず、音声認識精度が劣化する。 However, for example, when the configuration of the previous stage for performing noise suppression processing and the configuration of the subsequent stage for performing speech recognition are independently constructed, knowledge of the subsequent speech recognition engine is not always available. For example, in a client-server type speech recognition system, a client side that performs noise suppression processing and a server side that performs speech recognition may be independent. In such a case, if the client side does not use speech recognition knowledge for noise suppression and there is a mismatch between the noise suppression parameter (model) and the speech recognition parameter, appropriate noise suppression cannot be performed and the voice Recognition accuracy deteriorates.

本発明は、事前に音声認識の知識を利用できない場合でも、精度の高い雑音抑圧を行うことができる音声処理装置、音声処理方法、音声処理プログラムおよび雑音抑圧装置を提供することを目的とする。 An object of the present invention is to provide a speech processing device, a speech processing method, a speech processing program, and a noise suppression device that can perform highly accurate noise suppression even when speech recognition knowledge cannot be used in advance.

本発明による音声処理装置は、予め発声内容が把握されている音声であり雑音抑圧処理用のパラメータが用いられて雑音抑圧処理された複数の教師信号を、音声認識エンジンに送信する教師信号送信部と、音声認識エンジンによる複数の教師信号に対する音声認識処理の認識結果を受け取る認識結果受信部と、認識結果の精度に基づいて、複数の教師信号に用いられたパラメータの中から音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択するパラメータ選択部とを備えたことを特徴とする。 A speech processing apparatus according to the present invention is a teacher signal transmission unit that transmits a plurality of teacher signals, which are speeches whose contents are uttered in advance and subjected to noise suppression processing using parameters for noise suppression processing, to a speech recognition engine A recognition result receiving unit that receives a recognition result of speech recognition processing for a plurality of teacher signals by the speech recognition engine, and a speech by the speech recognition engine from among parameters used for the plurality of teacher signals based on the accuracy of the recognition result And a parameter selection unit that selects a parameter to be used for the noise suppression process performed before the recognition process.

本発明による雑音抑圧装置は、予め発声内容が把握されている音声であり雑音抑圧処理用のパラメータが用いられて雑音抑圧処理された複数の教師信号を、音声認識エンジンに送信する教師信号送信部と、音声認識エンジンによる複数の教師信号に対する音声認識処理の認識結果を受け取る認識結果受信部と、認識結果の精度に基づいて、複数の教師信号に用いられたパラメータのうち、音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択するパラメータ選択部と、選択されたパラメータを用いて雑音抑圧処理を行う耐雑音処理部とを備えたことを特徴とする。 A noise suppression device according to the present invention is a teacher signal transmission unit that transmits a plurality of teacher signals, which are voices whose utterance contents are grasped in advance and subjected to noise suppression processing using parameters for noise suppression processing, to a speech recognition engine A recognition result receiving unit that receives a recognition result of the speech recognition processing for a plurality of teacher signals by the speech recognition engine, and among the parameters used for the plurality of teacher signals based on the accuracy of the recognition result, the speech by the speech recognition engine A parameter selection unit that selects a parameter to be used for noise suppression processing performed before the recognition processing, and a noise proof processing unit that performs noise suppression processing using the selected parameter are provided.

本発明による雑音抑圧方法は、予め発声内容が把握されている音声であり雑音抑圧処理用のパラメータが用いられて雑音抑圧処理された複数の教師信号を、音声認識エンジンに送信し、音声認識エンジンによる複数の教師信号に対する音声認識処理の認識結果を受け取り、認識結果の精度に基づいて、複数の教師信号に用いられたパラメータのうち、音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択することを特徴とする。 The noise suppression method according to the present invention transmits, to a speech recognition engine, a plurality of teacher signals that are speech whose contents of utterances are grasped in advance and are subjected to noise suppression processing using parameters for noise suppression processing. The noise suppression processing performed before the speech recognition processing by the speech recognition engine among the parameters used for the plurality of teacher signals based on the accuracy of the recognition result based on the recognition result of the speech recognition processing for the plurality of teacher signals by The method is characterized in that parameters for use in the selection are selected.

本発明による雑音抑圧プログラムは、コンピュータに、予め発声内容が把握されている音声であり雑音抑圧処理用のパラメータが用いられて雑音抑圧処理された複数の教師信号を、音声認識エンジンに送信する教師信号送信処理と、音声認識エンジンによる複数の教師信号に対する音声認識処理の認識結果を受け取る認識結果受信処理と、認識結果の精度に基づいて、複数の教師信号に用いられたパラメータのうち、音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択するパラメータ選択処理とを実行させることを特徴とする。 The noise suppression program according to the present invention is a teacher that transmits, to a speech recognition engine, a plurality of teacher signals, which are sounds whose contents are uttered in advance and are subjected to noise suppression processing using parameters for noise suppression processing. Based on the signal transmission processing, the recognition result receiving processing for receiving the recognition result of the speech recognition processing for the plurality of teacher signals by the speech recognition engine, and the speech recognition among the parameters used for the plurality of teacher signals based on the accuracy of the recognition result And a parameter selection process for selecting a parameter to be used for the noise suppression process performed before the speech recognition process by the engine.

本発明によれば、事前に音声認識の知識を利用できない場合でも、精度の高い雑音抑圧を行うことができる。 According to the present invention, it is possible to perform highly accurate noise suppression even when knowledge of speech recognition cannot be used in advance.

本発明による音声処理装置の第１の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of the audio processing apparatus by this invention. 本発明による音声処理装置の第１の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 1st Embodiment of the audio processing apparatus by this invention. 実施例１にかかる雑音抑圧装置の構成を示すブロック図である。1 is a block diagram illustrating a configuration of a noise suppression device according to Embodiment 1. FIG. 実施例１にかかる雑音抑圧装置の動作を示すフローチャートである。3 is a flowchart illustrating the operation of the noise suppression device according to the first embodiment. パラメータのクラスタリングの例を示す説明図である。It is explanatory drawing which shows the example of parameter clustering. 本発明による音声処理装置の第２の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 2nd Embodiment of the audio processing apparatus by this invention. 本発明による音声処理装置の第２の実施形態の動作を示すフローチャートである。It is a flowchart which shows operation | movement of 2nd Embodiment of the speech processing unit by this invention. 実施例２にかかる音声認識システムの構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition system concerning Example 2. FIG. 実施例２にかかる音声認識システムの動作を示すフローチャートである。10 is a flowchart illustrating an operation of the voice recognition system according to the second embodiment. 本発明による音声処理装置の主要部の構成を示すブロック図である。It is a block diagram which shows the structure of the principal part of the speech processing unit by this invention.

本発明による音声処理装置の実施形態を、図面を参照して説明する。 An embodiment of a voice processing device according to the present invention will be described with reference to the drawings.

実施形態１．
図１は、第１の実施の形態に係る音声処理装置の構成を示した図である。図１に示される音声処理装置１０は、教師信号の出力と認識結果の受信を行い、雑音抑圧および音声強調のために最適なパラメータを選択する。Embodiment 1. FIG.
FIG. 1 is a diagram illustrating a configuration of a speech processing apparatus according to the first embodiment. The speech processing apparatus 10 shown in FIG. 1 outputs a teacher signal and receives a recognition result, and selects optimum parameters for noise suppression and speech enhancement.

本実施形態の音声処理装置１０は、汎用的なコンピュータシステムであり、図示しない構成として、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及び不揮発性記憶装置を備える。音声処理装置１０は、ＣＰＵがＲＡＭ、ＲＯＭ、又は不揮発性記憶装置に格納されたＯＳ（ＯｐｅｒａｔｉｏｎＳｙｓｔｅｍ）およびプログラムを読み込み、音声処理を実行する。なお、音声処理装置１０は、１台のコンピュータシステムである必要はなく、複数台のコンピュータシステムで構成されていてもよい。 The speech processing apparatus 10 according to the present embodiment is a general-purpose computer system, and includes a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), and a nonvolatile storage device, which are not illustrated. Prepare. In the sound processing device 10, the CPU reads an OS (Operation System) and a program stored in a RAM, a ROM, or a non-volatile storage device, and executes sound processing. Note that the voice processing apparatus 10 does not have to be a single computer system, and may be configured by a plurality of computer systems.

図１に示すように、本実施形態の音声処理装置１０は、教師信号送信部１１と、認識結果受信部１２と、パラメータ選択部１３と、パラメータ記憶部１５と、教師信号記憶部１４を備える。なお、パラメータ記憶部１５および教師信号記憶部１４は、音声処理装置１０の外部に設けられていてもよい。 As shown in FIG. 1, the speech processing apparatus 10 according to the present embodiment includes a teacher signal transmission unit 11, a recognition result reception unit 12, a parameter selection unit 13, a parameter storage unit 15, and a teacher signal storage unit 14. . The parameter storage unit 15 and the teacher signal storage unit 14 may be provided outside the speech processing apparatus 10.

パラメータ記憶部１５は、複数のパラメータを記憶する。パラメータとは、後述する雑音抑圧処理に用いられるデータであり、例えばモデルベース雑音抑圧を用いる場合は、音声認識エンジンに用いられるモデルである。教師信号記憶部１４は、パラメータ記憶部１５に記憶された複数のパラメータによって変換された複数の教師信号を記憶する。この変換とは、後述する耐雑音処理部１０２が行う雑音抑圧処理と同様の処理である。 The parameter storage unit 15 stores a plurality of parameters. The parameter is data used for noise suppression processing to be described later. For example, when model-based noise suppression is used, it is a model used for a speech recognition engine. The teacher signal storage unit 14 stores a plurality of teacher signals converted by a plurality of parameters stored in the parameter storage unit 15. This conversion is processing similar to noise suppression processing performed by the noise proof processing unit 102 described later.

教師信号送信部１１は、教師信号記憶部１４に記憶された複数の教師信号を、音声認識エンジン１０３に順次送信する。教師信号とは、予め発声内容が把握されている音声である。また、教師信号送信部１１は、送信された教師信号に関する情報、例えば変換に用いられたパラメータを認識結果受信部１２に通知する。 The teacher signal transmission unit 11 sequentially transmits a plurality of teacher signals stored in the teacher signal storage unit 14 to the speech recognition engine 103. The teacher signal is a sound whose utterance content is grasped in advance. Also, the teacher signal transmission unit 11 notifies the recognition result reception unit 12 of information related to the transmitted teacher signal, for example, parameters used for the conversion.

認識結果受信部１２は、音声認識エンジン１０３から得られる複数の音声認識処理の認識結果を順次受信し、教師信号送信部１１から通知されるパラメータの情報とあわせてパラメータ選択部１３に通知する。 The recognition result receiving unit 12 sequentially receives the recognition results of the plurality of speech recognition processes obtained from the speech recognition engine 103 and notifies the parameter selecting unit 13 together with the parameter information notified from the teacher signal transmitting unit 11.

パラメータ選択部１３は、認識結果の精度に基づいて、パラメータ記憶部１５に記憶された複数のパラメータの中から、音声認識エンジン１０３による認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択する。具体的には、パラメータ選択部１３は、認識結果受信部１２にて順次受信した複数の認識結果を比較し、最も認識精度の高い認識結果を選択する。そして、パラメータ選択部１３は、選択した認識結果の認識対象である教師信号に用いられたパラメータを選択する。パラメータ選択部１３は、例えば、予め把握されている教師信号の発声内容に単語レベルまたは文字レベルで最も近い認識結果を、最も認識精度が高い認識結果として選択する。単語レベルまたは文字レベルの距離の比較方法は、公知技術として良く知られているため、ここでは説明を省略する。 Based on the accuracy of the recognition result, the parameter selection unit 13 selects a parameter to be used for noise suppression processing performed before the recognition processing by the speech recognition engine 103 from among a plurality of parameters stored in the parameter storage unit 15. select. Specifically, the parameter selection unit 13 compares a plurality of recognition results sequentially received by the recognition result receiving unit 12, and selects the recognition result with the highest recognition accuracy. And the parameter selection part 13 selects the parameter used for the teacher signal which is the recognition object of the selected recognition result. The parameter selection unit 13 selects, for example, the recognition result closest to the utterance content of the teacher signal that has been grasped in advance at the word level or the character level as the recognition result with the highest recognition accuracy. Since the distance comparison method between the word level and the character level is well known as a known technique, the description thereof is omitted here.

なお、パラメータ選択部１３が比較する対象を認識精度のみとしたが、パラメータ選択部１３は、音声認識エンジン１０３の認識精度と処理時間との組み合わせに基づいて最適なパラメータを選択してもよい。その場合、認識結果受信部１２は、認識結果を受信する際に処理時間も合わせて得て、パラメータ選択部１３に通知する。一般に、音声認識においてはパラメータあるいはモデルのマッチングが良いほど処理時間が短いという傾向がある。そのため、パラメータ選択部１３は、処理時間も考慮することで、より精度の高いパラメータの選択をすることができる。 Although the parameter selection unit 13 compares only the recognition accuracy, the parameter selection unit 13 may select an optimal parameter based on the combination of the recognition accuracy of the speech recognition engine 103 and the processing time. In that case, the recognition result receiving unit 12 also obtains the processing time when receiving the recognition result, and notifies the parameter selection unit 13 of it. In general, in speech recognition, the better the matching of parameters or models, the shorter the processing time. Therefore, the parameter selection unit 13 can select a parameter with higher accuracy by considering the processing time.

次に、本実施形態の音声処理装置の動作を説明する。図２は、本発明による音声処理装置の第１の実施形態の動作を示すフローチャートである。 Next, the operation of the speech processing apparatus according to this embodiment will be described. FIG. 2 is a flowchart showing the operation of the first embodiment of the speech processing apparatus according to the present invention.

まず、教師信号送信部１１は、教師信号を送信する（ステップＳ２００）。具体的には、教師信号送信部１１は、複数のパラメータによって変換され、教師信号記憶部１４に記憶された複数の教師信号を、音声認識エンジン１０３に順次送信する。また、教師信号送信部１１は、送信した教師信号に関する情報を認識結果受信部１２に通知する。 First, the teacher signal transmission unit 11 transmits a teacher signal (step S200). Specifically, the teacher signal transmission unit 11 sequentially transmits a plurality of teacher signals converted by a plurality of parameters and stored in the teacher signal storage unit 14 to the speech recognition engine 103. The teacher signal transmission unit 11 notifies the recognition result reception unit 12 of information related to the transmitted teacher signal.

次に、認識結果受信部１２は、音声認識エンジン１０３から認識結果を受信する（ステップＳ２０１）。具体的には、認識結果受信部１２は、音声認識エンジン１０３から複数の認識結果を順次受信し、教師信号送信部１１から通知されるパラメータの情報と合わせてパラメータ選択部１３に通知する。 Next, the recognition result receiving unit 12 receives a recognition result from the speech recognition engine 103 (step S201). Specifically, the recognition result receiving unit 12 sequentially receives a plurality of recognition results from the speech recognition engine 103 and notifies the parameter selecting unit 13 together with the parameter information notified from the teacher signal transmitting unit 11.

次に、パラメータ選択部１３は、パラメータ選択を行う（ステップＳ２０２）。具体的には、パラメータ選択部１３は、認識結果受信部１２から通知された複数の認識結果を比較し、最も認識精度の高い認識結果を選択する。そして、パラメータ選択部１３は、選択した認識結果の認識対象である教師信号に用いられたパラメータを選択する。例えば、教師信号として「一番近い駅はどこですか」と発声された音声が用いられているとする。そして、音声認識エンジン１０３が、３種類のパラメータにより変換されたその教師信号に音声認識を行い、「一番近い駅は」「一番近い木はどこですか」「千葉駅はどこですか」といった３種類の認識結果を出力したとする。この場合、最も認識精度の高い認識結果、すなわち正解に近い認識結果は、単語レベルで比較して２番目の認識結果である。そこで、パラメータ選択部１３は、２番目の認識結果を選択する。この選択結果は、２番目の認識結果の対象である教師信号の変換に用いられたパラメータが、後段の音声認識エンジンにとって最適であることを示している。 Next, the parameter selection unit 13 performs parameter selection (step S202). Specifically, the parameter selection unit 13 compares a plurality of recognition results notified from the recognition result receiving unit 12 and selects a recognition result with the highest recognition accuracy. And the parameter selection part 13 selects the parameter used for the teacher signal which is the recognition object of the selected recognition result. For example, it is assumed that a voice uttered “where is the nearest station” is used as a teacher signal. Then, the speech recognition engine 103 performs speech recognition on the teacher signal converted by the three types of parameters, and 3 such as “where is the nearest station”, “where is the nearest tree”, “where is Chiba station”? Suppose that a type of recognition result is output. In this case, the recognition result with the highest recognition accuracy, that is, the recognition result close to the correct answer is the second recognition result compared at the word level. Therefore, the parameter selection unit 13 selects the second recognition result. This selection result indicates that the parameter used for conversion of the teacher signal that is the object of the second recognition result is optimal for the subsequent speech recognition engine.

次に、パラメータ選択部１３は、選択したパラメータを出力する（ステップＳ２０３）。 Next, the parameter selection unit 13 outputs the selected parameter (step S203).

本実施形態では変換に用いられる対象および選択の対象をパラメータと表現したが、パラメータは、例えばスペクトルサブトラクションの閾値でもよいし、ウィナーフィルターのゲインでもよい。 In the present embodiment, the object used for conversion and the object to be selected are expressed as parameters, but the parameter may be, for example, a threshold value of spectral subtraction or a gain of a Wiener filter.

なお、教師信号送信部１１は、複数のパラメータによって変換された複数の教師信号を順次送信する際、送信する順序を制御してもよい。例えば、複数のパラメータが音響的近さによってクラスタリングされている場合、教師信号送信部１１は、各クラスタを代表するノード（パラメータ）により変換された教師信号を優先適用する。そして、教師信号送信部１１は、パラメータ選択部１３にて認識精度が高いと選択された代表ノードに関連付けられたクラスタ内のパラメータを優先して適用する。これにより、パラメータ選択部１３は、パラメータが多数ある場合でも少ない処理量で効率的にパラメータ選択をすることが可能となる。 Note that the teacher signal transmission unit 11 may control the transmission order when sequentially transmitting a plurality of teacher signals converted by a plurality of parameters. For example, when a plurality of parameters are clustered according to acoustic proximity, the teacher signal transmission unit 11 preferentially applies a teacher signal converted by a node (parameter) representing each cluster. Then, the teacher signal transmission unit 11 preferentially applies the parameters in the cluster associated with the representative node selected by the parameter selection unit 13 when the recognition accuracy is high. As a result, the parameter selection unit 13 can efficiently select parameters with a small amount of processing even when there are many parameters.

図５は、パラメータのクラスタリングの例を示す説明図である。図５に示すように、あらかじめ多数のパラメータ（ここでは一つ一つの記号がパラメータを表す）がクラスタリングされ、その中で代表となるパラメータが代表ノードとされる。教師信号送信部１１は、まず４つの代表ノード（ａ、ｂ、ｃ、ｄ）を送信して音声認識を行う。そして、パラメータ選択部１３が例えばａを最良の認識精度となるパラメータとして選択した場合、教師信号送信部１１は、残りの代表ノードが含まれるクラスタは展開せずに、ａが含まれるクラスタＡ内のパラメータのみを対象として再帰的に教師信号の送信を行う。これにより、選択される可能性の低いパラメータの処理が省略されるため、効率が上がる。なお、図５に示す木構造は、多数階層であってもよい。また、代表ノードは、最良の認識精度のパラメータのみではなく、認識精度の良いものから順に複数選択されてもよい。 FIG. 5 is an explanatory diagram illustrating an example of parameter clustering. As shown in FIG. 5, a large number of parameters (here, each symbol represents a parameter) is clustered in advance, and a representative parameter among them is a representative node. The teacher signal transmitting unit 11 first transmits four representative nodes (a, b, c, d) to perform voice recognition. Then, when the parameter selection unit 13 selects, for example, a as the parameter that provides the best recognition accuracy, the teacher signal transmission unit 11 does not expand the cluster that includes the remaining representative nodes, but within the cluster A that includes a. The teacher signal is recursively transmitted only for the parameters of. This eliminates the processing of parameters that are unlikely to be selected, thereby increasing efficiency. Note that the tree structure shown in FIG. Further, a plurality of representative nodes may be selected in order from the one with the best recognition accuracy, not just the parameter with the best recognition accuracy.

このように、本実施形態の音声処理装置１０は、雑音抑圧処理を行う場合に、後段の音声認識エンジンの知識を事前に知らなくても精度の高い雑音抑圧および音声強調を行うことができる。特に、モデルベースの雑音抑圧手法を用いる場合には、雑音抑圧時のモデルと音声認識時のモデルにミスマッチがあると大きな精度劣化の要因になるので、このミスマッチを低減させることによる精度向上の効果は高い。 As described above, the speech processing apparatus 10 according to the present embodiment can perform highly accurate noise suppression and speech enhancement without knowing in advance the knowledge of the subsequent speech recognition engine when performing noise suppression processing. In particular, when using a model-based noise suppression method, if there is a mismatch between the model at the time of noise suppression and the model at the time of speech recognition, it will cause a large deterioration in accuracy, so the effect of improving accuracy by reducing this mismatch Is expensive.

また、本実施形態の音声処理装置１０は、教師信号を作成するパラメータが多数である場合に、木構造などを利用して精度の良いものから効率的に適用することで処理量を削減することができる。 In addition, the speech processing apparatus 10 according to the present embodiment reduces the amount of processing by efficiently applying a highly accurate one using a tree structure or the like when there are a large number of parameters for creating a teacher signal. Can do.

＜実施例１＞
以下、本実施形態の音声処理装置の実施例を説明する。図３は、実施例１にかかる雑音抑圧装置の構成を示すブロック図である。図３に示す雑音抑圧装置１００は、音声処理装置１０ｂと、耐雑音処理部１０２とを備える。また、雑音抑圧装置１００は、入力部１０１と、音声認識エンジン１０３とに接続されている。雑音抑圧装置１００は、雑音抑圧エンジンとして動作し、入力音声に雑音抑圧処理を施すことにより音声認識のために好適な音声を出力する。<Example 1>
Hereinafter, examples of the speech processing apparatus according to the present embodiment will be described. FIG. 3 is a block diagram of the configuration of the noise suppression device according to the first embodiment. The noise suppression device 100 illustrated in FIG. 3 includes a voice processing device 10b and a noise proof processing unit 102. The noise suppression apparatus 100 is connected to the input unit 101 and the speech recognition engine 103. The noise suppression apparatus 100 operates as a noise suppression engine, and outputs a voice suitable for voice recognition by performing a noise suppression process on the input voice.

雑音抑圧装置１００は、汎用的なコンピュータシステムを用いており、図示しない構成として、ＣＰＵ、ＲＡＭ、ＲＯＭ、及び不揮発性記憶装置を備える。雑音抑圧装置１００は、ＣＰＵがＲＡＭ、ＲＯＭ、又は不揮発性記憶装置に格納されたＯＳおよび雑音抑圧プログラムを読み込み、雑音抑圧処理を実行する。これにより、雑音抑圧装置１００は、入力音声を音声認識に好適な音声にすることができる。なお、雑音抑圧装置１００は１台のコンピュータシステムである必要はなく、複数台のコンピュータシステムで構成されていてもよい。 The noise suppression apparatus 100 uses a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a non-volatile storage device as components not shown. In the noise suppression device 100, the CPU reads the OS and the noise suppression program stored in the RAM, ROM, or nonvolatile storage device, and executes noise suppression processing. Thereby, the noise suppression apparatus 100 can make the input voice suitable for voice recognition. Note that the noise suppression apparatus 100 does not have to be a single computer system, and may be configured by a plurality of computer systems.

入力部１０１は、音声を入力し、耐雑音処理部１０２に出力する。入力部１０１は、例えば、マイクロフォンである。 The input unit 101 inputs voice and outputs it to the noise proof processing unit 102. The input unit 101 is a microphone, for example.

音声処理装置１０ｂは、入力部１０１が音声を入力する前に、音声認識エンジンに教師信号を送信し、認識結果を受信して、雑音抑圧処理に適したパラメータを耐雑音処理部１０２に通知する。音声処理装置１０ｂは、図１に示される音声処理装置１０と同様の機能を有するため、説明を省略する。 The speech processing apparatus 10b transmits a teacher signal to the speech recognition engine, receives a recognition result, and notifies the noise immunity processing unit 102 of parameters suitable for noise suppression processing before the input unit 101 inputs speech. . The voice processing device 10b has the same function as the voice processing device 10 shown in FIG.

耐雑音処理部１０２は、音声処理装置１０ｂから通知されたパラメータを用い、入力部１０１から入力された音声に雑音抑圧処理を行い、音声認識エンジン１０３に出力する。耐雑音処理部１０２は、雑音抑圧処理として、例えばモデルベース雑音抑圧を行う。 The noise proof processing unit 102 performs noise suppression processing on the voice input from the input unit 101 using the parameters notified from the voice processing device 10 b and outputs the result to the voice recognition engine 103. The noise proof processing unit 102 performs, for example, model-based noise suppression as the noise suppression processing.

次に、本実施例の雑音抑圧装置１００の動作を説明する。図４は、実施例１にかかる雑音抑圧装置の動作を示すフローチャートである。音声処理装置１０ｂは、図１の音声処理装置１０と同様の機能を有するものであるから、図２のフローチャートに示した音声処理装置１０の動作と同様の動作に関しては詳細な説明を省略する。 Next, the operation of the noise suppression apparatus 100 of the present embodiment will be described. FIG. 4 is a flowchart of the operation of the noise suppression apparatus according to the first embodiment. Since the voice processing device 10b has the same function as the voice processing device 10 of FIG. 1, detailed description of the same operations as those of the voice processing device 10 shown in the flowchart of FIG. 2 is omitted.

まず、教師信号送信部１１は、教師信号を送信する（ステップＳ４００）。具体的には、教師信号送信部１１は、音声認識エンジン１０３へ複数の教師信号を順次送信する。 First, the teacher signal transmission unit 11 transmits a teacher signal (step S400). Specifically, the teacher signal transmission unit 11 sequentially transmits a plurality of teacher signals to the speech recognition engine 103.

次に、認識結果受信部１２は、認識結果を受信する（ステップＳ４０１）。具体的には、認識結果受信部１２は、音声認識エンジン１０３から順次得られる複数の認識結果を受信する。 Next, the recognition result receiving unit 12 receives the recognition result (step S401). Specifically, the recognition result receiving unit 12 receives a plurality of recognition results sequentially obtained from the speech recognition engine 103.

次に、パラメータ選択部１３は、パラメータ選択を行う（ステップＳ４０２）。具体的には、パラメータ選択部１３は、認識結果受信部１２から順次受信した複数の認識結果のうち最も認識精度の高いものを選択する。そして、パラメータ選択部１３は、選択した認識結果の認識対象である教師信号に用いられたパラメータを選択する。 Next, the parameter selection unit 13 performs parameter selection (step S402). Specifically, the parameter selection unit 13 selects the one with the highest recognition accuracy among the plurality of recognition results sequentially received from the recognition result receiving unit 12. And the parameter selection part 13 selects the parameter used for the teacher signal which is the recognition object of the selected recognition result.

次に、パラメータ選択部１３は、パラメータを出力する（ステップＳ４０３）。具体的には、パラメータ選択部１３は、選択したパラメータを耐雑音処理部１０２に通知する。 Next, the parameter selection unit 13 outputs a parameter (step S403). Specifically, the parameter selection unit 13 notifies the noise proof processing unit 102 of the selected parameter.

次に、入力部１０１は、音声を入力する（ステップＳ４０４）。 Next, the input unit 101 inputs sound (step S404).

次に、耐雑音処理部１０２は、雑音抑圧処理を行う（ステップＳ４０５）。具体的には、耐雑音処理部１０２は、音声処理装置１０ｂから通知されるパラメータを用い、入力部１０１から入力される音声に雑音抑圧処理を施し、雑音抑圧後の入力音声を音声認識エンジン１０３に出力する。 Next, the noise proof processing unit 102 performs noise suppression processing (step S405). Specifically, the noise proof processing unit 102 performs noise suppression processing on the voice input from the input unit 101 using the parameters notified from the voice processing device 10b, and converts the input voice after noise suppression into the voice recognition engine 103. Output to.

本実施例において、雑音抑圧装置１００は、教師信号の送信によるパラメータの選択と、入力音声に対する雑音抑圧処理を、それぞれ１回ずつ行っているが、一度パラメータの選択が行われれば、同じ条件では選択されたパラメータを使い続けてもよい。 In the present embodiment, the noise suppression apparatus 100 performs parameter selection by transmission of a teacher signal and noise suppression processing for input speech once each. However, once parameter selection is performed, under the same conditions, You may continue to use the selected parameters.

実施形態２．
以下に記載する本実施形態の音声処理装置は、時々刻々変化する環境に追従してパラメータを変更する場合でも、効率よくパラメータの選択を行うためのものである。Embodiment 2. FIG.
The speech processing apparatus of the present embodiment described below is for efficiently selecting parameters even when parameters are changed following an environment that changes from moment to moment.

図６は、本発明による音声処理装置の第２の実施形態の構成を示すブロック図である。図６に示す音声処理装置１０ｃは、教師信号の出力と認識結果の受信を行い、雑音抑圧、音声強調のために最適なモデルパラメータを選択する。 FIG. 6 is a block diagram showing the configuration of the second embodiment of the speech processing apparatus according to the present invention. The speech processing apparatus 10c shown in FIG. 6 outputs a teacher signal and receives a recognition result, and selects an optimal model parameter for noise suppression and speech enhancement.

本実施形態の音声処理装置１０ｃは、汎用的なコンピュータシステムを用いており、図示しない構成として、ＣＰＵ、ＲＡＭ、ＲＯＭ、及び不揮発性記憶装置を備える。音声処理装置１０ｃは、ＣＰＵがＲＡＭ、ＲＯＭ、又は不揮発性記憶装置に格納されたＯＳ、音声処理プログラムを読み込み、音声処理を実行する。これにより、適切な雑音抑圧を効率よく行うことができる。なお、音声処理装置１０ｃは、１台のコンピュータシステムである必要はなく、複数台のコンピュータシステムを用いて構成されていてもよい。 The speech processing apparatus 10c of the present embodiment uses a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a nonvolatile storage device as components not shown. In the sound processing device 10c, the CPU reads the OS and the sound processing program stored in the RAM, ROM, or nonvolatile storage device, and executes sound processing. Thereby, appropriate noise suppression can be performed efficiently. The voice processing device 10c does not have to be a single computer system, and may be configured using a plurality of computer systems.

図６に示すように、本実施形態の音声処理装置１０ｃは、教師信号送信部１１と、認識結果受信部１２と、パラメータ選択部１３と、パラメータ記録部１６とを備える。音声処理装置１０ｃは、さらに、複数のパラメータを記憶するパラメータ記憶部１５と、当該複数のパラメータによって変換された複数の教師信号を記憶する教師信号記憶部１４とを備える。以下、音声処理装置１０ｃについて、図１に示した音声処理装置１０と異なる点についてのみ説明する。 As illustrated in FIG. 6, the speech processing apparatus 10 c according to the present embodiment includes a teacher signal transmission unit 11, a recognition result reception unit 12, a parameter selection unit 13, and a parameter recording unit 16. The voice processing device 10c further includes a parameter storage unit 15 that stores a plurality of parameters, and a teacher signal storage unit 14 that stores a plurality of teacher signals converted by the plurality of parameters. Hereinafter, only the difference of the audio processing device 10c from the audio processing device 10 shown in FIG. 1 will be described.

パラメータ記録部１６は、パラメータ選択部１３から通知されたパラメータを記録し、次の音声処理の際に記録したパラメータを教師信号送信部１１に通知する。このとき、記録するパラメータは一つとは限らず、それまでの音声処理にて選択された複数のパラメータの履歴情報を記録してもよい。また、パラメータ記録部１６は、教師信号送信部１１にも複数のパラメータを通知してもよい。 The parameter recording unit 16 records the parameter notified from the parameter selection unit 13 and notifies the teacher signal transmission unit 11 of the parameter recorded in the next voice processing. At this time, the number of parameters to be recorded is not limited to one, and history information of a plurality of parameters selected in the sound processing so far may be recorded. Further, the parameter recording unit 16 may notify the teacher signal transmission unit 11 of a plurality of parameters.

教師信号送信部１１は、複数のパラメータによって変換された複数の教師信号を、音声認識エンジン１０３に送信する。教師信号送信部１１は、パラメータ記録部１６に記録されているパラメータにより変換された教師信号を優先して使用する。 The teacher signal transmission unit 11 transmits a plurality of teacher signals converted by a plurality of parameters to the speech recognition engine 103. The teacher signal transmission unit 11 preferentially uses the teacher signal converted by the parameters recorded in the parameter recording unit 16.

パラメータ選択部１３は、認識結果受信部１２から順次受信した複数の認識結果を比較し、最も認識精度の高いものを選択し、選択した認識結果の認識対象である教師信号に用いられたパラメータを選択して耐雑音処理部１０２に出力し、同時にパラメータ記録部１６に通知する。 The parameter selection unit 13 compares a plurality of recognition results sequentially received from the recognition result reception unit 12, selects the one with the highest recognition accuracy, and sets the parameters used for the teacher signal that is the recognition target of the selected recognition result. The selected information is output to the noise proof processing unit 102 and simultaneously notified to the parameter recording unit 16.

次に、本実施形態の音声処理装置の動作を説明する。図７は、本実施形態の音声処理装置の動作を示すフローチャートである。 Next, the operation of the speech processing apparatus according to this embodiment will be described. FIG. 7 is a flowchart showing the operation of the speech processing apparatus according to this embodiment.

まず、教師信号送信部１１は、教師信号を送信する（ステップＳ７００）。具体的には、音声処理装置１０の教師信号送信部１１は、複数のパラメータによって変換された複数の教師信号を、音声認識エンジン１０３に順次送信する。また、教師信号送信部１１は、送信した教師信号に関する情報を認識結果受信部１２に通知する。このとき、教師信号送信部１１は、パラメータ記録部１６から通知された複数のパラメータ、つまり過去に選択されたパラメータにより変換された教師信号を優先的に使用する。教師信号送信部１１は、例えば、パラメータ記録部１６に記録された過去に選択されたパラメータのうち、直前に選択されたパラメータにより変換された教師信号を優先的に使用する。 First, the teacher signal transmission unit 11 transmits a teacher signal (step S700). Specifically, the teacher signal transmission unit 11 of the speech processing device 10 sequentially transmits a plurality of teacher signals converted by a plurality of parameters to the speech recognition engine 103. The teacher signal transmission unit 11 notifies the recognition result reception unit 12 of information related to the transmitted teacher signal. At this time, the teacher signal transmission unit 11 preferentially uses a plurality of parameters notified from the parameter recording unit 16, that is, teacher signals converted by parameters selected in the past. The teacher signal transmission unit 11 preferentially uses, for example, the teacher signal converted by the parameter selected immediately before among the parameters selected in the past recorded in the parameter recording unit 16.

次に、認識結果受信部１２は、認識結果を受信する（ステップＳ７０１）。具体的には、音声処理装置１０の認識結果受信部１２は、音声認識エンジン１０３から得られる複数の認識結果を順次受信し、教師信号送信部１１から通知されるパラメータの情報と合わせてパラメータ選択部１３に通知する。 Next, the recognition result receiving unit 12 receives the recognition result (step S701). Specifically, the recognition result receiving unit 12 of the speech processing apparatus 10 sequentially receives a plurality of recognition results obtained from the speech recognition engine 103 and selects parameters together with parameter information notified from the teacher signal transmitting unit 11. Notification to the unit 13.

次に、パラメータ選択部１３は、パラメータ選択を行う（ステップＳ７０２）。具体的には、音声処理装置１０のパラメータ選択部１３は、認識結果受信部１２にて順次受信した複数の認識結果を比較し、最も認識精度の高い認識結果を選択する。そしてパラメータ選択部１３は、選択した認識結果の認識対象である教師信号に用いられたパラメータを選択する。 Next, the parameter selection unit 13 performs parameter selection (step S702). Specifically, the parameter selection unit 13 of the speech processing device 10 compares a plurality of recognition results sequentially received by the recognition result receiving unit 12 and selects a recognition result with the highest recognition accuracy. Then, the parameter selection unit 13 selects a parameter used for the teacher signal that is a recognition target of the selected recognition result.

次に、パラメータ選択部１３は、パラメータを出力する（ステップＳ７０３）。具体的には、音声処理装置１０のパラメータ選択部１３は、選択したパラメータをパラメータ記録部１６に通知する。 Next, the parameter selection unit 13 outputs a parameter (step S703). Specifically, the parameter selection unit 13 of the voice processing device 10 notifies the parameter recording unit 16 of the selected parameter.

次に、パラメータ記録部１６は、パラメータを記録する（ステップＳ７０４）。具体的には、音声処理装置１０のパラメータ記録部１６は、パラメータ選択部１３から通知されたパラメータを記録し、次の音声処理の際に教師信号送信部１１に通知する。 Next, the parameter recording unit 16 records the parameters (step S704). Specifically, the parameter recording unit 16 of the voice processing device 10 records the parameter notified from the parameter selecting unit 13 and notifies the teacher signal transmitting unit 11 of the parameter during the next voice processing.

また、本実施形態にかかる音声処理装置１０ｃは、図５に示すようなパラメータの階層化を行って音声認識を行うパラメータの順序を制御してもよい。その場合、例えば、パラメータ記録部１６に記録されたパラメータを代表ノードとする。そして、教師信号送信部１１は、そのパラメータにより雑音処理された教師信号、およびそのパラメータの下位のクラスタに含まれるパラメータにより雑音処理された教師信号を優先して音声認識エンジン１０３に送信する。 Further, the speech processing apparatus 10c according to the present embodiment may control the order of parameters for performing speech recognition by hierarchizing parameters as shown in FIG. In this case, for example, the parameter recorded in the parameter recording unit 16 is set as the representative node. Then, the teacher signal transmission unit 11 preferentially transmits the teacher signal noise-processed by the parameter and the teacher signal noise-processed by the parameter included in the lower cluster of the parameter to the speech recognition engine 103.

このように、本実施形態にかかる音声処理装置１０ｃは、直前までの情報に追随して効率的にパラメータの選択を行うことで、処理量を減らすことができる。 As described above, the speech processing apparatus 10c according to the present embodiment can reduce the processing amount by efficiently selecting the parameters following the information up to immediately before.

＜実施例２＞
以下、本実施形態の音声処理装置の実施例を説明する。図８は、実施例２にかかる音声認識システムの構成を示した図である。図８における音声認識システム８００は、音声処理装置１０ｄと、耐雑音処理部１０２と、音声認識エンジン１０３とを備える。また、音声認識システム８００は、入力部１０１と、出力部８０１とに接続されている。<Example 2>
Hereinafter, examples of the speech processing apparatus according to the present embodiment will be described. FIG. 8 is a diagram illustrating the configuration of the speech recognition system according to the second embodiment. The speech recognition system 800 in FIG. 8 includes a speech processing device 10d, a noise proof processing unit 102, and a speech recognition engine 103. The voice recognition system 800 is connected to the input unit 101 and the output unit 801.

音声認識システム８００は、汎用的なコンピュータシステムを用いており、図示しない構成として、ＣＰＵ、ＲＡＭ、ＲＯＭ、及び不揮発性記憶装置を備える。音声認識システム８００は、ＣＰＵがＲＡＭ、ＲＯＭ、又は不揮発性記憶装置に格納されたＯＳ、音声認識プログラムを読み込み、音声認識処理を実行する。これにより、音声認識システム８００は、雑音下でも頑健に動作する音声認識を実現できる。なお、音声認識システム８００は、１台のコンピュータシステムである必要はなく、複数台のコンピュータシステムを用いて構成されていてもよい。入力部１０１は、入力となる音声を受け付け、耐雑音処理部１０２に入力する。入力部１０１は、例えばマイクロフォンである。 The speech recognition system 800 uses a general-purpose computer system, and includes a CPU, a RAM, a ROM, and a nonvolatile storage device as components not shown. In the speech recognition system 800, the CPU reads the OS and speech recognition program stored in the RAM, ROM, or nonvolatile storage device, and executes speech recognition processing. Thereby, the speech recognition system 800 can realize speech recognition that operates robustly even under noise. Note that the speech recognition system 800 does not have to be a single computer system, and may be configured using a plurality of computer systems. The input unit 101 accepts input audio and inputs it to the noise proof processing unit 102. The input unit 101 is, for example, a microphone.

音声処理装置１０ｄは、入力部１０１が入力音声を受け付ける前に、音声認識エンジン１０３に教師信号を送信し、認識結果を受信して雑音抑圧処理に適したパラメータを耐雑音処理部１０２に通知する。音声処理装置１０ｄは、図６の音声処理装置１０ｃと同様の機能を有する構成であるため、説明な説明を省略する。 The speech processing apparatus 10d transmits a teacher signal to the speech recognition engine 103 before the input unit 101 accepts the input speech, receives the recognition result, and notifies the noise proof processing unit 102 of a parameter suitable for noise suppression processing. . The voice processing device 10d has a function similar to that of the voice processing device 10c in FIG.

耐雑音処理部１０２は、音声処理装置１０ｄから通知されるパラメータを用い、入力部１０１から入力した音声に耐雑音処理を施し、音声認識エンジン１０３に出力する。音声認識エンジン１０３は、耐雑音処理部１０２から入力した雑音抑圧後の入力音声に対し、音声認識処理を実行し、音声認識結果を出力部８０１に通知する。 The noise proof processing unit 102 performs noise proof processing on the voice input from the input unit 101 using the parameter notified from the voice processing device 10 d and outputs it to the voice recognition engine 103. The speech recognition engine 103 performs speech recognition processing on the input speech after noise suppression input from the noise proof processing unit 102 and notifies the output unit 801 of the speech recognition result.

出力部８０１は、認識結果を出力する。例えば、出力部８０１は、テキストを画面表示するディスプレイでもよいし、認識結果を音声出力するための音声合成装置を組み込んだスピーカでもよい。 The output unit 801 outputs a recognition result. For example, the output unit 801 may be a display that displays text on the screen, or may be a speaker that incorporates a speech synthesizer for outputting the recognition result as speech.

次に、本実施例２にかかる音声認識処理の全体の流れを説明する。図９は、実施例２にかかる音声認識システムの動作を示すフローチャートである。ここで、音声認識システム８００の音声処理装置１０ｄは、図６の音声処理装置１０ｃと同様の機能を有するものであるから、図７のフローチャートと同様の動作については詳細な説明を省略する。 Next, the overall flow of the speech recognition processing according to the second embodiment will be described. FIG. 9 is a flowchart illustrating the operation of the speech recognition system according to the second embodiment. Here, since the speech processing apparatus 10d of the speech recognition system 800 has the same function as the speech processing apparatus 10c of FIG. 6, detailed description of the same operations as those in the flowchart of FIG. 7 is omitted.

まず、教師信号送信部１１は、教師信号を送信する（ステップＳ９００）。具体的には、教師信号送信部１１は、音声認識エンジン１０３へ複数の教師信号を順次送信する。このとき、パラメータ記録部１６から通知された過去の複数のパラメータにより変換された教師信号を優先的に使用する。 First, the teacher signal transmission unit 11 transmits a teacher signal (step S900). Specifically, the teacher signal transmission unit 11 sequentially transmits a plurality of teacher signals to the speech recognition engine 103. At this time, the teacher signal converted by a plurality of past parameters notified from the parameter recording unit 16 is preferentially used.

次に、認識結果受信部１２は、認識結果を受信する（ステップＳ９０１）。具体的には、認識結果受信部１２は、音声認識エンジン１０３から順次得られる複数の認識結果を受信する。 Next, the recognition result receiving unit 12 receives the recognition result (step S901). Specifically, the recognition result receiving unit 12 receives a plurality of recognition results sequentially obtained from the speech recognition engine 103.

次に、パラメータ選択部１３は、パラメータ選択を行う（ステップＳ９０２）。具体的には、パラメータ選択部１３は、認識結果受信部１２にて順次受信した複数の認識結果を比較し、最も認識精度の高い認識結果を選択する。そして、パラメータ選択部１３は、その認識結果の認識対象である教師信号に用いられたパラメータを選択する。 Next, the parameter selection unit 13 performs parameter selection (step S902). Specifically, the parameter selection unit 13 compares a plurality of recognition results sequentially received by the recognition result receiving unit 12, and selects the recognition result with the highest recognition accuracy. And the parameter selection part 13 selects the parameter used for the teacher signal which is the recognition object of the recognition result.

次に、パラメータ選択部１３は、パラメータを出力する（ステップＳ９０３）。具体的には、音声処理装置１０ｄのパラメータ選択部１３は、選択されたパラメータを耐雑音処理部１０２と音声処理装置１０ｄのパラメータ記録部１６とに通知する。 Next, the parameter selection unit 13 outputs a parameter (step S903). Specifically, the parameter selection unit 13 of the voice processing device 10d notifies the selected parameter to the noise proof processing unit 102 and the parameter recording unit 16 of the voice processing device 10d.

次に、パラメータ記録部１６は、パラメータを記録する（ステップＳ９０４）。具体的には、音声処理装置１０ｄのパラメータ記録部１６は、通知されたパラメータを記録し、次の音声処理時に利用可能にする。 Next, the parameter recording unit 16 records the parameters (step S904). Specifically, the parameter recording unit 16 of the voice processing device 10d records the notified parameter and makes it available for the next voice processing.

次に、入力部１０１は、音声を入力する（ステップＳ９０５）。 Next, the input unit 101 inputs sound (step S905).

次に、耐雑音処理部１０２は、雑音抑圧処理を行う（ステップＳ９０６）。具体的には、耐雑音処理部１０２は、音声処理装置１０ｄから通知されるパラメータを用い、入力部１０１から入力した音声に雑音抑圧処理を施し、雑音抑圧後の入力音声を音声認識エンジン１０３に入力する。 Next, the noise proof processing unit 102 performs noise suppression processing (step S906). Specifically, the noise proof processing unit 102 performs noise suppression processing on the voice input from the input unit 101 using parameters notified from the voice processing device 10d, and the input voice after noise suppression is input to the voice recognition engine 103. input.

次に、音声認識エンジン１０３は、音声を認識する（ステップＳ９０７）。具体的には、音声認識エンジン１０３は、耐雑音処理部１０２から入力した雑音抑圧後の音声に対して音声認識処理を行い、認識結果を出力部８０１に通知する。 Next, the speech recognition engine 103 recognizes speech (step S907). Specifically, the speech recognition engine 103 performs speech recognition processing on the speech after noise suppression input from the noise proof processing unit 102 and notifies the output unit 801 of the recognition result.

次に、音声認識エンジン１０３は、認識結果を出力する（ステップＳ９０８）。具体的には、出力部８０１は、音声認識エンジン１０３から通知された認識結果を、例えばディスプレイに表示する。 Next, the speech recognition engine 103 outputs a recognition result (step S908). Specifically, the output unit 801 displays the recognition result notified from the speech recognition engine 103 on, for example, a display.

このように、実施例２における音声認識システムは、過去に選択されたパラメータの情報を用いることにより効率的にパラメータを選択することができるので、雑音抑圧処理を繰り返し行っても処理量の増加を抑えることが可能となる。 As described above, since the speech recognition system according to the second embodiment can select parameters efficiently by using information on parameters selected in the past, the amount of processing can be increased even if the noise suppression processing is repeatedly performed. It becomes possible to suppress.

図１０は、本発明による音声処理装置の主要部の構成を示すブロック図である。図１０に示されるように、本発明による音声処理装置は、主要な構成として、予め発声内容が把握されている音声であり雑音抑圧処理用のパラメータが用いられて雑音抑圧処理された複数の教師信号を、音声認識エンジンに送信する教師信号送信部１１と、音声認識エンジンによる複数の教師信号に対する音声認識処理の認識結果を受け取る認識結果受信部１２と、認識結果の精度に基づいて、複数の教師信号に用いられたパラメータのうち、音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択するパラメータ選択部１３とを備えたことを特徴とする。 FIG. 10 is a block diagram showing the configuration of the main part of the speech processing apparatus according to the present invention. As shown in FIG. 10, the speech processing apparatus according to the present invention has, as a main configuration, a plurality of teachers that are speeches whose utterance contents have been grasped in advance and are subjected to noise suppression processing using parameters for noise suppression processing. Based on the accuracy of the recognition results, a teacher signal transmitting unit 11 that transmits signals to the speech recognition engine, a recognition result receiving unit 12 that receives recognition results of speech recognition processing for a plurality of teacher signals by the speech recognition engine, and a plurality of recognition results. A parameter selection unit 13 is provided for selecting a parameter to be used for noise suppression processing performed before speech recognition processing by the speech recognition engine among parameters used for the teacher signal.

また、上記の実施形態では、以下の（１）〜（６）に示す音声処理装置および雑音抑圧装置も開示されている。 Moreover, in said embodiment, the audio processing apparatus and noise suppression apparatus which are shown to the following (1)-(6) are also disclosed.

（１）雑音抑圧処理は、モデルベース雑音抑圧処理であり、パラメータは、雑音抑圧処理用のモデルである音声処理装置。このような音声処理装置によれば、雑音抑圧の精度を大きく向上させることができる。モデルベースの雑音抑圧手法を用いる場合には、雑音抑圧時のモデルと音声認識時のモデルにミスマッチがあると大きな精度劣化の要因になるので、このミスマッチを低減させることによる精度向上の効果は高いためである。 (1) The noise suppression process is a model-based noise suppression process, and the parameter is a speech processing apparatus that is a model for the noise suppression process. According to such a speech processing apparatus, the accuracy of noise suppression can be greatly improved. When using a model-based noise suppression method, if there is a mismatch between the model at the time of noise suppression and the model at the time of speech recognition, it will cause a large deterioration in accuracy, so the effect of improving accuracy by reducing this mismatch is high. Because.

（２）音声処理装置は、認識結果受信部が、音声認識エンジンにより行われた複数の教師信号に対する音声認識処理の処理時間を取得し、パラメータ選択部が、認識結果の精度および当該処理時間に基づいて、複数の教師信号に用いられたパラメータの中から、音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択するように構成されていてもよい。このような音声処理装置によれば、処理時間も考慮することで、より精度の高いパラメータの選択をすることができる。一般に、音声認識においてはパラメータあるいはモデルのマッチングが良いほど処理時間が短いという傾向があるためである。 (2) In the speech processing apparatus, the recognition result receiving unit acquires the processing time of speech recognition processing for a plurality of teacher signals performed by the speech recognition engine, and the parameter selection unit determines the accuracy of the recognition result and the processing time. Based on the parameters used for the plurality of teacher signals, a parameter for use in noise suppression processing performed before speech recognition processing by the speech recognition engine may be selected. According to such a voice processing device, it is possible to select a parameter with higher accuracy by taking the processing time into consideration. This is because, generally speaking, the better the parameter or model matching in speech recognition, the shorter the processing time.

（３）音声処理装置は、複数のパラメータは、音響的近さに応じて複数階層のクラスタに分類された木構造として表現され、複数のパラメータが当該クラスタの代表ノードとして表現され、教師信号送信部は、木構造の代表ノードとして表現されたパラメータが用いられた教師信号を優先して音声認識エンジンに送信し、当該教師信号のうち認識精度が最も高い教師信号に用いられたパラメータを選択し、当該パラメータを含むクラスタに属するパラメータが用いられた教師信号を次に優先して音声認識エンジンに送信するように構成されていてもよい。このような音声処理装置によれば、パラメータが多数ある場合でも少ない処理量で効率的にパラメータ選択をすることが可能となる。 (3) In the speech processing apparatus, a plurality of parameters are expressed as a tree structure classified into a cluster of a plurality of hierarchies according to acoustic proximity, a plurality of parameters are expressed as representative nodes of the cluster, and a teacher signal is transmitted. The unit preferentially transmits a teacher signal using a parameter expressed as a representative node of the tree structure to the speech recognition engine, and selects a parameter used for the teacher signal having the highest recognition accuracy among the teacher signals. The teacher signal using the parameter belonging to the cluster including the parameter may be configured to be transmitted to the speech recognition engine with priority next. According to such a speech processing apparatus, even when there are a large number of parameters, it is possible to efficiently select parameters with a small processing amount.

（４）音声処理装置は、過去に選択されたパラメータを記録するパラメータ記録部（例えば、パラメータ記録部１６）を備え、教師信号送信部は、パラメータ記録部に記録されたパラメータが用いられた教師信号を、優先して音声認識エンジンに送信するように構成されていてもよい。このような音声処理装置によれば、直前までの情報に追随して効率的にパラメータの選択を行うことで、処理量を減らすことができる。 (4) The speech processing apparatus includes a parameter recording unit (for example, parameter recording unit 16) that records a parameter selected in the past, and the teacher signal transmission unit uses a parameter recorded in the parameter recording unit. The signal may be configured to be transmitted to the speech recognition engine with priority. According to such a speech processing device, the amount of processing can be reduced by efficiently selecting a parameter following the information up to immediately before.

（５）音声処理装置は、過去に選択されたパラメータを記録するパラメータ記録部を備え、複数のパラメータは、音響的近さに応じて複数階層のクラスタに分類された木構造として表現され、複数のパラメータが当該クラスタの代表ノードとして表現され、教師信号送信部は、パラメータ記録部に記録されたパラメータに基づいて一つの代表ノードを選択し、当該代表ノードとして表現されたパラメータを含むクラスタに属するパラメータが用いられた教師信号を優先して音声認識エンジンに送信するように構成されていてもよい。このような音声処理装置によれば、パラメータが多数ある場合でも少ない処理量で効率的にパラメータ選択をすることが可能となり、直前までの情報に追随して効率的にパラメータの選択を行うことで、処理量を減らすことができる。 (5) The speech processing apparatus includes a parameter recording unit that records a parameter selected in the past, and the plurality of parameters are expressed as a tree structure classified into a cluster of a plurality of layers according to acoustic proximity. Are represented as representative nodes of the cluster, and the teacher signal transmission unit selects one representative node based on the parameters recorded in the parameter recording unit, and belongs to the cluster including the parameter represented as the representative node. The teacher signal using the parameter may be preferentially transmitted to the speech recognition engine. According to such an audio processing device, even when there are a large number of parameters, it is possible to efficiently select parameters with a small amount of processing, and it is possible to efficiently select parameters following the information up to immediately before. , Can reduce the processing amount.

（６）予め発声内容が把握されている音声であり雑音抑圧処理用のパラメータが用いられて雑音抑圧処理された複数の教師信号を、音声認識エンジンに送信する教師信号送信部と、音声認識エンジンによる複数の教師信号に対する音声認識処理の認識結果を受け取る認識結果受信部と、認識結果の精度に基づいて、複数の教師信号に用いられたパラメータのうち、音声認識エンジンによる音声認識処理の前に行われる雑音抑圧処理に用いるためのパラメータを選択するパラメータ選択部と、選択されたパラメータを用いて雑音抑圧処理を行う耐雑音処理部（例えば、耐雑音処理部１０２）とを備えた雑音抑圧装置。 (6) A teacher signal transmission unit that transmits a plurality of teacher signals, which are voices whose utterance contents have been grasped in advance and have been subjected to noise suppression processing using parameters for noise suppression processing, to the speech recognition engine, and the speech recognition engine A recognition result receiving unit for receiving recognition results of speech recognition processing for a plurality of teacher signals by the voice, and parameters used for the plurality of teacher signals based on the accuracy of the recognition results before the speech recognition processing by the speech recognition engine Noise suppression apparatus comprising: a parameter selection unit that selects parameters for use in noise suppression processing to be performed; and a noise resistance processing unit (for example, noise resistance processing unit 102) that performs noise suppression processing using the selected parameters .

この出願は、２０１２年９月２７日に出願された日本出願特願２０１２−２１３８６４を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2012-213864 for which it applied on September 27, 2012, and takes in those the indications of all here.

以上、実施形態（及び実施例）を参照して本願発明を説明したが、本願発明は上記実施形態（及び実施例）に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 While the present invention has been described with reference to the embodiments (and examples), the present invention is not limited to the above embodiments (and examples). Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

Industrial applicability

本発明によれば、音声認識システムを雑音下で精度良く実行するために耐雑音処理システムといった用途に適用できる。 INDUSTRIAL APPLICABILITY According to the present invention, the speech recognition system can be applied to an application such as a noise-resistant processing system in order to execute it accurately under noise.

１０，１０ｂ，１０ｃ，１０ｄ音声処理装置
１１教師信号送信部
１２認識結果受信部
１３パラメータ選択部
１４教師信号記憶部
１５パラメータ記憶部
１６パラメータ記録部
１００雑音抑圧装置
１０１入力部
１０２耐雑音処理部
１０３音声認識エンジン
８００音声認識システム
８０１出力部10, 10b, 10c, 10d Speech processing device 11 Teacher signal transmission unit 12 Recognition result reception unit 13 Parameter selection unit 14 Teacher signal storage unit 15 Parameter storage unit 16 Parameter recording unit 100 Noise suppression device 101 Input unit 102 Noise resistance processing unit 103 Speech recognition engine 800 Speech recognition system 801 Output unit

Claims

A teacher signal transmitting unit that transmits a plurality of teacher signals that are noises that have been preliminarily grasped and are subjected to noise suppression processing using parameters for noise suppression processing;
A recognition result receiving unit that receives a recognition result of voice recognition processing for the plurality of teacher signals by the voice recognition engine;
A parameter selection unit for selecting a parameter to be used for noise suppression processing performed before speech recognition processing by the speech recognition engine from parameters used for the plurality of teacher signals based on the accuracy of the recognition result; An audio processing device comprising:

Noise suppression processing is model-based noise suppression processing,
The speech processing apparatus according to claim 1, wherein the parameter is a model for noise suppression processing.

The recognition result receiver
Acquire the processing time of speech recognition processing for multiple teacher signals performed by the speech recognition engine,
The parameter selector
Based on the accuracy of the recognition result and the processing time, a parameter to be used for noise suppression processing performed before speech recognition processing by the speech recognition engine is selected from the parameters used for the plurality of teacher signals. The speech processing apparatus according to claim 1 or 2.

The plurality of parameters are represented as a tree structure classified into a cluster of a plurality of layers according to acoustic proximity, and the plurality of parameters are represented as representative nodes of the cluster,
The teacher signal transmitter
The teacher signal using the parameter expressed as the representative node of the tree structure is transmitted to the speech recognition engine with priority, and the parameter used for the teacher signal with the highest recognition accuracy is selected from among the teacher signals, The speech processing apparatus according to any one of claims 1 to 3, wherein a teacher signal in which a parameter belonging to a cluster including a parameter is used is transmitted with priority to the speech recognition engine.

A parameter recording unit for recording parameters selected in the past is provided.
The teacher signal transmitter
The speech processing apparatus according to any one of claims 1 to 3, wherein a teacher signal using the parameter recorded in the parameter recording unit is preferentially transmitted to a speech recognition engine.

A parameter recording unit for recording parameters selected in the past is provided.
The plurality of parameters are represented as a tree structure classified into a cluster of a plurality of layers according to acoustic proximity, and the plurality of parameters are represented as representative nodes of the cluster,
The teacher signal transmitter
One representative node is selected based on the parameter recorded in the parameter recording unit, and a teacher signal using a parameter belonging to a cluster including the parameter expressed as the representative node is preferentially transmitted to the speech recognition engine. The speech processing apparatus according to any one of claims 1 to 3.

A teacher signal transmitting unit that transmits a plurality of teacher signals that are noises that have been preliminarily grasped and are subjected to noise suppression processing using parameters for noise suppression processing;
A recognition result receiving unit that receives a recognition result of voice recognition processing for the plurality of teacher signals by the voice recognition engine;
A parameter selection unit for selecting a parameter to be used for noise suppression processing performed before speech recognition processing by the speech recognition engine from parameters used for the plurality of teacher signals based on the accuracy of the recognition result; ,
A noise suppression apparatus, comprising: a noise proof processing unit that performs noise suppression processing using the selected parameter.

Sending a plurality of teacher signals, which are speeches whose utterance contents have been grasped in advance and subjected to noise suppression processing using parameters for noise suppression processing, to the speech recognition engine,
Receiving recognition results of voice recognition processing for the plurality of teacher signals by the voice recognition engine;
Based on the accuracy of the recognition result, a parameter used for noise suppression processing performed before speech recognition processing by the speech recognition engine is selected from parameters used for the plurality of teacher signals. Voice processing method.

On the computer,
A teacher signal transmission process for transmitting a plurality of teacher signals, which are voices whose utterance contents have been grasped in advance and subjected to noise suppression processing using parameters for noise suppression processing, to a speech recognition engine;
A recognition result receiving process for receiving a recognition result of a voice recognition process for the plurality of teacher signals by the voice recognition engine;
A parameter selection process for selecting a parameter to be used for a noise suppression process performed before a voice recognition process by the voice recognition engine from parameters used for the plurality of teacher signals based on the accuracy of the recognition result; Voice processing program to execute.