JP2015155982A - Voice section detection device, speech recognition device, method thereof, and program - Google Patents

Voice section detection device, speech recognition device, method thereof, and program Download PDF

Info

Publication number
JP2015155982A
JP2015155982A JP2014031276A JP2014031276A JP2015155982A JP 2015155982 A JP2015155982 A JP 2015155982A JP 2014031276 A JP2014031276 A JP 2014031276A JP 2014031276 A JP2014031276 A JP 2014031276A JP 2015155982 A JP2015155982 A JP 2015155982A
Authority
JP
Japan
Prior art keywords
voice
speech
noise
reverberation
signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2014031276A
Other languages
Japanese (ja)
Other versions
JP6106618B2 (en
Inventor
記良 鎌土
Noriyoshi Kamado
記良 鎌土
雅清 藤本
Masakiyo Fujimoto
雅清 藤本
慶介 木下
Keisuke Kinoshita
慶介 木下
裕司 青野
Yuji Aono
裕司 青野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2014031276A priority Critical patent/JP6106618B2/en
Publication of JP2015155982A publication Critical patent/JP2015155982A/en
Application granted granted Critical
Publication of JP6106618B2 publication Critical patent/JP6106618B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

PROBLEM TO BE SOLVED: To provide voice section detection technology capable of recognizing the voice of a main speaker with high accuracy, not just in a quiet environment but even in a noisy environment, against the voices of a plurality of speakers mixed in a single microphone in an actual environment.
SOLUTION: A voice section detection device 10 includes: a noise suppression unit 110 for suppressing the noise included in a voice digital signal including a voice, noise, and reverberation by using a voice model and obtaining a noise-suppressed voice digital signal; a reverberation estimation unit 120 for estimating the reverberation component included in the noise-suppressed voice digital signal and obtaining a reverberation signal; a main speaker voice feature emphasis unit 140 for obtaining a noise/reverberation-suppressed voice digital signal that is a difference between the noise-suppressed voice digital signal and the reverberation signal; and a main speaker identification unit 150 for identifying a main speaker voice section that is a section in which the main speaker is talking from the noise/reverberation-suppressed voice digital signal by using the voice model.
COPYRIGHT: (C)2015,JPO&INPIT

Description

本発明は、音声ディジタル信号から音声区間を検出する技術、及び検出した音声区間に対して行う音声認識技術に関する。   The present invention relates to a technique for detecting a voice section from a voice digital signal, and a voice recognition technique performed for the detected voice section.

特許文献1が、音声信号区間推定技術と雑音除去技術との間でパラメータ等の情報を密に共有し、音声信号区間推定技術と雑音除去技術とを統合的に扱うことにより、高精度な音声信号区間推定及び雑音除去を行うことを可能にする雑音除去技術として知られている。しかし、特許文献1では、主話者音声、他話者音声が考慮されていなかった。なお、主話者とは、対象とする人を意味し、主話者音声とはその音声である。また、他話者とは、主話者以外の人を意味し、他話者音声とはその音声である。   Patent Document 1 shares information such as parameters densely between the speech signal section estimation technique and the noise removal technique, and handles the speech signal section estimation technique and the noise removal technique in an integrated manner, thereby achieving high-accuracy speech. It is known as a denoising technique that makes it possible to perform signal interval estimation and denoising. However, in Patent Document 1, the main speaker voice and the other speaker voice are not considered. The main speaker means a target person, and the main speaker voice is the voice. The other speaker means a person other than the main speaker, and the other speaker voice is the voice.

また、非特許文献1が、モバイル環境において他話者音声の直接音と反射音のエネルギー比に着目し、マルチステップ線形予測に基づき推定された残響成分から主話者音声区間の検出を行う技術として知られている。   Further, Non-Patent Document 1 focuses on the energy ratio of direct sound and reflected sound of other speaker's voice in a mobile environment, and detects the main speaker voice section from reverberation components estimated based on multistep linear prediction. Known as.

特開2009−210647号公報JP 2009-210647 A

鎌土記良、小橋川哲、木下慶介、政瀧浩和、高橋敏、「モバイル音声認識における主話者音声区間検出への残響除去法の応用」、日本音響学会研究発表会講演論文集、2013年、pp.145-146Kiyoshi Kamado, Satoshi Kohashikawa, Keisuke Kinoshita, Hirokazu Masatoshi, Satoshi Takahashi, "Application of dereverberation method to main speaker speech detection in mobile speech recognition", Proceedings of the Acoustical Society of Japan, 2013 , Pp.145-146

しかしながら、特許文献1では、音声とそうでない音との区別はつくが、他話者の音声と主話者の音声とを分離することができず、全ての音声を「音声区間」として判定してしまうため、主話者音声区間を推定することはできない。   However, in Patent Document 1, although it is possible to distinguish between a voice and a non-sound, it is impossible to separate the voices of other speakers and the main speakers, and all voices are determined as “speech intervals”. Therefore, the main speaker speech section cannot be estimated.

また、非特許文献1では、入力信号に雑音が含まれる場合、雑音の影響により残響推定の精度が下がってしまい、結果として主話者音声の抽出精度が下がることがある。また、残響推定の前段において、雑音抑圧を行う構成も考えられるが、その場合、雑音抑圧と残響抑圧と音声区間検出とを独立に行うため雑音を抑え過ぎて残響推定の精度が下がってしまい、結果として主話者音声の検出精度が下がることがある。   Further, in Non-Patent Document 1, when noise is included in an input signal, the accuracy of reverberation estimation decreases due to the influence of noise, and as a result, the extraction accuracy of the main speaker's speech may decrease. In addition, a configuration that performs noise suppression in the previous stage of reverberation estimation is also conceivable, but in that case, noise suppression, reverberation suppression, and speech section detection are performed independently, so noise is excessively suppressed and the accuracy of reverberation estimation is reduced. As a result, the detection accuracy of the main speaker voice may be lowered.

本発明は、実環境下における単一マイクへの複数話者混入音声(主話者音声に加え他話者音声も含む)に対し、静音環境下のみならず、高雑音環境下でも高い精度で主話者音声の認識を行うことができる音声区間検出技術を提供することを目的とする。   The present invention is highly accurate not only in a quiet environment but also in a high noise environment with respect to mixed speech (including main speaker speech and other speaker speech) to a single microphone in a real environment. It is an object of the present invention to provide a speech segment detection technique capable of recognizing a main speaker speech.

上記の課題を解決するために、本発明の一態様によれば、音声区間検出装置は、音声モデルを用いて、音声と雑音と残響とを含む音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求める雑音抑圧部と、雑音抑圧音声ディジタル信号に含まれる残響成分を推定して残響信号を求める残響推定部と、雑音抑圧音声ディジタル信号と残響信号との差である雑音残響抑圧音声ディジタル信号を求める主話者音声特徴強調部と、音声モデルを用いて、雑音残響抑圧音声ディジタル信号から主話者が話している区間である主話者音声区間を識別する主話者識別部を含む。   In order to solve the above-described problem, according to one aspect of the present invention, a speech segment detection apparatus suppresses noise included in a speech digital signal including speech, noise, and reverberation using a speech model. A noise suppression unit that obtains a suppressed speech digital signal, a reverberation estimation unit that estimates a reverberation component included in the noise suppressed speech digital signal, and a noise reverberation suppression that is a difference between the noise suppressed speech digital signal and the reverberation signal A main speaker voice feature emphasizing unit for obtaining a voice digital signal, and a main speaker identification unit for identifying a main speaker voice section, which is a section in which the main speaker is speaking, from a noise dereverberation voice digital signal using a voice model including.

上記の課題を解決するために、本発明の他の態様によれば、音声区間検出方法は、雑音抑圧部が、音声モデルを用いて、音声と雑音と残響とを含む音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求める雑音抑圧ステップと、残響推定部が、雑音抑圧音声ディジタル信号に含まれる残響成分を推定して残響信号を求める残響推定ステップと、
主話者音声特徴強調部が、雑音抑圧音声ディジタル信号と残響信号との差である雑音残響抑圧音声ディジタル信号を求める主話者音声特徴強調ステップと、主話者識別部が、音声モデルを用いて、雑音残響抑圧音声ディジタル信号から主話者が話している区間である主話者音声区間を識別する主話者識別ステップを含む。
In order to solve the above problem, according to another aspect of the present invention, in the speech interval detection method, the noise suppression unit is included in a speech digital signal including speech, noise, and reverberation using a speech model. A noise suppression step for obtaining a noise-suppressed speech digital signal by suppressing noise, and a reverberation estimating step for obtaining a reverberation signal by estimating a reverberation component included in the noise-suppressed speech digital signal;
The main speaker voice feature enhancement unit obtains a noise reverberation suppression voice digital signal, which is the difference between the noise suppressed voice digital signal and the reverberation signal, and the main speaker identification unit uses a voice model. And a main speaker identifying step of identifying a main speaker voice section, which is a section in which the main speaker is speaking, from the noise dereverberation suppressed voice digital signal.

実環境下における単一マイクへの複数話者混入音声に対し、静音環境下のみならず、高雑音環境下でも高い精度で主話者音声の認識を行うことができるという効果を奏する。   With respect to voices mixed with a plurality of speakers in a single microphone in an actual environment, the main speaker voice can be recognized with high accuracy not only in a quiet environment but also in a high noise environment.

第一実施形態に係る音声区間検出装置の機能ブロック図。The functional block diagram of the audio | voice area detection apparatus which concerns on 1st embodiment. 第一実施形態に係る音声区間検出装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the audio | voice area detection apparatus which concerns on 1st embodiment. 第一実施形態に係る音声区間検出雑音抑圧部の機能ブロック図。The functional block diagram of the audio | voice area detection noise suppression part which concerns on 1st embodiment. 第一実施形態に係る音声区間検出雑音抑圧部の処理フローの例を示す図。The figure which shows the example of the processing flow of the audio | voice area detection noise suppression part which concerns on 1st embodiment. 図5Aは主話者の音声に対応する音声アナログ信号のイメージを表す図、図5Bは主話者外音に対応する音声アナログ信号のイメージを表す図。FIG. 5A is a diagram showing an image of an audio analog signal corresponding to the voice of the main speaker, and FIG. 5B is a diagram showing an image of an audio analog signal corresponding to the external sound of the main speaker. 図6Aは主話者の音声に対応する残響信号をベクトルで表わしたイメージを表す図、図6Bは主話者外音に対応する残響信号をベクトルで表わしたイメージを表す図。6A is a diagram illustrating an image in which a reverberation signal corresponding to the voice of the main speaker is represented by a vector, and FIG. 6B is a diagram illustrating an image in which a reverberation signal corresponding to the external sound of the main speaker is represented by a vector. 図7Aは主話者の音声に対応する差分をベクトルで表わしたイメージを表す図、図7Bは主話者外音に対応する差分をベクトルで表わしたイメージを表す図。FIG. 7A is a diagram illustrating an image in which a difference corresponding to the voice of the main speaker is represented by a vector, and FIG. 7B is a diagram illustrating an image in which a difference corresponding to the external sound from the main speaker is represented by a vector. 第一実施形態に係る音声区間検出装置と、第二実施形態に係る音声認識装置との配置を説明するための図。The figure for demonstrating arrangement | positioning with the audio | voice area detection apparatus which concerns on 1st embodiment, and the speech recognition apparatus which concerns on 2nd embodiment.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。   Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

<第一実施形態のポイント>
特許文献1にて処理された雑音成分を抑圧した信号に含まれる主話者音声を、残響推定法を用いて強調する。例えば、参考文献1の残響推定法を用いることができる。
[参考文献1]国際公開第WO2007/100137号パンフレット
参考文献1では、モバイル環境において、単一マイクで収録される主話者音声と他話者音声とにそれぞれ含まれる残響成分の差分が大きいことを利用して、主話者の音声を高精度に強調できる。また、単一マイクで収録される信号をそのまま利用するのではなく、そこから雑音成分を抑圧した信号を使用するため、残響推定の精度が元の信号をそのまま処理した場合より向上する。
<Points of first embodiment>
The main speaker speech included in the signal in which the noise component processed in Patent Document 1 is suppressed is emphasized using a reverberation estimation method. For example, the reverberation estimation method of Reference 1 can be used.
[Reference Document 1] International Publication No. WO 2007/100137 Pamphlet In Reference Document 1, in the mobile environment, the difference between the reverberation components included in the main speaker voice and the other speaker voice recorded with a single microphone is large. Can be used to enhance the voice of the main speaker with high accuracy. Further, since the signal recorded by the single microphone is not used as it is, but the signal from which the noise component is suppressed is used, the accuracy of reverberation estimation is improved as compared with the case where the original signal is processed as it is.

さらに、本実施形態では元の信号から雑音と残響とを抑圧した信号を特許文献1の雑音除去装置に入力することで、再度、音声区間の判定をを行う。   Furthermore, in this embodiment, the speech section is determined again by inputting a signal in which noise and reverberation are suppressed from the original signal to the noise removal apparatus of Patent Document 1.

このような構成により、主話者の音声が精度良く強調された信号の音声特徴量を基に、音声モデルを用いて、音声区間を統計的に計算することが可能となり、非特許文献1のように残響信号(のパワーをスムージングした対数信号)を閾値判定する場合と比較して精度の高い主話者音声区間検出が可能となる。   With such a configuration, it is possible to statistically calculate a speech section using a speech model based on a speech feature amount of a signal in which the speech of the main speaker is accurately emphasized. As described above, it is possible to detect the main speaker voice section with higher accuracy compared to the case where the reverberation signal (logarithm signal obtained by smoothing the power thereof) is determined as a threshold value.

<第一実施形態に係る音声区間検出装置>
図1は音声区間検出装置10の機能ブロック図、図2はその処理フローの例を示す図である。音声区間検出装置10は、例えば、音声認識処理に用いる入力音声(以下「音声アナログ信号」ともいう)から他話者音声・無音・雑音区間を除去することで主話者の音声を高精度に認識することができる。
<Audio section detection device according to the first embodiment>
FIG. 1 is a functional block diagram of the speech section detection device 10, and FIG. 2 is a diagram showing an example of the processing flow thereof. For example, the speech section detection device 10 removes the other-speaker speech / silence / noise sections from the input speech (hereinafter also referred to as “speech analog signal”) used for speech recognition processing, thereby accurately processing the speech of the main speaker. Can be recognized.

まず、単一マイクへの入力信号に含まれる主話者の音声区間を、入力信号に含まれる残響成分から抽出する。さらに主話者音声を音声特徴量に変換し、これを入力として音声区間検出のための音声モデルと共に音声尤度計算を行うことにより統計的な枠組みで高精度な主話者の音声区間を抽出する。さらに、この主話者の音声区間を用いることで、高精度に主話者音声の音声認識が可能となる。通常、主話者以外の音声や雑音は、非音声とは判定されずに音声認識されてしまい音声認識結果が誤認識として湧き出す事になる。そのため、主話者の音声区間のみを高精度に抽出することで、認識対象外の音声や雑音による音声認識システムへの悪影響を低減する事ができる。   First, the voice section of the main speaker included in the input signal to the single microphone is extracted from the reverberation component included in the input signal. Furthermore, by converting the main speaker's speech into speech features, and using this as input, the speech likelihood is calculated together with the speech model for speech segment detection. To do. Further, by using the voice section of the main speaker, the voice of the main speaker can be recognized with high accuracy. Usually, voices and noises other than the main speaker are recognized as voices without being determined as non-speech, and the voice recognition result is generated as misrecognition. Therefore, by extracting only the voice section of the main speaker with high accuracy, it is possible to reduce adverse effects on the voice recognition system due to voice or noise that is not the recognition target.

音声区間検出装置10は、音声信号取得部100、音声区間検出雑音抑圧部110、残響推定部120、ゲイン調整部130、主話者音声特徴強調部140及び主話者音声区間抽出部160を含む。また、音声区間検出雑音抑圧部110は、主話者識別部150を含む。音声区間検出装置10は、マイクロホン90で収音した音声アナログ信号を受け取り、主話者の発話区間に対応し、雑音成分の抑圧された音声ディジタル信号(以下「雑音抑圧音声ディジタル信号」ともいう)を出力する。以下、各部の詳細を説明する。   The speech segment detection device 10 includes a speech signal acquisition unit 100, a speech segment detection noise suppression unit 110, a reverberation estimation unit 120, a gain adjustment unit 130, a main speaker speech feature enhancement unit 140, and a main speaker speech segment extraction unit 160. . Further, the voice section detection noise suppression unit 110 includes a main speaker identification unit 150. The speech section detection device 10 receives a speech analog signal picked up by the microphone 90, corresponds to the speech section of the main speaker, and a speech digital signal in which noise components are suppressed (hereinafter also referred to as “noise-suppressed speech digital signal”). Is output. Details of each part will be described below.

<音声信号取得部100>
入力:音声アナログ信号
出力:音声ディジタル信号
音声信号取得部100は、アナログの音声信号(音声アナログ信号)を受け取り、ディジタルの音声信号(音声ディジタル信号)に変換し(s100)、出力する。
<Audio signal acquisition unit 100>
Input: Audio analog signal output: Audio digital signal The audio signal acquisition unit 100 receives an analog audio signal (audio analog signal), converts it into a digital audio signal (audio digital signal) (s100), and outputs it.

<音声区間検出雑音抑圧部110>
入力:音声ディジタル信号
出力:雑音抑圧音声ディジタル信号
音声区間検出雑音抑圧部110は、音声ディジタル信号を受け取り、音声モデルを用いて、音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求め(s110)、主話者音声特徴強調部140、残響推定部120、主話者音声区間抽出部160に出力する。
<Speech Interval Detection Noise Suppression Unit 110>
Input: Voice digital signal Output: Noise suppression voice digital signal Voice section detection noise suppression unit 110 receives a voice digital signal, and uses a voice model to suppress noise contained in the voice digital signal to generate a noise-suppressed voice digital signal. Obtained (s110), and output to the main speaker voice feature enhancement unit 140, the reverberation estimation unit 120, and the main speaker voice section extraction unit 160.

例えば、音声区間検出雑音抑圧部110は、雑音抑圧と音声区間検出を同時に行い、特許文献1の雑音除去装置により実現される。音声区間検出雑音抑圧部110の処理の概要を説明する。   For example, the speech section detection noise suppression unit 110 performs noise suppression and speech section detection at the same time, and is realized by the noise removal device disclosed in Patent Document 1. An overview of the processing of the speech section detection noise suppression unit 110 will be described.

図3は音声区間検出雑音抑圧部110の機能ブロック図、図4はその処理フローの例を示す図である。   FIG. 3 is a functional block diagram of the speech section detection noise suppressing unit 110, and FIG. 4 is a diagram illustrating an example of a processing flow thereof.

音声区間検出雑音抑圧部110は、音響信号分析部111と、モデルパラメータ記憶部112と、前向き推定部113と、後向き推定部114と、パラメータ記憶部115と、状態確率比算出部116と、音声信号区間推定部117と、雑音除去部118とを含む。   The speech section detection noise suppression unit 110 includes an acoustic signal analysis unit 111, a model parameter storage unit 112, a forward estimation unit 113, a backward estimation unit 114, a parameter storage unit 115, a state probability ratio calculation unit 116, a speech A signal section estimation unit 117 and a noise removal unit 118 are included.

音響信号分析部111は、音声ディジタル信号を受け取り、音声ディジタル信号の音声特徴量を一定時間区間であるフレームごとに抽出して出力する(s111)。   The acoustic signal analysis unit 111 receives the audio digital signal, extracts the audio feature amount of the audio digital signal for each frame that is a predetermined time interval, and outputs it (s111).

モデルパラメータ記憶部112は、クリーン音声信号と無音信号の各出力確率を、それぞれ、複数の正規分布を含有する混合正規分布で表現した確率モデルの確率モデルパラメータを前述の音声モデルとして利用に先立ち記憶しておく。   The model parameter storage unit 112 stores the probability model parameters of the probability model in which the output probabilities of the clean speech signal and the silence signal are expressed by a mixed normal distribution containing a plurality of normal distributions as the above-described speech model before use. Keep it.

前向き推定部113は、音声特徴量と、モデルパラメータ記憶部112に記憶されたクリーン音声信号と無音信号の各確率モデルパラメータとが入力され、過去のフレームから現在のフレームに向かって並列非線形カルマンフィルタにより現在のフレームの雑音モデルパラメータを逐次推定して出力する(s113)。   The forward estimation unit 113 receives the speech feature amount and the probability model parameters of the clean speech signal and the silence signal stored in the model parameter storage unit 112, and performs parallel nonlinear Kalman filtering from the past frame to the current frame. The noise model parameters of the current frame are sequentially estimated and output (s113).

後向き推定部114は、前向き推定部113から出力された雑音モデルパラメータと、モデルパラメータ記憶部112に記憶されたクリーン音声信号と無音信号の各確率モデルパラメータとが入力され、未来のフレームから現在のフレームに向かって並列カルマンスムーザにより現在フレームの雑音モデルパラメータを逐次後向き推定し、この後向き推定した雑音モデルパラメータに基づき、音声(雑音+クリーン音声)信号と非音声(雑音+無音)信号の各出力確率をそれぞれ混合正規分布で表現した確率モデルの確率モデルパラメータを逐次推定し、音声信号と非音声信号それぞれの出力確率を算出して出力する(s114)。   The backward estimation unit 114 receives the noise model parameters output from the forward estimation unit 113, and the probability model parameters of the clean speech signal and the silence signal stored in the model parameter storage unit 112, and inputs the current model from the future frame. The noise model parameters of the current frame are sequentially and backward estimated by the parallel Kalman smoother toward the frame, and each of speech (noise + clean speech) signal and non-speech (noise + silence) signal is determined based on the backward estimated noise model parameter. The probability model parameters of the probability model each representing the output probability with a mixed normal distribution are sequentially estimated, and the output probabilities of the speech signal and the non-speech signal are calculated and output (s114).

パラメータ記憶部115は、前向き推定部113及び後向き推定部114における処理の過程で得られた計算結果を記憶する(s115)。   The parameter storage unit 115 stores the calculation results obtained in the process of the forward estimation unit 113 and the backward estimation unit 114 (s115).

状態確率比算出部116は、音声信号及び非音声信号それぞれの出力確率が入力され、音声状態確率と、非音声状態確率と、非音声状態確率に対する音声状態確率の比とを算出し、これらを出力する(s116)。   The state probability ratio calculation unit 116 receives the output probabilities of each of the speech signal and the non-speech signal, calculates the speech state probability, the non-speech state probability, and the ratio of the speech state probability to the non-speech state probability. It outputs (s116).

音声信号区間推定部117は、状態確率の比が入力され、フレームごとに状態確率の比としきい値とを比較して、各フレームが音声状態に属するか非音声状態に属するかを示す判定結果を出力する(s117)。   The speech signal section estimation unit 117 receives the state probability ratio, compares the state probability ratio with the threshold value for each frame, and determines whether each frame belongs to the speech state or the non-speech state. Is output (s117).

雑音除去部118は、音声信号及び非音声信号の各確率モデルパラメータである正規分布ごとの平均と、クリーン音声信号及び無音信号の各確率モデルパラメータである正規分布ごとの平均と、音声状態確率及び非音声状態確率とが入力される。雑音除去部118は、音声信号と非音声信号の各確率モデルパラメータである正規分布ごとの平均に対する、クリーン音声信号と無音信号の各確率モデルパラメータである正規分布ごとの平均の各相対値を、音声状態確率及び非音声状態確率を用いて加重平均し、雑音信号を除去する周波数応答フィルタを生成し、周波数応答フィルタをインパルス応答フィルタに変換し、音声ディジタル信号に対してインパルス応答フィルタを畳み込んで雑音抑圧音声ディジタル信号を生成して出力する(s118)。   The noise removal unit 118 includes an average for each normal distribution that is each probability model parameter of the speech signal and the non-speech signal, an average for each normal distribution that is each probability model parameter of the clean speech signal and the silence signal, and the speech state probability and The non-voice state probability is input. The noise removal unit 118 compares each average value of each normal distribution, which is each probability model parameter of the clean speech signal and the silence signal, with respect to the average of each normal model that is each probability model parameter of the speech signal and the non-speech signal. Generates a frequency response filter that removes the noise signal by weighted averaging using the speech state probability and the non-speech state probability, converts the frequency response filter to an impulse response filter, and convolves the impulse response filter with the speech digital signal Generates and outputs a noise-suppressed voice digital signal (s118).

なお、特許文献1では、音声信号区間推定部117の判定結果に基づき、音声状態に属するフレームのみ雑音除去を行うため、音声信号区間推定部117における処理が必要であるが、全てのフレームに対して雑音除去を行う場合には、音声信号区間推定部117における処理を省略してもよい。   In Patent Document 1, since noise removal is performed only on frames belonging to the speech state based on the determination result of the speech signal section estimation unit 117, processing in the speech signal section estimation unit 117 is necessary. When noise removal is performed, the processing in the speech signal section estimation unit 117 may be omitted.

音声区間検出雑音抑圧部110は、主話者識別部150を含む。主話者識別部150の処理内容は後述する。   The voice section detection noise suppression unit 110 includes a main speaker identification unit 150. The processing content of the main speaker identification unit 150 will be described later.

<残響推定部120>
入力:雑音抑圧音声ディジタル信号
出力:残響信号
残響推定部120は、雑音抑圧音声ディジタル信号に含まれる残響成分を推定し(s120)、残響信号を取得する。以下、残響成分を推定する方法の概要を説明する。
<Reverberation estimation unit 120>
Input: Noise suppression voice digital signal output: Reverberation signal The reverberation estimation unit 120 estimates a reverberation component included in the noise suppression voice digital signal (s120), and acquires a reverberation signal. Hereinafter, an outline of a method for estimating reverberation components will be described.

原音声信号s(z)は、式(1)のように、白色信号u(z)に短い自己回帰(Auto-Regressive:以下「AR」ともいう)過程がかかったものである。AR過程のZ変換をv(z)=1/(1-b(z))とし、1-b(z)を多項式とする。   The original audio signal s (z) is obtained by applying a short auto-regressive (hereinafter also referred to as “AR”) process to the white signal u (z) as shown in Expression (1). The Z transformation of the AR process is v (z) = 1 / (1-b (z)), and 1-b (z) is a polynomial.

Figure 2015155982
Figure 2015155982

この原音声信号s(z)が空間を伝達し、マイクで観測される信号x(z)は、式(1)より、以下のように表される。 The original audio signal s (z) is transmitted through the space, and the signal x (z) observed by the microphone is expressed as follows from the equation (1).

Figure 2015155982
Figure 2015155982

ここで、h(z)は、音源からマイクまでの室内伝達関数を表す。音声信号は、v(z)に従って強い短期的な相関を有する。そこで、式(3)による短期的な相関を取り除く線形予測によるPre-whitening処理を施すことにより、v(z)は、ほぼ白色信号とみなせ、v(z)≒1が成り立つ。   Here, h (z) represents an indoor transfer function from the sound source to the microphone. The audio signal has a strong short-term correlation according to v (z). Therefore, by performing pre-whitening processing by linear prediction that removes short-term correlation according to Equation (3), v (z) can be regarded as a substantially white signal, and v (z) ≈1 holds.

Figure 2015155982
Figure 2015155982

ここで、b(p)は、v(z)を効果的に抑圧するための線形予測係数であり、式(4)により求められる。   Here, b (p) is a linear prediction coefficient for effectively suppressing v (z), and is obtained by Expression (4).

Figure 2015155982
Figure 2015155982

ここで、r(i)は、マイクで観測された信号x(z)がiサンプルずれた場合の自己相関係数)を示す。この線形予測は、30msのフィルタ長で実施し、30ms以内に含まれる初期反射音成分及び音声の短期的な相関が取り除かれることが期待される。   Here, r (i) represents an autocorrelation coefficient when the signal x (z) observed by the microphone is shifted by i samples. This linear prediction is performed with a filter length of 30 ms, and it is expected that the short-term correlation between the early reflection component and the speech included within 30 ms is removed.

Dをステップサイズ(遅延)、Lをフィルタ長とすると、残響信号d(n)は以下のように定式化することができる。   When D is a step size (delay) and L is a filter length, the reverberation signal d (n) can be formulated as follows.

Figure 2015155982
Figure 2015155982

ここで、a(l)(ローマ字のエル)は線形予測係数、x~(n)は式(3)により求められたPre-whitening処理された観測音を表す。a(l)をz変換したa(z)は、式(6)で求められる。   Here, a (l) (Roman letter L) represents a linear prediction coefficient, and x to (n) represent pre-whitening-processed observation sounds obtained by Expression (3). a (z) obtained by z-converting a (l) is obtained by Expression (6).

Figure 2015155982
Figure 2015155982

ここで、hmin(z)とhmax(z)は、それぞれh(z)の最小位相成分(Z平面上の単位円内の零点に対応する成分)と最大位相成分(Z平面上の単位円外の零点に対応する成分)を表す。また、min[hmax(z)]は、hmax(z)を最小位相化する関数を表す。 Where h min (z) and h max (z) are the minimum phase component of h (z) (the component corresponding to the zero in the unit circle on the Z plane) and the maximum phase component (unit on the Z plane). Represents the component corresponding to the zero point outside the circle). Also, min [h max (z)] represents a function that minimizes h max (z).

一般に、Dは10〜200msに相当する値を、Lは100ms〜500msに相当する値を設定する。
本手法は、例えば参考文献1に詳しい。
In general, D is set to a value corresponding to 10 to 200 ms, and L is set to a value corresponding to 100 ms to 500 ms.
This technique is detailed in Reference Document 1, for example.

上述の方法や、他の既存の残響推定技術を用いて、残響推定部120は、雑音抑圧音声ディジタル信号x(n)に含まれる残響成分を推定し、残響信号d(n)を取得する。   Using the above-described method and other existing reverberation estimation techniques, the reverberation estimation unit 120 estimates the reverberation component included in the noise-suppressed speech digital signal x (n) and acquires the reverberation signal d (n).

<ゲイン調整部130>
入力:残響信号
出力:ゲイン調整された残響信号
ゲイン調整部130は、残響信号を受け取り、残響信号にゲインGを乗算し(s130)、ゲイン調整された残響信号を得、出力する。ゲインGは、1よりも小さく0より大きな値を用いる。例えば、0.8〜1.0の値を用いる。これにより、後述する主話者音声特徴強調部140において、雑音抑圧音声ディジタル信号と残響信号との差分を求める際に生じる歪を低減させることができる。
<Gain adjustment unit 130>
Input: Reverberation signal output: Gain-adjusted reverberation signal The gain adjustment unit 130 receives the reverberation signal, multiplies the reverberation signal by the gain G (s130), and obtains and outputs the gain-adjusted reverberation signal. As the gain G, a value smaller than 1 and larger than 0 is used. For example, a value of 0.8 to 1.0 is used. Thereby, in the main speaker voice feature emphasizing unit 140 described later, it is possible to reduce distortion that occurs when obtaining the difference between the noise-suppressed voice digital signal and the reverberation signal.

<主話者音声特徴強調部140>
入力:雑音抑圧音声ディジタル信号、ゲイン調整された残響信号
出力:雑音残響抑圧音声ディジタル信号
主話者音声特徴強調部140は、雑音抑圧音声ディジタル信号とゲイン調整された残響信号とを受け取り、これらの信号の差を算出し(s140)、雑音残響抑圧音声ディジタル信号として出力する。なお、雑音残響抑圧音声ディジタル信号は、主話者音声が強調された音声ディジタル信号といってもよい。
<Main speaker voice feature enhancement unit 140>
Input: noise-suppressed voice digital signal, gain-adjusted reverberation signal output: noise-reverberation-suppressed voice digital signal The main speaker voice feature enhancement unit 140 receives the noise-suppressed voice digital signal and the gain-adjusted reverberation signal, and The signal difference is calculated (s140) and output as a noise dereverberation speech digital signal. Note that the noise dereverberation speech digital signal may be referred to as a speech digital signal in which the main speaker speech is emphasized.

図5Aは主話者の音声に対応する音声アナログ信号のイメージを表す図、図5Bは他話者の音声に対応する音声アナログ信号のイメージを表す図である。図5Aに示すように、主話者の音声に対応する音声アナログ信号は、直接音Dが大きく、反射音R(残響成分)が小さい。一方、図5Bに示すように、他話者の音声に対応する音声アナログ信号は、直接音Dが小さく、反射音R(残響成分)が大きい。図6Aは主話者の音声に対応する残響信号をベクトルで表わしたイメージを表す図、図6Bは他話者の音声に対応する残響信号をベクトルで表わしたイメージを表す図である。図7Aは主話者の音声に対応する差分をベクトルで表わしたイメージを表す図、図7Bは他話者の音声に対応する差分をベクトルで表わしたイメージを表す図である。なお、図7中の小さい矢印は除去しきれなかった残響成分R’を表す。   FIG. 5A is a diagram illustrating an image of a speech analog signal corresponding to the voice of the main speaker, and FIG. 5B is a diagram illustrating an image of a speech analog signal corresponding to the speech of another speaker. As shown in FIG. 5A, the voice analog signal corresponding to the voice of the main speaker has a large direct sound D and a small reflected sound R (reverberation component). On the other hand, as shown in FIG. 5B, the voice analog signal corresponding to the voice of another speaker has a small direct sound D and a large reflected sound R (reverberation component). 6A is a diagram illustrating an image in which a reverberation signal corresponding to the voice of the main speaker is represented by a vector, and FIG. 6B is a diagram illustrating an image in which a reverberation signal corresponding to the speech of another speaker is represented by a vector. FIG. 7A is a diagram showing an image in which a difference corresponding to the voice of the main speaker is represented by a vector, and FIG. 7B is a diagram showing an image in which a difference corresponding to the voice of another speaker is represented by a vector. A small arrow in FIG. 7 represents a reverberation component R ′ that could not be removed.

主話者音声特徴強調部140では、ゲイン調整された残響信号と雑音抑圧音声ディジタル信号の差分を計算することにより、残響抑圧を行う。この減算処理の結果、主話者音声の特徴が強調され(図7参照)、音声ディジタル信号中の主話者音声の特徴量を高精度に抽出することができるようになる。   The main speaker speech feature emphasizing unit 140 performs reverberation suppression by calculating a difference between the gain-adjusted reverberation signal and the noise-suppressed speech digital signal. As a result of the subtraction process, the feature of the main speaker voice is emphasized (see FIG. 7), and the feature amount of the main speaker voice in the voice digital signal can be extracted with high accuracy.

ここで言う、主話者音声の特徴とは、残響抑圧後の信号と残響抑圧前の信号とのメルスペクトル上での特徴量の差分が小さいことを指している。一方、他話者音声の特徴量は、これに比べて残響抑圧により大きく変化するか、もしくは残響抑圧法の特性上、直接波が不明瞭な場合は残響推定ができず無音化される。そのため、残響抑圧後の信号の特徴量、メルスペクトル表現においては、主話者と他話者との差が強調されることとなる。なお、メルスペクトルは公知の技術であるため、ここでの説明は省略する。別の言い方をすると、他話者音声の場合、残響抑圧後の信号と残響抑圧前の信号とのメルスペクトル上での特徴量の差分が大きくなるか(残響抑圧により大きく変化するため)、または、差分が0(残響推定ができず残響が無音化されるため、0となり、残響抑圧前後の信号が全く同じ信号になるため)になる。差分が小さい場合(主話者)と、差分が大きい場合や0の場合(他話者)とは明らかに異なるため、主話者と他話者との差が強調されることとなる。   The feature of the main speaker speech here means that the difference in the feature amount on the mel spectrum between the signal after dereverberation and the signal before dereverberation is small. On the other hand, the feature amount of the other-speaker voice changes greatly due to dereverberation, or if the direct wave is unclear due to the characteristics of the dereverberation method, the reverberation cannot be estimated and the sound is silenced. Therefore, the difference between the main speaker and the other speaker is emphasized in the feature amount of the signal after dereverberation and the mel spectrum expression. Since the mel spectrum is a known technique, the description thereof is omitted here. In other words, in the case of other speaker's speech, the difference in the feature amount on the mel spectrum between the signal after dereverberation and the signal before dereverberation increases (because it changes greatly due to dereverberation), or , The difference becomes 0 (because the reverberation cannot be estimated and the reverberation is silenced, it becomes 0, and the signals before and after the reverberation suppression become exactly the same signal). Since the case where the difference is small (main speaker) is clearly different from the case where the difference is large or 0 (other speaker), the difference between the main speaker and the other speaker is emphasized.

ゲイン調整部130及び主話者音声特徴強調部140の処理を合わせて、スペクトルサブトラクション法という既知の手法で実現することができる(参考文献2参照)。
[参考文献2] BOLL, S. F., "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. Acoust., Speech, Signal Processing, 1979, vol. ASSP-27, pp. 113-120
The processing of the gain adjustment unit 130 and the main speaker voice feature enhancement unit 140 can be combined and realized by a known technique called a spectral subtraction method (see Reference 2).
[Reference 2] BOLL, SF, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction", IEEE Trans. Acoust., Speech, Signal Processing, 1979, vol. ASSP-27, pp. 113-120

<主話者識別部150>
入力:雑音残響抑圧音声ディジタル信号
出力:区間情報
主話者識別部150は、雑音残響抑圧音声ディジタル信号を受け取り、音声モデルを用いて、雑音残響抑圧音声ディジタル信号から主話者音声区間を識別し(s150)、識別結果を区間情報として出力する。例えば、主話者識別部150は、音響信号分析部111と、モデルパラメータ記憶部112と、前向き推定部113と、後向き推定部114と、パラメータ記憶部115と、状態確率比算出部116と、音声信号区間推定部117とを含む。音声ディジタル信号に代えて雑音残響抑圧音声ディジタル信号を用いて、s111〜s117を行い(図4参照)、主話者識別部150内の音声信号区間推定部117は、状態確率の比が入力され、フレームごとに状態確率の比としきい値とを比較して、各フレームが音声状態に属するか非音声状態に属するかを示す判定結果を区間情報として出力する(s117)。
<Main speaker identification unit 150>
Input: noise dereverberation suppressed digital signal output: section information The main speaker identification unit 150 receives the noise dereverberation suppressed digital signal, and identifies the main speaker voice segment from the noise dereverberation suppressed digital signal using a speech model. (S150), the identification result is output as section information. For example, the main speaker identification unit 150 includes an acoustic signal analysis unit 111, a model parameter storage unit 112, a forward estimation unit 113, a backward estimation unit 114, a parameter storage unit 115, a state probability ratio calculation unit 116, A speech signal section estimation unit 117. S111 to s117 are performed using a noise dereverberation suppressed voice digital signal instead of the voice digital signal (see FIG. 4), and the state probability ratio is input to the voice signal interval estimation unit 117 in the main speaker identification unit 150. Then, the ratio of the state probability and the threshold value are compared for each frame, and a determination result indicating whether each frame belongs to the voice state or the non-voice state is output as section information (s117).

なお、雑音残響抑圧音声ディジタル信号は、主話者音声が強調されているため、主話者音声に対応するフレームだけが音声状態に属すると判断され、他話者音声を含む主話者外音に対応するフレームは非音声状態に属すると判断されやすくなる。なお、主話者外音とは、主話者音声以外の音を意味し、他話者音声や雑音、無音等を含む。   In addition, since the main speaker voice is emphasized in the noise reverberation suppression voice digital signal, it is determined that only the frame corresponding to the main speaker voice belongs to the voice state, and the main speaker outside sound including the other speaker voice is included. It is easy to determine that the frame corresponding to is in the non-voice state. The main speaker outside sound means a sound other than the main speaker voice, and includes other speaker voice, noise, silence and the like.

これにより、主話者音声の特徴が強調された信号である雑音残響抑圧音声ディジタル信号が特許文献1の雑音除去装置(より詳しく言えば、音響信号分析部111)に入力されることとなり、主話者音声が精度良く強調された音声特徴量を基に、音声モデルを用いて、音声区間を統計的に再度計算することが可能となり、単なる閾値計算と比較して精度の高い主話者音声区間検出が可能となる。   As a result, a noise dereverberation-suppressed voice digital signal, which is a signal in which the features of the main speaker voice are emphasized, is input to the noise removal device of Patent Document 1 (more specifically, the acoustic signal analysis unit 111). It is possible to re-calculate the speech interval statistically using the speech model based on the speech feature that the speaker speech is emphasized with high accuracy, and the main speaker speech is more accurate than simple threshold calculation. Section detection is possible.

なお、残響推定部120における残響計算の際に遅延が生じているため、この遅延分を考慮し、出力される音声区間の時間を遅延分巻き戻す処理を行う。   Since a delay occurs during the reverberation calculation in the reverberation estimation unit 120, a process for rewinding the time of the output speech section is performed in consideration of the delay.

このように、雑音抑圧処理時の音声モデルと同様のモデルを利用することで、音声信号区間推定技術と雑音抑圧技術とを統合的に扱い、高精度な音声区間推定及び雑音抑圧を行う。   In this way, by using a model similar to the speech model at the time of noise suppression processing, the speech signal section estimation technique and the noise suppression technique are handled in an integrated manner, and highly accurate speech section estimation and noise suppression are performed.

<主話者音声区間抽出部160>
入力:雑音抑圧音声ディジタル信号、区間情報
出力:主話者の音声に対応する雑音抑圧音声ディジタル信号
主話者音声区間抽出部160は、雑音抑圧音声ディジタル信号と区間情報とを受け取り、区間情報を用いて、雑音抑圧音声ディジタル信号から主話者の音声に対応する部分を抽出し(s160)、音声区間検出装置10の出力値として出力する。
<Main speaker voice section extraction unit 160>
Input: Noise-suppressed voice digital signal, section information output: Noise-suppressed voice digital signal corresponding to the voice of the main speaker The main-speaker voice section extraction unit 160 receives the noise-suppressed voice digital signal and the section information, and receives the section information. Then, a portion corresponding to the voice of the main speaker is extracted from the noise-suppressed voice digital signal (s160) and output as an output value of the voice section detection device 10.

例えば、区間情報として、開始時間と終了時間を用いる場合、開始時間と終了時間との間のサンプルに1を、さらに、開始時間と終了時間のマージンを確保するため、主話者音声区間から主話者外音区間へと切り替わる開始時間の前にNサンプル(0.1〜0.4msに対応するサンプル長)の1を、1から0へと切り替わる終了時間の後にMサンプル(0.1〜0.4msに対応するサンプル長)の1を付加するマージン処理を行う。このマージン処理をした主話者音声区間(つまり、開始時間前Nサンプルから終了時間後Mサンプルに対応する部分までが1であり、他の部分が0である時間サンプル列)を雑音抑圧音声ディジタル信号に時間サンプル毎に乗算することで主話者音声を抽出することができる。   For example, when the start time and the end time are used as the section information, 1 is added to the sample between the start time and the end time, and in order to secure a margin between the start time and the end time, the main speaker voice section 1 of N samples (sample length corresponding to 0.1 to 0.4ms) before start time to switch to speaker outside sound section, M samples (corresponding to 0.1 to 0.4ms) after end time to switch from 1 to 0 Perform margin processing to add 1 of (sample length). The noise-suppressed speech digital data of the main speaker speech section (that is, a time sample sequence in which the portion corresponding to the N samples before the start time to the M sample after the end time is 1 and the other portions are 0) subjected to this margin processing The main speaker's voice can be extracted by multiplying the signal every time sample.

また、雑音抑圧音声ディジタル信号に主話者音声区間のフラグを付与した信号を区間情報として用いた場合、その信号にマージン処理を行い(つまり、始端と終端のそれぞれNサンプルとMサンプルの雑音抑圧音声ディジタル信号に主話者音声区間のフラグを付与する)、主話者音声区間のフラグを付与した部分に対応する雑音抑圧音声ディジタル信号を抽出する。また、雑音抑圧音声ディジタル信号に主話者外音区間のフラグを付与した信号を区間情報として用いた場合、主話者外音区間のフラグを付与していない雑音抑圧音声ディジタル信号にマージン処理を行い、主話者外音区間のフラグを付与していない部分に対応する雑音抑圧音声ディジタル信号を抽出する。
<効果>
実環境下における単一マイクへの複数話者混入音声に対し、静音のみならず、高雑音環境下でも高い精度で主話者音声の認識を行うことができる。また、その結果、マイクロホンの数を少なくすることができ、ハードウェアの構成も軽量化できる。
<変形例>
主話者音声区間抽出部160では、雑音抑圧音声ディジタル信号に代えて、元の音声ディジタル信号や、雑音残響抑圧音声ディジタル信号を用いてもよい。その場合であっても、主話者音声を抽出することができる。ただし、後段において、音声認識処理を行う場合には、雑音抑圧音声ディジタル信号を用いたときに最も認識精度が高まると考えられる。
In addition, when a signal with the flag of the main speaker voice section added to the noise-suppressed voice digital signal is used as section information, margin processing is performed on the signal (that is, noise suppression of N samples and M samples at the start and end, respectively) The main speaker voice section flag is added to the voice digital signal), and the noise-suppressed voice digital signal corresponding to the portion to which the main speaker voice section flag is added is extracted. In addition, when a signal with the flag of the main speaker outside sound section added to the noise-suppressed voice digital signal is used as section information, margin processing is applied to the noise-suppressed voice digital signal without the flag of the main speaker outer sound section. Then, a noise-suppressed speech digital signal corresponding to a portion to which the flag of the main speaker outside sound section is not given is extracted.
<Effect>
With respect to mixed speech from a plurality of speakers to a single microphone in an actual environment, it is possible to recognize the main speaker speech with high accuracy not only in a quiet environment but also in a high noise environment. As a result, the number of microphones can be reduced, and the hardware configuration can be reduced.
<Modification>
The main speaker voice segment extraction unit 160 may use the original voice digital signal or the noise dereverberation voice digital signal in place of the noise suppressed voice digital signal. Even in that case, the main speaker voice can be extracted. However, when speech recognition processing is performed in the subsequent stage, it is considered that the recognition accuracy is most enhanced when a noise-suppressed speech digital signal is used.

音声区間検出装置10は、入力信号として音声ディジタル信号を受け取る場合には、必ずしも音声信号取得部100を備えなくともよい。   The speech section detection device 10 does not necessarily need to include the speech signal acquisition unit 100 when receiving a speech digital signal as an input signal.

音声区間検出装置10は、必ずしもゲイン調整部130を備えなくともよい。この場合、主話者音声特徴強調部140では、ゲイン調整されていない残響信号をそのまま用いる。   The speech section detection device 10 does not necessarily include the gain adjustment unit 130. In this case, the main speaker voice feature emphasizing unit 140 uses a reverberation signal that has not been gain adjusted.

音声区間検出装置10は、必ずしも主話者音声抽出部170を備えなくともよい。主話者識別部150の出力値(区間情報)を、音声区間検出装置10の出力値として出力する。   The speech segment detection device 10 does not necessarily include the main speaker speech extraction unit 170. The output value (section information) of the main speaker identification unit 150 is output as the output value of the speech section detection device 10.

主話者識別部150は、必ずしも音声区間検出雑音抑圧部110の一部である必要はない。要は、雑音抑圧処理時の音声モデルと同様のモデルを利用することで、音声信号区間推定技術と雑音抑圧技術とを統合的に扱うことができればよい。   The main speaker identification unit 150 is not necessarily a part of the voice section detection noise suppression unit 110. In short, it is only necessary that the speech signal section estimation technology and the noise suppression technology can be handled in an integrated manner by using a model similar to the speech model at the time of noise suppression processing.

<第二実施形態>
図8は、音声区間検出装置10と、音声認識装置800との配置を説明するための図である。音声認識装置800の前段に音声区間検出装置10を配置する。
<Second embodiment>
FIG. 8 is a diagram for explaining the arrangement of the speech segment detection device 10 and the speech recognition device 800. The speech segment detection device 10 is arranged in the front stage of the speech recognition device 800.

音声認識装置800は、音声信号を入力として前述の音声区間検出装置10によって得られた信号を用いて、音声認識を行う。なお、音声信号とは、音声アナログ信号、音声ディジタル信号、雑音抑圧音声ディジタル信号及び雑音残響抑圧音声ディジタル信号を含む概念である。   The voice recognition apparatus 800 performs voice recognition using a signal obtained by the voice section detection apparatus 10 described above using a voice signal as an input. The speech signal is a concept including a speech analog signal, a speech digital signal, a noise-suppressed speech digital signal, and a noise reverberation-suppressed speech digital signal.

例えば、前述の音声区間検出装置10によって得られた主話者の音声に対応する雑音抑圧音声ディジタル信号を受け取り、その音声認識結果を出力する。   For example, it receives a noise-suppressed speech digital signal corresponding to the speech of the main speaker obtained by the speech section detection device 10 described above, and outputs the speech recognition result.

また、例えば区間情報を音声区間検出装置10の出力値として出力し(第一実施形態の変形例参照)、区間情報として主話者音声区間または主話者外音区間の少なくとも一方の開始時間と終了時間等を用いる場合、区間情報に対応する雑音抑圧音声ディジタル信号に対して音声認識を行い、音声認識結果を出力する。   Further, for example, section information is output as an output value of the voice section detection device 10 (see the modification of the first embodiment), and as the section information, at least one start time of the main speaker voice section or the main speaker outside sound section and When the end time or the like is used, speech recognition is performed on the noise-suppressed speech digital signal corresponding to the section information, and the speech recognition result is output.

また、例えば、区間情報を音声区間検出装置10の出力値として出力し(第一実施形態の変形例参照)、区間情報として音声ディジタル信号に主話者音声区間または主話者外音区間の少なくとも一方のフラグを付与した信号等を区間情報として用いる場合、主話者音声区間のフラグを付与された雑音抑圧音声ディジタル信号に対して音声認識を行い、その音声認識結果を出力する。   Further, for example, section information is output as an output value of the voice section detection device 10 (see the modification of the first embodiment), and at least a main speaker voice section or a main speaker outside sound section is added to the voice digital signal as section information. When a signal or the like with one flag is used as section information, speech recognition is performed on the noise-suppressed speech digital signal to which the flag of the main speaker speech section is added, and the speech recognition result is output.

このように、音声区間検出装置10によって得られた主話者の音声に対応する音声ディジタル信号(雑音抑圧音声ディジタル信号、雑音残響抑圧音声ディジタル信号)や区間情報を用いることで、音声認識処理に用いる入力音声(音声信号)から主話者外音・無音・雑音等を除去し、主話者の音声に対してのみ音声認識処理を行うことができ、その精度を向上させることができる。通常、主話者外音や雑音は、非音声とは判定されずに音声認識されてしまい音声認識結果が誤認識として湧き出すことになるが、音声区間検出装置10により、主話者音声区間のみを高精度に検出することで、認識対象外の音声や雑音による音声認識システムへの悪影響を低減する事が出来る。
<その他の変形例>
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。
<プログラム及び記録媒体>
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。
As described above, the speech recognition processing is performed by using the speech digital signal (noise-suppressed speech digital signal, noise-reverberation-suppressed speech digital signal) corresponding to the speech of the main speaker obtained by the speech segment detecting device 10 and the segment information. It is possible to remove the main speaker external sound / silence / noise from the input voice (speech signal) to be used, and perform voice recognition processing only on the main speaker's voice, thereby improving the accuracy. Normally, the main speaker's external sound and noise are recognized as voice without being determined as non-speech, and the voice recognition result is generated as misrecognition. Only with high accuracy, it is possible to reduce adverse effects on speech recognition systems caused by speech and noise that are not recognized.
<Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.
<Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。   The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD−ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。   The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP(Application Service Provider)型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの(コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等)を含むものとする。   A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。   In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims (4)

音声モデルを用いて、音声と雑音と残響とを含む音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求める雑音抑圧部と、
前記雑音抑圧音声ディジタル信号に含まれる残響成分を推定して残響信号を求める残響推定部と、
前記雑音抑圧音声ディジタル信号と前記残響信号との差である雑音残響抑圧音声ディジタル信号を求める主話者音声特徴強調部と、
前記音声モデルを用いて、前記雑音残響抑圧音声ディジタル信号から主話者が話している区間である主話者音声区間を識別する主話者識別部を含む、
音声区間検出装置。
A noise suppression unit that obtains a noise-suppressed voice digital signal by suppressing noise included in the voice digital signal including voice, noise, and reverberation using a voice model;
A reverberation estimation unit that estimates a reverberation component included in the noise-suppressed speech digital signal and obtains a reverberation signal;
A main speaker speech feature enhancement unit for obtaining a noise reverberation suppressed speech digital signal that is a difference between the noise suppressed speech digital signal and the reverberation signal;
Using the voice model, including a main speaker identification unit for identifying a main speaker voice section that is a section in which the main speaker is speaking from the noise reverberation suppression voice digital signal,
Voice segment detection device.
音声信号を入力として請求項1の音声区間検出装置から出力される信号を用いて、前記音声信号に対して音声認識を行う音声認識装置。   A speech recognition device that performs speech recognition on the speech signal using a signal output from the speech section detection device according to claim 1 with the speech signal as an input. 雑音抑圧部が、音声モデルを用いて、音声と雑音と残響とを含む音声ディジタル信号に含まれる雑音を抑圧して雑音抑圧音声ディジタル信号を求める雑音抑圧ステップと、
残響推定部が、前記雑音抑圧音声ディジタル信号に含まれる残響成分を推定して残響信号を求める残響推定ステップと、
主話者音声特徴強調部が、前記雑音抑圧音声ディジタル信号と前記残響信号との差である雑音残響抑圧音声ディジタル信号を求める主話者音声特徴強調ステップと、
主話者識別部が、前記音声モデルを用いて、前記雑音残響抑圧音声ディジタル信号から主話者が話している区間である主話者音声区間を識別する主話者識別ステップを含む、
音声区間検出方法。
A noise suppression step in which a noise suppression unit obtains a noise-suppressed speech digital signal by suppressing noise included in the speech digital signal including speech, noise, and reverberation using a speech model;
A reverberation estimation unit that estimates a reverberation component included in the noise-suppressed speech digital signal to obtain a reverberation signal; and
A main speaker voice feature enhancement step for obtaining a noise dereverberation suppressed voice digital signal that is a difference between the noise-suppressed voice digital signal and the reverberation signal;
The main speaker identifying unit includes a main speaker identifying step of identifying a main speaker voice section, which is a section in which the main speaker is speaking, from the noise reverberation-suppressed voice digital signal using the voice model.
Voice segment detection method.
請求項1の音声区間検出装置、または、請求項2の音声認識装置として、コンピュータを機能させるためのプログラム。   A program for causing a computer to function as the speech section detection device according to claim 1 or the speech recognition device according to claim 2.
JP2014031276A 2014-02-21 2014-02-21 Speech section detection device, speech recognition device, method thereof, and program Active JP6106618B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2014031276A JP6106618B2 (en) 2014-02-21 2014-02-21 Speech section detection device, speech recognition device, method thereof, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2014031276A JP6106618B2 (en) 2014-02-21 2014-02-21 Speech section detection device, speech recognition device, method thereof, and program

Publications (2)

Publication Number Publication Date
JP2015155982A true JP2015155982A (en) 2015-08-27
JP6106618B2 JP6106618B2 (en) 2017-04-05

Family

ID=54775315

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2014031276A Active JP6106618B2 (en) 2014-02-21 2014-02-21 Speech section detection device, speech recognition device, method thereof, and program

Country Status (1)

Country Link
JP (1) JP6106618B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017067862A (en) * 2015-09-28 2017-04-06 富士通株式会社 Voice signal processor, voice signal processing method and program
CN110853622A (en) * 2019-10-22 2020-02-28 深圳市本牛科技有限责任公司 Method and system for sentence segmentation by voice

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007100137A1 (en) * 2006-03-03 2007-09-07 Nippon Telegraph And Telephone Corporation Reverberation removal device, reverberation removal method, reverberation removal program, and recording medium
JP2009210647A (en) * 2008-02-29 2009-09-17 Nippon Telegr & Teleph Corp <Ntt> Noise canceler, method thereof, program thereof and recording medium
US20130218560A1 (en) * 2012-02-22 2013-08-22 Htc Corporation Method and apparatus for audio intelligibility enhancement and computing apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007100137A1 (en) * 2006-03-03 2007-09-07 Nippon Telegraph And Telephone Corporation Reverberation removal device, reverberation removal method, reverberation removal program, and recording medium
JP2009210647A (en) * 2008-02-29 2009-09-17 Nippon Telegr & Teleph Corp <Ntt> Noise canceler, method thereof, program thereof and recording medium
US20130218560A1 (en) * 2012-02-22 2013-08-22 Htc Corporation Method and apparatus for audio intelligibility enhancement and computing apparatus

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017067862A (en) * 2015-09-28 2017-04-06 富士通株式会社 Voice signal processor, voice signal processing method and program
CN110853622A (en) * 2019-10-22 2020-02-28 深圳市本牛科技有限责任公司 Method and system for sentence segmentation by voice
CN110853622B (en) * 2019-10-22 2024-01-12 深圳市本牛科技有限责任公司 Voice sentence breaking method and system

Also Published As

Publication number Publication date
JP6106618B2 (en) 2017-04-05

Similar Documents

Publication Publication Date Title
JP6553111B2 (en) Speech recognition apparatus, speech recognition method and speech recognition program
CN107910011B (en) Voice noise reduction method and device, server and storage medium
JP4842583B2 (en) Method and apparatus for multisensory speech enhancement
EP2381702B1 (en) Systems and methods for own voice recognition with adaptations for noise robustness
KR20170060108A (en) Neural network voice activity detection employing running range normalization
CN111370014A (en) Multi-stream target-speech detection and channel fusion
JPH09212196A (en) Noise suppressor
JP4975025B2 (en) Multisensory speech enhancement using clean speech prior distribution
JP2014502468A (en) Audio signal generation system and method
JP2011191423A (en) Device and method for recognition of speech
JP6464005B2 (en) Noise suppression speech recognition apparatus and program thereof
JP6151236B2 (en) Noise suppression device, method and program thereof
JP6374936B2 (en) Speech recognition method, speech recognition apparatus, and program
JP2015019124A (en) Sound processing device, sound processing method, and sound processing program
JP2011203700A (en) Sound discrimination device
CN110992967A (en) Voice signal processing method and device, hearing aid and storage medium
CN112309417A (en) Wind noise suppression audio signal processing method, device, system and readable medium
KR20220022286A (en) Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder
JP2006349723A (en) Acoustic model creating device, method, and program, speech recognition device, method, and program, and recording medium
JP6265903B2 (en) Signal noise attenuation
JP6106618B2 (en) Speech section detection device, speech recognition device, method thereof, and program
JP4891805B2 (en) Reverberation removal apparatus, dereverberation method, dereverberation program, recording medium
JP2005258158A (en) Noise removing device
JP4098647B2 (en) Acoustic signal dereverberation method and apparatus, acoustic signal dereverberation program, and recording medium recording the program
KR101610708B1 (en) Voice recognition apparatus and method

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20160222

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20170223

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20170228

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20170306

R150 Certificate of patent or registration of utility model

Ref document number: 6106618

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150