JP2015210423A

JP2015210423A - Specific voice suppressor, specific voice suppression method and program

Info

Publication number: JP2015210423A
Application number: JP2014092670A
Authority: JP
Inventors: 淳司渡邊; Junji Watanabe; 定男廣谷; Sadao Hiroya
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-04-28
Filing date: 2014-04-28
Publication date: 2015-11-24
Anticipated expiration: 2034-04-28
Also published as: JP6169526B2

Abstract

PROBLEM TO BE SOLVED: To provide a specific voice suppressor which suppresses voice of a specific speaker in the mixture voice signal, a specific voice suppression method and a program.SOLUTION: A specific voice suppressor comprises: a sound source separation unit which generates, when i is an element of {1,...,M}, a voice estimated value Sof a speaker i and a power parameter Pcorresponding to a voice power of the speaker i from a voice signal including voice of M speakers; a loud voice determination unit which compensates, when j is an element of {1,...,M} other than i, the voice power Pof the voice of the speaker i specified by the power parameter Pusing a distance d(U, L) between a microphone that is used when recording the voice signal and the speaker i, generates the compensated voice power P, and calculates a loud voice degree Ethat represents a loud voice degree of the speaker i to the other speakers j using the power P; and a mixture voice signal generation unit which combines the voice estimated value Sother than the estimated value Scorresponding to the loud voice degree Ethat is a threshold A or more to generate a mixture voice signal.

Description

本発明は、Ｍ人の話者の音声を含む音声信号から特定の話者の音声を抑圧する技術に関する。 The present invention relates to a technique for suppressing the voice of a specific speaker from a voice signal including the voices of M speakers.

特許文献１が、音声データの中から怒り感情に対応する音声区間を検出する従来技術として知られている。特許文献１では、学習用データを用いて音声特徴量と感情表出度との関係を学習し、各音声特徴量と感情表出確率とを対応付けた符号帳を学習しておく。そして、入力された音声データから抽出した音声特徴量に基づいて符号帳を探索することで、当該抽出した音声特徴量の感情表出確率を求め、怒り感情に対応する区間であるか否かを判定する。 Patent Document 1 is known as a conventional technique for detecting a voice section corresponding to an angry emotion from voice data. In Patent Literature 1, the learning data is used to learn the relationship between the voice feature quantity and the emotional expression level, and the code book in which each voice feature quantity is associated with the emotional expression probability is learned. Then, by searching the code book based on the voice feature amount extracted from the input voice data, the emotion expression probability of the extracted voice feature amount is obtained, and whether or not it is a section corresponding to the anger emotion. judge.

特開２００５−３４５４９６号公報JP 2005-345496 A

しかしながら、従来技術は、複数の話者の音声を含む音声信号に対応していない。従来技術では、音声信号が、複数の話者の音声を含む音声信号（以下、混合音声信号ともいう）である場合は、混合音声信号に対応する音声特徴量に基づいて感情分類を行う。したがって、混合音声信号の中に含まれる特定の人の怒り感情の音声区間だけを抽出することはできない。 However, the conventional technology does not support an audio signal including the voices of a plurality of speakers. In the prior art, when the audio signal is an audio signal including a plurality of speaker's audio (hereinafter also referred to as a mixed audio signal), emotion classification is performed based on an audio feature amount corresponding to the mixed audio signal. Therefore, it is not possible to extract only the speech segment of the anger feeling of a specific person included in the mixed speech signal.

本発明は、混合音声信号から特定の話者の音声を抑圧する特定音声抑圧装置、特定音声抑圧方法及びプログラム提供することを目的とする。 An object of the present invention is to provide a specific speech suppression apparatus, a specific speech suppression method, and a program for suppressing a specific speaker's speech from a mixed speech signal.

上記の課題を解決するために、本発明の一態様によれば、特定音声抑圧装置は、ｉ∈｛１，…，Ｍ｝とし、Ｍ人の話者の音声を含む音声信号から、話者ｉの音声の推定値Ｓ_ｉと、話者ｉの音声のパワーに対応するパワーパラメータＰ_1,iとを生成する音源分離部と、ｊ∈｛１，…，Ｍ｝＼ｉとし、音声信号を収音する際に用いたマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ）を用いて、パワーパラメータＰ_1,iにより特定される話者ｉの音声のパワーＰ_2,iを補正し、補正済みの音声のパワーＰ_3,iを生成し、パワーＰ_3,iを用いて、話者ｉの他の話者ｊに対する大声の度合いを表す大声度Ｅ_ｉを計算する大声判定部と、閾値Ａ以上の大声度Ｅ_{i_2}に対応する推定値Ｓ_{i_2}を除いて、音声の推定値Ｓ_{i_3}を合成して、混合音声信号を生成する混合信号生成部と、を含む。 In order to solve the above-described problem, according to one aspect of the present invention, the specific speech suppressing apparatus sets iε {1,..., M}, and from the speech signal including the speech of M speakers, a sound source separation unit for generating an estimated value S i of the voice of _i and a power parameter P _{1, i} corresponding to the power of the voice of the speaker i, j∈ {1,..., M} \ i, Using the distance d (U, L _{2, i} ) between the microphone and the speaker i used to pick up the sound, the power P _{2, i} of the voice of the speaker i specified by the power parameter P ₁ _{, i} To generate a corrected voice power P _{3, i} , and use the power P _3, _i to calculate loudness E _i representing loudness of the speaker i with respect to another speaker j except a determining unit, the estimated value S _{i_2} corresponding to the threshold a or more loud degree E _{i_2,} mixed signal which synthesizes an estimate S _{i - 3} audio, to produce a mixed audio signal Including a generating unit, a.

上記の課題を解決するために、本発明の他の態様によれば、特定音声抑圧装置は、ｉ∈｛１，…，Ｍ｝とし、Ｍ人の話者の音声を含む音声信号から、話者ｉの音声の推定値Ｓ_ｉと、話者ｉの音声のパワーに対応するパワーパラメータＰ_1,iとを生成する音源分離部と、ｊ∈｛１，…，Ｍ｝＼ｉとし、音声信号を収音する際に用いたマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ）を用いて、パワーパラメータＰ_1,iにより特定される話者ｉの音声のパワーＰ_2,iを補正し、補正済みの音声のパワーＰ_3,iを生成し、パワーＰ_3,iを用いて、話者ｉの他の話者ｊに対する大声の度合いを表す大声度Ｅ_ｉを計算する大声判定部と、閾値Ａ以上の大声度Ｅ_{i_2}に対応する推定値Ｓ_{i_2}の声道スペクトルｖ_{i_2}を生成し、声道スペクトルｖ_{i_2}からデルタ特徴量Δｖ_{i_2}を計算するデルタ特徴量計算部と、ほぼ０となる区間が閾値Ｂを超えるデルタ特徴量Δｖ_{i_3}に対応する推定値Ｓ_{i_3}を除いて、音声の推定値Ｓ_{i_4}を合成して、混合信号を生成する混合信号生成部と、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, the specific speech suppression apparatus sets iε {1,..., M}, and from the speech signal including speech of M speakers, A sound source separation unit that generates an estimated value S _i of the voice of the person i and a power parameter P _{1, i} corresponding to the power of the voice of the speaker i, j∈ {1,..., M} \ i, Using the distance d (U, L _{2, i} ) between the microphone and the speaker i used when collecting the signal, the power P ₂ of the voice of the speaker i specified by the power parameter P _{1, i} _i is corrected, corrected power P _{3, i} is generated, and loudness E _i representing the loudness of speaker i with respect to other speaker j is calculated using power P _{3, i.} a loud determination unit generates a vocal tract spectrum v _{i_2} estimate S _{i_2} corresponding to the threshold a or more loud degree E _{i_2,} delta features from vocal tract spectrum v _{i_2} v and delta feature quantity calculation unit for calculating a _{i_2,} except an estimate S _{i - 3} which is substantially 0. The interval corresponding to the delta feature quantity Delta] v _{i - 3} exceeding the threshold value B, and combining the estimated value S _{i - 4} voice, mixed A mixed signal generation unit for generating a signal.

上記の課題を解決するために、本発明の他の態様によれば、特定音声抑圧方法は、ｉ∈｛１，…，Ｍ｝とし、Ｍ人の話者の音声を含む音声信号から、話者ｉの音声の推定値Ｓ_ｉと、話者ｉの音声のパワーに対応するパワーパラメータＰ_1,iとを生成する音源分離ステップと、ｊ∈｛１，…，Ｍ｝＼ｉとし、音声信号を収音する際に用いたマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ）を用いて、パワーパラメータＰ_1,iにより特定される話者ｉの音声のパワーＰ_2,iを補正し、補正済みの音声のパワーＰ_3,iを生成し、パワーＰ_3,iを用いて、話者ｉの他の話者ｊに対する大声の度合いを表す大声度Ｅ_ｉを計算する大声判定ステップと、閾値Ａ以上の大声度Ｅ_{i_2}に対応する推定値Ｓ_{i_2}を除いて、音声の推定値Ｓ_{i_3}を合成して、混合音声信号を生成する混合信号生成ステップと、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, a specific speech suppression method uses iε {1,..., M} and a speech signal including speech of M speakers. A sound source separation step for generating an estimated value S _i of the voice of the person i and a power parameter P _{1, i} corresponding to the power of the voice of the speaker i, j∈ {1,..., M} \ i, Using the distance d (U, L _{2, i} ) between the microphone and the speaker i used when collecting the signal, the power P ₂ of the voice of the speaker i specified by the power parameter P _{1, i} _i is corrected, corrected power P _{3, i} is generated, and loudness E _i representing the loudness of speaker i with respect to other speaker j is calculated using power P _{3, i.} a loud determining step, except for the estimated value S _{i_2} corresponding to the threshold a or more loud degree E _{i_2,} synthesizes the estimated values S _{i - 3} audio, the mixed audio signal Comprising a mixed signal generation step of forming, the.

上記の課題を解決するために、本発明の他の態様によれば、特定音声抑圧方法は、ｉ∈｛１，…，Ｍ｝とし、Ｍ人の話者の音声を含む音声信号から、話者ｉの音声の推定値Ｓ_ｉと、話者ｉの音声のパワーに対応するパワーパラメータＰ_1,iとを生成する音源分離ステップと、ｊ∈｛１，…，Ｍ｝＼ｉとし、音声信号を収音する際に用いたマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ）を用いて、パワーパラメータＰ_1,iにより特定される話者ｉの音声のパワーＰ_2,iを補正し、補正済みの音声のパワーＰ_3,iを生成し、パワーＰ_3,iを用いて、話者ｉの他の話者ｊに対する大声の度合いを表す大声度Ｅ_ｉを計算する大声判定ステップと、閾値Ａ以上の大声度Ｅ_{i_2}に対応する推定値Ｓ_{i_2}の声道スペクトルｖ_{i_2}を生成し、声道スペクトルｖ_{i_2}からデルタ特徴量Δｖ_{i_2}を計算するデルタ特徴量計算ステップと、ほぼ０となる区間が閾値Ｂを超えるデルタ特徴量Δｖ_{i_3}に対応する推定値Ｓ_{i_3}を除いて、音声の推定値Ｓ_{i_4}を合成して、混合信号を生成する混合信号生成ステップと、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, a specific speech suppression method uses iε {1,..., M} and a speech signal including speech of M speakers. A sound source separation step for generating an estimated value S _i of the voice of the person i and a power parameter P _{1, i} corresponding to the power of the voice of the speaker i, j∈ {1,..., M} \ i, Using the distance d (U, L _{2, i} ) between the microphone and the speaker i used when collecting the signal, the power P ₂ of the voice of the speaker i specified by the power parameter P _{1, i} _i is corrected, corrected power P _{3, i} is generated, and loudness E _i representing the loudness of speaker i with respect to other speaker j is calculated using power P _{3, i.} a loud determining step, and generates a vocal tract spectrum v _{i_2} estimate S _{i_2} corresponding to the threshold a or more loud degree E _{i_2,} from the vocal tract spectrum v _{i_2} Except delta feature quantity calculation step of calculating the filter characteristic quantity Delta] v _{i_2,} the estimated value S _{i - 3} which is substantially 0. The interval corresponding to the delta feature quantity Delta] v _{i - 3} exceeding the threshold value B, by combining the estimated values S _{i - 4} voice And a mixed signal generating step for generating a mixed signal.

本発明によれば、混合音声信号から特定の話者の音声を抑圧することができるという効果を奏する。 According to the present invention, it is possible to suppress the voice of a specific speaker from the mixed voice signal.

第一実施形態に係る特定音声抑圧装置の機能ブロック図。The functional block diagram of the specific audio | voice suppression apparatus which concerns on 1st embodiment. 第一実施形態に係る特定音声抑圧装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the specific audio | voice suppression apparatus which concerns on 1st embodiment. 第二実施形態に係る特定音声抑圧装置の機能ブロック図。The functional block diagram of the specific audio | voice suppression apparatus which concerns on 2nd embodiment. 第二実施形態に係る特定音声抑圧装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the specific audio | voice suppression apparatus which concerns on 2nd embodiment.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted.

＜第一実施形態に係る特定音声抑圧装置＞
図１は第一実施形態に係る特定音声抑圧装置１００の機能ブロック図を、図２はその処理フローの例を示す。 <Specific Voice Suppressing Device According to First Embodiment>
FIG. 1 is a functional block diagram of the specific speech suppressing apparatus 100 according to the first embodiment, and FIG. 2 shows an example of its processing flow.

特定音声抑圧装置１００は、音源分離部１１０と大声判定部１２０と混合信号生成部１４０とを含む。 The specific speech suppressing apparatus 100 includes a sound source separation unit 110, a loud voice determination unit 120, and a mixed signal generation unit 140.

特定音声抑圧装置１００は、Ｍ人の話者の音声を含む混合音声信号X(t)を受け取り、特定の話者の音声を抑圧した混合音声信号^X(t)を生成し、出力する。なお、tは時刻を表すインデックスである。 The specific speech suppressing apparatus 100 receives the mixed speech signal X (t) including the speech of M speakers, generates and outputs a mixed speech signal ^ X (t) in which the speech of the specific speaker is suppressed. Note that t is an index representing time.

なお、入力される混合音声信号X(t)は、リアルタイムで収音された音声信号であってもよいし、テレビ番組やスポーツ映像のように予め録音された音声信号であってもよい。 The input mixed audio signal X (t) may be an audio signal collected in real time, or may be an audio signal recorded in advance such as a television program or a sports video.

＜音源分離部１１０＞
音源分離部１１０は、混合音声信号X(t)を受け取り、従来の音源分離技術を用いて、それぞれの話者ｉ（音源）の音声信号（音源信号）の推定値S_i(t)と、話者ｉの音声のパワーに対応するパワーパラメータP_1,i(t)と、話者ｉの位置に対応する音源位置パラメータL_1,i(t)とを算出し（ｓ１１０）、出力する。なお、ｉは話者を表すインデックスであり、ｉ∈｛１，…，Ｍ｝である。音源分離の従来技術として、例えば、参考文献１を用いることができる。
（参考文献１）特開２０１２−１７３５９２号公報 <Sound source separation unit 110>
The sound source separation unit 110 receives the mixed speech signal X (t), and uses the conventional sound source separation technology to estimate the speech signal (sound source signal) S _i (t) of each speaker i (sound source), A power parameter P _{1, i} (t) corresponding to the voice power of the speaker i and a sound source position parameter L _{1, i} (t) corresponding to the position of the speaker i are calculated (s110) and output. Note that i is an index representing a speaker, and iε {1,..., M}. For example, Reference 1 can be used as a conventional technique for sound source separation.
(Reference 1) JP 2012-173592 A

＜大声判定部１２０＞
大声判定部１２０は、Ｍ個のパワーパラメータP_1,i(t)と、Ｍ個の音源位置パラメータL_1,i(t)とを受け取り、話者ｉの音声が大声か否かを判定し（ｓ１２０）、大声を出している話者のインデックスｉ₂(t)の集合を出力する。なお、全ての話者ｉの音声に対して、判定処理を行う。 <Large voice determination unit 120>
The loudness determination unit 120 receives M power parameters P _{1, i} (t) and M sound source position parameters L _{1, i} (t), and determines whether or not the voice of the speaker i is loud. (S120), a set of indices i ₂ (t) of loud speakers is output. Note that determination processing is performed on the voices of all the speakers i.

例えば、大声判定部１２０は、距離計算部１２１とパワー補正部１２２と大声度計算部１２３と第一判定部１２４とを含む。 For example, the loudness determination unit 120 includes a distance calculation unit 121, a power correction unit 122, a loudness calculation unit 123, and a first determination unit 124.

（距離計算部１２１）
距離計算部１２１は、Ｍ個の音源位置パラメータL_1,i(t)を受け取る。距離計算部１２１は、音源位置パラメータL_1,i(t)を用いて、話者ｉの位置Ｌ_2,iを特定する。距離計算部１２１は、混合音声信号X(t)を収音する際に用いたマイクロホンの位置Ｕと位置Ｌ_2,iとを用いて、マイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）を計算し（ｓ１２１）、出力する。音源位置パラメータL_1,i(t)は、話者ｉの位置Ｌ_2,iを特定するためのパラメータであればよく、位置Ｌ_2,i自体であってもよい。例えば、距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）としてユークリッド距離を用いてもよい。また、マイクロホンの位置Ｕは、利用者等により予め与えられているものとする。 (Distance calculation unit 121)
The distance calculation unit 121 receives M sound source position parameters L _{1, i} (t). The distance calculation unit 121 identifies the position L _{2, i} of the speaker i using the sound source position parameter L _{1, i} (t). Distance calculation 121, using the position U of the microphones used in picking up mixed sound signal X (t) and position L _{2, i,} the distance d (U a microphone and speaker i, L _{2 , i} , t) is calculated (s121) and output. The sound source position parameter L _{1, i} (t) may be a parameter for specifying the position L _{2, i} of the speaker i, and may be the position L _{2, i} itself. For example, the Euclidean distance may be used as the distance d (U, L _{2, i} , t). The microphone position U is given in advance by a user or the like.

なお、マイクロホンと話者ｉとの位置関係は時間に依存しないことを前提としてもよい。その場合は、予め距離を与えられる構成としてもよく、距離計算部１２１を設けなくともよい。なお、この場合には、音源分離部１１０では、Ｍ個の音源位置パラメータL_1,i(t)を求めない構成としてもよい。また、一度だけ距離を計算し、その距離を用いて以下の処理を繰り返し行ってもよい。 The positional relationship between the microphone and the speaker i may be premised on not depending on time. In that case, the distance may be given in advance, and the distance calculation unit 121 may not be provided. In this case, the sound source separation unit 110 may be configured not to obtain M sound source position parameters L _{1, i} (t). Alternatively, the distance may be calculated only once, and the following processing may be repeated using the distance.

（パワー補正部１２２）
パワー補正部１２２は、Ｍ個のパワーパラメータP_1,i(t)と、Ｍ個の距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）とを受け取る。パワー補正部１２２は、パワーパラメータP_1,i(t)を用いて、話者ｉの音声のパワーＰ_2,i(t)を特定する。パワー補正部１２２は、距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）を用いて、話者ｉの音声のパワーＰ_2,i(t)を補正し、補正済みの音声のパワーＰ_3,i(t)を生成し（ｓ１２２）、出力する。 (Power correction unit 122)
The power correction unit 122 receives M power parameters P _{1, i} (t) and M distances d (U, L _{2, i} , t). The power correction unit 122 specifies the power P _{2, i} (t) of the voice of the speaker i using the power parameter P _{1, i} (t). The power correction unit 122 corrects the power P _{2, i} (t) of the voice of the speaker i using the distance d (U, L _{2, i} , t), and the corrected power P _{3, i} of the voice. (t) is generated (s122) and output.

マイクロホンから距離Ａにいる話者ｉ_Ａと、マイクロホンから距離Ｂ（＞Ａ）にいる話者ｉ_Ｂとが同じ大きさの声で話している場合、距離が小さいパワーP_{2,i_A}(t)のほうが距離が大きいパワーP_{2,i_B}(t)よりも大きくなる。この点を、距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）を用いて補正する。なお、下付添え字における「＿（アンダーバー）」は、アンダーバーの直後の文字が、直前の文字の下付添え字であることを表す。つまり、下付添え字におけるＸ＿Ｙは、Ｘ_Ｙであることを表す。 If the speaker i _A, which are in the distance A from the microphone, and a speaker i _B that are in the distance B (> A) from the microphone is speaking in the voice of the same size, distance is less power _{P 2,} i_A (t) _Becomes larger than the power P _{2, i_B} (t) having a larger distance. This point is corrected using the distance d (U, L _{2, i} , t). Note that “_ (under bar)” in the subscript indicates that the character immediately after the under bar is the subscript of the immediately preceding character. That is, X_Y in the subscript indicates _XY .

例えば、距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）が大きくなるほど音声のパワーが大きくなるように、言い換えれば、距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）の増加に伴って音声のパワーが単調増加するように、音声のパワーを補正する。例えば、次式により、パワーＰ_2,i(t)を距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）で正規化することで、補正する。
P_3,i(t)=P_2,i(t)/d'(U,L_2,i,t)
ただし、ｄ’（Ｕ，Ｌ_２，ｉ，ｔ）は、距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）の増加に伴って、単調減少するような関数である。なお、距離が２倍になると６dBほど音声パワーが小さくなることが知られている。この特性に基づいて、パワーＰ_3,i(t)を求めてもよい。 For example, the distance _{d (U, L 2, i} , t) as the audio power increases as increases, in other words, the distance _{d (U, L 2, i} , t) the speech power with increasing the The power of the sound is corrected so as to increase monotonously. For example, the correction is performed by normalizing the power P _{2, i} (t) by the distance d (U, L _{2, i} , t) according to the following equation.
P _{3, i} (t) = P _{2, i} (t) / d '(U, L _{2, i} , t)
However, d ′ (U, L _{2, i} , t) is a function that monotonously decreases as the distance d (U, L _{2, i} , t) increases. It is known that the sound power decreases by about 6 dB when the distance is doubled. Based on this characteristic, the power P _{3, i} (t) may be obtained.

（大声度計算部１２３）
大声度計算部１２３は、Ｍ個のパワーＰ_3,i(t)を受け取り、Ｍ個のパワーＰ_3,i(t)を用いて、話者ｉの他の話者ｊに対する大声の度合いを表す大声度Ｅ_ｉ(t)を計算し（ｓ１２３）、出力する。例えば、次式により、大声度Ｅ_ｉ(t)を計算する。 (Loudness calculator 123)
Loud calculation unit 123 receives the M power P _{3, i} (t), using the M power P _{3, i} (t), the degree of loud to other speaker j speaker i The loudness E _i (t) to be expressed is calculated (s123) and output. For example, loudness E _i (t) is calculated by the following equation.

（第一判定部１２４）
第一判定部１２４は、Ｍ個の大声度Ｅ_ｉ(t)を受け取り、大声度Ｅ_ｉ(t)と閾値Ａとを比較し、閾値Ａ以上の大声度Ｅ_{ｉ_2}に対応するインデックスｉ₂(t)の集合を出力する。 (First determination unit 124)
The first determination unit 124 receives M loudnesses E _i (t), compares the _loudness E _i (t) with a threshold A, and an index i ₂ (corresponding to _loudness E _{i_2} greater than or equal to the threshold A) ( Output a set of t).

ある１つの補正済みの音声のパワーＰ_3,i(t)と残りの補正済みの音声のパワーＰ_3,j(t)との差を計算し、残りのパワーＰ_3,j(t)よりも所定の閾値A以上、大きいパワーＰ_3,i(t)をもつ話者iを特定する処理である。他の話者の音声信号のパワーと比較して30dB以上（参考文献２）大きな音声信号は、叫び声のような耳障りな音声である可能性が高い。このことを利用して、第一判定部１２４では他の音声信号と比較して音量が特別大きな音声信号の話者のインデックスを特定する。例えば、閾値Aを、30dBとする。
（参考文献２）南條、国松、川野、中山、西浦、「音響防犯システムのための叫び声の基礎的検討」、2008年音響学会春季大会、1-Q-17, 2008. The difference between the power P _{3, i} (t) of one corrected sound and the power P _{3, j} (t) of the remaining corrected sound is calculated, and the remaining power P _{3, j} (t) is calculated. Is a process for identifying a speaker i having a power P _{3, i} (t) greater than or equal to a predetermined threshold A. There is a high possibility that an audio signal that is 30 dB or more higher than the power of the audio signal of another speaker (Reference Document 2) is annoying voice such as a screaming voice. Utilizing this fact, the first determination unit 124 identifies the speaker index of the voice signal whose volume is particularly large compared to other voice signals. For example, the threshold A is set to 30 dB.
(Reference 2) Nanjo, Kunimatsu, Kawano, Nakayama, Nishiura, “Fundamental study of cry for acoustic security system”, 2008 Acoustical Society Spring Meeting, 1-Q-17, 2008.

＜混合信号生成部１４０＞
混合信号生成部１４０は、インデックスｉ₂(t)の集合とＭ個の推定値S_i(t)とを受け取り、Ｍ個の推定値S_i(t)から、インデックスｉ₂(t)に対応する音声信号の推定値S_{i_2}(t)を除き、残りの音声信号の推定値S_{i_3}(t)を合成して混合音声信号^X(t)を生成し（ｓ１４０）、出力する。 <Mixed signal generator 140>
The mixed signal generation unit 140 receives a set of indexes i ₂ (t) and M estimated values S _i (t), and corresponds to the index i ₂ (t) from the M estimated values S _i (t). except for estimate S _{i_2} (t) of the audio signal, generating a synthesized and mixed sound signals ^ X (t) the estimated value S _{i - 3} (t) of the remaining audio signal (s140), and outputs.

＜効果＞
このような構成により、混合音声信号から特定の話者の音声を抑圧することができ、混合音声信号から聴者にとって不快な印象を与える音声（例えば、野次や叫び声などを含む大声）を抑圧した混合音声信号を生成することができる。 <Effect>
With such a configuration, the voice of a specific speaker can be suppressed from the mixed voice signal, and the mixed voice signal suppresses a voice that gives an unpleasant impression to the listener (for example, loud voices including field and screams). An audio signal can be generated.

＜変形例＞
本実施形態では、音源分離部１１０において、混合音声信号X(t)から話者ｉの位置に対応する音源位置パラメータL_1,i(t)を算出し、距離計算部１２１において、音源位置パラメータL_1,i(t)と予め与えられたマイクロホンの位置Ｕとを用いて、マイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）を計算している。しかし、既知の音源分離方法には、マイクロホンの位置Ｕを予め与えられることなく、混合音声信号X(t)からマイクマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）を算出できる方法もある。そのような既知の音源分離方法を用いて、音源分離部において、混合音声信号X(t)を受け取り、それぞれの話者ｉ（音源）の音声信号（音源信号）の推定値S_i(t)と、話者ｉの音声のパワーに対応するパワーパラメータP_1,i(t)と、マイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）とを算出し、出力してもよい。その場合、距離計算部１２１を設けなくともよい。 <Modification>
In the present embodiment, the sound source separation unit 110 calculates a sound source position parameter L _{1, i} (t) corresponding to the position of the speaker i from the mixed speech signal X (t), and the distance calculation unit 121 uses the sound source position parameter. The distance d (U, L _{2, i} , t) between the microphone and the speaker i is calculated using L _{1, i} (t) and the microphone position U given in advance. However, in the known sound source separation method, the distance d (U, L2 _{, i} , t) between the microphone microphone and the speaker i is determined from the mixed sound signal X (t) without being given the microphone position U in advance. There is also a method that can be calculated. Using such a known sound source separation method, the sound source separation unit receives the mixed speech signal X (t), and estimates S _i (t) of the speech signal (sound source signal) of each speaker i (sound source). And the power parameter P _{1, i} (t) corresponding to the voice power of the speaker i and the distance d (U, L _{2, i} , t) between the microphone and the speaker i are calculated and output. Also good. In that case, the distance calculation unit 121 may not be provided.

＜第二実施形態に係る特定音声抑圧装置２００＞
第一実施形態と異なる部分を中心に説明する。 <Specific Voice Suppression Device 200 According to Second Embodiment>
A description will be given centering on differences from the first embodiment.

図３は特定音声抑圧装置２００の機能ブロック図を、図４はその処理フローの例を示す。 FIG. 3 is a functional block diagram of the specific speech suppressing apparatus 200, and FIG. 4 shows an example of the processing flow.

特定音声抑圧装置１００は、音源分離部１１０と大声判定部１２０と叫び声判定部２３０と混合信号生成部２４０とを含む。 The specific speech suppressing apparatus 100 includes a sound source separation unit 110, a loud voice determination unit 120, a screaming determination unit 230, and a mixed signal generation unit 240.

＜叫び声判定部２３０＞
叫び声判定部２３０は、インデックスｉ₂(t)の集合とＭ個の推定値S_i(t)とを受け取り、話者ｉ₂(t)の音声が叫び声か否かを判定し（ｓ２３０）、叫び声を出している話者のインデックスi₃(t)を出力する。なお、全ての話者ｉではなく、インデックスｉ₂(t)に対応する話者の音声に対してのみ、判定処理を行う。 <Scream Determination Unit 230>
The scream determination unit 230 receives the set of indexes i ₂ (t) and the M estimated values S _i (t), determines whether or not the voice of the speaker i ₂ (t) is a scream (s230), Outputs the index i ₃ (t) of the screaming speaker. Note that the determination process is performed only on the voice of the speaker corresponding to the index i ₂ (t), not all the speakers i.

例えば、叫び声判定部２３０は、声道スペクトル生成部２３１とデルタ特徴量計算部２３２と第二判定部２３３とを含む。 For example, the screaming determination unit 230 includes a vocal tract spectrum generation unit 231, a delta feature amount calculation unit 232, and a second determination unit 233.

（声道スペクトル生成部２３１）
声道スペクトル生成部２３１は、インデックスｉ₂(t)の集合とＭ個の推定値S_i(t)とを受け取り、インデックスｉ₂(t)に対応する音声信号の推定値S_{i_2}(t)の声道スペクトルｖ_{i_2}(t)を生成し（ｓ２３１）、声道スペクトルｖ_{i_2}(t)の集合を出力する。 (Vocal tract spectrum generation unit 231)
The vocal tract spectrum generation unit 231 receives a set of indexes i ₂ (t) and M estimated values S _i (t), and estimates S _{i_2} (t) of speech signals corresponding to the index i ₂ (t). of the vocal tract spectrum v generates _{i_2} a (t) (s231), and outputs a set of vocal tract spectrum v _{i_2} (t).

（デルタ特徴量計算部２３２）
デルタ特徴量計算部２３２は、声道スペクトルｖ_{i_2}(t)の集合を受け取り、これらの値を用いて、デルタ特徴量Δｖ_{i_2}(t)を計算し（ｓ２３２）、デルタ特徴量Δｖ_{i_2}(t)の集合を出力する。例えば、次式により（参考文献３参照）、デルタ特徴量Δｖ_{i_2}(t)を計算する。 (Delta feature amount calculation unit 232)
Delta feature quantity calculation unit 232 receives a set of vocal tract spectrum v _{i_2} (t), using these values, to calculate a delta feature quantity _{Δv i_2 (t) (s232)} , the delta feature amount Delta] v _{i_2} (t ) Is output. For example, (see reference 3) using the following equation to calculate a delta feature quantity Delta] v _{i_2} a (t).

ここで、2w₁+1が例えば50ミリ秒となるようにw₁を設定する。
（参考文献３）FURUI S., "Speaker-independent isolated word recognition using dynamic features of speech spectrum", IEEE Trans. Acoust., Speech and Signal Processing ASSP-34(1), 1986, pp. 52-59. Here, w ₁ is set so that 2w ₁ +1 is, for example, 50 milliseconds.
(Reference 3) FURUI S., "Speaker-independent isolated word recognition using dynamic features of speech spectrum", IEEE Trans. Acoust., Speech and Signal Processing ASSP-34 (1), 1986, pp. 52-59.

（第二判定部２３３）
第二判定部２３３は、デルタ特徴量Δｖ_{i_2}(t)の集合を受け取り、デルタ特徴量Δｖ_{i_2}がほぼ０となる区間が閾値Ｂを超えるか否かを判定し、閾値Ｂを超えるデルタ特徴量Δｖ_{i_3}に対応するインデックスｉ₃(t)の集合を出力する。 (Second determination unit 233)
Second determination unit 233 receives a set of delta feature quantity Δv _{i_2} (t), delta feature quantity Delta] v _{i_2} almost becomes zero interval to determine whether more than a threshold value B, delta characteristic amount exceeding the threshold value B and it outputs a set of index i ₃ (t) corresponding to the _Δv i_3.

なお、デルタ特徴量Δｖ_{i_2}(t)の絶対値が十分に小さい正の値ε以下である区間が閾値Ｂを超えるか否かを判定してもよいし、以下の値が十分に小さい正の値ε以下であるか否かを判定してもよい。 Incidentally, it may be determined whether the absolute value is less than or equal to a positive value ε small enough interval of the delta feature quantity Δv _{i_2} (t) exceeds the threshold value B, the following values are sufficiently small positive You may determine whether it is below the value (epsilon).

ここで、2w₂+1が例えば、閾値Ｂ（例えば300ミリ秒）となるようにw₂を設定する。 Here, w ₂ is set so that 2w ₂ +1 becomes, for example, a threshold B (for example, 300 milliseconds).

デルタ特徴量は、所定時間区間ごとの音の変化を表す特徴量であり、値が大きいほど音の変化が大きいことを示す。デルタ特徴量がほぼ０である状態とは、音の変化がない状態であり、声を発していない（無音）であるか、または、音を伸ばしている状態であるか、のいずれかであると想定される。ただし、本実施形態では第一判定部１２４で音のパワーが大きな音声信号だけを叫び声判定部２３０の分析対象としているので、無音であることはありえない。つまり、叫び声判定部２３０では、音を伸ばしている状態であるか否かを判定し、音を伸ばしている傾向の強い音声信号の話者のインデックスを抽出している。 The delta feature amount is a feature amount that represents a change in sound for each predetermined time interval, and the greater the value, the greater the change in sound. The state in which the delta feature value is almost 0 is a state in which there is no change in sound, and it is either a state where no voice is produced (silence) or a state where the sound is extended. It is assumed. However, in the present embodiment, since only the sound signal having a large sound power is the analysis target of the screaming determination unit 230 in the first determination unit 124, there is no possibility of silence. In other words, the screaming determination unit 230 determines whether or not the sound is being extended, and extracts the speaker index of the voice signal that is likely to increase the sound.

叫び声は一般に母音を伸ばす傾向がある。音声データベースの長母音の長さを基準として、平静状態ではそれよりも長く音を伸ばすことはほとんどあり得ないことから、この処理により、叫び声である可能性の高い音声信号を抽出することができる。 Shouting generally tends to stretch vowels. Since it is almost impossible to extend the sound longer than that in the calm state with reference to the length of the long vowel in the speech database, it is possible to extract a speech signal that is likely to be a scream by this process. .

＜混合信号生成部２４０＞
混合信号生成部２４０は、インデックスｉ₃(t)の集合とＭ個の推定値S_i(t)とを受け取り、インデックスｉ₃(t)に対応する音声信号の推定値S_{i_3}(t)を除き、残りの音声信号の推定値S_{i_4}(t)を合成して混合音声信号^X(t)を生成し（ｓ２４０）、出力する。 <Mixed signal generator 240>
The mixed signal generation unit 240 receives a set of indexes i ₃ (t) and M estimated values S _i (t), and _calculates an estimated value S i — ₃ (t) of the speech signal corresponding to the index i ₃ (t). Except for this, the estimated value S _{i — 4} (t) of the remaining audio signal is synthesized to generate a mixed audio signal ^ X (t) (s240) and output.

＜効果＞
このような構成により、大声、特に野次や叫び声のような耳障りで視聴者にとって重要な情報を含まない音声だけを精度よく抑制することができる。 <Effect>
With such a configuration, it is possible to accurately suppress loud voices, particularly voices that are annoying, such as field and screams, and do not contain important information for the viewer.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

本発明は、例えばテレビの音声出力に応用することができる。生放送などの番組を視聴する際に、野次や叫び声などの不快な音を抑圧し快適に視聴することができる。 The present invention can be applied to, for example, audio output of a television. When watching a program such as a live broadcast, it is possible to suppress unpleasant sounds such as the field and screaming and comfortably view it.

例えば、テレビの受信側での実装が可能である。例えば、テレビ内部に、特定音声抑圧装置を組み込み、受信側で野次や叫び声を抑圧する。なお、受信側では、マイクロホンの位置Ｕ、または、マイクロホンと話者との距離は、予め、データとして取得できるものとする。例えば、受信側でマイクロホンの位置Ｕをデータとして取得できる場合、第一実施形態で説明した処理を行えばよい。また、受信側でマイクロホンと話者との距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）をデータとして取得できる場合、放送局側のサーバでは音源分離部及び距離計算部を含み、受信側では音源分離部及び距離計算部以外の構成を含めばよい。放送局側のサーバでは、収音した音声信号（混合音声信号）に対して、音源分離部及び距離計算部を実行して、話者ｉの音声信号の推定値S_i(t)と、音源パワーパラメータP_1,i(t)と、マイクマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）とを求める。そして、話者ｉの音声信号の推定値S_i(t)と、音源パワーパラメータP_1,i(t)と、マイクマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）とを、受信側に配信する。受信側では、話者ｉの音声信号の推定値S_i(t)と、音源パワーパラメータP_1,i(t)と、マイクマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）とを用いて、大声判定部以降の処理を行うことで、第一実施形態と同等の効果を得ることができる。また、第一実施形態の変形例で説明したように、音源分離部において、混合音声信号X(t)からマイクマイクロホンと話者ｉとの距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）を算出できる場合には、放送局側のサーバは、少なくとも混合音声信号X(t)を配信すればよく、受信側では、第一実施形態及びその変形例で説明した処理を行えばよい。 For example, it can be implemented on the receiving side of a television. For example, a specific voice suppression device is incorporated in the television, and the reception side suppresses field screams and screams. On the receiving side, the microphone position U or the distance between the microphone and the speaker can be acquired in advance as data. For example, when the microphone position U can be acquired as data on the receiving side, the processing described in the first embodiment may be performed. When the distance d (U, L _{2, i} , t) between the microphone and the speaker can be acquired as data on the receiving side, the server on the broadcasting station side includes a sound source separation unit and a distance calculation unit, and the receiving side has a sound source A configuration other than the separation unit and the distance calculation unit may be included. The server on the broadcast station side executes a sound source separation unit and a distance calculation unit on the collected sound signal (mixed sound signal), and the estimated value S _i (t) of the sound signal of the speaker i and the sound source The power parameter P _{1, i} (t) and the distance d (U, L _{2, i} , t) between the microphone and the speaker i are obtained. Then, the estimated value S _i (t) of the voice signal of the speaker i, the sound source power parameter P _{1, i} (t), and the distance d (U, L _{2, i} , t) between the microphone and the speaker i. Are distributed to the receiving side. On the receiving side, the estimated value S _i (t) of the voice signal of the speaker i, the sound source power parameter P _{1, i} (t), and the distance d (U, L _{2, i} , Using t), the same processing as the first embodiment can be obtained by performing the processing after the loudness determination unit. As described in the modification of the first embodiment, the sound source separation unit calculates the distance d (U, L _{2, i} , t) between the microphone and the speaker i from the mixed sound signal X (t). If possible, the server on the broadcasting station side only needs to distribute at least the mixed audio signal X (t), and the receiving side may perform the processing described in the first embodiment and its modifications.

また、テレビの放送局側での実装も可能である。マイクロホンから出力される混合音声信号を特定音声抑圧装置の入力とし、放送局は、特定音声抑圧装置の出力信号を送信すればよい。この場合、配信する時点で野次や叫び声を抑圧することができる。マイクロホンの位置Ｕは、利用者により、入力してもよいし、カメラ映像等を利用して取得してもよい。話者の位置について必ずしも音源分離部１１０で取得する必要はなく、カメラ映像等を利用してもよい。 It can also be implemented on the TV broadcasting station side. The mixed sound signal output from the microphone is used as the input of the specific sound suppressing device, and the broadcasting station may transmit the output signal of the specific sound suppressing device. In this case, it is possible to suppress the field and screams at the time of distribution. The position U of the microphone may be input by the user, or may be acquired using a camera image or the like. The speaker position is not necessarily acquired by the sound source separation unit 110, and a camera image or the like may be used.

また、例えば、補聴器に応用することも可能である。例えば、補聴器内部に、特定音声抑圧装置を組み込めばよく、マイクロホンの位置Ｕは、補聴器の位置となり、補聴器に対する話者ｉの位置が距離ｄ（Ｕ，Ｌ_2,ｉ，ｔ）となる。 Also, for example, it can be applied to a hearing aid. For example, a specific voice suppression device may be incorporated in the hearing aid. The position U of the microphone is the position of the hearing aid, and the position of the speaker i with respect to the hearing aid is the distance d (U, L _{2, i} , t).

Claims

Assume that i ∈ {1,..., M}, and from a speech signal including the speech of M speakers, an estimated value S i of the speech of speaker _i and a power parameter P ₁ corresponding to the speech power of speaker i _{, i} and a sound source separation unit,
j∈ {1,..., M} \ i, and using the distance d (U, L _{2, i} ) between the microphone and the speaker i used for collecting the voice signal, the power parameter P _{1 , i} is used to correct the power P _2, i of the voice of the speaker _i specified by i _{, i} to generate a corrected power P _{3, i} of the voice _, and using the power P _{3, i} , A loudness determination unit for calculating loudness E _i representing the loudness of the speaker j
Except for the estimated value S _{i_2} corresponding to the threshold A or more loud degree E _{i_2,} including by combining the estimated value S _{i - 3} audio, the mixing signal generator for generating a mixed sound signal, and
Specific voice suppression device.

The characteristic speech suppressing apparatus of claim 1,
The threshold A is 30 dB.
Characteristic voice suppression device.

Assume that i ∈ {1,..., M}, and from a speech signal including the speech of M speakers, an estimated value S i of the speech of speaker _i and a power parameter P ₁ corresponding to the speech power of speaker i _{, i} and a sound source separation unit,
j∈ {1,..., M} \ i, and using the distance d (U, L _{2, i} ) between the microphone and the speaker i used for collecting the voice signal, the power parameter P _{1 , i} is used to correct the power P _2, i of the voice of the speaker _i specified by i _{, i} to generate a corrected power P _{3, i} of the voice _, and using the power P _{3, i} , A loudness determination unit for calculating loudness E _i representing the loudness of the speaker j
It generates a vocal tract spectrum v _{i_2} estimate S _{i_2} corresponding to the threshold A or more loud degree E _{i_2,} and delta feature quantity calculation unit for calculating a delta feature quantity Delta] v _{i_2} from the vocal tract spectrum v _{i_2,}
A mixed signal generating unit that generates a mixed signal by synthesizing the estimated value S _{i_4 of the} speech by excluding the estimated value S _{i_3} corresponding to the delta feature quantity Δv _{i_3} in which the section that is substantially zero exceeds the threshold B ,
Specific voice suppression device.

The characteristic speech suppressing apparatus according to claim 3,
The threshold A is 30 dB, and the threshold B is 300 milliseconds.
Characteristic voice suppression device.

Assume that i ∈ {1,..., M}, and from a speech signal including the speech of M speakers, an estimated value S i of the speech of speaker _i and a power parameter P ₁ corresponding to the speech power of speaker i _{, i} and a sound source separation step;
j∈ {1,..., M} \ i, and using the distance d (U, L _{2, i} ) between the microphone and the speaker i used for collecting the voice signal, the power parameter P _{1 , i} is used to correct the power P _2, i of the voice of the speaker _i specified by i _{, i} to generate a corrected power P _{3, i} of the voice _, and using the power P _{3, i} , Loudness determination step of calculating loudness E _i representing the loudness of the speaker j
Except for the estimated value S _{i_2} corresponding to the threshold A or more loud degree E _{i_2,} including by combining the estimated value S _{i - 3} audio, the mixed signal generation step of generating a mixed sound signal, and
Specific voice suppression method.

Assume that i ∈ {1,..., M}, and from a speech signal including the speech of M speakers, an estimated value S i of the speech of speaker _i and a power parameter P ₁ corresponding to the speech power of speaker i _{, i} and a sound source separation step;
j∈ {1,..., M} \ i, and using the distance d (U, L _{2, i} ) between the microphone and the speaker i used for collecting the voice signal, the power parameter P _{1 , i} is used to correct the power P _2, i of the voice of the speaker _i specified by i _{, i} to generate a corrected power P _{3, i} of the voice _, and using the power P _{3, i} , Loudness determination step of calculating loudness E _i representing the loudness of the speaker j
It generates a vocal tract spectrum v _{i_2} estimate S _{i_2} corresponding to the threshold A or more loud degree E _{i_2,} and delta feature quantity calculation step of calculating a delta feature quantity Delta] v _{i_2} from the vocal tract spectrum v _{i_2,}
And a mixed signal generating step of generating a mixed signal by synthesizing the estimated value S _{i_4 of the} speech by excluding the estimated value S _{i_3} corresponding to the delta feature quantity Δv _{i_3} in which the interval that is substantially zero exceeds the threshold B. ,
Specific voice suppression method.

The program for functioning a computer as a specific audio | voice suppression apparatus in any one of Claims 1-4.