JP7216348B2

JP7216348B2 - Speech processing device, speech processing method, and speech processing program

Info

Publication number: JP7216348B2
Application number: JP2021029416A
Authority: JP
Inventors: 仁山本; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2023-02-01
Anticipated expiration: 2037-03-07
Also published as: JP2021092809A

Description

本発明は、音声処理装置、音声処理方法、および音声処理プログラムに関する。 The present invention relates to an audio processing device, an audio processing method, and an audio processing program.

音声信号から、音声を発した話者を特定するための個人性を表す話者特徴を算出する音声処理装置が知られている。また、この話者特徴を用いて、音声信号を発した話者を推定する話者認識装置が知られている。 2. Description of the Related Art A speech processing apparatus is known that calculates speaker characteristics representing individuality for identifying a speaker who has emitted a speech from a speech signal. Also known is a speaker recognition apparatus that estimates a speaker who has emitted a speech signal by using this speaker feature.

この種の音声処理装置を用いる話者認識装置は、話者を特定するために、第１の音声信号から抽出した第１の話者特徴と、第２の音声信号から抽出した第２の話者特徴との類似度を評価する。そして、話者認識装置は、類似度の評価結果に基づいて２つの音声信号の話者が同一か否かを判定する。 A speaker recognition apparatus using this type of speech processing apparatus uses first speaker features extracted from a first speech signal and second speech features extracted from a second speech signal to identify a speaker. Evaluate the degree of similarity with person characteristics. Then, the speaker recognition device determines whether or not the speaker of the two speech signals is the same based on the similarity evaluation result.

非特許文献１には、音声信号から話者特徴を抽出する技術が記載されている。非特許文献１に記載の話者特徴抽出技術は、音声モデルを用いて音声信号の音声統計量を算出する。そして、非特許文献１に記載の話者特徴抽出技術は、因子分析技術に基づいてその音声統計量を処理し、所定の要素数で表現される話者特徴ベクトルとして算出する。すなわち、非特許文献１においては、話者特徴ベクトルを話者の個人性を表す話者特徴として利用する。 Non-Patent Document 1 describes a technique for extracting speaker features from a speech signal. The speaker feature extraction technique described in Non-Patent Document 1 calculates speech statistics of a speech signal using a speech model. Then, the speaker feature extraction technique described in Non-Patent Document 1 processes the speech statistic based on the factor analysis technique and calculates a speaker feature vector represented by a predetermined number of elements. That is, in Non-Patent Document 1, a speaker feature vector is used as a speaker feature representing the speaker's individuality.

Najim Dehak, Patrick Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Transaction on Audio, Speech and Language Processing, Vol. 19, No. 4, pp. 788-798, 2011.Najim Dehak, Patrick Kenny, Reda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-End Factor Analysis for Speaker Verification,” IEEE Transaction on Audio, Speech and Language Processing, Vol. 19, No. 4, pp. 788-798 , 2011.

しかしながら、非特許文献１に記載の技術には、抽出した話者特徴を用いる話者認識の精度が十分でないという問題があった。 However, the technique described in Non-Patent Document 1 has a problem that the accuracy of speaker recognition using extracted speaker features is not sufficient.

非特許文献１に記載の技術は、話者特徴抽出装置に入力された音声信号に対して所定の統計処理を行い、話者特徴ベクトルを算出する。具体的には、非特許文献１に記載の技術は、話者特徴抽出装置に入力された音声信号の全体に対して一律の統計処理を行うことにより、話者特徴ベクトルを算出している。そのため、非特許文献１に記載の技術は、音声信号の部分区間に、話者の個人性を算出する元として適切ではない信号が含まれている場合であっても、音声信号の全体から話者特徴ベクトルを算出してしまうので、話者認識の精度を損なうおそれがある。具体的には、音声信号の部分区間に、例えば、話者の不明瞭な発声、話者の咳や笑い声などの話し声とは異なる音、雑音などが混入している場合に、話者認識の精度を損なうおそれがある。 The technique described in Non-Patent Document 1 performs predetermined statistical processing on a speech signal input to a speaker feature extraction device to calculate a speaker feature vector. Specifically, the technology described in Non-Patent Document 1 calculates a speaker feature vector by uniformly performing statistical processing on the entire speech signal input to the speaker feature extraction device. Therefore, the technology described in Non-Patent Document 1 can be used to extract speech from the entire speech signal even if a partial section of the speech signal includes a signal that is not suitable as a basis for calculating the individuality of the speaker. Since the speaker feature vector is calculated, the accuracy of speaker recognition may be impaired. Specifically, when a partial segment of a speech signal contains, for example, unclear utterances of the speaker, sounds different from the speaking voice such as coughing or laughter of the speaker, noise, etc., it is difficult to recognize the speaker. Accuracy may be lost.

本発明は、上記問題に鑑みてなされたものであり、その目的は、話者認識の精度をより高めた音声処理装置、音声処理方法、および音声処理プログラムを提供することにある。 SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech processing apparatus, a speech processing method, and a speech processing program that further improve the accuracy of speaker recognition.

本発明の第１の態様の音声処理装置は、音声を表す複数の音声信号の入力を受け付ける受付手段と、前記複数の音声信号における、話者認識に正解する音声および前記話者認識に誤りを起こす音声の２種類の品質を算出する品質推定手段と、前記複数の音声信号の前記品質に基づいて、前記複数の音声信号から特定の属性情報を認識するための認識特徴量を算出する情報処理手段と、前記複数の音声信号に含まれる音の種類の出現度を表す音声統計量を算出する音声統計量算出手段とを備え、前記情報処理手段は、前記複数の音声信号の前記音声統計量と、前記複数の音声信号の前記品質とに基づいて、前記認識特徴量を算出する。 A speech processing apparatus according to a first aspect of the present invention comprises reception means for accepting input of a plurality of speech signals representing speech ; quality estimating means for calculating two types of quality of the generated voice; and information processing for calculating a recognition feature amount for recognizing specific attribute information from the plurality of voice signals based on the quality of the plurality of voice signals. and speech statistic calculation means for calculating a speech statistic representing the frequency of occurrence of types of sounds contained in the plurality of speech signals, wherein the information processing means calculates the speech statistics of the plurality of speech signals. and the quality of the plurality of speech signals, the recognition feature amount is calculated.

本発明の第２の態様の音声処理方法は、音声を表す複数の音声信号の入力を受け付け、前記複数の音声信号における、話者認識に正解する音声および前記話者認識に誤りを起こす音声の２種類の品質を算出し、前記複数の音声信号の前記品質に基づいて、前記複数の音声信号から特定の属性情報を認識するための認識特徴量を算出し、前記複数の音声信号に含まれる音の種類の出現度を表す音声統計量を算出し、前記複数の音声信号の前記音声統計量と、前記複数の音声信号の前記品質とに基づいて、前記認識特徴量を算出する。 A speech processing method according to a second aspect of the present invention accepts input of a plurality of speech signals representing speech, and selects speech that is correct for speaker recognition and speech that causes an error in speaker recognition in the plurality of speech signals. calculating two types of quality, calculating a recognition feature amount for recognizing specific attribute information from the plurality of audio signals based on the quality of the plurality of audio signals, and calculating the recognition feature amount for recognizing specific attribute information from the plurality of audio signals; A speech statistic representing the degree of appearance of the type of sound is calculated, and the recognition feature quantity is calculated based on the speech statistic of the plurality of speech signals and the quality of the plurality of speech signals.

本発明の第３の態様の音声処理プログラムは、コンピュータに、音声を表す複数の音声信号の入力を受け付ける処理と、前記複数の音声信号における、話者認識に正解する音声および前記話者認識に誤りを起こす音声の２種類の品質を算出する処理と、前記複数の音声信号の前記品質に基づいて、前記複数の音声信号から特定の属性情報を認識するための認識特徴量を算出する処理と、前記複数の音声信号に含まれる音の種類の出現度を表す音声統計量を算出する処理と、前記複数の音声信号の前記音声統計量と、前記複数の音声信号の前記品質とに基づいて、前記認識特徴量を算出する処理とを実行させる。 A speech processing program according to a third aspect of the present invention provides a computer with a process of receiving input of a plurality of speech signals representing speech, a speech correct for speaker recognition in the plurality of speech signals, and processing of the speaker recognition. a process of calculating two types of quality of speech that causes an error; and a process of calculating a recognition feature amount for recognizing specific attribute information from the plurality of speech signals based on the qualities of the plurality of speech signals. , based on a process of calculating a speech statistic representing the appearance of types of sounds included in the plurality of speech signals, and the speech statistics of the plurality of speech signals and the quality of the plurality of speech signals; , and a process of calculating the recognition feature amount.

本発明によれば、話者認識の精度をより高めた音声処理装置、音声処理方法、およびプログラムを提供することができる。 According to the present invention, it is possible to provide a speech processing device, a speech processing method, and a program that improve the accuracy of speaker recognition.

本発明の第１の実施形態に係る音声処理装置の構成を示すブロック図である。1 is a block diagram showing the configuration of a speech processing device according to a first embodiment of the present invention; FIG. 本発明の第１の実施形態に係る音声処理装置の動作の流れを示すフローチャートである。4 is a flow chart showing the operation flow of the speech processing device according to the first embodiment of the present invention; 本発明の第２の実施形態に係る音声処理装置の構成を示すブロック図である。FIG. 4 is a block diagram showing the configuration of a speech processing device according to a second embodiment of the present invention; FIG. 本発明の第２の実施形態に係る音声処理装置の動作の流れを示すフローチャートである。9 is a flow chart showing the operation flow of the speech processing device according to the second embodiment of the present invention; 本発明の第３の実施形態に係る音声処理装置の構成を示すブロック図である。FIG. 11 is a block diagram showing the configuration of a speech processing device according to a third embodiment of the present invention; FIG. 本発明のその他の実施形態に係る音声処理装置の構成を示すブロック図である。FIG. 11 is a block diagram showing the configuration of a speech processing device according to another embodiment of the present invention;

以下、音声処理装置等および話者特徴抽出装置の実施形態について、図面を参照して説明する。なお、実施形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。 Hereinafter, embodiments of a speech processing device and the like and a speaker feature extraction device will be described with reference to the drawings. It should be noted that, since components denoted by the same reference numerals in the embodiments perform the same operations, re-explanation may be omitted.

＜第１の実施形態＞
図１は、本発明の第１の実施形態に係る音声処理装置の構成を示すブロック図である。 <First embodiment>
FIG. 1 is a block diagram showing the configuration of a speech processing device according to the first embodiment of the present invention.

音声処理装置１００は、貢献度推定部１１と、話者特徴算出部１２とを備える。 The speech processing device 100 includes a contribution estimation unit 11 and a speaker feature calculation unit 12 .

貢献度推定部１１は、外部から音声を表す音声信号を受け取る。また、貢献度推定部１１は、受けた音声信号に基づき、その音声信号の部分区間の品質の程度を数値で表した貢献度を算出する。 The contribution estimation unit 11 receives an audio signal representing audio from the outside. Further, based on the received audio signal, the contribution estimating unit 11 calculates a contribution that numerically expresses the degree of quality of the partial section of the audio signal.

話者特徴算出部１２は、貢献度推定部１１が算出した音声信号の部分区間の貢献度を、その部分区間の重みとして用いて、音声信号から特定の属性情報を認識するための認識特徴量を算出する。 The speaker feature calculator 12 uses the contribution of the partial section of the speech signal calculated by the contribution estimator 11 as the weight of the partial section to calculate the recognition feature amount for recognizing specific attribute information from the speech signal. Calculate

ここで、特定の属性情報とは、音声信号を発した話者、音声信号を構成する言語、音声信号に含まれる感情表現、および音声信号から推定される話者の性格等を示す情報である。 Here, the specific attribute information is information indicating the speaker who issued the voice signal, the language composing the voice signal, the emotional expression included in the voice signal, the personality of the speaker estimated from the voice signal, and the like. .

図２を参照し、音声処理装置１００の動作の流れについて説明する。図２は本発明の第１の実施形態に係る音声処理装置の動作の流れを示すフローチャートである。 The operation flow of the speech processing device 100 will be described with reference to FIG. FIG. 2 is a flow chart showing the operation flow of the speech processing device according to the first embodiment of the present invention.

まず、貢献度推定部１１は、外部から受けた音声信号に基づいて、音声信号の部分区間の貢献度を算出する（ステップＳ１０１）。次いで、貢献度推定部１１は、算出した音声信号の部分区間の貢献度を話者特徴算出部１２に出力する。 First, the contribution estimating unit 11 calculates the contribution of the partial section of the audio signal based on the externally received audio signal (step S101). Next, the contribution estimation unit 11 outputs the calculated contribution of the partial section of the speech signal to the speaker feature calculation unit 12 .

次いで、話者特徴算出部１２は、貢献度推定部１１から受けた貢献度に基づいて、認識特徴量を算出する（ステップＳ１０２）。 Next, the speaker feature calculation unit 12 calculates a recognition feature quantity based on the degree of contribution received from the contribution degree estimation unit 11 (step S102).

＜第２の実施形態＞
図３は、第２の実施形態における音声処理装置２００のブロック図である。音声処理装置２００は、貢献度推定部１１、話者特徴算出部１２、音声区間検出部２１、および音声統計量算出部２２を備える。また、音声処理装置２００は、さらに、貢献度記憶部２３および貢献度学習部２４を備えてもよい。 <Second embodiment>
FIG. 3 is a block diagram of a speech processing device 200 according to the second embodiment. The speech processing device 200 includes a contribution estimation unit 11 , a speaker feature calculation unit 12 , a speech section detection unit 21 and a speech statistics calculation unit 22 . Moreover, the speech processing device 200 may further include a contribution storage unit 23 and a contribution learning unit 24 .

音声区間検出部２１は、外部から音声信号を受け取る。また、音声区間検出部２１は、受け取った音声信号に含まれる音声区間を検出して区分化する。この時、音声区間検出部２１は、音声信号を一定の長さに区分化してもよいし、異なる長さに区分化してもよい。例えば、音声区間検出部２１は、音声信号のうち音量が一定時間継続して所定値より小さい区間を無音と判定し、その区間の前後を異なる音声区間と判定して区分化してもよい。そして、音声区間検出部２１は、区分化した結果（音声区間検出部２１の処理結果）である区分化音声信号を、貢献度推定部１１および音声統計量算出部２２に出力する。ここで、音声信号の受け取りとは、例えば、外部の装置または他の処理装置からの音声信号の受信、または他のプログラムからの音声信号処理の処理結果の引き渡しのことである。また、出力とは、例えば、外部の装置や他の処理装置への送信、または他のプログラムへの音声区間検出部２１の処理結果の引き渡しのことである。 The voice segment detection unit 21 receives voice signals from the outside. Further, the voice segment detection unit 21 detects and segments voice segments included in the received voice signal. At this time, the voice segment detection unit 21 may segment the voice signal into segments of a fixed length or segments of different lengths. For example, the voice segment detection unit 21 may determine that a segment of the voice signal in which the volume continues for a certain period of time and is smaller than a predetermined value is silent, and the segments before and after the segment may be determined as different voice segments and segmented. Then, the speech interval detection unit 21 outputs the segmented speech signal, which is the segmentation result (the processing result of the speech interval detection unit 21 ), to the contribution estimation unit 11 and the speech statistic calculation unit 22 . Here, receiving an audio signal means, for example, receiving an audio signal from an external device or another processing device, or passing a processing result of audio signal processing from another program. Also, the output means, for example, transmission to an external device or other processing device, or delivery of the processing result of the speech segment detection unit 21 to another program.

音声統計量算出部２２は、音声区間検出部２１から区分化音声信号を受け取る。音声統計量算出部２２は、受け取った区分化音声信号に基づいて、該区分化音声信号に含まれる音の種類を表す音声統計量を算出する。ここで、音の種類とは、例えば、言語により定まる音素や単語、音声信号を類似度に基づいてクラスタリングして得られる音のグループである。そして、音声統計量算出部２２は、音声統計量を話者特徴算出部１２に出力する。以降、ある音声信号に対して算出された音声統計量を、該音声信号の音声統計量と呼ぶ。 The speech statistic calculator 22 receives the segmented speech signal from the speech section detector 21 . Based on the received segmented audio signal, the audio statistic calculator 22 calculates audio statistics representing the types of sounds included in the segmented audio signal. Here, the type of sound is, for example, a group of sounds obtained by clustering phonemes, words, and speech signals determined by a language based on the degree of similarity. Then, the speech statistic calculator 22 outputs the speech statistic to the speaker feature calculator 12 . A speech statistic calculated for a certain speech signal is hereinafter referred to as a speech statistic of the speech signal.

音声統計量算出部２２が、音声統計量を算出する方法の一例について説明する。具体的には、音声統計量算出部２２は、音声区間検出部２１から受け取った区分化音声信号に基づいて、該区分化音声信号を周波数分析処理した計算結果で表現される音響特徴を算出し、算出した結果を出力する。例えば、音声統計量算出部２２は、音声区間検出部２１から受け取った区分化音声信号を、短時間フレーム時系列に変換する。そして、音声統計量算出部２２は、短時間フレーム時系列のそれぞれのフレームを周波数分析し、その処理結果を音響特徴として出力する。この場合、音声統計量算出部２２は、例えば、短時間フレーム時系列として、２５ミリ秒区間のフレームを１０ミリ秒ごとに生成する。音声統計量算出部２２は、例えば、周波数分析結果である音響特徴として、高速フーリエ変換処理（ＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ；ＦＦＴ）およびフィルタバンク処理によって得られた周波数フィルタバンク特徴や、さらに加えて離散コサイン変換処理を施して得られたメル周波数ケプストラム係数（Ｍｅｌ－ＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔｓ；ＭＦＣＣ）特徴などを算出する。 An example of a method of calculating the speech statistic by the speech statistic calculator 22 will be described. Specifically, based on the segmented speech signal received from the speech segment detection unit 21, the speech statistic calculation unit 22 calculates acoustic features represented by the result of frequency analysis processing of the segmented speech signal. , and output the calculated result. For example, the speech statistic calculator 22 converts the segmented speech signal received from the speech section detector 21 into a short-time frame time series. Then, the speech statistic calculation unit 22 frequency-analyzes each frame in the short-time frame time series, and outputs the processing result as an acoustic feature. In this case, the speech statistic calculation unit 22 generates, for example, frames of 25 millisecond intervals every 10 milliseconds as a short-time frame time series. The speech statistic calculation unit 22, for example, as acoustic features that are frequency analysis results, frequency filter bank features obtained by Fast Fourier Transform (FFT) and filter bank processing, and in addition discrete cosine transform Mel-Frequency Cepstrum Coefficients (MFCC) features obtained by the processing are calculated.

そして、音声統計量算出部２２は、音響特徴の時系列と、音響特徴と音の種類との対応関係を格納する音声モデルを用いて、音の種類を表す数値情報の時系列を算出する。音声統計量算出部２２は、例えば、音声モデルがガウス混合モデル（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ；ＧＭＭ）である場合、ガウス混合モデルが有する各要素分布の平均、分散、および混合係数に基づいて、各要素分布の事後確率を算出する。ここで、各要素分布の事後確率は、音声信号に含まれる音の種類それぞれの出現度である。また、音声統計量算出部２２は、例えば、音声モデルがニューラルネットワーク（ＮｅｕｒａｌＮｅｔｗｏｒｋ）である場合、音響特徴と、ニューラルネットワークが有する重み係数に基づいて、音声信号に含まれる音の種類の出現度を算出する。 Then, the speech statistic calculation unit 22 calculates the time series of numerical information representing the types of sounds using the time series of the acoustic features and the speech model that stores the correspondence relationship between the acoustic features and the types of sounds. For example, when the speech model is a Gaussian mixture model (GMM), the speech statistic calculation unit 22 calculates each element distribution based on the mean, variance, and mixture coefficient of each element distribution of the Gaussian mixture model. Calculate the posterior probability of Here, the posterior probability of each element distribution is the appearance of each type of sound included in the speech signal. Further, for example, when the speech model is a neural network, the speech statistic calculation unit 22 calculates the degree of appearance of the types of sounds contained in the speech signal based on the acoustic features and the weighting coefficients of the neural network. Calculate

貢献度記憶部２３は、１つ以上の貢献度推定器を記憶する。貢献度推定器は、音声信号を信号の品質によって複数の種類に仕分けるよう動作するように構成されるものである。貢献度推定器は、例えば、音声信号の品質を表す数値情報を出力する。信号の品質の種類とは、例えば、音声・非音声・無音である。また、信号の品質の種類とは、例えば、話者認識に正解する音声・話者認識に誤りを起こす音声である。 The contribution storage unit 23 stores one or more contribution estimators. The contribution estimator is configured to operate to classify the speech signal into a plurality of classes according to signal quality. The contribution estimator outputs, for example, numerical information representing the quality of the speech signal. The types of signal quality are, for example, speech/non-speech/silence. Further, the types of signal quality are, for example, speech that is correct in speaker recognition and speech that causes an error in speaker recognition.

具体的には、貢献度記憶部２３は、貢献度推定器が保有するパラメタを記憶する。貢献度記憶部２３は、例えば、貢献度推定器がニューラルネットワークである場合、それを構成するノードの数やノード間の接続重み係数などの一式をパラメタとして記憶する。 Specifically, the contribution storage unit 23 stores parameters held by the contribution estimator. For example, if the contribution estimator is a neural network, the contribution storage unit 23 stores a set of parameters, such as the number of nodes that make up the network and weighting coefficients for connections between nodes.

なお、図３では、貢献度記憶部２３が音声処理装置２００内に内蔵されることを例に説明を行ったが、本発明はこれに限定されるものではない。貢献度記憶部２３は、音声処理装置２００の外部に設けられた記憶装置で実現されるものであってもよい。 In addition, in FIG. 3, an example in which the degree-of-contribution storage unit 23 is built in the speech processing device 200 has been described, but the present invention is not limited to this. Contribution degree storage unit 23 may be realized by a storage device provided outside speech processing device 200 .

貢献度推定部１１は、音声区間検出部２１から区分化音声信号を受け取る。貢献度推定部１１は、貢献度記憶部２３に記憶されている貢献度推定器を用いて、区分化音声信号の品質を表す数値情報を算出する。貢献度推定部１１は、音声統計量算出部２２と同様に、区分化音声信号を短時間フレーム時系列に変換し、それぞれのフレームの音響特徴を算出し、音響特徴の時系列を算出する。続いて、貢献度推定部１１は、各フレームの音響特徴と貢献度推定器のパラメタとを用いて、各フレームの品質を表す数値を算出する。以降、ある音声信号に対して算出された信号の品質を表す数値のことを音声信号の貢献度と呼ぶ。 The contribution estimation unit 11 receives the segmented speech signal from the speech section detection unit 21 . The contribution estimation unit 11 uses the contribution estimator stored in the contribution storage unit 23 to calculate numerical information representing the quality of the segmented speech signal. Like the speech statistic calculation unit 22, the contribution estimation unit 11 converts the segmented speech signal into a short-time frame time series, calculates the acoustic features of each frame, and calculates the time series of the acoustic features. Subsequently, the contribution estimator 11 calculates a numerical value representing the quality of each frame using the acoustic features of each frame and the parameters of the contribution estimator. Hereinafter, a numerical value representing the signal quality calculated for a certain audio signal is referred to as the contribution of the audio signal.

具体的には、貢献度推定部１１は、例えば、貢献度推定器がニューラルネットワークである場合、音響特徴と、ニューラルネットワークが有する重み係数とに基づいて、音響特徴の貢献度を算出する。例えば、貢献度推定器がニューラルネットワークであり、その出力層が、２つの信号の品質の種類「話者認識に正解する信号」と「話者認識誤りを起こす信号」とに相当するものであるとする。このとき、貢献度推定器は、音響特徴が話者認識に正解する信号である確率と、音響特徴が話者認識誤りを起こす信号である確率とを算出し、貢献度として、例えば、「話者認識に正解する信号」である確率を出力する。また、貢献度推定部１１は、話者認識を実行する前に、音声信号の部分区間が音声か否かを識別して音声である確率を算出してもよい。 Specifically, for example, if the contribution estimator is a neural network, the contribution estimating unit 11 calculates the contribution of the acoustic feature based on the acoustic feature and the weighting factor of the neural network. For example, the contribution estimator is a neural network whose output layer corresponds to two signal quality types: "correct speaker recognition signals" and "false speaker recognition signals". and At this time, the contribution estimator calculates the probability that the acoustic feature is a signal that corrects speaker recognition and the probability that the acoustic feature is a signal that causes an error in speaker recognition. It outputs the probability that the signal is the correct signal for person recognition. Further, the contribution estimating unit 11 may determine whether a partial section of the speech signal is speech or not and calculate the probability that it is speech before executing speaker recognition.

話者特徴算出部１２は、音声統計量算出部２２が出力した音声統計量および貢献度推定部１１が出力した貢献度を受け取る。話者特徴算出部１２は、音声統計量および貢献度を用いて、音声信号から特定の属性情報を認識するための認識特徴量を算出する。 The speaker feature calculation unit 12 receives the speech statistics output by the speech statistics calculation unit 22 and the contributions output by the contribution estimation unit 11 . The speaker feature calculator 12 uses the speech statistic and the degree of contribution to calculate a recognition feature for recognizing specific attribute information from the speech signal.

話者特徴算出部１２が音声信号ｘの認識特徴量としてｉ－ｖｅｃｔｏｒに基づく特徴ベクトルＦ（ｘ）を算出する方法の一例について説明する。なお、話者特徴算出部１２が算出する特徴ベクトルＦ（ｘ）は、音声信号ｘに対して所定の演算を施して算出できるベクトルであればよく、ｉ－ｖｅｃｔｏｒはその一例である。 An example of a method for calculating the feature vector F(x) based on the i-vector as the recognition feature quantity of the speech signal x by the speaker feature calculator 12 will be described. Note that the feature vector F(x) calculated by the speaker feature calculation unit 12 may be any vector that can be calculated by performing a predetermined operation on the speech signal x, and i-vector is one example.

話者特徴算出部１２は、音声統計量算出部２２から、音声信号ｘの統計量の情報として、例えば、短時間フレームごとに算出された音響事後確率Ｐｔ（ｘ）および音響特徴Ａｔ（ｘ）（ｔ＝｛１…Ｔ｝、Ｔは１以上の自然数）とを受け取る。また、話者特徴算出部１２は、貢献度推定部１１から、音声信号ｘの貢献度の情報として、例えば、短時間フレームごとに算出された貢献度Ｃｔ（ｘ）を受け取る。話者特徴算出部１２は、以下の式（１）のように、音響事後確率Ｐｔ（ｘ）の各要素に対して、貢献度Ｃｔ（ｘ）をかけて、その結果をＱｔ（ｘ）として算出する。 The speaker feature calculator 12 receives from the speech statistic calculator 22 the acoustic posterior probability Pt(x) and the acoustic feature At(x) calculated for each short-time frame as information on the statistic of the speech signal x. (t={1 . . . T}, where T is a natural number equal to or greater than 1). Further, the speaker feature calculator 12 receives, for example, the contribution Ct(x) calculated for each short-time frame from the contribution estimator 11 as information about the contribution of the speech signal x. The speaker feature calculation unit 12 multiplies each element of the acoustic posterior probability Pt(x) by the degree of contribution Ct(x) as shown in the following equation (1), and obtains the result as Qt(x). calculate.

話者特徴算出部１２は、貢献度によって重みづけされた音響事後確率Ｑｔ（ｘ）および音響特徴Ａｔ（ｘ）を用いて、以下の式（２）に基づいて音声信号ｘの０次統計量Ｓ０（ｘ）を算出し、式（３）に基づいて１次統計量Ｓ１（ｘ）を算出する。 Using acoustic posterior probability Qt(x) weighted by the degree of contribution and acoustic feature At(x), speaker feature calculation unit 12 calculates the zero-order statistic of speech signal x based on the following equation (2): S0(x) is calculated, and the primary statistic S1(x) is calculated based on the equation (3).

話者特徴算出部１２は、続いて、以下の式（４）に基づいて音声信号ｘのｉ－ｖｅｃｔｏｒであるＦ（ｘ）を算出する。 The speaker feature calculator 12 then calculates F(x), which is the i-vector of the audio signal x, based on the following equation (4).

式（１）～式（４）において、Ｃは統計量Ｓ０（ｘ）およびＳ１（ｘ）の要素数、Ｄは音響特徴Ａｔ（ｘ）の要素数（次元数）、ｍｃは音響特徴空間におけるｃ番目の領域の音響特徴の平均ベクトル、Ｉは単位行列、０は零行列を表す。Ｔはｉ－ｖｅｃｔｏｒ計算用のパラメタであり、Σは音響特徴空間における音響特徴の共分散行列である。 In equations (1) to (4), C is the number of elements of statistics S0(x) and S1(x), D is the number of elements (number of dimensions) of acoustic feature At(x), and mc is the number of Mean vector of acoustic features of the c-th region, I is a unit matrix, and 0 is a zero matrix. T is the parameter for the i-vector computation and Σ is the covariance matrix of the acoustic features in the acoustic feature space.

話者特徴算出部１２が上述の手順で特徴ベクトルＦ（ｘ）を算出する際に、音声信号ｘのすべての時刻ｔ（ｔ＝｛１…Ｔ｝、Ｔは１以上の自然数）において、その貢献度Ｃｔ（ｘ）が１であれば、非特許文献１に記載のｉ－ｖｅｃｔｏｒ算出手順と等価である。本実施形態において、話者特徴算出部１２は、貢献度推定部１１が音声信号ｘの時刻ｔに応じて推定した貢献度Ｃｔ（ｘ）を用いることにより、非特許文献１に記載のｉ－ｖｅｃｔｏｒとは異なる特徴ベクトルＦ（ｘ）を算出できる。 When the speaker feature calculator 12 calculates the feature vector F(x) in the above-described procedure, at all times t (t={1 . . . T}, where T is a natural number equal to or greater than 1) of the speech signal x, the If the contribution Ct(x) is 1, it is equivalent to the i-vector calculation procedure described in Non-Patent Document 1. In this embodiment, the speaker feature calculator 12 uses the contribution Ct(x) estimated by the contribution estimator 11 according to the time t of the speech signal x to obtain i- A feature vector F(x) different from vector can be calculated.

このように、音声処理装置２００において、話者特徴算出部１２が、音声信号ｘに対して、該音声信号の各部分区間の品質に応じた貢献度Ｃｔ（ｘ）を用いて特徴ベクトルＦ（ｘ）を算出することにより、音声信号の品質に応じた特徴ベクトルを出力することができる。 In this way, in the speech processing device 200, the speaker feature calculator 12 calculates the feature vector F( By calculating x), a feature vector corresponding to the quality of the speech signal can be output.

貢献度学習部２４は、訓練用音声信号を用いて貢献度記憶部２３が記憶できる貢献度推定器を学習する。貢献度学習部２４は、例えば、貢献度推定器がニューラルネットワークである場合、それを構成するノード間の接続重み係数などのパラメタを、一般的な最適化基準に従って最適化する。貢献度学習部２４が使用する訓練用音声信号は、複数の音声信号を集めたものであり、それぞれの音声信号は、貢献度推定部１１が出力する信号の品質の種類のいずれかと対応付けられたものである。 The contribution learning unit 24 learns a contribution estimator that can be stored in the contribution storage unit 23 using the training speech signal. For example, when the contribution estimator is a neural network, the contribution learning unit 24 optimizes parameters such as connection weight coefficients between nodes that constitute it, according to general optimization criteria. The training speech signal used by the contribution learning unit 24 is a collection of a plurality of speech signals, and each speech signal is associated with one of the quality types of signals output by the contribution estimation unit 11. It is a thing.

以下では、入力が音響特徴であり、出力が「話者認識に正解する音声」および「話者認識に誤りを起こす音声」の２種類の信号の品質である貢献度推定器を貢献度学習部２４が学習する方法の一例を説明する。 Below, a contribution estimator whose input is acoustic features and whose output is the quality of two types of signals, ``speech that correctly recognizes a speaker'' and ``speech that causes an error in speaker recognition'', is used as a contribution learning unit. An example of how T.24 learns is described.

（ａ）まず、貢献度学習部２４は、話者ラベルつきの複数の音声信号を用いて、音声信号の話者ラベルを識別することのできる識別器を学習する。（ｂ）次に、貢献度学習部２４は、話者ラベルつきの複数の音声信号のそれぞれを、短時間フレームごとに算出した音響特徴の時系列に変換し、（ａ）で学習した識別器を用いて、各フレームの話者ラベルを識別する。（ｃ）次に、貢献度学習部２４は、識別された各フレームの話者ラベルのうち、事前に付与された話者ラベルと、識別器が識別した話者ラベルが同一であるフレームを「話者認識に正解する音声」、そうでないフレームを「話者認識に誤りを起こす音声」とする。（ｄ）そして、貢献度学習部２４は、「話者認識に正解する音声」および「話者認識に誤りを起こす音声」を訓練用音声信号として、貢献度推定器を学習する。 (a) First, the contribution learning unit 24 uses a plurality of speech signals with speaker labels to learn a classifier capable of identifying speaker labels of speech signals. (b) Next, the contribution learning unit 24 converts each of the plurality of speaker-labeled speech signals into a time series of acoustic features calculated for each short-time frame, and uses the discriminator learned in (a) as is used to identify the speaker label for each frame. (c) Next, the contribution level learning unit 24 selects frames having the same speaker label assigned in advance and the speaker label identified by the classifier among the speaker labels of each identified frame as " Frames that are not correct for speaker recognition are defined as "speech that causes errors in speaker recognition". (d) Then, the contribution learning unit 24 learns the contribution estimator using "speech that corrects for speaker recognition" and "speech that causes an error in speaker recognition" as training speech signals.

以上述べたように、本実施形態に係る音声処理装置２００において、貢献度推定部１１は、音声信号の部分区間に応じた品質を表す指標として、音声信号の貢献度を算出できる。また、話者特徴算出部１２は、音声信号の音響統計量と貢献度とに基づいて特徴ベクトルを算出する。これにより、音声信号に対して、音声信号の各部分区間の品質を考慮した特徴ベクトルを出力できる。すなわち、本実施形態にかかる音声処理装置２００は、話者認識の精度を高めるのに適した話者特徴を算出できる。 As described above, in the speech processing device 200 according to the present embodiment, the contribution estimating unit 11 can calculate the contribution of the speech signal as an index representing the quality corresponding to the partial section of the speech signal. The speaker feature calculation unit 12 also calculates feature vectors based on the acoustic statistics and contribution of the speech signal. This makes it possible to output a feature vector for the speech signal, taking into consideration the quality of each partial section of the speech signal. That is, the speech processing apparatus 200 according to the present embodiment can calculate speaker features suitable for increasing the accuracy of speaker recognition.

なお、本実施形態に係る音声処理装置２００における貢献度記憶部２３は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。また、貢献度記憶部２３に貢献度推定器が記憶される過程は特に限定されない。例えば、記録媒体を介して貢献度推定器が貢献度記憶部２３に記憶されてもよいし、通信回線等を介して送信された貢献度推定器が貢献度記憶部２３に記憶されてもよい。または、入力デバイスを介して入力された貢献度推定器が貢献度記憶部２３で記憶されてもよい。 Note that the contribution storage unit 23 in the speech processing device 200 according to the present embodiment is preferably a non-volatile recording medium, but can also be realized with a volatile recording medium. Also, the process by which the contribution estimator is stored in the contribution storage unit 23 is not particularly limited. For example, the contribution estimator may be stored in the contribution storage unit 23 via a recording medium, or the contribution estimator transmitted via a communication line or the like may be stored in the contribution storage unit 23. . Alternatively, the contribution estimator input via the input device may be stored in the contribution storage unit 23 .

（第２の実施形態の動作）
次に、第２の実施形態における音声処理装置２００の動作について、図４のフローチャートを用いて説明する。図４は、音声処理装置２００の動作の一例を示すフローチャートである。 (Operation of Second Embodiment)
Next, the operation of the speech processing device 200 according to the second embodiment will be described using the flowchart of FIG. FIG. 4 is a flow chart showing an example of the operation of the speech processing device 200. As shown in FIG.

音声処理装置２００は、外部から１つ以上の音声信号を受け取り、音声区間検出部２１に提供する。具体的には、音声区間検出部２１は、受け取った音声信号を区分化し、区分化音声信号を貢献度推定部１１および音声統計量算出部２２に出力する（ステップＳ２０１）。 The speech processing device 200 receives one or more speech signals from the outside and provides them to the speech segment detection unit 21 . Specifically, the speech segment detection unit 21 segments the received speech signal and outputs the segmented speech signal to the contribution estimation unit 11 and the speech statistics calculation unit 22 (step S201).

音声統計量算出部２２は、受け取った１つ以上の区分化音声信号それぞれについて、短時間フレーム分析処理を行い、音響特徴と音声統計量の時系列を算出する（ステップＳ２０２）。 The speech statistic calculator 22 performs short-time frame analysis processing on each of the received one or more segmented speech signals, and calculates time series of acoustic features and speech statistics (step S202).

貢献度推定部１１は、受け取った１つ以上の区分化音声信号のそれぞれについて、短時間分析フレーム処理を行い、貢献度の時系列を算出する（ステップＳ２０３）。 The contribution estimating unit 11 performs short-time analysis frame processing on each of the received one or more segmented speech signals, and calculates a contribution time series (step S203).

話者特徴算出部１２は、受け取った１つ以上の音響特徴・音声統計量・貢献度の時系列に基づいて、話者認識特徴量を算出して出力する。（ステップＳ２０４）。音声処理装置２００は、外部からの音声信号の受理が終了したら、一連の処理を終了する。 The speaker feature calculator 12 calculates and outputs a speaker recognition feature based on one or more received acoustic features, speech statistics, and time series of contribution. (Step S204). After receiving the audio signal from the outside, the audio processing device 200 ends the series of processes.

（第２の実施形態の効果）
以上、説明したように、本実施形態にかかる音声処理装置２００によれば、音声処理装置２００が算出した話者特徴を用いる話者認識の精度を高めることができる。なぜならば、音声処理装置２００は、貢献度推定部１１が音声信号の品質を貢献度として算出し、話者特徴算出部１２が貢献度を考慮した特徴ベクトルを算出することで、音声信号の品質の高い部分区間に重きを置いた特徴ベクトルを出力するからである。 (Effect of Second Embodiment)
As described above, according to the speech processing device 200 of the present embodiment, it is possible to improve the accuracy of speaker recognition using the speaker features calculated by the speech processing device 200 . This is because, in the speech processing apparatus 200, the contribution estimating unit 11 calculates the quality of the speech signal as the contribution, and the speaker feature calculating unit 12 calculates the feature vector considering the contribution. This is because feature vectors are output with weights placed on subintervals with high .

このように、本実施形態に係る音声処理装置２００は、音声信号に対して、各部分区間の品質に応じた貢献度を考慮した特徴ベクトルを算出する。これにより、音声信号の部分区間に、話者の不明瞭な発声、話者の咳や笑い声などの話し声とは異なる音、雑音などが混入している場合にも、話者認識に適した認識特徴量を求めることができる。 In this way, the speech processing apparatus 200 according to the present embodiment calculates a feature vector for the speech signal, taking into account the degree of contribution corresponding to the quality of each partial section. As a result, even if the partial section of the speech signal contains unclear utterances of the speaker, sounds different from the speaking voice such as coughing or laughter of the speaker, noise, etc., recognition suitable for speaker recognition is possible. A feature amount can be obtained.

＜第３の実施形態＞
図５は、本発明の第３の実施形態に係る、音声処理装置の構成の一例を示すブロック図である。 <Third Embodiment>
FIG. 5 is a block diagram showing an example configuration of a speech processing device according to the third embodiment of the present invention.

図５に示すように、音声処理装置３００は、貢献度推定部１１と、話者特徴算出部１２と、属性認識部１３とを備える。音声処理装置３００は、属性情報を認識することのできる音声処理装置である。 As shown in FIG. 5 , the speech processing device 300 includes a contribution estimation unit 11 , a speaker feature calculation unit 12 and an attribute recognition unit 13 . The voice processing device 300 is a voice processing device capable of recognizing attribute information.

貢献度推定部１１および話者特徴算出部１２については、第１および第２の実施形態と同様なので説明は省略する。 The contribution estimating unit 11 and the speaker feature calculating unit 12 are the same as those in the first and second embodiments, so descriptions thereof will be omitted.

属性認識部１３は、話者特徴算出部１２から属性情報を認識するための認識特徴量を受け取る。属性認識部１３は、認識特徴量に基づいて、音声信号を発した話者、音声信号を構成する言語、音声信号に含まれる感情表現、音声信号から推定される発話者の性格等を認識する。具体的には、属性認識部１３は、例えば、認識特徴量を比較するための比較用音声データを格納する記憶装置（図示しない）を参照する。この場合、属性認識部１３は、認識特徴量と、比較用音声データの類似の度合い等を算出することで、属性情報を認識することができる。 The attribute recognition unit 13 receives recognition feature amounts for recognizing attribute information from the speaker feature calculation unit 12 . The attribute recognition unit 13 recognizes, based on the recognition feature amount, the speaker who issued the voice signal, the language constituting the voice signal, the emotional expression included in the voice signal, the character of the speaker estimated from the voice signal, and the like. . Specifically, the attribute recognition unit 13, for example, refers to a storage device (not shown) that stores comparison speech data for comparing recognition feature amounts. In this case, the attribute recognition unit 13 can recognize the attribute information by calculating the recognition feature amount and the degree of similarity of the comparison audio data.

＜第３の実施形態の具体例＞
次に、本発明の第３の実施形態に係る音声処理装置３００の具体的な応用例について説明する。 <Specific example of the third embodiment>
Next, a specific application example of the speech processing device 300 according to the third embodiment of the present invention will be described.

本発明の第３の実施形態に係る音声処理装置３００が算出した話者特徴は、音声信号の話者を推定する話者認識に利用可能である。例えば、第１の音声信号から算出した第１の話者特徴と、第２の音声信号から算出した第２の話者特徴とから、２つの話者特徴の類似性を現す指標として、コサイン類似度を算出する。例えば、話者照合することを目的とする場合は、前記の類似度に基づく照合可否の判定情報を出力してもよい。また、話者識別することを目的とする場合は、第１の音声信号に対して複数の第２の音声信号を用意して各々の類似度を求め、値の大きい組を出力してもよい。 The speaker features calculated by the speech processing device 300 according to the third embodiment of the present invention can be used for speaker recognition to estimate the speaker of the speech signal. For example, from a first speaker feature calculated from a first speech signal and a second speaker feature calculated from a second speech signal, cosine similarity Calculate degrees. For example, when the purpose is to verify the speaker, it is possible to output determination information as to whether or not verification is possible based on the degree of similarity. Further, when the purpose is to identify a speaker, a plurality of second speech signals may be prepared for the first speech signal, the degree of similarity of each may be obtained, and a set having a large value may be output. .

本発明の第３の実施形態に係る音声処理装置３００は、音声信号から特定の属性情報を認識するための認識特徴量を算出する特徴算出装置の一例である。音声処理装置３００は、特定の属性が音声信号を発した話者であるとき、話者特徴抽出装置として利用可能である。また、音声処理装置３００は、例えば文発話の音声信号に対して、当該話者特徴を用いて推定した話者情報に基づいて、当該話者の話し方の特徴に適応化する機構を備える音声認識装置の一部としても利用可能である。また、ここで、話者を示す情報は、話者の性別を示す情報や、話者の年齢あるいは年齢層を示す情報であってもよい。 A speech processing device 300 according to the third embodiment of the present invention is an example of a feature calculation device that calculates a recognition feature amount for recognizing specific attribute information from a speech signal. The speech processing device 300 can be used as a speaker feature extraction device when the specific attribute is the speaker who emitted the speech signal. In addition, the speech processing device 300 includes a mechanism for adapting, for example, a speech signal of a sentence utterance to the speaking style characteristics of the speaker based on the speaker information estimated using the speaker characteristics. It can also be used as part of a device. Further, the information indicating the speaker may be information indicating the gender of the speaker, or information indicating the age or age group of the speaker.

本発明の第３の実施形態に係る音声処理装置３００は、特定の属性を音声信号が伝える言語（音声信号を構成する言語）を示す情報とするとき、言語特徴算出装置として利用可能である。また、音声処理装置３００は、例えば文発話の音声信号に対して、当該言語特徴を用いて推定した言語情報に基づいて、翻訳する言語を選択する機構を備える音声翻訳装置の一部としても利用可能である。 The speech processing device 300 according to the third embodiment of the present invention can be used as a language feature calculation device when a specific attribute is used as information indicating the language (language constituting the speech signal) conveyed by the speech signal. In addition, the speech processing device 300 is also used as part of a speech translation device having a mechanism for selecting a language to be translated based on language information estimated using the language feature, for example, for a speech signal of sentence utterance. It is possible.

本発明の第３の実施形態に係る音声処理装置３００は、特定の属性が話者の発話時の感情を示す情報であるとき、感情特徴算出装置として利用可能である。また、音声処理装置３００は、例えば蓄積された多数の発話の音声信号に対して、当該感情特徴を用いて推定した感情情報に基づいて、特定の感情に対応する音声信号を特定する機構を備える音声検索装置や音声表示装置の一部としても利用可能である。この感情情報には、例えば、感情表現を示す情報、発話者の性格を示す情報等が含まれる。 The speech processing device 300 according to the third embodiment of the present invention can be used as an emotion feature calculation device when the specific attribute is information indicating the speaker's emotion at the time of speaking. In addition, the speech processing device 300 has a mechanism for identifying a speech signal corresponding to a specific emotion, based on emotion information estimated using the emotion feature, for example, for speech signals of a large number of accumulated utterances. It can also be used as part of a voice search device or voice display device. This emotional information includes, for example, information indicating emotional expression, information indicating the personality of the speaker, and the like.

以上のように、本実施形態における特定の属性情報は、音声信号を発した話者、音声信号を構成する言語、音声信号に含まれる感情表現、音声信号から推定される発話者の性格、の少なくともいずれか一つを表す情報である。 As described above, the specific attribute information in the present embodiment includes information such as the speaker who issued the audio signal, the language composing the audio signal, the emotional expression included in the audio signal, and the character of the speaker estimated from the audio signal. It is information representing at least one of them.

（ハードウエア構成についての説明）
以上、実施形態を用いて本発明を説明したが、本発明は、上記実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解しうる様々な変更をすることができる。すなわち、本発明は、以上の実施形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 (Description of hardware configuration)
Although the present invention has been described using the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention. That is, it goes without saying that the present invention is not limited to the above embodiments, and that various modifications are possible and are also included within the scope of the present invention.

以上のように、本発明の一態様における音声処理装置等は、音声信号の品質を考慮した特徴ベクトルを抽出し話者認識の精度を高めることができるという効果を有しており、音声処理装置等および話者認識装置として有用である。なお、本発明において使用者に関する情報を取得、利用する場合は、これを適法に行うものとする。 As described above, the speech processing device or the like according to one aspect of the present invention has the effect of being able to extract a feature vector in consideration of the quality of the speech signal and improve the accuracy of speaker recognition. etc. and as a speaker recognizer. It should be noted that, in the present invention, acquisition and use of information relating to the user shall be done legally.

＜その他の実施形態＞
音声処理装置は、ハードウエアによって実現してもよいし、ソフトウエアによって実現してもよい。また、音声処理装置は、ハードウエアとソフトウエアの組み合わせによって実現してもよい。 <Other embodiments>
The audio processing device may be implemented by hardware or by software. Also, the audio processing device may be realized by a combination of hardware and software.

図６は、音声処理装置を構成する情報処理装置（コンピュータ）の一例を示すブロック図である。 FIG. 6 is a block diagram showing an example of an information processing device (computer) that constitutes the speech processing device.

図６に示すように、情報処理装置４００は、制御部（ＣＰＵ：Central Processing Unit）４１０と、記憶部４２０と、ＲＯＭ（Read Only Memory）４３０と、ＲＡＭ（Random Access Memory）４４０と、通信インターフェース４５０と、ユーザインターフェース４６０とを備えている。 As shown in FIG. 6, the information processing apparatus 400 includes a control unit (CPU: Central Processing Unit) 410, a storage unit 420, a ROM (Read Only Memory) 430, a RAM (Random Access Memory) 440, and a communication interface. 450 and a user interface 460 .

制御部（ＣＰＵ）４１０は、記憶部４２０またはＲＯＭ４３０に格納されたプログラムをＲＡＭ４４０に展開して実行することで、音声処理装置および話者認識装置の各種の機能を実現することができる。また、制御部（ＣＰＵ）４１０は、データ等を一時的に格納できる内部バッファを備えていてもよい。 The control unit (CPU) 410 expands the programs stored in the storage unit 420 or the ROM 430 into the RAM 440 and executes them, thereby realizing various functions of the speech processing device and the speaker recognition device. Also, the control unit (CPU) 410 may include an internal buffer that can temporarily store data and the like.

記憶部４２０は、各種のデータを保持できる大容量の記憶媒体であって、ＨＤＤ（Hard Disc Drive）、およびＳＳＤ（Solid State Drive）等の記憶媒体で実現することができる。また、記憶部４２０は、情報処理装置４００が通信インターフェース４５０を介して通信ネットワークと接続されている場合には、通信ネットワーク上に存在するクラウドストレージであってもよい。また、記憶部４２０は、制御部（ＣＰＵ）４１０が読み取り可能なプログラムを保持していてもよい。 The storage unit 420 is a large-capacity storage medium capable of holding various data, and can be realized by a storage medium such as an HDD (Hard Disc Drive) and an SSD (Solid State Drive). Further, when the information processing device 400 is connected to a communication network via the communication interface 450, the storage unit 420 may be a cloud storage existing on the communication network. Further, the storage unit 420 may hold a program readable by the control unit (CPU) 410 .

ＲＯＭ４３０は、記憶部４２０と比べると小容量なフラッシュメモリ等で構成できる不揮発性の記憶装置である。また、ＲＯＭ４３０は、制御部（ＣＰＵ）４１０が読み取り可能なプログラムを保持していてもよい。なお、制御部（ＣＰＵ）４１０が読み取り可能なプログラムは、記憶部４２０およびＲＯＭ４３０の少なくとも一方が保持していればよい。 The ROM 430 is a non-volatile storage device that can be configured with a flash memory or the like that has a smaller capacity than the storage unit 420 . Further, the ROM 430 may hold a program readable by the control unit (CPU) 410 . At least one of the storage unit 420 and the ROM 430 may hold a program readable by the control unit (CPU) 410 .

なお、制御部（ＣＰＵ）４１０が読み取り可能なプログラムは、コンピュータが読み取り可能な様々な記憶媒体に非一時的に格納した状態で、情報処理装置４００に供給してもよい。このような記憶媒体は、例えば、磁気テープ、磁気ディスク、光磁気ディスク、ＣＤ－ＲＯＭ、ＣＤ－Ｒ、ＣＤ－Ｒ／Ｗ、半導体メモリである。 The program readable by the control unit (CPU) 410 may be supplied to the information processing apparatus 400 in a state of being non-temporarily stored in various computer-readable storage media. Such storage media are, for example, magnetic tapes, magnetic disks, magneto-optical disks, CD-ROMs, CD-Rs, CD-R/Ws, and semiconductor memories.

ＲＡＭ４４０は、ＤＲＡＭ（Dynamic Random Access Memory）及びＳＲＡＭ（Static Random Access Memory）等の半導体メモリであり、データ等を一時的に格納する内部バッファとして用いることができる。 The RAM 440 is a semiconductor memory such as DRAM (Dynamic Random Access Memory) and SRAM (Static Random Access Memory), and can be used as an internal buffer for temporarily storing data and the like.

通信インターフェース４５０は、有線または無線を介して、情報処理装置４００と、通信ネットワークとを接続するインターフェースである。 The communication interface 450 is an interface that connects the information processing device 400 and a communication network via wire or wireless.

ユーザインターフェース４６０は、例えば、ディスプレイ等の表示部、およびキーボード、マウス、タッチパネル等の入力部である。 The user interface 460 is, for example, a display unit such as a display, and an input unit such as a keyboard, mouse, and touch panel.

上記の実施の形態の一部又は全部は、以下の付記のようにも記載され得るが以下には限られない。 Some or all of the above-described embodiments may be described in the following additional remarks, but are not limited to the following.

［付記１］
音声を表す複数の音声信号の入力を受け付ける受付手段と、
前記複数の音声信号の品質に基づいて、前記複数の音声信号から特定の属性情報を認識するための認識特徴量を算出する情報処理手段とを備える、音声処理装置。 [Appendix 1]
receiving means for receiving input of a plurality of audio signals representing speech;
and information processing means for calculating a recognition feature amount for recognizing specific attribute information from the plurality of audio signals based on the quality of the plurality of audio signals.

［付記２］
前記複数の音声信号に含まれる音の種類の比率を表す音声統計量を算出する音声統計量算出手段をさらに備え、
前記情報処理手段は、前記複数の音声信号の前記音声統計量と、前記複数の音声信号の前記品質とに基づいて、前記認識特徴量を算出する、付記１に記載の音声処理装置。 [Appendix 2]
Further comprising a speech statistic calculation means for calculating a speech statistic representing a ratio of types of sounds contained in the plurality of speech signals,
The speech processing device according to supplementary note 1, wherein the information processing means calculates the recognition feature amount based on the speech statistics of the plurality of speech signals and the quality of the plurality of speech signals.

［付記３］
前記品質は、
前記複数の音声信号の一部が音声か否かを識別して算出した音声らしさを表す値、前記複数の音声信号の一部が話者認識に正解する音声か否かを識別して算出した話者認識の正解しやすさを表す値、前記複数の音声信号の一部が話者認識誤りを起こす音声か否かを識別して算出した話者認識の誤りやすさを表す値の少なくともいずれかひとつである、付記１または２に記載の音声処理装置。 [Appendix 3]
Said quality is
A value representing the likelihood of speech calculated by identifying whether a portion of the plurality of speech signals is speech, and a value calculated by identifying whether a portion of the plurality of speech signals is correct speech for speaker recognition. at least one of a value representing the likelihood of correct speaker recognition, and a value representing the likelihood of speaker recognition error calculated by identifying whether or not a portion of the plurality of speech signals is speech that causes speaker recognition errors. 3. The audio processing device according to appendix 1 or 2, which is one of the above.

［付記４］
ニューラルネットワークを用いて前記複数の音声信号の前記品質を算出する品質推定手段をさらに備える、付記３に記載の音声処理装置。 [Appendix 4]
3. The audio processing device according to appendix 3, further comprising quality estimating means for calculating the quality of the plurality of audio signals using a neural network.

［付記５］
前記情報処理手段は、
前記認識特徴量として i-vector を算出する、付記３または４に記載の音声処理装置。 [Appendix 5]
The information processing means is
5. The speech processing device according to appendix 3 or 4, wherein an i-vector is calculated as the recognition feature quantity.

［付記６］
前記認識特徴量に基づいて前記属性情報を認識する属性認識手段を備える、付記１～５のいずれか１つに記載の音声処理装置。 [Appendix 6]
6. The speech processing device according to any one of appendices 1 to 5, comprising attribute recognition means for recognizing the attribute information based on the recognition feature amount.

［付記７］
前記特定の属性情報は、
音声信号を発した話者、音声信号を構成する言語、音声信号に含まれる感情表現、音声信号から推定される話者の性格の少なくともいずれか１つを表す情報である、付記１～６のいずれか１つに記載の音声処理装置。 [Appendix 7]
The specific attribute information is
Supplementary notes 1 to 6, which is information representing at least one of the speaker who issued the audio signal, the language that composes the audio signal, the emotional expression contained in the audio signal, and the personality of the speaker estimated from the audio signal. The audio processing device according to any one of the above.

［付記８］
音声を表す複数の音声信号の入力を受け付け、
前記複数の音声信号の品質に基づいて、前記複数の音声信号から特定の属性情報を認識するための認識特徴量を算出する、音声処理方法。 [Appendix 8]
accepts input of a plurality of audio signals representing speech;
A speech processing method, comprising: calculating a recognition feature amount for recognizing specific attribute information from the plurality of speech signals based on the quality of the plurality of speech signals.

［付記９］
前記複数の音声信号に含まれる音の種類の比率を表す音声統計量をさらに算出し、
前記複数の音声信号の前記音声統計量と、前記複数の音声信号の前記品質とに基づいて、前記認識特徴量を算出する、付記８に記載の音声処理方法。 [Appendix 9]
further calculating an audio statistic representing the ratio of types of sounds included in the plurality of audio signals;
9. The speech processing method according to appendix 8, wherein the recognition feature quantity is calculated based on the speech statistic of the plurality of speech signals and the quality of the plurality of speech signals.

［付記１０］
前記品質は、
前記複数の音声信号の一部が音声か否かを識別して算出した音声らしさを表す値、前記複数の音声信号の一部が話者認識に正解する音声か否かを識別して算出した話者認識の正解しやすさを表す値、前記複数の音声信号の一部が話者認識誤りを起こす音声か否かを識別して算出した話者認識の誤りやすさを表す値の少なくともいずれかひとつである、付記８または９に記載の音声処理方法。 [Appendix 10]
Said quality is
A value representing the likelihood of speech calculated by identifying whether a portion of the plurality of speech signals is speech, and a value calculated by identifying whether a portion of the plurality of speech signals is correct speech for speaker recognition. at least one of a value representing the likelihood of correct speaker recognition, and a value representing the likelihood of speaker recognition error calculated by identifying whether or not a portion of the plurality of speech signals is speech that causes speaker recognition errors. 10. The audio processing method according to appendix 8 or 9, which is one of

［付記１１］
ニューラルネットワークを用いて前記複数の音声信号の前記品質を算出する、付記１０に記載の音声処理方法。 [Appendix 11]
11. The audio processing method of claim 10, wherein a neural network is used to calculate the quality of the plurality of audio signals.

［付記１２］
前記認識特徴量として i-vector を算出する、付記１０または１１に記載の音声処理方法。 [Appendix 12]
12. The speech processing method according to appendix 10 or 11, wherein an i-vector is calculated as the recognition feature quantity.

［付記１３］
前記認識特徴量に基づいて前記属性情報を認識する、付記８～１２のいずれか１つに記載の音声処理方法。 [Appendix 13]
13. The speech processing method according to any one of appendices 8 to 12, wherein the attribute information is recognized based on the recognition feature amount.

［付記１４］
前記特定の属性情報は、
音声信号を発した話者、音声信号を構成する言語、音声信号に含まれる感情表現、音声信号から推定される話者の性格の少なくともいずれか１つを表す情報である、付記８～１３のいずれか１つに記載の音声処理方法。 [Appendix 14]
The specific attribute information is
Supplementary notes 8 to 13, which is information representing at least one of the speaker who emitted the audio signal, the language that composes the audio signal, the emotional expression contained in the audio signal, and the character of the speaker estimated from the audio signal. 1. A speech processing method according to any one of the preceding claims.

［付記１５］
コンピュータに、
音声を表す複数の音声信号の入力を受け付ける処理と、
前記複数の音声信号の品質に基づいて、前記複数の音声信号から特定の属性情報を認識するための認識特徴量を算出する処理とを実行させる、音声処理プログラム。 [Appendix 15]
to the computer,
a process of accepting input of a plurality of audio signals representing speech;
and calculating a recognition feature amount for recognizing specific attribute information from the plurality of audio signals based on the quality of the plurality of audio signals.

［付記１６］
前記コンピュータに、
前記複数の音声信号に含まれる音の種類の比率を表す音声統計量をさらに算出する処理と、
前記複数の音声信号の前記音声統計量と、前記複数の音声信号の前記品質とに基づいて、前記認識特徴量を算出する処理とを実行させる、付記１５に記載の音声処理プログラム。 [Appendix 16]
to the computer;
a process of further calculating an audio statistic representing the ratio of types of sounds included in the plurality of audio signals;
16. The speech processing program according to appendix 15, causing execution of a process of calculating the recognition feature quantity based on the speech statistics of the plurality of speech signals and the quality of the plurality of speech signals.

［付記１７］
前記品質は、
前記複数の音声信号の一部が音声か否かを識別して算出した音声らしさを表す値、前記複数の音声信号の一部が話者認識に正解する音声か否かを識別して算出した話者認識の正解しやすさを表す値、および前記複数の音声信号の一部が話者認識誤りを起こす音声か否かを識別して算出した話者認識の誤りやすさを表す値の少なくともいずれかひとつである、付記１５または１６に記載の音声処理プログラム。 [Appendix 17]
Said quality is
A value representing the likelihood of speech calculated by identifying whether a portion of the plurality of speech signals is speech, and a value calculated by identifying whether a portion of the plurality of speech signals is correct speech for speaker recognition. At least a value representing the likelihood of correct speaker recognition and a value representing the likelihood of speaker recognition error calculated by identifying whether or not a portion of the plurality of speech signals is speech that causes speaker recognition errors. 17. The audio processing program according to appendix 15 or 16, which is any one.

［付記１８］
前記コンピュータに、
ニューラルネットワークを用いて前記複数の音声信号の前記品質を算出する処理を実行させる、付記１７に記載の音声処理プログラム。 [Appendix 18]
to the computer;
18. The audio processing program according to appendix 17, causing execution of processing for calculating the quality of the plurality of audio signals using a neural network.

［付記１９］
前記コンピュータに、
前記認識特徴量として i-vector を算出する処理を実行させる、付記１７または１８に記載の音声処理プログラム。 [Appendix 19]
to the computer;
19. The speech processing program according to appendix 17 or 18, which executes a process of calculating an i-vector as the recognition feature quantity.

［付記２０］
前記コンピュータに、
前記認識特徴量に基づいて前記属性情報を認識する処理を実行させる、付記１５～１９のいずれか１つに記載の音声処理プログラム。 [Appendix 20]
to the computer;
20. The speech processing program according to any one of appendices 15 to 19, which executes a process of recognizing the attribute information based on the recognition feature amount.

［付記２１］
前記特定の属性情報は、
音声信号を発した話者、音声信号を構成する言語、音声信号に含まれる感情表現、音声信号から推定される話者の性格の少なくともいずれか１つを表す情報である、付記１５～２０のいずれか１つに記載の音声処理プログラム。 [Appendix 21]
The specific attribute information is
Supplementary notes 15 to 20, which is information representing at least one of the speaker who issued the audio signal, the language that composes the audio signal, the emotional expression contained in the audio signal, and the personality of the speaker estimated from the audio signal. The audio processing program according to any one of the preceding claims.

１１・・・貢献度推定部
１２・・・話者特徴算出部
１３・・・属性認識部
２１・・・音声区間検出部
２２・・・音声統計量算出部
２３・・・貢献度記憶部
２４・・・貢献度学習部
１００,２００,３００・・・音声処理装置
４００・・・情報処理装置
４１０・・・制御部（ＣＰＵ）
４２０・・・記憶部
４３０・・・ＲＯＭ
４４０・・・ＲＡＭ
４５０・・・通信インターフェース
４６０・・・ユーザインターフェース REFERENCE SIGNS LIST 11 contribution estimation unit 12 speaker feature calculation unit 13 attribute recognition unit 21 speech segment detection unit 22 speech statistic calculation unit 23 contribution storage unit 24 ... Contribution degree learning unit 100, 200, 300 ... Speech processing device 400 ... Information processing device 410 ... Control unit (CPU)
420... Storage unit 430... ROM
440 RAM
450... communication interface 460... user interface

Claims

receiving means for receiving input of a plurality of audio signals representing speech;
quality estimating means for calculating two types of quality of the speech that is correct for speaker recognition and the speech that causes an error in speaker recognition, in the plurality of speech signals;
information processing means for calculating a recognition feature amount for recognizing specific attribute information from the plurality of audio signals based on the quality of the plurality of audio signals;
an audio statistic calculation means for calculating an audio statistic representing the degree of occurrence of types of sounds included in the plurality of audio signals;
The speech processing device, wherein the information processing means calculates the recognition feature amount based on the speech statistics of the plurality of speech signals and the quality of the plurality of speech signals.

2. The audio processing device according to claim 1, wherein said quality is a value representing the likelihood of speech calculated by identifying whether a portion of said plurality of audio signals is speech.

The information processing means is
3. The speech processing apparatus according to claim 2, wherein an i-vector is calculated as said recognition feature amount.

4. The speech processing apparatus according to claim 1, further comprising attribute recognition means for recognizing said attribute information based on said recognition feature quantity.

The specific attribute information is
Claims 1 to 4, wherein the information represents at least one of the speaker who issued the voice signal, the language composing the voice signal, the emotional expression included in the voice signal, and the personality of the speaker estimated from the voice signal. The audio processing device according to any one of .

accepts input of a plurality of audio signals representing speech;
calculating two types of quality of speech that is correct for speaker recognition and speech that causes an error in speaker recognition, in the plurality of speech signals;
calculating a recognition feature amount for recognizing specific attribute information from the plurality of audio signals based on the quality of the plurality of audio signals;
calculating an audio statistic representing the frequency of occurrence of types of sounds included in the plurality of audio signals;
The speech processing method, wherein the recognition feature quantity is calculated based on the speech statistic of the plurality of speech signals and the quality of the plurality of speech signals.

Said quality is
7. The audio processing method according to claim 6, wherein said value is a value representing likelihood of speech calculated by identifying whether a portion of said plurality of audio signals is speech.

to the computer,
a process of accepting input of a plurality of audio signals representing speech;
a process of calculating two types of quality of speech that is correct for speaker recognition and speech that causes an error in speaker recognition in the plurality of speech signals;
a process of calculating a recognition feature amount for recognizing specific attribute information from the plurality of audio signals based on the quality of the plurality of audio signals;
a process of calculating an audio statistic representing the frequency of occurrence of types of sounds included in the plurality of audio signals;
A speech processing program for executing a process of calculating the recognition feature amount based on the speech statistics of the plurality of speech signals and the quality of the plurality of speech signals.