JP6553015B2

JP6553015B2 - Speaker attribute estimation system, learning device, estimation device, speaker attribute estimation method, and program

Info

Publication number: JP6553015B2
Application number: JP2016222351A
Authority: JP
Inventors: 歩相名神山; 哲小橋川; 山口　義和; 義和山口
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-11-15
Filing date: 2016-11-15
Publication date: 2019-07-31
Anticipated expiration: 2036-11-15
Also published as: JP2018081169A

Description

この発明は、音声から話者の属性を推定する技術に関する。 The present invention relates to a technique for estimating speaker attributes from speech.

音声対話ロボットやコールセンターにおけるマーケティング情報収集等のために、音声から話者の属性（例えば、性別や年代等）を識別する技術が必要とされている。話者の属性を識別する従来技術としては、入力音声の声質をi-vectorに変換し、そのi-vectorをサポートベクターマシン（SVM: Support Vector Machine）を用いて識別する手法や、混合ガウスモデル（GMM: Gaussian Mixture Model）を用いて識別する手法が存在する（例えば、非特許文献１参照）。 There is a need for a technology for identifying speaker attributes (e.g., gender, age, etc.) from speech in order to collect marketing information in a speech interactive robot or a call center. Conventional techniques for identifying speaker attributes include converting the voice quality of the input speech into an i-vector and identifying the i-vector using a support vector machine (SVM), or a mixed Gaussian model. There is a method of identifying using (GMM: Gaussian Mixture Model) (for example, see Non-Patent Document 1).

宮森翔子他、“ちょっとした一言の音声認識による子ども利用者判別法の検討”、情報科学技術フォーラム講演論文集、vol. 9(3)、pp. 469-472、2010年Shoko Miyamori et al., “Study on Child User Discrimination by Speech Recognition in a Single Word”, Proceedings of Information Science and Technology Forum, vol. 9 (3), pp. 469-472, 2010

従来の話者属性推定技術では、識別率の向上が課題である。従来技術による話者属性推定の識別率は、成人男性、成人女性、子ども（男女）の３クラスの場合、80〜90％程度である。特に、学習データに含まれている属性に関係ない特徴（例えば、雑音や音声のクリッピング等）が識別対象の音声に含まれていた際の誤識別を防ぐ必要がある。 In the conventional speaker attribute estimation technique, improvement of the identification rate is a problem. The identification rate of speaker attribute estimation according to the prior art is about 80 to 90% in the case of three classes of adult male, adult female, and child (male and female). In particular, it is necessary to prevent misidentification when a feature (for example, noise, clipping of speech, etc.) that is not related to the attribute included in the learning data is included in the speech to be identified.

従来技術では、識別結果の確からしさを求められないことも課題である。例えば、音声対話ロボット等で誤った識別結果に基づいて応答を返してしまうとユーザに不快感を与えてしまう場合があるため、識別結果が確実ではない場合はニュートラルな属性として応答を返す必要がある。また、音声入力中に属性を識別し、逐次確からしさを求められるようになれば、識別結果に基づいた素早い応答を返すことができるようになる。 In the prior art, it is also a problem that the accuracy of the identification result cannot be obtained. For example, if a response is returned based on an incorrect identification result by a voice interactive robot or the like, it may cause discomfort to the user. If the identification result is not certain, it is necessary to return the response as a neutral attribute. is there. In addition, if attributes can be identified during speech input and certainty can be sequentially obtained, it is possible to return a quick response based on the identification result.

この発明は、上述のような点に鑑みて、従来よりも高精度に話者属性を推定することができる話者属性推定技術を提供することを目的とする。 An object of this invention is to provide the speaker attribute estimation technique which can estimate a speaker attribute with higher precision than before in view of the above points.

上記の課題を解決するために、この発明の話者属性推定システムは、学習装置と推定装置とを含む。学習装置は、学習音声毎の話者属性を表す属性情報から学習音声のフレーム毎の話者属性を表す属性ラベル系列を作成する属性ラベル作成部と、学習音声のフレーム毎の音響特徴量系列と属性ラベル系列とを用いてディープニューラルネットワークモデルを学習する深層学習部と、を含む。推定装置は、入力音声のフレーム毎の音響特徴量系列からディープニューラルネットワークモデルを用いてフレーム毎の事後確率系列を計算する事後確率計算部と、話者属性毎に求めた事後確率系列の対数和に基づいて話者属性を識別する識別部と、を含む。 In order to solve the above problems, a speaker attribute estimation system of the present invention includes a learning device and an estimation device. The learning device includes an attribute label generation unit that generates an attribute label sequence representing a speaker attribute of each frame of learning speech from attribute information representing speaker attributes of each learning speech, an acoustic feature amount sequence of each frame of learning speech, and A deep learning unit that learns a deep neural network model using the attribute label series. The estimation device includes a posterior probability calculation unit that calculates a posterior probability sequence for each frame using a deep neural network model from an acoustic feature amount sequence for each frame of the input speech, and a logarithmic sum of the posterior probability sequences obtained for each speaker attribute. And an identification unit for identifying speaker attributes based on

この発明によれば、雑音や属性以外のスペクトル特徴（例えば、クリップ音等）に頑健な推定が可能となり、高精度に話者の属性（例えば、性別や年代等）を推定することができる。また、識別結果の信頼性を求めることができるため、例えば、音声対話ロボット等では素早い応答が可能となる。 According to the present invention, it is possible to perform robust estimation with respect to spectral characteristics (for example, clip sound) other than noise and attributes, and it is possible to estimate speaker attributes (for example, gender and age) with high accuracy. In addition, since the reliability of the identification result can be obtained, for example, a voice interaction robot or the like can respond quickly.

図１は、第一実施形態の学習装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of the learning device according to the first embodiment. 図２は、第一実施形態の推定装置の機能構成を例示する図である。FIG. 2 is a diagram illustrating a functional configuration of the estimation apparatus according to the first embodiment. 図３は、第一実施形態の話者属性推定方法の処理手続きを例示する図である。FIG. 3 is a diagram illustrating a processing procedure of the speaker attribute estimation method according to the first embodiment. 図４は、クリップ音の作成方法を説明するための図である。FIG. 4 is a diagram for explaining a method of creating a clip sound. 図５は、第二実施形態の学習装置の機能構成を例示する図である。FIG. 5 is a diagram illustrating a functional configuration of the learning device according to the second embodiment. 図６は、第二実施形態の話者属性推定方法の処理手続きを例示する図である。FIG. 6 is a diagram illustrating a processing procedure of the speaker attribute estimation method according to the second embodiment. 図７は、第三実施形態の学習装置の機能構成を例示する図である。FIG. 7 is a diagram illustrating a functional configuration of the learning device according to the third embodiment. 図８は、第三実施形態の話者属性推定方法の処理手続きを例示する図である。FIG. 8 is a diagram illustrating a processing procedure of the speaker attribute estimation method according to the third embodiment. 図９は、第四実施形態の学習装置の機能構成を例示する図である。FIG. 9 is a diagram illustrating a functional configuration of the learning device according to the fourth embodiment. 図１０は、第四実施形態の推定装置の機能構成を例示する図である。FIG. 10 is a diagram illustrating a functional configuration of the estimation device according to the fourth embodiment. 図１１は、第四実施形態の話者属性推定方法の処理手続きを例示する図である。FIG. 11 is a diagram illustrating a processing procedure of the speaker attribute estimation method according to the fourth embodiment.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第一実施形態］
第一実施形態では、深層学習モデルを利用して入力音声の逐次事後確率を求め、その事後確率を合計した対数事後確率を用いて話者属性の推定を行う。これにより、従来よりも高精度に話者属性を推定することが可能となる。 First Embodiment
In the first embodiment, the sequential posterior probability of the input speech is determined using the deep learning model, and the speaker attribute is estimated using the logarithmic posterior probability obtained by summing the posterior probabilities. This makes it possible to estimate speaker attributes with higher accuracy than in the past.

第一実施形態の話者属性推定システムは、例えば、学習データからディープニューラルネットワーク（DNN: Deep Neural Network）モデル（以下、DNNモデル）を学習する学習装置と、学習したDNNモデルを用いて入力音声の話者属性を推定する推定装置とを含む。第一実施形態の学習装置は、図１に示すように、学習データ記憶部１０、特徴量抽出部１１、属性ラベル作成部１２、深層学習部１３、およびDNNモデル記憶部２０を備える。第一実施形態の推定装置は、図２に示すように、DNNモデル記憶部２０、特徴量抽出部２１、事後確率計算部２２、および識別部２３を含む。この学習装置および推定装置が、図３に示す各ステップの処理を行うことにより第一実施形態の話者属性推定方法が実現される。 The speaker attribute estimation system according to the first embodiment includes, for example, a learning device for learning a Deep Neural Network (DNN) model (hereinafter referred to as DNN model) from learning data, and an input speech using the learned DNN model. And an estimation device for estimating the speaker attribute. As shown in FIG. 1, the learning apparatus according to the first embodiment includes a learning data storage unit 10, a feature extraction unit 11, an attribute label creation unit 12, a deep learning unit 13, and a DNN model storage unit 20. As illustrated in FIG. 2, the estimation apparatus according to the first embodiment includes a DNN model storage unit 20, a feature amount extraction unit 21, a posterior probability calculation unit 22, and an identification unit 23. The learning device and the estimation device perform the process of each step shown in FIG. 3 to realize the speaker attribute estimation method of the first embodiment.

学習装置および推定装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知または専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。学習装置および推定装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。学習装置および推定装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、学習装置および推定装置の各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。学習装置および推定装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。学習装置および推定装置が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The learning device and the estimation device are configured, for example, by loading a special program into a known or dedicated computer having a central processing unit (CPU), a main memory (RAM), and the like. It is a special device. For example, the learning device and the estimation device execute each process under the control of the central processing unit. The data input to the learning device and estimation device and the data obtained by each process are stored, for example, in the main storage device, and the data stored in the main storage device is read out as needed to perform other processing. It is used. Further, at least a part of each processing unit of the learning device and the estimation device may be configured by hardware such as an integrated circuit. Each storage unit included in the learning device and the estimation device is, for example, a main storage device such as a random access memory (RAM), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory. Alternatively, it can be configured by middleware such as a relational database or key-value store. Each storage unit included in the learning device and the estimation device only needs to be logically divided, and may be stored in one physical storage device.

学習データ記憶部１０には、DNNモデルの学習に用いる学習データが記憶されている。学習データは、学習音声s(k, t)と属性情報L(k)とを含む。k（=0, 1, …, K）は学習音声の番号である。t（=0, 1, …, T_k-1）はサンプル時間である。T_kはk番目の学習音声の時間長である。s(k, t)はサンプリング周波数をf_s[Hz]とした場合のサンプル時間tにおけるk番目の学習音声の振幅である。L(k)はk番目の学習音声の話者属性を示す数値である。例えば、L(k)=0は「成人男性」、L(k)=1は「成人女性」、L(k)=2は「子ども」のように表現される。 The learning data storage unit 10 stores learning data used to learn the DNN model. The learning data includes learning speech s (k, t) and attribute information L (k). k (= 0, 1,..., K) is a learning speech number. t (= 0, 1, ..., T _k -1) is the sample time. T _k is the time length of the k-th learning speech. s (k, t) is the amplitude of the k-th learning speech at sample time t when the sampling frequency is f _s [Hz]. L (k) is a numerical value indicating the speaker attribute of the kth learning speech. For example, L (k) = 0 is expressed as “adult male”, L (k) = 1 is expressed as “adult female”, and L (k) = 2 is expressed as “child”.

図３を参照して、第一実施形態の学習装置および推定装置が実行する話者属性推定方法の処理手続きを説明する。 With reference to FIG. 3, the processing procedure of the speaker attribute estimation method executed by the learning device and the estimation device of the first embodiment will be described.

ステップＳ１において、学習装置の特徴量抽出部１１は、学習データ記憶部１０に記憶された学習音声s(k, t)を読み出し、その学習音声s(k, t)からメル周波数ケプストラム係数（MFCC: Mel-Frequency Cepstrum Coefficient）の音響特徴量系列c(k, i, j)を抽出して出力する。i（=0, 1, 2, …, I_k-1）はフレーム番号、I_kはk番目の学習音声のフレーム数、j（=0, 1, 2, …, J_k-1）は音響特徴量の次元を示す番号、J_kはk番目の学習音声の音響特徴量の次元数である。メル周波数ケプストラム係数は公知の方法で抽出を行えばよい。例えば、12次元とそのΔ特徴量、Δパワー特徴量を利用するとよい。分析フレーム幅は10ミリ秒程度がよい。抽出した音響特徴量系列c(k, i, j)は属性ラベル作成部１２および深層学習部１３へ送られる。 In step S1, the feature quantity extraction unit 11 of the learning apparatus reads the learning speech s (k, t) stored in the learning data storage unit 10, and the mel frequency cepstrum coefficient (MFCC) is read from the learning speech s (k, t). : Mel-Frequency Cepstrum Coefficient) acoustic feature sequence c (k, i, j) is extracted and output. i (= 0, 1, 2,…, I _k -1) is the frame number, I _k is the number of frames of the kth learning speech, and j (= 0, 1, 2,…, J _k -1) is the sound The number indicating the dimension of the feature, J _k is the number of dimensions of the acoustic feature of the k-th learning speech. The mel frequency cepstrum coefficient may be extracted by a known method. For example, 12 dimensions and their Δ feature amount and Δ power feature amount may be used. The analysis frame width should be about 10 milliseconds. The extracted acoustic feature quantity sequence c (k, i, j) is sent to the attribute label creating unit 12 and the deep learning unit 13.

ステップＳ２において、学習装置の属性ラベル作成部１２は、学習データ記憶部１０に記憶された属性情報L(k)を読み出し、学習音声のフレーム数I_kの属性ラベル系列l(k, i)を作成する。具体的には、全フレーム（i=0, 1, …, I_k-1）について、l(k, i)=L(k)を設定する。作成した属性ラベル系列l(k, i)は深層学習部１３へ送られる。 In step S2, the attribute label creating unit 12 of the learning device reads out the attribute information L (k) stored in the learning data storage unit 10, and obtains the attribute label sequence l (k, i) of the number I _k of frames of learning speech. create. Specifically, l (k, i) = L (k) is set for all frames (i = 0, 1,..., I _k −1). The created attribute label sequence l (k, i) is sent to the deep learning unit 13.

ステップＳ３において、学習装置の深層学習部１３は、特徴量抽出部１１から受け取った学習音声s(k, t)の音響特徴量系列c(k, i, j)と、属性ラベル作成部１２から受け取った学習音声s(k, t)の属性ラベル系列l(k, i)とを用いて、式（１）となるDNNモデルλを学習する。 In step S 3, the deep learning unit 13 of the learning device receives the acoustic feature quantity sequence c (k, i, j) of the learning speech s (k, t) received from the feature quantity extraction unit 11 and the attribute label creating unit 12 By using the attribute label sequence l (k, i) of the received learning speech s (k, t), the DNN model λ represented by Equation (1) is learned.

DNNモデルは、画像認識や音声認識で使われており、細かい特徴を学習することができる。p(m|λ, c(k, i, j))は、特徴量c(k, i, j)が属性m（=0, 1, …, M）に属する事後確率である。属性mは、例えば、m=0は「成人男性」、m=1は「成人女性」、m=2は「子ども」等とする。DNNモデルの学習は、全音声（k=0, 1, 2, …, K）の全フレーム（i=0, 1, 2, …, I_k-1）について、その属性ラベルl(k, i)を用いて行う。学習したDNNモデルλは、DNNモデル記憶部２０に記憶する。 The DNN model is used in image recognition and speech recognition, and can learn detailed features. p (m | λ, c (k, i, j)) is a posterior probability that the feature quantity c (k, i, j) belongs to the attribute m (= 0, 1,..., M). For example, m = 0 is “adult male”, m = 1 is “adult female”, m = 2 is “child”, and the like. The learning of the DNN model is based on the attribute labels l (k, i) for all frames (i = 0, 1, 2,..., I _k -1) of all speech (k = 0, 1, 2,..., K). Do it using). The learned DNN model λ is stored in the DNN model storage unit 20.

ステップＳ４において、推定装置の特徴量抽出部２１は、入力された音声s'(t)からメル周波数ケプストラム係数の音響特徴量系列c'(i, j)（i=0, 1, …, I-1、j=0, 1, …, J-1、Iは入力音声のフレーム数、Jは入力音声の音響特徴量の次元数）を抽出して出力する。抽出した音響特徴量系列c'(i, j)は事後確率計算部２２へ送られる。 In step S4, the feature amount extraction unit 21 of the estimation device calculates the acoustic feature amount sequence c ′ (i, j) (i = 0, 1,..., I) of the mel frequency cepstrum coefficient from the input speech s ′ (t). −1, j = 0, 1,..., J−1, I is the number of frames of the input speech, and J is the number of dimensions of the acoustic feature quantity of the input speech. The extracted acoustic feature quantity sequence c ′ (i, j) is sent to the posterior probability calculation unit 22.

ステップＳ５において、推定装置の事後確率計算部２２は、特徴量抽出部２１から受け取った入力音声s'(t)の音響特徴量系列c'(i, j)から、DNNモデル記憶部２０に記憶されたDNNモデルλを用いて、事後確率系列q(i, m)=p(m|λ, c'(i, j))（i=0, 1, …, I-1、m=0, 1, …, M）を計算する。計算した事後確率系列q(i, m)は識別部２３へ送られる。 In step S 5, the posterior probability calculation unit 22 of the estimation device stores the acoustic feature quantity sequence c ′ (i, j) of the input speech s ′ (t) received from the feature quantity extraction unit 21 in the DNN model storage unit 20. A posteriori probability series q (i, m) = p (m |?, C '(i, j)) (i = 0, 1, ..., I-1, m = 0, 1,…, M). The calculated posterior probability sequence q (i, m) is sent to the identification unit 23.

ステップＳ６において、推定装置の識別部２３は、事後確率計算部２２から受け取った事後確率系列q(i, m)から話者属性L'を識別して出力する。話者属性の識別は、式（２）により、全フレームの事後確率の対数和を求め、最も値が高い話者属性を識別結果として出力する。 In step S6, the identification unit 23 of the estimation device identifies and outputs the speaker attribute L ′ from the posterior probability series q (i, m) received from the posterior probability calculation unit 22. The speaker attribute is identified by calculating the logarithmic sum of the posterior probabilities of all frames by using the expression (2), and outputting the speaker attribute having the highest value as the identification result.

［第二実施形態］
音声対話ロボット等では、音声を入力する際にマイクに近付き過ぎたりして、振幅が振り切れているクリップ音が入力されることがある。学習データの一部にクリップ音を含む学習音声が存在すると、同じクリップ音が入力された際に、本来の属性とは違い、このクリップ音の特徴がある属性に識別されてしまうことがある。そのため、第二実施形態では、図４に示すように、学習データからクリッピングしたクリップ音を作成し、学習データに追加することで、クリップ音の特徴に引きずられずに本来の属性に識別することを可能とする。 Second Embodiment
In a voice dialogue robot or the like, a clip sound with a completely swung amplitude may be input because the voice is input too close to the microphone. If a learning voice including a clip sound is present in part of the learning data, the same clip sound may be identified as an attribute having a feature of the clip sound unlike the original attribute when the same clip sound is input. Therefore, in the second embodiment, as shown in FIG. 4, by creating a clip sound clipped from learning data and adding it to the learning data, it is possible to identify the original attribute without being dragged by the features of the clip sound. To be possible.

第二実施形態の学習装置は、図５に示すように、学習データ記憶部１０、特徴量抽出部１１、属性ラベル作成部１２、深層学習部１３、およびDNNモデル記憶部２０を第一実施形態と同様に備え、クリップ音合成部１４をさらに備える。この学習装置と第一実施形態の推定装置とが、図６に示す各ステップの処理を行うことにより第二実施形態の話者属性推定方法が実現される。 The learning apparatus according to the second embodiment, as shown in FIG. 5, includes a learning data storage unit 10, a feature quantity extraction unit 11, an attribute label creation unit 12, a deep learning unit 13, and a DNN model storage unit 20 according to the first embodiment. The clip sound synthesizer 14 is further provided. The learning device and the estimation device of the first embodiment perform the process of each step shown in FIG. 6 to realize the speaker attribute estimation method of the second embodiment.

図６を参照して、第二実施形態の話者属性推定方法の処理手続きを説明する。以下では、上述の第一実施形態との相違点を中心に説明する。 With reference to FIG. 6, the processing procedure of the speaker attribute estimation method of the second embodiment will be described. Below, it demonstrates centering on difference with the above-mentioned 1st embodiment.

ステップＳ７において、学習装置のクリップ音合成部１４は、学習データ記憶部１０に記憶された学習音声s(k, t)を読み出し、その学習音声s(k, t)の振幅を増幅し、所定の閾値を超えた振幅をその閾値に丸めてクリップ音S(k, t)を合成する。合成したクリップ音S(k, t)は学習データ記憶部１０に記憶する。 In step S7, the clip sound synthesis unit 14 of the learning device reads the learning speech s (k, t) stored in the learning data storage unit 10, amplifies the amplitude of the learning speech s (k, t), The clip sound S (k, t) is synthesized by rounding the amplitude exceeding the threshold value to the threshold value. The synthesized clip sound S (k, t) is stored in the learning data storage unit 10.

具体的には、クリップ音合成部１４は、以下のようにしてクリップ音を合成する。 Specifically, the clip sound synthesizer 14 synthesizes the clip sound as follows.

１．学習音声s(k, t)の振幅をa倍した音声S(k, t)=a*s(k, t)（k=0, 1, …, K、t=0, 1, …, T_k-1）を作成する。 1. Speech S (k, t) = a * s (k, t) (k = 0, 1,..., K, t = 0, 1,..., T obtained by multiplying the amplitude of learning speech s (k, t) by a Create _k -1).

２．音声S(k, t)のうち所定の閾値±h（h>0）を超える値を丸めるために、全学習音声（k=0, 1, …, K）の全サンプル（t=0, 1, …, T_k-1）について、以下のように設定する。 2. In order to round out values of speech S (k, t) exceeding a predetermined threshold value ± h (h> 0), all samples (t = 0, 1) of all learning speech (k = 0, 1,..., K) ,…, T _k −1) are set as follows.

（ア）S(k, t)>hの場合、S(k, t)=hとする
（イ）S(k, t)<-hの場合、S(k, t)=-hとする。 (A) If S (k, t)> h, set S (k, t) = h. (B) If S (k, t) <-h, set S (k, t) =-h. .

aの値は複数の値を設定して実施するとよい。例えば、a=1, 3, 6等で実施するとよい。このようにして合成したクリップ音は、図４に示すような波形を示す。 It is advisable to set multiple values for a. For example, a = 1, 3, 6, etc. The clip sound synthesized in this way shows a waveform as shown in FIG.

学習装置の以降の処理では、追加したクリップ音S(k, t)を学習音声s(k, t)と同様に利用する。これにより、推定装置の入力音声s'(t)がクリップした音声であっても、話者属性を正しく推定することが可能となる。 In the subsequent processing of the learning device, the added clip sound S (k, t) is used in the same manner as the learning sound s (k, t). As a result, it is possible to correctly estimate the speaker attributes even if the input speech s ′ (t) of the estimation device is a clipped speech.

［第三実施形態］
話者属性は無声音には特徴が現れにくいため、無声音の影響により識別を誤ることがある。また、音声を発声していない区間を含む場合は該当部分の周囲の雑音を学習データに含んでしまうことにより誤識別をすることがある。そのため、属性識別は有声音に限って行うとよい。そこで、第三実施形態では、学習データの無声音または無音の部分にラベルデータを与えて、無声音または無音の確率が高い場合は識別の対象から除外することで、識別率を高くすることを可能とする。 Third Embodiment
The speaker attribute is likely to misidentify due to the influence of the unvoiced sound, since the feature is not likely to appear in the unvoiced sound. In addition, when a section where speech is not uttered is included, there may be erroneous identification by including noise around the corresponding portion in the learning data. Therefore, attribute identification is preferably performed only for voiced sounds. Therefore, in the third embodiment, it is possible to increase the identification rate by giving label data to the unvoiced sound or silent part of the learning data and excluding from the target of identification when the probability of unvoiced sound or silent is high. Do.

第三実施形態の学習装置は、図７に示すように、学習データ記憶部１０、特徴量抽出部１１、属性ラベル作成部１２、深層学習部１３、クリップ音合成部１４、およびDNNモデル記憶部２０を第二実施形態と同様に備え、有声無声判定部１５をさらに備える。この学習装置と第一実施形態の推定装置とが、図８に示す各ステップの処理を行うことにより第三実施形態の話者属性推定方法が実現される。 The learning apparatus according to the third embodiment, as shown in FIG. 7, includes a learning data storage unit 10, a feature quantity extraction unit 11, an attribute label creation unit 12, a deep learning unit 13, a clip sound synthesis unit 14, and a DNN model storage unit. 20 as in the second embodiment, and further includes a voiced / unvoiced determination unit 15. The learning device and the estimation device of the first embodiment perform the process of each step shown in FIG. 8 to realize the speaker attribute estimation method of the third embodiment.

図８を参照して、第三実施形態の話者属性推定方法の処理手続きを説明する。以下では、上述の第二実施形態との相違点を中心に説明する。 With reference to FIG. 8, the processing procedure of the speaker attribute estimation method of the third embodiment will be described. Below, it demonstrates centering on difference with the above-mentioned 2nd embodiment.

ステップＳ８において、学習装置の有声無声判定部１５は、学習データ記憶部１０に記憶された学習音声s(k, t)を読み出し、その学習音声s(k, t)の有声無声区間を判定し、有声無声情報v(k, i)を生成する。生成した有声無声情報v(k, i)はクリップ音合成部１４により合成されたクリップ音S(k, t)と関連付けて学習データ記憶部１０に記憶する。有声無声情報v(k, i)は、例えば、k番目の学習音声s(k, t)のi番目のフレームが有声の場合はv(k, i)=1とし、無声の場合はv(k, i)=0とする。有声無声の判定は、特徴量抽出部１１と同様のフレーム幅で行い、基本周波数抽出の一般的な方法で行えばよい。 In step S8, the voiced / unvoiced determination unit 15 of the learning device reads the learning voice s (k, t) stored in the learning data storage unit 10, and determines the voiced / unvoiced section of the learning voice s (k, t). Voiced and unvoiced information v (k, i) is generated. The generated voiced / unvoiced information v (k, i) is stored in the learning data storage unit 10 in association with the clip sound S (k, t) synthesized by the clip sound synthesis unit 14. The voiced unvoiced information v (k, i) is, for example, v (k, i) = 1 if the i-th frame of the k-th learning speech s (k, t) is voiced and v (k (i) if unvoiced). k, i) = 0. The determination of voiced / unvoiced may be performed with the same frame width as that of the feature amount extraction unit 11 and may be performed by a general method of basic frequency extraction.

ステップＳ２において、学習装置の属性ラベル作成部１２は、学習データ記憶部１０に記憶された属性情報L(k)および有声無声情報v(k, i)を読み出し、学習音声のフレーム数分の属性ラベル系列l(k, i)を作成する。具体的には、全フレーム（i=0, 1, …, I_k-1）について、有声部（v(k, i)=1の場合）はl(k, i)=L(k)を設定し、無声部（v(k, i)=0の場合）はl(k, i)=-1を設定する。 In step S2, the attribute label creating unit 12 of the learning device reads the attribute information L (k) and the voiced voiceless information v (k, i) stored in the learning data storage unit 10, and the attributes for the number of frames of learning speech Create a label series l (k, i). Specifically, for all frames (i = 0, 1,..., I _k −1), the voiced part (in the case of v (k, i) = 1) gives l (k, i) = L (k) Set and unvoiced part (when v (k, i) = 0) set l (k, i) = -1.

ステップＳ３において、学習装置の深層学習部１３は、第一実施形態と同様にして、DNNモデルλを学習する。第三実施形態のDNNモデルλは、無声部（l(k, i)=-1）を属性m=-1として、特徴量c(k, i, j)が属性m（=-1, 0, 1, …, M）に属する事後確率となる。 In step S3, the deep learning unit 13 of the learning device learns the DNN model λ as in the first embodiment. In the DNN model λ of the third embodiment, the unvoiced part (l (k, i) = − 1) is an attribute m = −1, and the feature amount c (k, i, j) is an attribute m (= −1, 0). , 1,…, M).

ステップＳ５において、推定装置の事後確率計算部２２は、第一実施形態と同様にして、事後確率系列q(i, m)（i=0, 1, …, I-1、m=-1, 0, 1, …, M）を計算する。 In step S5, the a posteriori probability calculating unit 22 of the estimating device calculates the posterior probability sequence q (i, m) (i = 0, 1,..., I-1, m = -1, as in the first embodiment. 0, 1,…, M).

ステップＳ６において、推定装置の識別部２３は、事後確率計算部２２から受け取った事後確率系列q(i, m)から話者属性L'を識別して出力する。第三実施形態の識別部２３は、無声部を属性m=-1として学習しているため、式（３）に示す関数f(i, m)を用いて有声部のみに限定して識別を行う。 In step S6, the identification unit 23 of the estimation device identifies and outputs the speaker attribute L ′ from the posterior probability series q (i, m) received from the posterior probability calculation unit 22. Since the identification unit 23 of the third embodiment learns the unvoiced part as the attribute m = −1, the identification is limited to only the voiced part using the function f (i, m) shown in Expression (3). Do.

［第四実施形態］
第四実施形態では、学習時の事後確率系列と推定時の事後確率系列との分布から識別結果の確からしさを示す信頼度を求める。信頼度は、0以上1以下の数値であり、1に近づけば近づくほど識別結果L'が確かな結果と言うことができる。信頼度を利用することで、例えば、音声対話ロボット等が信頼度に応じて適切な応答を選択するなどを行うことが可能となる。 Fourth Embodiment
In the fourth embodiment, the reliability indicating the probability of the identification result is obtained from the distribution of the posterior probability sequence at the time of learning and the posterior probability sequence at the time of estimation. The reliability is a numerical value of 0 or more and 1 or less, and it can be said that the closer to 1, the more reliable the identification result L ′. By using the reliability, for example, a voice dialogue robot or the like can select an appropriate response according to the reliability.

第四実施形態の学習装置は、図９に示すように、学習データ記憶部１０、特徴量抽出部１１、属性ラベル作成部１２、深層学習部１３、およびDNNモデル記憶部２０を第一実施形態と同様に備え、学習データ事後確率計算部１６、信頼度パラメータ学習部１７、および信頼度パラメータ記憶部３０をさらに備える。第四実施形態の推定装置は、図１０に示すように、DNNモデル記憶部２０、特徴量抽出部２１、事後確率計算部２２、および識別部２３を第一実施形態と同様に備え、信頼度計算部２４および信頼度パラメータ記憶部３０をさらに備える。この学習装置および推定装置が、図１１に示す各ステップの処理を行うことにより第四実施形態の話者属性推定方法が実現される。 The learning apparatus according to the fourth embodiment includes, as shown in FIG. 9, a learning data storage unit 10, a feature quantity extraction unit 11, an attribute label creation unit 12, a deep learning unit 13, and a DNN model storage unit 20 according to the first embodiment. The learning data posterior probability calculation unit 16, the reliability parameter learning unit 17, and the reliability parameter storage unit 30 are further provided. The estimation apparatus according to the fourth embodiment includes, as shown in FIG. 10, a DNN model storage unit 20, a feature quantity extraction unit 21, a posterior probability calculation unit 22, and an identification unit 23, as in the first embodiment. A calculation unit 24 and a reliability parameter storage unit 30 are further provided. The learning device and the estimation device perform the process of each step shown in FIG. 11 to realize the speaker attribute estimation method of the fourth embodiment.

図９では第一実施形態の学習装置に対して第四実施形態の考え方を適用した構成を示したが、第四実施形態の考え方は第二実施形態および第三実施形態に適用することもできる。すなわち、第四実施形態の学習装置は、クリップ音合成部１４および有声無声判定部１５の一方もしくは両方を備えていてもよい。 Although FIG. 9 shows a configuration in which the concept of the fourth embodiment is applied to the learning device of the first embodiment, the concept of the fourth embodiment can also be applied to the second embodiment and the third embodiment. . That is, the learning device according to the fourth embodiment may include one or both of the clip sound synthesis unit 14 and the voiced / unvoiced determination unit 15.

図１１を参照して、第四実施形態の話者属性推定方法の処理手続きを説明する。以下では、上述の第一実施形態との相違点を中心に説明する。 With reference to FIG. 11, the processing procedure of the speaker attribute estimation method of the fourth embodiment will be described. Below, it demonstrates centering on difference with the above-mentioned 1st embodiment.

ステップＳ９において、学習装置の学習データ事後確率計算部１６は、特徴量抽出部１１が学習音声s(k, t)から抽出した音響特徴量系列c(k, i, j)から、深層学習部１３が学習したDNNモデルλを用いて、事後確率系列q'(k, i, m)=p(m|λ, c(k, i, j))（k=0, 1, …, K、i=0, 1, …, I_k-1、m=0, 1, …, M）を計算する。計算した事後確率系列q'(k, i, m)は信頼度パラメータ学習部１７へ送られる。 In step S9, the learning data posterior probability calculating unit 16 of the learning device performs a deep learning unit from the acoustic feature amount sequence c (k, i, j) extracted from the learning speech s (k, t) by the feature amount extracting unit 11 The posterior probability series q ′ (k, i, m) = p (m | λ, c (k, i, j)) (k = 0, 1,..., K, using the DNN model λ learned by 13) Calculate i = 0, 1,..., I _k -1, m = 0, 1,. The calculated posterior probability series q ′ (k, i, m) is sent to the reliability parameter learning unit 17.

ステップＳ１０において、学習装置の信頼度パラメータ学習部１７は、学習データ事後確率計算部１６から受け取った事後確率系列q'(k, i, m)および属性ラベル作成部１２が作成した属性ラベル系列l(k, i)から、信頼度を求めるための事後確率系列の平均μ(m)、標準偏差σ(m)、フレーム数n(m)を計算する。以下、これらを総称して信頼度パラメータと呼ぶ。計算した信頼度パラメータμ(m), σ(m), n(m)は信頼度パラメータ記憶部３０に記憶する。 In step S10, the reliability parameter learning unit 17 of the learning apparatus receives the posterior probability sequence q ′ (k, i, m) received from the learning data posterior probability calculation unit 16 and the attribute label sequence l created by the attribute label creation unit 12. From (k, i), the average μ (m), standard deviation σ (m), and number of frames n (m) of the posterior probability series for obtaining the reliability are calculated. Hereinafter, these are collectively referred to as reliability parameters. The calculated reliability parameters μ (m), σ (m), and n (m) are stored in the reliability parameter storage unit 30.

具体的には、信頼度パラメータ学習部１７は、以下のようにして信頼度パラメータμ(m), σ(m), n(m)を求める。 Specifically, the reliability parameter learning unit 17 calculates the reliability parameters μ (m), σ (m), and n (m) as follows.

１．式（４）により、フレーム数n(m)および事後確率合計値s(m)を求める。 1. The number of frames n (m) and the total posterior probability s (m) are obtained from the equation (4).

２．全属性（m=0, 1, …, M）について、平均μ(m)=s(m)/n(m)を求める。 2. Average μ (m) = s (m) / n (m) is obtained for all attributes (m = 0, 1,..., M).

３．全属性（m=0, 1, …, M）について、式（５）により、平均からの差分合計値d(m)を求める。 3. For all the attributes (m = 0, 1,..., M), the difference total value d (m) from the average is obtained by equation (5).

４．全属性（m=0, 1, …, M）について、式（６）により、標準偏差σ(m)を求める。 4. For all the attributes (m = 0, 1,..., M), the standard deviation σ (m) is obtained by equation (6).

ステップＳ１１において、推定装置の信頼度計算部２４は、事後確率計算部２２が出力する事後確率系列q(i, m)および識別部２３が出力する話者属性L'から、信頼度パラメータ記憶部３０に記憶された信頼度パラメータμ(m), σ(m), n(m)を用いて、信頼度rを求める。求めた信頼度rは識別結果L'と共に出力する。信頼度rは、事後確率系列q(i, m)について属性m=L'としたときの事後確率系列q(i, L')の平均μ'、標準偏差σ'、フレーム数n'にて求められる分布と、事前に求めた信頼度パラメータμ(m), σ(m), n(m)に基づく分布とに基づき求めることができる。 In step S <b> 11, the reliability calculation unit 24 of the estimation device calculates a reliability parameter storage unit from the posterior probability sequence q (i, m) output from the posterior probability calculation unit 22 and the speaker attribute L ′ output from the identification unit 23. Using the reliability parameters μ (m), σ (m), and n (m) stored in 30, the reliability r is obtained. The obtained reliability r is output together with the identification result L ′. The reliability r is the mean μ ′, standard deviation σ ′, and frame number n ′ of the posterior probability sequence q (i, L ′) when the attribute m = L ′ for the posterior probability sequence q (i, m). The distribution can be obtained based on the obtained distribution and the distribution based on the reliability parameters μ (m), σ (m), and n (m) obtained in advance.

具体的には、信頼度計算部２４は、以下のようにして信頼度rを求める。 Specifically, the reliability calculation unit 24 calculates the reliability r as follows.

１．式（７）により、フレーム数n'および事後確率合計値s'を求める。 1. The number n ′ of frames and the total posterior probability value s ′ are obtained by the equation (7).

２．平均μ'=s'/n'を求める。 2. Find the average μ '= s' / n'.

３．式（８）により、平均からの差分合計値d'を求める。 3. The difference total value d ′ from the average is obtained by Expression (8).

４．式（９）により、標準偏差σ'を求める。 4. The standard deviation σ ′ is obtained from equation (9).

５．式（10）により、統計量tを求める。 5. The statistic t is obtained from equation (10).

６．t>0であり、自由度n'+n(L')-2のt分布T(x)において、上記５で求めた統計量tを用いて、式（11）の信頼度rを求める。 6. In the t distribution T (x) with t> 0 and n '+ n (L')-2 degrees of freedom, the reliability r of the equation (11) is determined using the statistic t determined in 5 above.

信頼度計算部２４は、信頼度パラメータμ(m), σ(m), n(m)を使わずに、以下のようにして信頼度rを求めることもできる。このときの信頼度は、各属性の事後確率系列の平均・分散値から、求める属性の平均値が有意に高いか否かを判定するための値である。この場合、学習装置は、学習データ事後確率計算部１６、信頼度パラメータ学習部１７、および信頼度パラメータ記憶部３０を備えなくてもよい。また、推定装置は、信頼度パラメータ記憶部３０を備えなくてもよい。 The reliability calculation unit 24 can also obtain the reliability r as follows without using the reliability parameters μ (m), σ (m), n (m). The reliability at this time is a value for determining whether the average value of the attribute to be obtained is significantly high from the average / variance value of the posterior probability series of each attribute. In this case, the learning device may not include the learning data posterior probability calculation unit 16, the reliability parameter learning unit 17, and the reliability parameter storage unit 30. The estimation device may not include the reliability parameter storage unit 30.

１．式（12）により、フレーム数n'と各属性の事後確率合計値s'(m)を求める。 1. The number of frames n ′ and the total posterior probability value s ′ (m) of each attribute are obtained by Expression (12).

２．各属性の平均μ'(m)=s'(m)/n'を求める。 2. Average μ ′ (m) = s ′ (m) / n ′ of each attribute is obtained.

３．式（13）により、各属性の平均からの差分合計値d'(m)を求める。 3. The difference total value d ′ (m) from the average of each attribute is obtained by Expression (13).

４．式（14）により、各属性の標準偏差σ'(m)を求める。 4. The standard deviation σ ′ (m) of each attribute is obtained by Expression (14).

５．式（15）により、識別された話者属性L'と他の話者属性との統計量t(m)を求める。 5. A statistic t (m) between the identified speaker attribute L ′ and the other speaker attributes is obtained by Expression (15).

６．t(m)>0であり、自由度2n'-2のt(m)分布T_m(x)において、上記５で求めた統計量t(m)を用いて、式（16）の信頼度rの平均値を求める。 6. In the t (m) distribution T _m (x) with t (m)> 0 and degrees of freedom 2n′−2, the reliability of the equation (16) is obtained using the statistic t (m) obtained in 5 above. Find the average value of r.

1-rはt検定における事後確率が平均μとなる確率を表す。例えば、1-r<0.05の場合、5％の有意水準にて事前に求めた属性m以外に属する話者属性における平均の事後確率μ(m)より有意に事後確率系列μが高いと言える。この発生する確率を1から引いた値を信頼度として、識別結果L'の確からしさとして使うことが可能である。 1-r represents the probability that the posterior probability in the t-test is an average μ. For example, in the case of 1-r <0.05, it can be said that the posterior probability series μ is significantly higher than the average posterior probability μ (m) in the speaker attributes other than the attribute m obtained in advance at the significance level of 5%. A value obtained by subtracting the probability of occurrence from 1 can be used as the reliability and the probability of the identification result L ′.

以上、この発明の実施の形態について説明したが、具体的な構成は、これらの実施の形態に限られるものではなく、この発明の趣旨を逸脱しない範囲で適宜設計の変更等があっても、この発明に含まれることはいうまでもない。実施の形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 As described above, the embodiments of the present invention have been described, but the specific configuration is not limited to these embodiments, and even if there is a design change or the like as appropriate without departing from the spirit of the present invention, Needless to say, it is included in the present invention. The various processes described in the embodiment are not only executed chronologically according to the order described, but may be executed in parallel or individually depending on the processing capability of the apparatus executing the process or the necessity.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, this program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. Then, at the time of execution of the process, the computer reads the program stored in its own recording medium and executes the process according to the read program. Further, as another execution form of this program, the computer may read the program directly from the portable recording medium and execute processing according to the program, and further, the program is transferred from the server computer to this computer Each time, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program in the present embodiment includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Further, in this embodiment, although the present apparatus is configured by executing a predetermined program on a computer, at least a part of the processing contents may be realized as hardware.

１０学習データ記憶部
１１特徴量抽出部
１２属性ラベル作成部
１３深層学習部
１４クリップ音合成部
１５有声無声判定部
１６学習データ事後確率計算部
１７信頼度パラメータ学習部
２０ DNNモデル記憶部
２１特徴量抽出部
２２事後確率計算部
２３識別部
２４信頼度計算部
３０信頼度パラメータ記憶部 10 learning data storage unit 11 feature amount extraction unit 12 attribute label generation unit 13 deep learning unit 14 clip sound synthesis unit 15 voiced voiceless determination unit 16 learning data posterior probability calculation unit 17 reliability parameter learning unit 20 DNN model storage unit 21 feature amount Extraction unit 22 Posterior probability calculation unit 23 Identification unit 24 Reliability calculation unit 30 Reliability parameter storage unit

Claims

A speaker attribute estimation system including a learning device and an estimation device,
The above learning device
An attribute label creating unit for creating an attribute label series representing a speaker attribute for each frame of learning speech from attribute information representing a speaker attribute for each learning speech;
A clip sound synthesis unit that amplifies the amplitude of the learning speech and rounds the amplitude above a predetermined threshold to that threshold to synthesize clip sound;
A deep learning unit that learns a deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the clip sound and the attribute label sequence;
Including
The above estimation device is
A posterior probability calculation unit that calculates a posterior probability sequence for each frame from the acoustic feature amount sequence for each frame of the input speech using the deep neural network model;
An identification unit for identifying a speaker attribute based on a logarithmic sum of the above-mentioned posterior probability series obtained for each speaker attribute;
Speaker attribute estimation system including.

The speaker attribute estimation system according to claim 1 ,
The above learning device
Further including a voiced voiceless determination unit generating voiced voiceless information indicating whether voiceed or unvoiced for each frame of the learning voice;
The attribute label creating unit sets the value of the attribute information for the voiced frame and sets the value indicating that the voiceless frame is unvoiced based on the voiced voiceless information, and creates the attribute label sequence. It is
Speaker attribute estimation system.

A speaker attribute estimation system including a learning device and an estimation device,
The above learning device
An attribute label creating unit for creating an attribute label series representing a speaker attribute for each frame of learning speech from attribute information representing a speaker attribute for each learning speech;
A deep learning unit that learns a deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the attribute label sequence;
A learning data posterior probability calculation unit that calculates a posterior probability sequence for each frame from the acoustic feature amount sequence for each frame of the learning speech using the deep neural network model;
A reliability parameter learning unit that calculates a reliability parameter that represents the distribution of the posterior probability series of the learning speech;
Including
The above estimation device is
A posterior probability calculation unit that calculates a posterior probability sequence for each frame from the acoustic feature amount sequence for each frame of the input speech using the deep neural network model;
An identification unit for identifying a speaker attribute based on a logarithmic sum of the above-mentioned posterior probability series obtained for each speaker attribute;
A reliability calculation unit that calculates the reliability based on the distribution represented by the reliability parameter and the distribution of the posterior probability series of the input speech;
Speaker attribute estimation system including.

A speaker attribute estimation system including a learning device and an estimation device,
The above learning device
An attribute label creating unit for creating an attribute label series representing a speaker attribute for each frame of learning speech from attribute information representing a speaker attribute for each learning speech;
A deep learning unit that learns a deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the attribute label sequence;
Including
The above estimation device is
A posterior probability calculation unit that calculates a posterior probability sequence for each frame from the acoustic feature amount sequence for each frame of the input speech using the deep neural network model;
An identification unit for identifying a speaker attribute based on a logarithmic sum of the above-mentioned posterior probability series obtained for each speaker attribute;
A reliability calculation unit that calculates a reliability based on the distribution of the posterior probability series of the input speech regarding the identified speaker attribute and the distribution of the posterior probability series of the input speech regarding the other speaker attributes;
Speaker attribute estimation system including.

An attribute label creating unit for creating an attribute label series representing a speaker attribute for each frame of learning speech from attribute information representing a speaker attribute for each learning speech;
A clip sound synthesis unit that amplifies the amplitude of the learning speech and rounds the amplitude above a predetermined threshold to that threshold to synthesize clip sound;
A deep learning unit that learns a deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the clip sound and the attribute label sequence;
A learning device including

An attribute label creating unit for creating an attribute label series representing a speaker attribute for each frame of learning speech from attribute information representing a speaker attribute for each learning speech;
A deep learning unit that learns a deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the attribute label sequence;
A learning data posterior probability calculation unit that calculates a posterior probability sequence for each frame from the acoustic feature amount sequence for each frame of the learning speech using the deep neural network model;
A reliability parameter learning unit that calculates a reliability parameter that represents the distribution of the posterior probability series of the learning speech;
A learning device including

A posterior probability calculation unit that calculates a posterior probability sequence for each frame using a deep neural network model learned in advance from an acoustic feature amount sequence for each frame of the input speech;
An identification unit for identifying a speaker attribute based on a logarithmic sum of the above-mentioned posterior probability series obtained for each speaker attribute;
Including
The deep neural network creates an attribute label sequence representing speaker attributes for each frame of learning speech from attribute information representing speaker attributes for each learning speech, amplifies the amplitude of the learning speech, and exceeds a predetermined threshold. The amplitude is rounded to the threshold value to synthesize clip sound, and learning is performed using the learning voice and the acoustic feature amount series for each frame of the clip sound and the attribute label series.
Estimator.

A posterior probability calculation unit that calculates a posterior probability sequence for each frame using a deep neural network model learned in advance from an acoustic feature amount sequence for each frame of the input speech;
An identification unit for identifying a speaker attribute based on a logarithmic sum of the above-mentioned posterior probability series obtained for each speaker attribute;
A reliability calculation unit that calculates the reliability based on the distribution represented by the reliability parameter calculated in advance and the distribution of the posterior probability sequence of the input speech;
Including
The deep neural network creates an attribute label sequence representing a speaker attribute of each frame of learning speech from attribute information representing a speaker attribute of each learning speech, and an acoustic feature amount sequence of each frame of the learning speech and the attribute all SANYO learned by using the label series,
The reliability parameter, Ru der represents the distribution of the posterior probability sequence for each frame calculated from acoustic features sequence of each frame of the training speech using the deep neural network model,
Estimator.

A posterior probability calculation unit that calculates a posterior probability sequence for each frame using a deep neural network model learned in advance from an acoustic feature amount sequence for each frame of the input speech;
An identification unit for identifying a speaker attribute based on a logarithmic sum of the above-mentioned posterior probability series obtained for each speaker attribute;
A reliability calculation unit that calculates a reliability based on the distribution of the posterior probability series of the input speech regarding the identified speaker attribute and the distribution of the posterior probability series of the input speech regarding the other speaker attributes;
Including
The deep neural network creates an attribute label sequence representing a speaker attribute of each frame of learning speech from attribute information representing a speaker attribute of each learning speech, and an acoustic feature amount sequence of each frame of the learning speech and the attribute It is learned using the label sequence and
Estimator.

The attribute label creation unit creates an attribute label series representing the speaker attribute for each frame of the learning speech from the attribute information representing the speaker attribute for each learning speech,
The clip sound synthesis unit amplifies the amplitude of the learning speech, and the amplitude exceeding a predetermined threshold is rounded to the threshold to synthesize the clip sound,
The deep learning unit learns the deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the clip sound and the attribute label sequence,
The posterior probability calculation unit calculates the posterior probability sequence for each frame using the deep neural network model from the acoustic feature amount sequence for each frame of the input speech,
The identification unit identifies the speaker attribute based on the logarithmic sum of the a posteriori probability series obtained for each speaker attribute.
Speaker attribute estimation method.

The attribute label creation unit creates an attribute label series representing the speaker attribute for each frame of the learning speech from the attribute information representing the speaker attribute for each learning speech,
The deep learning unit learns the deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the attribute label sequence,
The learning data posterior probability calculation unit calculates a posterior probability sequence for each frame using the deep neural network model from the acoustic feature amount sequence for each frame of the learning speech,
The reliability parameter learning unit calculates a reliability parameter representing the distribution of the posterior probability series of the learning speech,
The posterior probability calculation unit calculates the posterior probability sequence for each frame using the deep neural network model from the acoustic feature amount sequence for each frame of the input speech,
The identification unit identifies the speaker attribute based on the logarithmic sum of the a posteriori probability series obtained for each speaker attribute ,
A reliability calculation unit calculates the reliability based on the distribution represented by the reliability parameter and the distribution of the posterior probability series of the input speech ;
Speaker attribute estimation method.

The attribute label creation unit creates an attribute label series representing the speaker attribute for each frame of the learning speech from the attribute information representing the speaker attribute for each learning speech,
The deep learning unit learns the deep neural network model using the acoustic feature amount sequence for each frame of the learning speech and the attribute label sequence,
The posterior probability calculation unit calculates the posterior probability sequence for each frame using the deep neural network model from the acoustic feature amount sequence for each frame of the input speech,
The identification unit identifies the speaker attribute based on the logarithmic sum of the a posteriori probability series obtained for each speaker attribute ,
A confidence calculation unit calculates the confidence based on the distribution of the posterior probability sequence of the input speech regarding the identified speaker attributes and the distribution of the posterior probability sequence of the input speech regarding the other speaker attributes .
Speaker attribute estimation method.

A program for causing a computer to function as the learning device according to claim 5 or 6 , or the estimation device according to any one of claims 7 to 9 .