JP7141641B2

JP7141641B2 - Paralinguistic information estimation device, learning device, method thereof, and program

Info

Publication number: JP7141641B2
Application number: JP2019149021A
Authority: JP
Inventors: 厚志安藤; 歩相名神山; 哲小橋川; 智基戸田
Original assignee: Nippon Telegraph and Telephone Corp; Tokai National Higher Education and Research System NUC
Current assignee: Nippon Telegraph and Telephone Corp; Tokai National Higher Education and Research System NUC
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2022-09-26
Anticipated expiration: 2039-08-15
Also published as: JP2021032920A

Description

特許法第３０条第２項適用（１）ウェブサイトの掲載日２０１９年２月１９日ウェブサイトのアドレス一般社団法人日本音響学会２０１９年春季研究発表会サイトｈｔｔｐ：／／ｗｗｗ．ａｓｊ．ｇｒ．ｊｐ／ａｎｎｕａｌｍｅｅｔｉｎｇ／ｉｎｄｅｘ．ｈｔｍｌ（２）開催日２０１９年３月５日～３月７日（公知日：２０１９年３月６日）集会名一般社団法人日本音響学会２０１９年春季研究発表会開催場所国立大学法人電気通信大学Application of Article 30, Paragraph 2 of the Patent Law (1) Date of posting on the website February 19, 2019 Website address Website of the 2019 Spring Research Presentation Meeting of the Acoustical Society of Japan http://www. asj. gr. jp/annualmeeting/index. html (2) Date March 5th to March 7th, 2019 (Publication date: March 6th, 2019) Assembly name The Acoustical Society of Japan 2019 Spring Research Presentation Venue The University of Electro-Communications

本発明は、音声からパラ言語情報を推定する技術に関する。 The present invention relates to technology for estimating paralinguistic information from speech.

発話された音声からパラ言語情報を推定する技術が知られている。パラ言語情報とは、発話者が聞き手に与える言語情報のうち、言語以外の周辺情報を意味する。例えば、発話者の喜び・悲しみ・怒り・平静などの感情、発話者の態度（丁寧、高圧的など）、発話者の意図（肯定的、否定的など）といった情報がパラ言語情報である。 Techniques for estimating paralinguistic information from uttered speech are known. Paralinguistic information means peripheral information other than the language among the linguistic information given to the listener by the speaker. For example, paralinguistic information includes the speaker's emotions such as joy, sadness, anger, and calmness, the speaker's attitude (polite, high-handed, etc.), and the speaker's intention (positive, negative, etc.).

例えば、非特許文献１には、表現学習に基づくパラ言語情報推定技術が開示されている。表現学習とは、データを表現する因子を自動抽出し、抽出した因子に基づいて推定問題を解く手法を指す。例えば、表現学習に基づくパラ言語情報の推定では、音声（音声データ）から音声を表現する複数の因子を抽出し、これらの因子を入力としてパラ言語情報を推定する。音声（音声データ）を表現する因子は、当該音声の各特徴を表す要素である。因子の例は、話した言葉の要素（音韻性）、話し手の要素（話者性）、話し方の要素（発話様式）、背景雑音の特性、残響の特性などである。 For example, Non-Patent Document 1 discloses a paralinguistic information estimation technique based on expression learning. Representation learning refers to a method of automatically extracting factors representing data and solving estimation problems based on the extracted factors. For example, in estimating paralinguistic information based on expression learning, a plurality of factors expressing speech are extracted from speech (speech data), and these factors are used as inputs to estimate paralinguistic information. A factor expressing speech (speech data) is an element expressing each feature of the speech. Examples of factors are elements of the spoken word (phonological), elements of the speaker (speakerness), elements of speaking style (speech style), characteristics of background noise, characteristics of reverberation, and the like.

表現学習では半教師あり学習が可能であり、教師ラベル付きデータは少ないが教師ラベル無しデータが大量にある場合において高精度なモデルを得ることができる。従来の半教師あり学習を用いた表現学習では、因子を抽出するための因子抽出モデルを大量の教師ラベル無しデータで学習し、その後に、因子から元データを再構成する再構成モデルを少量の教師ラベル付きデータで学習する。従来の因子抽出モデルの学習は、因子抽出モデルが抽出した因子からデータの再構成ができるかどうかを基準として行われるため、教師ラベル無しデータを学習に利用できる。また、再構成モデルは、データの再構成誤差を最小化する基準で学習するため、少量の教師ラベルからでも高精度な再構成モデルを得ることができる。パラ言語情報に関する音声を収集することは容易であり、非常に多くの教師ラベル無しデータを収集できる。一方、パラ言語情報の教師ラベル作成には複数人の聴取者が必要であり、少数の教師ラベル付きデータが少量しか得られないことが多い。そのため、表現学習はパラ言語情報推定に適した技術であるといえる。 Semi-supervised learning is possible in representation learning, and highly accurate models can be obtained when there is a small amount of supervised labeled data but a large amount of unsupervised unlabeled data. In representation learning using conventional semi-supervised learning, a factor extraction model for extracting factors is trained with a large amount of unsupervised unlabeled data, and then a reconstruction model that reconstructs the original data from the factors is trained with a small amount of data. Train with supervised labeled data. Since the learning of the conventional factor extraction model is based on whether or not data can be reconstructed from the factors extracted by the factor extraction model, unsupervised label data can be used for learning. In addition, since the reconstruction model learns on the basis of minimizing the data reconstruction error, it is possible to obtain a highly accurate reconstruction model even from a small amount of teacher labels. Collecting speech for paralinguistic information is easy, and a large amount of unsupervised unlabeled data can be collected. On the other hand, creating supervised labels for paralinguistic information requires multiple listeners, and often yields only a small amount of supervised labeled data. Therefore, it can be said that expression learning is a technique suitable for paralinguistic information estimation.

S. E. Eskimez, Z. Duan and W. Heinzelman, “Unsupervised Learning Approach to Feature Analysis for Automatic Speech Emotion Recognition,” in Proc. of ICASSP, 2018, pp. 5099 - 5103.S. E. Eskimez, Z. Duan and W. Heinzelman, “Unsupervised Learning Approach to Feature Analysis for Automatic Speech Emotion Recognition,” in Proc. of ICASSP, 2018, pp. 5099 - 5103.

しかしながら、従来の因子抽出モデルで抽出される因子系列は、パラ言語情報の推定に不要な因子を含んでいるおそれがある。例えば、話した言葉が同一であっても怒っているか喜んでいるかによって発話様式が異なることがある。従って、パラ言語情報として感情を推定する場合、発話様式の要素の必要性は高いが、音韻性の要素の必要性は低い可能性がある。しかし従来方法では、音声から抽出した全因子を用いてパラ言語情報を推定するため、本来、パラ言語情報の推定に不要な因子も用いてパラ言語推定が行われている可能性が高い。推定に不要な情報は雑音として働くことが多いため、従来方法ではパラ言語情報の推定精度が低下するおそれがある。 However, the factor series extracted by the conventional factor extraction model may contain unnecessary factors for estimating paralinguistic information. For example, even if the spoken words are the same, the utterance style may differ depending on whether the person is angry or happy. Therefore, when estimating emotions as paralinguistic information, there is a possibility that the elements of utterance patterns are highly necessary, but the elements of phonology are less necessary. However, in the conventional method, paralinguistic information is estimated using all the factors extracted from the speech, so there is a high possibility that the paralinguistic estimation is performed using factors that are essentially unnecessary for estimating the paralinguistic information. Since unnecessary information for estimation often works as noise, there is a risk that the accuracy of estimating paralinguistic information will decrease in the conventional method.

本発明は、このような点に鑑みてなされたものであり、表現学習に基づくパラ言語情報推定においてパラ言語情報の推定精度を向上させることを目的とする。 SUMMARY OF THE INVENTION The present invention has been made in view of these points, and an object of the present invention is to improve the accuracy of paralinguistic information estimation in paralinguistic information estimation based on expression learning.

上記の課題を解決するため、音声の音響特徴量を入力とし、音声を表す因子のうち音声のパラ言語情報の推定に必要な因子を抽出し、抽出された因子を入力として音声のパラ言語情報を推定して出力する。 In order to solve the above problems, we take the acoustic features of speech as an input, extract the factors necessary for estimating the paralinguistic information of the speech from among the factors representing the speech, and use the extracted factors as input to estimate the paralinguistic information of the speech. is estimated and output.

本発明では、パラ言語情報の推定に必要な因子を用いてパラ言語情報を推定するため、パラ言語情報の推定精度を向上できる。 In the present invention, since the paralinguistic information is estimated using factors necessary for estimating the paralinguistic information, the estimation accuracy of the paralinguistic information can be improved.

図１は、第１実施形態の学習装置の機能構成を例示したブロック図である。FIG. 1 is a block diagram illustrating the functional configuration of the learning device of the first embodiment. 図２は、第１，２実施形態のパラ言語情報推定装置の機能構成を例示したブロック図である。FIG. 2 is a block diagram illustrating the functional configuration of the paralinguistic information estimation device of the first and second embodiments. 図３は、第１実施形態の因子抽出モデル学習部の処理を説明するための概念図である。FIG. 3 is a conceptual diagram for explaining the processing of the factor extraction model learning unit of the first embodiment. 図４は、第２実施形態の学習装置の機能構成を例示したブロック図である。FIG. 4 is a block diagram illustrating the functional configuration of the learning device of the second embodiment. 図５は、第２実施形態の因子抽出モデル学習部の処理を説明するための概念図である。FIG. 5 is a conceptual diagram for explaining the processing of the factor extraction model learning unit of the second embodiment.

以下、本発明の実施形態を説明する。
［原理］
原理を説明する。各実施形態の表現学習に基づくパラ言語情報推定では、音声の音響特徴量から当該音声を表す因子のうち当該音声のパラ言語情報の推定に必要な因子を抽出し、抽出された因子から当該音声のパラ言語情報を推定して出力する。これにより、パラ言語情報の推定に不要な因子が取り除かれ、好ましくはパラ言語情報の推定に必要な因子のみを用いてパラ言語情報が推定されるため、パラ言語情報の推定精度が向上する。 Embodiments of the present invention will be described below.
[principle]
Explain the principle. In the paralinguistic information estimation based on expression learning of each embodiment, the factors necessary for estimating the paralinguistic information of the speech are extracted from the acoustic features of the speech, from among the factors representing the speech. Estimate and output the paralinguistic information of As a result, the factors unnecessary for estimating the paralinguistic information are removed, and preferably only the factors required for estimating the paralinguistic information are used to estimate the paralinguistic information, thereby improving the estimation accuracy of the paralinguistic information.

ここで、音声のパラ言語情報の推定に必要な因子の抽出には、音声の音響特徴量を入力とし、音声を表す因子のうち音声のパラ言語情報の推定に必要な因子を抽出して出力する因子抽出モデルを用いる。パラ言語情報の推定には、単数または複数の因子を入力とし、音声のパラ言語情報を推定して出力するパラ言語情報推定モデルが用いられる。これらの因子抽出モデルおよびパラ言語情報推定モデルは機械学習によって得られる。ポイントは、音声を表す因子のうち、表現学習に基づくパラ言語情報推定において不要であると考えられる、音韻性や話者性、背景雑音の特性、残響の特性などの要素を除去する因子抽出モデルを学習することである。 Here, to extract the factors necessary for estimating the paralinguistic information of the speech, the acoustic features of the speech are input, and the factors necessary for estimating the paralinguistic information of the speech are extracted and output from the factors representing the speech. Use a factor extraction model that A paralinguistic information estimation model is used for estimating paralinguistic information, which inputs one or more factors and estimates and outputs the paralinguistic information of speech. These factor extraction models and paralinguistic information estimation models are obtained by machine learning. The point is a factor extraction model that removes elements such as phonology, speaker characteristics, background noise characteristics, and reverberation characteristics that are considered unnecessary for paralinguistic information estimation based on expression learning among the factors representing speech. is to learn

しかし、非特許文献１のような従来技術によって、音響特徴量から音声を表す因子系列を抽出する場合、抽出された因子系列は単一の因子ベクトルとして表現されており、また、この因子ベクトルのどの次元の値がどの因子の値に該当するか（例えば、因子ベクトルの何次元目の要素が音韻性を表す因子を表しているか）を判断することはできない。そのため、従来技術によって抽出された因子系列から、単純にパラ言語情報の推定に必要な因子または不要な因子を選択することはできない。 However, when a factor sequence representing speech is extracted from acoustic features by a conventional technique such as Non-Patent Document 1, the extracted factor sequence is expressed as a single factor vector. It is not possible to determine which dimension value corresponds to which factor value (for example, which dimension element of the factor vector represents the phonological factor). Therefore, it is not possible to simply select necessary or unnecessary factors for estimating paralinguistic information from the factor series extracted by the conventional technique.

これに対し、各実施形態では、因子抽出モデルで抽出された因子系列から特定の因子を推定することが不可能になった場合、因子抽出モデルで抽出される因子系列から当該特定の因子が完全に除去されたとみなすことができると仮定する。例えば、因子抽出モデルで推定された因子系列から音韻性を推定することができなくなった場合、因子抽出モデルで推定された因子系列から音韻性が除去されたとみなすことができると仮定する。このような仮定の下、各実施形態では、因子抽出モデルで抽出された因子系列からパラ言語情報の推定に不要な特定の要素（以下「除去要素」）の推定が困難となるように（好ましくは、除去要素の推定ができなくなるように）因子抽出モデルを学習する。すなわち、実施形態の因子抽出モデル学習部は、学習用音声の音響特徴量と、学習用音声を表す因子のうち学習用音声のパラ言語情報の推定に不要な単数の因子である除去要素の正解ラベルまたは学習用音声を表す因子のうち学習用音声のパラ言語情報の推定に不要な複数の因子である複数の除去要素の正解ラベルと、を入力とし、(1)音声の音響特徴量を入力とし、音声を表す因子のうち音声のパラ言語情報の推定に必要な因子を抽出して出力する因子抽出モデルと、(2)因子抽出モデルから出力された因子と除去要素の正解ラベルとを入力とし、音声の音響特徴量を再構成する再構成モデルと、(3)因子抽出モデルから出力された因子を入力とし、音声を表す因子のうち音声のパラ言語情報の推定に不要な単数の因子である除去要素を推定する除去要素推定モデルまたは音声を表す因子のうち音声のパラ言語情報の推定に不要な複数の因子である複数の除去要素を推定する除去要素推定モデルと、を学習する。その際、因子抽出モデル学習部は、学習用音声の音響特徴量が因子抽出モデルに入力された際に除去要素推定モデルで推定される除去要素と正解ラベルとの間の誤差が大きくなるように因子抽出モデルを学習する。好ましくは、因子抽出モデル学習部は、学習用音声の音響特徴量が因子抽出モデルに入力された際に除去要素推定モデルで推定される除去要素と正解ラベルとの間の誤差を最大化するように因子抽出モデルを学習する。当該誤差が大きいということは因子抽出モデルで抽出される因子から除去要素の推定が困難ということであり、当該誤差を最大化することは因子抽出モデルで抽出される因子から除去要素の推定が最も困難（例えば、不可能）ということである。そのため、このような学習によって得られた因子抽出モデルは、音声を表す因子のうち音声のパラ言語情報の推定に必要な因子を抽出するものと推定される。 On the other hand, in each embodiment, when it becomes impossible to estimate a specific factor from the factor series extracted by the factor extraction model, the specific factor is completely extracted from the factor series extracted by the factor extraction model. can be considered to have been removed by For example, when it becomes impossible to estimate the phonology from the factor sequence estimated by the factor extraction model, it is assumed that the phonology can be considered to be removed from the factor sequence estimated by the factor extraction model. Under this assumption, in each embodiment, it is difficult to estimate a specific element (hereinafter "removed element") unnecessary for estimating paralinguistic information from the factor series extracted by the factor extraction model (preferably learns a factor extraction model so that the estimation of the removed component is not possible). That is, the factor extraction model learning unit of the embodiment corrects the acoustic feature amount of the training speech and the removal element, which is a single factor unnecessary for estimating the paralinguistic information of the training speech among the factors representing the training speech. (1) Acoustic features of speech are input. and input the factor extraction model that extracts and outputs the factors necessary for estimating the paralinguistic information of speech from among the factors representing speech, and (2) the correct labels of the factors and removed elements output from the factor extraction model. , a reconstruction model that reconstructs the acoustic features of speech, and (3) a single factor that is unnecessary for estimating paralinguistic information among the factors that represent speech, with the factors output from the factor extraction model as input. or a removal factor estimation model for estimating a plurality of factors that are unnecessary for estimating paralinguistic information of speech among factors representing speech. At that time, the factor extraction model learning unit increases the error between the removed element estimated by the removed element estimation model and the correct label when the acoustic feature amount of the training speech is input to the factor extraction model. Train a factor extraction model. Preferably, the factor extraction model learning unit maximizes the error between the removed element estimated by the removed element estimation model and the correct label when the acoustic feature amount of the training speech is input to the factor extraction model. to learn a factor extraction model. A large error means that it is difficult to estimate the removed elements from the factors extracted by the factor extraction model. Difficult (e.g., impossible). Therefore, the factor extraction model obtained by such learning is presumed to extract the factors necessary for estimating the paralinguistic information of speech from among the factors representing speech.

このように得られた因子抽出モデルに音声の音響特徴量を入力することで、当該音声を表す因子のうち当該音声のパラ言語情報の推定に必要な因子が抽出され、抽出された音声のパラ言語情報の推定に必要な因子をパラ言語情報推定モデルに入力することで、当該音声のパラ言語情報が推定される。 By inputting the acoustic features of the speech into the factor extraction model obtained in this way, the factors necessary for estimating the paralinguistic information of the speech are extracted from among the factors representing the speech. The paralinguistic information of the speech is estimated by inputting the factors necessary for estimating the linguistic information into the paralinguistic information estimation model.

［第１実施形態］
次に第１実施形態を説明する。本実施形態では、音声を表す因子のうちパラ言語情報の推定に不要な因子（除去要素）が単数である場合を例示する。
＜構成＞
図１に例示するように、本実施形態の学習装置１１は、音響特徴抽出部１１１－１，１１１－２、因子抽出モデル学習部１１２、因子抽出モデル記憶部１１３、因子抽出部１１４、パラ言語情報推定モデル学習部１１５、およびパラ言語情報推定モデル記憶部１１６を有する。図２に例示するように、本実施形態のパラ言語情報推定装置１２は、音響特徴抽出部１２１、因子抽出モデル記憶部１２３、因子抽出部１２４、パラ言語情報推定部１２５、およびパラ言語情報推定モデル記憶部１２６を有する。 [First embodiment]
Next, a first embodiment will be described. In the present embodiment, a single factor (removal element) unnecessary for estimating paralinguistic information among the factors representing speech is exemplified.
<Configuration>
As illustrated in FIG. 1, the learning device 11 of this embodiment includes acoustic feature extraction units 111-1 and 111-2, a factor extraction model learning unit 112, a factor extraction model storage unit 113, a factor extraction unit 114, a para language It has an information estimation model learning unit 115 and a paralinguistic information estimation model storage unit 116 . As illustrated in FIG. 2, the paralinguistic information estimation device 12 of this embodiment includes an acoustic feature extraction unit 121, a factor extraction model storage unit 123, a factor extraction unit 124, a paralinguistic information estimation unit 125, and a paralinguistic information estimation unit. It has a model storage unit 126 .

＜学習処理＞
次に、図１および図３を用い、本実施形態の学習処理を説明する。
≪学習データ≫
学習処理の前提として、学習用音声Ｖ_ｔ１、除去要素の正解ラベルＬＡ_ｔａ、学習用音声Ｖ_ｔ２、およびパラ言語情報の正解ラベルＬＡ_ｔｐが準備される。学習用音声Ｖ_ｔ１は発話された音声の時系列データであり、前述の因子抽出モデルＭ_ｆを学習するための教師ラベルなしデータである。除去要素の正解ラベルＬＡ_ｔａは、学習用音声Ｖ_ｔ１を表す因子ＦＡ_ｔ１のうち学習用音声Ｖ_ｔ１のパラ言語情報の推定に不要な単数の因子である除去要素の正解ラベルである。除去要素の例は、音韻性、話者性、背景雑音の特性、残響の特性などである。除去要素が音韻性の場合、例えば、学習用音声Ｖ_ｔ１の各発話に対応する音素列を表す情報が正解ラベルＬＡ_ｔａである。除去要素が話者性の場合、例えば、学習用音声Ｖ_ｔ１の各発話に対応する話者ＩＤやインデックス、事前学習されたモデルを用いて推定された話者情報の連続表現（例えば、i-vector（参考文献１）など）が正解ラベルＬＡ_ｔａである。除去要素が背景雑音の特性の場合、例えば、学習用音声Ｖ_ｔ１の背景雑音の種類の正解を表す背景雑音ＩＤやインデックス（例えば、雑音なしなら０、車内雑音なら１、雑踏なら２、それ以外なら３など）が正解ラベルＬＡ_ｔａである。除去要素が残響の特性の場合、例えば、残響時間（ＲＴ６０）の正解値（例えば、残響時間が５００ｍｓ未満の短い残響なら０、５００ｍｓ以上１０００ｍｓ未満の一般的な残響なら１、１０００ｍｓ以上の長い残響なら２など）が正解ラベルＬＡ_ｔａである。学習用音声Ｖ_ｔ２は発話された音声の時系列データであり、パラ言語情報の正解ラベルＬＡ_ｔｐは学習用音声Ｖ_ｔ２の各発話のパラ言語情報を表す正解ラベルである。学習用音声Ｖ_ｔ２は学習用音声Ｖ_ｔ１と同一であってもよいし、学習用音声Ｖ_ｔ１と相違していてもよい。
参考文献１：N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” in IEEE Trans. on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788 - 798, 2011. <Learning processing>
Next, the learning process of this embodiment will be described with reference to FIGS. 1 and 3. FIG.
≪Learning data≫
As a premise of the learning process, a learning voice V _t1 , a correct label LA _ta of a removal element, a learning voice V _t2 , and a correct label LA _tp of paralinguistic information are prepared. The training speech _Vt1 is time-series data of uttered speech, and is unsupervised label data for learning the factor extraction model _Mf described above. The correct label LA _ta of the removed element is the correct label of the removed element, which is a single factor unnecessary for estimating the paralinguistic information of the training speech V _{t1 among the factors FA t1} _representing the training speech V _t1 . Examples of removal factors are phonology, speaker characteristics, background noise characteristics, reverberation characteristics, and the like. When the element to be removed is phonological, for example, information representing the phoneme string corresponding to each utterance of the learning speech V _t1 is the correct label LA _ta . When the removal element is speaker characteristics, for example, a speaker ID or index corresponding to each utterance of the training speech V _t1 , a continuous representation of speaker information estimated using a pre-trained model (for example, i- vector (reference document 1, etc.) is the correct label LA _ta . If the element to be removed is the background noise characteristic, for example, the background noise ID or index representing the correct answer for the type of background noise of the learning speech _Vt1 (for example, 0 for no noise, 1 for noise in the car, 2 for crowds, other If 3, etc.) is the correct label LA _ta . If the removal element is the characteristic of reverberation, for example, the correct value of the reverberation time (RT60) (for example, 0 for short reverberation with a reverberation time of less than 500 ms, 1 for general reverberation with a reverberation time of 500 ms or more and less than 1000 ms, long reverberation of 1000 ms or more If 2, etc.) is the correct label LA _ta . The training speech V _t2 is time-series data of uttered speech, and the correct label LA _tp of the paralinguistic information is the correct label representing the paralinguistic information of each utterance of the training speech V _t2 . The learning voice V _t2 may be the same as the learning voice V _t1 or may be different from the learning voice V _t1 .
Reference 1: N. Dehak, PJ Kenny, R. Dehak, P. Dumouchel, P. Ouellet, “Front-End Factor Analysis for Speaker Verification,” in IEEE Trans. on Audio, Speech, and Language Processing, vol. 19 , no. 4, pp. 788-798, 2011.

≪音響特徴抽出部１１１－１の処理≫
音響特徴抽出部１１１－１は、学習用音声Ｖ_ｔ１を入力とし、学習用音声Ｖ_ｔ１の音響特徴量の系列（例えば、時系列）である音響特徴ベクトル系列Ｆ_ｔ１を抽出して出力する（ステップＳ１１１－１）。音響特徴ベクトル系列Ｆ_ｔ１を構成する要素に限定はないが、例えば、音響特徴ベクトル系列Ｆ_ｔ１は、音声を短時間ごとに分析して得られるMFCC(Mel-frequency Cepstral Coefficients)、対数メルフィルタバンク出力、基本周波数、短時間パワー、のいずれか一つ以上の音響特徴量を要素に含むベクトルの系列である。例えば、音響特徴ベクトル系列Ｆ_ｔ１は除去要素の正解ラベルＬＡ_ｔａと同じ系列長を持つ。 <<Processing of Acoustic Feature Extraction Unit 111-1>>
Acoustic feature extraction unit 111-1 receives training speech V _t1 as input, extracts and outputs acoustic feature vector sequence F _t1 , which is a sequence (for example, time series) of acoustic features of learning speech V _t1 ( Step S111-1). The elements that make up the acoustic feature vector sequence F _t1 are not limited _. It is a sequence of vectors whose elements include one or more acoustic features of output, fundamental frequency, and short-time power. For example, the acoustic feature vector sequence F _t1 has the same sequence length as the correct label LA _ta of the removed element.

≪因子抽出モデル学習部１１２の処理≫
本実施形態では深層学習に基づく枠組みを利用した処理を例示する。しかし、これは本発明を限定するものではない。因子抽出モデル学習部１１２は、入力された音声を表す全因子のうち当該音声のパラ言語情報の推定に不要と思われる特定の除去要素（例えば、音韻性、話者性、背景雑音の特性、または残響の特性の何れか）を１個除去した残りの因子（低次元ベクトル）を出力する因子抽出モデルＭ_ｆを学習する（ステップＳ１１２）。 <<Processing of Factor Extraction Model Learning Unit 112>>
In this embodiment, processing using a framework based on deep learning is exemplified. However, this is not a limitation of the invention. The factor extraction model learning unit 112 selects specific removal elements (for example, phonological characteristics, speaker characteristics, background noise characteristics, or one of the reverberation characteristics) is removed, and the factor extraction model _Mf is learned to output the remaining factors (low-dimensional vectors) (step S112).

図３に例示するように、因子抽出モデル学習部１１２は、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１（音響特徴量）と、学習用音声Ｖ_ｔ１を表す因子のうち学習用音声Ｖ_ｔ１のパラ言語情報の推定に不要な単数の因子である除去要素の正解ラベルＬＡ_ｔａとを入力とし、(1)音声の音響特徴ベクトル系列（音響特徴量）を入力とし、音声を表す因子のうち音声のパラ言語情報の推定に必要な因子ＦＡ_ｔ１を抽出して出力する因子抽出モデルＭ_ｆと、(2)因子抽出モデルＭ_ｆから出力された因子ＦＡ_ｔ１と除去要素の正解ラベルとを入力とし、音声の音響特徴ベクトル系列（音響特徴量）を再構成する再構成モデルＭ_ｒと、(3)因子抽出モデルＭ_ｆから出力された因子ＦＡ_ｔ１を入力とし、音声を表す因子のうち音声のパラ言語情報の推定に不要な単数の因子である除去要素ＬＡ_ａを推定する除去要素推定モデルＭ_ａとを同時に学習する。例えば、因子抽出モデル学習部１１２は、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１（音響特徴量）が因子抽出モデルＭ_ｆに入力された際に因子抽出モデルＭ_ｆから出力される因子ＦＡ_ｔ１と除去要素の正解ラベルＬＡ_ｔａとが再構成モデルＭ_ｒに入力された際に再構成モデルＭ_ｒで再構成される音響特徴ベクトル系列（音響特徴量）と音響特徴ベクトル系列Ｆ_ｔ１との誤差に対応する再構成モデルＭ_ｒの損失関数値Ｌ_ｒと、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１が因子抽出モデルＭ_ｆに入力された際に因子抽出モデルＭ_ｆから出力される因子ＦＡ_ｔ１が除去要素推定モデルＭ_ａに入力された際に除去要素推定モデルＭ_ａで推定される除去要素ＬＡ_ａと除去要素の正解ラベルＬＡ_ｔａとの誤差に対応する除去要素推定モデルＭ_ａの損失関数値Ｌ_ａと、の重み付き和を全体の損失関数値Ｌとして、因子抽出モデルＭ_ｆ、再構成モデルＭ_ｒ、および除去要素推定モデルＭ_ａを学習する。 As illustrated in FIG. 3, the factor extraction model learning unit 112 _acquires the acoustic feature vector sequence F _t1 (acoustic feature amount) of the learning voice V _t1 and the learning voice V t1 among the factors representing the learning voice V _t1 . (1) Acoustic feature vector sequence (acoustic feature quantity) of _speech is input, and among the factors representing speech, Input the factor extraction model M _f that extracts and outputs the factor FA _t1 necessary for estimating the paralinguistic information of speech, and (2) the factor FA _t1 output from the factor extraction model M _f and the correct label of the removed element. A reconstruction model _Mr for reconstructing an acoustic feature vector sequence (acoustic feature amount) of speech and a factor FA _t1 output from (3) a factor extraction model _Mf are input, and speech A removed element estimation model M _a for estimating a removed element LA _a , which is a single factor unnecessary for estimating the paralinguistic information, is learned at the same time. For example, the _factor _extraction model _learning unit 112 _uses the factor FA When _t1 and the correct label LA _ta of the removed element are input to the reconstruction model _Mr , the acoustic feature vector sequence (acoustic feature quantity) reconstructed by the reconstruction model _Mr and the acoustic feature vector sequence _Ft1 Output from the factor extraction model M _f when the loss function value L _r of the reconstruction model M _r corresponding to the error and the acoustic feature vector sequence F _t1 of the learning speech V _t1 are input to the factor extraction model M _f The removed element estimation model M _a corresponding to the error between the removed element LA _a estimated by the removed element estimation model M _a when the factor FA _t1 is input to the removed element estimation model M _a and the correct label LA _ta of the removed element and the weighted sum of the total loss function value L, the factor extraction model M _f , the reconstruction model M _r , and the removed _element estimation model M _a are learned.

これらのモデルの学習には、例えば誤差逆伝搬法が用いられる。すなわち、因子抽出モデル学習部１１２は、音響特徴ベクトル系列Ｆ_ｔ１を因子抽出モデルＭ_ｆおよび再構成モデルＭ_ｒの二つに順伝搬し、除去要素の正解ラベルＬＡ_ｔａを再構成モデルＭ_ｒに順伝搬して得られた再構成後の音響特徴ベクトル系列と音響特徴ベクトル系列Ｆ_ｔ１との誤差に対応する値を再構成の損失関数値Ｌ_ｒとし、音響特徴ベクトル系列Ｆ_ｔ１を因子抽出モデルＭ_ｆおよび除去要素推定モデルＭ_ａの二つに順伝搬し、除去要素の正解ラベルＬＡ_ｔａを除去要素推定モデルＭ_ａに順伝搬して得られた除去要素ＬＡ_ａと除去要素の正解ラベルＬＡ_ｔａとの誤差に対応する値を除去要素の損失関数値Ｌ_ａとし、これらの二つの損失関数値Ｌ_ｒ，Ｌ_ａの重み付け和を全体の損失関数値Ｌとして３つのモデルの学習を行う。なお、除去要素の正解ラベルＬＡ_ｔａを再構成モデルＭ_ｒに順伝搬させるのは、理想的には因子抽出モデルＭ_ｆから出力される因子ＦＡ_ｔ１は除去要素を含まず、因子ＦＡ_ｔ１のみから音響特徴ベクトル系列を再構成することはできないからである。 For learning these models, for example, the error backpropagation method is used. That is, the factor extraction model learning unit 112 forward propagates the acoustic feature vector sequence _Ft1 to the factor extraction model _Mf and the reconstruction model _Mr , and transfers the correct label LA _ta of the removed element to the reconstruction model _Mr. A value corresponding to the error between the reconstructed acoustic feature vector sequence obtained by forward propagation and the acoustic feature vector sequence _Ft1 is set as the reconstruction loss function value _{Lr, and the acoustic feature vector sequence Ft1} _is the factor extraction model. The removed element LA _a and the correct label LA of the removed element obtained by forward propagation to _Mf and the removed element estimation model M _a , and forward propagating the correct label LA _ta of the removed element to the removed element estimation model M _a The value corresponding to the error from _ta is set as the loss function value L _a of the removed element, and the weighted sum of these two loss function values L _r and L _a is set as the overall loss function value L, and three models are learned. The correct label LA _ta of the removed element is forward propagated to the reconstructed model M _r because, ideally, the factor FA _t1 output from the factor extraction model M _f does not include the removed element, and only the factor FA _t1 This is because the acoustic feature vector sequence cannot be reconstructed.

再構成モデルＭ_ｒで再構成される音響特徴ベクトル系列と音響特徴ベクトル系列Ｆ_ｔ１との間の誤差の例は、これらの二乗誤差である。損失関数値Ｌ_ｒの例は、再構成モデルＭ_ｒで再構成される音響特徴ベクトル系列と音響特徴ベクトル系列Ｆ_ｔ１との二乗誤差である。除去要素推定モデルＭ_ａで推定される除去要素ＬＡ_ａと除去要素の正解ラベルＬＡ_ｔａとの間の誤差の例は、これらの交差エントロピーである。損失関数値Ｌ_ａの例は、除去要素推定モデルＭ_ａで推定される除去要素ＬＡ_ａと除去要素の正解ラベルＬＡ_ｔａとの交差エントロピーである。全体の損失関数値Ｌの例は、以下のように表される。
Ｌ＝（１－α）Ｌ_ｒ＋αＬ_ａ（１）
ただし、αは損失重みであり、０≦α≦１を満たす定数である。 Examples of errors between the acoustic feature vector sequence reconstructed by the reconstruction model M _r and the acoustic feature vector sequence F _t1 are these squared errors. An example of the loss function value _Lr is the squared error between the acoustic feature vector sequence reconstructed by the reconstruction model _Mr and the acoustic feature vector sequence _Ft1 . An example of the error between the removed element LA _a estimated by the removed element estimation model M _a and the correct label LA _ta of the removed element is these cross-entropies. An example of the loss function value L _a is the cross entropy between the removed element LA _a estimated by the removed element estimation model M _a and the correct label LA _ta of the removed element. An example of the overall loss function value L is expressed as follows.
L=(1−α)L _r +αL _a (1)
where α is a loss weight and is a constant that satisfies 0≦α≦1.

ただし、因子抽出モデル学習部１１２は、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１（音響特徴量）が因子抽出モデルＭ_ｆに入力された際に除去要素推定モデルＭ_ａで推定される除去要素ＬＡ_ａと正解ラベルＬＡ_ｔａとの間の誤差（例えば、除去要素推定モデルＭ_ａで推定される除去要素ＬＡ_ａと正解ラベルＬＡ_ｔａとの交差エントロピー）が大きくなるように因子抽出モデルＭ_ｆを学習する。例えば、誤差逆伝搬法で学習が行われる場合、因子抽出モデルＭ_ｆと除去要素推定モデルＭ_ａとの間に勾配反転層（Gradient Reversal Layer: GRL）が配置される。勾配反転層は、順伝搬の際には恒等変換をおこなう。すなわち、勾配反転層は、順伝搬された入力値をそのまま出力値として出力する。つまり、勾配反転層は、順伝搬時には何も行わない。しかし、勾配反転層は、誤差逆伝搬法での逆伝搬の際に入力値の勾配を反転させる。すなわち、勾配反転層は、入力された誤差に対応する偏微分値に負定数を乗じた値を出力する。つまり、誤差逆伝搬法での除去要素推定モデルＭ_ａから因子抽出モデルＭ_ｆへの逆伝搬の際に、除去要素推定モデルＭ_ａから出力される除去要素ＬＡ_ａと正解ラベルＬＡ_ｔａとの間の誤差に対応する偏微分値（誤差の更新対象の重みやバイアスでの偏微分値）が除去要素推定モデルＭ_ａから勾配反転層に逆伝搬され、この勾配反転層がこの偏微分値に負定数を乗じた値を因子抽出モデルＭ_ｆへ逆伝搬する。勾配反転層の詳細は、例えば、参考文献２に記載されている。この勾配反転層の働きにより、因子抽出モデルＭ_ｆで抽出された因子から除去要素の推定が困難になるように学習される（好ましくは、当該因子から除去要素の推定が出来なくなるように学習される）ため、３つのモデルの学習が進むことで、再構成モデルＭ_ｒによる音響特徴ベクトル系列の再構成は正しくできるが、除去要素推定モデルＭ_ａによる除去要素の推定が困難となるような因子を抽出する因子抽出モデルＭ_ｆを得ることができる。すなわち、因子抽出モデル学習部１１２は、パラ言語情報推定に不要となる除去要素を取り除いた因子を抽出できる因子抽出モデルＭ_ｆを学習できる。
参考文献２：Yaroslav Ganin, Victor Lempitsky, “Unsupervised Domain Adaptation by Backpropagation,” Skolkovo Institute of Science and Technology (Skoltech), Moscow Region, Russia However, the factor extraction model learning unit 112 performs the removal estimated by the removed element estimation model M _a when the acoustic feature vector sequence F _t1 (acoustic feature quantity) of the learning speech V _t1 is input to the factor extraction model M _f . The factor extraction model M _f is used so that the error between the element LA _a and the correct label LA _ta (for example, the cross entropy between the removed element LA _a estimated by the removed element estimation model M _a and the correct label LA _ta ) becomes large. to learn. For example, when learning is performed by the error backpropagation method, _a gradient reversal layer (GRL) is arranged between the factor extraction model _Mf and the removed element estimation model Ma. The gradient inversion layer performs identity transformation during forward propagation. That is, the gradient inversion layer directly outputs the forward propagated input value as the output value. That is, the gradient inversion layer does nothing during forward propagation. However, the gradient inversion layer inverts the gradient of the input value during backpropagation in the error backpropagation method. That is, the gradient inversion layer outputs a value obtained by multiplying the partial differential value corresponding to the input error by a negative constant. That is, during back propagation from the removed element estimation model M _a to the factor extraction model M _f in the error back propagation method, between the removed element LA _a output from the removed element estimation model M _a and the correct label LA _ta The partial differential value corresponding to the error (the partial differential value at the weight or bias to be updated of the error) is back-propagated from the removed element estimation model _Ma to the gradient inversion layer, and the gradient inversion layer negatively affects this partial differential value. The value multiplied by the constant is propagated back to the factor extraction model _Mf . Details of gradient reversal layers are described, for example, in Reference 2. By the function of this gradient inversion layer, learning is performed so that it becomes difficult to estimate the removed element from the factor extracted by the factor extraction model _Mf (preferably, the learning is performed so that the removed element cannot be estimated from the factor). Therefore, as the learning of the three models progresses, the acoustic feature vector sequence can be correctly reconstructed by the reconstruction model M _r , but the factors that make it difficult to estimate the removed elements by the removed element estimation model M _a A factor extraction model M _f can be obtained that extracts That is, the factor extraction model learning unit 112 can learn the factor extraction model _Mf capable of extracting factors from which removed elements unnecessary for paralinguistic information estimation are removed.
Reference 2: Yaroslav Ganin, Victor Lempitsky, “Unsupervised Domain Adaptation by Backpropagation,” Skolkovo Institute of Science and Technology (Skoltech), Moscow Region, Russia

因子抽出モデル学習部１１２による学習は、所定の終了条件を満たすまで繰り返される。例えば、因子抽出モデル学習部１１２は、入力されたエポック数（音響特徴ベクトル系列Ｆ_ｔおよび正解ラベルＬＡ_ｔａの全てが学習に利用された回数）が一定値（例えば１００回）に到達した場合、学習が完了したとみなして、その時点の因子抽出モデルＭ_ｆを出力する。因子抽出モデルＭ_ｆは因子抽出モデル記憶部１１３に格納される（ステップＳ１１３）。再構成モデルＭ_ｒおよび除去要素推定モデルＭ_ａは以降の処理で利用されない。 Learning by the factor extraction model learning unit 112 is repeated until a predetermined end condition is satisfied. For example, when the input epoch number (the number of times the acoustic feature vector sequence F _t and the correct label LA _ta are all used for learning) reaches a certain value (for example, 100 times), Assuming that the learning is completed, the factor extraction model _Mf at that time is output. The factor extraction model _Mf is stored in the factor extraction model storage unit 113 (step S113). The reconstructed model M _r and the removed element estimation model M _a are not used in subsequent processing.

≪音響特徴抽出部１１１－２の処理≫
音響特徴抽出部１１１－２は、学習用音声Ｖ_ｔ２を入力とし、学習用音声Ｖ_ｔ２の音響特徴量の系列（例えば、時系列）である音響特徴ベクトル系列Ｆ_ｔ２を抽出して出力する（ステップＳ１１１－２）。音響特徴抽出部１１１－２が学習用音声Ｖ_ｔ２から音響特徴ベクトル系列Ｆ_ｔ２を抽出するための演算の種別は、音響特徴抽出部１１１－１が学習用音声Ｖ_ｔ１から音響特徴ベクトル系列Ｆ_ｔ１を抽出するための演算の種別と同一である。すなわち、音響特徴ベクトル系列Ｆ_ｔ２を構成する要素である音響特徴量の種別は、音響特徴ベクトル系列Ｆ_ｔ１を構成する要素である音響特徴量の種別と同一である。例えば、音響特徴ベクトル系列Ｆ_ｔ２は除去要素の正解ラベルＬＡ_ｔａと同じ系列長を持つ。音響特徴ベクトル系列Ｆ_ｔ２および音響特徴ベクトル系列Ｆ_ｔ１の長さは、互いに同一であってもよいし、同一でなくてもよい。 <<Processing of Acoustic Feature Extraction Unit 111-2>>
Acoustic feature extraction unit 111-2 receives training speech V _t2 as input, extracts and outputs acoustic feature vector sequence F _t2 , which is a sequence (for example, time series) of acoustic feature amounts of learning speech V _t2 ( step S111-2). The type of computation for the acoustic feature extraction unit 111-2 to extract the acoustic feature vector sequence F _t2 from the training speech V _t2 is as _follows _. is the same as the type of operation for extracting In other words, the type of acoustic feature quantity that is the element that configures the acoustic feature vector sequence _{Ft2 is the same as the type of the acoustic feature quantity that is the element that configures the acoustic feature vector sequence Ft1} _. For example, the acoustic feature vector sequence F _t2 has the same sequence length as the correct label LA _ta of the removed element. The acoustic feature vector sequence F _t2 and the acoustic feature vector sequence F _t1 may or may not have the same length.

≪因子抽出部１１４の処理≫
因子抽出部１１４には、音響特徴抽出部１１１－２から出力された音響特徴ベクトル系列Ｆ_ｔ２と、因子抽出モデル記憶部１１３から読み出された因子抽出モデルＭ_ｆとが入力される。因子抽出部１１４は、音響特徴ベクトル系列Ｆ_ｔ２を因子抽出モデルＭ_ｆに入力し、学習用音声Ｖ_ｔ２を表す因子のうち学習用音声Ｖ_ｔ２のパラ言語情報の推定に必要な因子系列（複数の因子の系列、例えば、各時点での因子を要素とする因子ベクトルの時系列）ＦＡ_ｔ２を抽出して出力する（ステップＳ１１４）。例えば、因子系列ＦＡ_ｔ２は、音響特徴ベクトル系列Ｆ_ｔ２およびパラ言語情報の正解ラベルＬＡ_ｔｐと同じ長さを持つ。 <<Processing of Factor Extraction Unit 114>>
The acoustic feature vector sequence Ft2 output from the acoustic feature extraction unit _111-2 and the factor extraction model _Mf read from the factor extraction model storage unit 113 are input to the factor extraction unit 114. FIG. The factor extraction unit 114 inputs the acoustic feature vector sequence F _t2 to the factor extraction model M _f , _and _selects the factor sequence (a plurality of (for example, the time series of factor vectors whose elements are the factors at each point in time) FA _t2 is extracted and output (step S114). For example, the factor sequence FA _t2 has the same length as the acoustic feature vector sequence F _t2 and the correct label LA _tp of the paralinguistic information.

≪パラ言語情報推定モデル学習部１１５の処理≫
パラ言語情報推定モデル学習部１１５には、パラ言語情報の正解ラベルＬＡ_ｔｐと、因子抽出部１１４から出力された因子系列ＦＡ_ｔ２とが入力される。パラ言語情報推定モデル学習部１１５は、各発話に対応する因子系列ＦＡ_ｔ２とパラ言語情報の正解ラベルＬＡ_ｔｐとの組を用いてパラ言語情報推定モデルＭ_ｐを学習する（ステップＳ１１５）。パラ言語情報推定モデルＭ_ｐは、多クラス分類問題を扱うことができるモデルである。パラ言語情報推定モデルＭ_ｐの例は深層学習に基づくモデルである。しかし、これは本発明を限定するものではなく、パラ言語情報推定モデルＭ_ｐが多クラスロジスティック回帰などの別の多クラス分類モデルであってもよい。パラ言語情報推定モデルＭ_ｐが深層学習に基づくモデルを用いる場合、例えば再帰型ニューラルネットワークに基づくモデルに因子系列ＦＡ_ｔ２を入力し、当該モデルの最終出力とパラ言語情報の正解ラベルＬＡ_ｔｐとの交差エントロピーを損失関数として誤差逆伝搬法によりモデル学習を行う。しかし、これは本発明を限定するものではない。パラ言語情報推定モデル学習部１１５は、学習によって得られたパラ言語情報推定モデルＭ_ｐを出力し、パラ言語情報推定モデル記憶部１１６に格納する（ステップＳ１１６）。 <<Processing of Paralinguistic Information Estimation Model Learning Unit 115>>
The correct label LA _tp of the paralinguistic information and the factor sequence FA _t2 output from the factor extracting unit 114 are input to the paralinguistic information estimation model learning unit 115 . The paralinguistic information estimation model learning unit 115 learns the paralinguistic information estimation model M _p using a set of the factor sequence FA _t2 corresponding to each utterance and the correct label LA _tp of the paralinguistic information (step S115). The paralinguistic information estimation model _Mp is a model that can handle multi-class classification problems. An example of the paralinguistic information estimation model M _p is a model based on deep learning. However, this is not a limitation of the invention and the paralinguistic information estimation model M _p may be another multi-class classification model such as multi-class logistic regression. When the paralinguistic information estimation model M _p uses a model based on deep learning, for example, a factor sequence FA _t2 is input to a model based on a recurrent neural network, and the final output of the model and the correct label LA _tp of the paralinguistic information Model learning is performed by the error backpropagation method with the cross-entropy as the loss function. However, this is not a limitation of the invention. The paralinguistic information estimation model learning unit 115 outputs the paralinguistic information estimation model _Mp obtained by learning, and stores it in the paralinguistic information estimation model storage unit 116 (step S116).

＜パラ言語情報推定処理＞
図２を用いて、本実施形態のパラ言語情報推定処理を説明する。
前処理として、学習装置１１の因子抽出モデル記憶部１１３に格納された因子抽出モデルＭ_ｆがパラ言語情報推定装置１２の因子抽出モデル記憶部１２３に格納され、パラ言語情報推定モデル記憶部１１６に格納されたパラ言語情報推定モデルＭ_ｐがパラ言語情報推定モデル記憶部１２６に格納される。 <Paralinguistic information estimation processing>
Paralinguistic information estimation processing according to this embodiment will be described with reference to FIG.
As preprocessing, the factor extraction model M _f stored in the factor extraction model storage unit 113 of the learning device 11 is stored in the factor extraction model storage unit 123 of the paralinguistic information estimation device 12, and is stored in the paralinguistic information estimation model storage unit 116. The stored paralinguistic information estimation model M _p is stored in the paralinguistic information estimation model storage unit 126 .

≪音響特徴抽出部１２１の処理≫
音響特徴抽出部１２１は、パラ言語情報の推定対象である音声Ｖ_ｉｎを入力とし、音声Ｖ_ｉｎの音響特徴量の系列（例えば、時系列）である音響特徴ベクトル系列Ｆ_ｉｎを抽出して出力する（ステップＳ１２１）。音声Ｖ_ｉｎは発話された音声の時系列データである。また、音響特徴抽出部１２１が音声Ｖ_ｉｎから音響特徴ベクトル系列Ｆ_ｉｎを抽出するための演算の種別は、音響特徴抽出部１１１－２が学習用音声Ｖ_ｔ２から音響特徴ベクトル系列Ｆ_ｔ２を抽出するための演算の種別と同一である。すなわち、音響特徴ベクトル系列Ｆ_ｉｎを構成する要素である音響特徴量の種別は、音響特徴ベクトル系列Ｆ_ｔ２を構成する要素である音響特徴量の種別と同一である。 <<Processing of Acoustic Feature Extraction Unit 121>>
Acoustic feature extraction unit 121 receives as input speech _Vin , which is an object for estimating _{paralinguistic} information, and extracts and outputs acoustic feature vector series _Fin , which is a series (for example, time series) of acoustic features of speech Vin. (step S121). Voice _Vin is time-series data of uttered voice. Further, the type of computation for the acoustic feature extraction unit 121 to extract the acoustic feature vector sequence F _in from the voice V _in is as follows: the acoustic feature extraction unit 111-2 extracts the acoustic feature vector sequence F _t2 from the learning voice V _t2 ; It is the same as the type of operation for That is, the type of the acoustic feature quantity that is the element forming the acoustic feature vector sequence F _in is the same as the type of the acoustic feature quantity that is the element forming the acoustic feature vector sequence F _t2 .

≪因子抽出部１２４の処理≫
因子抽出部１２４には、音響特徴抽出部１２１から出力された音響特徴ベクトル系列Ｆ_ｉｎと、因子抽出モデル記憶部１２３から読み出された因子抽出モデルＭ_ｆとが入力される。因子抽出部１２４は、音響特徴ベクトル系列Ｆ_ｉｎを因子抽出モデルＭ_ｆに入力し、学習用音声Ｖ_ｉｎを表す因子のうち音声Ｖ_ｉｎのパラ言語情報の推定に必要な因子系列ＦＡ_ｉｎを抽出して出力する（ステップＳ１２４）。 <<Processing of factor extraction unit 124>>
The acoustic feature vector series F _in output from the acoustic feature extraction unit 121 and the factor extraction model M _f read from the factor extraction model storage unit 123 are input to the factor extraction unit 124 . The factor extraction unit 124 inputs the acoustic feature vector sequence F _in to the factor extraction model M _f , and _{extracts the factor sequence FA in} _necessary for estimating the _{paralinguistic} information of the speech Vin from among the factors representing the training speech Vin. and output (step S124).

≪パラ言語情報推定部１２５の処理≫
パラ言語情報推定部１２５には、因子抽出部１２４から出力された因子系列ＦＡ_ｉｎと、パラ言語情報推定モデル記憶部１２６から読み出されたパラ言語情報推定モデルＭ_ｐとが入力される。パラ言語情報推定部１２５は、因子系列ＦＡ_ｉｎをパラ言語情報推定モデルＭ_ｐに入力し、音声Ｖ_ｉｎのパラ言語情報を推定してパラ言語情報推定結果Ｐとして出力する（ステップＳ１２５）。本実施形態のパラ言語情報推定モデルＭ_ｐは深層学習に基づくモデルであるため、パラ言語情報推定部１２５は、因子系列ＦＡ_ｉｎをパラ言語情報推定モデルＭ_ｐに入力し、順伝搬させることで音声Ｖ_ｉｎのパラ言語情報推定結果Ｐを得て出力する。パラ言語情報推定モデルＭ_ｐの出力が各パラ言語情報クラスの確率として得られる場合、パラ言語情報推定部１２５は、例えば、パラ言語情報推定モデルＭ_ｐの出力が最大となるパラ言語情報クラスをパラ言語情報推定結果Ｐとして出力する。しかし、これは本発明を限定するものではなく、例えば、パラ言語情報推定部１２５がパラ言語情報推定モデルＭ_ｐから出力された各パラ言語情報クラスの確率を出力してもよい。 <<Processing of Paralinguistic Information Estimating Unit 125>>
Paralinguistic information estimation unit 125 receives factor sequence FA _in output from factor extraction unit 124 and paralinguistic information estimation model M _p read from paralinguistic information estimation model storage unit 126 . The paralinguistic information estimation unit 125 inputs the factor sequence FA _in to the paralinguistic information estimation model M _p , estimates the paralinguistic information of the speech Vin, and outputs it as a _{paralinguistic} information estimation result P (step S125). Since the paralinguistic information estimation model M _p of the present embodiment is a model based on deep learning, the paralinguistic information estimation unit 125 inputs the factor sequence FA _in to the paralinguistic information estimation model M _p and propagates it forward. A _{paralinguistic} information estimation result P of the voice Vin is obtained and output. When the output of the paralinguistic information estimation model M _p is obtained as the probability of each paralinguistic information class, the paralinguistic information estimation unit 125 selects, for example, the paralinguistic information class that maximizes the output of the paralinguistic information estimation model M _p . Output as a paralinguistic information estimation result P. However, this does not limit the present invention. For example, the paralinguistic information estimation unit 125 may output the probability of each paralinguistic information class output from the paralinguistic information estimation model _Mp .

［第２実施形態］
次に第２実施形態を説明する。本実施形態では、音声を表す因子のうちパラ言語情報の推定に不要な因子（除去要素）が複数である場合を例示する。なお、以下ではこれまで説明した事項との相違点を中心に説明し、既に説明した事項については同じ参照番号を用いて説明を簡略化する。 [Second embodiment]
Next, a second embodiment will be described. This embodiment exemplifies a case where there are a plurality of factors (removal elements) that are unnecessary for estimating paralinguistic information among factors representing speech. Note that the following description will focus on differences from the items that have been described so far, and the same reference numbers will be used for the items that have already been described to simplify the description.

＜構成＞
図４に例示するように、本実施形態の学習装置２１は、音響特徴抽出部１１１－１，１１１－２、因子抽出モデル学習部２１２、因子抽出モデル記憶部１１３、因子抽出部１１４、パラ言語情報推定モデル学習部１１５、およびパラ言語情報推定モデル記憶部１１６を有する。本実施形態のパラ言語情報推定装置１２は第１実施形態のものと同一である。 <Configuration>
As illustrated in FIG. 4, the learning device 21 of this embodiment includes acoustic feature extraction units 111-1 and 111-2, a factor extraction model learning unit 212, a factor extraction model storage unit 113, a factor extraction unit 114, a para language It has an information estimation model learning unit 115 and a paralinguistic information estimation model storage unit 116 . The paralinguistic information estimation device 12 of this embodiment is the same as that of the first embodiment.

＜学習処理＞
次に、図４および図５を用い、本実施形態の学習処理を説明する。
≪学習データ≫
本実施形態では、学習処理の前提として、学習用音声Ｖ_ｔ１、複数の除去要素の正解ラベルＬＡ_ｔａ，１，…，ＬＡ_ｔａ，Ｎ、学習用音声Ｖ_ｔ２、およびパラ言語情報の正解ラベルＬＡ_ｔｐが準備される。ただし、Ｎは２以上の整数であり、除去要素の個数を表す。また、ｎ＝１，…，Ｎとする。各除去要素の正解ラベルＬＡ_ｔａ，ｎは、それぞれ、学習用音声Ｖ_ｔ１を表す因子ＦＡ_ｔ１のうち学習用音声Ｖ_ｔ１のパラ言語情報の推定に不要な単数の因子である除去要素の正解ラベルである。除去要素の具体例は第１実施形態で例示した通りである。本実施形態では、例えば、音韻性、話者性、背景雑音の特性、残響の特性のうち２個以上の因子を除去要素とする。ＬＡ_ｔａ，１，…，ＬＡ_ｔａ，Ｎは少なくとも互いに相違する除去要素の正解ラベルを含む。例えば、ＬＡ_ｔａ，１，…，ＬＡ_ｔａ，Ｎはそれぞれ互いに相違する。 <Learning processing>
Next, the learning process of this embodiment will be described with reference to FIGS. 4 and 5. FIG.
≪Learning data≫
In this embodiment, as a premise of the learning process, the learning speech V _t1 , the correct labels LA _ta _,1 , _. _tp is prepared. However, N is an integer of 2 or more and represents the number of elements to be removed. Also, let n=1, . . . , N. The correct label LA _ta,n of each removed element is the correct label of the removed element, which is a single factor unnecessary for estimating the paralinguistic information of the training speech V _{t1 among the factors FA t1} _representing the training speech V _t1 . is. Specific examples of removal elements are as illustrated in the first embodiment. In the present embodiment, for example, two or more factors out of phonological characteristics, speaker characteristics, background noise characteristics, and reverberation characteristics are used as removal factors. LA _ta _,1 , . For example, LA _ta,1 , . . . , LA _ta,N are different from each other.

≪因子抽出モデル学習部２１２の処理≫
音響特徴抽出部１１１－１の処理、音響特徴抽出部１１１－２の処理、因子抽出部１１４の処理、パラ言語情報推定モデル学習部１１５の処理は第１実施形態で説明した通りである。以下では第１実施形態との相違点である因子抽出モデル学習部２１２の処理のみを説明する。 <<Processing of Factor Extraction Model Learning Unit 212>>
The processing of the acoustic feature extraction unit 111-1, the processing of the acoustic feature extraction unit 111-2, the processing of the factor extraction unit 114, and the processing of the paralinguistic information estimation model learning unit 115 are as described in the first embodiment. Only the processing of the factor extraction model learning unit 212, which is different from the first embodiment, will be described below.

本実施形態でも深層学習に基づく枠組みを利用した処理を例示する。しかし、これは本発明を限定するものではない。因子抽出モデル学習部２１２は、入力された音声を表す全因子のうち当該音声のパラ言語情報の推定に不要と思われる特定の除去要素（例えば、音韻性、話者性、背景雑音の特性、または残響の特性の何れか）を複数個除去した残りの因子を出力する因子抽出モデルＭ_ｆを学習する（ステップＳ２１２）。 This embodiment also exemplifies processing using a framework based on deep learning. However, this is not a limitation of the invention. The factor extraction model learning unit 212 selects specific removal elements (for example, phonological characteristics, speaker characteristics, background noise characteristics, or reverberation characteristics) is learned (step _S212 ).

図５に例示するように、因子抽出モデル学習部２１２は、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１（音響特徴量）と、学習用音声Ｖ_ｔ１を表す因子のうち学習用音声Ｖ_ｔ１のパラ言語情報の推定に不要な複数の因子である複数の除去要素の正解ラベルＬＡ_ｔａ，１，…，ＬＡ_ｔａ，Ｎとを入力とし、(1)因子抽出モデルＭ_ｆと、(2)再構成モデルＭ_ｒと、(3)複数の除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎを学習する。ただし、各除去要素推定モデルＭ_ａ，ｎ（ただし、ｎ＝１，…，Ｎ）は、因子抽出モデルＭ_ｆから出力された因子ＦＡ_ｔ１を入力とし、音声を表す因子のうち音声のパラ言語情報の推定に不要な単数の因子である除去要素ＬＡ_ａ，ｎを推定するモデルである。例えば、因子抽出モデル学習部１１２は、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１（音響特徴量）が因子抽出モデルＭ_ｆに入力された際に因子抽出モデルＭ_ｆから出力される因子ＦＡ_ｔ１と除去要素の正解ラベルＬＡ_ｔａ，１，…，ＬＡ_ｔａ，Ｎとが再構成モデルＭ_ｒに入力された際に再構成モデルＭ_ｒで再構成される音響特徴ベクトル系列（音響特徴量）と音響特徴ベクトル系列Ｆ_ｔ１との誤差に対応する再構成モデルＭ_ｒの損失関数値Ｌ_ｒと、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１が因子抽出モデルＭ_ｆに入力された際に因子抽出モデルＭ_ｆから出力される因子ＦＡ_ｔ１が除去要素推定モデルＭ_ａ，ｎに入力された際に除去要素推定モデルＭ_ａ，ｎで推定される除去要素ＬＡ_ａ，ｎと除去要素の正解ラベルＬＡ_ｔａ，ｎとの誤差に対応する除去要素推定モデルＭ_ａ，ｎの損失関数値Ｌ_ａ，ｎと、の重み付き和を全体の損失関数値Ｌとして、因子抽出モデルＭ_ｆ、再構成モデルＭ_ｒ、および除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎを学習する。 As illustrated in FIG. 5, the factor extraction model learning unit 212 _acquires the acoustic feature vector sequence F _t1 (acoustic feature amount) of the learning voice V _t1 and the learning voice V t1 among the factors representing the learning voice V _t1 . , LA _ta _,N of a plurality of removed elements, which are a plurality of factors unnecessary for estimating paralinguistic information, are input, and (1) a factor extraction model M _f and (2) The reconstruction model M _r and (3) a plurality of removal element estimation models M _a,1 , . . . , M _a,N are learned. However, each removed element _estimation model M _a,n ₍ where n=1, . It is a model for estimating the removal factor LA _a,n , which is a single factor unnecessary for estimating information. For example, the _factor _extraction model _learning unit 112 _uses the factor FA Acoustic feature vector sequences (acoustic features) reconstructed by the _{reconstruction} model _Mr when _t1 and the correct labels LA _ta _,1 , . and the _acoustic _feature _vector _sequence _Ft1 _. When the factor FA _t1 output from the factor extraction model _Mf is input to the removal element estimation model M _a _,n , the removal element LA _a,n estimated by the removal element estimation model M a,n and the correct answer of the removal element Using the weighted sum of the loss function value L _a,n of the removed element estimation model M _a,n corresponding to the error with the label LA _ta,n as the overall loss function value L, the factor extraction model M _f is reconstructed. , _M _a _,N are learned.

第１実施形態と同様、これらのモデルの学習には、例えば誤差逆伝搬法が用いられる。すなわち、因子抽出モデル学習部２１２は、音響特徴ベクトル系列Ｆ_ｔ１を因子抽出モデルＭ_ｆおよび再構成モデルＭ_ｒの二つに順伝搬し、除去要素の正解ラベルＬＡ_ｔａを再構成モデルＭ_ｒに順伝搬して得られた再構成後の音響特徴ベクトル系列と音響特徴ベクトル系列Ｆ_ｔ１との誤差に対応する値を再構成の損失関数値Ｌ_ｒとし、音響特徴ベクトル系列Ｆ_ｔ１を因子抽出モデルＭ_ｆおよび各除去要素推定モデルＭ_ａ，ｎに順伝搬し、各除去要素の正解ラベルＬＡ_ｔａ，ｎを除去要素推定モデルＭ_ａ，ｎに順伝搬して得られた各除去要素ＬＡ_ａ，ｎと各除去要素の正解ラベルＬＡ_ｔａ，ｎとの誤差に対応する値を各除去要素の損失関数値Ｌ_ａ，ｎとし、これらの損失関数値Ｌ_ｒ，Ｌ_ａ，１，…，Ｌ_ａ，Ｎの重み付け和を全体の損失関数値Ｌとして因子抽出モデルＭ_ｆと再構成モデルＭ_ｒと除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎの学習を行う。 As in the first embodiment, for example, error backpropagation is used for learning these models. That is, the factor extraction model learning unit 212 forward propagates the acoustic feature vector sequence _Ft1 to the factor extraction model _Mf and the reconstruction model _Mr , and transfers the correct label LA _ta of the removed element to the reconstruction model _Mr. A value corresponding to the error between the reconstructed acoustic feature vector sequence obtained by forward propagation and the acoustic feature vector sequence _Ft1 is set as the reconstruction loss function value _{Lr, and the acoustic feature vector sequence Ft1} _is the factor extraction model. Each removal element LA _a, obtained by forward propagation to _Mf and each removal element estimation model M _a,n , and forward propagation of the correct label LA _ta,n of each removal element to the removal element estimation model M _a,n A value corresponding to the error between n and the correct label LA _ta,n _of each elimination element is defined as the loss function value L _a,n of each elimination element, and these loss function values L _r , L _{a, 1} , . . . , L _{a , N} is the overall loss function value L, and the factor extraction model M _f , the reconstruction model M _r , and the removed element estimation model M _a,1 , . . . , M _a,N are learned.

各除去要素推定モデルＭ_ａ，ｎで推定される各除去要素ＬＡ_ａ，ｎと各除去要素の正解ラベルＬＡ_ｔａ，ｎとの間の誤差の例は、これらの交差エントロピーである。各損失関数値Ｌ_ａ，ｎの例は、各除去要素推定モデルＭ_ａ，ｎで推定される各除去要素ＬＡ_ａ，ｎと各除去要素の正解ラベルＬＡ_ｔａ，ｎとの交差エントロピーである。全体の損失関数値Ｌの例は、以下のように表される。

ただし、αは損失重みであり、０≦α≦１を満たす定数である。β_ｎはｎ番目の除去要素ＬＡ_ａ，ｎの除去要素重みであり、０≦β_ｎ≦１および

を満たす定数である。ｃは正定数であり、例えばｃ＝１である。すなわち、因子抽出モデル学習部２１２は、例えば、再構成モデルＭ_ｒの損失関数値Ｌ_ｒと、各除去要素ＬＡ_ａ，ｎに対応する除去要素重みβ_ｎと各除去要素ＬＡ_ａ，ｎに対応する除去要素モデルの損失関数値Ｌ_ａ，ｎの積β_ｎＬ_ａ，ｎと、の重み付き和を全体の損失関数値Ｌ_ａ，ｎとして、因子抽出モデルＭ_ｆと、再構成モデルＭ_ｒと、複数の除去要素ＬＡ_ａ，１，…，ＬＡ_ａ，Ｎを推定する除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎとを学習する。 An example of the error between each removed element LA _a,n estimated by each removed element estimation model M _a,n and the correct label LA _ta,n of each removed element is these cross-entropies. An example of each loss function value L _a,n is the cross entropy between each removal element LA _a,n estimated by each removal element estimation model M _a,n and the correct label LA _ta,n of each removal element. An example of the overall loss function value L is expressed as follows.

where α is a loss weight and is a constant that satisfies 0≦α≦1. β _n is the removal factor weight of the nth removal factor LA _a,n , 0≦β _n ≦1 and

is a constant that satisfies c is a positive constant, for example c=1. That is, the factor extraction model learning unit 212, for example, the loss function value L _r of the reconstruction model _Mr , the removal element weight β _n corresponding to each removal element LA _a _{, n} , and the removal element LA a, n The weighted sum of the product β _n L _a,n of the loss function values L _a,n of the removed element models is set as the overall loss function value L _a,n , and the factor extraction model M _f and the reconstruction model M _r , and a removal element estimation model M _a,1 , . . . , M _a, _N that estimates a plurality of removal elements LA _a,1 , .

全ての除去要素重みβ_１，…，β_Ｎを同じ値としてもよいが（例えば、β_１＝…＝β_Ｎ＝１）、除去要素重みβ_ｎの大きさを調整することにより、特定の除去要素を強く取り除く因子抽出モデルＭ_ｆを学習することもできる。すなわち、本実施形態で得られる因子抽出モデルＭ_ｆは、除去要素重みβ_ｎが大きな損失関数値Ｌ_ａ，ｎに対応する除去要素ほど強く取り除いた因子系列を抽出する。 Although all removal factor weights β ₁ , . . . , β _N may have the same value ₍ eg, β ₁ ₌ . It is also possible to learn a factor extraction model _Mf that strongly removes elements. That is, the factor extraction model M _f obtained in the present embodiment extracts a factor series in which the removed element corresponding to the loss function value L _a,n with the larger removed element weight β _n is removed more strongly.

また、音声の音響特徴ベクトル系列（音響特徴量）から推定が困難な除去要素に対応する除去要素重みβ_ｎほど値を大きくしてもよい。すなわち、複数の除去要素は、第１除去要素と、第１除去要素よりも音響特徴ベクトル系列から推定が容易な第２除去要素とを含み、第１除去要素に対応する除去要素重みの値は第２除去要素に対応する除去要素重みの値よりも大きくてもよい。これにより、音響特徴ベクトル系列から推定が困難で取り除きにくい除去要素を十分に取り除くことができる。除去要素の推定の容易性および困難性の基準を例示する。
基準１：クラスが少ない除去要素ほど推定が容易であり、クラスが多い除去要素ほど推定が困難である。すなわち、第１除去要素がＣＬ１種類のクラスの何れかであり、第２除去要素がＣＬ２種類のクラスの何れかであり、ＣＬ１＞ＣＬ２である場合、第２除去要素の推定は第１除去要素の推定よりも容易である。
基準２：話者性の除去要素の推定は、音韻性の除去要素の推定よりも容易である。
基準３：実験的に除去要素の推定の容易性および困難性を定めてもよい。例えば、各ｎ＝１，…，Ｎについて、学習用音声Ｖ_ｔ１、除去要素の正解ラベルＬＡ_ｔａ，ｎ、学習用音声Ｖ_ｔ２、およびパラ言語情報の正解ラベルＬＡ_ｔｐを用い、ＬＡ_ｔａ＝ＬＡ_ｔａ，ｎとして第１実施形態の学習装置１１の音響特徴抽出部１１１－１および因子抽出モデル学習部１１２の処理を行い、（１）式の損失関数Ｌを求める。ここで、損失関数Ｌが小さいほど除去要素の推定が適切になされており、除去要素の推定が容易であると判断する。すなわち、第１除去要素に対応する損失関数がＬ_ｎ１（ｎ１∈｛１，…，Ｎ｝）であり、第２除去要素に対応する損失関数がＬ_ｎ２（ｎ２∈｛１，…，Ｎ｝，ｎ１≠ｎ２）であり、Ｌ_ｎ１＞Ｌ_ｎ２である場合、第２除去要素の推定は第１除去要素の推定よりも容易である。 Also, the removal element weight _βn corresponding to a removal element that is difficult to estimate from the acoustic feature vector sequence (acoustic feature amount) of the speech may be increased. That is, the plurality of removal elements includes a first removal element and a second removal element that is easier to estimate from the acoustic feature vector sequence than the first removal element, and the removal element weight value corresponding to the first removal element is It may be greater than the value of the removal factor weight corresponding to the second removal factor. This makes it possible to sufficiently remove removal elements that are difficult to estimate and remove from the acoustic feature vector sequence. The ease and difficulty criteria for estimating removal factors are illustrated.
Criterion 1: A removed element with fewer classes is easier to estimate, and a removed element with more classes is more difficult to estimate. That is, if the first removal factor is one of the CL1 classes, the second removal factor is one of the CL2 classes, and CL1>CL2, the estimation of the second removal factor is the first removal factor is easier than estimating
Criterion 2: Estimation of speaker-related elimination factors is easier than estimation of phonological elimination factors.
Criterion 3: The ease and difficulty of estimating removal factors may be determined experimentally. For each _n =1, . . . , _N , for each _n ₌ ₁ , . As _{ta and n} , the processing of the acoustic feature extraction unit 111-1 and the factor extraction model learning unit 112 of the learning device 11 of the first embodiment is performed to obtain the loss function L of equation (1). Here, it is judged that the smaller the loss function L is, the more appropriately the removal factor is estimated and the easier the estimation of the removal factor is. That is, the loss function corresponding to the first removal element is L _n1 (n1 ∈ {1, ..., N}), and the loss function corresponding to the second removal element is L _n2 (n2 ∈ {1, ..., N} , n1≠n2) and L _n1 >L _n2 , the estimation of the second removal factor is easier than the estimation of the first removal factor.

第１実施形態の因子抽出モデル学習部１１２と同様、因子抽出モデル学習部２１２は、学習用音声Ｖ_ｔ１の音響特徴ベクトル系列Ｆ_ｔ１（音響特徴量）が因子抽出モデルＭ_ｆに入力された際に除去要素推定モデルＭ_ａ，ｎで推定される除去要素ＬＡ_ａ，ｎと正解ラベルＬＡ_ｔａ，ｎとの間の誤差がそれぞれ大きくなるように因子抽出モデルＭ_ｆを学習する。すなわち、因子抽出モデルＭ_ｆで抽出された因子系列から各除去要素推定モデルＭ_ａ，ｎが除去要素を推定することが困難となるように、因子抽出モデルＭ_ｆを学習する。例えば、誤差逆伝搬法で学習が行われる場合、第１実施形態で説明したように、因子抽出モデルＭ_ｆと除去要素推定モデルＭ_ａ，ｎとの間に勾配反転層ＧＲＬ_ｎがそれぞれ配置される。各勾配反転層ＧＲＬ_ｎは第１実施形態で説明したものと同じである。 Similar to the factor extraction model learning unit 112 of the first embodiment, the factor _extraction model learning unit ₂₁₂ _performs First, the factor extraction model _Mf is learned so that the error between the removed element LA _a,n estimated by the removed element estimation model M _a, n and the correct label LA _ta,n becomes large. That is, the factor extraction model _Mf is learned so that it becomes difficult for each removal element estimation model M _a,n to estimate the removal element from the factor series extracted by the factor extraction model _Mf . For example, when learning is performed by the error backpropagation method, the gradient reversal layer GRL _n is arranged between the factor extraction model M _f and the removed element estimation model M _a,n as described in the first embodiment. be. Each gradient reversal layer GRL _n is the same as described in the first embodiment.

因子抽出モデル学習部２１２は、因子抽出モデルＭ_ｆおよび再構成モデルＭ_ｒとともに、Ｎ個の除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎを同時にまとめて学習してもよいし、除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎを段階的に学習してよい。すなわち、一度に複数の除去要素を取り除く除去要素推定モデルの学習を行うことが困難である可能性があるため、段階的に各除去要素を取り除く除去要素推定モデルを学習していってもよい。これによって各除去要素をうまく取り除くことができる因子抽出モデルＭ_ｆを学習でき、パラ言語情報推定精度が向上する可能性がある。この場合、例えば、Ｍを２以上の整数とし、ｍ＝１，…，Ｍとし、Ｎ個の除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎを要素とする集合｛Ｍ_ａ，１，…，Ｍ_ａ，Ｎ｝の部分集合をＳｕｂｓｅｔ_ｍ∈｛Ｍ_ａ，１，…，Ｍ_ａ，Ｎ｝とし、Ｓｕｂｓｅｔ_１⊂Ｓｕｂｓｅｔ_２⊂…⊂Ｓｕｂｓｅｔ_Ｍ＝｛Ｍ_ａ，１，…，Ｍ_ａ，Ｎ｝とする。因子抽出モデル学習部２１２は、まず、因子抽出モデルＭ_ｆおよび再構成モデルＭ_ｒとともにＳｕｂｓｅｔ_１に属する除去要素推定モデルの学習を行い、次に因子抽出モデルＭ_ｆおよび再構成モデルＭ_ｒとともにＳｕｂｓｅｔ_２に属する除去要素推定モデルの学習を行い・・・というように段階的に学習を行っていき、最後に因子抽出モデルＭ_ｆおよび再構成モデルＭ_ｒとともにＳｕｂｓｅｔ_Ｍに属する除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎの学習を行う。因子抽出モデル学習部２１２がＳｕｂｓｅｔ_ｍ＝｛Ｍ_{ａ，ｇ（ｍ，１）}，…，Ｍ_{ａ，ｇ（ｍ，ｈ（ｍ））}｝に属する除去要素推定モデルＭ_{ａ，ｇ（ｍ，１）}，…，Ｍ_{ａ，ｇ（ｍ，ｈ（ｍ））}の学習を行うときは、これらに対応する損失関数値Ｌ_{ａ，ｇ（ｍ，１）}，…，Ｌ_{ａ，ｇ（ｍ，ｈ（ｍ））}の除去要素重みβ_{ｇ（ｍ，１）}，…，β_{ｇ（ｍ，ｈ（ｍ））}の値を０よりも大きな値とし、その他の除去要素重みβ_ｎを０として学習を行う。ここで、ｇおよびｈは関数であり、ｇ（ｍ，１），…，ｇ（ｍ，ｈ（ｍ））は関数値であり、｛ｇ（ｍ，１），…，ｇ（ｍ，ｈ（ｍ））｝∈｛１，…，Ｎ｝を満たし、ｈ（ｍ）は１≦ｈ（ｍ）≦Ｎを満たす整数の関数値である。Ｓｕｂｓｅｔ_１，…，Ｓｕｂｓｅｔ_Ｍの選択の仕方、すなわち、除去要素推定モデルの学習順序に限定はないが、除去要素推定モデルＭ_ａ，１，…，Ｍ_ａ，Ｎのうち、推定が容易な除去要素を推定する除去要素推定モデルから順番に学習を行うことが望ましい。例えば、Ｓｕｂｓｅｔ_νに属する何れかの除去要素推定モデルが行う除去要素の推定が、Ｓｕｂｓｅｔ_ν＋１で新たに学習対象に加わった除去要素推定モデルが行う除去要素の推定よりも容易であることが望ましい。これによって各除去要素をうまく取り除くことができる因子抽出モデルＭ_ｆを学習でき、パラ言語情報推定精度が向上する。なお、除去要素の推定の容易性および困難性の基準としては、例えば、前述の基準１～３を用いることができる。 The factor extraction model learning unit 212 may simultaneously _learn _{the N} removal element _estimation models M _a,1 , . The removed element estimation models M _a,1 , . . . , M _a,N may be learned step by step. That is, since it may be difficult to learn a removal element estimation model that removes a plurality of removal elements at once, the removal element estimation model that removes each removal element may be learned step by step. As a result, it is possible to learn a factor extraction model _Mf that can successfully remove each removed element, and possibly improve the accuracy of estimating paralinguistic information. In this case, for example, M is an integer of 2 or more, m = 1, ..., M, and _a set {M _a _{, 1} _. _{_} _{_} _{_} _{_} _{_} _{_} _{_} _{a, N} }. The factor extraction model learning unit 212 first learns a removed element estimation model belonging to Subset ₁ together with the factor extraction model _Mf and the reconstruction model _{Mr r} _, and then learns the subset with the factor extraction model _Mf and the reconstruction model Mr _The removal element estimation model belonging to Subset M is learned step by step, and finally, the removal element estimation model M _a belonging to Subset _M together with the factor extraction model M _f and the reconstruction model M _r. _{, 1} , . . . , M _a,N are learned. The factor extraction model learning unit 212 determines the removal element estimation model M a, _g ₍ _m _{,1 )} _, _. _{_} _(m)) _, _the values of the removal element weights β _{g(m, 1)} , . conduct. where g and h are functions, g(m, 1), ..., g(m, h(m)) are function values, and {g(m, 1), ..., g(m, h (m))}ε{1,...,N}, and h(m) is an integer function value satisfying 1≤h(m)≤N. _, _Subset _M _, that is, the learning order of the removed element estimation models is not limited. It is desirable to learn sequentially from the removed element estimation model that estimates the elements. For example, it is desirable that the removal factor estimation performed by any removal element estimation model belonging to Subset _ν is easier than the removal factor estimation performed by the removal element estimation model newly added to the learning target in Subset _ν+1 . As a result, a factor extraction model _Mf capable of successfully removing each removed element can be learned, and the accuracy of estimating paralinguistic information is improved. As the criteria for the easiness and difficulty of estimating the elements to be removed, for example, criteria 1 to 3 described above can be used.

因子抽出モデル学習部２１２による学習は、所定の終了条件を満たすまで繰り返される。例えば、因子抽出モデル学習部２１２は、入力されたエポック数（音響特徴ベクトル系列Ｆ_ｔおよび正解ラベルＬＡ_ｔａ，１，…，ＬＡ_ｔａ，Ｎの全てが学習に利用された回数）が一定値（例えば１００回）に到達した場合、学習が完了したとみなして、その時点の因子抽出モデルＭ_ｆを出力する。因子抽出モデルＭ_ｆは因子抽出モデル記憶部１１３に格納される（ステップＳ１１３）。第１実施形態と同様、再構成モデルＭ_ｒおよび除去要素推定モデルＭ_ａは以降の処理で利用されない。その他は第１実施形態と同じである。 Learning by the factor extraction model learning unit 212 is repeated until a predetermined end condition is satisfied. For example, the factor extraction model learning unit 212 sets the input number of epochs (the number of times all of the acoustic feature vector sequence F _t and the correct label LA _ta _,1 , . For example, 100 times), the learning is considered to be completed, and the factor extraction model _Mf at that time is output. The factor extraction model _Mf is stored in the factor extraction model storage unit 113 (step S113). As in the first embodiment, the reconstructed model _Mr and the removed element estimation model M _a are not used in subsequent processing. Others are the same as the first embodiment.

［各実施形態の特徴］
上述のように、各実施形態では、音声を表す因子のうち音声のパラ言語情報の推定に必要な因子を抽出し、抽出した因子を用いて音声のパラ言語情報を推定するため、パラ言語情報の推定に不要な因子が取り除かれ、従来よりも精度の高いパラ言語情報推定が可能となる。第２実施形態では、音声を表す因子からパラ言語情報の推定に不要な複数の除去要素を取り除いて得られる因子系列を用いて音声のパラ言語情報を推定するため、さらに精度の高いパラ言語情報推定が可能となる。 [Features of each embodiment]
As described above, in each embodiment, the factors necessary for estimating the paralinguistic information of speech are extracted from the factors representing speech, and the extracted factors are used to estimate the paralinguistic information of speech. Unnecessary factors for estimating are removed, making it possible to estimate paralinguistic information with higher accuracy than before. In the second embodiment, the paralinguistic information of speech is estimated using a factor sequence obtained by removing a plurality of removal elements unnecessary for estimating paralinguistic information from the factors representing speech. Estimation becomes possible.

また各実施形態では、因子抽出モデルで抽出された因子系列から特定の因子を推定することが不可能になった場合、因子抽出モデルで抽出される因子系列から当該特定の因子が完全に除去されたとみなすことができると仮定し、因子抽出モデルで抽出された因子系列からパラ言語情報の推定に不要な除去要素の推定が困難となるように因子抽出モデルを学習する。これにより、パラ言語情報の推定に必要な因子を抽出する因子抽出モデルを得ることができる。 Further, in each embodiment, when it becomes impossible to estimate a specific factor from the factor series extracted by the factor extraction model, the specific factor is completely removed from the factor series extracted by the factor extraction model. Then, the factor extraction model is learned so that it becomes difficult to estimate the removed elements unnecessary for estimating the paralinguistic information from the factor series extracted by the factor extraction model. This makes it possible to obtain a factor extraction model for extracting factors necessary for estimating paralinguistic information.

［その他の変形例等］
本発明は上述の実施形態に限定されるものではない。例えば、上述の実施形態では、因子抽出モデル、再構成モデル、および除去要素推定モデルとして、深層学習に基づくモデルを例示した。しかしながら、因子抽出モデル、再構成モデル、および除去要素推定モデルとして、その他の推定モデルが用いられてもよい。 [Other modifications, etc.]
The invention is not limited to the embodiments described above. For example, in the above-described embodiments, models based on deep learning were exemplified as the factor extraction model, the reconstruction model, and the removed element estimation model. However, other estimation models may be used as the factor extraction model, the reconstruction model, and the removed element estimation model.

上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。 The various types of processing described above may not only be executed in chronological order according to the description, but may also be executed in parallel or individually according to the processing capacity of the device that executes the processing or as necessary. In addition, it goes without saying that appropriate modifications are possible without departing from the gist of the present invention.

上記の各装置は、例えば、ＣＰＵ（central processing unit）等のプロセッサ（ハードウェア・プロセッサ）およびＲＡＭ（random-access memory）・ＲＯＭ（read-only memory）等のメモリ等を備える汎用または専用のコンピュータが所定のプログラムを実行することで構成される。このコンピュータは１個のプロセッサやメモリを備えていてもよいし、複数個のプロセッサやメモリを備えていてもよい。このプログラムはコンピュータにインストールされてもよいし、予めＲＯＭ等に記録されていてもよい。また、ＣＰＵのようにプログラムが読み込まれることで機能構成を実現する電子回路（circuitry）ではなく、プログラムを用いることなく処理機能を実現する電子回路を用いて一部またはすべての処理部が構成されてもよい。１個の装置を構成する電子回路が複数のＣＰＵを含んでいてもよい。 Each of the above devices is, for example, a general-purpose or dedicated computer equipped with a processor (hardware processor) such as a CPU (central processing unit) and memories such as RAM (random-access memory) and ROM (read-only memory) is configured by executing a predetermined program. This computer may have a single processor and memory, or may have multiple processors and memories. This program may be installed in the computer, or may be recorded in ROM or the like in advance. Also, some or all of the processing units are configured using an electronic circuit that realizes processing functions without using a program, rather than an electronic circuit that realizes a functional configuration by reading a program like a CPU. may An electronic circuit that constitutes one device may include a plurality of CPUs.

上述の構成をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。このプログラムをコンピュータで実行することにより、上記処理機能がコンピュータ上で実現される。この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体の例は、非一時的な（non-transitory）記録媒体である。このような記録媒体の例は、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等である。 When the above configuration is implemented by a computer, the processing contents of the functions that each device should have are described by a program. By executing this program on a computer, the above processing functions are realized on the computer. A program describing the contents of this processing can be recorded in a computer-readable recording medium. An example of a computer-readable recording medium is a non-transitory recording medium. Examples of such recording media are magnetic recording devices, optical discs, magneto-optical recording media, semiconductor memories, and the like.

このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ－ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The distribution of this program is carried out, for example, by selling, assigning, lending, etc. portable recording media such as DVDs and CD-ROMs on which the program is recorded. Further, the program may be distributed by storing the program in the storage device of the server computer and transferring the program from the server computer to other computers via the network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。処理の実行時、このコンピュータは、自己の記憶装置に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。 A computer that executes such a program, for example, first stores the program recorded on a portable recording medium or the program transferred from the server computer once in its own storage device. When executing the process, this computer reads the program stored in its own storage device and executes the process according to the read program. As another form of execution of this program, the computer may directly read the program from a portable recording medium and execute processing according to the program. , may sequentially execute processing according to the received program. A configuration in which the above processing is executed by a so-called ASP (Application Service Provider) type service, which does not transfer the program from the server computer to this computer and realizes the processing function only by the execution instruction and result acquisition, is also possible. good.

コンピュータ上で所定のプログラムを実行させて本装置の処理機能が実現されるのではなく、これらの処理機能の少なくとも一部がハードウェアで実現されてもよい。 At least a part of these processing functions may be realized by hardware instead of executing a predetermined program on a computer to realize the processing functions of the present apparatus.

本発明によって推定されるパラ言語情報は、例えば、音声対話における話し相手の感情を考慮した対話制御（相手が怒っていれば話題を変えるなど）や、音声を用いたメンタルヘルス診断（毎日の音声を収録し、悲しみや怒り音声の頻度からメンタルヘルス状況を予測するなど）に応用可能である。 The paralinguistic information estimated by the present invention can be used, for example, for dialogue control that takes into account the emotions of the other party in a spoken dialogue (such as changing the topic if the other party is angry), mental health diagnosis using speech (daily speech, etc.). recording and predicting mental health status from the frequency of sadness and anger voices).

１１，２１学習装置
１２パラ言語情報推定装置 11, 21 learning device 12 paralinguistic information estimation device

Claims

a factor extracting unit that receives acoustic features of speech as an input, extracts and outputs factors necessary for estimating paralinguistic information of the speech from among factors that are elements representing each feature of the speech;
a paralinguistic information estimating unit that receives the factor extracted by the factor extracting unit as an input, estimates and outputs the paralinguistic information of the speech;
A paralinguistic information estimation device having

Acoustic features of the training speech, and the correct label of the removed element, which is a single factor unnecessary for estimating the paralinguistic information of the training speech among the factors representing each feature of the training speech, or the learning correct labels of a plurality of removed elements that are unnecessary for estimating the paralinguistic information of the training speech among the factors that are elements representing each feature of the training speech, and
(1) a factor extraction model that takes as input acoustic feature values of speech, extracts and outputs factors necessary for estimating paralinguistic information of the speech from factors that are elements representing each feature of the speech; ) a reconstruction model that receives the factors output from the factor extraction model and the correct labels of the removed elements as inputs and reconstructs the acoustic feature of the speech; and (3) the factors output from the factor extraction model. A removed element estimation model for estimating a removed element, which is a single factor unnecessary for estimating the paralinguistic information of the speech, or an element representing each feature of the speech. and a factor extraction model learning unit that learns a removed element estimation model that estimates a plurality of removed elements that are a plurality of factors that are unnecessary for estimating the paralinguistic information of the speech among the factors that are
The factor extraction model learning unit
The factor extraction model is learned so that an error between the removal element estimated by the removal element estimation model and the correct label increases when the acoustic feature amount of the training speech is input to the factor extraction model. A learning device.

The learning device of claim 2,
The factor extraction model learning unit
The reconstruction when the factors output from the factor extraction model when the acoustic features of the learning speech are input to the factor extraction model and the correct labels of the removed elements are input to the reconstruction model. The loss function value of the reconstruction model corresponding to the error between the acoustic feature quantity reconstructed by the model and the acoustic feature quantity of the training speech and the acoustic feature quantity of the training speech are input to the factor extraction model. removal element estimation corresponding to an error between the removal element estimated by the removal element estimation model and the correct label of the removal element when the factor output from the factor extraction model is input to the removal element estimation model when the factor is input to the removal element estimation model A learning device that learns the factor extraction model, the reconstruction model, and the removed element estimation model using a weighted sum of the loss function value of the model and the total loss function value.

The learning device according to claim 2 or 3,
The factor extraction model learning unit learns the factor extraction model and the removed element estimation model using an error back propagation method,
Partial differentiation corresponding to the error between the removed element output from the removed element estimation model and the correct label when backpropagating from the removed element estimation model to the factor extraction model in the error backpropagation method A learning device, wherein values are back-propagated from the removed element estimation model to a gradient inversion layer, and the gradient inversion layer back-propagates a value obtained by multiplying the partial differential value by a negative constant to the factor extraction model.

The learning device according to any one of claims 2 to 4,
The factor extraction model learning unit
The factor extraction model, the reconstruction model, and a plurality of removal element estimation models for estimating the plurality of removal elements, using as inputs the acoustic feature quantity of the training speech and the correct labels of the plurality of removal elements. and learn
A learning device that performs learning in order from a removed element estimation model for estimating a removed element that is easy to estimate among the removed element estimation models for estimating the plurality of removed elements.

The learning device according to any one of claims 2 to 5,
The factor extraction model learning unit
Inputting the acoustic feature quantity of the training speech and the correct labels of the plurality of removal elements,
A weighted sum of the loss function value of the reconstructed model, the product of the removed element weight corresponding to each removed element and the loss function value of the removed element model corresponding to each removed element, as the overall loss function value , a learning device for learning the factor extraction model, the reconstruction model, and the removed element estimation model for estimating the plurality of removed elements.

The learning device of claim 6,
The plurality of removal elements includes a first removal element and a second removal element that is easier to estimate from the acoustic feature amount of speech than the first removal element,
The learning device, wherein a removal factor weight value corresponding to the first removal factor is greater than a removal factor weight value corresponding to the second removal factor.

a factor extraction step of extracting and outputting the factors necessary for estimating the paralinguistic information of the speech from among the factors representing the features of the speech, with the acoustic features of the speech as input;
a paralinguistic information estimating step of estimating and outputting the paralinguistic information of the speech using the factors extracted in the factor extracting step as input;
A method for estimating paralinguistic information.

Acoustic features of the training speech, and the correct label of the removed element, which is a single factor unnecessary for estimating the paralinguistic information of the training speech among the factors representing each feature of the training speech, or the learning correct labels of a plurality of removed elements that are unnecessary for estimating the paralinguistic information of the training speech among the factors that are elements representing each feature of the training speech, and
(1) a factor extraction model that takes as input acoustic feature values of speech, extracts and outputs factors necessary for estimating paralinguistic information of the speech from factors that are elements representing each feature of the speech; ) a reconstruction model that receives the factors output from the factor extraction model and the correct labels of the removed elements as inputs and reconstructs the acoustic feature of the speech; and (3) the factors output from the factor extraction model. A removed element estimation model for estimating a removed element, which is a single factor unnecessary for estimating the paralinguistic information of the speech, or an element representing each feature of the speech. and a factor extraction model learning step for learning a removed element estimation model for estimating a plurality of removed elements that are a plurality of factors unnecessary for estimating the paralinguistic information of the speech among the factors that are
The factor extraction model learning step includes:
The factor extraction model is learned so that an error between the removal element estimated by the removal element estimation model and the correct label increases when the acoustic feature amount of the training speech is input to the factor extraction model. A learning method that includes steps to

A program for causing a computer to function as the paralinguistic information estimation device according to claim 1 .

A program for causing a computer to function as the learning device according to any one of claims 2 to 7.