JP2021026045A

JP2021026045A - Storage device, storage method and program

Info

Publication number: JP2021026045A
Application number: JP2019141515A
Authority: JP
Inventors: 勇気太刀岡; Yuki Tachioka
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2021-02-22

Abstract

To provide a technology for storing utterance data used for learning an acoustic model by concealing the content.SOLUTION: A storage device 1 for storing utterance data which is utterance data formed of data on voice and a text corresponding to the voice, and which is used for learning an acoustic model comprises: an input unit 10 for accepting input if a plurality of pieces of original utterance data formed of the voice and the text corresponding to the voice; a fragment data generation unit 11 for dividing the original utterance data and generating a plurality of pieces of fragment data formed of the voice including one or more clauses and the text corresponding to the voice; a fragment data connection unit 12 which randomly connects the plurality of pieces of fragment data generated from the plurality of pieces of original utterance data and generates the plurality of utterance data in prescribed length; and a storage unit 13 for storing the plurality of utterance data generated by the fragment data connection unit 12.SELECTED DRAWING: Figure 1

Description

本発明は、音声認識のための音響モデルの学習に用いられる発話データを保存する技術に関する。 The present invention relates to a technique for storing utterance data used for learning an acoustic model for speech recognition.

音声認識の音響モデルを作る際には、実際に使われるドメイン内の音声データを使って学習を行うことが有効であり、これにより、音響モデルの性能を顕著に向上させることができる。しかしながら、ドメイン内のデータは個人情報を含むので話者のプライバシーを侵害するおそれがある。また、ドメイン内のデータから、話者がある集団に属していることを特定されてしまうおそれもある。したがって、一般的には、ドメイン内のデータは利用期間が過ぎると破棄される。 When creating an acoustic model for speech recognition, it is effective to perform learning using the speech data in the domain that is actually used, and this can significantly improve the performance of the acoustic model. However, since the data in the domain contains personal information, it may infringe on the privacy of the speaker. In addition, the data in the domain may identify the speaker as belonging to a certain group. Therefore, in general, the data in the domain is destroyed after the usage period.

しかし、いったんデータが破棄されてしまうと、より効果的なモデル構造が将来的に提案されたとしても再学習することができない。ドメイン内の音声データを秘匿化し、プライバシーを保護した状態で保存しておくことができることが望ましい。 However, once the data is destroyed, it cannot be retrained even if a more effective model structure is proposed in the future. It is desirable to be able to conceal the voice data in the domain and store it in a privacy-protected state.

一般には、特定のノイズを加えておく方法がある。例えば、特許文献1には、データ所有者がデータ分析者へデータを開示する時の、データ所有者およびデータ分析者のリスクを低減するため、データ公開の際にデータを劣化させる条件を定める方法が開示されている。また、非特許文献１，２には、計算方法の手順を秘匿化する方法が開示されている。 Generally, there is a method of adding specific noise. For example, Patent Document 1 describes a method for defining conditions for deteriorating data at the time of data disclosure in order to reduce the risk of the data owner and the data analyst when the data owner discloses the data to the data analyst. Is disclosed. Further, Non-Patent Documents 1 and 2 disclose a method of concealing the procedure of the calculation method.

特開２０１８−１２８９１３号公報Japanese Unexamined Patent Publication No. 2018-128913

P. Smaragdis and M. Shashanka, “A framework for secure speech recognition,” IEEE Transactions on Audio, Speech, Language Processing, 15, 1404--1413 (2007).P. Smaragdis and M. Shashanka, “A framework for secure speech recognition,” IEEE Transactions on Audio, Speech, Language Processing, 15, 1404--1413 (2007). M.A. Pathak, B. Raj, S. Rane, and P. Smaragdis, “Privacy-preserving speech processing,”IEEE Signal Processing Magazine, 62--74 (2013).M.A. Pathak, B. Raj, S. Rane, and P. Smaragdis, “Privacy-preserving speech processing,” IEEE Signal Processing Magazine, 62--74 (2013).

上記した特許文献１に記載されたデータにノイズを加える方法は、データ提供者とデータ利用者とが異なる場合であって、データ提供者にはデータ保護が必要ない場合には有効である。しかし、データ提供者とデータ利用者が同じ場合には、加えたノイズが分かってしまうとデータ保護の意味がない。 The method of adding noise to the data described in Patent Document 1 described above is effective when the data provider and the data user are different and the data provider does not need data protection. However, when the data provider and the data user are the same, there is no point in protecting the data if the added noise is known.

また、非特許文献１，２に記載された方法は、秘匿化を行わない場合に比べて計算量が多く、モデル変更の際には操作のプロトコルを変更する必要がある。また、非特許文献１，２に記載された方法は、データ保護に使うことはできない。
そこで、本発明は、上記背景に鑑み、音響モデルの学習に用いられる発話データを、その内容を秘匿化して保存できる技術を提供することを目的とする。 Further, the methods described in Non-Patent Documents 1 and 2 require a large amount of calculation as compared with the case where concealment is not performed, and it is necessary to change the operation protocol when changing the model. Further, the methods described in Non-Patent Documents 1 and 2 cannot be used for data protection.
Therefore, in view of the above background, an object of the present invention is to provide a technique capable of concealing and storing the contents of utterance data used for learning an acoustic model.

本発明の保存装置は、音声に関するデータおよび当該音声に対応するテキストからなる発話データであって、音響モデルの学習に用いられる発話データを保存する保存装置であって、音声と当該音声に対応するテキストからなる複数のオリジナルの発話データを入力する入力部と、前記オリジナルの発話データを分割して、１以上の文節を含む音声および当該音声に対応するテキストからなる複数の断片データを生成する断片データ生成部と、複数の前記オリジナルの発話データから生成された複数の断片データをランダムに結合して所定の長さの複数の発話データを生成する断片データ結合部と、前記断片データ結合部にて生成された複数の発話データを保存する保存部とを備える。 The storage device of the present invention is a storage device that stores utterance data composed of data related to voice and text corresponding to the voice, and is used for learning an acoustic model, and corresponds to the voice and the voice. An input unit for inputting a plurality of original utterance data consisting of text, and a fragment for dividing the original utterance data to generate a plurality of fragment data consisting of a voice containing one or more phrases and text corresponding to the voice. The data generation unit, the fragment data combination unit that randomly combines a plurality of fragment data generated from the original speech data to generate a plurality of speech data of a predetermined length, and the fragment data combination unit. It is provided with a storage unit for storing a plurality of utterance data generated in the above.

このように入力された複数のオリジナルの発話データを分割して複数の断片データを生成し、複数の断片データをランダムに結合して、新しい発話データを生成することで、オリジナルの発話データの内容をわからなくすることができる。また、断片データは１以上の文節を含んでいるので、音響特徴量の時系列が保存されており、生成された発話データを音響モデルの学習に用いることができる。ここで、「文節」とは、１以上の自立語と０以上の付属語からなる、文の区切りの単位である。 The contents of the original utterance data are generated by dividing the plurality of original utterance data input in this way to generate a plurality of fragment data, and randomly combining the plurality of fragment data to generate new utterance data. Can be obscured. Further, since the fragment data includes one or more clauses, the time series of the acoustic features is stored, and the generated utterance data can be used for learning the acoustic model. Here, the "bunsetsu" is a sentence delimiter unit composed of one or more independent words and zero or more attached words.

本発明の保存装置は、前記入力部より入力される音声から音響特徴量を生成する音響特徴量生成部を備え、前記保存部は、前記音声に関するデータとして、前記発話データの音響特徴量を保存してもよい。このように音響特徴量を音声に関するデータとして保存することにより、背景のノイズ等に基づいて発話データを構成する断片データの結合位置を推測することを困難にすることができる。 The storage device of the present invention includes an acoustic feature amount generation unit that generates an acoustic feature amount from the voice input from the input unit, and the storage unit stores the acoustic feature amount of the utterance data as data related to the voice. You may. By storing the acoustic features as data related to speech in this way, it is possible to make it difficult to estimate the combined position of the fragment data constituting the utterance data based on the background noise and the like.

本発明の保存装置は、前記音響特徴量生成部にて生成された音響特徴量を正規化する特徴量正規化部を備え、前記保存部は、前記音声に関するデータとして、前記発話データの正規化された音響特徴量を保存してもよい。このように音響特徴量を正規化することにより、発話データを構成する断片データの結合位置を推測することをさらに困難にすることができる。 The storage device of the present invention includes a feature amount normalization unit that normalizes the acoustic feature amount generated by the acoustic feature amount generation unit, and the storage unit normalizes the speech data as data related to the voice. The generated acoustic features may be stored. By normalizing the acoustic features in this way, it is possible to make it more difficult to estimate the connection position of the fragment data constituting the utterance data.

本発明の保存装置において、前記入力部は、複数の前記オリジナルの発話データとともに当該発話データの話者の識別子の入力を受け付け、前記保存部は、前記断片データ結合部にて生成された発話データを、前記話者の識別子に関連付けて保存してもよい。このように、新たに生成した発話データの話者の識別子を保存することにより、保存したデータの用途が広がる。 In the storage device of the present invention, the input unit receives input of a speaker identifier of the utterance data together with a plurality of the original utterance data, and the storage unit receives the utterance data generated by the fragment data combination unit. May be stored in association with the speaker's identifier. By storing the speaker identifier of the newly generated utterance data in this way, the use of the stored data is expanded.

本発明の保存装置において、前記入力部は、複数の前記オリジナルの発話データとともに当該発話データの話者の識別子の入力を受け付け、前記断片データ結合部は、前記話者の識別子に基づいて、生成される各発話データに複数の話者から得た断片データを含めるようにしてもよい。このように新たに生成する発話データに複数の話者の発話データから得た断片データを含めることにより、発話データから話者を特定することを困難にすることができる。 In the storage device of the present invention, the input unit accepts input of a speaker identifier of the utterance data together with a plurality of the original utterance data, and the fragment data combining unit is generated based on the speaker identifier. Fragment data obtained from a plurality of speakers may be included in each utterance data to be made. By including the fragment data obtained from the utterance data of a plurality of speakers in the newly generated utterance data in this way, it is possible to make it difficult to identify the speaker from the utterance data.

本発明の保存装置は、複数の前記オリジナルの発話データに話者識別技術を適用して、話者の特徴量を求める話者特徴量算出部と、前記話者の特徴量に基づいて前記話者をクラスタリングするクラスタリング部とを備え、前記断片データ結合部は、生成される各発話データに、同じクラスタに含まれる複数の話者の発話データから得た断片データを含めるようにしてもよい。この構成により、類似する特徴量を持った複数の話者の音声が結合された発話データが生成されるので、発話データから話者を特定することがいっそう困難となる。なお、話者の特徴量としては、例えば、因子分析から作られるi-vectorや、話者識別ネットワークの中間層の出力から得られるx-vector／d-vectorを用いることができる。 The storage device of the present invention applies a speaker identification technique to a plurality of the original speech data to obtain a speaker feature amount calculation unit for obtaining a speaker feature amount, and the story based on the speaker feature amount. A clustering unit for clustering people may be provided, and the fragment data combining unit may include fragment data obtained from speech data of a plurality of speakers included in the same cluster in each generated speech data. With this configuration, utterance data in which the voices of a plurality of speakers having similar features are combined is generated, so that it becomes more difficult to identify the speaker from the utterance data. As the feature amount of the speaker, for example, an i-vector created from factor analysis or an x-vector / d-vector obtained from the output of the intermediate layer of the speaker identification network can be used.

本発明の保存装置は、前記音声から音響特徴量を生成する音響特徴量生成部と、同じクラスタに含まれる複数の話者の発話データの音響特徴量の正規化を行う特徴量正規化部とを備える。このように音響特徴量を正規化することにより、発話データから話者を特定することをいっそう困難にすることができる。 The storage device of the present invention includes an acoustic feature amount generation unit that generates an acoustic feature amount from the voice, and a feature amount normalization unit that normalizes the acoustic feature amount of speech data of a plurality of speakers included in the same cluster. To be equipped. By normalizing the acoustic features in this way, it is possible to make it more difficult to identify the speaker from the utterance data.

本発明の保存装置において、前記断片データ生成部は、入力された前記テキストに対して形態素解析及び構文解析を行って文節を求め、文節の区切り位置に基づいて、１以上の文節を含む断片データを生成してもよい。このようにテキストに対して形態素解析及び構文解析を用いることにより、発話データの文節を適切に求めることができる。 In the storage device of the present invention, the fragment data generation unit performs morphological analysis and syntactic analysis on the input text to obtain a clause, and based on the break position of the clause, fragment data including one or more clauses. May be generated. By using morphological analysis and parsing for the text in this way, it is possible to appropriately obtain the phrase of the utterance data.

本発明の保存装置において、前記断片データ生成部は、入力された音声において無発音区間を検出し、前記無発音区間を文節の区切り位置として、文節の区切り位置に基づいて、１以上の文節を含む断片データを生成してもよい。音声データの無発音区間は、文節の区切りと一致することが多いので、無発音区間を断片データの区切り位置として、断片データを生成することができる。また、無発音区間で区切ることにより、言葉のつながりに起因する発声変形による音韻の変化の影響を低減できる。 In the storage device of the present invention, the fragment data generation unit detects a non-pronounced section in the input voice, sets the non-pronounced section as a phrase delimiter position, and sets one or more phrases based on the phrase delimiter position. Fragment data containing may be generated. Since the unpronounced section of the voice data often coincides with the phrase delimiter, the fragment data can be generated by using the non-pronounced section as the delimiter position of the fragment data. In addition, by dividing into non-pronunciation sections, it is possible to reduce the influence of phonological changes due to vocalization deformation caused by the connection of words.

本発明の保存装置において、前記断片データ結合部は、複数の前記オリジナルの発話データの長さの分布に基づいて、生成する発話データの前記所定の長さを設定してもよい。発話データから生成された断片データは、オリジナルの発話データにおいて連続していた音素の情報を失う。具体的に説明すると、無発音区間のない発話データでは、発話文の先頭から末尾まで音素が連続しているので、前または後ろに音素のないことを示すＮＵＬＬが含まれるのは、発話文の先頭と最後の２箇所だけである。この発話データを例えばＮ分割すると、各断片データの前後にＮＵＬＬが存在することになるので、２×Ｎ（個）のＮＵＬＬが発生してしまう。本発明では、入力された複数の発話データの長さの分布に基づく所定の長さの発話データを生成するので、音素のないことを示すＮＵＬＬの数をオリジナルの発話データとほぼ同じにできる。ここで、「複数の前記オリジナルの発話データの長さの分布に基づ」くとは、複数の発話データの平均の長さを用いる態様や、オリジナルの複数の発話データと同じ分布を有する態様であってよい。 In the storage device of the present invention, the fragment data combining unit may set the predetermined length of the utterance data to be generated based on the distribution of the lengths of the plurality of original utterance data. The fragment data generated from the utterance data loses the phoneme information that was continuous in the original utterance data. Specifically, in utterance data without a non-pronounced section, phonemes are continuous from the beginning to the end of the utterance sentence, so it is the utterance sentence that contains NULL indicating that there is no phoneme before or after. There are only two places, the beginning and the end. If this utterance data is divided into N, for example, there will be NULL before and after each fragment data, so that 2 × N (pieces) of NULL will be generated. In the present invention, since the utterance data of a predetermined length is generated based on the distribution of the lengths of the plurality of input utterance data, the number of NULL indicating that there is no phoneme can be made substantially the same as the original utterance data. Here, "based on the distribution of the lengths of the plurality of original utterance data" means an embodiment in which the average length of the plurality of utterance data is used, or an embodiment having the same distribution as the original plurality of utterance data. It may be.

本発明の保存装置は、前記保存部に保存されたテキストに基づいて言語モデルを生成する言語モデル生成部と、前記言語モデルを前記オリジナルの発話データのテキストに適用して、言語予測を行う言語予測部とを備えてもよい。この構成により、生成された発話データの秘匿化の程度を定量的に求めることができる。生成された発話データのテキストから生成された言語モデルによって、オリジナルの発話データを精度良く予測できる場合は秘匿化の程度が低く、逆に、オリジナルの発話データを予測できない場合には秘匿化の程度が高いと判断できる。予測精度には例えば言語モデルのパープレキシティー（perplexity）の逆数を使うことができる。 The storage device of the present invention includes a language model generation unit that generates a language model based on the text stored in the storage unit, and a language that applies the language model to the text of the original utterance data to perform language prediction. It may be provided with a prediction unit. With this configuration, the degree of concealment of the generated utterance data can be quantitatively determined. The degree of concealment is low if the original utterance data can be predicted accurately by the language model generated from the text of the generated utterance data, and conversely, the degree of concealment if the original utterance data cannot be predicted. Can be judged to be high. For example, the reciprocal of the perplexity of the language model can be used for the prediction accuracy.

本発明の保存装置は、前記オリジナルの発話データのテキストに基づいて言語モデルを生成する言語モデル生成部と、前記言語モデルを前記保存部に保存されたテキストに適用して、言語予測を行う言語予測部とを備えてもよい。この構成により、生成された発話データの秘匿化の程度を定量的に求めることができる。オリジナルの発話データのテキストから生成された言語モデルによって、生成された発話データを精度良く予測できる場合は秘匿化の程度が低く、逆に、生成された発話データを予測できない場合には秘匿化の程度が高いと判断できる。 The storage device of the present invention is a language model generation unit that generates a language model based on the text of the original utterance data, and a language that applies the language model to the text stored in the storage unit to perform language prediction. It may be provided with a prediction unit. With this configuration, the degree of concealment of the generated utterance data can be quantitatively determined. If the generated utterance data can be predicted accurately by the language model generated from the text of the original utterance data, the degree of concealment is low, and conversely, if the generated utterance data cannot be predicted, concealment is performed. It can be judged that the degree is high.

本発明の保存装置において、前記言語予測部による予測精度が所定の閾値より高い場合には、前記断片データ生成部は、断片データの長さを短くしてもよい。この構成により、予測精度が高い場合、すなわち発話データの秘匿化が十分でない場合には、断片データをさらに細かくして秘匿化を行うことにより、発話データの秘匿化を実現できる。なお、所定の閾値は、予測精度の絶対値であってもよいし、相対値であってもよい。相対値を用いる場合には、オリジナルの発話データから生成した言語モデルと、生成された発話データから生成した言語モデルを準備し、それぞれの言語モデルでオリジナルのテキスト、または断片データを結合したデータの予測を行い、その予測精度を比較する。予測精度の差が所定の閾値より大きい場合には、発話データの秘匿化が十分に行われていると判断できる。 In the storage device of the present invention, when the prediction accuracy by the language prediction unit is higher than a predetermined threshold value, the fragment data generation unit may shorten the length of the fragment data. With this configuration, when the prediction accuracy is high, that is, when the utterance data is not sufficiently concealed, the utterance data can be concealed by making the fragment data finer and concealing it. The predetermined threshold value may be an absolute value of prediction accuracy or a relative value. When using relative values, prepare a language model generated from the original utterance data and a language model generated from the generated utterance data, and combine the original text or fragment data with each language model. Make predictions and compare their prediction accuracy. When the difference in prediction accuracy is larger than a predetermined threshold value, it can be determined that the utterance data is sufficiently concealed.

本発明の保存方法は、音声に関するデータおよび当該音声に対応するテキストからなる発話データであって、音響モデルの学習に用いられる発話データを保存装置に保存する方法であって、音声と当該音声に対応するテキストからなる複数のオリジナルの発話データを入力するステップと、前記オリジナルの発話データを分割して、１以上の文節を含む音声および当該音声に対応するテキストからなる複数の断片データを生成するステップと、複数の前記オリジナルの発話データから生成された複数の断片データをランダムに結合して所定の長さの複数の発話データを生成するステップと、生成された複数の発話データを前記保存装置に保存するステップとを備える。 The storage method of the present invention is a method of storing utterance data including data related to voice and text corresponding to the voice in a storage device, which is used for learning an acoustic model, and is stored in the voice and the voice. A step of inputting a plurality of original utterance data consisting of corresponding texts and the original utterance data are divided to generate a plurality of fragment data consisting of a voice containing one or more phrases and a text corresponding to the voice. The storage device includes a step, a step of randomly combining a plurality of fragment data generated from the plurality of original utterance data to generate a plurality of utterance data having a predetermined length, and a plurality of generated utterance data. It has a step to save to.

本発明のプログラムは、音声に関するデータおよび当該音声に対応するテキストからなる発話データであって、音響モデルの学習に用いられる発話データを保存装置に保存するためのプログラムであって、コンピュータに、音声と当該音声に対応するテキストからなる複数のオリジナルの発話データを入力するステップと、前記オリジナルの発話データを分割して、１以上の文節を含む音声および当該音声に対応するテキストからなる複数の断片データを生成するステップと、複数の前記オリジナルの発話データから生成された複数の断片データをランダムに結合して所定の長さの複数の発話データを生成するステップと、生成された複数の発話データを前記保存装置に保存するステップとを実行させる。 The program of the present invention is utterance data composed of data related to voice and text corresponding to the voice, and is a program for storing utterance data used for learning an acoustic model in a storage device, and is a program for storing voice in a computer. A step of inputting a plurality of original utterance data consisting of text corresponding to the voice and the original utterance data, and a plurality of fragments consisting of a voice containing one or more phrases and text corresponding to the voice by dividing the original utterance data. A step of generating data, a step of randomly combining a plurality of fragment data generated from the plurality of original utterance data to generate a plurality of utterance data of a predetermined length, and a plurality of generated utterance data. Is executed in the storage device.

本発明によれば、オリジナルの発話データの内容をわからなくした状態で、保存することができる。 According to the present invention, it is possible to save the contents of the original utterance data in a state where the contents are unknown.

第１の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of 1st Embodiment. 第２の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of the 2nd Embodiment. 第３の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of the 3rd Embodiment. 第４の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of 4th Embodiment. 第５の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of 5th Embodiment. 第６の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of 6th Embodiment. （ａ）話者１〜話者９の発話データに基づいて、話者ベクトルｖ１〜ｖ９を求めることを示す概念図である。（ｂ）話者ベクトルｖ１〜ｖ９に基づいて、話者１〜話者９をクラスタリングした例を示す図である。(A) It is a conceptual diagram which shows that the speaker vector v1 -v9 is obtained based on the utterance data of the speaker 1-speaker 9. (B) It is a figure which shows the example which clustered the speaker 1 to the speaker 9 based on the speaker vector v1 to v9. 第７の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of 7th Embodiment. 第８の実施の形態の保存装置の構成を示す図である。It is a figure which shows the structure of the storage apparatus of 8th Embodiment.

以下、本発明の実施の形態にかかる保存装置について、図面を参照して説明する。
（第１の実施の形態）
図１は、第１の実施の形態の保存装置１の構成を示す図である。第１の実施の形態の保存装置１は、音声に関するデータおよび当該音声に対応するテキストからなる発話データであって、音響モデルの学習に用いられる発話データを保存する保存装置１である。 Hereinafter, the storage device according to the embodiment of the present invention will be described with reference to the drawings.
(First Embodiment)
FIG. 1 is a diagram showing a configuration of a storage device 1 according to the first embodiment. The storage device 1 of the first embodiment is a storage device 1 that stores utterance data including data related to voice and text corresponding to the voice, and is used for learning an acoustic model.

保存装置１は、オリジナルの発話データを入力する入力部１０と、オリジナルの発話データを分割して複数の断片データを生成する断片データ生成部１１と、複数の断片データをランダムに結合して所定の長さの複数の発話データを生成する断片データ結合部１２と、断片データ結合部１２にて生成された複数の発話データを保存する保存部１３とを備えている。 The storage device 1 is predetermined by randomly combining the input unit 10 for inputting the original utterance data, the fragment data generation unit 11 for dividing the original utterance data to generate a plurality of fragment data, and the plurality of fragment data. It includes a fragment data combining unit 12 that generates a plurality of utterance data having a length of the above, and a storage unit 13 that stores a plurality of utterance data generated by the fragment data combining unit 12.

以下、保存装置１の動作とともに保存装置１の各構成の詳細について説明する。保存装置１は、入力部１０からオリジナルの発話データの入力を受け付ける。入力部１０から入力されるオリジナルの発話データは、例えば、対象のドメインで取得されたドメイン内の発話データであり、音声と当該音声に対応するテキストからなる。音声に対応するテキストは、例えば、音声から書き起こしたテキストである。 Hereinafter, the operation of the storage device 1 and the details of each configuration of the storage device 1 will be described. The storage device 1 receives the input of the original utterance data from the input unit 10. The original utterance data input from the input unit 10 is, for example, utterance data in the domain acquired in the target domain, and is composed of a voice and a text corresponding to the voice. The text corresponding to the voice is, for example, a text transcribed from the voice.

次に、断片データ生成部１１は、入力されたオリジナルの発話データを分割し、複数の断片データを生成する。断片データ生成部１１は、オリジナルの発話データを分割する際に、文節を最小単位として分割する。すなわち、断片データには、少なくとも１つ以上の文節が含まれるように分割する。断片データ生成部１１が、発話データを区切って断片データを生成する方法としては、例えば、発話データのテキストに対して形態素解析を行って発話データを単語に分割し、その後、構文解析を行って同じ構文木に属している部分でカットするという方法が考えられる。また、構文解析に代えて、助詞などの付属語が出現した箇所を区切り位置としてカットしてもよい。 Next, the fragment data generation unit 11 divides the input original utterance data and generates a plurality of fragment data. When dividing the original utterance data, the fragment data generation unit 11 divides the phrase into the minimum unit. That is, the fragment data is divided so as to include at least one clause. As a method for the fragment data generation unit 11 to divide the utterance data and generate the fragment data, for example, the text of the utterance data is subjected to morphological analysis, the utterance data is divided into words, and then the parsing is performed. A possible method is to cut at the part that belongs to the same syntax tree. Further, instead of parsing, the part where the particle such as a particle appears may be cut as a delimiter position.

また、断片データ生成部１１は、入力された音声において無発音区間（「ポーズ」ともいう）を検出し、無発音区間を文節の区切り位置として、文節の区切り位置に基づいて、１以上の文節を含む断片データを生成してもよい。具体的には、断片データ生成部１１は、音素のアライメントをとり、無発音区間を文節の切れ目であると特定し、発話データを分割する。また、上述した方法によって求めた分割箇所のOR/ANDによって、文節の切れ目を決めてもよい。 Further, the fragment data generation unit 11 detects a non-pronounced section (also referred to as “pause”) in the input voice, sets the non-pronounced section as a phrase break position, and uses one or more clauses based on the phrase break position. Fragment data containing may be generated. Specifically, the fragment data generation unit 11 aligns the phonemes, identifies the non-pronounced section as a break in the phrase, and divides the utterance data. Further, the break of the clause may be determined by the OR / AND of the division portion obtained by the above method.

続いて、断片データ結合部１２は、断片データ生成部１１にて生成された断片データをランダムに結合して新たな発話データを生成する。ここで、１つのオリジナルの発話データから得た断片データをランダムに結合するだけだと、オリジナルの発話データの中で順番を入れ替えるだけとなるので、オリジナルの発話データを推測されてしまう可能性が高い。断片データ結合部１２は、複数のオリジナルの発話データから得た断片データをランダムに結合することで、オリジナルの発話データの内容を秘匿化する。 Subsequently, the fragment data combining unit 12 randomly combines the fragment data generated by the fragment data generating unit 11 to generate new utterance data. Here, if only the fragment data obtained from one original utterance data is randomly combined, the order is only changed in the original utterance data, so that the original utterance data may be inferred. high. The fragment data combining unit 12 conceals the contents of the original utterance data by randomly combining the fragment data obtained from the plurality of original utterance data.

断片データ結合部１２が新たに生成する発話データの長さは、複数のオリジナルの発話データの長さに基づいて設定する。例えば、複数のオリジナルの発話データの長さの平均値を求め、当該平均値を四捨五入した値としてもよい。あるいは、複数のオリジナルの発話データの長さの分布（平均、分散）を求め、同じ分布となるように、新たな発話データの長さを設定してもよい。
そして、最後に、断片データ結合部１２にて生成された新たな発話データを保存部１３に保存する。 The length of the utterance data newly generated by the fragment data combining unit 12 is set based on the lengths of the plurality of original utterance data. For example, the average value of the lengths of a plurality of original utterance data may be obtained, and the average value may be rounded off. Alternatively, the length distribution (mean, variance) of a plurality of original utterance data may be obtained, and the length of the new utterance data may be set so as to have the same distribution.
Finally, the new utterance data generated by the fragment data combining unit 12 is stored in the storage unit 13.

本実施の形態の保存装置１は、複数のオリジナルの発話データを断片データに分割し、複数の断片データをランダムに結合して新たな発話データを生成し、新たな発話データを保存するので、オリジナルの発話データの内容がわからない状態で、発話データを保存できる。ここで、断片データは１つ以上の文節を含んでいるおり、新たな発話データにおいても、音響特徴量の時系列が保持されているので、保存された発話データを使って音響モデルの学習を行うことができる。 The storage device 1 of the present embodiment divides a plurality of original utterance data into fragment data, randomly combines the plurality of fragment data to generate new utterance data, and stores the new utterance data. The utterance data can be saved without knowing the contents of the original utterance data. Here, since the fragment data contains one or more clauses and the time series of the acoustic features is retained even in the new utterance data, the acoustic model can be learned using the saved utterance data. It can be carried out.

また、本実施の形態の保存装置１は、オリジナルの発話データの長さの平均または分布と同じになるように、新たに生成される複数の発話データの長さを設定するので、新たに生成される複数の発話データに含まれる先頭の文字および末尾の文字の数を、複数のオリジナルの発話データに含まれる先頭の文字および末尾の文字の数と同程度にできる。これにより、保存された発話データにより、オリジナルの発話データと同様の学習を行える。 Further, since the storage device 1 of the present embodiment sets the lengths of a plurality of newly generated utterance data so as to be the same as the average or distribution of the lengths of the original utterance data, it is newly generated. The number of first and last characters contained in the plurality of utterance data to be generated can be made equal to the number of first and last characters contained in the plurality of original utterance data. As a result, the stored utterance data can be used for learning similar to the original utterance data.

（第２の実施の形態）
図２は、第１の実施の形態を変形した第２の実施の形態の保存装置２の構成を示す図である。第２の実施の形態の保存装置２の基本的な構成は、第１の実施の形態と同じであるが、第２の実施の形態の保存装置２は、入力部１０より入力される音声から音響特徴量を生成する音響特徴量生成部１４を備えている。音響特徴量生成部１４は、音声から生成した音響特徴量を断片データ生成部１１に入力する。断片データ生成部１１は、オリジナルの発話データのテキストまたは無発音区間に基づいて、断片データを生成するために発話データを区切る位置を求め、その区切り位置で音響特徴量データを区切って断片データを生成する。 (Second Embodiment)
FIG. 2 is a diagram showing a configuration of a storage device 2 of a second embodiment, which is a modification of the first embodiment. The basic configuration of the storage device 2 of the second embodiment is the same as that of the first embodiment, but the storage device 2 of the second embodiment is based on the voice input from the input unit 10. It is provided with an acoustic feature amount generation unit 14 that generates an acoustic feature amount. The acoustic feature amount generation unit 14 inputs the acoustic feature amount generated from the voice to the fragment data generation unit 11. The fragment data generation unit 11 obtains a position for dividing the utterance data in order to generate the fragment data based on the text of the original utterance data or a non-sounding section, and divides the acoustic feature data at the dividing position to divide the fragment data. Generate.

断片データ生成部１１は、断片データ（音響特徴量とそれに対応するテキストからなる）を断片データ結合部１２に渡す。断片データ結合部１２は、第１の実施の形態の保存装置２と同様に、断片データをランダムに結合して新たな発話データを生成し、保存部１３に保存する。本実施の形態の保存装置２が保存する発話データは、音声に関するデータとして音響特徴量を含んでいる。 The fragment data generation unit 11 passes the fragment data (consisting of an acoustic feature amount and the corresponding text) to the fragment data combination unit 12. The fragment data combining unit 12 randomly combines the fragment data to generate new utterance data and stores it in the storage unit 13, similarly to the storage device 2 of the first embodiment. The utterance data stored by the storage device 2 of the present embodiment includes acoustic features as data related to voice.

以上、第２の実施の形態の保存装置２について説明した。第２の実施の形態の保存装置２は、第１の実施の形態と同様に、内容を秘匿化した状態で発話データを保存できる。また、音声に関するデータとして音響特徴量を保存することにより、背景のノイズ等に基づいて発話データを構成する断片データの結合位置を推測することを困難にできる。 The storage device 2 of the second embodiment has been described above. The storage device 2 of the second embodiment can store the utterance data in a state where the contents are concealed, as in the first embodiment. Further, by storing the acoustic feature amount as the data related to the voice, it is possible to make it difficult to estimate the combined position of the fragment data constituting the utterance data based on the background noise or the like.

（第３の実施の形態）
図３は、第１の実施の形態を変形した第３の実施の形態の保存装置３の構成を示す図である。第３の実施の形態の保存装置３の基本的な構成は、第２の実施の形態と同じであるが、第３の実施の形態の保存装置３は、音響特徴量生成部１４にて生成された音響特徴量を正規化する特徴量正規化部１５を備えている。正規化された特徴量としては、例えば特徴量空間最尤線形回帰による線形変換された特徴量やニューラルネットワークのボトルネック特徴量を用いることができる。 (Third Embodiment)
FIG. 3 is a diagram showing a configuration of a storage device 3 of a third embodiment, which is a modification of the first embodiment. The basic configuration of the storage device 3 of the third embodiment is the same as that of the second embodiment, but the storage device 3 of the third embodiment is generated by the acoustic feature amount generation unit 14. The feature quantity normalization unit 15 for normalizing the generated acoustic feature quantity is provided. As the normalized features, for example, a linearly transformed feature by the feature space maximum likelihood linear regression or a bottleneck feature of a neural network can be used.

特徴量正規化部１５は、正規化した音響特徴量を断片データ生成部１１に入力する。断片データ生成部１１は、オリジナルの発話データのテキストまたは無発音区間に基づいて、断片データを生成するために発話データを区切る位置を求め、その区切り位置で正規化された音響特徴量データを区切って断片データを生成する。 The feature amount normalization unit 15 inputs the normalized acoustic feature amount to the fragment data generation unit 11. The fragment data generation unit 11 obtains a position for dividing the utterance data in order to generate the fragment data based on the text of the original utterance data or a non-sounding section, and divides the normalized acoustic feature data at the dividing position. To generate fragment data.

断片データ生成部１１は、断片データ（正規化された音響特徴量とそれに対応するテキストからなる）を断片データ結合部１２に渡す。断片データ結合部１２は、第１の実施の形態の保存装置３と同様に、断片データをランダムに結合して新たな発話データを生成し、保存部１３に保存する。本実施の形態の保存装置３が保存する発話データは、音声に関するデータとして、正規化された音響特徴量を含んでいる。 The fragment data generation unit 11 passes the fragment data (consisting of the normalized acoustic features and the corresponding text) to the fragment data combination unit 12. Similar to the storage device 3 of the first embodiment, the fragment data combining unit 12 randomly combines the fragment data to generate new utterance data and stores it in the storage unit 13. The utterance data stored by the storage device 3 of the present embodiment includes a normalized acoustic feature amount as data related to voice.

以上、第３の実施の形態の保存装置３について説明した。第３の実施の形態の保存装置３は、第１の実施の形態と同様に、内容を秘匿化した状態で発話データを保存できる。また、音声に関するデータとして正規化された音響特徴量を保存することにより、背景のノイズ等に基づいて発話データを構成する断片データの結合位置を推測することをさらに困難にできる。 The storage device 3 of the third embodiment has been described above. The storage device 3 of the third embodiment can store the utterance data in a state where the contents are concealed, as in the first embodiment. Further, by storing the normalized acoustic features as the data related to the voice, it is possible to further make it more difficult to estimate the combined position of the fragment data constituting the utterance data based on the background noise and the like.

（第４の実施の形態）
図４は、第１の実施の形態を変形した第４の実施の形態の構成を示す図である。第４の実施の形態の保存装置４の基本的な構成は、第３の実施の形態と同じであるが、第４の実施の形態の保存装置４において、入力部１０は、オリジナルの発話データとして、話者を特定する話者ＩＤを含む発話データの入力を受け付ける。なお、本実施の形態では、一度に入力される複数の発話データの話者は１人であることを想定している。 (Fourth Embodiment)
FIG. 4 is a diagram showing a configuration of a fourth embodiment obtained by modifying the first embodiment. The basic configuration of the storage device 4 of the fourth embodiment is the same as that of the third embodiment, but in the storage device 4 of the fourth embodiment, the input unit 10 uses the original utterance data. The input of utterance data including the speaker ID that identifies the speaker is accepted. In this embodiment, it is assumed that one speaker is a plurality of utterance data input at one time.

第４の実施の形態の保存装置４は、第３の実施の形態の保存装置４と同様に、入力された複数の発話データから複数の断片データを生成し、生成した複数の断片データをランダムに結合して新たな発話データを生成し、保存部１３に保存する。第４の実施の形態では、新たに生成する発話データに話者ＩＤを付加する。断片データ結合部１２は、新たな発話データを生成し、保存部１３に保存する。 Similar to the storage device 4 of the third embodiment, the storage device 4 of the fourth embodiment generates a plurality of fragment data from the input plurality of utterance data, and randomly generates the generated plurality of fragment data. To generate new utterance data, and store it in the storage unit 13. In the fourth embodiment, the speaker ID is added to the newly generated utterance data. The fragment data combining unit 12 generates new utterance data and stores it in the storage unit 13.

以上、第４の実施の形態の保存装置４について説明した。第４の実施の形態の保存装置４は、第１の実施の形態と同様に、内容を秘匿化した状態で発話データを保存できる。また、保存された発話データに話者ＩＤが付加されていることにより、保存されたデータを利用する際に、例えば、同じ話者の発話データだけを用いて学習したり、逆に異なる複数の話者の発話データを用いて学習するなど、データの用途が広がる。 The storage device 4 of the fourth embodiment has been described above. The storage device 4 of the fourth embodiment can store the utterance data in a state where the contents are concealed, as in the first embodiment. Further, since the speaker ID is added to the saved utterance data, when using the saved data, for example, learning is performed using only the utterance data of the same speaker, or conversely, a plurality of different speech data are used. The use of data is expanding, such as learning using the speaker's utterance data.

（第５の実施の形態）
図５は、第５の実施の形態の保存装置５の構成を示す図である。第１〜第４の実施の形態の保存装置５では、発話内容を秘匿化して発話データを保存する装置について説明したが、第５の実施の形態では、発話内容の秘匿化に加えて、発話者の特定をも困難にし、匿名性に配慮した発話データを保存する装置について説明する。 (Fifth Embodiment)
FIG. 5 is a diagram showing the configuration of the storage device 5 according to the fifth embodiment. In the storage device 5 of the first to fourth embodiments, the device for concealing the utterance content and storing the utterance data has been described, but in the fifth embodiment, in addition to concealing the utterance content, the utterance is spoken. A device that makes it difficult to identify a person and stores utterance data in consideration of anonymity will be described.

第５の実施の形態の保存装置５の基本的な構成は、第４の実施の形態の保存装置４と同じである。第５の実施の形態の保存装置５は、一度に入力される複数の発話データに、複数の話者による発話データを含んでおり、各発話データに話者を識別する話者ＩＤが付されている。 The basic configuration of the storage device 5 of the fifth embodiment is the same as that of the storage device 4 of the fourth embodiment. The storage device 5 of the fifth embodiment includes utterance data by a plurality of speakers in a plurality of utterance data input at one time, and a speaker ID that identifies the speaker is attached to each utterance data. ing.

第５の実施の形態においては、断片データ結合部１２が新たな発話データを生成する際に、必ず複数の異なる話者の断片データを用いて、新たな発話データを生成する。すなわち、断片データ結合部１２は、生成される各発話データに複数の話者から得た断片データを含める。断片データ結合部１２は、第４の実施の形態とは異なり、生成された発話データに話者ＩＤを付与することはしない。 In the fifth embodiment, when the fragment data combining unit 12 generates new utterance data, the fragment data of a plurality of different speakers is always used to generate new utterance data. That is, the fragment data combining unit 12 includes fragment data obtained from a plurality of speakers in each generated utterance data. The fragment data combining unit 12 does not assign a speaker ID to the generated utterance data, unlike the fourth embodiment.

以上、第５の実施の形態の保存装置５について説明した。第５の実施の形態の保存装置５は、第１〜第４の実施の形態と同様に、内容を秘匿化した状態で発話データを保存できる。また、保存された各発話データには、複数の異なる発話者から得た断片データが含まれるので、発話データから話者を特定することが困難になる。１つの発話にｋ人の話者の断片データを含めることとすれば、その発話データの話者の候補をｋ人以下に絞り込めないｋ−匿名化が達成できる。 The storage device 5 of the fifth embodiment has been described above. The storage device 5 of the fifth embodiment can store the utterance data in a state where the contents are concealed, as in the first to fourth embodiments. Further, since each of the stored utterance data includes fragment data obtained from a plurality of different speakers, it becomes difficult to identify the speaker from the utterance data. If fragment data of k speakers is included in one utterance, k-anonymization can be achieved in which the candidates for speakers of the utterance data cannot be narrowed down to k or less.

（第６の実施の形態）
図６は、第２の実施の形態を変形した第６の実施の形態の保存装置６の構成を示す図である。第６の実施の形態の保存装置６の基本的な構成は、第１の実施の形態と同じであるが、第６の実施の形態の保存装置６は、第５の実施の形態の保存装置５の構成に加えて、話者特徴量算出部１６と、クラスタリング部１７を備えている。 (Sixth Embodiment)
FIG. 6 is a diagram showing a configuration of a storage device 6 according to a sixth embodiment, which is a modification of the second embodiment. The basic configuration of the storage device 6 of the sixth embodiment is the same as that of the first embodiment, but the storage device 6 of the sixth embodiment is the storage device of the fifth embodiment. In addition to the configuration of 5, a speaker feature amount calculation unit 16 and a clustering unit 17 are provided.

話者特徴量算出部１６は、話者識別技術を用いて、発話データの音声から話者の特徴量を算出する機能を有する。本実施の形態では、話者の特徴量として、話者の埋め込みベクトル（分散表現）を用いる。このベクトルは、例えば因子分析から作られる量であるi-vectorや、話者識別ネットワークの中間層の出力より得られるx-vector/d-vectorなどを利用することができる。図７（ａ）は、話者１〜話者９の発話データに基づいて、話者ベクトルｖ１〜ｖ９を求めることを示す概念図である。 The speaker feature amount calculation unit 16 has a function of calculating the feature amount of the speaker from the voice of the utterance data by using the speaker identification technique. In the present embodiment, the speaker's embedded vector (distributed expression) is used as the speaker's feature quantity. For this vector, for example, i-vector, which is a quantity created by factor analysis, or x-vector / d-vector obtained from the output of the middle layer of the speaker identification network can be used. FIG. 7A is a conceptual diagram showing that the speaker vectors v1 to v9 are obtained based on the utterance data of the speakers 1 to 9.

クラスタリング部１７は、話者特徴量算出部１６にて求めた埋め込みベクトルに基づき、オリジナルの発話データの話者をクラスタリングし、音響特徴量が近い話者が同じクラスタに含まれるようにする。 The clustering unit 17 clusters the speakers of the original utterance data based on the embedded vector obtained by the speaker feature amount calculation unit 16, so that speakers having similar acoustic features are included in the same cluster.

図７（ｂ）は、話者ベクトルｖ１〜ｖ９に基づいて、話者１〜話者９をクラスタリングした例を示す図である。話者１、話者４、話者５が同じクラスタｃ１にクラスタリングされているのは、ベクトルｖ１，ｖ４，ｖ５の話者の特徴量が類似しているからである。 FIG. 7B is a diagram showing an example of clustering speakers 1 to 9 based on speaker vectors v1 to v9. The speaker 1, the speaker 4, and the speaker 5 are clustered in the same cluster c1 because the features of the speakers of the vectors v1, v4, and v5 are similar.

クラスタリング部１７は、クラスタＩＤとクラスタに含まれる話者の話者ＩＤを断片データ結合部１２に渡す。断片データ結合部１２は、断片データ結合部１２が新たな発話データを生成する際に、複数の話者の断片データを用いるが、このときに同じクラスタに含まれる複数の話者の断片データを用いる。 The clustering unit 17 passes the cluster ID and the speaker ID of the speaker included in the cluster to the fragment data combining unit 12. The fragment data combining unit 12 uses the fragment data of a plurality of speakers when the fragment data combining unit 12 generates new utterance data, and at this time, the fragment data of a plurality of speakers included in the same cluster is used. Use.

以上、第６の実施の形態の保存装置６について説明した。第６の実施の形態の保存装置６は、１つの発話データに、音響特徴量が近い複数の話者の断片データを含めるので、発話データから話者を特定することを困難にすることができる。また、音響特徴量が正規化されているので、発話データから話者を特定することをいっそう困難にすることができる。 The storage device 6 of the sixth embodiment has been described above. Since the storage device 6 of the sixth embodiment includes fragment data of a plurality of speakers having similar acoustic features in one utterance data, it can be difficult to identify the speaker from the utterance data. .. Moreover, since the acoustic features are normalized, it is possible to make it more difficult to identify the speaker from the utterance data.

なお、第６の実施の形態の保存装置６において、クラスタリング部１７からクラスタＩＤとクラスタに含まれる話者の話者ＩＤを特徴量正規化部１５に渡し、特徴量正規化部１５がクラスタのデータを用いて、同じクラスタに含まれる話者間で特徴量の正規化を行ってもよい。同じクラスタを正規化の単位とすることで、各話者の音響特徴量と正規化された音響特徴量との差異を小さくできる。 In the storage device 6 of the sixth embodiment, the cluster ID and the speaker ID of the speaker included in the cluster are passed from the clustering unit 17 to the feature amount normalization unit 15, and the feature amount normalization unit 15 is the cluster. The data may be used to normalize features among speakers in the same cluster. By using the same cluster as the unit of normalization, the difference between the acoustic features of each speaker and the normalized acoustic features can be reduced.

（第７の実施の形態）
図８は、第７の実施の形態の保存装置７の構成を示す図である。第７の実施の形態の保存装置７は、保存部１３に保存された発話データの秘匿化の程度を定量的に評価し、その評価結果に基づいて、発話データを生成する。 (7th Embodiment)
FIG. 8 is a diagram showing the configuration of the storage device 7 according to the seventh embodiment. The storage device 7 of the seventh embodiment quantitatively evaluates the degree of concealment of the utterance data stored in the storage unit 13, and generates utterance data based on the evaluation result.

第７の実施の形態の保存装置７は、第２の実施の形態の保存装置７の構成に加えて、保存装置７に保存された発話データに基づいて言語モデル生成部１８と、生成された言語モデルを記憶した言語モデル記憶部１９と、言語モデルに基づいて言語予測を行う言語予測部２０とを備えている。 The storage device 7 of the seventh embodiment is generated by the language model generation unit 18 based on the speech data stored in the storage device 7, in addition to the configuration of the storage device 7 of the second embodiment. It includes a language model storage unit 19 that stores a language model, and a language prediction unit 20 that performs language prediction based on the language model.

言語モデル生成部１８は、保存部１３に記憶された複数の発話データに基づいて言語モデルを生成する。生成する言語モデルとしては、例えば、n-gram言語モデルやrecurrent neural networkに基づく言語モデル等を使用することができる。 The language model generation unit 18 generates a language model based on a plurality of utterance data stored in the storage unit 13. As the language model to be generated, for example, an n-gram language model, a language model based on a recurrent neural network, or the like can be used.

言語予測部２０は、生成された言語モデルを用いて、入力された発話データのテキストに対して言語予測を行い、言語予測の予測精度を求める。予測精度としては、例えばperplexityの逆数などを用いることができる。言語予測部２０にて得られたperplexityの逆数がある閾値以上である場合には、入力された発話データの秘匿化が十分に行えていないと判定できる。なお、言語予測部２０が予測精度を判定する方法は、別の方法を採用することができる。例えば、オリジナルの発話データのテキストに基づいて言語モデルを生成し、生成した言語モデルを用いてオリジナルの発話データの予測精度を求め、保存された発話データから生成した言語モデルと予測精度を比較してもよい。両言語モデルの予測精度の差が閾値以下の場合には、入力された発話データの秘匿化が十分に行えていないと判定できる。 The language prediction unit 20 uses the generated language model to perform language prediction on the text of the input utterance data, and obtains the prediction accuracy of the language prediction. As the prediction accuracy, for example, the reciprocal of perplexity can be used. When the reciprocal of the perplexity obtained by the language prediction unit 20 is equal to or greater than a certain threshold value, it can be determined that the input utterance data is not sufficiently concealed. As the method for determining the prediction accuracy by the language prediction unit 20, another method can be adopted. For example, a language model is generated based on the text of the original utterance data, the prediction accuracy of the original utterance data is obtained using the generated language model, and the prediction accuracy is compared with the language model generated from the stored utterance data. You may. When the difference in prediction accuracy between the two language models is less than or equal to the threshold value, it can be determined that the input utterance data is not sufficiently concealed.

言語予測部２０は、秘匿化が十分に行えていない場合には、断片データ生成部１１に対してその旨を通知する。断片データ生成部１１は、この通知を受けると、生成する断片データの長さを短くする。例えば、元々、３文節前後からなる断片データを生成していたところ、秘匿化が十分でないと判定された場合には、断片データ生成部１１は、２文節以下の断片データを生成する。より短くした断片化データを用いることで、オリジナルの発話データがより断片化されるので、秘匿化の程度を高めることができる。 If the concealment is not sufficiently performed, the language prediction unit 20 notifies the fragment data generation unit 11 to that effect. Upon receiving this notification, the fragment data generation unit 11 shortens the length of the fragment data to be generated. For example, when it is determined that the concealment is not sufficient when the fragment data consisting of about 3 clauses is originally generated, the fragment data generation unit 11 generates the fragment data of 2 clauses or less. By using shorter fragmented data, the original utterance data is more fragmented, so that the degree of concealment can be increased.

以上、第７の実施の形態の保存装置７について説明した。第７の実施の形態の保存装置７は、保存部１３に保存された発話データの秘匿化の程度を定量的に判定できる。そして、秘匿化の程度が低い場合には、より細分化した断片データを生成して、新たな発話データを生成し、秘匿化のレベルを保った発話データを保存できる。 The storage device 7 of the seventh embodiment has been described above. The storage device 7 of the seventh embodiment can quantitatively determine the degree of concealment of the utterance data stored in the storage unit 13. Then, when the degree of concealment is low, it is possible to generate more subdivided fragment data, generate new utterance data, and store the utterance data maintaining the concealment level.

（第８の実施の形態）
図９は、第７の実施の形態を変形した第８の実施の形態の保存装置８の構成を示す図である。第８の実施の形態の保存装置８の基本的な構成は、第７の実施の形態と同じであるが、第８の実施の形態の保存装置８は、入力されたオリジナルの発話データに基づいて言語モデルを生成し、生成した言語モデルを、保存部１３に保存された発話データに適用して言語予測を行う点が異なる。この構成によっても、保存部１３に保存された発話データの秘匿化の程度を評価することができる。 (8th Embodiment)
FIG. 9 is a diagram showing a configuration of a storage device 8 according to an eighth embodiment, which is a modification of the seventh embodiment. The basic configuration of the storage device 8 of the eighth embodiment is the same as that of the seventh embodiment, but the storage device 8 of the eighth embodiment is based on the input original utterance data. The difference is that a language model is generated and the generated language model is applied to the utterance data stored in the storage unit 13 to perform language prediction. With this configuration as well, the degree of concealment of the utterance data stored in the storage unit 13 can be evaluated.

以上、本実施の形態の保存装置について説明したが、上記した保存装置のハードウェアの例は、ＣＰＵ、ＲＡＭ、ＲＯＭ、ハードディスク、ディスプレイ、キーボード、マウス、通信インターフェース等を備えたコンピュータである。上記した各機能を実現するモジュールを有するプログラムをＲＡＭまたはＲＯＭに格納しておき、ＣＰＵによって当該プログラムを実行することによって、上記した保存装置が実現される。このようなプログラムも本発明の範囲に含まれる。 Although the storage device of the present embodiment has been described above, an example of the hardware of the storage device described above is a computer provided with a CPU, RAM, ROM, hard disk, display, keyboard, mouse, communication interface, and the like. The above-mentioned storage device is realized by storing a program having a module that realizes each of the above-mentioned functions in a RAM or ROM and executing the program by a CPU. Such programs are also included in the scope of the present invention.

本発明は、音響モデルの学習に用いられる発話データを保存する技術として有用である。 The present invention is useful as a technique for storing utterance data used for learning an acoustic model.

１〜８保存装置
１０入力部
１１断片データ生成部
１２断片データ結合部
１３保存部
１４音響特徴量生成部
１５特徴量正規化部
１６話者特徴量算出部
１７クラスタリング部
１８言語モデル生成部
１９言語モデル記憶部
２０言語予測部
1-8 Preservation device 10 Input unit 11 Fragment data generation unit 12 Fragment data combination unit 13 Preservation unit 14 Acoustic feature amount generation unit 15 Feature amount normalization unit 16 Speaker feature amount calculation unit 17 Clustering unit 18 Language model generation unit 19 languages Model storage 20 Language prediction unit

Claims

An utterance data consisting of voice data and text corresponding to the voice, and a storage device for storing utterance data used for learning an acoustic model.
An input unit for inputting multiple original utterance data consisting of a voice and text corresponding to the voice,
A fragment data generation unit that divides the original utterance data and generates a plurality of fragment data consisting of a voice containing one or more phrases and a text corresponding to the voice.
A fragment data combining unit that randomly combines a plurality of fragment data generated from the plurality of original utterance data to generate a plurality of utterance data having a predetermined length.
A storage unit that stores a plurality of utterance data generated by the fragment data combination unit, and a storage unit.
A storage device equipped with.

It is provided with an acoustic feature amount generation unit that generates an acoustic feature amount from the voice input from the input unit.
The storage device according to claim 1, wherein the storage unit stores the acoustic feature amount of the utterance data as data related to the voice.

It is provided with a feature amount normalization unit that normalizes the acoustic feature amount generated by the acoustic feature amount generation unit.
The storage device according to claim 2, wherein the storage unit stores a normalized acoustic feature amount of the utterance data as data related to the voice.

The input unit accepts the input of the speaker identifier of the utterance data together with the plurality of original utterance data.
The storage device according to any one of claims 1 to 3, wherein the storage unit stores the utterance data generated by the fragment data combination unit in association with the identifier of the speaker.

The input unit accepts the input of the speaker identifier of the utterance data together with the plurality of original utterance data.
The storage device according to claim 1, wherein the fragment data combining unit includes fragment data obtained from a plurality of speakers in each utterance data generated based on the identifier of the speaker.

A speaker feature calculation unit that obtains speaker features by applying speaker identification technology to a plurality of the original utterance data, and a speaker feature calculation unit.
A clustering unit that clusters the speaker based on the feature amount of the speaker,
With
The storage device according to claim 5, wherein the fragment data combining unit includes fragment data obtained from utterance data of a plurality of speakers included in the same cluster in each utterance data generated.

An acoustic feature amount generation unit that generates an acoustic feature amount from the voice,
A feature normalization unit that normalizes the acoustic features of the utterance data of multiple speakers included in the same cluster,
The storage device according to claim 6.

From claim 1, the fragment data generation unit performs morphological analysis and syntactic analysis on the input text to obtain a clause, and generates fragment data including one or more clauses based on the break position of the clause. 7. The storage device according to any one of 7.

The fragment data generation unit detects a non-pronounced section in the input voice, and uses the non-pronounced section as a phrase delimiter position to generate fragment data including one or more phrases based on the phrase delimiter position. Item 2. The storage device according to any one of Items 1 to 8.

The storage according to any one of claims 1 to 9, wherein the fragment data combining unit sets the predetermined length of the utterance data to be generated based on the distribution of the lengths of the plurality of original utterance data. apparatus.

A language model generation unit that generates a language model based on the text stored in the storage unit,
A language prediction unit that applies the language model to the text of the original utterance data to perform language prediction, and
The storage device according to any one of claims 1 to 10.

A language model generator that generates a language model based on the text of the original utterance data,
A language prediction unit that applies the language model to the text stored in the storage unit to perform language prediction,
The storage device according to any one of claims 1 to 10.

When the prediction accuracy by the language prediction unit is higher than a predetermined threshold value,
The storage device according to claim 11 or 12, wherein the fragment data generation unit shortens the length of the fragment data.

It is utterance data consisting of data related to voice and text corresponding to the voice, and is a method of storing utterance data used for learning an acoustic model in a storage device.
Steps to input multiple original utterance data consisting of voice and text corresponding to the voice,
A step of dividing the original utterance data to generate a plurality of fragmentary data consisting of a voice containing one or more phrases and a text corresponding to the voice.
A step of randomly combining a plurality of fragment data generated from the plurality of original utterance data to generate a plurality of utterance data having a predetermined length, and
A step of storing a plurality of generated utterance data in the storage device, and
Preservation method with.

It is a utterance data consisting of data related to voice and text corresponding to the voice, and is a program for storing utterance data used for learning an acoustic model in a storage device, and is stored in a computer.
Steps to input multiple original utterance data consisting of voice and text corresponding to the voice,
A step of dividing the original utterance data to generate a plurality of fragmentary data consisting of a voice containing one or more phrases and a text corresponding to the voice.
A step of randomly combining a plurality of fragment data generated from the plurality of original utterance data to generate a plurality of utterance data having a predetermined length, and
A step of storing a plurality of generated utterance data in the storage device, and
A program that executes.