JP5706368B2

JP5706368B2 - Speech conversion function learning device, speech conversion device, speech conversion function learning method, speech conversion method, and program

Info

Publication number: JP5706368B2
Application number: JP2012113439A
Authority: JP
Inventors: 水野　秀之; 秀之水野; 勇祐井島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-05-17
Filing date: 2012-05-17
Publication date: 2015-04-22
Anticipated expiration: 2032-05-17
Also published as: JP2013238819A

Description

この発明は、発音の傾向の異なる二者間で対話する際に発話者の音声を聴取者に聞き取りやすい音声に変換する音声変換技術に関する。 The present invention relates to a voice conversion technique for converting a voice of a speaker into a voice that can be easily heard by a listener when a conversation between two parties having different pronunciation tendencies is performed.

日本人が外国で英語等を母語とする人と会話する場合、英語または当該国の母語を用いて会話することが一般的である。近年では日本人の大部分はある程度英語を話せるもの、会話能力としては初心者レベルの場合が大多数である。このような初心者が諸外国等で英語を用いて意思疎通することは困難であるため、従来は音声自動翻訳技術により日本語と英語または他の言語との相互の翻訳が期待されてきた。実際、様々な大学、企業等で音声の自動翻訳を目指した研究が行われてきており、実環境での実用実験も行われてきている。その結果現在では、旅行や受付等、利用場面を限定すればある程度実用レベルになってきている。 When a Japanese has a conversation with a person whose mother tongue is English or the like in a foreign country, it is common to have a conversation using English or the native language of the country. In recent years, the majority of Japanese can speak English to some extent, and the conversation ability is mostly at the beginner level. Since it is difficult for such beginners to communicate using English in other countries, it has been expected that Japanese and English or other languages can be translated into each other by automatic speech translation technology. In fact, various universities, companies, etc. have been conducting research aimed at automatic speech translation, and practical experiments in actual environments have also been conducted. As a result, at present, it has become a practical level to some extent if the use scenes such as travel and reception are limited.

このように、自動音声翻訳の技術的な発展は著しいものの、なお多くの技術的な課題が存在する。自動音声翻訳の実現には、音声認識技術による音声のテキスト化、機械翻訳技術によるある言語のテキストから異なる言語のテキストへの翻訳、音声合成技術による翻訳されたテキストから音声への変換、という３種類の全く異なる技術を全体として統合し動作させる必要がある。そのため自動音声翻訳全体としての精度を向上することは困難である。上述のとおり利用場面を限定することで各技術のチューニングを行い、全体の精度を高める工夫が行われてはいるが、利用場面が限定されるため一般的な普及には至っていない。 Thus, although the technical development of automatic speech translation is remarkable, there are still many technical problems. Realization of automatic speech translation includes text-to-speech by speech recognition technology, translation from text in one language to text in a different language by machine translation technology, and conversion from translated text to speech by speech synthesis technology. It is necessary to integrate and operate totally different kinds of technologies as a whole. Therefore, it is difficult to improve the accuracy of the entire automatic speech translation. As described above, each technology is tuned by limiting the usage scenes, and devices have been devised to improve the overall accuracy. However, since the usage scenes are limited, it has not been widely spread.

一方、発話者が語学学習等により会話能力を高めることは可能である。しかし、第二言語と母語とでは発音自体が根本的に異なることが多いため、学習の初期段階ではその言語を母語とする人には聞き取りづらいか、異なる音として聞き取れる発音になることがある。また、第二言語を習得中の学習者はその言語の聴取能力が低いため、その言語を母語とする人の発音を全く聞き取れなかったりするという問題もある。 On the other hand, it is possible for a speaker to improve conversational ability by language learning or the like. However, since the pronunciation itself is often fundamentally different between the second language and the mother tongue, it may be difficult for the person who speaks the language as the mother tongue at the initial stage of learning, or the pronunciation may be heard as a different sound. In addition, since the learner who is learning the second language has a low ability to listen to the language, there is also a problem that the pronunciation of a person whose native language is the language cannot be heard at all.

そこで、発話者の発音を聴取者にとって聞き取りやすい音声に変換するような方法が考えられる。従来から声質を変換する技術に関しては様々な方法が提案されている。例えば非特許文献１には、特定の個人と異なる個人とで同一のテキストを発話した音声をそれぞれ収録し、それぞれの音声間の対応関係を表す変換関数を学習することで、特定の個人が発話した音声を異なる個人が発話した音声に類似する音声に変換する声質変換技術が記載されている。 Therefore, a method is conceivable in which the pronunciation of the speaker is converted into a voice that is easy for the listener to hear. Conventionally, various methods have been proposed for techniques for converting voice quality. For example, Non-Patent Document 1 records voices uttered by the same text between a specific individual and different individuals, and learns a conversion function representing the correspondence between the respective voices so that a specific individual speaks. A voice quality conversion technique is described in which converted speech is converted into speech similar to speech uttered by different individuals.

G.Bandoin, Y.Stylianou, “On the transformation of the speech spectrum for voice conversion”, Proc. of ICSLP1996, Vol.3, pp.1405-1408, 1996.G.Bandoin, Y. Stylianou, “On the transformation of the speech spectrum for voice conversion”, Proc. Of ICSLP1996, Vol.3, pp.1405-1408, 1996.

しかしながら、非特許文献１に記載の声質変換技術は、特定の個人と異なる個人との間で声質を変換することを目的としている。この技術をそのまま特定の言語を習得中の話者とその言語を母語とする話者とで会話する場面に応用した場合、発話者の声質までが異なった声質に変換されてしまい、聴取者に違和感を与えることになるという問題があった。 However, the voice quality conversion technique described in Non-Patent Document 1 aims to convert voice quality between a specific individual and a different individual. When this technology is applied to a conversation between a speaker who is learning a specific language and a speaker whose native language is that language, the voice quality of the speaker is converted to a different voice quality, which There was a problem of giving a sense of incongruity.

この発明はこのような点に鑑みてなされたものであり、発話者の声質を維持したまま、聴取者にとって聞き取りやすい音声に変換することができる音声変換技術を提供することを目的とする。 This invention is made in view of such a point, and it aims at providing the audio | voice conversion technique which can be converted into the audio | voice which is easy to hear for a listener, maintaining the voice quality of a speaker.

上記の課題を解決するために、この発明の音声変換関数学習装置は、複数の話者を発音の傾向によりグループ分けし、一方のグループに属する第一話者が発話した音声を他方のグループに属する第二話者が発話した音声に類似する音声へ変換し、第二話者が発話した音声を第一話者が発話した音声に類似する音声へ変換する音声変換関数を学習する。音声変換関数学習装置は、第一話者平均声モデル記憶部と第二話者平均声モデル記憶部とテキスト記憶部と第一話者音声合成部と第二話者音声合成部と変換関数学習部とを備える。第一話者平均声モデル記憶部には、複数の第一話者が発話した第一話者音声を学習して生成した第一話者平均声モデルが記憶されている。第二話者平均声モデル記憶部には、複数の第二話者が発話した第二話者音声を学習して生成した第二話者平均声モデルが記憶されている。テキスト記憶部には、任意のテキストが記憶されている。第一話者音声合成部は、第一話者平均声モデルを用いてテキストを音声合成し、第一話者平均声合成音を生成する。第二話者音声合成部は、第二話者平均声モデルを用いてテキストを音声合成し、第二話者平均声合成音を生成する。変換関数学習部は、第一話者平均声合成音と第二話者平均声合成音とを用いて、第一話者音声から第二話者音声への対応関係を学習して、第一話者が発話した音声を入力として第二話者が発話した音声に類似する第二話者類似音声を出力する第一音声変換関数を生成し、第一話者平均声合成音と第二話者平均声合成音とを用いて、第二話者音声から第一話者音声への対応関係を学習して、第二話者が発話した音声を入力として第一話者が発話した音声に類似する第一話者類似音声を出力する第二音声変換関数を生成する。 In order to solve the above-described problem, the speech conversion function learning device of the present invention divides a plurality of speakers into groups according to pronunciation tendency, and the speech uttered by the first speaker belonging to one group is assigned to the other group. A voice conversion function is converted to convert the voice spoken by the second speaker into a voice similar to the voice spoken by the second speaker, and convert the voice spoken by the second speaker into a voice similar to the voice spoken by the first speaker. The voice conversion function learning device includes a first speaker average voice model storage unit, a second speaker average voice model storage unit, a text storage unit, a first speaker voice synthesis unit, a second speaker voice synthesis unit, and a conversion function learning. A part. The first speaker average voice model storage unit stores a first speaker average voice model generated by learning first speaker voices uttered by a plurality of first speakers. The second speaker average voice model storage unit stores a second speaker average voice model generated by learning second speaker voices uttered by a plurality of second speakers. Arbitrary text is stored in the text storage unit. The first speaker voice synthesizer synthesizes text using the first speaker average voice model to generate a first speaker average voice synthesized sound. The second speaker voice synthesizer synthesizes text using the second speaker average voice model to generate a second speaker average voice synthesized sound. The conversion function learning unit learns the correspondence from the first speaker voice to the second speaker voice using the first speaker average voice synthesized sound and the second speaker average voice synthesized sound, Generates the first speech conversion function that outputs the second speaker-similar speech similar to the speech spoken by the second speaker, using the speech spoken by the speaker as the input, Learn the correspondence from the second speaker's voice to the first speaker's voice using the average voice synthesized by the speaker, and use the voice spoken by the second speaker as input to the voice spoken by the first speaker A second voice conversion function for outputting similar first speaker-like voice is generated.

また、この発明の音声変換装置は、複数の話者を発音の傾向によりグループ分けし、一方のグループに属する第一話者が発話した音声を他方のグループに属する第二話者が発話した音声に類似する音声へ変換し、当該第二話者が発話した音声を当該第一話者が発話した音声に類似する音声へ変換する。音声変換装置は、第一音声変換関数記憶部と第二音声変換関数記憶部と第一話者音声変換部と第二話者音声変換部とを備える。第一音声変換関数記憶部には、第一話者が発話した音声を入力として第二話者が発話した音声に類似する第二話者類似音声を出力する第一音声変換関数が記憶されている。第二音声変換関数記憶部には、第二話者が発話した音声を入力として第一話者が発話した音声に類似する第一話者類似音声を出力する第二音声変換関数が記憶されている。第一話者音声変換部は、入力音声が第一話者の発話した音声であれば、第一音声変換関数を実行することにより、入力音声を第二話者類似音声に変換する。第二話者音声変換部は、入力音声が第二話者の発話した音声であれば、第二音声変換関数を実行することにより、入力音声を第一話者類似音声に変換する。ただし、第一音声変換関数は、複数の第一話者が発話した第一話者音声を学習して生成した第一話者平均声モデルと複数の第二話者が発話した第二話者音声を学習して生成した第二話者平均声モデルとを用いて、第一話者音声から第二話者音声への対応関係を学習されたものである。また、第二音声変換関数は、第一話者平均声モデルと第二話者平均声モデルとを用いて、第二話者音声から第一話者音声への対応関係を学習されたものである。 The voice conversion device according to the present invention also divides a plurality of speakers into groups according to pronunciation tendency, and a voice uttered by a first speaker belonging to one group is a voice uttered by a second speaker belonging to the other group. And the voice uttered by the second speaker is converted into the voice similar to the voice uttered by the first speaker. The speech conversion device includes a first speech conversion function storage unit, a second speech conversion function storage unit, a first speaker speech conversion unit, and a second speaker speech conversion unit. The first voice conversion function storage unit stores a first voice conversion function that outputs a second speaker-similar voice similar to a voice spoken by the second speaker by using the voice spoken by the first speaker as an input. Yes. The second voice conversion function storage unit stores a second voice conversion function that outputs a first speaker-similar voice similar to a voice spoken by the first speaker by using a voice spoken by the second speaker as an input. Yes. The first speaker voice conversion unit converts the input voice into a second speaker similar voice by executing a first voice conversion function if the input voice is a voice uttered by the first speaker. If the input voice is a voice uttered by the second speaker, the second speaker voice conversion unit converts the input voice into the first speaker-like voice by executing a second voice conversion function. However, the first voice conversion function is the first speaker average voice model generated by learning the first speaker voice uttered by a plurality of first speakers and the second speaker uttered by a plurality of second speakers. The correspondence relationship from the first speaker voice to the second speaker voice is learned using the second speaker average voice model generated by learning the voice. The second voice conversion function is a learning function of the correspondence from the second speaker voice to the first speaker voice using the first speaker average voice model and the second speaker average voice model. is there.

この発明の音声変換技術によれば、発音の傾向の異なる二者間で対話する際に、発話者の声質を維持したまま聴取者に聞き取りやすい音声に変換することができるため、二者間での円滑な意思の伝達が可能となる。 According to the voice conversion technology of the present invention, when a conversation between two parties having different pronunciation tendencies, the voice can be converted into a voice that can be easily heard by the listener while maintaining the voice quality of the speaker. Can communicate smoothly.

第１実施形態に係る音声変換関数学習装置の構成例を示すブロック図。The block diagram which shows the structural example of the speech conversion function learning apparatus which concerns on 1st Embodiment. 第１実施形態に係る音声変換装置の構成例を示すブロック図。The block diagram which shows the structural example of the audio | voice conversion apparatus which concerns on 1st Embodiment. 第１実施形態に係る音声変換関数学習装置の動作例を示すフローチャート。The flowchart which shows the operation example of the speech conversion function learning apparatus which concerns on 1st Embodiment. 第１実施形態に係る音声変換装置の動作例を示すフローチャート。The flowchart which shows the operation example of the speech converter which concerns on 1st Embodiment. 第２実施形態に係る音声変換関数学習装置の構成例を示すブロック図。The block diagram which shows the structural example of the speech conversion function learning apparatus which concerns on 2nd Embodiment. 第２実施形態に係る音声変換装置の構成例を示すブロック図。The block diagram which shows the structural example of the speech converter which concerns on 2nd Embodiment.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

［第１実施形態］
＜概要＞
まず、この発明の第１実施形態の概要を説明する。この実施形態では、音声変換関数学習装置１０と音声変換装置２０を用いる。まず、あらかじめ複数の話者を発音の傾向によりグループ分けし、各グループに属する複数の話者の音声を収集する。音声変換関数学習装置１０は、任意のグループに属する話者を第一話者として、第一話者が発話した音声を学習して第一話者平均声モデルを生成する。また、異なるグループに属する話者を第二話者として、第二話者が発話した音声を学習して第二話者平均声モデルを生成する。そして、第一話者平均声モデルと第二話者平均声モデルとを用いて、第一話者が発話した音声を第二話者が発話した音声に類似する音声に変換する第一音声変換関数と、第二話者が発話した音声を第一話者が発話した音声に類似する音声に変換する第二音声変換関数とを学習する。 [First Embodiment]
<Overview>
First, the outline of the first embodiment of the present invention will be described. In this embodiment, the speech conversion function learning device 10 and the speech conversion device 20 are used. First, a plurality of speakers are grouped in advance according to their pronunciation tendency, and voices of a plurality of speakers belonging to each group are collected. The speech conversion function learning device 10 learns the speech uttered by the first speaker with a speaker belonging to an arbitrary group as the first speaker, and generates a first speaker average voice model. In addition, a second speaker average voice model is generated by learning a voice uttered by the second speaker, with speakers belonging to different groups as the second speaker. Then, the first voice conversion for converting the voice uttered by the first speaker into the voice similar to the voice uttered by the second speaker using the first speaker average voice model and the second speaker average voice model A function and a second speech conversion function for converting speech uttered by the second speaker into speech similar to the speech uttered by the first speaker are learned.

第一音声変換関数と第二音声変換関数を学習する際には、第一話者平均声モデルと第二話者平均声モデルをそれぞれ用いて十分な数のテキストを音声合成し、その合成音間の対応関係を表す変換関数を学習する。 When learning the first voice conversion function and the second voice conversion function, a sufficient number of texts are synthesized using the first speaker average voice model and the second speaker average voice model, respectively, Learn the conversion function that represents the correspondence between the two.

音声変換装置２０は、第一話者が発話した音声が入力された場合には、第一音声変換関数を用いて、第二話者が発話した音声に類似する音声に変換する。一方、第二話者が発話した音声が入力された場合には、第二音声変換関数を用いて、第一話者が発話した音声に類似する音声に変換する。 When the voice uttered by the first speaker is input, the voice conversion device 20 converts the voice into a voice similar to the voice uttered by the second speaker using the first voice conversion function. On the other hand, when the voice uttered by the second speaker is input, the voice is converted into a voice similar to the voice uttered by the first speaker using the second voice conversion function.

平均声モデルは多数の話者の声質を用いて構築する平均的な声質の音響モデルである。したがって、十分な量の第一話者の音声と第二話者の音声を収集することが出来れば、第一話者平均声モデルと第二話者平均声モデルの声質は均質なものとすることができる。その結果、第一話者平均声モデルと第二話者平均声モデルとの差分は、第一話者と第二話者の発音の傾向のみが抽出されたものとなることが期待できる。つまり第一話者平均声モデルと第二話者平均声モデルの対応関係を表す変換関数は、入力された音声に対してそれぞれの話者の発音の傾向を双方向に反映させる変換関数であると言える。したがって、第一話者と第二話者が対話する際に、発話者の音声が聴取者の聞き取りやすい音声に変換されるため、二者間での円滑な意思の伝達が可能となる。 The average voice model is an average voice quality acoustic model constructed using the voice quality of many speakers. Therefore, the voice quality of the first speaker average voice model and the second speaker average voice model should be uniform if a sufficient amount of the voice of the first speaker and the voice of the second speaker can be collected. be able to. As a result, the difference between the first speaker average voice model and the second speaker average voice model can be expected to extract only the pronunciation tendency of the first speaker and the second speaker. In other words, the conversion function that represents the correspondence between the first speaker average voice model and the second speaker average voice model is a conversion function that bidirectionally reflects the tendency of each speaker's pronunciation to the input speech. It can be said. Therefore, when the first speaker and the second speaker interact, the voice of the speaker is converted into a voice that is easy for the listener to hear, so that smooth communication between the two parties can be achieved.

＜構成＞
図１を参照して、第１実施形態に係る音声変換関数学習装置１０の構成例を詳細に説明する。音声変換関数学習装置１０は、第一話者モデル学習部１１０と第二話者モデル学習部１１５と第一話者音声合成部１２０と第二話者音声合成部１２５と変換関数学習部１３０と第一話者音声記憶部９１０と第二話者音声記憶部９１５と第一話者平均声モデル記憶部９２０と第二話者平均声モデル記憶部９２５とテキスト記憶部９３０と第一話者平均声合成音記憶部９４０と第二話者平均声合成音記憶部９４５と第一音声変換関数記憶部９５０と第二音声変換関数記憶部９５５とを備える。第一話者音声記憶部９１０と第二話者音声記憶部９１５と第一話者平均声モデル記憶部９２０と第二話者平均声モデル記憶部９２５とテキスト記憶部９３０と第一話者平均声合成音記憶部９４０と第二話者平均声合成音記憶部９４５と第一音声変換関数記憶部９５０と第二音声変換関数記憶部９５５は、例えば、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）などの半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 <Configuration>
A configuration example of the speech conversion function learning device 10 according to the first embodiment will be described in detail with reference to FIG. The speech conversion function learning device 10 includes a first speaker model learning unit 110, a second speaker model learning unit 115, a first speaker speech synthesis unit 120, a second speaker speech synthesis unit 125, and a conversion function learning unit 130. First speaker voice storage unit 910, second speaker voice storage unit 915, first speaker average voice model storage unit 920, second speaker average voice model storage unit 925, text storage unit 930, and first speaker average A voice synthesized sound storage unit 940, a second speaker average voice synthesized sound storage unit 945, a first speech conversion function storage unit 950, and a second speech conversion function storage unit 955 are provided. First speaker voice storage unit 910, second speaker voice storage unit 915, first speaker average voice model storage unit 920, second speaker average voice model storage unit 925, text storage unit 930, and first speaker average The voice synthesis sound storage unit 940, the second speaker average voice synthesis sound storage unit 945, the first voice conversion function storage unit 950, and the second voice conversion function storage unit 955 are, for example, a hard disk, an optical disk, or a flash memory. It can be configured by an auxiliary storage device configured by a semiconductor memory device such as a middleware such as a relational database or a key-value store.

図２を参照して、第１実施形態に係る音声変換装置２０の構成例を詳細に説明する。音声変換装置２０は、収音手段２０１と発音手段２０２と第一話者音声変換部２１０と第二話者音声変換部２１５と第一音声変換関数記憶部９５０と第二音声変換関数記憶部９５５とを備える。第一音声変換関数記憶部９５０および第二音声変換関数記憶部９５５は、音声変換関数学習装置１０の備える第一音声変換関数記憶部９５０および第二音声変換関数記憶部９５５と同様に構成される。 With reference to FIG. 2, the structural example of the audio | voice conversion apparatus 20 which concerns on 1st Embodiment is demonstrated in detail. The voice conversion device 20 includes a sound collection unit 201, a sound generation unit 202, a first speaker voice conversion unit 210, a second speaker voice conversion unit 215, a first voice conversion function storage unit 950, and a second voice conversion function storage unit 955. With. The first speech conversion function storage unit 950 and the second speech conversion function storage unit 955 are configured similarly to the first speech conversion function storage unit 950 and the second speech conversion function storage unit 955 included in the speech conversion function learning device 10. .

＜音声変換関数学習処理＞
図３を参照して、音声変換関数学習装置１０の動作例を、実際に行われる手続きの順に従って詳細に説明する。 <Speech conversion function learning process>
With reference to FIG. 3, an operation example of the speech conversion function learning device 10 will be described in detail according to the order of procedures actually performed.

音声変換関数学習装置１０の備える第一話者音声記憶部９１０には、複数の第一話者が発話した複数の音声である第一話者音声が記憶されている。第一話者とは、あらかじめ複数の話者を発音の傾向によりグループ分けし、複数のグループの中から選択されたあるグループに属する話者である。第一話者音声は、実際に収録された音声データと、その音声データに対してあらかじめ自動的または手作業により付与されたコンテキスト情報から構成される。コンテキスト情報は、具体的には、形態素、音素、アクセントなどである。音声データに対してコンテキスト情報を自動的に付与する方法は、従来より様々な方法が提案されているため、ここでの詳細な説明は省略する。 The first speaker voice storage unit 910 included in the speech conversion function learning device 10 stores a first speaker voice that is a plurality of voices uttered by a plurality of first speakers. The first speaker is a speaker belonging to a certain group selected from a plurality of groups in which a plurality of speakers are grouped in advance according to pronunciation tendency. The first speaker voice is composed of actually recorded voice data and context information that is automatically or manually given to the voice data in advance. Specifically, the context information includes morphemes, phonemes, accents, and the like. Various methods for automatically giving context information to audio data have been proposed so far, and a detailed description thereof will be omitted here.

音声変換関数学習装置１０の備える第二話者音声記憶部９１５には、複数の第二話者が発話した複数の音声である第二話者音声が記憶されている。第二話者とは、あらかじめ複数の話者を発音の傾向によりグループ分けし、複数のグループの中から選択されたあるグループに属する話者である。第二話者が属するグループは、第一話者が属するグループとは異なるグループでなければならない。したがって、第一話者と第二話者は、発音の傾向が互いに異なる二組の話者である。第二話者音声の構成は、上述の第一話者音声の構成と同様であるので、ここでは説明を省略する。 The second speaker voice storage unit 915 included in the speech conversion function learning device 10 stores a second speaker voice that is a plurality of voices uttered by a plurality of second speakers. The second speaker is a speaker belonging to a certain group selected from a plurality of groups in which a plurality of speakers are grouped in advance according to pronunciation tendency. The group to which the second speaker belongs must be different from the group to which the first speaker belongs. Therefore, the first speaker and the second speaker are two pairs of speakers having different pronunciation tendencies. Since the configuration of the second speaker voice is the same as the configuration of the first speaker voice described above, description thereof is omitted here.

音声変換関数学習装置１０の備えるテキスト記憶部９３０には、あらかじめ与えられた任意のテキストが記憶されている。与えられるテキストは、この発明の音声変換技術が適用される場面を考慮して選択することが望ましい。また、テキストのデータ量は学習精度に影響を与えるため、できるだけ多い方が望ましい。 The text storage unit 930 provided in the speech conversion function learning device 10 stores arbitrary text given in advance. It is desirable to select the given text in consideration of the scene where the speech conversion technique of the present invention is applied. Moreover, since the amount of text data affects the learning accuracy, it is desirable that the amount is as large as possible.

音声変換関数学習装置１０の備える第一話者モデル学習部１１０は、第一話者音声を学習して第一話者平均声モデルを生成する（Ｓ１１０）。平均声の学習は、様々な方法が提案されているが、例えば、「J.YAMAGISHI, M.TAMURA, T.MASUKO, K.TOKUDA, T.KOBAYASHI, ”A Training Method of Average Voice Model for HMM-Based Speech Synthesis”, IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences Vol.E86-A No.8, pp.1956-1963（参考文献１）」に記載の方法で行うことができる。生成された第一話者平均声モデルは、第一話者平均声モデル記憶部９２０に記憶される。 The first speaker model learning unit 110 included in the speech conversion function learning device 10 learns the first speaker speech and generates a first speaker average voice model (S110). Various methods have been proposed for learning average voice. For example, “J.YAMAGISHI, M.TAMURA, T.MASUKO, K.TOKUDA, T.KOBAYASHI,” A Training Method of Average Voice Model for HMM- Based Speech Synthesis ", IEICE TRANSACTIONS on Fundamentals of Electronics, Communications and Computer Sciences Vol. E86-A No. 8, pp.1956-1963 (reference document 1)". The generated first speaker average voice model is stored in the first speaker average voice model storage unit 920.

音声変換関数学習装置１０の備える第二話者モデル学習部１１５は、第二話者音声を学習して第二話者平均声モデルを生成する（Ｓ１１５）。平均声の学習は、上述の第一話者平均声モデルの学習と同様に、様々な方法により行うことができる。生成された第二話者平均声モデルは、第二話者平均声モデル記憶部９２５に記憶される。 The second speaker model learning unit 115 included in the speech conversion function learning device 10 learns the second speaker speech and generates a second speaker average voice model (S115). The learning of the average voice can be performed by various methods similarly to the learning of the first speaker average voice model described above. The generated second speaker average voice model is stored in the second speaker average voice model storage unit 925.

音声変換関数学習装置１０の備える第一話者音声合成部１２０は、テキスト記憶部１３０に記憶されているテキストを、第一話者平均声モデルを用いて音声合成し、第一話者平均声合成音を生成する（Ｓ１２０）。第一話者平均声合成音は、音声合成により生成される音声データと、その音声データに対応する音素ラベルにより構成される。音声合成の方法は、様々な方法が提案されているが、例えば、「K.Tokuda, Z.Heiga. A.W.Black, “An HMM-based speech synthesis system applied to English”, Proc. of 2002 IEEE SSW, 2002（参考文献２）」に記載の方法で行うことができる。音素ラベルとは、音声データ中に含まれる各音素の時間的な位置を表す情報である。音素の時間的な位置は音声合成処理の中で決定するものであるため、音声合成処理において容易に取得することができる。生成された第一話者平均声合成音は、第一話者平均声合成音記憶部９４０に記憶される。 The first speaker voice synthesizing unit 120 included in the speech conversion function learning device 10 synthesizes the text stored in the text storage unit 130 using the first speaker average voice model, and generates the first speaker average voice. A synthesized sound is generated (S120). The first speaker average voice synthesized sound is composed of voice data generated by voice synthesis and a phoneme label corresponding to the voice data. Various speech synthesis methods have been proposed. For example, “K. Tokuda, Z. Heiga. AWBlack,“ An HMM-based speech synthesis system applied to English ”, Proc. Of 2002 IEEE SSW, 2002 (Reference Document 2) ". The phoneme label is information indicating the temporal position of each phoneme included in the voice data. Since the temporal position of the phoneme is determined in the speech synthesis process, it can be easily obtained in the speech synthesis process. The generated first speaker average voice synthesized sound is stored in the first speaker average voice synthesized sound storage unit 940.

音声変換関数学習装置１０の備える第二話者音声合成部１２５は、テキスト記憶部１３０に記憶されているテキストを、第二話者平均声モデルを用いて音声合成し、第二話者平均声合成音を生成する（Ｓ１２５）。第二話者平均声合成音の構成は、上述の第一話者平均声合成音の構成と同様である。音声合成の方法は、上述の第一話者平均声合成音の合成と同様に、様々な方法により行うことができる。生成された第二話者平均声合成音は、第二話者平均声合成音記憶部９４５に記憶される。 The second speaker speech synthesizer 125 included in the speech conversion function learning device 10 synthesizes the text stored in the text storage unit 130 using the second speaker average voice model, and generates the second speaker average voice. A synthesized sound is generated (S125). The configuration of the second speaker average voice synthesized sound is the same as the configuration of the first speaker average voice synthesized sound described above. Similar to the synthesis of the first speaker average voice synthesized sound, the voice synthesis method can be performed by various methods. The generated second speaker average voice synthesized sound is stored in the second speaker average voice synthesized sound storage unit 945.

音声変換関数学習装置１０の備える変換関数学習部１３０は、第一話者平均声合成音と第二話者平均声合成音とを用いて、第一音声変換関数を学習する。また、第一話者平均声合成音と第二話者平均声合成音とを用いて、第二音声変換関数を学習する。第一音声変換関数とは、第一話者が発話した音声を入力として、第二話者類似音声を出力する変換関数である。第二話者類似音声は、第二話者が発話した音声に類似する音声であり、より詳細には、第一話者の声質を維持したまま第二話者の発音の傾向が反映された音声である。第二音声変換関数とは、第一音声変換関数とは逆に、第二話者が発話した音声を入力として、第一話者類似音声を出力する変換関数である。第一話者類似音声は、第一話者が発話した音声に類似する音声であり、より詳細には、第二話者の声質を維持したまま第一話者の発音の傾向が反映された音声である。 The conversion function learning unit 130 included in the speech conversion function learning device 10 learns the first speech conversion function using the first speaker average voice synthesized sound and the second speaker average voice synthesized sound. Further, the second speech conversion function is learned using the first speaker average voice synthesized sound and the second speaker average voice synthesized sound. The first speech conversion function is a conversion function that outputs the second speaker-similar speech with the speech uttered by the first speaker as an input. The second speaker-similar voice is similar to the voice uttered by the second speaker. More specifically, the second speaker's pronunciation tendency is reflected while maintaining the voice quality of the first speaker. It is voice. In contrast to the first voice conversion function, the second voice conversion function is a conversion function that outputs the first speaker-like voice with the voice uttered by the second speaker as an input. The voice similar to the voice of the first speaker is similar to the voice spoken by the first speaker, and more specifically, the first speaker's pronunciation tendency is reflected while maintaining the voice quality of the second speaker. It is voice.

第一音声変換関数および第二音声変換関数の学習方法について、詳細に説明する。変換関数の学習方法は、既知の様々な声質変換技術を適用することができるが、ここでは、非特許文献１に記載の方法を例に説明する。特許文献１においては、様々な音響モデルについて言及しているが、ここでは多次元混合正規分布（Gaussian Mixture Model、GMM）により音声の特徴量がモデル化されている場合を例にとって説明する。 A learning method of the first speech conversion function and the second speech conversion function will be described in detail. Various known voice quality conversion techniques can be applied to the conversion function learning method. Here, the method described in Non-Patent Document 1 will be described as an example. In Patent Document 1, various acoustic models are mentioned, but here, an example in which a speech feature is modeled by a multidimensional mixed normal distribution (Gaussian Mixture Model, GMM) will be described.

xを入力音声のp次元の特徴量ベクトルとし、μを入力音声xの平均とし、Σを入力音声xの共分散行列とし、α_iをクラスiの重みとし、mをクラス数とすると、多次元混合正規分布によりモデル化された入力音声xの確率分布p(x)は以下の式で表すことができる。 If x is the p-dimensional feature vector of the input speech, μ is the average of the input speech x, Σ is the covariance matrix of the input speech x, α _i is the weight of class i, and m is the number of classes. The probability distribution p (x) of the input speech x modeled by the dimensional mixed normal distribution can be expressed by the following equation.

ここで、xを入力音声とし、yを出力音声とし、μ_i ^(x)を入力音声xのクラスiの平均とし、μ_i ^(y)を出力音声yのクラスiの平均とし、Σ_i ^(xx)を入力音声xのクラスiの共分散行列とし、Σ_i ^(xy)を入力音声xと出力音声yのクラスiの共分散行列とすると、変換関数y=F(x)は以下の式で表すことができる。 Where x is the input speech, y is the output speech, μ _i ^(x) is the average of class i of input speech x, μ _i ^(y) is the average of class i of output speech y, and Σ _i ^{( xx)} is the class i covariance matrix of input speech x, and Σ _i ^(xy) is the class i covariance matrix of input speech x and output speech y, the transformation function y = F (x) is It can be expressed as

変換関数F(x)のパラメータであるα_i、μ_i ^(x) _、μ_i ^(y) _、Σ_i ^(xx) _、Σ_i ^(yx)は以下のように結合特徴量ベクトルを用いてEMアルゴリズムにより推定することができる。 Α _i, μ _i ^(x) _, μ _i ^(y) _, Σ _i ^(xx) _, Σ _i ^(yx) parameters of the transformation function F (x) Can be estimated.

入力音声xを第一話者平均声合成音とし、出力音声yを第二話者平均声合成音とすることで、第一音声変換関数を学習することができる。逆に、入力音声xを第二話者平均声合成音とし、出力音声yを第一話者平均声合成音とすることで、第二音声変換関数を学習することができる。このように、ある音声と異なる音声との間で音素ラベルの対応付けが可能であれば、それらの音声間の相互の対応関係である変換関数は入力音声と出力音声を入れ替えるだけで容易に学習することができる。変換関数の学習方法についてのより詳細な説明は、非特許文献１を参照されたい。 The first speech conversion function can be learned by using the input speech x as the first speaker average voice synthesized sound and the output speech y as the second speaker average voice synthesized sound. Conversely, the second speech conversion function can be learned by using the input speech x as the second speaker average voice synthesized sound and the output speech y as the first speaker average voice synthesized sound. In this way, if it is possible to associate phoneme labels between a certain voice and a different voice, the conversion function, which is the mutual correspondence between those voices, can be easily learned simply by switching the input voice and the output voice. can do. Refer to Non-Patent Document 1 for a more detailed description of the conversion function learning method.

＜音声変換処理＞
図４を参照して、音声変換装置２０の動作例を、実際に行われる手続きの順に従って詳細に説明する。 <Audio conversion processing>
With reference to FIG. 4, an example of the operation of the speech conversion apparatus 20 will be described in detail according to the order of procedures actually performed.

音声変換装置２０の備える第一音声変換関数記憶部９５０には、音声変換関数学習装置１０の学習した第一音声変換関数が記憶されている。 The first speech conversion function storage unit 950 included in the speech conversion device 20 stores the first speech conversion function learned by the speech conversion function learning device 10.

音声変換装置２０の備える第二音声変換関数記憶部９５５は、音声変換関数学習装置１０の学習した第二音声変換関数が記憶されている。 The second speech conversion function storage unit 955 provided in the speech conversion device 20 stores the second speech conversion function learned by the speech conversion function learning device 10.

音声変換装置２０の備える収音手段２０１は、発話者の発話した音声を音声信号に変換して、入力端子（図示せず）を介して音声変換装置２０へ入力する（Ｓ２０１）。収音手段２０１は、典型的にはマイクロホンである。 The sound collection means 201 included in the voice conversion device 20 converts the voice spoken by the speaker into a voice signal and inputs the voice signal to the voice conversion device 20 via an input terminal (not shown) (S201). The sound collection means 201 is typically a microphone.

音声変換装置２０の備える第一話者音声変換部２１０は、収音手段２０１を介して入力された音声信号が、誰の発話した音声であるかを判定する（Ｓ２０５）。入力音声の発話者を判定する方法は様々な方法が考えられるが、例えば、手動で設定可能としてもよい。入力音声が第一話者の発話した音声であれば、第一音声変換関数記憶部９５０に記憶されている第一音声変換関数を実行することにより、その入力音声を第二話者類似音声に変換する（Ｓ２１０）。入力音声の変換方法についての詳細は、非特許文献１を参照されたい。生成した第二話者類似音声は、発音手段２０２へ出力される。 The first speaker voice converter 210 included in the voice converter 20 determines who the voice signal input via the sound pickup means 201 is the voice uttered (S205). Although various methods can be considered as a method for determining the speaker of the input voice, for example, it may be set manually. If the input speech is speech uttered by the first speaker, by executing the first speech conversion function stored in the first speech conversion function storage unit 950, the input speech is converted into the second speaker similar speech. Conversion is performed (S210). Refer to Non-Patent Document 1 for details of the input speech conversion method. The generated second speaker-like voice is output to the sound generation means 202.

音声変換装置２０の備える第二話者音声変換部２１５は、収音手段２０１を介して入力された音声信号が、誰の発話した音声であるかを判定する（Ｓ２０５）。入力音声の発話者を判定する方法は様々な方法が考えられるが、例えば、手動で設定可能としてもよい。入力音声が第二話者の発話した音声であれば、第二音声変換関数記憶部９５５に記憶されている第二音声変換関数を実行することにより、その入力音声を第一話者類似音声に変換する（Ｓ２１５）。入力音声の変換方法についての詳細は、非特許文献１を参照されたい。生成した第一話者類似音声は、発音手段２０２へ出力される。 The second speaker voice conversion unit 215 included in the voice conversion device 20 determines who the voice signal input via the sound pickup means 201 is the voice uttered (S205). Although various methods can be considered as a method for determining the speaker of the input voice, for example, it may be set manually. If the input voice is a voice spoken by the second speaker, the second voice conversion function stored in the second voice conversion function storage unit 955 is executed to change the input voice to the first speaker-like voice. Conversion is performed (S215). Refer to Non-Patent Document 1 for details of the input speech conversion method. The generated first speaker-like voice is output to the sound generation means 202.

音声変換装置２０の備える発音手段２０２は、出力端子（図示せず）を介して音声変換装置２０が出力する音声信号を、音声に変換して周囲へ発音する（Ｓ２０２）。発音手段２０２は、典型的にはスピーカーである。ここで出力する音声信号は、入力音声が第一話者の発話した音声であれば、第二話者類似音声である。一方、入力音声が第二話者の発話した音声であれば、第一話者類似音声である。 The sound generation means 202 included in the sound conversion device 20 converts the sound signal output from the sound conversion device 20 via an output terminal (not shown) into sound and generates sound around the sound (S202). The sound generation means 202 is typically a speaker. The voice signal output here is a second speaker-like voice if the input voice is a voice uttered by the first speaker. On the other hand, if the input voice is voice spoken by the second speaker, the voice is similar to the first speaker.

＜効果＞
この発明の第１実施形態では、音声変換関数学習装置１０が、第一話者平均声モデルと第二話者平均声モデルとを用いて、同一のテキストをそれぞれ音声合成し、生成された合成音の対応関係を表す変換関数を学習する。音声変換装置２０は、音声変換関数学習装置１０が学習した変換関数を用いて、第一話者の発話する音声を第二話者の発話する音声に類似する音声に変換し、第二話者の発話する音声を第一話者の発話する音声に類似する音声に変換する。 <Effect>
In the first embodiment of the present invention, the speech conversion function learning device 10 performs speech synthesis on the same text using the first speaker average voice model and the second speaker average voice model, and the generated synthesis. Learn transformation functions that represent the correspondence between sounds. The voice conversion device 20 uses the conversion function learned by the voice conversion function learning device 10 to convert the voice uttered by the first speaker into a voice similar to the voice uttered by the second speaker. Is converted to a voice similar to the voice of the first speaker.

このように構成することにより、発音の傾向の異なる二者間で対話する際に、発話者の声質を維持したまま聴取者に聞き取りやすい音声に変換することができるため、二者間での円滑な意思の伝達が可能となる。 In this way, when talking between two parties with different pronunciation tendencies, the voice quality of the speaker can be converted into a voice that is easy to hear while maintaining the voice quality of the speaker. Communication is possible.

［第２実施形態］
＜概要＞
まず、この発明の第２実施形態の概要を説明する。この実施形態は、ある言語を母語とせず、その言語の習得が十分でない学習者と、その言語を母語とする母語話者とが対話する場面に、この発明を適用することを想定している。すなわち、第１実施形態における第一話者を、ある言語を母語とせず、その言語の習得が十分でない学習者とし、第１実施形態における第二話者を、その言語を母語とする母語話者とする。 [Second Embodiment]
<Overview>
First, the outline of the second embodiment of the present invention will be described. In this embodiment, it is assumed that the present invention is applied to a scene in which a learner who does not speak a language as a native language and does not have sufficient knowledge of the language and a native speaker who speaks the language as a native language interact. . That is, the first speaker in the first embodiment is a learner who does not have a language as a native language and is not sufficiently mastered of the language, and the second speaker in the first embodiment is a native language whose native language is the language. I will be a person.

この実施形態では、音声変換関数学習装置１１と音声変換装置２１を用いる。まず、あらかじめ対象言語を母語とせず、その言語の習得が十分でない学習者の音声と、その言語を母語とする母語話者の音声とを、それぞれ収集する。音声変換関数学習装置１１は、複数の学習者が発話した音声を学習して学習者平均声モデルを生成する。また、複数の母語話者が発話した音声を学習して母語話者平均声モデルを生成する。そして、学習者平均声モデルと母語話者平均声モデルとを用いて、学習者が発話した音声を母語話者が発話した音声に類似する音声に変換する第一音声変換関数と、母語話者が発話した音声を学習者が発話した音声に類似する音声に変換する第二音声変換関数とを学習する。 In this embodiment, the speech conversion function learning device 11 and the speech conversion device 21 are used. First, a speech of a learner who does not have a target language as a native language in advance and is not sufficiently mastered of the language, and a speech of a native language speaker whose native language is the language are collected. The speech conversion function learning device 11 learns speech uttered by a plurality of learners and generates a learner average voice model. In addition, it learns voices spoken by a plurality of native speakers and generates an average speech model for native speakers. Then, using the learner average voice model and the native speaker average voice model, a first speech conversion function for converting speech uttered by the learner into speech similar to speech uttered by the native speaker, and a native speaker The second voice conversion function for converting the voice uttered by the voice into a voice similar to the voice uttered by the learner is learned.

音声変換装置は、学習者が発話した音声が入力された場合には、第一音声変換関数を用いて、母語話者が発話した音声に類似する音声に変換する。一方、母語話者が発話した音声が入力された場合には、第二音声変換関数を用いて、学習者が発話した音声に類似する音声に類似する音声に変換する。 When the voice uttered by the learner is input, the voice conversion device converts the voice into a voice similar to the voice uttered by the native speaker using the first voice conversion function. On the other hand, when the voice spoken by the native speaker is input, the voice is converted to a voice similar to the voice similar to the voice spoken by the learner using the second voice conversion function.

上述の通り、平均声モデルの特徴を鑑みると、学習者平均声モデルと母語話者平均声モデルとの差分は、対象言語の習得度の違いのみを表わすものとなることが期待できる。つまり学習者平均声モデルと母語話者平均声モデルの対応関係を表す変換関数は、入力された音声に対して、対象言語の習得度を双方向に反映させる変換関数であると言える。したがって、学習者と母語話者が対話する際に、発話者の音声が聴取者の聞き取りやすい音声に変換されるため、二者間での円滑な意思の伝達が可能となる。 As described above, in view of the features of the average voice model, the difference between the learner average voice model and the native speaker average voice model can be expected to represent only the difference in the mastery of the target language. That is, it can be said that the conversion function representing the correspondence between the learner average voice model and the native speaker average voice model is a conversion function that reflects the acquired level of the target language in two directions with respect to the input speech. Therefore, when the learner and the native speaker speak, the voice of the speaker is converted into a voice that can be easily heard by the listener, so that smooth communication between the two parties can be achieved.

＜構成＞
図５を参照して、第２実施形態に係る音声変換関数学習装置１１の構成例を詳細に説明する。音声変換関数学習装置１１は、学習者モデル学習部１１１と母語話者モデル学習部１１６と学習者音声合成部１２１と母語話者音声合成部１２６と変換関数学習部１３１と学習者音声記憶部９１１と母語話者音声記憶部９１６と学習者平均声モデル記憶部９２１と母語話者平均声モデル記憶部９２６とテキスト記憶部９３１と学習者平均声合成音記憶部９４１と母語話者平均声合成音記憶部９４６と学習者音声変換関数記憶部９５１と母語話者音声変換関数記憶部９５６とを備える。学習者音声記憶部９１１と母語話者音声記憶部９１６と学習者平均声モデル記憶部９２１と母語話者平均声モデル記憶部９２６とテキスト記憶部９３１と学習者平均声合成音記憶部９４１と母語話者平均声合成音記憶部９４６と学習者音声変換関数記憶部９５１と母語話者音声変換関数記憶部９５６は、例えば、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）などの半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 <Configuration>
With reference to FIG. 5, a configuration example of the speech conversion function learning device 11 according to the second embodiment will be described in detail. The speech conversion function learning device 11 includes a learner model learning unit 111, a native speaker model learning unit 116, a learner speech synthesis unit 121, a native speaker speech synthesis unit 126, a conversion function learning unit 131, and a learner speech storage unit 911. , Native speaker voice storage unit 916, learner average voice model storage unit 921, native speaker average voice model storage unit 926, text storage unit 931, learner average voice synthesized sound storage unit 941, and native speaker average voice synthesized sound A storage unit 946, a learner speech conversion function storage unit 951, and a native speaker speech conversion function storage unit 956 are provided. Learner voice storage unit 911, native speaker voice storage unit 916, learner average voice model storage unit 921, native speaker average voice model storage unit 926, text storage unit 931, learner average voice synthesized sound storage unit 941 and native language The speaker average voice synthesized sound storage unit 946, the learner speech conversion function storage unit 951, and the native speaker speech conversion function storage unit 956 are configured by semiconductor memory elements such as a hard disk, an optical disk, or a flash memory, for example. Auxiliary storage devices, or middleware such as relational databases and key-value stores.

図６を参照して、第２実施形態に係る音声変換装置２１の構成例を詳細に説明する。音声変換装置２１は、収音手段２０１と発音手段２０２と学習者音声変換部２１１と母語話者音声変換部２１６と学習者音声変換関数記憶部９５１と母語話者音声変換関数記憶部９５６とを備える。学習者音声変換関数記憶部９５１および母語話者音声変換関数記憶部９５６は、音声変換関数学習装置１１の備える学習者音声変換関数記憶部９５１および母語話者音声変換関数記憶部９５６と同様に構成される。 With reference to FIG. 6, the structural example of the speech converter 21 which concerns on 2nd Embodiment is demonstrated in detail. The speech conversion device 21 includes a sound collection unit 201, a pronunciation unit 202, a learner speech conversion unit 211, a native speaker speech conversion unit 216, a learner speech conversion function storage unit 951, and a native language speaker speech conversion function storage unit 956. Prepare. The learner speech conversion function storage unit 951 and the native-speaker speech conversion function storage unit 956 are configured in the same manner as the learner speech conversion function storage unit 951 and the native-speaker speech conversion function storage unit 956 provided in the speech conversion function learning device 11. Is done.

＜第１実施形態との相違点＞
この実施形態と第１実施形態との相違点について説明する。第１実施形態と第２実施形態では、基本的に音声変換関数学習処理と音声変換処理の内容は同様である。第２実施形態では、第１実施形態における第一話者を、ある言語を母語とせず、その言語の習得が十分でない話者である学習者とし、第２実施形態における第二話者を、その言語を母語とする話者である母語話者とする。ある言語の習得が十分でない学習者は、その言語の発音が適切でなく自身の母語の発音に近くなることが考えられるため、発音の傾向が近いグループとすることができる。また、ある言語を母語とする母語話者は、その言語の発音が適切であるため、同様に発音の傾向が近いグループとすることができる。例えば、対象言語を英語とすると、学習者を英語の習得が十分でない日本人として、母語話者を英語を母語とする米国人とすることが考えられる。 <Differences from the first embodiment>
Differences between this embodiment and the first embodiment will be described. In the first embodiment and the second embodiment, the contents of the voice conversion function learning process and the voice conversion process are basically the same. In the second embodiment, the first speaker in the first embodiment is a learner who is not a native language of a certain language and is not a sufficient speaker of the language, and the second speaker in the second embodiment is A native speaker who is a speaker whose native language is the language. A learner who does not learn a certain language can be considered as a group having a similar pronunciation tendency because the pronunciation of the language is not appropriate and may be close to the pronunciation of his / her mother tongue. In addition, native speakers whose native language is a language are appropriate to pronounce in that language, and thus can be grouped with similar pronunciation trends. For example, if the target language is English, it is conceivable that the learner is a Japanese who does not acquire English enough and the native speaker is an American who speaks English as a native language.

具体的には、学習者音声記憶部９１１に記憶される学習者音声と、母語話者音声記憶部９１６に記憶される母語話者音声と、テキスト記憶部９３１に記憶される任意のテキストは、いずれも母語話者が母語とし、学習者が母語としない対象言語で統一されていなければいけない。また、音声変換装置２１の備える収集手段２０１からの入力音声も、同じ言語で発話されなければいけない。上記の例であれば、学習者音声と母語話者音声は英語で発話された音声でなければいけないし、音声変換装置２１へ入力される発話者の音声は英語を発話したものでなければならない。 Specifically, the learner voice stored in the learner voice storage unit 911, the native speaker voice stored in the native speaker voice storage unit 916, and any text stored in the text storage unit 931 are: In any case, the language must be unified in the target language that the native speaker is the native language and the learner is not the native language. Also, the input voice from the collecting means 201 provided in the voice conversion device 21 must be uttered in the same language. In the above example, the learner's voice and the native speaker's voice must be spoken in English, and the voice of the speaker input to the voice conversion device 21 must be spoken in English. .

＜効果＞
この実施形態では、音声変換関数学習装置１１が、学習者平均声モデルと母語話者平均声モデルとを用いて、同一のテキストをそれぞれ音声合成し、対応する合成音の対応関係を表す変換関数を学習する。音声変換装置２１は、音声変換関数学習装置１１が学習した変換関数を用いて、学習者の発話する音声を母語話者の発話する音声に類似する音声に変換し、母語話者の発話する音声を学習者の発話する音声に類似する音声に変換する。 <Effect>
In this embodiment, the speech conversion function learning device 11 uses the learner average voice model and the native speaker average voice model to synthesize the same text, respectively, and to express the correspondence relationship of the corresponding synthesized sounds. To learn. The speech conversion device 21 converts the speech uttered by the learner into speech similar to the speech uttered by the native speaker using the conversion function learned by the speech conversion function learning device 11, and the speech uttered by the native speaker. Is converted into a voice similar to the voice uttered by the learner.

このように構成することにより、ある言語を母語とせず、その言語の習得が十分でない学習者が発声した音声を、その言語を母語とする母語話者にとって聴取しやすい音声に、発話者の声質を維持したまま変換することができ、学習者がその言語を習得する初期段階であっても、学習者から母語話者への円滑な意思の伝達が可能となる。 By configuring in this way, the voice quality of the speaker is changed from a voice uttered by a learner who does not speak a language as a native language and is not sufficiently acquired by the language to a voice that is easy for a native speaker to speak the language. Therefore, even if the learner is in the initial stage of acquiring the language, smooth transmission of intention from the learner to the native speaker becomes possible.

また、ある言語を母語とする母語話者が発声した音声を、その言語を母語とせず、その言語の習得が十分でない学習者にとって聴取しやすい音声に、発話者の声質を維持したまま変換することができ、学習者がその言語を習得する初期段階であっても、母語話者から学習者への円滑な意思の伝達が可能となる。 In addition, the speech uttered by a native speaker who speaks a language as a native language is converted into a speech that is easy to hear for learners who do not have the language as a native language and who do not have sufficient language skills, while maintaining the voice quality of the speaker. Therefore, even if the learner is in the initial stage of acquiring the language, it is possible to smoothly transmit the intention from the native speaker to the learner.

［プログラム、記録媒体］
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施例において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program, recording medium]
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above-described embodiments are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

また、上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１０，１１音声変換関数学習装置
２０，２１音声変換装置
１１０第一話者モデル学習部１１１学習者モデル学習部
１１５第二話者モデル学習部１１６母語話者モデル学習部
１２０第一話者音声合成部１２１学習者音声合成部
１２５第二話者音声合成部１２６母語話者音声合成部
１３０，１３１変換関数学習部
２０１収音手段２０２発音手段
２１０第一話者音声変換部２１１学習者音声変換部
２１５第二話者音声変換部２１６母語話者音声変換部
９１０第一話者音声記憶部９１１学習者音声記憶部
９１５第二話者音声記憶部９１５母語話者音声記憶部
９２０第一話者平均声モデル記憶部９２１学習者平均声モデル記憶部
９２５第二話者平均声モデル記憶部９２６母語話者平均声モデル記憶部
９３０，９３１テキスト記憶部
９４０第一話者平均声合成音記憶部９４１学習者平均声合成音記憶部
９４５第二話者平均声合成音記憶部９４６母語話者平均声合成音記憶部
９５０第一音声変換関数記憶部９５１学習者音声変換関数記憶部
９５５第二音声変換関数記憶部９５６母語話者音声変換関数記憶部 10, 11 Speech conversion function learning device 20, 21 Speech conversion device 110 First speaker model learning unit 111 Learner model learning unit 115 Second speaker model learning unit 116 Native speaker model learning unit 120 First speaker speech synthesis Unit 121 learner speech synthesis unit 125 second speaker speech synthesis unit 126 native speaker speech synthesis unit 130, 131 conversion function learning unit 201 sound collection unit 202 pronunciation unit 210 first speaker speech conversion unit 211 learner speech conversion unit 215 Second speaker voice conversion unit 216 Native speaker voice conversion unit 910 First speaker voice storage unit 911 Learner voice storage unit 915 Second speaker voice storage unit 915 Native speaker voice storage unit 920 First speaker average Voice model storage unit 921 Learner average voice model storage unit 925 Second speaker average voice model storage unit 926 Native speaker average voice model storage unit 930, 931 Text recording Memory unit 940 First speaker average voice synthesized sound storage unit 941 Learner average voice synthesized sound storage unit 945 Second speaker average voice synthesized sound storage unit 946 Native speaker average voice synthesized sound storage unit 950 First speech conversion function storage 951 Learner's speech conversion function storage unit 955 Second speech conversion function storage unit 956 Native speaker's speech conversion function storage unit

Claims

Multiple speakers are grouped according to their pronunciation tendency, and the speech uttered by the first speaker belonging to one group is converted into speech similar to the speech uttered by the second speaker belonging to the other group. A speech conversion function learning device for learning a speech conversion function for converting speech uttered by two speakers into speech similar to the speech uttered by the first speaker,
A first speaker average voice model storage unit storing a first speaker average voice model generated by learning a first speaker voice uttered by the plurality of first speakers;
A second speaker average voice model storage unit storing a second speaker average voice model generated by learning a second speaker voice uttered by the plurality of second speakers;
A text storage unit in which arbitrary text is stored;
A first speaker voice synthesizer that synthesizes the text using the first speaker average voice model and generates a first speaker average voice synthesized sound;
A second speaker voice synthesizer that synthesizes the text using the second speaker average voice model and generates a second speaker average voice synthesized sound;
Learning the correspondence from the first speaker voice to the second speaker voice using the first speaker average voice synthesized sound and the second speaker average voice synthesized sound, Generating a first speech conversion function that outputs a second speaker-similar speech similar to the speech uttered by the second speaker using the speech uttered by the speaker, and the first speaker average voice synthesized sound and the first speaker The correspondence between the second speaker voice and the first speaker voice is learned using the two-speaker average voice synthesized sound, and the first talk is input using the voice uttered by the second speaker. A conversion function learning unit that generates a second speech conversion function that outputs a first speaker-similar voice similar to a voice uttered by a speaker;
A speech conversion function learning device comprising:

The speech conversion function learning device according to claim 1,
In the first speaker average voice synthesized sound and the second speaker average voice synthesized sound, the probability distribution of the feature vector is modeled by a multidimensional mixed normal distribution,
The conversion function learning unit
Using the first speaker average voice synthesized sound as an input and using a combined feature vector of the first speaker average voice synthesized sound and the second speaker average voice synthesized sound, parameters of the first speech conversion function The speech conversion function learning device is characterized by estimating the parameters of the second speech conversion function using the combined feature vector using the second speaker average voice synthesized sound as an input.

Multiple speakers are grouped according to their pronunciation tendency, and the speech uttered by the first speaker belonging to one group is converted into speech similar to the speech uttered by the second speaker belonging to the other group. A speech conversion device that converts speech uttered by two speakers into speech similar to the speech uttered by the first speaker,
A first speech conversion function storage unit storing a first speech conversion function that outputs a second speaker-similar speech similar to a speech uttered by the second speaker by using the speech uttered by the first speaker; ,
A second speech conversion function storage unit that stores a second speech conversion function that outputs a first speaker-similar speech similar to a speech uttered by the first speaker by using speech uttered by the second speaker; ,
If the input speech is speech uttered by the first speaker, a first speaker speech conversion unit that converts the input speech into the second speaker similar speech by executing the first speech conversion function; ,
If the input speech is speech uttered by the second speaker, a second speaker speech conversion unit that converts the input speech into the first speaker similar speech by executing the second speech conversion function When,
With
The first voice conversion function includes a first speaker average voice model generated by learning a first speaker voice uttered by a plurality of the first speakers, and a second episode uttered by the plurality of second speakers. A correspondence relationship from the first speaker voice to the second speaker voice is learned using a second speaker average voice model generated by learning a speaker voice,
The second voice conversion function uses the first speaker average voice model and the second speaker average voice model to learn the correspondence from the second speaker voice to the first speaker voice. An audio conversion device characterized by that.

The voice conversion device according to claim 3,
The first speech conversion function includes: a first speaker average voice synthesized sound obtained by synthesizing arbitrary text using the first speaker average voice model; and the second speaker average voice model. Using the synthesized second speaker average voice synthesized sound, the correspondence relationship from the first speaker voice to the second speaker voice is learned,
The second voice conversion function uses the first speaker average voice synthesized sound and the second speaker average voice synthesized sound to determine a correspondence relationship from the second speaker voice to the first speaker voice. A voice conversion device characterized by being learned.

The voice conversion device according to claim 4,
The first speaker average voice synthesized sound and the second speaker average voice synthesized sound have a probability distribution modeled by a multi-dimensional mixed normal distribution,
The first speech conversion function uses a combined feature vector obtained by combining the first speaker average voice synthesized sound and the second speaker average voice synthesized sound with the first speaker average voice synthesized sound as an input. Using the estimated parameters,
The second speech conversion function uses a parameter estimated using the combined feature vector with the second speaker average voice synthesized sound as an input.

Multiple speakers are grouped according to their pronunciation tendency, and the speech uttered by the first speaker belonging to one group is converted into speech similar to the speech uttered by the second speaker belonging to the other group. A speech conversion function learning method for learning a speech conversion function for converting speech uttered by two speakers into speech similar to the speech uttered by the first speaker,
Using the first speaker average voice model generated by learning the first speaker voice uttered by a plurality of the first speakers, speech synthesis is performed on any text and a first speaker average voice synthesized sound is generated. First speaker speech synthesis step;
A second speaker average voice synthesized sound is generated by synthesizing the text using a second speaker average voice model generated by learning a second speaker voice uttered by a plurality of the second speakers. A two-speaker speech synthesis step;
Learning the correspondence from the first speaker voice to the second speaker voice using the first speaker average voice synthesized sound and the second speaker average voice synthesized sound, Generating a first speech conversion function that outputs a second speaker-similar speech similar to the speech uttered by the second speaker using the speech uttered by the speaker, and the first speaker average voice synthesized sound and the first speaker The correspondence between the second speaker voice and the first speaker voice is learned using the two-speaker average voice synthesized sound, and the first talk is input using the voice uttered by the second speaker. A transformation function learning step for generating a second speech transformation function for outputting a first speaker-like speech similar to the speech uttered by the speaker;
A speech conversion function learning method comprising:

Multiple speakers are grouped according to their pronunciation tendency, and the speech uttered by the first speaker belonging to one group is converted into speech similar to the speech uttered by the second speaker belonging to the other group. A speech conversion method for converting speech uttered by two speakers into speech similar to the speech uttered by the first speaker,
If the input speech is speech uttered by the first speaker, the second speaker-similar speech similar to the speech uttered by the second speaker is output using the speech uttered by the first speaker as input. A first speaker voice conversion step of converting the input voice into the second speaker similar voice by executing a voice conversion function;
If the input voice is a voice uttered by the second speaker, a voice similar to the voice uttered by the first speaker is output using the voice uttered by the second speaker as an input. A second speaker voice conversion step of converting the input voice into the first speaker-like voice by executing a second voice conversion function;
Including
The first voice conversion function includes a first speaker average voice model generated by learning a first speaker voice uttered by a plurality of the first speakers, and a second episode uttered by the plurality of second speakers. A correspondence relationship from the first speaker voice to the second speaker voice is learned using a second speaker average voice model generated by learning a speaker voice,
The second voice conversion function uses the first speaker average voice model and the second speaker average voice model to learn the correspondence from the second speaker voice to the first speaker voice. A voice conversion method characterized by the fact that

A program for causing a computer to function as the speech conversion function learning device according to claim 1 or 2 or the speech conversion device according to any one of claims 3 to 5.