JP2007192931A

JP2007192931A - Voice pattern conversion/dubbing system, and program

Info

Publication number: JP2007192931A
Application number: JP2006009161A
Authority: JP
Inventors: Akihiro Okamoto; 明浩岡本
Original assignee: Asahi Kasei Corp
Current assignee: Asahi Kasei Corp
Priority date: 2006-01-17
Filing date: 2006-01-17
Publication date: 2007-08-02
Anticipated expiration: 2026-01-17
Also published as: JP4769086B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice pattern conversion/dubbing system capable of performing dubbing accompanied by voice pattern conversion between different languages, and a program. <P>SOLUTION: A conversion filter generation section 102 of the voice pattern conversion/dubbing system 100 generates a voice pattern conversion filter 1 by learning voice data 3 for a filter uttered and collected in English by an actor 2 and voice data 5 for a filter uttered and collected in English by a voice actor 4. A voice pattern conversion section 103 converts the voice uttered in Japanese by the voice actor 4 to the voice of the actor 2 by using the voice pattern conversion filter 1. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、声質変換を伴った吹替を行う声質変換吹替システム、及び、プログラムに関する。 The present invention relates to a voice quality conversion dubbing system and a program for performing a dubbing accompanied by voice quality conversion.

従来、ある話者（元話者）が発声した音声を、別の話者（目標話者）の音声に変換する声質変換技術が知られている（例えば、特許文献１参照）。特許文献１においては、ドラマを見ている人の音声を、ドラマの登場人物を演じている役者の音声に変換している。特許文献１では、元話者及び目標話者間で音声特徴パラメータをマッチングし、元話者の音声と時間軸が合うように目標話者の音声特徴パラメータから音声を合成する技術が用いられている。
特開平４−２４０９００号公報 Conventionally, there is known a voice quality conversion technique for converting a voice uttered by a certain speaker (former speaker) into a voice of another speaker (target speaker) (see, for example, Patent Document 1). In Patent Document 1, the voice of a person watching a drama is converted into the voice of an actor who plays a character in the drama. In Patent Document 1, a technique is used in which speech feature parameters are matched between the original speaker and the target speaker, and speech is synthesized from the speech feature parameters of the target speaker so that the time axis of the speech of the original speaker matches. Yes.
JP-A-4-240900

海外映画や海外ドラマの中に登場している俳優の台詞を日本語音声化する場合には日本人の声優が吹替えを行うことが一般的に行われているが、日本人の声優の声質ではなく実際に演じている俳優の声質での吹替えを実現することも視聴者から要望されている。つまり、著名な海外の俳優は独特な声質がその俳優のキャラクターの一部となっていることが多いが、声優の吹替え声がその俳優の声質と一致せず大きくかけ離れている場合は視聴者に違和感を与えたり、失望させてしまうことになる。前述した従来の声質変換技術においては、目標話者と同一の台詞を発声した元話者の音声を目標話者の音声に変換するので、台詞を日本語音声化するためには、目標話者である海外の俳優に日本語の台詞を発声させることが必要となる。海外の俳優が日本語の台詞を発声すれば、その台詞の抑揚が不自然であっても、日本人の声優の抑揚に置き換えられるので、抑揚の不自然さは解消される。しかし、ほとんどの場合、海外の俳優は日本語を話さないし、仮に話すことができても台詞の全てを日本語で発声させることは現実的ではない。 When voices of actors appearing in overseas movies and dramas are converted into Japanese voices, Japanese voice actors are generally dubbed, but the voice quality of Japanese voice actors is There is also a demand from viewers to realize the dubbing with the voice quality of the actor who actually plays. In other words, famous foreign actors often have a unique voice quality as part of their character, but if the voice-over of the voice actor does not match the voice quality of the actor, It can make you feel uncomfortable or disappointing. In the conventional voice quality conversion technology described above, the voice of the former speaker who uttered the same dialogue as the target speaker is converted to the voice of the target speaker. It is necessary for foreign actors to speak Japanese lines. If an overseas actor speaks a Japanese line, even if the inflection of that line is unnatural, it will be replaced by an inflection of a Japanese voice actor, so the unnaturalness of the inflection is eliminated. However, in most cases, foreign actors do not speak Japanese, and even if they can speak, it is not practical to have all lines spoken in Japanese.

本発明は、以上のような従来の問題を解決するためになされたものであり、異なる言語間での、声質変換を伴った吹替えを行うことを可能とする声質変換吹替システム、及び、プログラムを提供する。 The present invention has been made in order to solve the conventional problems as described above. A voice quality conversion dubbing system and a program capable of performing dubbing with voice quality conversion between different languages are provided. provide.

上記課題を解決するために、第１の言語で発声された第１話者及び第２話者の音声に基づいて、前記第２話者の音声を前記第１話者の音声に変換するための第１話者変換フィルタを作成する変換フィルタ作成手段と、前記第２話者が前記第１の言語とは異なる第２の言語で発声した音声を、前記第１話者変換フィルタを用いて前記第１話者の音声に変換する声質変換手段とを備えることを特徴とする声質変換吹替システムを提供する。 In order to solve the above-described problem, based on the voices of the first speaker and the second speaker uttered in the first language, the voice of the second speaker is converted into the voice of the first speaker. Conversion filter creation means for creating the first speaker conversion filter, and voice uttered in a second language different from the first language by the second speaker using the first speaker conversion filter There is provided a voice quality conversion dubbing system comprising voice quality conversion means for converting the voice of the first speaker.

本発明によれば、第１の言語で発声された第１話者及び第２話者の音声を学習して第１話者変換フィルタを作成することで、第２話者が第２の言語で発声した音声を第１話者変換フィルタを用いて第１話者の音声に変換することが可能となる。このため、第１話者が第２の言語を話すことができなくても、第２話者が第１の言語と第２の言語とを話すことができれば、第２話者が第２の言語で発声した音声を第１話者の音声に容易に変換することができる。従って、異なる言語間での、声質変換を伴った吹替えを行うことが可能となる。 According to the present invention, the second speaker can create the first speaker conversion filter by learning the voices of the first speaker and the second speaker uttered in the first language, so that the second speaker can It is possible to convert the voice uttered in step 1 to the voice of the first speaker using the first speaker conversion filter. Therefore, even if the first speaker cannot speak the second language, if the second speaker can speak the first language and the second language, the second speaker The voice uttered in the language can be easily converted into the voice of the first speaker. Therefore, it is possible to perform dubbing with voice quality conversion between different languages.

請求項２に記載の発明は、請求項１に記載の声質変換吹替システムにおいて、前記変換フィルタ作成手段は、前記第２の言語で発声された前記第２話者及び第３話者の音声に基づいて、前記第３話者の音声を前記第２話者の音声に変換するための第２話者変換フィルタを作成し、前記声質変換手段は、前記第３話者が前記第２の言語で発声した音声を前記第２話者変換フィルタを用いて前記第２話者の音声に変換し、該第２話者の音声を前記第１話者変換フィルタを用いて前記第１話者の音声に変換することを特徴とする。 According to a second aspect of the present invention, in the voice quality conversion dubbing system according to the first aspect, the conversion filter creating means applies the voices of the second speaker and the third speaker uttered in the second language. A second speaker conversion filter for converting the third speaker's voice into the second speaker's voice based on the second speaker conversion filter; Is used to convert the voice of the first speaker using the second speaker conversion filter, and the voice of the second speaker is converted to the voice of the first speaker using the first speaker conversion filter. It is characterized by being converted to speech.

本発明によれば、第３話者が、第１話者が話す第１の言語とは異なる第２の言語しか話すことができなくても、また、第１話者が第２の言語を話すことができなくても、第１の言語で発声された第１話者及び第２話者の音声を学習することにより第１話者変換フィルタを作成し、第２の言語で発声された第２話者及び第３話者の音声を学習することにより第２話者変換フィルタを作成しておくことで、第３話者が第２の言語で発声した音声を第１話者の音声に変換することが可能となり、異なる言語間での、声質変換を伴った吹替えを行うことが可能となる。 According to the present invention, even if the third speaker can speak only a second language different from the first language spoken by the first speaker, the first speaker can also speak the second language. The first speaker conversion filter was created by learning the voices of the first and second speakers spoken in the first language, even though they could not speak, and were spoken in the second language The second speaker conversion filter is created by learning the voices of the second speaker and the third speaker, so that the voice of the third speaker uttered in the second language is the voice of the first speaker. It is possible to convert to voice, and to perform voice-over with voice quality conversion between different languages.

請求項３に記載の発明は、前記声質変換手段は、前記第２話者変換フィルタと前記第１話者変換フィルタとが合成された変換フィルタを用いて、前記第３話者が前記第２の言語で発声した音声を前記第１の話者の音声に変換することを特徴とする。
本発明によれば、合成された変換フィルタを用いて声質変換を行うことで、第２話者変換フィルタと第１話者変換フィルタとを用いて声質変換を行うよりも、変換処理の時間を短縮することができる。 According to a third aspect of the present invention, the voice conversion means uses the conversion filter obtained by synthesizing the second speaker conversion filter and the first speaker conversion filter, so that the third speaker is the second speaker. The voice uttered in the language is converted into the voice of the first speaker.
According to the present invention, by performing voice quality conversion using the synthesized conversion filter, it is possible to reduce the time for the conversion process compared to performing voice quality conversion using the second speaker conversion filter and the first speaker conversion filter. It can be shortened.

請求項４に記載の発明は、請求項１から３の何れか１項に記載の声質変換吹替システムにおいて、前記変換フィルタ作成手段はサーバ装置が備えており、前記声質変換手段はクライアント装置が備えていることを特徴とする。
本発明によれば、クライアント装置は、サーバ装置で作成された変換フィルタを用いて声質変換を行うことができる。 According to a fourth aspect of the present invention, in the voice quality conversion dubbing system according to any one of the first to third aspects, the conversion filter creation means is provided in a server device, and the voice quality conversion means is provided in a client device. It is characterized by.
According to the present invention, the client device can perform voice quality conversion using the conversion filter created by the server device.

請求項５に記載の発明は、コンピュータに、第１の言語で発声された第１話者及び第２話者の音声に基づいて、前記第２話者の音声を前記第１話者の音声に変換するための第１話者変換フィルタを作成する変換フィルタ作成ステップと、前記第２話者が前記第１の言語とは異なる第２の言語で発声した音声を、前記第１話者変換フィルタを用いて前記第１話者の音声に変換する声質変換ステップとを実行させるためのプログラムを提供する。 According to the fifth aspect of the present invention, the voice of the second speaker is converted to the voice of the first speaker based on the voices of the first speaker and the second speaker uttered in the first language. A conversion filter creating step for creating a first speaker conversion filter for converting into a first language, and a voice uttered in a second language different from the first language by the second speaker. There is provided a program for executing a voice quality conversion step of converting to the voice of the first speaker using a filter.

本発明によれば、コンピュータに前記プログラムを記憶させておくことで、異なる言語間での、声質変換を伴った吹替えを容易に行うことが可能となる。
請求項６に記載の発明は、請求項５に記載のプログラムにおいて、コンピュータに、前記変換フィルタ作成ステップにおいて、前記第２の言語で発声された前記第２話者及び第３話者の音声に基づいて、前記第３話者の音声を前記第２話者の音声に変換するための第２話者変換フィルタを作成する処理と、前記声質変換ステップにおいて、前記第３話者が前記第２の言語で発声した音声を前記第２の変換フィルタを用いて前記第２話者の音声に変換し、該第２話者の音声を前記第１の変換フィルタを用いて前記第１の話者の音声に変換する処理とをさらに実行させることを特徴とする。 According to the present invention, it is possible to easily perform dubbing with voice quality conversion between different languages by storing the program in a computer.
According to a sixth aspect of the present invention, in the program according to the fifth aspect, in the computer, the voice of the second speaker and the third speaker uttered in the second language in the conversion filter creating step. Based on the processing for creating a second speaker conversion filter for converting the third speaker's voice into the second speaker's voice, and the voice quality conversion step, the third speaker The voice uttered in the language is converted to the voice of the second speaker using the second conversion filter, and the voice of the second speaker is converted to the first speaker using the first conversion filter. And a process of converting the sound into a voice.

本発明によれば、コンピュータに前記プログラムを記憶させておくことで、第１の変換フィルタ及び第２の変換フィルタを作成することができ、これらの変換フィルタを用いて第３話者が第２の言語で発声した音声を第１の言語を話す第１話者の音声に変換することが可能となり、異なる言語間での、声質変換を伴った吹替えを行うことが可能となる。 According to the present invention, the first conversion filter and the second conversion filter can be created by storing the program in a computer, and the third speaker can use the conversion filter to generate the second conversion filter. It is possible to convert the voice uttered in the language of the first speaker into the voice of the first speaker who speaks the first language, and to perform dubbing with voice quality conversion between different languages.

また、第３話者が、第１話者が話す第１の言語とは異なる第２の言語しか話すことができなくても、第１の言語で発声された第１話者及び第２話者の音声を学習することにより第１話者変換フィルタを作成し、第２の言語で発声された第２話者及び第３話者の音声を学習することにより第２話者変換フィルタを作成しておくことで、第３話者が第２の言語で発声した音声を第１話者の音声に変換することが可能となり、異なる言語間での、声質変換を伴った吹替えを行うことが可能となる。 In addition, even if the third speaker can speak only a second language different from the first language spoken by the first speaker, the first speaker and the second episode spoken in the first language The first speaker conversion filter is created by learning the voice of the speaker, and the second speaker conversion filter is created by learning the voices of the second and third speakers uttered in the second language. By doing so, it becomes possible to convert the voice uttered by the third speaker in the second language into the voice of the first speaker, and it is possible to perform dubbing with voice quality conversion between different languages. It becomes possible.

以下、図面を参照して、本発明に係る実施の形態について説明する。
（第１の実施の形態）
まず、本発明に係る第１の実施の形態について説明する。第１の実施の形態に係る声質変換吹替システム１００は１又は複数の装置で構成されている。
図１には、声質変換吹替システム１００の機能構成を示す。同図に示すように、声質変換吹替システム１００は、音声収集部１０１と、変換フィルタ作成部１０２と、声質変換部１０３と、を含んで構成される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
(First embodiment)
First, a first embodiment according to the present invention will be described. The voice quality conversion dubbing system 100 according to the first embodiment is composed of one or a plurality of devices.
In FIG. 1, the functional structure of the voice quality conversion dubbing system 100 is shown. As shown in the figure, the voice quality conversion dubbing system 100 includes a voice collection unit 101, a conversion filter creation unit 102, and a voice quality conversion unit 103.

音声収集部１０１はマイクロフォン及び録音装置を含んで構成されており、第１話者及び第２話者がそれぞれ同一の言語（第１の言語）で発声した音声をマイクロフォンで収集し、録音装置に音声データとして記録する。
変換フィルタ作成部１０２は、音声収集部１０１により収集された第１話者及び第２話者の音声データを用いて学習を行い、第２話者の音声を第１話者の音声に変換するための声質変換フィルタ１（第１話者変換フィルタ）を作成する。ここで、変換フィルタには、演算を行うための関数、変換テーブル等が含まれる。学習方法としては、例えば、公知の混合正規分布モデル（ＧＭＭ；Gaussian Mixture Model）に基づく特徴量変換法を用いることができる。フィルタ１の作成方法はこれ以外にもあらゆる公知の手法を用いることが可能である。 The voice collecting unit 101 is configured to include a microphone and a recording device, and the voices uttered by the first speaker and the second speaker in the same language (first language) are collected by the microphone and stored in the recording device. Record as audio data.
The conversion filter creation unit 102 performs learning using the voice data of the first speaker and the second speaker collected by the voice collection unit 101, and converts the voice of the second speaker into the voice of the first speaker. Voice quality conversion filter 1 (first speaker conversion filter) is created. Here, the conversion filter includes a function for performing an operation, a conversion table, and the like. As a learning method, for example, a feature amount conversion method based on a known mixed normal distribution model (GMM) can be used. Any other known method can be used as the method for creating the filter 1.

声質変換部１０３は、変換フィルタ作成部１０２により作成された声質変換フィルタ１を用いて、第２話者が第１の言語とは異なる言語（第２の言語）で発声した音声を第１話者の音声に変換する。
なお、これらの変換フィルタ作成部１０２及び声質変換部１０３の機能は、声質変換吹替システム１００を構成する各装置が備えるＣＰＵがメモリに記憶されているプログラムに従って処理を実行することにより実現される。 The voice quality conversion unit 103 uses the voice quality conversion filter 1 created by the conversion filter creation unit 102 to convert the voice uttered by the second speaker in a language (second language) different from the first language into the first episode. To the voice of the other person.
The functions of the conversion filter creation unit 102 and the voice quality conversion unit 103 are realized by executing processing according to a program stored in a memory by a CPU included in each device constituting the voice quality conversion dubbing system 100.

次に、図２を参照して、本実施の形態に係る声質変換吹替システム１００の動作例について説明する。ここでは、英語しか話せない俳優２（第１話者）の声色を目標の声色とし、バイリンガルの声優４（第２話者）を変換元の話者として、日本語の吹替音声データを作成する例で説明する。
まず、バイリンガルの声優４の音声を俳優２の音声に変換するための声質変換フィルタ１を作成する。 Next, with reference to FIG. 2, an operation example of the voice quality conversion dubbing system 100 according to the present embodiment will be described. Here, the voice of actor 2 (first speaker) who can speak only English is set as the target voice, and the voice actor 4 (second speaker) of bilingual is used as the conversion source speaker to create Japanese dubbing voice data. This will be explained with an example.
First, the voice quality conversion filter 1 for converting the voice of the bilingual voice actor 4 into the voice of the actor 2 is created.

具体的には、声質変換フィルタ１を作成するための英文を約５０文用意し、俳優２にその英文を読み上げてもらい、音声収集部１０１はフィルタ用音声データ３を得る（ステップＳ１０１）。
ここで、カーネギーメロン大学（http://www.speech.cs.cmu.edu/cgi-bin/cmudict）の調べによると、英語の音素は図４に示されるように３９種類あるとされている。一方、日本語の音素は図５に示されるように１９種類あるとされ、図６に示されるように日本語の音素は英語の音素に対応付けられる。前記約５０文の英文は、日本語の音素に対応付けられた英語の音素を全て含むように設計される。 Specifically, about 50 English sentences for preparing the voice quality conversion filter 1 are prepared, and the actor 2 reads the English sentence, and the voice collecting unit 101 obtains the voice data 3 for filtering (step S101).
Here, according to a study by Carnegie Mellon University (http://www.speech.cs.cmu.edu/cgi-bin/cmudict), there are 39 types of English phonemes as shown in FIG. . On the other hand, there are 19 types of Japanese phonemes as shown in FIG. 5, and Japanese phonemes are associated with English phonemes as shown in FIG. The approximately 50 English sentences are designed to include all English phonemes associated with Japanese phonemes.

次に、英語も日本語も流暢に話せるバイリンガルの声優４に前記用意した英文を読み上げてもらい、音声収集部１０１はフィルタ用音声データ５を得る（ステップＳ１０２）。
次に、変換フィルタ作成部１０２は、混合正規分布モデル（ＧＭＭ）に基づく特徴量変換法を用いて、フィルタ用音声データ５からフィルタ用音声データ３に声質を変換するための声質変換フィルタ１を作成する（ステップＳ１０３）。 Next, the bilingual voice actor 4 who can speak English and Japanese fluently reads out the prepared English sentence, and the voice collecting unit 101 obtains the voice data 5 for filtering (step S102).
Next, the conversion filter creation unit 102 uses the feature conversion method based on the mixed normal distribution model (GMM) to convert the voice quality conversion filter 1 for converting the voice quality from the filter voice data 5 to the filter voice data 3. Create (step S103).

以下、混合正規分布モデル（ＧＭＭ）に基づく特徴量変換法（例えば、A. Kain and M.W.Macon," Spectral voice conversion for text-to-speech synthesis," Proc.ICASSP,pp.285-288,Seattle,U.S.A.May,1998.参照）を用いた声質変換フィルタの作成方法について詳細に説明する。
時間領域においてフレームごとに対応付けられた、変換元となる話者の音声の特徴量ｘおよび変換先となる話者の音声の特徴量ｙを、それぞれ Hereafter, a feature conversion method based on a mixed normal distribution model (GMM) (for example, A. Kain and MWMacon, “Spectral voice conversion for text-to-speech synthesis,” Proc.ICASSP, pp.285-288, Seattle, USAMay) , 1998.) will be described in detail.
The feature amount x of the voice of the speaker as the conversion source and the feature amount y of the voice of the speaker as the conversion destination, which are associated with each frame in the time domain,

と表す。ここで、ｐは特徴量の次元数であり、Ｔは転置を示す。ＧＭＭでは、音声の特徴量ｘの確率分布ｐ（ｘ）を

It expresses. Here, p is the number of dimensions of the feature quantity, and T indicates transposition. In GMM, the probability distribution p (x) of the speech feature quantity x is

と表す。ここで、α_ｉはクラスｉの重み、ｍはクラス数である。また、Ｎ（ｘ；μ_ｉ，Σ_ｉ）はクラスｉでの平均ベクトルμ_ｉおよび共分散行列Σ_ｉを有する正規分布であり、

It expresses. Here, α _i is the weight of class i, and m is the number of classes. N (x; μ _i , Σ _i ) is a normal distribution having a mean vector μ _i and a covariance matrix Σ _i in class i,

と表される。次に、元話者の音声の特徴量ｘから目標話者の音声の特徴量ｙへと変換を行う変換関数Ｆ（ｘ）は、

It is expressed. Next, a conversion function F (x) for converting the feature amount x of the voice of the original speaker into the feature amount y of the target speaker's speech is:

と表される。ここで、μ_ｉ ^（ｘ）、μ_ｉ ^（ｙ）はそれぞれｘおよびｙのクラスｉでの平均ベクトルを表す。また、Σ_ｉ ^（ｘｘ）はｘのクラスｉでの共分散行列を示し、Σ_ｉ ^（ｙｘ）はｙとｘにおけるクラスｉでの相互共分散行列を示す。ｈ_ｉ（ｘ）は、

It is expressed. Here, μ _i ^(x) and μ _i ^(y) represent average vectors in class i of x and y, respectively. Σ _i ^(xx) represents a covariance matrix of x in class i, and Σ _i ^(yx) represents a mutual covariance matrix in class i of y and x. h _i (x) is

である。変換関数Ｆ（ｘ）の学習は、変換パラメータである（α_ｉ、μ_ｉ ^（ｘ）、μ_ｉ ^（ｙ）、Σ_ｉ ^（ｘｘ）、Σ_ｉ ^（ｙｘ））を推定することにより行われる。ｘおよびｙの結合特徴量ベクトルｚを

It is. Learning of the conversion function F (x) is performed by estimating conversion parameters (α _i , μ _i ^(x) , μ _i ^(y) , Σ _i ^(xx) , Σ _i ^(yx) ). The combined feature vector z of x and y is

と定義する。ｚの確率分布ｐ（ｚ）はＧＭＭにより

It is defined as The probability distribution p (z) of z is determined by GMM

と表される。ここで、ｚのクラスｉでの共分散行列Σ_ｉ ^（ｚ）および平均ベクトルμ_ｉ ^（ｚ）はそれぞれ

It is expressed. Here, the covariance matrix Σ _i ^(z) and the mean vector μ _i ^(z) in class i of ^z are respectively

と表される。変換パラメータ（α_ｉ、μ_ｉ ^（ｘ）、μ_ｉ ^（ｙ）、Σ_ｉ ^（ｘｘ）、Σ_ｉ ^（ｙｘ））の推定は、公知のＥＭアルゴリズムにより行うことができる。
学習にはテキストなどの言語情報は一切使用せず、特徴量の抽出やＧＭＭの学習はコンピュータを用いて全て自動で行う。

It is expressed. The conversion parameters (α _i , μ _i ^(x) , μ _i ^(y) , Σ _i ^(xx) , Σ _i ^(yx) ) can be estimated by a known EM algorithm.
Language information such as text is not used for learning, and feature extraction and GMM learning are all performed automatically using a computer.

次に、俳優２が話した台詞を日本語に翻訳した台詞を、声優４にそのまま日本語で読み上げてもらい、日本語に吹き替えた音声データ６を得る（ステップＳ１０４）。
声質変換部１０３は、吹き替えた音声データ６を声質変換フィルタ１に通して変換し、吹き替えた音声データ７を得る（ステップＳ１０５）。この吹き替えた音声データ７が英語しか話せない俳優２の声色に似せた日本語の吹き替え音声データである。 Next, the speech obtained by translating the speech spoken by the actor 2 into Japanese is directly read out by the voice actor 4 in Japanese, and the voice data 6 dubbed into Japanese is obtained (step S104).
The voice quality conversion unit 103 converts the voice data 6 that has been dubbed through the voice quality conversion filter 1 to obtain the voice data 7 that has been dubbed (step S105). This dubbed voice data 7 is dubbed voice data in Japanese that resembles the voice of actor 2 who can speak only English.

以上のように、俳優２及び声優４の英語による音声データを用いて声質変換フィルタ１を作成することで、当該声質変換フィルタ１を用いて声優４が日本語で発声した音声を俳優２の音声に変換することが可能となる。このように、英語と日本語との両方の言語を話すことができる声優４を用意すれば、俳優２が日本語を話すことができなくても、声優４が日本語で発声した音声を俳優２の音声に変換することができる。つまり、異なる言語間での、声質変換を伴った吹替えを行うことが可能となる。 As described above, by creating the voice quality conversion filter 1 using the voice data of the actor 2 and the voice actor 4 in English, the voice of the voice actor 4 uttered in Japanese using the voice quality conversion filter 1 is the voice of the actor 2. It becomes possible to convert to. In this way, if the voice actor 4 who can speak both English and Japanese is prepared, even if the actor 2 cannot speak Japanese, the voice voiced by the voice actor 4 in Japanese will be the actor. It is possible to convert the sound into two. That is, it is possible to perform dubbing with voice quality conversion between different languages.

（第２の実施の形態）
次に、本発明に係る第２の実施の形態について説明する。
第２の実施の形態に係る声質変換吹替システム１００は、第１の実施の形態に係る声質変換吹替システム１００と同様に、音声収集部１０１、変換フィルタ作成部１０２及び声質変換部１０３を備えている。
本実施の形態に係る音声収集部１０１は、第１の言語で発声された第１話者及び第２話者の音声と、第１の言語とは異なる第２の言語で発声された第２話者及び第３話者の音声とを収集する。 (Second Embodiment)
Next, a second embodiment according to the present invention will be described.
Similar to the voice quality conversion dubbing system 100 according to the first embodiment, the voice quality conversion dubbing system 100 according to the second embodiment includes a voice collection unit 101, a conversion filter creation unit 102, and a voice quality conversion unit 103. Yes.
The voice collection unit 101 according to the present embodiment includes the voices of the first and second speakers uttered in the first language and the second uttered in the second language different from the first language. The voices of the speaker and the third speaker are collected.

変換フィルタ作成部１０２は、第１の実施の形態に係る変換フィルタ作成部１０２が備える機能（声質変換フィルタ１を作成する機能）に加えて、第２の言語で発声された第２話者及び第３話者の音声に基づいて、第３話者の音声を第２話者の音声に変換するための声質変換フィルタ８（第２話者変換フィルタ）を作成する。
声質変換部１０３は、第３話者が第２の言語で発声した音声を声質変換フィルタ８を用いて第２話者の音声に変換し、当該第２話者の音声を声質変換フィルタ１を用いて第１話者の音声に変換する。 In addition to the function (the function of creating the voice quality conversion filter 1) included in the conversion filter creation unit 102 according to the first embodiment, the conversion filter creation unit 102 includes the second speaker uttered in the second language, Based on the voice of the third speaker, a voice quality conversion filter 8 (second speaker conversion filter) for converting the voice of the third speaker into the voice of the second speaker is created.
The voice quality conversion unit 103 converts the voice uttered by the third speaker in the second language into the voice of the second speaker using the voice quality conversion filter 8, and converts the voice of the second speaker into the voice quality conversion filter 1. Used to convert to the voice of the first speaker.

次に、図３を参照して、本実施の形態に係る声質変換吹替システム１００の動作例について説明する。ここでは、英語しか話せない俳優２の声色を目標の声色とし、日本語しか話せない声優１０を変換元話者として、日本語の吹替音声データを作成する例で説明する。
声質変換フィルタ１を作成する手順まで（上記ステップＳ１０３まで）は第１の実施の形態に同じである。 Next, with reference to FIG. 3, an operation example of the voice quality conversion dubbing system 100 according to the present embodiment will be described. Here, an example will be described in which dubbing voice data in Japanese is created with the voice color of actor 2 who can speak only English as the target voice color and voice actor 10 who can speak only Japanese as the conversion source speaker.
Up to the procedure for creating the voice quality conversion filter 1 (up to step S103) is the same as that of the first embodiment.

次に、日本語しか話せない声優１０の音声をバイリンガルの声優４の音声に変換するための声質変換フィルタ８を作成するために、日本語文を約５０文用意する。バイリンガルの声優４にその日本語文を読み上げてもらい、音声収集部１０１はフィルタ用音声データ９を得る（ステップＳ２０１）。
次に、日本語しか話せない声優１０に上記用意した日本語文を読み上げてもらい、音声収集部１０１はフィルタ用音声データ１１を得る（ステップＳ２０２）。 Next, in order to create a voice quality conversion filter 8 for converting the voice of a voice actor 10 who can speak only Japanese into the voice of a bilingual voice actor 4, about 50 Japanese sentences are prepared. The bilingual voice actor 4 reads out the Japanese sentence, and the voice collection unit 101 obtains the voice data 9 for filtering (step S201).
Next, the voice actor 10 who can speak only Japanese reads out the prepared Japanese sentence, and the voice collecting unit 101 obtains the voice data 11 for filtering (step S202).

次に、変換フィルタ作成部１０２は、混合正規分布モデル（ＧＭＭ）に基づく特徴量変換法を用いて、フィルタ用音声データ１１からフィルタ用音声データ９に声質を変換するための声質変換フィルタ８を作成する（ステップＳ２０３）。
次に、俳優２が話した台詞を日本語に翻訳した台詞を声優１０に読み上げてもらい、吹き替えた音声データ１２を得る（ステップＳ２０４）。 Next, the conversion filter creation unit 102 uses a feature conversion method based on a mixed normal distribution model (GMM) to generate a voice quality conversion filter 8 for converting voice quality from the filter voice data 11 to the filter voice data 9. Create (step S203).
Next, the voice actor 10 reads out the speech obtained by translating the speech spoken by the actor 2 into Japanese, and obtains voice-dubbed audio data 12 (step S204).

声質変換部１０３は、吹き替えた音声データ１２を声質変換フィルタ８に通し、さらに声質変換フィルタ１に通して、吹き替えた音声データ１３を得る（ステップＳ２０５）。この吹き替えた音声データ１３が、英語しか話せない俳優２の声色に似せた日本語の吹き替え音声データである。
このように、声優１０が日本語しか話すことができなくても、声質変換吹替システム１００において、俳優２及び声優４それぞれが英語で発声した音声を学習して声質変換フィルタ１を作成し、声優４及び声優１０それぞれが日本語で発声した音声を学習して声質変換フィルタ８を作成することにより、これらの声質変換フィルタ１，８を用いて声優１０が日本語で発声した音声を俳優２の音声に変換することが可能となり、異なる言語間での吹替を容易に行うことができる。 The voice quality conversion unit 103 passes the dubbed voice data 12 through the voice quality conversion filter 8 and further passes through the voice quality conversion filter 1 to obtain the dubbed voice data 13 (step S205). This dubbed voice data 13 is dubbed voice data in Japanese that resembles the voice of actor 2 who can speak only English.
Thus, even if the voice actor 10 can only speak Japanese, the voice quality conversion dubbing system 100 creates the voice quality conversion filter 1 by learning the voices spoken by the actor 2 and the voice actor 4 in English. 4 and voice actor 10 each learns the voice uttered in Japanese and creates voice quality conversion filter 8, so that voice actor 10 speaks Japanese in voice using voice conversion filters 1 and 8. It becomes possible to convert to speech, and dubbing between different languages can be easily performed.

なお、声質変換フィルタ８と声質変換フィルタ１を合成して新たな声質変換フィルタを作成できることが知られており、この新たな声質変換フィルタを利用して声優１０の音声を直接俳優２の音声に変換することができる。このような変換方法を採ることによって、変換処理時間を半分にすることが可能となる。
また、声質変換吹替システム１００が備える機能（音声収集部１０１、変換フィルタ作成部１０２及び声質変換部１０３）の装置への配置の仕方については様々な態様を採ることができる。 It is known that the voice quality conversion filter 8 and the voice quality conversion filter 1 can be synthesized to create a new voice quality conversion filter. The voice of the voice actor 10 is directly converted into the voice of the actor 2 by using the new voice quality conversion filter. Can be converted. By adopting such a conversion method, the conversion processing time can be halved.
Moreover, various aspects can be taken about the arrangement | positioning method to the apparatus of the voice quality conversion dubbing system 100 (the audio | voice collection part 101, the conversion filter production part 102, and the voice quality conversion part 103).

例えば、声質変換吹替システム１００をサーバ装置及びクライアント装置で構成し、サーバ装置に変換フィルタ作成部１０２を配置し、クライアント装置に声質変換部１０３を配置することができる。このような構成とすることで、サーバ装置で作成した声質変換フィルタ１，８をクライアント装置にダウンロードして、クライアント装置で声質変換を行うことが可能となる。 For example, the voice quality conversion dubbing system 100 can be configured by a server device and a client device, the conversion filter creation unit 102 can be arranged in the server device, and the voice quality conversion unit 103 can be arranged in the client device. With this configuration, the voice quality conversion filters 1 and 8 created by the server device can be downloaded to the client device, and voice quality conversion can be performed by the client device.

なお、上記第１及び第２の実施の形態では、一人の声優を変換元話者として日本語の吹替え音声データを作成したが、これに限らず、複数人の声優を変換元話者としてもよい。
この場合は、一人の第２話者に対して変換元話者である第３話者が複数人存在するので、変換元話者ごとに声質変換フィルタ８を作成する必要がある。音声収集部１０１で収集した変換元話者ごとのフィルタ用音声データを使用して、変換フィルタ作成部１０２は変換元話者ごとに声質変換フィルタ８を作成する。このように、英語と日本語との両方の言語を話すことができる声優４を一人用意し、声優１０が行っていた日本語台詞の読み上げ作業を複数の声優で分担することで、声優一人あたりの読み上げ作業の所要時間短縮が可能となり、作業の効率化をはかることができる。 In the first and second embodiments, Japanese dubbing voice data is created using one voice actor as the conversion source speaker. However, the present invention is not limited to this, and a plurality of voice actors may be used as the conversion source speakers. Good.
In this case, since there are a plurality of third speakers who are conversion source speakers for one second speaker, it is necessary to create a voice quality conversion filter 8 for each conversion source speaker. Using the filter voice data for each conversion source speaker collected by the voice collection unit 101, the conversion filter creation unit 102 creates a voice quality conversion filter 8 for each conversion source speaker. In this way, one voice actor 4 who can speak both English and Japanese is prepared, and by reading out the Japanese speech that the voice actor 10 performed, the voice actors can share each voice actor. It is possible to shorten the time required for the reading-out operation, and the work efficiency can be improved.

以下、図１、図３、図７を参照して、変換元の話者を複数にした場合の声質変換吹替システム１００の動作例について詳細に説明する。この場合は、図３に示す第３話者（声優１０）が複数人存在することになる。ここで、図７は、変換関数Ｆ，Ｇを作成する際の学習過程及び変換関数Ｆ、Ｇを用いた声質変換過程を示す図である。
収録対象の話者から音声（音声サンプルデータ）を収録すると、変換フィルタ作成部１０２は、第３話者（元話者と呼ぶ）の音声と第２話者（中間話者と呼ぶ）の音声とに基づいて学習を行うことにより、元話者の音声を中間話者の音声に変換するための変換関数Ｆ（声質変換フィルタ８）を生成する。ここで、元話者の音声及び中間話者の音声は、予め元話者と中間話者とに同じ約５０文（１セットの音声内容）の日本語を発声させ収録しておいたものを用いる。また、中間話者は１人（バイリンガルの声優４）であり、元話者が複数存在するので変換元話者ごとに変換関数Ｆを作成する。つまり、複数の元話者各々の音声と１つの中間話者の音声との学習をそれぞれ行う。そのために、１つの中間話者の音声が複数の元話者各々に対して共通に設けられている。学習の手法としては、例えば、上記第１及び第２の実施形態と同様に混合正規分布モデル（ＧＭＭ）に基づく特徴量変換法を用いることができる。これ以外にも、あらゆる公知の手法を用いることが可能である。 Hereinafter, an example of the operation of the voice quality conversion dubbing system 100 when a plurality of conversion source speakers are used will be described in detail with reference to FIGS. 1, 3, and 7. In this case, there are a plurality of third speakers (voice actors 10) shown in FIG. Here, FIG. 7 is a diagram illustrating a learning process when creating the conversion functions F and G and a voice quality conversion process using the conversion functions F and G.
When voice (voice sample data) is recorded from the recording target speaker, the conversion filter creation unit 102 sends the voice of the third speaker (referred to as the former speaker) and the voice of the second speaker (referred to as the intermediate speaker). Thus, the conversion function F (voice quality conversion filter 8) for converting the voice of the original speaker into the voice of the intermediate speaker is generated. Here, the voice of the original speaker and the voice of the intermediate speaker are recorded in advance by uttering the same approximately 50 sentences (one set of voice contents) of Japanese in the original speaker and the intermediate speaker. Use. Further, since there is one intermediate speaker (bilingual voice actor 4) and there are a plurality of original speakers, a conversion function F is created for each conversion source speaker. That is, learning is performed for each of a plurality of former speakers and one intermediate speaker. Therefore, the voice of one intermediate speaker is provided in common for each of a plurality of former speakers. As a learning method, for example, a feature amount conversion method based on a mixed normal distribution model (GMM) can be used as in the first and second embodiments. In addition to this, any known method can be used.

次に、変換フィルタ作成部１０２は、中間話者の音声を第１話者（目標話者と呼ぶ）の音声に変換するための変換関数Ｇ（声質変換フィルタ１）を作成する。ここで、中間話者の音声及び目標話者の音声は、予め中間話者（バイリンガルの声優４）と目標話者（俳優２）とに同じ約５０文（１セットの音声内容）の英語を発声させ収録しておいたものを用いる。 Next, the conversion filter creation unit 102 creates a conversion function G (voice quality conversion filter 1) for converting the voice of the intermediate speaker into the voice of the first speaker (referred to as the target speaker). Here, the voice of the middle speaker and the voice of the target speaker are preliminarily spoken with about 50 sentences (one set of voice contents) of English for the intermediate speaker (bilingual voice actor 4) and the target speaker (actor 2). Use what you recorded and recorded.

なお、変換関数Ｆ、Ｇの形式は数式に限らず、変換テーブルの形で表されていてもよい。
次に、図７を参照して、変換関数Ｆ、Ｇの学習過程及び変換関数Ｆ、Ｇを用いた変換過程を説明する。ここでは、元話者が２人であり、変換関数Ｆおよび変換関数Ｇの学習用音声のための文章がそれぞれ１セット分の日本語約５０文（ｓｅｔＡ）および英語約５０文（ｓｅｔＢ）であるとする。 Note that the format of the conversion functions F and G is not limited to a mathematical expression, and may be expressed in the form of a conversion table.
Next, the learning process of the conversion functions F and G and the conversion process using the conversion functions F and G will be described with reference to FIG. Here, there are two former speakers, and the sentences for learning speech of the conversion function F and the conversion function G are about 50 sentences (setA) in Japanese and about 50 sentences (setB) in English, respectively. Suppose there is.

まず、変換フィルタ作成部１０２は、図７に示すように、元話者（Ｓｒｃ．１）の音声ｓｅｔＡと中間話者（Ｉｎ．）の音声ｓｅｔＡとに基づいて学習を行い、変換関数Ｆ（Ｓｒｃ．１（Ａ））を作成する（ステップＳ３０１）。同様に、変換フィルタ作成部１０２は、元話者（Ｓｒｃ．２）の音声ｓｅｔＡと中間話者（Ｉｎ．）の音声ｓｅｔＡとに基づいて学習を行い、変換関数Ｆ（Ｓｒｃ．２（Ａ））を作成する（ステップＳ３０２）。 First, as shown in FIG. 7, the conversion filter creation unit 102 performs learning based on the voice setA of the original speaker (Src.1) and the voice setA of the intermediate speaker (In.), And performs a conversion function F ( Src.1 (A)) is created (step S301). Similarly, the conversion filter creation unit 102 performs learning based on the voice setA of the original speaker (Src.2) and the voice setA of the intermediate speaker (In.), And performs a conversion function F (Src.2 (A)). ) Is created (step S302).

次に、変換フィルタ作成部１０２は、図７に示すように、中間話者（Ｉｎ．）の英語約５０文（ｓｅｔＢ）の音声と目標話者（Ｔａｇ．１）の音声ｓｅｔＢとに基づいて学習を行い、変換関数Ｇ１（Ｉｎ．（Ｂ））を作成する（ステップＳ３０３）。
変換過程においては、声質変換部１０３は、図７に示すように、元話者（Ｓｒｃ．１）の任意の日本語音声を変換関数Ｆ（Ｓｒｃ．１（Ａ））を用いて中間話者（Ｉｎ．）の音声に変換する（ステップＳ３０４）。次に、声質変換部１０３は、中間話者（Ｉｎ．）の音声を、変換関数Ｇ１（Ｉｎ．（Ｂ））を用いて、目標話者（Ｔａｇ．１）の日本語音声へ変換する（ステップＳ３０５）。 Next, as shown in FIG. 7, the transform filter creation unit 102 is based on the speech of about 50 sentences (setB) in English of the intermediate speaker (In.) And the speech setB of the target speaker (Tag.1). Learning is performed to create a conversion function G1 (In. (B)) (step S303).
In the conversion process, as shown in FIG. 7, the voice quality conversion unit 103 converts any Japanese speech of the original speaker (Src.1) into an intermediate speaker using the conversion function F (Src.1 (A)). (In.) Voice is converted (step S304). Next, the voice quality conversion unit 103 converts the speech of the intermediate speaker (In.) Into the Japanese speech of the target speaker (Tag. 1) using the conversion function G1 (In. (B)) ( Step S305).

同様に、声質変換部１０３は、図７に示すように、元話者（Ｓｒｃ．２）の任意の日本語音声を変換関数Ｆ（Ｓｒｃ．２（Ａ））を用いて中間話者（Ｉｎ．）の音声に変換する（ステップＳ３０６）。次に、声質変換部１０３は、中間話者（Ｉｎ．）の音声を、変換関数Ｇ１（Ｉｎ．（Ｂ））を用いて、目標話者（Ｔａｇ．１）の日本語音声へ変換する（ステップＳ３０７）。 Similarly, as shown in FIG. 7, the voice quality conversion unit 103 converts any Japanese speech of the original speaker (Src.2) into an intermediate speaker (In) using the conversion function F (Src.2 (A)). .) (Step S306). Next, the voice quality conversion unit 103 converts the speech of the intermediate speaker (In.) Into the Japanese speech of the target speaker (Tag. 1) using the conversion function G1 (In. (B)) ( Step S307).

なお、変換関数Ｆと変換関数Ｇとが合成された関数を用いて、元話者の音声を直接目標話者の音声に変換する機能を備えていてもよい。 Note that a function of directly converting the voice of the original speaker into the voice of the target speaker may be provided using a function in which the conversion function F and the conversion function G are combined.

目標話者が話す言語の種類に制約されることなく、異なる言語間での、声質変換を伴った吹替えに利用することができる。 Without being restricted by the type of language spoken by the target speaker, it can be used for dubbing with voice quality conversion between different languages.

本発明の第１の実施の形態に係る声質変換吹替システムの機能構成を示すブロック図である。It is a block diagram which shows the function structure of the voice quality conversion dubbing system which concerns on the 1st Embodiment of this invention. 同実施の形態に係る声質変換吹替システムの動作例を説明するための図である。It is a figure for demonstrating the operation example of the voice quality conversion dubbing system which concerns on the same embodiment. 本発明の第２の実施の形態に係る声質変換吹替システムの動作例を説明するための図である。It is a figure for demonstrating the operation example of the voice quality conversion dubbing system which concerns on the 2nd Embodiment of this invention. 英語の音素の種類を示す図である。It is a figure which shows the kind of English phoneme. 日本語の音素の種類を示す図である。It is a figure which shows the kind of Japanese phoneme. 日本語の音素と英語の音素との対応関係を示す図である。It is a figure which shows the correspondence of a Japanese phoneme and an English phoneme. 変換関数Ｆ，Ｇを作成する際の学習過程及び変換関数Ｆ、Ｇを用いた声質変換過程を示す図である。It is a figure which shows the learning process at the time of producing the conversion functions F and G, and the voice quality conversion process using the conversion functions F and G.

Explanation of symbols

１００声質変換吹替システム
１０１音声収集部
１０２変換フィルタ作成部
１０３声質変換部 100 Voice quality conversion dubbing system 101 Voice collection unit 102 Conversion filter creation unit 103 Voice quality conversion unit

Claims

A first speaker conversion filter for converting the voice of the second speaker into the voice of the first speaker based on the voices of the first speaker and the second speaker uttered in the first language. A conversion filter creation means to create;
Voice quality conversion means for converting the voice uttered by the second speaker in a second language different from the first language into the voice of the first speaker using the first speaker conversion filter. Voice quality conversion dubbing system characterized by that.

The conversion filter creating means includes
Second speaker conversion for converting the voice of the third speaker into the voice of the second speaker based on the voices of the second and third speakers uttered in the second language Create a filter
The voice quality conversion means includes
The voice of the third speaker uttered in the second language is converted to the voice of the second speaker using the second speaker conversion filter, and the voice of the second speaker is converted to the first speaker. The voice quality conversion dubbing system according to claim 1, wherein the voice conversion of the first speaker is converted using a conversion filter.

The voice quality conversion means includes
Using the conversion filter obtained by synthesizing the second speaker conversion filter and the first speaker conversion filter, the voice of the first speaker is expressed as the voice uttered by the third speaker in the second language. The voice quality conversion dubbing system according to claim 2, wherein the voice quality conversion dubbing system is performed.

The conversion filter creating means is provided in a server device,
The voice quality conversion dubbing system according to any one of claims 1 to 3, wherein the voice quality conversion means is provided in a client device.

On the computer,
A first speaker conversion filter for converting the voice of the second speaker into the voice of the first speaker based on the voices of the first speaker and the second speaker uttered in the first language. A conversion filter creation step to be created;
A voice quality conversion step of converting a voice uttered by the second speaker in a second language different from the first language into a voice of the first speaker using the first speaker conversion filter; Program to let you.

On the computer,
In the conversion filter creation step,
Second speaker conversion for converting the voice of the third speaker into the voice of the second speaker based on the voices of the second and third speakers uttered in the second language The process of creating a filter,
In the voice quality conversion step,
The voice uttered by the third speaker in the second language is converted to the voice of the second speaker using the second conversion filter, and the voice of the second speaker is converted to the first conversion filter. The program according to claim 5, further executing a process of converting to the voice of the first speaker using a voice.