JP2009216723A

JP2009216723A - Similar speech selection device, speech creation device, and computer program

Info

Publication number: JP2009216723A
Application number: JP2008056938A
Authority: JP
Inventors: Yoshihiro Adachi; 吉広足立; Shinichi Kawamoto; 真一川本; Satoru Nakamura; 哲中村
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2008-03-06
Filing date: 2008-03-06
Publication date: 2009-09-24

Abstract

<P>PROBLEM TO BE SOLVED: To accurately select speech which is similar to a target speech from a plurality of sample speeches. <P>SOLUTION: A similar speech selection device includes: a voice actor speech database (DB) for storing the plurality of sample speeches; a distance calculation processing section (Step 902 to 910) for calculating the distance between the target speech and each of the plurality of sample speeches by a weighted linear sum of a distance scale between two pieces of speech, which is calculated for each sound feature amount of one category or more for speech; and a speech selection section (Step 912) for selecting speech in which the distance calculated by the distance calculation processing section is the smallest, is selected from the plurality of sample speeches as the speech similar to the target speech. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は映画、アニメーション等、出演者がシナリオにしたがって発話するマルチメディア製作物などにおける台詞音声作成装置に関し、特に、所定のシナリオにあわせて台詞を効率的に収録し、再生することが可能な台詞音声作成装置に関する。 The present invention relates to a speech production apparatus for multimedia productions in which a performer utters according to a scenario, such as a movie or animation, and in particular, it can efficiently record and reproduce speech according to a predetermined scenario. The present invention relates to a speech sound generating device.

コンピュータ技術、特に映像及び音響処理技術が発達することにより、利用者を参加させたマルチメディア製作物をごく短時間に製作することが可能なシステムが実用化されつつある。たとえば、博覧会のアトラクションとして、そのようなシステムを導入し、博覧会を訪れた参加者を登場人物とする映画をその場で製作し上映するようなアトラクションが実現できると、より多くの訪問者をひきつけるような効果が期待できる。 With the development of computer technology, particularly video and audio processing technology, systems capable of producing multimedia products with users in a very short time are being put into practical use. For example, by introducing such a system as an attraction for an exposition, and creating an on-site movie with the participants of the exposition as characters, it is possible to realize more attractions. The effect that attracts

そうしたアトラクションシステムが、特許文献１に開示されている。特許文献１に開示されたシステムは、参加者の立体的顔画像を撮影する複数の３次元スキャナ及び画像処理パーソナルコンピュータ（以下単に「ＰＣ」と呼ぶ。）と、予め準備された映画のシナリオを登場人物の画像、背景画像等とともに記憶するためのシナリオ保存サーバと、これら３次元スキャナによって撮影された参加者の立体的顔画像に基づき、シナリオ保存サーバに保存されている映画の登場人物の顔を参加者の顔画像に入替えることにより、参加者が登場人物として登場する映画を生成するためのアトラクション映像生成装置と、生成された映画を映写するための映像送出装置とを含む。 Such an attraction system is disclosed in Patent Document 1. The system disclosed in Patent Document 1 includes a plurality of three-dimensional scanners and image processing personal computers (hereinafter simply referred to as “PCs”) that capture a three-dimensional face image of a participant, and a movie scenario prepared in advance. A scenario storage server for storing together with an image of a character, a background image, and the like, and a face of a movie character stored in the scenario storage server based on the 3D face images of the participants taken by the 3D scanner Is replaced with a face image of the participant, thereby including an attraction video generation device for generating a movie in which the participant appears as a character, and a video transmission device for projecting the generated movie.

複数の参加者はそれぞれ、映画の中の所望の登場人物を指定することでその登場人物として映画の中に登場することができる。
特開２００５−１１５７４０号公報 Each of the plurality of participants can appear in the movie as the character by designating a desired character in the movie.
JP 2005-115740 A

しかし、上記したシステムでは、登場人物の顔画像を参加者の顔画像に入替えることはできても、声までを入替えることはできない。これは、顔画像の場合には３次元スキャナで参加者の顔画像を撮影することにより、どのようなシーンでもその顔画像を利用することができるのに比べ、音声の場合にはそのようなことができないためである。 However, in the system described above, even if the face image of the character can be replaced with the face image of the participant, the voice cannot be replaced. This is because, in the case of voice, such a face image can be used in any scene by capturing the participant's face image with a 3D scanner. It is because it cannot be done.

音声の場合には、シナリオにあわせ、かつ映像にあわせて台詞を読上げる作業をする必要がある。こうした作業は単に難しいだけではなく、長時間を有する作業であるという問題点がある。限られた時間しかないアトラクションの参加者にそのような作業を強いることはできず、結果として上記したアトラクションシステムにおいて参加者の声を使用することができなかった。 In the case of audio, it is necessary to read the dialogue in accordance with the scenario and the video. Such work is not only difficult, but has a problem that it is a work having a long time. It was not possible to force such a participant to an attraction participant who had a limited time, and as a result, the voice of the participant could not be used in the above-described attraction system.

こうしたことは、映画に限らず他の場面でも起こりうる。たとえば音声のみを使用したラジオドラマのようなものを作成する場合にも、参加者がさくことのできる時間が短ければ、参加者の声を利用して長時間のドラマを作成することは困難である。また、アニメーションを吹替える場合、又は実写の動物に人間の声をあてはめるための吹替えをする場合にも同様の問題が生じる。 This can happen not only in movies but also in other situations. For example, when creating something like a radio drama that uses only audio, it is difficult to create a long drama using the voice of the participant if the time available for the participant is short. is there. The same problem also occurs when the animation is dubbed or when the dubbing is performed to apply a human voice to a live-action animal.

また、こうした問題はアトラクションの参加者のように一時的に来訪している人の声を使用する場合だけではなく、いわゆる声優のように吹替えを職業としている人の場合にも生じえる。ある長さのシナリオの全てを声優の声で吹替える場合には、最低限必要な時間が決まってしまい、利用可能な時間が非常に限定されているときには吹替えを完全に行なうことが不可能な場合さえ生じ得る。 In addition, such a problem may occur not only when using the voice of a temporarily visiting person such as an attraction participant, but also when using a voice-over profession such as a so-called voice actor. When dubbing all scenarios of a certain length with the voice of the voice actor, the minimum required time is determined, and when the available time is very limited, it is impossible to completely dubb Even cases can occur.

仮に、既に他人の声で台詞が多数録音されている場合には、もしかしたら参加者の声とよく似た人物の声を使用して吹替えを行なうことができるかもしれない。しかしそのためには、男女、年齢、声の質などを考慮して、できるだけ多数の人物の声を収録しておかなければならず、それは非常に困難である。こうした問題を解決するために、登場人物の台詞が分かっているマルチメディア製作物を製作するために、登場人物の音声をユーザの声で容易に短時間で置換することが可能であることが望ましい。そのためには、例えば予め準備してあるサンプル音声から、目標となる音声に近い音声を精度よく生成できることが望ましい。そのためにはさらに、予め準備してあるサンプル音声から、目標となる音声に近い音声を選択できることが望ましい。 If many dialogues have already been recorded in the voice of another person, it may be possible to use a voice of a person who is very similar to the voice of the participant. However, in order to do so, it is necessary to record the voices of as many people as possible in consideration of gender, age, and voice quality, which is very difficult. In order to solve these problems, it is desirable to be able to easily replace the voice of the character with the user's voice in a short time in order to produce a multimedia production in which the character's dialogue is known. . For this purpose, for example, it is desirable that a sound close to the target sound can be generated with high accuracy from a sample sound prepared in advance. For that purpose, it is further desirable to be able to select a sound close to the target sound from the sample sounds prepared in advance.

それ故に本発明の目的は、複数のサンプル音声の中から、目標となる音声に類似する音声を精度よく選択することが可能な音声選択装置を提供することである。 Therefore, an object of the present invention is to provide a voice selection device capable of accurately selecting a voice similar to a target voice from a plurality of sample voices.

本発明の他の目的は、複数のサンプル音声の中から、目標となる音声に類似する音声を精度よく選択し、この音声から目標となる音声によく似た音声を生成することが可能な音声生成装置を提供することである Another object of the present invention is to select a sound similar to the target sound from a plurality of sample sounds with high accuracy and generate a sound very similar to the target sound from this sound. Is to provide a generator

本発明の第１の局面に係る類似音声選択装置は、複数のサンプル音声の中から、目標となる音声に類似する音声を選択するための類似音声選択装置であって、複数のサンプル音声を記憶するための手段と、音声に対する１種類以上の音響特徴量の各々について算出される、２つの音声の間の距離尺度の重み付き線形和により、目標となる音声と、複数のサンプル音声の各々との間の距離を算出するための距離算出手段と、複数のサンプル音声の中から、距離算出手段により算出された距離が最も小さいものを、目標となる音声に類似する音声として選択するための音声選択手段とを含む。 The similar voice selection device according to the first aspect of the present invention is a similar voice selection device for selecting a voice similar to a target voice from a plurality of sample voices, and stores the plurality of sample voices. And a target speech and each of a plurality of sample speeches by a weighted linear sum of distance measures between the two speeches calculated for each of the one or more acoustic feature quantities for the speech A distance calculating means for calculating the distance between the two and a sound for selecting a sound having the smallest distance calculated by the distance calculating means from among a plurality of sample sounds as a sound similar to the target sound Selecting means.

ある発話者の音声で、その発話者による録音がなされていない内容の発話をさせたいことがある。この類似音声選択装置は、そうした場合に有効である。具体的には、距離算出手段は、音声に対する１種類以上の音響特徴量の各々について算出される、２つの音声の間の距離尺度を算出し、それらの重み付き線形和により、目標となる音声と、サンプル音声の各々との間の距離を算出する。音声選択手段は、この距離が最も小さいものを選択する。 There is a case where it is desired to utter a content of a speaker that is not recorded by the speaker. This similar voice selection device is effective in such a case. Specifically, the distance calculation means calculates a distance measure between two sounds calculated for each of one or more types of acoustic feature quantities for the sound, and calculates a target sound by a weighted linear sum. And the distance between each of the sample sounds. The voice selection means selects the one having the smallest distance.

２つの音声の間の１種類以上の音響特徴量について算出される距離尺度の線形和を用いることにより、単独の音響特徴量のみを用いた場合と比較して、より精度よく、目標音声に近い音声を定めることができる。 By using a linear sum of distance measures calculated for one or more types of acoustic feature quantities between two voices, it is closer to the target voice with higher accuracy than when only a single acoustic feature quantity is used. Voice can be defined.

好ましくは、距離尺度は、２つの音声の同一の音響特徴量の間での動的時間軸伸縮（ＤｙｎａｍｉｃＴｉｍｅＷａｒｐｉｎｇ：ＤＴＷ）によって算出される距離尺度である。 Preferably, the distance measure is a distance measure calculated by dynamic time warping (DTW) between the same acoustic features of two sounds.

より好ましくは、距離尺度は、２つの音声の同一の音響特徴量の間でガウス混合分布モデル（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）によって算出される距離尺度である。 More preferably, the distance measure is a distance measure calculated by a Gaussian Mixture Model between the same acoustic features of two sounds.

このように、ＤＴＷ又はＧＭＭを用いて算出された距離尺度の線形和を用いることにより、単独の音響特徴量のみを用いた場合と比較して精度がより高くなることが実験により確認できる。 As described above, it can be experimentally confirmed that by using the linear sum of the distance scales calculated by using DTW or GMM, the accuracy becomes higher than when only a single acoustic feature amount is used.

さらに好ましくは、距離尺度の重み付き線形和の結合係数は、複数の音声について、複数の目標音声の各々との類似度の順位付けを人間によって行なった結果と、距離算出手段によって行なった結果との相関が最大となるように、予め決定されている。 More preferably, the combination coefficient of the weighted linear sums of the distance scales is a result of ranking the similarity between each of the plurality of sounds and each of the plurality of target sounds by a human and a result obtained by the distance calculating unit. Is determined in advance so as to maximize the correlation.

重み付き線形和の結合係数が、人間の判断によって付された順位付けとの相関が高い結果が得られるように、予め決定されている。１種類以上の音響特徴量についての距離尺度の線形和を使用するため、この相関を高くすることができる。その結果、人間による類似音声の選択結果に近い結果が得られる音声選択装置が得られる。 The combination coefficient of the weighted linear sum is determined in advance so as to obtain a result having a high correlation with the ranking given by human judgment. Since a linear sum of distance scales for one or more types of acoustic features is used, this correlation can be increased. As a result, it is possible to obtain a voice selection device that can obtain a result close to the selection result of a similar voice by a human.

本発明の第２の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、複数のサンプル音声を記憶するための手段と、音声に対する１種類以上の音響特徴量の各々について算出される距離尺度の重み付き線形和により、目標となる音声と、複数のサンプル音声の各々との間の距離を算出するための距離算出手段と、複数のサンプル音声の中から、距離算出手段により算出された距離が最も小さなものを、目標となる音声に類似する音声として選択するための音声選択手段として機能させる。 When the computer program according to the second aspect of the present invention is executed by a computer, the computer is calculated for each of means for storing a plurality of sample sounds and one or more acoustic feature quantities for the sounds. The distance calculation means for calculating the distance between the target sound and each of the plurality of sample sounds, and the distance calculation means from among the plurality of sample sounds The selected one having the smallest distance is made to function as a voice selection means for selecting a voice similar to the target voice.

本発明の第３の局面に係る音声生成装置は、予め定められた内容の音声を、目標となる音声に類似する音声で生成するための音声生成装置であって、予め定められた内容の、複数のサンプル音声を記憶するための手段と、音声に対する１種類以上の音響特徴量の各々について算出される、２つの音声の間の距離尺度の重み付き線形和により、目標となる音声と、複数のサンプル音声の各々との間の距離を算出するための距離算出手段と、複数のサンプル音声の中から、距離算出手段により算出された距離が最も小さなものを、目標となる音声に類似する音声として選択するための音声選択手段と、音声選択手段によって選択された音声を用いて、予め定められた内容の音声を生成するための音声生成手段とを含む。 An audio generation device according to a third aspect of the present invention is an audio generation device for generating audio having a predetermined content with audio similar to a target audio, and having a predetermined content, By means of means for storing a plurality of sample sounds and a weighted linear sum of distance measures between two sounds calculated for each of one or more types of acoustic features for the sounds, A distance calculation means for calculating the distance between each of the sample sounds, and a sound similar to the target sound that has the smallest distance calculated by the distance calculation means among the plurality of sample sounds Voice selection means for selecting as a voice, and voice generation means for generating a voice having a predetermined content using the voice selected by the voice selection means.

この音声生成装置では、第１の局面に係る音声選択装置と同様の構成で、目標となる音声が複数のサンプル音声から選択され、選択された音声により予め定められた内容の発話の音声が生成される。したがって、第１の局面に係る音声選択装置と同様、精度よく、目標音声に近い音声を定め、その音声によって予め定められた内容の発話の音声を生成することができる。 In this voice generation device, the target voice is selected from a plurality of sample voices with the same configuration as that of the voice selection device according to the first aspect, and a voice of utterance having a predetermined content is generated by the selected voice. Is done. Therefore, similarly to the voice selection device according to the first aspect, it is possible to accurately define a voice close to the target voice and generate a voice of utterance having a predetermined content by the voice.

好ましくは、音声選択手段は、複数のサンプル音声の中から、距離算出手段により算出された距離が最も小さな複数のものを、目標となる音声に類似する音声として選択するための手段を含み、音声生成手段は、音声選択手段によって選択された複数の音声に対するモーフィングを行なって新たな音声を生成するための音声モーフィング手段を含む。 Preferably, the sound selection means includes means for selecting a plurality of samples having the smallest distance calculated by the distance calculation means from among a plurality of sample sounds as sounds similar to the target sound. The generation means includes voice morphing means for generating a new voice by performing morphing on the plurality of voices selected by the voice selection means.

複数のサンプル音声の中から、目標となる音声に最も類似する音声が選択され、さらにそれらのモーフィングによって音声が合成される。モーフィングによって、選択されたサンプル音声よりもさらに目標となる音声に近い音声を生成することができる。 A sound that is most similar to the target sound is selected from the plurality of sample sounds, and the sound is further synthesized by morphing them. Morphing can generate speech that is closer to the target speech than the selected sample speech.

より好ましくは、音声生成装置はさらに、選択するための手段によって選択された複数の話者の音声を、音声モーフィング手段によってモーフィングして得られたモーフィング後音声の所定の特徴量ベクトルと、目標となる音声の、所定の特徴量に対応する特徴量ベクトルとの間の距離が最小となるように、音声モーフィング手段によるモーフィング時の複数の話者のモーフィング比率を最適化するための最適化手段を含む。 More preferably, the speech generation apparatus further includes a predetermined feature vector of morphed speech obtained by morphing speech of a plurality of speakers selected by the means for selecting by speech morphing means, a target, Optimization means for optimizing the morphing ratio of a plurality of speakers at the time of morphing by the voice morphing means so that the distance between the voice and the feature quantity vector corresponding to the predetermined feature quantity is minimized. Including.

モーフィング比率が最適化されているため、モーフィングにより得られる音声は、目標となる音声により近くなる。 Since the morphing ratio is optimized, the voice obtained by morphing is closer to the target voice.

本発明の第４の局面に係るコンピュータプログラムは、コンピュータにより実行されると、当該コンピュータを、予め定められた内容の、複数のサンプル音声を記憶するための手段と、音声に対する１種類以上の音響特徴量の各々について算出される、２つの音声の間の距離尺度の重み付き線形和により、目標となる音声と、複数のサンプル音声の各々との間の距離を算出するための距離算出手段と、複数のサンプル音声の中から、距離算出手段により算出された距離が最も小さなものを、目標となる音声に類似する音声として選択するための音声選択手段と、音声選択手段によって選択された音声を用いて、予め定められた内容の音声を生成するための音声生成手段として機能させる。 When the computer program according to the fourth aspect of the present invention is executed by a computer, the computer stores means for storing a plurality of sample sounds having predetermined contents, and one or more kinds of sounds for the sounds. Distance calculating means for calculating a distance between the target sound and each of the plurality of sample sounds by a weighted linear sum of distance measures between the two sounds calculated for each feature amount; , A voice selection unit for selecting, as a voice similar to the target voice, a voice having a smallest distance calculated by the distance calculation unit from among a plurality of sample voices, and a voice selected by the voice selection unit. It is made to function as an audio | voice production | generation means for producing | generating the audio | voice of the predetermined content.

以下、本発明の台詞音声作成装置の一実施の形態に係る、類似音声選択及び音声生成を行なうマルチメディア製作システムについて説明する。以下の説明及び図面において、同一の部品には同一の名称及び参照番号を付してある。それらの機能も同一である。したがってそれらについての詳細な説明は繰返さない。 Hereinafter, a multimedia production system that performs similar speech selection and speech generation according to an embodiment of the speech speech creation device of the present invention will be described. In the following description and drawings, the same parts are denoted by the same names and reference numerals. Their functions are also the same. Therefore, detailed description thereof will not be repeated.

図１に、本発明の一実施の形態に係るマルチメディア製作システム５０のブロック図を示す。図１を参照して、マルチメディア製作システム５０は、特許文献１に記載されたものと同様の複数の３次元スキャナからなる三次元スキャナ群６０と、三次元スキャナ群６０により撮影された参加者の顔画像の三次元モデルを作成するための画像処理ＰＣ６２と、映画のシナリオを登場人物（以下「キャラクタ」と呼ぶ。）の顔画像及びその他の画像とともに記憶するための図示しないシナリオ保存サーバと、画像処理ＰＣ６２により生成された参加者の顔画像を用い、シナリオ保存サーバに保存されているキャラクタの顔画像を置換して、参加者の顔を持つ人物が登場する映像を生成し映像データ６６として出力するための映像生成装置６４と、この映像データ６６を保存するための映像データ記憶装置とを含む。 FIG. 1 shows a block diagram of a multimedia production system 50 according to an embodiment of the present invention. Referring to FIG. 1, a multimedia production system 50 includes a three-dimensional scanner group 60 including a plurality of three-dimensional scanners similar to those described in Patent Document 1, and participants photographed by the three-dimensional scanner group 60. An image processing PC 62 for creating a three-dimensional model of the face image of the user, and a scenario storage server (not shown) for storing the movie scenario together with the face image of the character (hereinafter referred to as “character”) and other images. Then, using the face image of the participant generated by the image processing PC 62, the face image of the character stored in the scenario storage server is replaced to generate a video in which a person having the face of the participant appears, and the video data 66 A video generation device 64 for outputting the video data and a video data storage device for storing the video data 66.

マルチメディア製作システム５０はさらに、最終的な映像データ６６を作成するための映像素材を記憶するための映像素材ＤＢ（データベース）７０と、この映画のキャラクタの中で、参加者による吹替えの対象となるキャラクタの台詞に関する台詞情報を記憶するための台詞情報記憶部７２と、映画の中のキャラクタの台詞を標準的な音声で発話した標準音声を記憶するための標準音声記憶部７４と、映画の中の各台詞がどのようなシーンで発話されているか、それにより台詞の音声にどのような音響効果を加えるべきかを示すカット情報を記憶するためのカット情報記憶部７６とを含む。 The multimedia production system 50 further includes a video material DB (database) 70 for storing the video material for creating the final video data 66, and a dubbing target by a participant among the characters of this movie. A dialogue information storage unit 72 for storing dialogue information related to the dialogue of the character, a standard voice storage unit 74 for storing standard speech in which the dialogue of the character in the movie is spoken with standard speech, and a movie And a cut information storage unit 76 for storing cut information indicating in what scene each dialogue is uttered and thereby what kind of acoustic effect should be added to the speech of the dialogue.

マルチメディア製作システム５０はさらに、映像素材ＤＢ７０に記憶された映像、台詞情報記憶部７２に記憶された台詞情報、標準音声記憶部７４に記憶された標準音声による台詞の発話データ、及びカット情報記憶部７６に記憶されたカット情報を用い、参加者（ユーザ）の音声を収録してその音声に基づき、映画の特定のキャラクタの台詞の音声をユーザの音声に入替える処理（いわゆる「吹替え」と同様の処理）を実行し、ユーザの音声により台詞を発話した音声からなる台詞音声データ８６と、台詞音声データ８６の中の台詞の発話開始時間、発話時間、対応する音声ファイル名などをテーブルとして記憶するための台詞音声テーブル８８とを出力するための台詞音声データ作成部９０とを含む。 The multimedia production system 50 further stores the video stored in the video material DB 70, the speech information stored in the speech information storage unit 72, the speech utterance data of the standard speech stored in the standard speech storage unit 74, and the cut information storage. Using the cut information stored in the section 76, the voice of the participant (user) is recorded, and the voice of the dialogue of a specific character of the movie is replaced with the voice of the user based on the voice (so-called “dubbing”) Similar processing) is performed, and speech audio data 86 composed of speech uttered by speech of the user, speech start time of speech in speech speech data 86, speech time, corresponding audio file name, etc. as a table A dialogue voice data creation unit 90 for outputting a dialogue voice table 88 for storage.

台詞音声データ作成部９０は、三次元スキャナ群６０と同様、複数のユーザの音声を処理することが可能なように構成されている。各ユーザは後述するように識別子（ＩＤ）によって区別され、三次元スキャナ群６０、画像処理ＰＣ６２、及び映像生成装置６４からなる映像処理系と、台詞音声データ作成部９０とで同じユーザについては同じＩＤを割当てて管理する。こうすることにより、映画の複数のキャラクタの顔と音声とを同時に特定のユーザの顔と音声とにより入替えることが可能になる。 Similar to the three-dimensional scanner group 60, the dialogue voice data creation unit 90 is configured to be able to process voices of a plurality of users. As will be described later, each user is distinguished by an identifier (ID), and the same user is the same in the video processing system including the three-dimensional scanner group 60, the image processing PC 62, and the video generation device 64, and in the speech audio data creation unit 90. ID is assigned and managed. By doing so, it becomes possible to replace the faces and sounds of a plurality of characters in the movie with the faces and sounds of a specific user at the same time.

マルチメディア製作システム５０はさらに、台詞音声データ作成部９０がユーザの音声からキャラクタの台詞音声を作成する際に、どのような手法を用いるかを示す情報を台詞ごとに記憶した手法リストテーブル７８と、ユーザによる発話の収録ができなかった台詞について、台詞音声データ作成部９０がユーザの発話に代えて台詞音声データの作成に使用する、予めこの映画の各台詞を種々の声優の音声により発話したサンプル音声データを記憶した声優音声ＤＢ８０と、台詞音声データ作成部９０がユーザの発話に代えて音声合成によりユーザの声に似た声質の台詞音声データを生成する際に使用する音声素片を、それらの特徴量データとともに記憶した素片ＤＢ８２と、後述するように、声優音声ＤＢ８０に記憶された声優音声のうちから、ユーザの音声に似た声質のものを選択する際に使用される、複数の音響特徴量を線形結合して得られる複合的音響特徴量を算出するための、線形結合係数を記憶するための線形結合係数記憶部９４とを含む。本実施の形態では、後述するように８種類の音響特徴量の線形結合で複合的音響特徴量を算出する。したがって線形結合係数記憶部９４には、８つの係数が記憶される。 The multimedia production system 50 further includes a method list table 78 that stores information indicating which method is used for each dialogue when the speech speech data creation unit 90 creates speech speech of the character from the user's speech. For the speech that could not be recorded by the user, the speech audio data creation unit 90 used to create speech audio data instead of the user's speech, and uttered each speech of this movie in advance with the voices of various voice actors A voice actor voice DB 80 storing sample voice data and a speech segment used when the speech voice data creation unit 90 generates speech voice data having a voice quality similar to the user's voice by voice synthesis instead of the user's speech. Of the voice actor voices stored in the voice actor voice DB 80, as will be described later, the segment DB 82 stored together with the feature amount data. In order to store a linear combination coefficient for calculating a composite acoustic feature amount obtained by linearly combining a plurality of acoustic feature amounts used when selecting a voice quality similar to the user's voice And a linear combination coefficient storage unit 94. In this embodiment, as will be described later, a composite acoustic feature amount is calculated by linear combination of eight types of acoustic feature amounts. Therefore, eight coefficients are stored in the linear combination coefficient storage unit 94.

マルチメディア製作システム５０はさらに、映像生成装置６４から出力された映像データ６６と、台詞音声データ作成部９０から出力された台詞音声データ８６とを台詞音声テーブル８８を使用して互いに同期させて再生することにより、キャラクタの一部の顔画像及び音声がユーザの顔画像及び音声に入替えられたマルチキャラクタ製作物を上演するための映像・音声再生装置９２を含む。 The multimedia production system 50 further reproduces the video data 66 output from the video generation device 64 and the speech audio data 86 output from the speech audio data creation unit 90 using the speech audio table 88 in synchronization with each other. Thus, a video / audio reproduction device 92 for performing a multi-character product in which a part of the character's face image and sound is replaced with the user's face image and sound is included.

前述したとおり、台詞音声データ作成部９０は、複数のユーザの音声を収録し、それらに基づき、別々のキャラクタの台詞音声を生成する機能を持つ。そのために台詞音声データ作成部９０は、各々が処理対象のユーザに関する識別情報、性別、氏名、年齢、吹替え対象となるキャラクタを特定する情報等を含むユーザ情報の入力を受けるための複数のユーザ情報入力部１００，１００Ａ，…，１００Ｎと、これらユーザ情報入力部１００，１００Ａ，…，１００Ｎが受けたユーザ情報に基づいて、各々が対応するユーザの音声を収録し、収録した音声に基づいて種々の手法によりユーザの音声の声質で対応するキャラクタの台詞音声を生成し出力するための複数のキャラクタ音声作成部１０２，１０２Ａ，…，１０２Ｎと、複数のキャラクタ音声作成部１０２，１０２Ａ，…，１０２Ｎの出力する、ユーザ音声の声質に置換えられた種々のキャラクタの台詞音声を、台詞情報記憶部７２に記載された台詞情報に基づいて１つのマルチメディア製作物の音声を構成するように台詞の番号順にしたがって統合し、台詞音声データ８６及び台詞音声テーブル８８として出力するための音声統合部１０４とを含む。 As described above, the speech sound data creation unit 90 has a function of recording speech of a plurality of users and generating speech speech of different characters based on them. For this purpose, the speech sound data creation unit 90 has a plurality of pieces of user information for receiving input of user information including identification information, sex, name, age, information for specifying a character to be dubbed, etc. Based on the user information received by the input units 100, 100A,..., 100N and the user information input units 100, 100A,. , 102N, and a plurality of character voice generation units 102, 102A,..., 102N for generating and outputting the speech of the corresponding character with the voice quality of the user by the method of The speech of various characters replaced by the voice quality of the user speech output by So as to form a sound one multimedia productions on the basis of the speech information by integrating according to the numerical order of the words, and a voice integration section 104 for outputting as speech audio data 86 and the speech sound table 88.

なお、ユーザ情報入力部１００，１００Ａ，…，１００Ｎにより入力されたユーザ情報は、画像処理ＰＣ６２にも与えられ、ユーザの顔画像の管理にも用いられる。 The user information input by the user information input units 100, 100A,..., 100N is also given to the image processing PC 62, and is used for managing the user's face image.

複数のキャラクタ音声作成部１０２，１０２Ａ，…，１０２Ｎの構成はいずれも同じである。したがって以下では、キャラクタ音声作成部１０２の構成を代表として説明する。 The plurality of character voice creation units 102, 102A,..., 102N have the same configuration. Therefore, hereinafter, the configuration of the character voice creation unit 102 will be described as a representative.

図２は、キャラクタ音声作成部１０２の機能的ブロック図である。図２を参照して、キャラクタ音声作成部１０２は、ユーザ情報を受けて、映像素材ＤＢ７０に格納されている映像素材、台詞情報記憶部７２に記憶されている台詞情報、及び標準音声記憶部７４に記憶されている標準音声による台詞音声を利用して、ユーザによる吹替え対象となるキャラクタの台詞音声をユーザに発話させ、その発話音声をユーザ音声ＤＢ１２０に収録するための音声収録部１１４と、音声収録部１１４における発話の収録を制御するためにアテンダントが音声収録部１１４を操作するため、及びユーザによる発話の補助を行なうために使用する入出力装置１１２とを含む。 FIG. 2 is a functional block diagram of the character voice creation unit 102. Referring to FIG. 2, the character voice creation unit 102 receives user information, the video material stored in the video material DB 70, the dialogue information stored in the dialogue information storage unit 72, and the standard voice storage unit 74. A speech recording unit 114 for causing the user to utter speech of the character to be dubbed by the user and recording the uttered speech in the user speech DB 120, using speech of standard speech stored in The input / output device 112 is used for the attendant to operate the voice recording unit 114 to control the recording of the utterance in the recording unit 114 and to assist the user in speaking.

ところで、一般的に、１つの映画を構成する台詞は多数あり、あるキャラクタの台詞のみに限ってもユーザによるその台詞の発話音声の収録にはかなりの時間を要することが予測される。映画の音声の発話となると、キャラクタの動きにあわせて発話を行なう必要があり、この収録にはさらに時間がかかる可能性が高い。特に、アトラクションなどでは、時間的制限もあって、全ての発話音声の収録を行なうことが難しい場合が多い。また、収録できたとしても発話時間が短すぎたり長すぎたりすることも多く、そのまま収録音声を用いることができない場合が多い。そこで本実施の形態に係るキャラクタ音声作成部１０２では、あるキャラクタの台詞のうち、ユーザによる発話を収録できた分、及びできなかった分の双方に対して、それぞれ所定の音声生成手法を用いて、台詞音声をできるだけユーザの声質に近い声質で生成することを目標としている。手法リストテーブル７８には、台詞ごとにどのような手法をどのような優先順位で使用するかを示す台詞ごとの手法リストが記憶されており、キャラクタ音声作成部１０２は音声生成にこの手法リストテーブル７８を使用する。 By the way, in general, there are many dialogues constituting one movie, and it is predicted that it takes a considerable time for the user to record the speech of the dialogue even if the dialogue is limited to the dialogue of a certain character. When it comes to the utterance of a movie sound, it is necessary to utter according to the movement of the character, and this recording is likely to take more time. In particular, attraction and the like, there are many cases where it is difficult to record all speech sounds due to time restrictions. Even if recording is possible, the utterance time is often too short or too long, and the recorded speech cannot often be used as it is. Therefore, in the character voice creation unit 102 according to the present embodiment, a predetermined voice generation method is used for both the portion of the speech of a character that has been recorded by the user and the portion that has not been recorded. The goal is to generate speech with a voice quality that is as close to that of the user as possible. The method list table 78 stores a method list for each dialogue indicating which method is used for each dialogue and in what priority order, and the character voice creation unit 102 performs this method list table for voice generation. 78 is used.

キャラクタ音声作成部１０２はさらに、音声収録部１１４によってユーザ音声ＤＢ１２０に記憶されたユーザ音声に対し、手法リストテーブル７８を参照して、ユーザの吹替え対象のキャラクタの台詞ごとに条件に合致した手法を決定し、さらに、図１に示す線形結合係数記憶部９４からの線形結合係数を参照して、声優音声ＤＢ８０に記憶された声優音声のうちでユーザ音声に類似した３名の声優の音声を決定し、それら３名の声優音声をモーフィングすることによりユーザ音声に類似した音声を生成する際のモーフィング率ベクトルを計算するための合成手法決定部１１６と、合成手法決定部１１６によって決定された３名の声優音声の識別子を記憶するための類似声優記憶部１３０と、合成手法決定部１１６によって決定されたモーフィング率ベクトルを記憶するためのモーフィング率記憶部１３２と、合成手法決定部１１６によって決定された手法を用いてキャラクタの台詞音声をユーザの声質にあわせて作成し、台詞ごとに音声ファイル１１０として出力するための音声作成部１１８とを含む。 The character voice creation unit 102 further refers to the user voice stored in the user voice DB 120 by the voice recording unit 114 with reference to the method list table 78 and uses a method that matches the conditions for each speech of the character to be dubbed by the user. Further, with reference to the linear combination coefficient from the linear combination coefficient storage unit 94 shown in FIG. 1, among the voice actor voices stored in the voice actor voice DB 80, the voices of three voice actors similar to the user voice are determined. Then, a morphing rate determination vector 116 for calculating a morphing rate vector when generating a voice similar to the user voice by morphing these three voice actor voices, and the three names determined by the synthesis technique deciding unit 116 A similar voice actor storage unit 130 for storing the identifiers of voice actor voices, and a morph determined by the synthesis method determination unit 116 Using the method determined by the morphing rate storage unit 132 for storing the singing rate vector and the synthesis method determining unit 116, the speech of the character is created in accordance with the voice quality of the user, and the speech file 110 is output for each speech. And an audio creation unit 118.

音声作成部１１８はこの合成の際に、手法によって声優音声ＤＢ８０、ユーザ音声ＤＢ１２０、素片ＤＢ８２、標準音声記憶部７４等に記憶された音声、類似声優記憶部１３０に記憶された識別子に対応する声優音声、又はモーフィング率記憶部１３２に記憶されたモーフィング率ベクトルを適宜利用する。また音声作成部１１８は、生成された台詞の発話音声に対し、カット情報記憶部７６に記憶されたカット情報に基づいて決定される音響効果を加えて最終的な音声ファイル１１０を出力する。 The voice creating unit 118 corresponds to the voice stored in the voice actor voice DB 80, the user voice DB 120, the segment DB 82, the standard voice storage unit 74, and the identifier stored in the similar voice actor storage unit 130 according to the method at the time of synthesis. A voice actor voice or a morphing rate vector stored in the morphing rate storage unit 132 is appropriately used. In addition, the voice creation unit 118 adds a sound effect determined based on the cut information stored in the cut information storage unit 76 to the generated speech voice of the line, and outputs a final voice file 110.

合成手法決定部１１６において行なわれる、ユーザ音声に類似した声優音声の選択処理の詳細と、それらを使用したモーフィングの詳細と、モーフィングのためのモーフィング率ベクトルの算出方法の詳細については後述する。 Details of selection processing of voice actor speech similar to the user speech, details of morphing using them, and details of a method of calculating a morphing rate vector for morphing performed by the synthesis method determination unit 116 will be described later.

キャラクタ音声作成部１０２はさらに、ユーザ音声ＤＢ１２０に格納されたユーザの音声を声優音声ＤＢ８０に新たな声優音声として登録する処理を行なう音声ＤＢ更新部１２２と、ユーザ音声ＤＢ１２０に記憶されたユーザ音声を音素片（素片）に分解し、それらの所定の音響データ、音素ラベル、及びユーザＩＤとともに素片ＤＢ８２に追加するための素片ＤＢ更新部１２４とを含む。素片ＤＢ更新部１２４による音声の素片への分解においては、音声認識技術を利用し、台詞情報記憶部７２に記憶された台詞情報にあわせてユーザ音声ＤＢ１２０に記憶されたユーザの音声を細分化するセグメンテーションを行なう。 The character voice creation unit 102 further includes a voice DB update unit 122 that performs processing for registering a user voice stored in the user voice DB 120 as a new voice actor voice in the voice actor voice DB 80, and a user voice stored in the user voice DB 120. A unit DB update unit 124 is included for disassembling into phonemes (units) and adding them to the unit DB 82 together with their predetermined acoustic data, phoneme labels, and user IDs. In the segmentation of speech into segments by the segment DB update unit 124, speech recognition technology is used to subdivide the user's speech stored in the user speech DB 120 in accordance with the speech information stored in the speech information storage unit 72. Perform segmentation.

図３は、台詞情報記憶部７２に記憶される台詞情報テーブルの構成を示す。図３を参照して、台詞情報記憶部７２は、作成対象となる映画の台詞の全てを通し番号（Ｎｏ）で管理するためのものである。各台詞情報は、その台詞の通し番号（以下「台詞番号」と呼ぶ。）と、その台詞を発話する映画のキャラクタを識別するキャラクタＩＤと、台詞の内容であるテキストデータと、その台詞を標準音声で発話したものを記録した、標準音声記憶部７４内の音声ファイルのファイル名と、映画の進行経過の中でその台詞の発話が開始される時点を示す開始時刻と、その発話の継続時間を示す発話時間とを含む。台詞情報記憶部７２の台詞情報テーブルがこのような構成を有しているため、同じキャラクタＩＤの台詞を抽出することにより、あるキャラクタの台詞を全てリスト化することができる。また、ある台詞について、ユーザによる音声が利用できないときに、対応する標準音声を音声ファイル名により示される音声ファイルから得ることができる。 FIG. 3 shows the configuration of a dialogue information table stored in the dialogue information storage unit 72. Referring to FIG. 3, the dialogue information storage unit 72 is for managing all dialogues of a movie to be created with serial numbers (No). Each line information includes a serial number of the line (hereinafter referred to as “line number”), a character ID for identifying a movie character that utters the line, text data that is the content of the line, and the line as a standard voice. The file name of the audio file in the standard audio storage unit 74 in which the utterance is recorded, the start time indicating when the dialogue is started during the progress of the movie, and the duration of the utterance And the utterance time shown. Since the dialogue information table of the dialogue information storage unit 72 has such a configuration, it is possible to list all dialogues of a certain character by extracting dialogues with the same character ID. Further, when a user's voice cannot be used for a certain line, the corresponding standard voice can be obtained from the voice file indicated by the voice file name.

図４に、本実施の形態に係るマルチメディア製作システム５０における、ユーザによる録音状況としてあり得るいくつかの場合を示す。なお、たとえば図４（Ａ）を参照して、あるユーザについて録音すべき発話全体が発話集合１４０を形成するものとする。この発話集合１４０は、音声合成、声質変換などのために必要で、必ず収録すべき発話からなる必須発話部分１４２と、対応するキャラクタの台詞全体からなる台詞部分１４４とからなる。ユーザの収録にかかる時間、ユーザの発話の巧拙などにより、必須発話部分１４２はともかく、台詞部分１４４については、全て収録できる場合、一部のみしか収録できないとき、全く収録できないとき、の３通りがあり得る。図４には、それらの場合を分けて、収録できた部分に斜線を付し、収録できなかった部分は白抜きのままで例示してある。 FIG. 4 shows some possible cases as a recording situation by the user in the multimedia production system 50 according to the present embodiment. For example, referring to FIG. 4A, it is assumed that the entire utterance to be recorded for a certain user forms utterance set 140. The utterance set 140 is necessary for speech synthesis, voice quality conversion, and the like, and includes an essential utterance portion 142 including utterances that should be recorded and a dialogue portion 144 including the entire dialogue of the corresponding character. Depending on the user's recording time, skill of the user's utterance, etc., the speech part 144 can be recorded in all three ways, when it can be recorded, when only part of it can be recorded, or when it cannot be recorded at all. possible. In FIG. 4, these cases are divided, and the recorded portions are hatched, and the portions that could not be recorded are illustrated as white.

たとえば図４（Ａ）には、発話集合１４０の全てを収録できた場合を示す。図４（Ｂ）には、必須発話部分１４２と、一部の台詞部分１４６のみが収録でき、残りの部分１４８が収録できなかった場合を示す。図４（Ｃ）には、必須発話部分１４２の部分のみが収録でき、他の台詞部分１５０が全く収録できなかった場合を示す。 For example, FIG. 4A shows a case where the entire utterance set 140 has been recorded. FIG. 4B shows a case where only the essential utterance portion 142 and some dialogue portions 146 can be recorded, and the remaining portion 148 cannot be recorded. FIG. 4C shows a case where only the essential utterance portion 142 can be recorded and the other dialogue portion 150 cannot be recorded at all.

図４（Ａ）に示す場合には、基本的にはユーザの音声のみを用いて台詞音声を作成することができる。ただしこの場合にも、ユーザの巧拙によって話速を変換したり、発話レベルを調整したりする加工が必要なときがある。それらは台詞ごとに異なる。 In the case shown in FIG. 4A, the speech can be created basically using only the user's voice. However, even in this case, there is a case where it is necessary to change the speech speed or adjust the speech level by the skill of the user. They are different for each line.

図４（Ｂ）に示す場合には、収録できた台詞部分１４６については、図４（Ａ）に示す場合と同様に処理できるが、収録できなかった台詞部分１４８については何らかの手法を用いてユーザの音声以外からユーザの音声に似た台詞音声を生成する必要がある。 In the case shown in FIG. 4B, the recorded speech portion 146 can be processed in the same manner as in the case shown in FIG. 4A, but the speech portion 148 that could not be recorded is used by the user by some method. It is necessary to generate speech similar to the user's voice from other than the above voice.

図４（Ｃ）に示す場合には、台詞部分１５０の全てについて台詞音声を生成する必要がある。その場合、たとえば必須発話部分１４２からユーザの声質を表す特徴量を抽出し、声優音声ＤＢから類似の声質の声優の台詞音声を抽出したり、標準音声の声質をユーザの声質に近くなるように変換したりする処理（声質変換）を行なったりする必要がある。 In the case illustrated in FIG. 4C, it is necessary to generate speech for all of the speech parts 150. In that case, for example, a feature amount representing the voice quality of the user is extracted from the essential utterance portion 142, and speech voices of voice actors of similar voice quality are extracted from the voice actor voice DB, or the voice quality of the standard voice is made close to the voice quality of the user. It is necessary to perform processing (voice quality conversion) for conversion.

図２に示す手法リストテーブル７８には、台詞ごとに、どのような優先順位でそうした手法を使用するかが示されている。本実施の形態では、９種類の手法を用いて台詞音声を生成する。それら手法の詳細については後述する。 The technique list table 78 shown in FIG. 2 shows in what priority order such techniques are used for each dialogue. In the present embodiment, speech speech is generated using nine types of techniques. Details of these methods will be described later.

図５及び図６は、図２に示す音声収録部１１４で行なわれる音声収録処理を、コンピュータハードウェア上で実現するコンピュータプログラムのフローチャートである。既に述べたように、映画のキャラクタの台詞の吹替えを行なうことは難しい。たとえばある台詞について、決められた時間で明瞭に発話する必要がある。通常、発話時間が長すぎても短すぎても吹替えとして不適当になる場合がある。ましてや、声優ではないユーザに台詞の吹替えを間違いなく行なわせるのは困難である。そこで、本実施の形態では、様々な方策を講じてできるだけ正確に所望の台詞音声を収録することができるようにしている。たとえば、図７に示されるように、台詞音声収録時にユーザに提示される入出力装置１１２の画面に、台詞の発話時の映像２４６と、発話すべき台詞のテキスト２４０とを表示し、発話の進行にあわせて伸びるプログレスバー２４２を表示したり、台詞のテキスト２４０のうち、発話が終了しているべき部分２４４の色を、これから発話すべき部分の色と違う色で表示したりする、という方法を採用する。 5 and 6 are flowcharts of a computer program that realizes the sound recording processing performed by the sound recording unit 114 shown in FIG. 2 on computer hardware. As already mentioned, it is difficult to dubb a movie character line. For example, it is necessary to speak clearly in a certain time for a dialogue. Usually, if the utterance time is too long or too short, it may be inappropriate as a dubbing. In addition, it is difficult to make sure that a user who is not a voice actor performs speech dub. Therefore, in the present embodiment, various measures are taken so that desired speech can be recorded as accurately as possible. For example, as shown in FIG. 7, a video 246 when speech is spoken and a text 240 of speech to be spoken are displayed on the screen of the input / output device 112 presented to the user at the time of speech recording. A progress bar 242 that expands with the progress is displayed, or the color of the part 244 in the dialogue text 240 where the speech should end is displayed in a color different from the color of the part that should be spoken. Adopt the method.

図５を参照して、このプログラムは、ユーザ情報をユーザ情報入力部１００から受信し所定の記憶領域に保存するステップ１７０と、ステップ１７０に続き、受信したユーザ情報にしたがって、処理対象のユーザに対し、指定されたキャラクタを割当てるステップ１７２と、ステップ１７２に続き、共通の練習用台詞及び対応する標準音声、ステップ１７２で割当てられたキャラクタの台詞及び対応する標準音声を図２に示す台詞情報記憶部７２及び標準音声記憶部７４から抽出するステップ１７４と、ステップ１７４に続き、ユーザ音声テーブルと呼ばれる、ユーザの台詞音声を管理するためのテーブルを生成し、全ての台詞について未収録状態に初期化するステップ１７６とを含む。 Referring to FIG. 5, this program receives user information from user information input unit 100 and saves it in a predetermined storage area. On the other hand, in step 172 for assigning the designated character, and following step 172, the dialogue information storage shown in FIG. 2 shows the common practice dialogue and the corresponding standard voice, the dialogue of the character assigned in step 172 and the corresponding standard voice. Step 174 extracted from the unit 72 and the standard voice storage unit 74, and following step 174, a table for managing the user's speech is created, which is called a user speech table, and all dialogues are initialized to an unrecorded state. Step 176.

ユーザ音声テーブルは、図２に示すユーザ音声ＤＢ１２０の一部を構成する。図８を参照して、ユーザ音声ＤＢ１２０は、ユーザの発話を台詞ごとに収録した音声ファイルを記憶するユーザ音声記憶部２６２と、ユーザ音声記憶部２６２に記憶された音声ファイルの管理を行なうためのユーザ音声テーブル２６０とを含む。 The user voice table constitutes a part of the user voice DB 120 shown in FIG. Referring to FIG. 8, the user voice DB 120 manages a voice file stored in the user voice storage unit 262 and a user voice storage unit 262 that stores a voice file in which a user's speech is recorded for each line. User voice table 260.

ユーザ音声テーブル２６０は、ユーザが吹替えを行なうキャラクタの台詞と、対応するユーザ音声とを管理するためのものであって、先頭にはユーザＩＤが付され、さらに、このキャラクタの台詞の各々について、抽出された台詞の台詞番号と、ユーザによるその台詞の発話の収録が完了したか否かを示す録音フラグと、収録した発話音声データを格納した音声ファイルの名称と、その発話時間とを記憶するためのものである。録音フラグは、１のときに発話音声が収録済であることを示し、０のときには未収録であることを示す。なお、実際には発話開始時間、発話時間などは１秒よりも細かい単位で管理する必要があるが、以下の説明及び図面では、理解を容易にするため、これら時間は秒単位で管理するものとする。 The user voice table 260 is for managing the dialogue of the character that the user performs dubbing and the corresponding user voice. The user ID is attached to the head, and for each of the dialogue of this character, Stores the line number of the extracted line, the recording flag indicating whether or not the recording of the utterance of the line by the user has been completed, the name of the audio file storing the recorded utterance audio data, and the utterance time Is for. The recording flag indicates that the utterance voice has been recorded when it is 1, and indicates that it has not been recorded when it is 0. Actually, the utterance start time, utterance time, etc. need to be managed in units smaller than 1 second. However, in the following explanation and drawings, these times are managed in units of seconds for easy understanding. And

再び図５を参照して、ステップ１７６では、上記したユーザ音声テーブル２６０が新たに作成され、台詞番号には抽出された台詞に付されている通し番号が、録音フラグには全て０が、音声ファイル名には全て空白が、発話時間には全て０が、それぞれ代入される。 Referring to FIG. 5 again, in step 176, the above-described user voice table 260 is newly created, and the serial number attached to the extracted dialogue is set as the dialogue number, all the recording flags are 0, and the voice file. Blanks are assigned to the names, and 0s are assigned to the utterance times.

このプログラムはさらに、ステップ１７６に続き、収録に要した時間を測定するためのタイマを起動するステップ１７８と、ユーザ音声テーブル２６０内の先頭の台詞を選択するステップ１８０と、直前のステップで選択された台詞のテキストをユーザの前に置かれたモニタに表示するステップ１８２と、この台詞に対応する標準音声を標準音声記憶部７４から取出し、再生するステップ１８４とを含む。ステップ１８２及び１８４においても、図７に示したような表示が行なわれる。 This program is further selected in step 176 following step 176, starting step 178 for measuring the time required for recording, step 180 selecting the first dialogue in the user voice table 260, and the immediately preceding step. Step 182 for displaying the text of the dialogue displayed on the monitor placed in front of the user, and Step 184 for retrieving the standard speech corresponding to this speech from the standard speech storage unit 74 and reproducing it. Also in steps 182 and 184, the display as shown in FIG. 7 is performed.

このプログラムはさらに、ステップ１８４に続いて、ユーザの発話練習の時間として設けられたステップ１８６と、ステップ１８２に戻って再度練習を行なうか、次のステップに進んでもよいかをアテンダントが判定して入力する判定結果にしたがい、制御の流れを分岐させるステップ１８８とを含む。ステップ１８８での判定結果が再度練習を行なうべきことを示すときには、制御はステップ１８２に戻る。 In step 184, the program further includes step 186 provided as a user's speech practice time, and the attendant determines whether to return to step 182 to practice again or proceed to the next step. And step 188 for branching the flow of control according to the determination result to be input. When the determination result at step 188 indicates that the practice should be performed again, control returns to step 182.

このプログラムはさらに、ステップ１８８で練習を終了しても良いことを示す入力がされたことに応答して実行され、選択中の台詞を再度表示するステップ１９０と、選択中の台詞の通常の発話速度にしたがって変化するプログレスバーの表示を開始するステップ１９２とを含む。 The program is further executed in response to the input indicating that the practice may be terminated in step 188, and again displays step 190 and the normal speech of the selected dialogue. And a step 192 of starting to display a progress bar that changes according to the speed.

続いて図６を参照して、このプログラムは、ステップ１９２の次に配置され、ユーザの発話する台詞音声を録音するステップ１９４と、ステップ１９４で録音された台詞音声を再生するステップ１９６と、ステップ１９６で再生された台詞音声の発話時間、発話の明瞭さ及び自然さなどに基づいて、この台詞の収録を完了するか否かについてアテンダントが判定して入力した結果にしたがい、制御の流れを分岐させるステップ１９８と、ステップ１９８においてこの台詞の収録を完了することを示す入力が行なわれたことに応答して実行され、ステップ１９４で録音された音声を音声ファイルとしてユーザ音声記憶部２６２内に保存し、ユーザ音声テーブル２６０内の当該台詞の音声ファイル名欄にその音声ファイル名を、発話時間欄に録音音声の持続時間を、それぞれ代入するステップ２００と、収録フラグに「１」を代入するステップ２０１とを含む。 Next, referring to FIG. 6, this program is arranged next to step 192, and step 194 for recording the speech spoken by the user, step 196 for playing the speech recorded in step 194, and step Based on the utterance time of the speech spoken in 196, the clarity and naturalness of the speech, etc., the attendant determines whether or not to complete the recording of the speech, and the control flow branches according to the result input by the attendant Step 198 is executed in response to the input indicating completion of the recording of the dialogue in Step 198, and the voice recorded in Step 194 is stored in the user voice storage unit 262 as a voice file. Then, record the voice file name in the voice file name column of the dialogue in the user voice table 260 and record it in the utterance time column. The duration of the voice, includes a step 200 to assign each, and a step 201 to assign "1" to the recording flag.

このプログラムはさらに、ステップ２００の後、対象のキャラクタの次の台詞の選択を試みるステップ２０２と、ステップ２０２で選択を試みた次の台詞が存在しているか否か、すなわち対象のキャラクタの台詞を全て処理したか否かを判定し、その判定結果にしたがって制御の流れを分岐させるステップ２０４と、ステップ２０４においてまだ台詞が残っていると判定されたことに応答して、タイマを参照し、録音開始から所定時間が経過したか否かを判定し、判定結果にしたがって制御の流れを分岐させるステップ２１２とを含む。ステップ２１２においてまだ所定時間が経過していないと判定された場合には、制御は図５のステップ１８２に戻る。 The program further, after step 200, attempts to select the next line of the target character in step 202, and whether or not the next line tried to be selected in step 202 exists, that is, the line of the target character. It is determined whether or not all processing has been performed, and in response to the determination in step 204 that the flow of control is branched in accordance with the determination result, and in step 204, it is determined that speech is still remaining, the timer is referred to, and recording is performed. Determining whether or not a predetermined time has elapsed from the start, and branching the control flow according to the determination result. If it is determined in step 212 that the predetermined time has not yet elapsed, control returns to step 182 in FIG.

このプログラムはさらに、ステップ２０４で対象のキャラクタの全台詞について収録が完了したと判定された場合、及びステップ２１２において所定時間が経過したと判定されたことに応答して実行され、録音した全音声を、対応する台詞のテキストに基づいてセグメンテーションし、音声素片に分解するステップ２０６と、ステップ２０６で生成された素片の各々について、Ｆ０，スペクトル分布など、所定の音響特徴量を算出するステップ２０８と、ステップ２０６で作成された素片を、ステップ２０８で算出された音響特徴量、対応する音素のラベル、及び話者のＩＤとともに素片ＤＢ８２に追加して処理を終了するステップ２１０とを含む。 This program is further executed when it is determined in step 204 that the recording has been completed for all dialogues of the target character, and in response to the determination that the predetermined time has elapsed in step 212, and all recorded voices are recorded. Are segmented on the basis of the corresponding dialogue text and decomposed into speech segments, and for each of the segments generated in step 206, a predetermined acoustic feature quantity such as F0, spectral distribution, etc. is calculated. 208 and the step 210 where the segment created in step 206 is added to the segment DB 82 together with the acoustic feature amount calculated in step 208, the corresponding phoneme label, and the speaker ID. Including.

このプログラムはさらに、ステップ１９８において、録音をやり直すことを示す入力がアテンダントにより行なわれたことに応答して実行され、ステップ１９４で録音された音声データを破棄するステップ２１４と、ステップ２１４の後に配置され、タイマの時間を参照して所定時間が経過したか否かを判定し、判定結果にしたがって制御の流れを分岐させるステップ２１６と、ステップ２１６においてまだ所定時間が経過していないと判定されたときに実行され、どこから処理を再開するかを決めるアテンダントの入力にしたがって、台詞音声の収録から再開するときにはステップ１９０に、発話の練習から再開するときにはステップ１８２に、それぞれ制御の流れを分岐させるステップ２２０と、ステップ２１６で既に所定時間が経過していると判定されたことに応答して実行され、現在収録中の台詞が必須部分であればステップ２２０に、それ以外であればステップ２０６に、それぞれ制御を分岐させるステップ２１８とを含む。 The program is further executed in step 198 in response to the input made by the attendant indicating that the recording is to be redone, and disposed after step 214 and discarding the audio data recorded in step 194. It is determined whether or not a predetermined time has elapsed with reference to the time of the timer, and it is determined in step 216 that the control flow branches according to the determination result, and that the predetermined time has not yet elapsed in step 216 Steps for branching the control flow to step 190 when resuming from speech speech recording and to step 182 when resuming from speech practice according to an attendant input that determines when to resume processing, 220, and the predetermined time has already passed in step 216 Be performed in response to the determination that there, in step 220 if the speech is an integral part of the current recording, the step 206 and otherwise, and a step 218 for branching each control.

図９は、図２に示す音声作成部１１８のより詳細なブロック図を示す。図９を参照して、音声作成部１１８は、それぞれ第１の手法〜第９の手法によって台詞音声を生成するための第１〜第９の音声生成部３００，３０２，３０４，３０６，３０８，３１０，３１２，３１４，及び３１６と、合成手法決定部１１６によって決定された手法にしたがって、第１〜第９の音声生成部３００，３０２，３０４，３０６，３０８，３１０，３１２，３１４，及び３１６のいずれかを選択的に能動化し、ユーザ音声を与えて指定した手法で音声を生成させる分岐部２８０と、合成手法決定部１１６によって決定された手法にしたがい、分岐部２８０によって選択された音声生成部の出力である台詞音声データを選択して共通の出力に出力する合流部２９２と、合流部２９２により出力される台詞音声データに対し、カット情報記憶部７６に記憶されたカット情報にしたがって指定される音響効果を付加して出力する音声信号処理部３２０とを含む。 FIG. 9 shows a more detailed block diagram of the voice creation unit 118 shown in FIG. Referring to FIG. 9, the speech creation unit 118 includes first to ninth speech generation units 300, 302, 304, 306, 308, for generating speech speech by the first to ninth methods, respectively. 310, 312, 314, and 316, and the first to ninth speech generation units 300, 302, 304, 306, 308, 310, 312, 314, and 316 according to the method determined by the synthesis method determination unit 116. One of the above, and a voice generation selected by the branching unit 280 according to a method determined by the synthesis method determining unit 116 and a branching unit 280 that generates a voice by a method specified by giving a user voice The merging unit 292 that selects the dialogue voice data that is the output of the part and outputs it to a common output, and the dialogue voice data that is output by the merging unit 292 is cut And a sound signal processing unit 320 to output the added sound effects to be designated in accordance with the cut information stored in the broadcast storage unit 76.

第１の音声生成部３００は、ある台詞についてユーザの台詞音声を収録することができたときの手法である。この場合には、原則として収録した音声をそのまま使用する。 The first sound generation unit 300 is a technique when a user's line sound can be recorded for a certain line. In this case, in principle, the recorded voice is used as it is.

第２の音声生成部３０２も、ある台詞についてユーザの台詞音声を収録することができたときの手法である。ただし、この手法では、収録した台詞音声の発話速度を調整して台詞音声を生成する。 The second sound generation unit 302 is also a technique when the user's line sound can be recorded for a certain line. However, in this method, speech is generated by adjusting the utterance speed of the recorded speech.

第３の音声生成部３０４は、台詞のうち、一部についてユーザの台詞音声を収録することができなかったときにも有効な手法である。この手法では、収録することができた台詞についてはユーザの台詞音声の話速変換をして台詞音声を生成する。収録することができなかった台詞については、ユーザの音声を使用せず、標準音声記憶部７４に記憶された標準音声のうち、ユーザ情報に合致した台詞音声（性、年齢など）を用いる。 The third sound generation unit 304 is an effective technique even when the user's speech is not recorded for some of the speech. In this method, the speech that has been recorded is converted to the speech speed of the user's speech, and speech is generated. For speech that could not be recorded, speech of the user's voice (sex, age, etc.) that matches the user information out of the standard speech stored in the standard speech storage unit 74 is used without using the user's speech.

第４の音声生成部３０６も、台詞のうち、一部についてユーザの台詞音声を収録することができなかったときにも有効な手法である。この手法では、収録することができた台詞についてはユーザの台詞音声の話速変換をして台詞音声を生成する。収録することができなかった台詞については、声優音声ＤＢ８０に記憶されている声優による台詞音声のうち、ユーザの音声に最も近い声質を持つ声優の台詞音声が採用される。このときの声優音声の決定には、練習用台詞から得られたユーザ音声の８種の特徴量（基本周波数、スペクトル分布など）について、比較対象となる音声との間で算出される距離の線形和を用いた声質間の距離比較が用いられる。この際の、これら特徴量の線形和の係数には、図１に示す線形結合係数記憶部９４に記憶された線形結合係数が使用される。ただし、本実施の形態では、この比較は図２に示す合成手法決定部１１６によって行なわれ、その結果である声優音声の識別子が類似声優記憶部１３０に記憶されている。第４の音声生成部３０６はこの情報を利用する。後述する第５の音声生成部３０８、第７の音声生成部３１２、及び第８の音声生成部３１４でも同様である。この類似した声優音声の決定の際には、ユーザの発話のうち、必須発話部分の音声と、声優音声の対応する台詞音声とが使用される。 The fourth sound generation unit 306 is also an effective technique even when the user's speech is not recorded for some of the speech. In this method, the speech that has been recorded is converted to the speech speed of the user's speech, and speech is generated. For speech that could not be recorded, speech speech of a voice actor having a voice quality closest to the user's speech among speech speech by voice actors stored in the voice actor speech DB 80 is employed. To determine the voice actor voice at this time, the linearity of the distance calculated with the voice to be compared for the eight types of feature quantities (basic frequency, spectrum distribution, etc.) of the user voice obtained from the practice dialogue. A distance comparison between voices using sums is used. At this time, the linear combination coefficient stored in the linear combination coefficient storage unit 94 shown in FIG. 1 is used as the coefficient of the linear sum of these feature quantities. However, in the present embodiment, this comparison is performed by the synthesis method determination unit 116 shown in FIG. 2, and the resulting voice actor speech identifier is stored in the similar voice actor storage unit 130. The fourth voice generation unit 306 uses this information. The same applies to a fifth sound generation unit 308, a seventh sound generation unit 312 and an eighth sound generation unit 314 which will be described later. In the determination of the similar voice actor voice, the voice of the essential utterance portion and the corresponding speech voice of the voice actor voice among the user's utterances are used.

第５の音声生成部３０８も、台詞のうち、一部についてユーザの台詞音声を収録することができなかったときに有効な手法である。この手法では、収録することができた台詞についてはユーザの台詞音声の話速変換をして台詞音声を生成する。収録することができなかった台詞については、声優音声ＤＢ８０に記憶されている声優による台詞音声のうち、ユーザ音声と声質が最も類似のもの３個を特定し、その台詞音声にさらにユーザの声質を反映させた声質変換を行なって台詞音声とする。類似した声優音声の特定には、前述したとおり類似声優記憶部１３０に記憶された声優音声の識別子を用いる。声質変換は、類似声優記憶部１３０に格納された識別子に対応する３名の声優音声に対し、モーフィング率記憶部１３２に記憶された、合成手法決定部１１６によって決定されたモーフィング率ベクトルを用いた、音声分析変換合成システムＳＴＲＡＩＧＨＴ（http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/）によるモーフィングによって行なわれる。 The fifth voice generation unit 308 is also an effective technique when it is not possible to record the user voice for some of the lines. In this method, the speech that has been recorded is converted to the speech speed of the user's speech, and speech is generated. For the dialogue that could not be recorded, among the speech voices by voice actors stored in the voice actor voice DB 80, the three voice voices that are most similar to the user voice are identified, and the voice quality of the user is further added to the speech voice. The reflected voice quality is converted into speech. To identify similar voice actor voices, the voice actor voice identifiers stored in the similar voice actor storage unit 130 are used as described above. The voice quality conversion uses the morphing rate vector determined by the synthesis method determination unit 116 stored in the morphing rate storage unit 132 for the three voice actor voices corresponding to the identifiers stored in the similar voice actor storage unit 130. This is performed by morphing by a speech analysis conversion synthesis system STRIGHT (http://www.wakayama-u.ac.jp/~kawahara/STRAIGHTadv/).

第６の音声生成部３１０も、台詞のうち、一部についてユーザの台詞音声を収録することができなかったときに有効な手法である。この手法では、収録することができた台詞についてはユーザの台詞音声の話速変換をして台詞音声を生成する。収録することができなかった台詞については、収録できたユーザ音声から生成した音声素片のうち、母音の音声素片と、素片ＤＢ８２に記憶されている全子音の音声素片のうち、ユーザの音声に類似した特徴量を持つ音声素片とを用いて音声合成をする。発話の個人的特徴は主として母音に現れるので、このような音声合成をすることによって、かなりユーザの音声に似た声質の合成音声を生成することができる。 The sixth sound generation unit 310 is also an effective technique when the user's speech is not recorded for some of the speech. In this method, the speech that has been recorded is converted to the speech speed of the user's speech, and speech is generated. For dialogue that could not be recorded, among the speech segments generated from the recorded user speech, among the speech units of the vowels and the speech units of all consonants stored in the segment DB 82, the user Speech synthesis is performed using speech segments having feature quantities similar to the speech. Since the personal characteristics of the utterance appear mainly in the vowels, by performing such speech synthesis, it is possible to generate synthesized speech with a voice quality much like the user's speech.

第７の音声生成部３１２は、必須発話部分以外の台詞音声が全く収録できなかったときに採用される手法である。この手法では、声優音声ＤＢ８０に記憶されている声優音声のうち、ユーザ音声と最も声質が類似した音声が台詞音声として使用される。 The seventh voice generation unit 312 is a method employed when no speech other than the essential utterance part can be recorded. In this method, a voice whose voice quality is most similar to the user voice among voice actor voices stored in the voice actor voice DB 80 is used as a speech voice.

第８の音声生成部３１４も、必須発話部分以外の台詞音声が全く収録できなかったときに有効な手法である。この手法では、声優音声ＤＢ８０に記憶されている声優音声のうち、ユーザ音声と最も声質が類似した音声を用い、その声優音声にさらにユーザ音声の声質を用いた声質変換を行なって台詞音声を生成する。 The eighth voice generation unit 314 is also an effective technique when no speech other than the essential utterance part can be recorded. In this method, speech that is most similar to the voice of the user voice among voice actor voices stored in the voice actor voice DB 80 is used, and voice quality conversion using the voice quality of the user voice is further performed on the voice actor voice to generate speech voice. To do.

第９の音声生成部３１６も、必須発話部分以外の台詞音声が全く収録できなかったときに有効な手法である。この手法では、必須発話部分について収録したユーザ音声から生成した音声素片のうち、母音の音声素片と、素片ＤＢ８２に記憶されている子音の音声素片のうち、ユーザの音声に類似した特徴量を持つ音声素片とを用いて音声合成をする。前述のとおり、このような音声合成をすることによって、かなりユーザの音声に似た声質の合成音声を生成することができる。 The ninth voice generation unit 316 is also an effective technique when no speech other than the essential utterance part can be recorded. In this method, among speech units generated from user speech recorded for essential utterance parts, vowel speech units and consonant speech units stored in the unit DB 82 are similar to the user's speech. Speech synthesis is performed using speech segments having feature quantities. As described above, by performing such speech synthesis, synthesized speech having a voice quality much similar to that of the user can be generated.

以上の各手法の説明から明らかなように、台詞情報記憶部７２に記憶された台詞情報は第１〜第９の音声生成部３００，３０２，３０４，３０６，３０８，３１０，３１２，３１４，及び３１６の全てにより参照される。標準音声記憶部７４に記憶された標準音声は、第３の音声生成部３０４に参照される。声優音声ＤＢ８０に記憶された声優音声は、第４の音声生成部３０６、第５の音声生成部３０８、第７の音声生成部３１２、及び第８の音声生成部３１４により参照される。素片ＤＢ８２は、第６の音声生成部３１０、及び第９の音声生成部３１６により参照される。類似声優記憶部１３０は、第４の音声生成部３０６、第５の音声生成部３０８、第７の音声生成部３１２、及び第８の音声生成部３１４により参照される。モーフィング率記憶部１３２は、第５の音声生成部３０８及び第８の音声生成部３１４により参照される。 As is clear from the description of each method described above, the dialogue information stored in the dialogue information storage unit 72 is the first to ninth speech generation units 300, 302, 304, 306, 308, 310, 312, 314, and Referenced by all of 316. The standard voice stored in the standard voice storage unit 74 is referred to by the third voice generation unit 304. The voice actor voice stored in the voice actor voice DB 80 is referred to by the fourth voice generator 306, the fifth voice generator 308, the seventh voice generator 312, and the eighth voice generator 314. The element DB 82 is referred to by the sixth sound generation unit 310 and the ninth sound generation unit 316. The similar voice actor storage unit 130 is referred to by the fourth voice generation unit 306, the fifth voice generation unit 308, the seventh voice generation unit 312, and the eighth voice generation unit 314. The morphing rate storage unit 132 is referred to by the fifth sound generation unit 308 and the eighth sound generation unit 314.

図１０は、図２に示す合成手法決定部１１６で行なわれる音声の生成手法の決定処理を、コンピュータハードウェア上で実現するコンピュータプログラムのフローチャートである。図１０を参照して、このプログラムは、ユーザ音声から得られる８種類の音響特徴量の線形和からなる複合音響特徴量と、声優音声ＤＢ８０に記憶された各声優音声から得られる同じ複合音響特徴量とを用いた距離比較により、ユーザ音声に最もよく似た３つの声優音声を特定し、これら声優音声の識別子を図９に示す類似声優記憶部１３０に記憶するステップ３３０と、ステップ３３０で特定された３つの声優音声を用いたモーフィングにより、ユーザ音声の声質に似た音声を合成する際の、モーフィング率ベクトルを算出し、図９に示すモーフィング率記憶部１３２に記憶させるステップ３３２とを含む。ステップ３３０及びステップ３３２の詳細については後述する。 FIG. 10 is a flowchart of a computer program that implements, on computer hardware, the speech generation method determination process performed by the synthesis method determination unit 116 shown in FIG. Referring to FIG. 10, this program has the same composite acoustic feature obtained from a composite acoustic feature amount composed of a linear sum of eight types of acoustic feature amounts obtained from user speech and each voice actor speech stored in voice actor speech DB 80. The three voice actor voices most similar to the user voice are identified by the distance comparison using the quantity, and the identifiers of these voice actor voices are stored in the similar voice actor storage unit 130 shown in FIG. A step 332 of calculating a morphing rate vector when synthesizing a voice similar to the voice quality of the user voice by morphing using the three voice actor voices and storing it in the morphing rate storage unit 132 shown in FIG. . Details of step 330 and step 332 will be described later.

このプログラムはさらに、以下の繰返しを制御するための変数ｉに０を代入するステップ３４０と、変数ｉに１を加算するステップ３４２と、変数ｉの値が台詞の数ＭＡＸを超えたか否かを判定し、超えた場合には処理を終了するステップ３４４と、ステップ３４４で変数ｉの値がＭＡＸ以下であると判定されたことに応答して実行され、台詞番号がｉの台詞（以下これを「台詞（ｉ）」と書く。）に対応する手法リストを手法リストテーブル７８から読出し、作業用のリスト変数ＷＬＩＳＴに格納するステップ３４６とを含む。 The program further includes step 340 for substituting 0 for variable i for controlling the following iteration, step 342 for adding 1 to variable i, and whether or not the value of variable i exceeds the number of lines MAX. If it is determined that the value of the variable i is determined to be less than or equal to MAX in step 344, the process is terminated. And a step 346 of reading out a method list corresponding to “line (i)” from the method list table 78 and storing it in the work list variable WLIST.

手法リストテーブル７８の詳細を図１１に示す。図１１を参照して、手法リストテーブル７８は、台詞番号ごとに、利用可能な手法の識別子をリストした手法リストを含む。通常は、この手法リストにリストされた手法のいずれかを用いれば必ず台詞を処理できるように手法リストテーブル７８は予め作成されている。ただし、手法リストの中に、利用可能なものが含まれない場合も含めて、たとえば標準音声の台詞音声を出力する、というデフォルトの手法が予め準備されている。 Details of the technique list table 78 are shown in FIG. Referring to FIG. 11, method list table 78 includes a method list that lists identifiers of available methods for each line number. Normally, the method list table 78 is created in advance so that dialogue can be processed without fail using any of the methods listed in this method list. However, a default method of outputting, for example, standard speech speech is prepared in advance, including cases where usable methods are not included in the method list.

再び図１０を参照して、このプログラムはさらに、ステップ３４６の後に配置され、リスト変数ＷＬＩＳＴの要素数を変数ＣＭＡＸに代入するステップ３４８と、ステップ３４８に続き、以下の繰返しを制御する変数ｊに０を代入するステップ３５０とを含む。なお、一般的にリスト変数の要素のインデックスは０から始まる。 Referring again to FIG. 10, this program is further arranged after step 346, substituting the number of elements of the list variable WLIST into the variable CMAX, step 348, and step 348 to the variable j that controls the following iterations And step 350 for substituting 0. In general, the index of the element of the list variable starts from 0.

このプログラムはさらに、ステップ３５０に続き、ｊ＋１がＣＭＡＸの値を超えたか否かを判定し、判定結果にしたがって制御の流れを分岐させるステップ３５２と、ステップ３５２においてｊ＋１がＣＭＡＸ以下であると判定されたことに応答して実行され、リスト変数ＷＬＩＳＴのうち、リスト要素ＷＬＩＳＴ［ｊ］によって示される手法が、与えられたユーザ音声によって実現可能か否かを判定し、判定結果にしたがって制御を分岐させるステップ３５４とを含む。これら手法の各々について採用できるか否かは、処理対象の台詞音声の収録状況により異なる。基本的には、第１の手法と第２の手法とについては対応の台詞音声の収録がされていなければ利用できないが、それ以外の手法は対応の台詞音声の収録がされていなくても利用可能である。その理由については各手法の説明から明らかとなるであろう。 In step 350, the program further determines whether j + 1 exceeds the value of CMAX, and branches the control flow according to the determination result. In step 352, it is determined that j + 1 is equal to or less than CMAX. It is executed in response to this, and it is determined whether or not the technique indicated by the list element WLIST [j] among the list variables WLIST can be realized by a given user voice, and the control is branched according to the determination result. Step 354. Whether or not each of these methods can be adopted depends on the recording situation of the speech to be processed. Basically, the first method and the second method cannot be used unless the corresponding speech is recorded, but other methods can be used even if the corresponding speech is not recorded. Is possible. The reason will be clear from the explanation of each method.

このプログラムはさらに、ステップ３５４においてリスト要素ＷＬＩＳＴ［ｊ］によって示される手法が利用可能でないと判定されたことに応答して実行され、変数ｊの値に１を加算してステップ３５２に制御を戻すステップ３５６と、ステップ３５４においてリスト要素ＷＬＩＳＴ［ｊ］によって示される手法が利用可能であると判定されたことに応答して実行され、台詞（ｉ）をＷＬＩＳＴ［ｊ］により示される手法を用いて処理し、制御をステップ３４２に戻すステップ３５８と、ステップ３５２においてｊ＋１の値がＣＭＡＸより大きいと判定されたことに応答して実行され、台詞（ｉ）をデフォルトの手法で処理し、制御をステップ３４２に戻すステップ３６０とを含む。 The program is further executed in response to determining in step 354 that the technique indicated by list element WLIST [j] is not available, adding 1 to the value of variable j, and returning control to step 352. Executed in response to determining that the technique indicated by list element WLIST [j] is available in step 356 and step 354, using the technique indicated by WLIST [j] Processing and returning control to step 342, executed in response to determining that the value of j + 1 is greater than CMAX in step 352, processing line (i) in a default manner, and controlling And step 360 of returning to 342.

図１２は、図２に示すカット情報記憶部７６の構成を示す。図１２を参照して、カット情報記憶部７６は、台詞ごとに、台詞番号と、その台詞に対して適用すべき音響効果を列挙した音響効果リストとを記憶している。ある台詞について音響効果を加えようとする場合、音声信号処理部３２０は、このカット情報記憶部７６の、処理対象の台詞番号に対応する音響効果リストを調べ、それを順に先頭から実行する。 FIG. 12 shows the configuration of the cut information storage unit 76 shown in FIG. Referring to FIG. 12, the cut information storage unit 76 stores, for each dialogue, a dialogue number and an acoustic effect list listing the acoustic effects to be applied to the dialogue. When an acoustic effect is to be applied to a certain dialogue, the audio signal processing unit 320 examines the acoustic effect list corresponding to the dialogue number to be processed in the cut information storage unit 76, and sequentially executes them from the top.

図１３は、図９に示す第１の音声生成部３００を実現するプログラムのフローチャートである。図１３を参照して、このプログラムは、ユーザ音声ＤＢから台詞音声（ｉ）を読出すステップ３８０を含む。ステップ３８０によってこの処理は終了する。読出された台詞音声（ｉ）は、音声信号処理部３２０に与えられ、処理される。音声信号処理部３２０の処理の詳細は図２３を参照して後述する。 FIG. 13 is a flowchart of a program that implements the first sound generation unit 300 shown in FIG. Referring to FIG. 13, this program includes a step 380 of reading the speech voice (i) from the user voice DB. Step 380 ends the process. The read speech (i) is given to the audio signal processing unit 320 and processed. Details of the processing of the audio signal processing unit 320 will be described later with reference to FIG.

この第１の手法は、対象となる台詞についてユーザの音声を収録することができたときの手法であり、台詞音声としてユーザの音声をそのまま使用する。 This first technique is a technique when the user's voice can be recorded for the target dialogue, and the user's voice is used as it is as the dialogue voice.

図１４は図９に示す第２の音声生成部３０２を実現するためのプログラムの制御構造を示すフローチャートである。図１４を参照して、このプログラムは、ユーザ音声ＤＢ１２０からユーザの台詞音声（ｉ）とその発話時間とを読出すステップ４１０と、ステップ４１０に続き、台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出すステップ４１２と、ステップ４１０で読出した発話時間ｔｉとステップ４１２で読出した発話時間Ｔｉとを用い、ユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉとなるように話速変換を行なって処理を終了するステップ４１４とを含む。 FIG. 14 is a flowchart showing a control structure of a program for realizing the second sound generation unit 302 shown in FIG. Referring to FIG. 14, this program reads the user's speech voice (i) and its speech time from user speech DB 120, and continues to step 410, and the speech time of speech (i) from the speech information table. Using the utterance time ti read at step 410 and the utterance time Ti read at step 410 and the utterance time Ti read at step 412, the speech speed conversion is performed so that the utterance time of the user's speech (i) is changed from ti to Ti. And 414 to end the process.

図１５は、図９に示す第３の音声生成部３０４を実現するためのプログラムのフローチャートである。図１５を参照して、このプログラムは、ユーザ音声ＤＢ１２０のユーザ音声テーブル２６０から録音フラグ（ｉ）を読出すステップ４４０と、ステップ４４０において読出された録音フラグの値が１か否かを判定し、その結果に応じて制御の流れを分岐させるステップ４４２とを含む。 FIG. 15 is a flowchart of a program for realizing the third sound generation unit 304 shown in FIG. Referring to FIG. 15, this program reads out recording flag (i) from user voice table 260 of user voice DB 120, and determines whether the value of the recording flag read out in step 440 is 1 or not. And step 442 for branching the control flow according to the result.

このプログラムはさらに、ステップ４４２において録音フラグが１でない（すなわちこの台詞についてユーザ音声の収録ができなかった）と判定されたことに応答して実行され、標準音声記憶部７４から台詞（ｉ）の標準音声を読出し、台詞音声（ｉ）として出力し、処理を終了するステップ４４４と、ステップ４４２において録音フラグが１であると判定されたことに応答して実行され、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出すステップ４４６と、台詞情報記憶部７２に記憶された台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出すステップ４４８と、ステップ４４６及び４４８でそれぞれ読出された発話時間ｔｉ及びＴｉを用いて、ユーザの台詞音声（ｉ）の発話時間がＴｉとなるように、話速変換を行なって出力し、処理を終了するステップ４５０とを含む。 This program is further executed in response to the determination that the recording flag is not 1 in step 442 (that is, the user's voice cannot be recorded for this line), and the line (i) of the line (i) is read from the standard voice storage unit 74. The standard voice is read and output as a dialogue voice (i), and the process is terminated in response to the determination that the recording flag is 1 in step 444 and the processing, and the dialogue voice ( i) and the utterance time ti are read in step 446, the utterance time Ti of the line (i) is read from the line information table stored in the line information storage unit 72, and read in steps 446 and 448, respectively. The speech speed conversion is performed so that the speech time of the user's speech (i) becomes Ti using the spoken time ti and Ti. It is output, and a step 450 to end the process.

図１６は、図９に示す第４の音声生成部３０６を実現するためのプログラムの制御構造を示すフローチャートである。図１６を参照して、このプログラムは、ユーザ音声ＤＢ１２０のユーザ音声テーブル２６０からｉ番目の台詞音声に対する録音フラグ（ｉ）を読出すステップ４７０と、ステップ４７０で読出された録音フラグ（ｉ）の値が１か否かによって制御の流れを分岐させるステップ４７２と、ステップ４７２において録音フラグ（ｉ）の値が１でない（すなわち０である。）と判定されたことに応答して実行され、声優音声ＤＢ８０中に記憶されている台詞（ｉ）の声優音声のうち、ユーザ音声と最も声質が類似したものを読出して台詞音声（ｉ）として出力し、処理を終了するステップ４７４とを含む。声質が類似した声優音声の読出には、前述したとおり類似声優記憶部１３０（図９）に記憶された声優音声の識別子が用いられる。これは、以下に説明する図１７のステップ５０４、図１９のステップ５６０、図２０のステップ５８０でも同様である。したがってそれら個所については上記説明は繰返さない。 FIG. 16 is a flowchart showing a control structure of a program for realizing the fourth sound generation unit 306 shown in FIG. Referring to FIG. 16, this program reads out recording flag (i) for the i-th speech voice from user voice table 260 of user voice DB 120, and recording flag (i) read in step 470. The control flow is executed depending on whether the value of the recording flag (i) is not 1 (that is, 0) in step 472 where the flow of control is branched depending on whether the value is 1 or not. Among the voice actor voices of the line (i) stored in the voice DB 80, the voice voice having the most similar voice quality to the user voice is read and output as the line voice (i), and the process is terminated 474. As described above, the voice actor voice identifier stored in the similar voice actor storage unit 130 (FIG. 9) is used to read voice actor voices having similar voice qualities. The same applies to step 504 in FIG. 17, step 560 in FIG. 19, and step 580 in FIG. 20 described below. Therefore, the above description will not be repeated for those portions.

このプログラムはさらに、ステップ４７２において録音フラグが１であると判定されたことに応答して実行され、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出すステップ４７６と、ステップ４７６に続き、台詞情報記憶部７２の台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出すステップ４７８と、発話時間ｔｉ及びＴｉを用い、ステップ４７６で読出されたユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉとなるように、話速変換を行なって、台詞音声（ｉ）として出力し、処理を終了するステップ４８０とを含む。 This program is further executed in response to the determination that the recording flag is 1 in step 472, and reads the speech (i) and the speech time ti from the user voice DB 120, and in step 476 Subsequently, step 478 for reading the speech time Ti of the speech (i) from the speech information table of the speech information storage unit 72, and the speech of the user speech (i) read at step 476 using the speech times ti and Ti. Step 480 which performs speech speed conversion so as to change the time from ti to Ti and outputs as speech (i) and ends the process.

図１７は、図９に示す第５の音声生成部３０８を実現するプログラムの制御構造を示すフローチャートである。図５を参照して、このプログラムは、ユーザ音声ＤＢ１２０から録音フラグ（ｉ）を読出すステップ５００と、読出された録音フラグの値が１か否かを判定し、判定結果に応じて制御の流れを分岐させるステップ５０２と、ステップ５０２において録音フラグの値が１ではないと判定されたことに応答して実行され、声優音声ＤＢ８０に格納されている台詞（ｉ）の声優音声のうち、ユーザの声質に最も類似したものを特定するステップ５０４と、ステップ５０４で特定された台詞（ｉ）の声優音声を、ユーザ音声の特徴を用いて声質変換し、台詞音声（ｉ）として出力し処理を終了するステップ５０６とを含む。 FIG. 17 is a flowchart showing a control structure of a program that realizes the fifth sound generation unit 308 shown in FIG. Referring to FIG. 5, this program reads step 500 for recording flag (i) from user voice DB 120, determines whether or not the value of the recorded recording flag is 1, and performs control according to the determination result. Of the voice actor voices in the line (i), which is executed in response to the branching of the flow 502 and the fact that the value of the recording flag is determined not to be 1 in step 502 and stored in the voice actor voice DB 80, the user Step 504 that identifies the voice quality most similar to the voice quality of the voice of the line (i) identified in Step 504 is voice-converted using the characteristics of the user voice, and is output as the line voice (i) for processing. And ending step 506.

このプログラムはまた、ステップ５０２において録音フラグが１であると判定されたことに応答して実行され、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出すステップ５０８と、台詞情報記憶部７２の台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出すステップ５１０と、ユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉになるように話速変換を行なって、台詞音声（ｉ）として出力し処理を終了するステップ５１２とを含む。 This program is also executed in response to determining that the recording flag is 1 in step 502, reading the speech (i) and speech time ti from the user speech DB 120, and speech information storage. In step 510, the speech time Ti of the speech (i) is read from the speech information table of the section 72, and the speech speed is converted so that the speech time of the user speech speech (i) is changed from ti to Ti. and step 512 for outputting as i) and ending the processing.

図１８は、図９に示す第６の音声生成部３１０を実現するためのプログラムのフローチャートである。図１８を参照して、このプログラムは、ユーザ音声ＤＢ１２０から録音フラグ（ｉ）を読出すステップ５３０と、この録音フラグの値が１か否かを判定し、判定結果にしたがって制御の流れを分岐させるステップ５３２と、ステップ５３２において録音フラグの値が１でないと判定されたことに応答して実行され、台詞（ｉ）、ユーザ音声の特徴量、ユーザの母音の音声素片、素片ＤＢ８２の子音の音声素片を使用して音声合成を行なって台詞音声（ｉ）を生成し出力するステップ５３４とを含む。 FIG. 18 is a flowchart of a program for realizing the sixth sound generation unit 310 shown in FIG. Referring to FIG. 18, this program reads step 530 of recording flag (i) from user voice DB 120, determines whether the value of this recording flag is 1, and branches the control flow according to the determination result. Step 532, and in response to the determination that the value of the recording flag is not 1 in step 532, the line (i), the feature amount of the user speech, the speech unit of the user vowel, and the speech unit DB 82 And step 534 for generating and outputting speech speech (i) by performing speech synthesis using consonant speech units.

このプログラムはさらに、ステップ５３２において録音フラグ＝１であると判定されたことに応答して実行され、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出すステップ５３６と、台詞情報記憶部７２の台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出すステップ５３８と、ユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉとなるようにユーザの台詞音声（ｉ）の話速変換を行なって台詞音声（ｉ）として出力するステップ５４０とを含む。 This program is further executed in response to the determination that the recording flag = 1 in Step 532, and reads the speech (i) and the speech time ti from the user speech DB 120, and the speech information storage. Step 538 for reading the speech time Ti of the speech (i) from the speech information table of the section 72, and the speech speed of the user speech speech (i) so that the speech time of the user speech speech (i) is changed from ti to Ti. Step 540 of performing conversion and outputting as speech (i).

図１９は、図９に示す第７の音声生成部３１２を実現するプログラムの制御構造を示すフローチャートである。図１９を参照して、このプログラムは、声優音声ＤＢ８０の台詞（ｉ）の音声の中で、ユーザ音声の声質と最も類似した音声を読出し、台詞音声（ｉ）として出力し、処理を終了するステップ５６０を含む。 FIG. 19 is a flowchart showing a control structure of a program that implements the seventh sound generation unit 312 shown in FIG. Referring to FIG. 19, this program reads the speech most similar to the voice quality of the user speech from speech (i) in voice actor speech DB 80, outputs it as speech speech (i), and ends the processing. Step 560 is included.

図２０は、図９に示す第８の音声生成部３１４を実現するプログラムの制御構造を示すフローチャートである。図２０を参照して、このプログラムは、声優音声ＤＢ８０に記憶されている、台詞（ｉ）の声優音声のうち、ユーザ音声の声質と最もよく類似した音声を特定し読出すステップ５８０と、ステップ５８０で読出された音声（ｉ）を、ユーザの発話必須部分の音声の特徴を用いて、ユーザの声質に近い声質に声質変換することにより、ユーザの台詞音声（ｉ）を生成し出力するステップ５８２とを含む。 FIG. 20 is a flowchart showing a control structure of a program that realizes the eighth sound generation unit 314 shown in FIG. Referring to FIG. 20, this program specifies and reads out the voice most similar to the voice quality of the user voice among voice actor voices of dialogue (i) stored in voice actor voice DB 80, Step of generating and outputting the user's speech (i) by converting the voice (i) read in 580 into a voice quality close to the user's voice quality using the voice characteristics of the user's utterance essential part 582.

図２１は、図９に示す第９の音声生成部３１６を実現するためのプログラムのフローチャートである。図２１を参照して、このプログラムは、台詞（ｉ）、ユーザ音声の特徴量、ユーザの母音の音声素片、及び素片ＤＢ８２に記憶された全子音の素片を用いて、台詞（ｉ）の音声合成を行ない、台詞音声（ｉ）として出力し処理を終了するステップ６００を含む。 FIG. 21 is a flowchart of a program for realizing the ninth sound generation unit 316 shown in FIG. Referring to FIG. 21, this program uses the dialogue (i), the feature amount of the user voice, the speech unit of the user's vowel, and the speech unit of all consonants stored in the unit DB 82. ) Is synthesized and output as speech (i), and the process ends.

図２２は、図９に示す音声信号処理部３２０を実現するプログラムのフローチャートである。音声信号処理部３２０は、合流部２９２の出力する台詞音声（ｉ）に対し、以下のような処理を行なう。すなわち、このプログラムは、カット情報記憶部７６から台詞（ｉ）の音響効果リストＥＬＩＳＴを読出すステップ３８２と、ステップ３８２の後、音響効果リストＥＬＩＳＴの要素数を変数ＥＭＡＸに代入するステップ３８４と、ステップ３８４の後、以後の繰返しを制御するための変数ｋに０を代入するステップ３８６と、ステップ３８６に続いて配置され、ステップ３８８に続き、ｋ＋１の値がＥＭＡＸより大きいか否かを判定し、判定結果にしたがって制御を分岐させるステップ３９０と、ステップ３９０においてｋ＋１の値がＥＭＡＸ以下であると判定されたことに応答して実行され、台詞音声（ｉ）にＥＬＩＳＴ［ｋ］の音響効果を付与するステップ３９２と、ステップ３９２の後、変数ｋの値に１を加算してステップ３９０に制御を戻すステップ３８８とを含む。 FIG. 22 is a flowchart of a program that implements the audio signal processing unit 320 shown in FIG. The audio signal processing unit 320 performs the following processing on the speech (i) output from the merging unit 292. That is, the program reads the sound effect list ELIST of the line (i) from the cut information storage unit 76, and after step 382, substitutes the number of elements of the sound effect list ELIST into the variable EMAX, step 384. Subsequent to step 384, step 386 is performed following step 386, in which 0 is substituted for variable k for controlling subsequent iterations. Following step 388, it is determined whether the value of k + 1 is greater than EMAX. Step 390 for branching control according to the determination result, and executed in response to the determination that the value of k + 1 is equal to or less than EMAX in Step 390, and the acoustic effect of ELIST [k] is added to the speech speech (i). After adding step 392 and step 392, 1 is added to the value of variable k and step 390 is performed. And a step 388 to return the control.

このプログラムはさらに、ステップ３９０においてｋ＋１の値がＥＭＡＸより大きいと判定されたことに応答して実行され、台詞音声（ｉ）を音声ファイルに書き出すステップ３９４と、ステップ３９４の後、台詞音声テーブル８８の台詞（ｉ）の音声ファイル名を新たなファイル名で更新して処理を終了するステップ３９６とを含む。 The program is further executed in response to determining that the value of k + 1 is greater than EMAX in step 390, and writing the speech speech (i) to a speech file; after step 394, the speech speech table 88 And 396 to update the speech file name of the line (i) with a new file name and end the process.

ステップ３９６で更新される台詞音声テーブル８８の構成を図２４に示す。図２４を参照して、台詞音声テーブル８８は、台詞番号と、台詞の再生開始時刻と、台詞の再生（発話）時間と、その台詞の音声（台詞音声）が格納された台詞音声データ８６中のファイル名と、再生フラグとを含む。台詞の再生開始時刻は、作成される映画の先頭を所定の時刻とし、その時刻を基準として台詞の再生を開始するように定められる時刻である。再生時間は台詞の再生の継続時間をさす。再生ファイル名は、すでに述べたように台詞音声データ８６中で、台詞音声を格納したファイルのファイル名である。再生フラグは、０であれば映画の再生時に音声を再生することを示し、１であれば再生しないことを示す。この再生フラグは、後述するように音声の重なり（二人以上の登場人物が同時に発話すること）を実現するために用いられる。その手法については後述する。 The structure of the speech table 88 updated in step 396 is shown in FIG. Referring to FIG. 24, the speech sound table 88 includes speech number 86, speech playback start time, speech playback (utterance) time, and speech speech (speech speech) stored therein. File name and playback flag. The dialogue reproduction start time is a time determined so that the beginning of a movie to be created is a predetermined time, and reproduction of the dialogue is started based on that time. Playback time refers to the duration of dialogue playback. The playback file name is the file name of the file storing the speech in the speech audio data 86 as described above. If the playback flag is 0, it indicates that audio is played back when the movie is played back, and if it is 1, it indicates that playback is not performed. As will be described later, this reproduction flag is used to realize a voice overlap (two or more characters speak at the same time). The method will be described later.

図２３は、本実施の形態に係るマルチメディア製作システム５０によって作成された映画を再生する再生システムのブロック図である。図２３を参照して、この再生システムは、映像データ６６から映像信号と映像・同期信号と効果音の音声信号とを出力するための映像信号再生部６２０と、映像信号再生部６２０により再生された映像信号を再生して映像を表示するための表示装置６２２と、映像信号再生部６２０が出力する効果音の音声信号を音声に変換して出力するための効果音出力装置６２４と、映像の再生に先立って、台詞音声データ８６及び台詞音声テーブル８８を入力として受け、台詞音声テーブル８８に記憶された各台詞の発話開始時刻及び発話時間に基づいて互いに同じ時間に重なって発話されるべき台詞の組合せを検出し、それらの音声ファイルの音声を合成して新たな音声ファイルを作成して、重なりが検出された台詞の内の１つの音声ファイルと入替え、さらにそれ以外の台詞音声の発話フラグを「１」に更新することによって、同時に発話する台詞音声が統合されるように台詞音声データ８６及び台詞音声テーブル８８を更新するための同時音声統合処理部６３２とを含む。 FIG. 23 is a block diagram of a playback system for playing back a movie created by the multimedia production system 50 according to the present embodiment. Referring to FIG. 23, this playback system is played back by video signal playback unit 620 for outputting a video signal, a video / synchronization signal, and a sound effect sound signal from video data 66, and video signal playback unit 620. A display device 622 for reproducing the reproduced video signal and displaying the video, a sound effect output device 624 for converting the sound signal of the sound effect output from the video signal reproducing unit 620 into sound, and outputting the sound. Prior to playback, speech speech data 86 and speech speech table 88 are received as input, and speech that should be spoken at the same time based on the speech start time and speech time of each speech stored in speech speech table 88. Is detected, and the voices of those voice files are synthesized to create a new voice file, which is replaced with one voice file in the line where the overlap is detected. Further, by updating the speech flag of the other speech voices to “1”, the speech speech data 86 and the speech speech table 88 for updating the speech speech data 86 and the speech speech table 88 so that speech speech spoken at the same time is integrated. 632.

再生システムはさらに、再生時に映像信号再生部６２０からの同期信号を受け、台詞音声テーブル８８を参照して、同期信号により示される時刻と一致する発話開始時刻の台詞音声であって、かつ対応する再生フラグが「０」であるものを検出して台詞音声データ８６から読出し、再生して音声信号を出力するための同期再生部６３８と、同期再生部６３８の出力する音声信号を音声に変換して出力するための台詞音声出力装置６４０とを含む。 The reproduction system further receives a synchronization signal from the video signal reproduction unit 620 at the time of reproduction, refers to the dialogue sound table 88, and is a speech sound at an utterance start time coinciding with the time indicated by the synchronization signal, and corresponds. A synchronous reproduction unit 638 for detecting a reproduction flag of “0”, reading out from the speech audio data 86, reproducing and outputting the audio signal, and converting the audio signal output from the synchronous reproduction unit 638 into audio And a speech output device 640 for outputting.

すなわち、この再生システムは、効果音と、台詞音声とを完全に分離して生成し、台詞音声をその発話開始時間の順番にしたがって、順に再生する。そのため、効果音を活かしながら、登場人物の音声と顔画像とをユーザのものに置換した映画を再生できる。 In other words, this playback system generates sound effects and line speech completely separately, and plays the speech in order according to the order of the utterance start times. Therefore, it is possible to play a movie in which the sound of the characters and the face image are replaced with those of the user while utilizing the sound effects.

図２４は、前述したとおり、台詞音声テーブル８８の構成を示す。図２５は、図２４に示す台詞音声テーブル８８のうち、発話時間が重なっている台詞（台詞１，２，３）を同時音声統合処理部６３２によって統合した後の台詞音声テーブル８８の構成を示す。 FIG. 24 shows the structure of the speech table 88 as described above. FIG. 25 shows the configuration of the speech table 88 after the speech (utterances 1, 2 and 3) having the same speech time in the speech table 88 shown in FIG. .

図２５を参照して、台詞音声テーブル８８の構成自体は更新前と同様である。異なっているのは、台詞１の再生時間が７秒から１１秒に増加していること、台詞１の再生ファイル名が「ｗａｖｅ０００１．ｗａｖ」から「ｃｏｍｂ０００１．ｗａｖ」に変更されていること、及び台詞２及び３の再生フラグが「０」から「１」に変更されていることである。これは以下の理由による。 Referring to FIG. 25, the structure itself of dialogue speech table 88 is the same as that before the update. The difference is that the playback time of dialogue 1 has increased from 7 seconds to 11 seconds, the playback file name of dialogue 1 has been changed from “wave0001.wav” to “comb0001.wav”, and This means that the playback flags of lines 2 and 3 have been changed from “0” to “1”. This is due to the following reason.

図２４に示す台詞音声テーブル８８において、台詞１の再生開始時刻は０時０分３秒、再生時間が７秒であるから、再生終了時刻は０時０分１０秒である。一方、台詞２の再生開始時刻は０時０分８秒、再生時間は５秒であるから再生終了時刻は０時０分１３秒となる。すると、台詞１の発話時間と台詞２の発話時間とは、一部において互いに重なっている。本実施の形態では、このように互いに発話の時間帯が重なっている台詞については、それらの音声を統合して新たな音声ファイルとし、一方の台詞の音声ファイル（通常は再生開始時間の早い方）の音声ファイルと入替え、その発話時間も新たな音声ファイルの発話時間で更新する。そして、他方の台詞の音声ファイルについては再生フラグを１とする。 In the speech table 88 shown in FIG. 24, the playback start time of speech 1 is 0: 0: 3 and the playback time is 7 seconds, so the playback end time is 0: 0: 10. On the other hand, the playback start time of dialogue 2 is 0: 0: 8, and the playback time is 5 seconds, so the playback end time is 0: 0: 13. Then, the speech time of line 1 and the speech time of line 2 partially overlap each other. In the present embodiment, for dialogues in which utterance times overlap with each other in this way, those speeches are integrated into a new speech file, and one speech file (usually the one with the earliest playback start time) ) And the utterance time is updated with the utterance time of the new audio file. The playback flag is set to 1 for the other speech file.

図２４に示す例では、台詞１，２、及び３の再生時刻が重なっていたために、これらが統合され、最終的に図２５に示すように台詞１の再生時間が１１秒、台詞２及び台詞３の再生フラグが１（すなわち再生せず）となっている。 In the example shown in FIG. 24, since the playback times of dialogues 1, 2, and 3 overlap, these are integrated, and finally, the reproduction time of dialogue 1 is 11 seconds, dialogue 2 and dialogue, as shown in FIG. 3 is 1 (that is, no reproduction is performed).

図２６は、この同時音声統合処理部６３２を実現するためのプログラムのフローチャートである。図２６を参照して、このプログラムは、処理中の台詞の台詞番号を表す変数Ｘに初期値として０を代入するステップ６６０と、この変数Ｘに１を加算するステップ６６２と、ステップ６６２の処理結果を受け、Ｘ番目の台詞（Ｘ）の台詞音声が存在するか否か（すなわち全ての台詞音声を処理し終わったか否か）を判定し、判定結果に応じて制御を分岐させるステップ６６４とを含む。ステップ６６４では、全ての台詞音声を処理し終わっていれば処理を終了する。 FIG. 26 is a flowchart of a program for realizing the simultaneous audio integration processing unit 632. Referring to FIG. 26, this program substitutes 0 as an initial value for variable X representing the line number of the line being processed, step 662 for adding 1 to this variable X, and processing of step 662 Receiving the result, determining whether or not the speech of the Xth dialogue (X) exists (that is, whether or not all dialogue speech has been processed) and branching the control according to the judgment result; including. In step 664, if all the speech sounds have been processed, the process is terminated.

このプログラムはさらに、ステップ６６４において台詞音声（Ｘ）が存在すると判定されたことに応答して実行され、台詞音声テーブル８８のその台詞音声（Ｘ）の再生フラグの値が０か否かを判定し、判定結果に応じて制御を分岐させるステップ６６６を含む。ステップ６６６において再生フラグが０でないと判定された場合、台詞音声（Ｘ）を再生する必要はない。したがってこの場合、制御はステップ６６２に戻り、次の台詞音声の処理に移る。 This program is further executed in response to the determination that the speech line (X) is present in step 664, and determines whether or not the value of the playback flag of the speech line (X) in the speech line table 88 is 0. And step 666 of branching the control according to the determination result. If it is determined in step 666 that the playback flag is not 0, it is not necessary to play the speech (X). Therefore, in this case, the control returns to step 662 and proceeds to the next dialogue speech processing.

このプログラムはさらに、ステップ６６６で台詞音声（Ｘ）の発話フラグの値が０であると判定されたことに応答して実行され、台詞音声（Ｘ）と音声が重なっているか否かを判定する台詞音声の台詞番号を示す変数ＹにＸの値を代入するステップ６６８と、ステップ６６８の後、この変数Ｙの値に１を加算するステップ６７０と、ステップ６７０の処理結果を受け、台詞音声（Ｙ）が存在するか否か、すなわち全ての台詞音声について台詞音声（Ｘ）との重なりを調べる処理が完了したか否かを判定し、判定結果にしたがって制御の流れを分岐させるステップ６７２とを含む。ステップ６７２においてＹ番目の台詞が存在していないと判定された場合、制御はステップ６６２に戻る。 This program is further executed in response to the determination that the speech flag value of the speech line (X) is 0 in step 666, and determines whether or not the speech line (X) and the speech overlap. Step 668 for substituting the value of X into the variable Y indicating the line number of the line speech, step 670 for adding 1 to the value of this variable Y after step 668, and the processing result of step 670, Y) is present, that is, it is determined whether or not the processing for checking the overlap with the speech (X) for all speech is completed, and the flow of control 672 is branched according to the determination result. Including. If it is determined in step 672 that the Yth line does not exist, control returns to step 662.

このプログラムはさらに、ステップ６７２においてＹ番目の台詞音声が存在すると判定されたことに応答して実行され、台詞音声（Ｙ）の再生フラグの値が０か否かを判定し、判定結果に応じて制御の流れを分岐させるステップ６７４を含む。ステップ６７４で台詞音声（Ｙ）の再生フラグの値が０でないと判定されたときには、制御はステップ６７０に戻り、次の台詞音声に対して台詞音声（Ｘ）との重なりを調べる処理に移る。 This program is further executed in response to the determination that the Yth speech line is present in step 672, determines whether the value of the playback flag of the speech line (Y) is 0, and depends on the determination result. Step 674 for branching the control flow. When it is determined in step 674 that the value of the playback flag of the speech line (Y) is not 0, the control returns to step 670, and the process shifts to a process for examining the overlap of the next speech line with the speech line (X).

このプログラムはさらに、ステップ６７４において台詞音声（Ｙ）の再生フラグの値が０であると判定されたことに応答して実行され、台詞音声テーブル８８に記憶された双方の台詞音声の発話開始時間及び発話時間の値に基づき、台詞（Ｘ）と台詞（Ｙ）との発話時間の少なくとも一部が重なっているか否かを判定し、判定結果に応じて制御を分岐させるステップ６７６を含む。ステップ６７６で発話時間が重なっていないと判定された場合には、制御はステップ６７０に戻る。 This program is further executed in response to determining that the value of the playback flag of the speech line (Y) is 0 in step 674, and the speech start times of both speech lines stored in the speech line table 88. And a step 676 of determining whether or not at least a part of the speech time of the dialogue (X) and the dialogue (Y) overlaps based on the value of the speech time and branching the control according to the judgment result. If it is determined in step 676 that the speech times do not overlap, control returns to step 670.

このプログラムはさらに、ステップ６７６において台詞（Ｘ）と台詞（Ｙ）との発話時間の少なくとも一部が重なっていると判定されたことに応答して実行され、台詞音声（Ｘ）と台詞音声（Ｙ）とを混合して新たな台詞音声を作成し、台詞音声（Ｘ）として台詞音声データ８６を更新するステップ６７８と、この新たな台詞音声（Ｘ）の発話時間ｔを、重複修正前の台詞音声（Ｘ）の発話時間ｔｘと台詞音声（Ｙ）の発話時間ｔｙとの間で、以下のようにして計算して求め、これを新たな台詞音声（Ｘ）の発話時間ｔｘとして台詞音声テーブル８８を更新するステップ６８０と、ステップ６８０に続き、台詞音声テーブル８８の、台詞音声（Ｙ）の再生フラグの値を「１」に更新し、制御をステップ６７０に戻すステップ６８２とを含む。 This program is further executed in response to the determination that at least a part of the speech time of the dialogue (X) and dialogue (Y) overlaps in step 676, and dialogue speech (X) and dialogue speech ( Y) is mixed to create a new speech, and the speech speech data 86 is updated as speech speech (X), and the speech time t of this new speech speech (X) is set to the value before duplication correction. It is calculated as follows between the speech time tx of the speech speech (X) and the speech time ty of the speech speech (Y), and this is calculated as the speech time tx of the new speech speech (X). Following step 680, the table 88 is updated, and following step 680, the value of the speech flag (Y) playback flag in the speech table 88 is updated to "1", and control returns to step 670.

図２７は、図２３に示す同期再生部６３８を実現するためのプログラムのフローチャートである。図２７を参照して、このプログラムは、図２３に示す映像信号再生部６２０から与えられる同期信号を読込むステップ７００と、ステップ７００で読込んだ同期信号により示される時刻が、台詞音声テーブル８８に格納されたいずれかの台詞であって、かつその再生フラグが０であるものの発話開始時刻に達したか否かを判定し、判定結果にしたがって制御の流れを分岐させるステップ７０２とを含む。ステップ７０２において、同期信号により示される時刻がいずれの台詞音声の再生開始時間にもなっていないと判定されたときには、制御はステップ７００に戻り、再度、同期信号を読込む。 FIG. 27 is a flowchart of a program for realizing the synchronous playback unit 638 shown in FIG. Referring to FIG. 27, in this program, step 700 for reading the synchronization signal provided from video signal reproduction unit 620 shown in FIG. 23 and the time indicated by the synchronization signal read in step 700 are the speech sound table 88. And step 702 for determining whether or not the speech start time has been reached for any of the dialogues stored in and whose playback flag is 0, and branching the control flow according to the determination result. If it is determined in step 702 that the time indicated by the synchronization signal is not the playback start time of any speech sound, control returns to step 700 and the synchronization signal is read again.

このプログラムはさらに、ステップ７０２において、同期信号により示される時刻がいずれかの台詞音声の発話開始時刻になったと判定されたことに応答して実行され、その台詞音声の再生を開始し、制御をステップ７００に戻すステップ７０４とを含む。 This program is further executed in response to the determination in step 702 that the time indicated by the synchronization signal has reached the speech start time of any speech, starts playback of that speech, and performs control. And step 704 which returns to step 700.

図２８は、音声信号処理部３２０が実行する音響効果処理のうち、話速変換と音量正規化処理の内容を説明するための図である。 FIG. 28 is a diagram for explaining the contents of speech speed conversion and volume normalization processing among the sound effect processing executed by the audio signal processing unit 320.

図２８（Ａ）を参照して、話速変換処理とは、台詞の発話時間の基準となる参照音声７２０での発話時間と比較して、収録音声７２２の収録時間が図２８（Ａ）に示されるように短すぎたり、逆に長すぎたりした場合に、この収録音声７２２の話速を変換して参照音声７２０の発話時間と等しい発話時間の補正音声７２４を生成する処理のことである。話速変換には、既存の話速変換技術を使用することができる。 Referring to FIG. 28 (A), the speech speed conversion process is the recording time of recorded voice 722 in FIG. 28 (A) compared to the utterance time of reference voice 720, which is the standard of speech time of dialogue. This is a process of generating a corrected speech 724 having a speech time equal to the speech time of the reference speech 720 by converting the speech speed of the recorded speech 722 when it is too short or too long as shown. . The existing speech speed conversion technology can be used for the speech speed conversion.

図２８（Ｂ）は、音量正規化を示す。参照音声７４０の平均レベルＬ０と比較して、収録音声７４２の平均レベルＬ１が図２８（Ｂ）に示すように低すぎたり、逆に高すぎたりしたときに、収録音声のレベルを補正して、ほぼ平均レベルＬ０と等しい平均レベルＬ３とするのが音量正規化処理である。このような音量正規化処理は、複数のユーザによって収録される音声の大きさにばらつきがあってはいけなかったり、逆に場面によってはユーザにより音声の大きさに差をつけたりする必要があるために行われる。この音量正規化についても、既存の技術を使用することができる。 FIG. 28B shows volume normalization. Compared with the average level L0 of the reference voice 740, when the average level L1 of the recorded voice 742 is too low or too high as shown in FIG. In the sound volume normalization process, the average level L3 which is substantially equal to the average level L0 is set. In such volume normalization processing, the volume of audio recorded by multiple users must not vary, and conversely, depending on the scene, it is necessary for the user to make a difference in the volume of audio. To be done. For this volume normalization, the existing technology can be used.

［コンピュータによる実現］
図２９は、このマルチメディア製作システム５０においてユーザの音声を収録するための台詞音声データ作成部９０のハードウェアの外観図を示す。図２９を参照して、台詞音声データ作成部９０は、実質的にはコンピュータシステム８３０からなる。図３０は、このコンピュータシステム８３０の内部構成を示す。 [Realization by computer]
FIG. 29 is an external view of the hardware of the speech voice data creation unit 90 for recording the user's voice in the multimedia production system 50. Referring to FIG. 29, the speech sound data creation unit 90 is substantially composed of a computer system 830. FIG. 30 shows the internal configuration of the computer system 830.

図２９を参照して、コンピュータシステム８３０は、リムーバブルメモリ用のメモリポート８５２及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）ドライブ８５０を有するコンピュータ８４０と、文字情報及びコマンド操作の入力を行うためのキーボード８４６と、ポインティングデバイスであるマウス８４８と、２台のモニタ８４２及び８４４と、２台のマイクロフォン８６８及び８７０と、２組のスピーカセット８７２及び８７４とを含む。これらのうち、モニタ８４４、スピーカセット８７４、及びマイクロフォン８６８は、コンピュータシステム８３０の本体部分と分離して図２９に示すようにユーザの録音用ブースに設置されており、ユーザの台詞音声の録音時にユーザとの入出力インタフェースとして用いられる。 Referring to FIG. 29, a computer system 830 includes a computer 840 having a memory port 852 for removable memory and a DVD (Digital Versatile Disc) drive 850, a keyboard 846 for inputting character information and command operations, and pointing. The device includes a mouse 848, two monitors 842 and 844, two microphones 868 and 870, and two sets of speakers 872 and 874. Among these, the monitor 844, the speaker set 874, and the microphone 868 are separated from the main body of the computer system 830 and are installed in the user's recording booth as shown in FIG. Used as an input / output interface with the user.

図３０を参照して、コンピュータ８４０は、メモリポート８５２と、ＤＶＤドライブ８５０と、マイクロフォン８６８及び８７０と、スピーカセット８７２及び８７４とに加えて、ＣＰＵ（中央処理装置）８５６と、ＣＰＵ８５６、メモリポート８５２及びＤＶＤドライブ８５０に接続されたバス８６６と、ブートアッププログラム等を記憶する読出専用メモリ（ＲＯＭ）３５８と、バス８６６に接続され、プログラム命令、システムプログラム、及び作業データ等を記憶するランダムアクセスメモリ（ＲＡＭ）８６０と、バス８６６、マイクロフォン８６８及び８７０、並びにスピーカセット８７２及び８７４に接続されるサウンドボード８８４とを含む。 Referring to FIG. 30, a computer 840 includes a memory port 852, a DVD drive 850, microphones 868 and 870, speaker sets 872 and 874, a CPU (central processing unit) 856, a CPU 856, and a memory port. A bus 866 connected to the 852 and the DVD drive 850, a read only memory (ROM) 358 for storing a boot-up program and the like, and a random access connected to the bus 866 for storing a program command, a system program, work data and the like. A memory (RAM) 860 and a sound board 884 connected to a bus 866, microphones 868 and 870, and speaker sets 872 and 874 are included.

コンピュータ８４０はさらに、他のコンピュータと通信を行なうためのローカルエリアネットワーク（ＬＡＮ）８７６への接続を提供するネットワークインターフェイスカード（ＮＩＣ）８７８を含む。 Computer 840 further includes a network interface card (NIC) 878 that provides a connection to a local area network (LAN) 876 for communicating with other computers.

コンピュータシステム８３０にマルチメディア製作システム５０としての動作を行なわせるための、上記した各種のコンピュータプログラムは、ＤＶＤドライブ８５０又はメモリポート８５２に挿入されるＤＶＤ８６２又はリムーバブルメモリ８６４に記憶され、さらにハードディスク８５４に転送される。又は、プログラムは図示しないネットワークを通じてコンピュータ８４０に送信されハードディスク８５４に記憶されてもよい。プログラムは実行の際にＲＡＭ８６０にロードされる。ＤＶＤ８６２から、リムーバブルメモリ８６４から、又はネットワーク８７６を介して、直接にＲＡＭ８６０にプログラムをロードしてもよい。 The various computer programs described above for causing the computer system 830 to operate as the multimedia production system 50 are stored in the DVD 862 or the removable memory 864 inserted into the DVD drive 850 or the memory port 852, and further stored in the hard disk 854. Transferred. Alternatively, the program may be transmitted to the computer 840 through a network (not shown) and stored in the hard disk 854. The program is loaded into the RAM 860 when executed. The program may be loaded into the RAM 860 directly from the DVD 862, from the removable memory 864, or via the network 876.

これらのプログラムは、コンピュータ８４０にこの実施の形態のマルチメディア製作システム５０としての動作を行なわせる複数の命令を含む。この動作を行なわせるのに必要な基本的機能のいくつかはコンピュータ８４０上で動作するオペレーティングシステム（ＯＳ）若しくはサードパーティのプログラム、又はコンピュータ８４０にインストールされる音声処理及び統計モデル処理用の各種ツールキットのモジュールにより提供される。したがって、このプログラムはこの実施の形態のシステム及び方法を実現するのに必要な機能全てを必ずしも含まなくてよい。このプログラムは、命令のうち、所望の結果が得られる様に制御されたやり方で適切な機能又は予め準備されたコンピュータ用プログラムの集まりであるいわゆる「ツールキット」中の適切な「ツール」を呼出す事により、上記した台詞音声作成装置としての動作を実行する命令のみを含んでいればよい。コンピュータシステム８３０の動作は周知であるので、ここでは繰返さない。 These programs include a plurality of instructions that cause the computer 840 to operate as the multimedia production system 50 of this embodiment. Some of the basic functions required to perform this operation are an operating system (OS) or a third party program running on the computer 840, or various tools for voice processing and statistical model processing installed on the computer 840. Provided by module of kit. Therefore, this program does not necessarily include all functions necessary for realizing the system and method of this embodiment. This program calls out the appropriate “tools” in a so-called “tool kit” which is a collection of computer programs with appropriate functions or prepared in advance in a controlled manner to obtain the desired result. Therefore, it is only necessary to include an instruction for executing the operation as the above-described speech sound creation device. The operation of computer system 830 is well known and will not be repeated here.

なお、図１に示されるシステムのうち、台詞音声データ作成部９０はユーザごとの収録をするための、いずれもコンピュータシステム８３０と同様の構成の複数のコンピュータシステムと、音声統合部１０４を実現するための１つのコンピュータシステムとを含む。音声統合部１０４を実現するコンピュータシステムも、ハードウェア構成はコンピュータシステム８３０と同様であるが、マイクロフォン及びスピーカなどは必要ない。 In the system shown in FIG. 1, the dialogue voice data creation unit 90 realizes a plurality of computer systems having the same configuration as the computer system 830 and a voice integration unit 104 for recording for each user. One computer system for The computer system that implements the voice integration unit 104 has the same hardware configuration as the computer system 830, but does not require a microphone, a speaker, or the like.

また、本実施の形態では、図２３に示す映像・音声再生装置９２のうち、映像信号再生部６２０は１つのコンピュータシステムにより実現され、同時音声統合処理部６３２及び同期再生部６３８はそれとは別の１つのコンピュータシステムにより実現される。 Further, in the present embodiment, in the video / audio playback device 92 shown in FIG. 23, the video signal playback unit 620 is realized by one computer system, and the simultaneous audio integration processing unit 632 and the synchronous playback unit 638 are different. This is realized by one computer system.

本システムで使用されるコンピュータシステムは、いずれもネットワーク８７６を介して互いに通信を行ない、最終的に映像データ６６、台詞音声データ８６、及び台詞音声テーブル８８を映像・音声再生装置９２のハードディスクに作成し、そこから再生を行なう。 All of the computer systems used in this system communicate with each other via the network 876, and finally create the video data 66, the speech audio data 86, and the speech audio table 88 on the hard disk of the video / audio reproduction device 92. And play from there.

［動作］
以上に構成を説明したマルチメディア製作システム５０は以下のように動作する。なお、ユーザ音声に最も類似した声優音声を選択する際、及びモーフィング率の決定の際のマルチメディア製作システム５０の動作については後に詳述する。また、８種類の音響特徴量を線形結合して複合的音響特徴量を算出する際の、線形結合係数の算出についても後述する。以下の説明では、この線形結合係数の算出は既に行なわれて線形結合係数記憶部９４に記憶されているものとする。 [Operation]
The multimedia production system 50 whose configuration has been described above operates as follows. The operation of the multimedia production system 50 when selecting the voice actor voice most similar to the user voice and determining the morphing rate will be described in detail later. Further, calculation of a linear combination coefficient when calculating a composite acoustic feature amount by linearly combining eight types of acoustic feature amounts will be described later. In the following description, it is assumed that the linear combination coefficient is already calculated and stored in the linear combination coefficient storage unit 94.

図１を参照して、複数のユーザがマルチメディア製作システム５０を利用するものとして、予め各ユーザには識別情報が割当てられているものとする。また各ユーザには、映画の登場人物の誰と入替わるかが決定されているものとする。 Referring to FIG. 1, it is assumed that a plurality of users use the multimedia production system 50, and identification information is assigned to each user in advance. Further, it is assumed that each user is determined as to whom a character in the movie is to be replaced.

マルチメディア製作システム５０では、予め映像素材ＤＢ７０には映像素材が、台詞情報記憶部７２には台詞情報が、標準音声記憶部７４には各台詞に対し、男性、女性、年齢に応じた標準音声が、カット情報記憶部７６には音響効果情報が、それぞれ格納されている。また、声優音声ＤＢ８０には各台詞を複数の声優がそれぞれ発話したものが台詞別、声優別に格納されているものとする。各声優の音声については、予め音響分析が行なわれており、それぞれの声質を表す８種類の音響特徴量が算出され、声優音声ＤＢ８０に記憶されている。また素片ＤＢ８２には、標準音声及び声優音声をセグメンテーションすることによって作成された音声素片が格納されている。各音声素片には、対応する音素の音素ラベルと、音響特徴量と、元の音声の識別子と、発話者の識別子とが付されている。 In the multimedia production system 50, video material is stored in advance in the video material DB 70, dialogue information is stored in the dialogue information storage unit 72, and standard speech corresponding to men, women, and ages for each dialogue in the standard audio storage unit 74. However, the sound effect information is stored in the cut information storage unit 76, respectively. Also, it is assumed that the voice actor voice DB 80 stores each speech uttered by a plurality of voice actors for each speech and each voice actor. The voice of each voice actor is subjected to acoustic analysis in advance, and eight types of acoustic feature quantities representing the respective voice qualities are calculated and stored in the voice actor voice DB 80. The unit DB 82 stores speech units created by segmenting standard speech and voice actor speech. Each speech segment is assigned a phoneme label of the corresponding phoneme, an acoustic feature, an original speech identifier, and a speaker identifier.

各ユーザのユーザ情報が、ユーザ情報入力部１００，１００Ａ，…，１００Ｎで入力され、画像処理ＰＣ６２及び複数のキャラクタ音声作成部１０２，１０２Ａ，…，１０２Ｎのうち、対応するものに送られる。 User information of each user is input by the user information input units 100, 100A,..., 100N, and sent to the corresponding one of the image processing PC 62 and the plurality of character voice generation units 102, 102A,.

三次元スキャナ群６０は、各ユーザの顔をスキャンし、画像処理ＰＣ６２に３次元スキャンデータを送る。以下、画像処理ＰＣ６２はユーザの３次元スキャンデータを用いてユーザの三次元顔モデルを作成し、さらに任意の角度からの３次元顔画像を作成して映像生成装置６４に与える。映像生成装置６４は、登場人物の顔画像を、画像処理ＰＣ６２で作成されたユーザの顔画像で置換し、映像データ６６として出力する。なお、映像データ６６には、音声との同期をとるための同期信号再生用のデータが含まれている。 The three-dimensional scanner group 60 scans each user's face and sends three-dimensional scan data to the image processing PC 62. Thereafter, the image processing PC 62 creates the user's three-dimensional face model using the user's three-dimensional scan data, further creates a three-dimensional face image from an arbitrary angle, and gives it to the video generation device 64. The video generation device 64 replaces the character's face image with the user's face image created by the image processing PC 62, and outputs it as video data 66. Note that the video data 66 includes synchronization signal reproduction data for synchronizing with audio.

一方、複数のキャラクタ音声作成部１０２，１０２Ａ，…，１０２Ｎは、いずれも、以下のようにして対応するユーザの台詞音声を収録し、この収録音声に基づいて、第１の音声生成部３００〜第９の音声生成部３１６を用いてユーザの声を活かした映画用の音声データを作成し、出力する。このときの第１の音声生成部３００〜第９の音声生成部３１６の処理はいずれも同様である。以下では、キャラクタ音声作成部１０２の動作について説明する。 On the other hand, each of the plurality of character voice generation units 102, 102A,..., 102N records the corresponding user's speech as follows, and based on the recorded voice, the first voice generation unit 300˜ The ninth sound generation unit 316 is used to create and output movie sound data utilizing the user's voice. The processes of the first sound generation unit 300 to the ninth sound generation unit 316 at this time are the same. Hereinafter, the operation of the character voice creation unit 102 will be described.

図２を参照して、音声収録部１１４は、ユーザ情報をユーザ情報入力部１００から受信し（図５のステップ１７０）、以後の処理ではこのユーザ情報を用いる。続いて、ユーザに割当てられたキャラクタに関する情報が入力される（図５のステップ１７２）。音声収録部１１４は、入力されたキャラクタの台詞に関する台詞情報を台詞情報記憶部７２から読出し、対応する標準音声を標準音声記憶部７４から、対応する映像がもしあれば映像素材ＤＢ７０から、それぞれ読出す（図５のステップ１７４）。音声収録部１１４はさらに、ユーザ音声テーブル２６０を作成し、全ての台詞情報の録音フラグを０に初期化する。 With reference to FIG. 2, the sound recording unit 114 receives user information from the user information input unit 100 (step 170 in FIG. 5), and uses this user information in the subsequent processing. Subsequently, information relating to the character assigned to the user is input (step 172 in FIG. 5). The voice recording unit 114 reads the dialogue information related to the input dialogue of the character from the dialogue information storage unit 72, reads the corresponding standard voice from the standard voice storage unit 74, and if there is a corresponding video from the video material DB 70, respectively. (Step 174 in FIG. 5). The voice recording unit 114 further creates a user voice table 260 and initializes the recording flags of all dialogue information to zero.

音声収録部１１４は、続いてタイマをスタートさせ（ステップ１７８）、台詞の収録を開始する。台詞の収録では、発話対象の台詞を選択し（ステップ１８０）、映像と、台詞情報の表示とを行ない（ステップ１８２）、同時に標準音声の再生を開始する。その結果、入出力装置１１２の画面（モニタ８４４の画面）に図７に示すような表示が行なわれる。この後、ユーザが標準音声をまねて、練習としてその発話を行なう（ステップ１８６）。 Next, the audio recording unit 114 starts a timer (step 178) and starts recording lines. In the recording of dialogue, the dialogue target speech is selected (step 180), video and dialogue information are displayed (step 182), and reproduction of standard audio is started at the same time. As a result, the display as shown in FIG. 7 is performed on the screen of the input / output device 112 (the screen of the monitor 844). Thereafter, the user imitates the standard voice and utters it as practice (step 186).

コンピュータシステム８３０を操作しながらユーザの発話を聞いているアテンダントが、その発話についての練習を終了してよいか否かを判断し（ステップ１８８）、もし練習をさらにする必要があれば（ステップ１８８においてＮＯ）、その発話について再度同じ処理を繰返す操作を行なう。練習を終了してよいと判定されると（ステップ１８８においてＹＥＳ）、再度選択した台詞と、対応する映像とを表示し（ステップ１９０）、プログレスバーの表示を開始し（ステップ１９２）、ユーザの音声を収録する（ステップ１９４）。 The attendant listening to the user's utterance while operating the computer system 830 determines whether or not to practice the utterance (step 188), and if further practice is needed (step 188). In NO, the same operation is repeated for the utterance. If it is determined that the practice can be finished (YES in step 188), the selected dialogue and the corresponding video are displayed (step 190), and the progress bar is displayed (step 192). Audio is recorded (step 194).

もしも収録した音声が正しい内容で、発声内容も明瞭で、発話時間も許容範囲内であれば、アテンダントは収録した音声を音声ファイルとしてユーザ音声記憶部２６２に保存し、図８に構成を示すユーザ音声テーブル２６０の処理中の台詞の行の音声ファイル名の欄にユーザ音声記憶部２６２に保存した音声ファイルの名称を代入し、発話時間の欄にユーザの台詞音声の実際の発話時間（ｔｉ）を代入する（ステップ２００）。さらに音声収録部１１４は、その行の録音フラグを１に更新し（ステップ２０１）、次の台詞を選択する（ステップ２０２）。もしも全ての台詞についてのユーザの台詞音声の収録が完了していれば（ステップ２０４でＹＥＳ）、収録されたユーザの全発話を音素にセグメンテーションして素片化し（ステップ２０６）、各音声素片の音響特徴量を算出して（ステップ２０８）、素片ＤＢ８２に追加する。 If the recorded voice is correct, the utterance is clear, and the utterance time is within an allowable range, the attendant saves the recorded voice as a voice file in the user voice storage unit 262, and the user whose configuration is shown in FIG. The speech file name stored in the user speech storage unit 262 is substituted into the speech file name column of the speech line being processed in the speech table 260, and the actual speech time (ti) of the user speech speech in the speech time column. Is substituted (step 200). Furthermore, the audio recording unit 114 updates the recording flag of the line to 1 (step 201), and selects the next line (step 202). If the recording of the user's speech for all lines has been completed (YES in step 204), all the recorded user's utterances are segmented into phonemes (step 206). Is calculated (step 208) and added to the segment DB 82.

もしもステップ２０４でまだ全ての台詞についての台詞音声の収録が終わっていないと判定されると、ステップ２１２でタイマを参照し、予め収録時間として定められていた時間を超過しているか否かを判定する。もしも超過していれば、ステップ２０６に進み、以後は全ての台詞について収録を完了した場合と同様の処理が行なわれる。もしもまだ所定時間に達していなければ、図５のステップ１８２に戻り、このユーザに対応するキャラクタの次の台詞について、上述した処理を繰返す。 If it is determined in step 204 that the recording of speech for all the lines has not been completed yet, in step 212, a timer is referred to and it is determined whether or not a predetermined time has been exceeded. To do. If it has exceeded, the process proceeds to step 206, and thereafter the same processing as when recording is completed for all lines is performed. If the predetermined time has not yet been reached, the process returns to step 182 in FIG. 5, and the above-described processing is repeated for the next line of the character corresponding to this user.

仮にステップ１９８で、収録された音声が好ましくないもの（たとえば内容が本来の発話テキストと著しく異なっているもの、発話が不明瞭なもの、発話時間が許容範囲外のもの）であるとアテンダントが判定したときには、ステップ２１４でその収録音声が破棄される。続いてタイマをチェックすることで、収録のための時間を超過しているか否かを判定する（ステップ２１６）。時間が超過していなければ、処理中の台詞の標準音声による発声（ステップ１８２）からやり直すか、単にユーザによる発話の収録（ステップ１９０）からやり直すかをアテンダントが判定し、判定結果にしたがって指示を入力する。音声収録部１１４は、その指示にしたがって制御を分岐させ（ステップ２２０）、その結果、ステップ１８２又はステップ１９０から処理が再開される。 In step 198, the attendant determines that the recorded voice is not preferable (for example, the content is significantly different from the original utterance text, the utterance is unclear, or the utterance time is outside the allowable range). If so, the recorded sound is discarded in step 214. Subsequently, the timer is checked to determine whether or not the time for recording has been exceeded (step 216). If the time has not exceeded, the attendant determines whether to start again from the speech of the speech being processed (step 182) or simply from the recording of the utterance by the user (step 190), and instruct according to the determination result. input. The audio recording unit 114 branches the control according to the instruction (step 220), and as a result, the processing is resumed from step 182 or step 190.

一方、ステップ２１６ですでに収録に要した時間が、所定の時間を超過していると判定された場合には、ステップ２１８で現在収録中の台詞が必須部分の台詞か否かを判定する。必須部分であれば、この収録は必ずする必要があるため、制御はステップ２２０に進み、アテンダントの判定にしたがって、収録を再開する。もしも必須部分でなければ、収録作業を終了すべきであるから、制御はステップ２０６に進む。以後、全ての台詞の収録が完了したときと同様の動作がステップ２０６，２０８及び２１０で実行される。 On the other hand, if it is determined in step 216 that the time required for recording has already exceeded the predetermined time, it is determined in step 218 whether the currently recorded dialogue is an essential portion of dialogue. If it is an indispensable part, this recording must be performed. Therefore, the control proceeds to step 220 and the recording is resumed according to the determination of the attendant. If it is not an essential part, the recording operation should be terminated, and control proceeds to step 206. Thereafter, the same operations as those performed when the recording of all dialogues is completed are executed in steps 206, 208 and 210.

こうして、音声収録部１１４によって、図８に示すユーザ音声記憶部２６２には、あるキャラクタの台詞についてのユーザの台詞音声の音声ファイルが格納され、ユーザ音声テーブル２６０には各台詞について、録音できたか否かを示す録音フラグと、ユーザ音声記憶部２６２中の対応する音声ファイルの名称と、ユーザによる発話時間とが記録される。 In this way, the voice recording unit 114 stores the voice file of the user's speech for a certain character's speech in the user speech storage unit 262 shown in FIG. 8, and the user speech table 260 can record each speech. The recording flag indicating whether or not, the name of the corresponding voice file in the user voice storage unit 262, and the utterance time by the user are recorded.

複数のキャラクタ音声作成部１０２，１０２Ａ，…，１０２Ｎの各々が上記した処理を実行する結果、これらからはそれぞれのキャラクタの台詞音声がユーザ音声ＤＢ１２０（ユーザ音声テーブル２６０とユーザ音声記憶部２６２）の形で出力される。音声統合部１０４は、これら種々のキャラクタのユーザの台詞音声を台詞情報記憶部７２に記憶された台詞情報に基づいて所定の順番で読出せるよう統合し、台詞音声データ８６及び台詞音声テーブル８８を出力する。音声収録部１１４は、このようにして、対象のユーザについての音声の収録が完了すると合成手法決定部１１６に対し、台詞音声の生成を開始するよう指示を出す。 As a result of each of the plurality of character voice creation units 102, 102A,..., 102N executing the above-described processing, the speech of each character is stored in the user voice DB 120 (user voice table 260 and user voice storage unit 262). Is output in the form. The voice integration unit 104 integrates the speech voices of the users of these various characters so that they can be read in a predetermined order based on the dialogue information stored in the dialogue information storage unit 72, and the dialogue voice data 86 and the dialogue voice table 88 are integrated. Output. In this way, when the sound recording for the target user is completed, the sound recording unit 114 instructs the synthesis method determination unit 116 to start generating speech.

この指示に応答して、対応の合成手法決定部１１６は以下のような処理を実行する。図１０を参照して、ステップ３３０において、ユーザ音声に最も類似した３個の声優音声を声優音声ＤＢ８０に記憶された声優音声の中から選択し、それらの識別子を類似声優記憶部１３０（図２）に記憶する。ステップ３３２において、これら３個の声優音声からユーザ音声の声質に類似した音声をモーフィングにより生成するためのモーフィング率ベクトルｒを推定し、モーフィング率記憶部１３２図２）に記憶する。ステップ３３０及びステップ３３２の処理の詳細については、他の処理部のもので行なわれるのとあわせ、後に詳述する。さらに、ステップ３４０〜ステップ３４４によって、処理すべき台詞のうち、先頭の台詞を選択する。そして、その台詞の台詞番号をキーに、手法リストテーブル７８を検索し、その台詞に対する手法リストＷＬＩＳＴを入手する。 In response to this instruction, the corresponding synthesis method determination unit 116 executes the following processing. Referring to FIG. 10, in step 330, three voice actor voices most similar to the user voice are selected from voice actor voices stored in voice actor voice DB 80, and their identifiers are assigned to similar voice actor storage unit 130 (FIG. 2). ). In step 332, a morphing rate vector r for generating a voice similar to the voice quality of the user voice from these three voice actor voices by morphing is estimated and stored in the morphing rate storage unit 132 (FIG. 2). Details of the processing of step 330 and step 332 will be described later together with the processing performed by other processing units. Further, at step 340 to step 344, the first dialogue among the dialogues to be processed is selected. Then, the method list table 78 is searched using the line number of the line as a key, and the method list WLIST for the line is obtained.

続いてステップ３４８〜３５４の処理により、手法リストＷＬＩＳＴに記載された手法について、先頭から順番に調べ、利用可能な手法で最初に発見された手法を用い、処理対象の台詞をその手法で処理することを決定し、その手法を特定する情報を音声作成部１１８に与え、処理させる。手法リストは必ずその中に利用可能なものがあるように作成するが、仮にない場合でもデフォルトの手法を用いて台詞音声の生成ができるようにしておく。 Subsequently, by the processing in steps 348 to 354, the methods described in the method list WLIST are checked in order from the top, and the first method found using the available methods is used, and the processing target dialogue is processed by that method. Is determined, and information for specifying the method is given to the voice creating unit 118 for processing. The method list is created so that there is always a method list that can be used, but even if there is no method list, it is possible to generate speech using the default method.

このようにして処理対象の中の最初の台詞について、ユーザの収録音声に基づいて、音声作成部１１８の第１〜第９の音声生成部３００〜３１６中で、選択された手法に対応するものに対し、台詞音声の生成を指示する。このとき合成手法決定部１１６は、分岐部２８０を制御して、選択された音声生成部にユーザ音声を与え、その出力する台詞音声を選択して出力するように合流部２９２を制御する。こうして、先頭の台詞について台詞音声の生成を開始させると、合成手法決定部１１６は再度ステップ３４２から処理を再開し、次の台詞について台詞音声生成の手法を決定し、対応する音声生成部に台詞音声を生成させる。こうして対象となるキャラクタの台詞について全て台詞音声の生成が完了すると、合成手法決定部１１６の処理は終了である。 In this way, the first line in the processing target corresponds to the method selected in the first to ninth sound generation units 300 to 316 of the sound generation unit 118 based on the user's recorded sound. Is instructed to generate speech. At this time, the synthesis method determining unit 116 controls the branching unit 280 to give the user voice to the selected voice generation unit, and to control the merging unit 292 so as to select and output the output speech. Thus, when generation of speech is started for the first dialogue, the synthesis method determination unit 116 restarts the process from step 342 again, determines the speech generation method for the next dialogue, and sends the dialogue to the corresponding speech generation unit. Generate sound. When the generation of speech for all the dialogues of the target character is completed in this way, the processing of the synthesis method determination unit 116 is finished.

図９を参照して、音声作成部１１８は以下のように動作する。分岐部２８０は、合成手法決定部１１６からの指示にしたがい、指定された音声生成部を能動化し、ユーザ音声を与える。第１の音声生成部３００〜第９の音声生成部３１６のうち、能動化されたものは、与えられたユーザ音声に基づき、それぞれの手法を用いて台詞音声を生成する。出力される台詞音声は合流部２９２によって選択され、音声信号処理部３２０に与えられる。 With reference to FIG. 9, the voice creation unit 118 operates as follows. The branching unit 280 activates the designated voice generation unit according to an instruction from the synthesis method determination unit 116, and gives a user voice. Among the first voice generation unit 300 to the ninth voice generation unit 316, the activated one generates a speech line using each method based on the given user voice. The output speech is selected by the merging unit 292 and given to the audio signal processing unit 320.

ここで、第１の手法が選択された場合、図９に示す第１の音声生成部３００は、ユーザ音声ＤＢ１２０から台詞音声（ｉ）を読出す（ステップ３８０）。ステップ３８０によってこの処理は終了する。 Here, when the first method is selected, the first voice generation unit 300 shown in FIG. 9 reads the line voice (i) from the user voice DB 120 (step 380). Step 380 ends the process.

第２の手法が選択された場合、図９に示す第２の音声生成部３０２は以下のように動作する。図１４を参照して、まず、第２の音声生成部３０２は、ユーザ音声ＤＢ１２０からユーザの台詞音声（ｉ）とその発話時間ｔｉとを読出す（ステップ４１０）。第２の音声生成部３０２は続いて、台詞情報テーブル台詞（ｉ）の発話時間Ｔｉを読出す（ステップ４１２）。さらに、第２の音声生成部３０２は、ステップ４１０で読出した発話時間ｔｉとステップ４１２で読出した発話時間Ｔｉとを用い、ユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉとなるように話速変換を行なう（ステップ４１４）。 When the second method is selected, the second sound generation unit 302 shown in FIG. 9 operates as follows. Referring to FIG. 14, first, second speech generation unit 302 reads the user's speech speech (i) and its utterance time ti from user speech DB 120 (step 410). Next, the second speech generation unit 302 reads the speech time Ti of the speech information table speech (i) (step 412). Further, the second voice generation unit 302 uses the utterance time ti read at step 410 and the utterance time Ti read at step 412 so that the utterance time of the user's speech (i) is changed from ti to Ti. Speech speed conversion is performed (step 414).

第３の手法が選択された場合、図９に示す第３の音声生成部３０４は以下のように動作する。図１５を参照して、第３の音声生成部３０４は、まずユーザ音声ＤＢ１２０のユーザ音声テーブル２６０から録音フラグ（ｉ）を読出す（ステップ４４０）。次に第３の音声生成部３０４は、読出された録音フラグの値が１か否かを判定し、録音フラグが１でないときには標準音声記憶部７４から台詞（ｉ）の標準音声を読出し、台詞音声（ｉ）として出力し、処理を終了する（ステップ４４４）。ステップ４４２において録音フラグが１であると判定された場合には、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出し（ステップ４４６）、台詞情報記憶部７２に記憶された台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出す（ステップ４４８）。そして、ステップ４４６及び４４８でそれぞれ読出された発話時間ｔｉ及びＴｉを用いて、ユーザの台詞音声（ｉ）の発話時間がＴｉとなるように、話速変換を行なって出力する（ステップ４５０）。 When the third method is selected, the third sound generation unit 304 shown in FIG. 9 operates as follows. Referring to FIG. 15, first, the third voice generation unit 304 reads the recording flag (i) from the user voice table 260 of the user voice DB 120 (step 440). Next, the third sound generation unit 304 determines whether or not the value of the read recording flag is 1. When the recording flag is not 1, the third sound generation unit 304 reads the standard sound of the line (i) from the standard sound storage unit 74, and the line The voice (i) is output, and the process ends (step 444). When it is determined in step 442 that the recording flag is 1, the speech information (i) and the speech time ti are read from the user speech DB 120 (step 446), and the speech information table stored in the speech information storage unit 72 is stored. The speech time Ti of the line (i) is read from (step 448). Then, using the utterance times ti and Ti read in steps 446 and 448, respectively, the speech speed is converted and output so that the utterance time of the user's speech (i) becomes Ti (step 450).

第４の手法が選択された場合、図９に示す第４の音声生成部３０６は以下のように動作する。図１６を参照して、第４の音声生成部３０６は、ユーザ音声ＤＢ１２０のユーザ音声テーブル２６０からｉ番目の台詞音声に対する録音フラグ（ｉ）を読出す（ステップ４７０）。次に、ステップ４７０で読出された録音フラグ（ｉ）の値が１でない場合、声優音声ＤＢ８０中に記憶されている台詞（ｉ）の声優音声のうち、ユーザ音声と最も声質が類似したものを読出して台詞音声（ｉ）として出力する（ステップ４７４）。この際の声優音声の選択には、類似声優記憶部１３０に記憶された声優音声の識別子が使用される。これは後の処理でも同様であるから、後の説明では詳細を繰返すことはしない。ステップ４７２において録音フラグが１であると判定されれば、第３の音声生成部３０４は、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出す（ステップ４７６）。次に、台詞情報記憶部７２の台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出し（ステップ４７８）、発話時間ｔｉ及びＴｉを用い、読出されたユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉとなるように、話速変換を行なって、台詞音声（ｉ）として出力する（ステップ４８０）。 When the fourth method is selected, the fourth sound generation unit 306 shown in FIG. 9 operates as follows. Referring to FIG. 16, fourth voice generation unit 306 reads a recording flag (i) for the i-th speech voice from user voice table 260 of user voice DB 120 (step 470). Next, when the value of the recording flag (i) read out in step 470 is not 1, the voice actor voice of the line (i) stored in the voice actor voice DB 80 is the one having the most similar voice quality to the user voice. Read out and output as speech (i) (step 474). In this case, the voice actor voice stored in the similar voice actor storage unit 130 is used for selection of the voice actor voice. Since this is the same in the later processing, the details will not be repeated in the later description. If it is determined in step 472 that the recording flag is 1, the third speech generation unit 304 reads the speech speech (i) and the speech time ti from the user speech DB 120 (step 476). Next, the speech time Ti of the speech (i) is read from the speech information table of the speech information storage unit 72 (step 478), and the speech time of the speech speech (i) of the read user is read using the speech times ti and Ti. The speech speed is converted so as to change from ti to Ti and output as speech (i) (step 480).

第５の手法が選択された場合、図９に示す第５の音声生成部３０８は以下のように動作する。図１７を参照して、第５の音声生成部３０８は、ユーザ音声ＤＢ１２０から録音フラグ（ｉ）を読出す（ステップ５００）。読出された録音フラグの値が１ではない場合、声優音声ＤＢ８０に格納されている台詞（ｉ）の声優音声のうち、ユーザの声質に最も類似したものを特定し（ステップ５０４）、その声優音声を、ユーザ音声の特徴を用いて声質変換し、台詞音声（ｉ）として出力し処理を終了する（ステップ５０６）。ステップ５０４での声優音声の選択時、及びステップ５０６での声質変換時の音声生成部３０８の動作については後述する。ステップ５０２において録音フラグが１である場合、第５の音声生成部３０８は、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出す（ステップ５０８）。次に、台詞情報記憶部７２の台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出す（ステップ５１０）。最後に、ユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉになるように話速変換を行なって、台詞音声（ｉ）として出力し処理を終了する（ステップ５１２）。 When the fifth method is selected, the fifth sound generation unit 308 illustrated in FIG. 9 operates as follows. Referring to FIG. 17, fifth voice generation unit 308 reads recording flag (i) from user voice DB 120 (step 500). When the value of the read recording flag is not 1, the voice actor voice of the line (i) stored in the voice actor voice DB 80 is identified most closely to the voice quality of the user (step 504), and the voice actor voice is determined. Is converted to voice quality using the characteristics of the user voice and output as speech voice (i), and the process is terminated (step 506). The operation of the voice generation unit 308 at the time of selecting a voice actor voice at step 504 and at the time of voice quality conversion at step 506 will be described later. If the recording flag is 1 in step 502, the fifth speech generation unit 308 reads the speech speech (i) and the speech time ti from the user speech DB 120 (step 508). Next, the speech time Ti of the dialogue (i) is read from the dialogue information table of the dialogue information storage unit 72 (step 510). Finally, speech speed conversion is performed so that the speech time of the user's speech (i) is changed from ti to Ti, and the speech is output as speech (i), and the process is terminated (step 512).

第６の手法が選択された場合、図９に示す第６の音声生成部３１０は以下のように動作する。図１８を参照して、第６の音声生成部３１０は、ユーザ音声ＤＢ１２０から録音フラグ（ｉ）を読出す（ステップ５３０）。この録音フラグの値が１でなければ、台詞（ｉ）、ユーザ音声の特徴量、ユーザの母音の音声素片、素片ＤＢ８２の子音の音声素片を使用して音声合成を行なって台詞音声（ｉ）を生成し出力する（ステップ５３４）。録音フラグ＝１であれば、第６の音声生成部３１０は、ユーザ音声ＤＢ１２０から台詞音声（ｉ）と発話時間ｔｉとを読出す（ステップ５３６）。次に、台詞情報記憶部７２の台詞情報テーブルから台詞（ｉ）の発話時間Ｔｉを読出す（ステップ５３８）。最後に、ユーザの台詞音声（ｉ）の発話時間がｔｉからＴｉとなるようにユーザの台詞音声（ｉ）の話速変換を行なって台詞音声（ｉ）として出力する（ステップ５４０）。 When the sixth method is selected, the sixth sound generation unit 310 illustrated in FIG. 9 operates as follows. Referring to FIG. 18, the sixth sound generation unit 310 reads the recording flag (i) from the user sound DB 120 (step 530). If the value of this recording flag is not 1, speech synthesis is performed by using speech (i), user speech features, user vowel speech units, and consonant speech units of the unit DB 82 to perform speech synthesis. (I) is generated and output (step 534). If the recording flag = 1, the sixth voice generation unit 310 reads the line voice (i) and the utterance time ti from the user voice DB 120 (step 536). Next, the speech time Ti of the dialogue (i) is read from the dialogue information table of the dialogue information storage unit 72 (step 538). Finally, the speech speed of the user's speech (i) is converted so that the speech time of the user's speech (i) is changed from ti to Ti and output as speech (i) (step 540).

第７の手法が選択された場合、図９に示す第７の音声生成部３１２は以下のように動作する。図１９を参照して、第７の音声生成部３１２は、声優音声ＤＢ８０の台詞（ｉ）の音声の中で、ユーザ音声の声質と最も類似した音声を読出し、台詞音声（ｉ）として出力し、処理を終了する（ステップ５６０）。 When the seventh method is selected, the seventh sound generation unit 312 illustrated in FIG. 9 operates as follows. Referring to FIG. 19, the seventh voice generation unit 312 reads the voice most similar to the voice quality of the user voice among the voices of the voice (i) in the voice actor voice DB 80 and outputs the voice as the voice (i). The process is terminated (step 560).

第８の手法が選択された場合、第８の音声生成部３１４は以下のように動作する。図２０を参照して、第８の音声生成部３１４は、声優音声ＤＢ８０に記憶されている、台詞（ｉ）の声優音声のうち、ユーザ音声の声質と最もよく類似した音声を特定し読出す（ステップ５８０）。次に、ステップ５８０で読出された音声（ｉ）を、ユーザの発話必須部分の音声の特徴を用いて、ユーザの声質に近い声質に声質変換することにより、ユーザの台詞音声（ｉ）を生成し出力する（ステップ５８２）。声質変換時の音声生成部３１４の動作は、第５の手法のステップ５０６の場合と同様である。 When the eighth method is selected, the eighth sound generation unit 314 operates as follows. Referring to FIG. 20, the eighth voice generation unit 314 identifies and reads the voice most similar to the voice quality of the user voice among the voice actor voices of line (i) stored in the voice actor voice DB 80. (Step 580). Next, the speech (i) read in step 580 is converted to a voice quality close to the user's voice quality using the voice characteristics of the user's utterance essential part, thereby generating the user's speech voice (i). And output (step 582). The operation of the voice generation unit 314 at the time of voice quality conversion is the same as that in step 506 of the fifth method.

第９の手法が選択された場合、図９に示す第９の音声生成部３１６は以下のように動作する。図２１を参照して、第９の音声生成部３１６は、台詞（ｉ）、ユーザ音声の特徴量、ユーザの母音の音声素片、及び素片ＤＢ８２に記憶された子音の素片を用いて、台詞（ｉ）の音声合成を行ない、台詞音声（ｉ）として出力し処理を終了する（ステップ６００）。 When the ninth method is selected, the ninth sound generation unit 316 shown in FIG. 9 operates as follows. Referring to FIG. 21, the ninth speech generation unit 316 uses the speech (i), the feature amount of the user speech, the speech unit of the user's vowel, and the consonant segment stored in the segment DB 82. The speech (i) is synthesized, output as speech (i), and the process ends (step 600).

合流部２９２から出力される台詞音声には、いずれも図９に示す音声信号処理部３２０により以下のようにしてカット情報記憶部７６により指定される音響効果が追加される。すなわち、図２２を参照して、音声信号処理部３２０は、合流部２９２の出力する台詞音声（ｉ）に対し、カット情報記憶部７６から台詞（ｉ）の音響効果リストＥＬＩＳＴを読出す（ステップ３８２）。音声信号処理部３２０はさらに、音響効果リストＥＬＩＳＴの要素を順番に調べ、それらの要素により特定される音響効果を台詞音声（ｉ）に全て加え、その後、音響効果が加えられた台詞音声（ｉ）を音声ファイルに書き出す（ステップ３９４）。このとき、音量正規化処理（図２８（Ｂ））などの処理も同時に実行する。音声信号処理部３２０は、この後、台詞音声テーブル８８の台詞（ｉ）の音声ファイル名を新たなファイル名で更新して処理を終了する（ステップ３９６）。 The speech effect output from the merging unit 292 is added with an acoustic effect specified by the cut information storage unit 76 as follows by the audio signal processing unit 320 shown in FIG. That is, referring to FIG. 22, audio signal processing unit 320 reads out acoustic effect list ELIST of dialogue (i) from cut information storage unit 76 for dialogue speech (i) output from merging unit 292 (step 382). The sound signal processing unit 320 further examines the elements of the sound effect list ELIST in order, adds all the sound effects specified by these elements to the line sound (i), and then the line sound (i ) Is written to the audio file (step 394). At this time, processing such as volume normalization processing (FIG. 28B) is also executed simultaneously. Thereafter, the sound signal processing unit 320 updates the sound file name of the line (i) in the line sound table 88 with a new file name, and ends the process (step 396).

この音声信号処理部３２０の機能により、図２４に示すような台詞音声テーブル８８と、音響効果が加えられた台詞音声データ８６とが生成される。 The speech signal processing unit 320 generates a speech speech table 88 as shown in FIG. 24 and speech speech data 86 to which an acoustic effect is added.

このようにして、全てのキャラクタの全ての台詞について台詞音声が作成され、それらに対応する台詞音声データ８６と台詞音声テーブル８８とが作成されると、映像データ６６とあわせて映像・音声再生装置９２により映画を再生することができる。このときには、図２３に示す映像信号再生部６２０、同時音声統合処理部６３２、及び同期再生部６３８は以下のように動作する。 In this way, speech is generated for all dialogues of all characters, and when speech speech data 86 and speech speech table 88 corresponding to them are created, the video / audio reproduction device is combined with the video data 66. The movie can be played back by 92. At this time, the video signal reproduction unit 620, the simultaneous audio integration processing unit 632, and the synchronous reproduction unit 638 shown in FIG. 23 operate as follows.

最初に、同時音声統合処理部６３２は、図２６に示すような制御構造を持つプログラムにより、互いに重なっている台詞の音声を１つのファイルに統合し、それにあわせて台詞音声テーブル８８の音声ファイル名称を更新し、さらに統合により再生不要となった音声ファイルに対応する再生フラグを１とする処理を実行する。この処理により最終的に再生可能な台詞音声データ８６及び台詞音声テーブル８８が生成される。このときの台詞音声テーブル８８の各台詞の再生開始時刻には、各台詞の再生を開始する時刻が記録されている。 First, the simultaneous speech integration processing unit 632 integrates speeches that overlap each other into a single file by a program having a control structure as shown in FIG. 26, and the speech file names in the speech speech table 88 accordingly. Is updated, and the process of setting the reproduction flag corresponding to the audio file that is no longer necessary due to the integration to 1 is executed. Through this process, speech sound data 86 and speech sound table 88 that can be finally reproduced are generated. At this time, the time to start playback of each speech is recorded in the playback start time of each speech in the speech speech table 88.

映画の再生が開始されると、映像信号再生部６２０は映像信号と背景音などの効果音を示す音声信号を再生し、表示装置６２２及び効果音出力装置６２４にそれぞれ与える。表示装置６２２はこの映像信号を再生し、映像を表示する。効果音出力装置６２４は効果音の音声信号を音声に変換する。この映画の登場人物の顔画像は、ユーザの顔画像と入替えられている。 When the reproduction of the movie is started, the video signal reproduction unit 620 reproduces an audio signal indicating a sound effect such as a video signal and a background sound, and supplies the sound signal to the display device 622 and the sound effect output device 624. The display device 622 reproduces this video signal and displays the video. The sound effect output device 624 converts the sound signal of the sound effect into sound. The face image of the character in this movie has been replaced with the face image of the user.

一方、映像信号再生部６２０は、映像信号の再生と同期して映像データ中に記録されている同期データに基づいて、同期信号を生成し同期再生部６３８に与える。 On the other hand, the video signal reproduction unit 620 generates a synchronization signal based on the synchronization data recorded in the video data in synchronization with the reproduction of the video signal and supplies the synchronization signal to the synchronization reproduction unit 638.

同期再生部６３８は、この同期信号を常に監視し、同期信号により表される時刻が台詞音声テーブル８８に記憶されている台詞音声の再生開始時刻と一致すると、その台詞音声を再生し台詞音声出力装置６４０に与える。台詞音声出力装置６４０はこの音声を再生する。台詞音声は、上記したいずれかの手法にしたがって再生又は合成された音声である。この音声は、基本的には各ユーザの音声そのままか、それが話速変換されたものか、又はできるだけユーザの音声の声質に似るように選択され、又は合成された音声である。もちろん、中には標準音声をそのまま再生する場合もあり得るが、台詞の全体を見ると、それぞれのキャラクタの声は、対応するユーザの声質に似たものに感じられる。 The synchronized playback unit 638 always monitors the synchronized signal, and when the time represented by the synchronized signal matches the playback start time of the speech stored in the speech audio table 88, the speech is reproduced and the speech audio output is performed. To device 640. The dialogue voice output device 640 reproduces this voice. Dialogue speech is speech reproduced or synthesized according to any of the above-described methods. This voice is basically the voice of each user as it is, the voice speed converted, or the voice selected or synthesized so as to be as similar as possible to the voice quality of the user's voice. Of course, some standard voices may be reproduced as they are. However, when the entire dialogue is viewed, the voice of each character feels similar to the voice quality of the corresponding user.

なお、上の実施の形態の説明では、ユーザの音声を素片化し、素片ＤＢ８２に追加している。しかし本発明はそのような実施の形態には限定されない。例えば、ユーザの音声のうち、高品質に録音できた台詞音声は、声優音声ＤＢ８０に登録するようにしてもよい。こうすることで、多数のユーザの音声を声優音声ＤＢ８０に追加することが可能になり、さまざまな音声を効率よく収集することが可能になる。 In the description of the above embodiment, the user's voice is segmented and added to the segment DB 82. However, the present invention is not limited to such an embodiment. For example, speech speech that has been recorded with high quality among user speech may be registered in the voice actor speech DB 80. By doing so, it becomes possible to add the voices of many users to the voice actor voice DB 80, and it is possible to efficiently collect various voices.

したがって、マルチメディア製作システム５０によって、予め準備された映画の登場人物の顔画像だけでなく、その台詞まで、ユーザの音声に置換えたかのような映画を作成し、上映することができる。その結果、登場人物の台詞が分かっているマルチメディア製作物において、登場人物の音声をユーザの声で容易に短時間で置換することが可能になる。また、登場人物の音声をユーザの声に近い声質で容易に短時間で置換することが可能になる。さらに、登場人物の音声をユーザの声に近い声質で容易に短時間で置換することが可能なように、多数のユーザの声を収集し、それらを音声の置換において利用することが可能になる。 Therefore, the multimedia production system 50 can create and screen a movie as if not only the face images of the characters of the movie prepared in advance but also the speech of the user was replaced with the user's voice. As a result, it is possible to easily replace the voice of the character with the voice of the user in a short time in a multimedia product in which the line of the character is known. In addition, the voice of the character can be easily replaced with a voice quality close to that of the user in a short time. Furthermore, it is possible to collect a large number of user voices and use them in voice replacement so that the voices of the characters can be easily replaced in a short time with a voice quality close to that of the user. .

［類似音声の選択及び声質変換のための構成］
以下、図１０のステップ３３０で実行される、ユーザ音声に類似した声優音声を決定する処理と、図１７のステップ５０６及び図２０のステップ５８２で実行される、声質を変換する処理（声質変換処理）とについて説明する。以下の説明において用いられる変数及び定数（Ｉ、ｉ、Ｊ、ｊ、Ｍ、Ｎ、Ｓ、ｓ等）はいずれも、プログラム上、局所変数として用いられるものであり、他の図に出現する変数とは別個のものである。 [Configuration for selecting similar voices and converting voice quality]
Hereinafter, a process of determining a voice actor voice similar to the user voice executed in step 330 of FIG. 10 and a process of converting voice quality (voice quality conversion process) executed in step 506 of FIG. 17 and step 582 of FIG. ). Variables and constants (I, i, J, j, M, N, S, s, etc.) used in the following description are all used as local variables in the program, and appear in other figures. Are separate.

最初に、類似音声を決定する際の原理の概略について説明する。詳細については後述する。本実施の形態では、ユーザの音声に類似した声優音声を決定する際に、特定の音響特徴量の差を単純に基準とするのではなく、できるだけ人間の知覚に近いような基準を採用する。そのために、音声について予め複数種類の音響特徴量を算出し、二つの音声の間でこれらによって算出される距離尺度の線形結合で、音声の類似度（本明細書ではこれを「知覚的距離」と呼ぶ。）を表現することにする。この知覚的距離によって、人間の知覚に近い形で類似音声の選択ができるように、予めこの線形結合の係数α_ｉを最適化しておく。この最適化を実現するための処理については後述する。 First, an outline of the principle for determining similar speech will be described. Details will be described later. In this embodiment, when determining a voice actor voice similar to the user's voice, a standard that is as close to human perception as possible is adopted instead of simply using a difference in specific acoustic feature quantities as a reference. For this purpose, a plurality of types of acoustic feature amounts are calculated in advance for the sound, and the similarity of the sound (this is referred to as “perceptual distance” in this specification) by linear combination of distance measures calculated by the two sounds. Will be expressed). The linear combination coefficient α _i is optimized in advance so that a similar speech can be selected in a form close to human perception based on this perceptual distance. Processing for realizing this optimization will be described later.

声優音声については、予め上記した複数の音響特徴量の各々を算出しておき、これらを声優音声の識別子と関連付けて記憶しておく。ユーザの音声が得られると、その音声から同様にして複数の音響特徴量を算出する。これら複数の音響特徴量と、各声優に対して算出されていた複数の音響特徴量との間で知覚的距離を算出し、知覚的距離が小さいものから順番に、ユーザの音声に類似した音声とする。 For the voice actor voice, each of the plurality of acoustic feature quantities described above is calculated in advance and stored in association with the voice actor voice identifier. When the user's voice is obtained, a plurality of acoustic feature quantities are calculated in the same manner from the voice. A perceptual distance is calculated between the plurality of acoustic feature quantities and the plurality of acoustic feature quantities calculated for each voice actor, and sounds similar to the user's voice in order from the smallest perceptual distance. And

なお、以下に述べる実施の形態では、発話ごとにユーザ音声に声質が類似した声優音声を決定するものとしているが、実際には練習用発話の録音が終了した時点で、録音された発話を用いてそのユーザ音声に似た声優音声を決定しておく。こうすることで、声質変換の際に使用するモーフィング率（後述）もあらかじめ算出しておくことができる。本実施の形態では、後述する声質変換処理との関連で、類似音声を決定する処理では、３人の声優音声を、ユーザ音声に類似した音声として使用する。また、音響特徴量としては８種類を用いる。すなわち、本実施の形態では、線形結合係数αｉとして、α_１〜α_８が用いられる。 In the embodiment described below, the voice actor voice whose voice quality is similar to the user voice is determined for each utterance, but actually, when the recording of the practice utterance is finished, the recorded utterance is used. A voice actor voice similar to the user voice is determined. In this way, a morphing rate (described later) used for voice quality conversion can also be calculated in advance. In the present embodiment, three voice actor voices are used as voices similar to user voices in the process of determining similar voices in relation to the voice quality conversion process described later. Also, eight types of acoustic feature quantities are used. That is, in the present embodiment, α _{1 to} α ₈ are used as the linear combination coefficient αi.

本実施の形態で使用される音響特徴量は以下の通りである。なお、前述の通り、本実施の形態では、音声分析変換合成システムＳＴＲＡＩＧＨＴをモーフィングに用いるため、ＳＴＲＡＩＧＨＴ特有の音響特徴量が含まれている。
（１）ＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）１２次元＋ΔＭＦＣＣ１２次元＋Δパワー１次元の計２５次元特徴ベクトル、
（２）声帯音源の周波数特性を表わす３５次以上の高次ＳＴＲＡＩＧＨＴケプストラム（ＣｅｐＨ）、
（３）声帯音源の周波数特性の傾斜を表わすＳＴＲＡＩＧＨＴケプストラムの１次（Ｃｅｐ１）、
（４）２．６ｋＨｚ以上の対数スペクトラム（Ｓｐｅｃｔｒｕｍ）、
（５）ＳＴＲＡＩＧＨＴの分析パラメータである非周期性指標の２．６ｋＨｚ以下（Ａｐ）、
（６）基本周波数（Ｆ０）、
（７）声質表現に重要な第１〜第４ホルマント周波数（Ｆｏｒｍａｎｔ）、及び
（８）対数スペクトルの０〜３ｋＨｚのスペクトル傾斜（ＳｐｅｃＳｌｏｐｅ）。 The acoustic feature quantities used in the present embodiment are as follows. As described above, in the present embodiment, since the speech analysis conversion / synthesis system STRAIGHT is used for morphing, acoustic features specific to STRAIGHT are included.
(1) MFCC (Mel Frequency Cepstrum Coefficient) 12 dimensions + ΔMFCC 12 dimensions + Δ power 1 dimension in total 25 dimensions feature vector,
(2) 35th-order or higher order STRAIGHT cepstrum (CepH) representing the frequency characteristics of the vocal cord sound source,
(3) the first order (Cep1) of the STRAIGHT cepstrum representing the slope of the frequency characteristics of the vocal cord sound source;
(4) Logarithmic spectrum (spectrum) of 2.6 kHz or more,
(5) 2.6 kHz or less (Ap) of an aperiodicity index which is an analysis parameter of STRIGHT
(6) Fundamental frequency (F0),
(7) First to fourth formant frequencies (Formant) important for voice quality expression, and (8) Spectral slope of 0 to 3 kHz (SpecSlope) of the logarithmic spectrum.

２つの音声の間で、これら音響特徴量の各々に関して算出される距離尺度としては、ＤＴＷ（ＤｙｎａｍｉｃＴｉｍｅＷａｒｐｉｎｇ）距離又はＧＭＭ（ＧａｕｓｓｉａｎＭｉｘｔｕｒｅＭｏｄｅｌ）尤度（混合数＝１６）が用いられる。これらのいずれを使用して音響特徴量の距離を組合わせた場合も、音響特徴量を単独で用いた場合よりも好ましい結果が得られた。中でもＤＴＷはＧＭＭを用いた場合よりもよい結果をもたらした。したがって、本実施の形態ではＤＴＷを用いる。 As a distance measure calculated for each of these acoustic features between two speeches, a DTW (Dynamic Time Warping) distance or a GMM (Gaussian Mixture Model) likelihood (mixing number = 16) is used. When any of these is used to combine the distances of the acoustic feature amounts, a more preferable result is obtained than when the acoustic feature amounts are used alone. Above all, DTW gave better results than using GMM. Therefore, DTW is used in this embodiment.

次に、声質変換処理の原理について説明する。本実施の形態で行なう声質変換は、声優音声の中から、上記した知覚的距離が近い音声であるとして選択された複数の声優音声を混合（モーフィング）することにより、さらにユーザの声質に近い音声を合成する処理である。概念的には次のように表すことができる。 Next, the principle of voice quality conversion processing will be described. The voice quality conversion performed in the present embodiment is performed by mixing (morphing) a plurality of voice actor voices selected as voices having a short perceptual distance from voice actor voices, thereby further reproducing voices closer to the voice quality of the user. Is a process of synthesizing. Conceptually, it can be expressed as follows.

図３７に示すように、選択された声優音声が本実施の形態のように３個（第１〜第３の声優音声）の場合を考える。説明を簡単にするために、知覚的距離を算出するために使用される音響特徴量が８個ではなく３個の場合について説明する。この場合、選択された３個の声優音声は、３次元の音響特徴量空間（話者空間）の中の３つの点９９０，９９２及び９９４に対応するということができる。これら音声を混合（モーフィング）した音声は、３点９９０、９９２及び９９４を頂点とする三角形の平面内の点となる。本実施の形態では、これらモーフィング後の音声のうち、ユーザ（ターゲット話者）の音声に対応する点９９８との距離９９７が最も小さくなる点９９６に対応するものを採用する。 As shown in FIG. 37, consider the case where there are three selected voice actor voices (first to third voice actor voices) as in the present embodiment. In order to simplify the description, a case will be described in which the number of acoustic feature values used for calculating the perceptual distance is three instead of eight. In this case, it can be said that the selected three voice actor voices correspond to the three points 990, 992, and 994 in the three-dimensional acoustic feature space (speaker space). A sound obtained by mixing (morphing) these sounds becomes a point in a triangular plane having three points 990, 992, and 994 as vertices. In this embodiment, among the morphed voices, the voice corresponding to the point 996 where the distance 997 to the point 998 corresponding to the voice of the user (target speaker) is the smallest is adopted.

このような処理により、予め声優によって録音されていた台詞音声から、ユーザの声質に近い台詞音声を合成することができる。 Through such processing, speech speech close to the voice quality of the user can be synthesized from speech speech previously recorded by a voice actor.

なお、ここで選択される声優音声の数は３個に限定されるわけではなく、２個でもよいし、４個以上でもよい。選択された声優音声が２個の場合には、両者を結ぶ線分上の点でターゲット話者に最も近い点に対応する音声がモーフィングにより合成される。選択された声優音声が４個以上の場合には、平面ではなくそれら４個の声優音声に対応する点により既定される４面体の内部の点のうち、ターゲット話者の音声に最も近い位置にある音声がモーフィングにより合成される。５個以上の場合も同様である。 The number of voice actor voices selected here is not limited to three, but may be two or four or more. When two voice actor voices are selected, the voice corresponding to the point closest to the target speaker at the point on the line connecting the two voice actor voices is synthesized by morphing. When there are four or more selected voice actor voices, the point within the tetrahedron defined by the points corresponding to the four voice actor voices instead of the plane is located closest to the target speaker voice. A certain voice is synthesized by morphing. The same applies to the case of five or more.

また、音響特徴量については、特に複数である必要はなく、１個以上であってもよい。この場合には、線形結合係数はベクトル形式ではなく単独の数となる。この場合でも、線形結合係数が人間の知覚に基づく音声の類似順位との相関が高く選ばれているため、人間が選ぶ場合と同様の声質の音声を選択できる。 Further, the acoustic feature amount need not be plural, and may be one or more. In this case, the linear combination coefficient is not a vector format but a single number. Even in this case, since the linear combination coefficient is selected to have a high correlation with the similarity order of speech based on human perception, it is possible to select speech having the same voice quality as that selected by the human.

以下、システムの稼動に先立って、上記した類似音声の判定に使用される音響特徴量の距離尺度の線形結合係数を算出する手順について説明する。 Hereinafter, a procedure for calculating a linear combination coefficient of the distance scale of the acoustic feature amount used for the above-described determination of similar speech prior to the operation of the system will be described.

図３２に、これら線形結合係数を算出するための基礎データとして、人間により知覚された音声の類似の順序を示すデータを作成する手順をフローチャート形式で示す。図３２を参照して、この手順は、予め録音されたＭ人の話者音声を準備し、それらの音響特徴量を算出するステップ９２０と、これらＭ人の話者音声の各々に対し、以下のプロセス９２４を実行するステップ９２２とを含む。 FIG. 32 shows, in a flowchart form, a procedure for creating data indicating the similar order of speech perceived by humans as basic data for calculating these linear combination coefficients. Referring to FIG. 32, this procedure prepares M speaker voices recorded in advance and calculates their acoustic feature value 920, and for each of these M speaker voices, Step 922 of performing the process 924 of FIG.

プロセス９２４は、処理対象の音声をターゲット音声として、Ｍ人の話者音声がターゲット音声に類似していると思われる順位を人間の知覚により設定するステップ９２６と、このときの類似順位を、そのターゲット音声と関係付けて記憶するステップ９２８とを含む。 The process 924 sets a rank in which the speech to be processed is assumed to be similar to the target voice by the human perception with the target voice as the target voice, and the similarity rank at this time is set as the step 926. And storing 928 in association with the target speech.

この順位データ作成処理を実行することにより、１つのターゲット音声に対し、１〜Ｍ番までの順位を示すデータが１セットずつ作成される。ターゲット話者として選ばれる話者は全体でＭ人なので、全体では順位データはＭセット作成される。 By executing this rank data creation process, one set of data indicating ranks 1 to M is created for each target voice. Since M speakers are selected as target speakers, M sets of rank data are created as a whole.

本実施の形態では、線形結合係数の信頼性を高めるために、上記した順位データ作成処理を少なくとも２回行なう。すなわち、最終的には１つのターゲット音声に対し、各々が１〜Ｍ番までの順位を示す順位データが２セット、全体では２Ｍセット作成される。 In the present embodiment, the rank data creation process described above is performed at least twice in order to increase the reliability of the linear combination coefficient. That is, finally, two sets of rank data each indicating the ranks of Nos. 1 to M are created for one target voice, and 2M sets in total are created.

このように順位データが作成されると、線形結合係数を算出することができる。図３３に、線形結合係数算出処理を実現するプログラムのフローチャートを示す。 When rank data is created in this way, a linear combination coefficient can be calculated. FIG. 33 shows a flowchart of a program that realizes linear combination coefficient calculation processing.

図３３を参照して、このプログラムは、順位データ作成処理で準備されたＭ人の話者の音響特徴量データをコンピュータ読取可能な形式で記憶装置内に準備するステップ９３０と、これらＭ人の話者の各々に対し、その話者をターゲット話者として選択し、以下に説明するプロセス９３４を実行するステップ９３２とを含む。 Referring to FIG. 33, the program prepares the acoustic feature quantity data of M speakers prepared in the rank data creation process in a storage device in a computer-readable format, and 930 of these M people. For each speaker, select that speaker as the target speaker and perform a process 934 that performs process 934 described below.

プロセス９３４は、Ｍ個の話者音声からＮ名（Ｎ≦Ｍ）の話者を選出するステップ９３５と、順位データ作成処理で作成された順位データのうち、処理中のターゲット話者に対して作成された２つの順位データセットを読出し、それぞれの順位データに基づいて、ステップ９３５で選択されたＮ名の話者がターゲット話者に類似していると知覚された相対的な順位を計算するステップ９３６とを含む。順位データセットが２組あるので、このときの相対的な順位のデータセットも２組作成される。 The process 934 selects N speakers (N ≦ M) from M speaker voices, and among the rank data created by the rank data creation process, the process 934 applies to the target speaker being processed. The two created rank data sets are read, and based on the respective rank data, the relative ranks perceived that the N speakers selected in step 935 are similar to the target speaker are calculated. Step 936. Since there are two sets of rank data sets, two sets of data sets of relative rank at this time are also created.

プロセス９３４はさらに、ステップ９３６で計算された２組の相対的な順位が一致するか否かを判定し、判定結果に応じて制御の流れを分岐させるステップ９３８を含む。ステップ９３８の判定結果がＮＯの場合には、これらＮ名の話者についての順位データの信頼性が低いということであるから、これらＮ名の話者の組に対する処理をこれ以上行なわず、プロセス９３４の終わりにジャンプする。一方、ステップ９３８の判定結果がＹＥＳの場合には次の処理が実行される。人間の知覚による類似の順位付けには、類似順位の設定者の主観が影響するため、このようにできるだけ客観性を高めることで、最終的に得られる線形結合係数の信頼性を高めることができる。 The process 934 further includes a step 938 for determining whether or not the two sets of relative ranks calculated in the step 936 match and branching the control flow according to the determination result. If the determination result in step 938 is NO, it means that the reliability of the rank data for these N speakers is low, and no further processing is performed on the set of these N speakers. Jump to the end of 934. On the other hand, if the decision result in the step 938 is YES, the following process is executed. Similarity ranking by human perception is affected by the subjectivity of the similarity ranking setter, so by increasing the objectivity as much as possible, the reliability of the finally obtained linear combination coefficient can be increased. .

プロセス９３４はさらに、ステップ９３８における判定結果がＹＥＳのときに実行され、このターゲット話者と、ステップ９３５で選出されたＮ名の話者の組合せと、Ｎ名の話者の順位とをＲＡＭに記憶してこのプロセス９３４を終了するステップ９４０を含む。 The process 934 is further executed when the determination result in step 938 is YES, and the combination of this target speaker, the N speakers selected in step 935, and the ranking of the N speakers is stored in the RAM. Step 940 of storing and ending this process 934 is included.

このプログラムはさらに、線形結合係数として適当な値で準備されていた初期値α＝（ａ1，ａ２，…，ａ８）を決定するステップ９４２と、この線形結合係数αをニュートン法を用いて最適化し、知覚的距離による順位と人間により判断された音声の類似の相対順序との間の相関値が最も高くなるようにするステップ９４４と、ステップ９４４によって最適化された線形結合係数αを線形結合係数記憶部９４に保存して処理を終了するステップ９４６とを含む。知覚的距離は音響的特徴量に基づくものであるので、本明細書ではこれを音響的類似度と呼ぶ。一方、人間による類似の相対順序は、あくまで人間の知覚によるものなので、これを本明細書では知覚的類似度と呼ぶ。 The program further includes a step 942 for determining an initial value α = (a 1, a 2,..., A 8) prepared with appropriate values as linear combination coefficients, and optimizing the linear combination coefficient α using Newton's method. Step 944 for obtaining the highest correlation value between the order by perceptual distance and the relative relative order of similar speech determined by humans, and the linear combination coefficient α optimized by step 944 as the linear combination coefficient And step 946 for saving the processing in the storage unit 94 and ending the processing. Since the perceptual distance is based on an acoustic feature, this is referred to as acoustic similarity in this specification. On the other hand, since the relative order of similarities by human beings is based on human perception, this is called perceptual similarity in this specification.

図３４は、図３３のステップ９４４で実行される、ニュートン法により線形結合係数α＝（ａ1，ａ２，…，ａ８）を最適化するためのプログラムのフローチャートである。図３４を参照して、このプログラムは、ニュートン法にしたがい、αの周辺値を準備するステップ１０００を含む。本実施の形態では、αの周辺値として１６個を用いる。具体的には、α＝（ａ1，ａ２，…，ａ８）の要素ａ１について、他の要素の値は維持したまま、要素ａ１の値だけを＋１した１番目の周辺値α１と、同じく要素ａ１について、他の要素の値は維持したまま、要素ａ１の値だけを−１した２番目の周辺値α２とを含む。以後、他の要素ａ２〜ａ８の各々に対して同様の処理をし、周辺値α３〜α１６を設定する。すなわち、ステップ１０００の処理により、線形結合係数αの周辺値としてα１〜α１６の１６通りの値が得られる。 FIG. 34 is a flowchart of a program for optimizing the linear combination coefficient α = (a1, a2,..., A8) by the Newton method, which is executed in step 944 of FIG. Referring to FIG. 34, the program includes a step 1000 of preparing a peripheral value of α according to the Newton method. In the present embodiment, 16 are used as the peripheral value of α. Specifically, with respect to the element a1 of α = (a1, a2,..., A8), while maintaining the values of the other elements, the first peripheral value α1 obtained by adding only the value of the element a1 to the element a1 , The second peripheral value α2 obtained by subtracting only the value of the element a1 while maintaining the values of the other elements is included. Thereafter, similar processing is performed for each of the other elements a2 to a8, and peripheral values α3 to α16 are set. That is, 16 values of α1 to α16 are obtained as peripheral values of the linear combination coefficient α by the process of step 1000.

このプログラムはさらに、ステップ１０００に続いて実行され、図３３のステップ９４０で記憶した、Ｎ名の話者の組合せを全て読出すステップ１００２と、読出されたＮ名の話者の組合せの全てに対し、後述するプロセス１００６を行なうステップ１００４とを含む。 This program is further executed following step 1000, and in step 1002 for reading all combinations of N speakers stored in step 940 of FIG. 33, and for all of the combinations of N speakers read out, step 1002 is executed. On the other hand, it includes step 1004 for performing a process 1006 described later.

プロセス１００６は、処理対象となっているＮ名の話者の組合せに含まれる各話者に対し、線形結合係数α及びその周辺値α１〜α１６の各々について、ステップ１０１０を実行するステップ１００８を含む。ステップ１０１０では、それらを知覚的距離算出の際の線形結合係数として、ターゲット話者と処理対象の話者との間の知覚的距離を以下の式によって算出する。 Process 1006 includes step 1008 of executing step 1010 for each of the linear combination coefficient α and its peripheral values α1 to α16 for each speaker included in the combination of N speakers to be processed. . In step 1010, the perceptual distance between the target speaker and the speaker to be processed is calculated by the following equation using these as linear combination coefficients when calculating the perceptual distance.

ただし、Ｌ（Ｋ）は、線形結合係数αによりＫ番目の話者に対して算出された知覚的距離、Ｌ^ｊ（Ｋ）は、ｊ番目の周辺値αｊによりＫ番目の話者に対して算出された知覚的距離、Ｌ_ｉはｉ番目の音響特徴量に対してターゲット話者と距離計算対象の話者との間で算出された距離、ａ_ｉは線形結合係数αのｉ番目の要素、a_ｉｊはｊ番目の周辺値αｊのｉ番目の要素である。

Where L (K) is the perceptual distance calculated for the Kth speaker by the linear combination coefficient α, and L ^j (K) is for the Kth speaker by the jth peripheral value αj. The calculated perceptual distance, L _i is the distance calculated between the target speaker and the distance calculation target speaker for the i th acoustic feature, and a _i is the i th element of the linear combination coefficient α. , A _ij is the i-th element of the j-th peripheral value αj.

ステップ１０１０を実行することにより、Ｎ名の話者の各々について１７種類の知覚的距離が算出される。 By executing step 1010, 17 different perceptual distances are calculated for each of the N speakers.

このプログラムはさらに、ステップ１００８で算出された、１７種類の知覚的距離（音響的類似度）を基準として、処理対象となっているＮ名の話者について１７種類の順位の組合せを算出するステップ１０１２と、ステップ１０１２で算出された１７種類の順位の組合せの各々と、人間により判断された類似の相対順序（知覚的類似度）との間で、Ｓｐｅａｒｍａｎの順位相関係数ρを以下の式により算出し、記憶するステップ１０１４とを含む。 The program further calculates a combination of 17 types of ranks for the N speakers to be processed based on the 17 types of perceptual distance (acoustic similarity) calculated in step 1008. 1012 and Spearman's rank correlation coefficient ρ between each of the 17 kinds of rank combinations calculated in step 1012 and the similar relative order (perceptual similarity) determined by a human being: And step 1014 for calculating and storing.

ここで、ａ、ｂはそれぞれ知覚的類似度による順位、及び音響的類似度による順位である。

Here, a and b are ranks based on perceptual similarity and ranks based on acoustic similarity, respectively.

以上のプロセス１００６をＮ名の話者の組合せの各々に対して行なうことにより、最大で１７×_ＭＣ_Ｎ個のＳｐｅａｒｍａｎの順位相関係数ρが算出される。 By performing the above process 1006 for each of the combinations of N speakers, up to 17 × _M C _N Spearman rank correlation coefficients ρ are calculated.

このプログラムはさらに、プロセス１００６が完了した後に実行され、ステップ１０１４で記憶された、最大で１７×_ＭＣ_Ｎ個のＳｐｅａｒｍａｎの順位相関係数ρのうち、共通の線形結合係数α又はその周辺値αｊ（１≦ｊ≦１６）を用いて算出されたものをそれぞれ読出し、線形結合係数α又はその周辺値αｊごとに、その平均値を算出するステップ１０１６と、ステップ１０１６で算出された順位相関係数の平均値のうち、線形結合係数αについて算出された平均値がそれ以外の周辺値αｊの各々について算出された平均値の全てより大きいか否かを判定し、判定結果に応じて制御の流れを分岐させるステップ１０１８と、ステップ１０１８における判定結果がＮＯである場合に実行され、線形結合係数αを、周辺値αｘ（ｘ＝１〜１６）のうち、順位相関係数の平均値が最も高かったもので更新し、制御をステップ１０００に戻すステップ１０２０とを含む。ステップ１０１８の判定結果がＹＥＳの場合には、このプログラムの処理は終了し、図３３のステップ９４６に復帰する。したがって、このときの線形結合係数αが線形結合係数記憶部９４に記憶される。 This program is further executed after the process 1006 is completed, and the common linear combination coefficient α or its peripheral value among the maximum 17 × _M C _N Spearman rank correlation coefficients ρ stored in step 1014. Steps 1016 for reading out the values calculated using αj (1 ≦ j ≦ 16) and calculating the average value for each linear combination coefficient α or its peripheral value αj, and the rank correlation calculated in step 1016 Among the average values of the numbers, it is determined whether or not the average value calculated for the linear combination coefficient α is greater than all the average values calculated for each of the other peripheral values αj, and control is performed according to the determination result. Step 1018 for branching the flow and executed when the determination result in Step 1018 is NO, the linear combination coefficient α is set to the peripheral value αx (x = 1 to 16). Of, updated with an average value of rank correlation coefficient was highest, and a step 1020 returning control to step 1000. If the decision result in the step 1018 is YES, the process of this program ends, and the process returns to the step 946 in FIG. Accordingly, the linear combination coefficient α at this time is stored in the linear combination coefficient storage unit 94.

このようにして得られた線形結合係数は、１人又は２人による類似順位付けのデータを用いたものではあるが、対象となる話者の数がＭ人と複数であり、かつＭ人の中からＮ人を選択する仕方は一般的に多数あるため、信頼性の高いものとなる。 The linear combination coefficient obtained in this way uses data for ranking similarities by one or two people, but the number of target speakers is M and plural, and M Since there are generally many ways to select N people from among them, the reliability is high.

以下、このようにして決定され保存された線形結合係数を用い、どのようにしてユーザに類似した声優音声が選択されるかについて説明する。この選択は、以下に説明する類似音声決定プログラムにより行なわれる。 Hereinafter, how the voice actor speech similar to the user is selected using the linear combination coefficient determined and stored in this manner will be described. This selection is performed by a similar voice determination program described below.

図３１を参照して、類似音声決定プログラムは、予め算出され保存されている線形結合係数αｉ（ｉ＝１〜８）を、記憶装置から読出すステップ９００と、以下の繰返しを制御するための変数Ｊに０を代入することにより変数Ｊを初期化するステップ９０２と、この変数Ｊに１を加算するステップ９０４と、ステップ９０４に続いて実行され、変数Ｊの値が声優音声ＤＢ８０中の声優の数Ｊ_ＭＡＸよりも大きいか否かを判定し、判定結果に応じて制御の流れを分岐させるステップ９０６とを含む。 Referring to FIG. 31, the similar speech determination program reads step 900 for reading linear combination coefficient αi (i = 1 to 8) calculated and stored in advance from the storage device, and controls the following repetitions Subsequent to step 902 for initializing variable J by substituting 0 for variable J, step 904 for adding 1 to variable J, and step 904, the value of variable J is the voice actor in voice actor voice DB 80. determining whether greater than the number J _MAX of, and a step 906 for branching control flow depending on the result of determination.

このプログラムはさらに、ステップ９０６の判定結果がＮＯであると判定されたことに応答して実行され、声優音声ＤＢ８０からＪ番目の声優音声データを読出すステップ９０８と、Ｊ番目の声優音声データとこのユーザの音声データとの知覚的距離Ｌ（Ｊ）を以下の式により算出するステップ９１０とを含む。 The program is further executed in response to the determination result of step 906 being NO, and reads out the Jth voice actor voice data from the voice actor voice DB 80; and the Jth voice actor voice data; And a step 910 of calculating a perceptual distance L (J) with the user's voice data by the following equation.

なおＬ_ｉはi番めの音響特徴量について算出された距離である。この後、処理はステップ９０４へ戻される。

Note that L _i is a distance calculated for the i-th acoustic feature amount. Thereafter, the process returns to step 904.

このプログラムはさらに、ステップ９０６でＹＥＳと判定されたことに応答して実行され、声優音声ＤＢ８０に記憶された数の分だけ算出された知覚的距離を昇順にソートし、知覚的距離が最も小さいものから順番に所定個数（この数を一般的にＳとする。本実施の形態ではＳ＝３である。）の声優音声を、このユーザの音声に最も類似する音声として選択して処理を終了するステップ９１２を含む。 This program is further executed in response to the determination of YES in step 906, and the perceptual distances calculated by the number stored in the voice actor voice DB 80 are sorted in ascending order, and the perceptual distance is the smallest. A predetermined number of voice actor voices (in this embodiment, this number is generally S. S = 3 in this embodiment) are selected as the voices most similar to this user's voice, and the process is terminated. Step 912.

図３５を参照して、声質変換プログラムは、ユーザの声質に似た声質を持つとして選択されたＳ人（本実施の形態ではＳ＝３）の声優の音声について、予め算出され記憶されていた上述した８種類の音響特徴量及び予め記憶されていた線形結合係数を記憶装置から読出すステップ９５０と、モーフィング時の音韻の時間情報及びホルマント位置が一致するように、これら音響特徴量の時系列に特徴点（開始時刻、終了時刻、及び中間時刻等並びに第１〜第４ホルマント位置）を付与するステップ９５２と、選択された声優音声の音響特徴量と、ユーザ音声の音響特徴量とを用いて、声優音声を混合してユーザ音声の声質に似た音声を合成するための各声優音声のモーフィング率（混合率）を推定するステップ９５３と、推定されたモーフィング率（混合率）に基づいて、前述した特徴点のモーフィング後の位置を決定するステップ９５４と、特徴点以外の点の音響特徴量を特徴点間において区分線形補間することにより、時間軸及び周波数軸を伸縮させるステップ９５６と、各時点における各音響特徴量を各声優音声に与えられたモーフィング率に基づいて線形結合してモーフィング音声Ｘ_Ｍを算出するステップ９５８とを含む。モーフィング率の決定方法については後述する。ステップ９５２での特徴点の付与は、本実施の形態ではアテンダントにより行なわれる。 Referring to FIG. 35, the voice quality conversion program has been calculated and stored in advance for voices of S voice actors selected as having voice quality similar to the voice quality of the user (S = 3 in the present embodiment). Step 950 for reading the above-described eight types of acoustic feature quantities and previously stored linear combination coefficients from the storage device, and the time series of these acoustic feature quantities so that the time information and formant position of the phoneme during morphing coincide with each other. Step 952 for assigning feature points (start time, end time, intermediate time, etc., and first to fourth formant positions) to the voice feature, and the acoustic feature quantity of the selected voice actor voice and the acoustic feature quantity of the user voice are used. Step 953 for estimating the morphing rate (mixing rate) of each voice actor voice for mixing voice actor voices and synthesizing voice similar to the voice quality of the user voice; Based on the rate (mixing rate), step 954 for determining the position after the morphing of the feature point described above, and the time axis and the frequency by piecewise linearly interpolating the acoustic feature quantity of the point other than the feature point between the feature points A step 956 for expanding and contracting the axis and a step 958 for calculating the morphing sound X _M by linearly combining the acoustic feature values at each time point based on the morphing rate given to each voice actor voice. A method for determining the morphing rate will be described later. In the present embodiment, the feature points are assigned in step 952 by an attendant.

ステップ９５８では、モーフィング音声Ｘ_Ｍは以下の式により算出される。 In step 958, morphing audio _{X M} is calculated by the following equation.

ｒ＝［ｒ_１ｒ_２ … ｒ_ｓ］^Ｔをモーフィング率ベクトルと呼ぶ。

r = [r ₁ r ₂ ... r _s ] ^T is called a morphing rate vector.

図３６を参照して、図３５のステップ９５４及びステップ９５８で用いられるモーフィング率を算出するための、図３５のステップ９５３の処理を実現するプログラムについて説明する。このモーフィング率を推定するための式は、後述するとおり解析的に解くことが困難であるために、ニュートン法を用いた逐次的処理により最適解を求める。以下、この逐次的処理について説明する。 With reference to FIG. 36, a program for realizing the processing of step 953 in FIG. 35 for calculating the morphing rate used in step 954 and step 958 in FIG. 35 will be described. Since the equation for estimating the morphing rate is difficult to solve analytically as will be described later, an optimal solution is obtained by sequential processing using the Newton method. Hereinafter, this sequential processing will be described.

複数の話者の音声（本実施の形態においてはユーザ音声に類似した声優音声）のモーフィングでは、様々な声質をモーフィング率によって実現できる。このため、目標話者（本実施の形態においてはユーザ）の声質になるようなモーフィング率を推定する。本実施の形態においては以下の式（１）を満たすようにモーフィング率を決定する。 In the morphing of a plurality of speaker voices (voice actor voices similar to user voices in the present embodiment), various voice qualities can be realized by morphing rates. For this reason, the morphing rate is estimated such that the voice quality of the target speaker (user in the present embodiment) is achieved. In the present embodiment, the morphing rate is determined so as to satisfy the following formula (1).

この手法では、この式（２）を満足するモーフィング率ベクトル＾ｒは、以下の式（３）を最小とするように推定される。

In this method, the morphing rate vector {circumflex over (r)} r satisfying the equation (2) is estimated so as to minimize the following equation (3).

本実施の形態の音声モーフィングには、前述のように音声分析変換合成システムＳＴＲＡＩＧＨＴをモーフィングに用いる。この音声モーフィングでは、特徴量およびモーフィング後の時間周波数平面を同一のモーフィング率で制御するため、式（３）における勾配は非線形であり、解析的に解くことが困難である。そこで、次の式（５）（６）により、ニュートン法でモーフィング率を逐次的に求める。

As described above, the speech analysis conversion synthesis system STRIGHT is used for morphing in the speech morphing of the present embodiment. In this voice morphing, the feature amount and the time-frequency plane after morphing are controlled with the same morphing rate, so the gradient in equation (3) is non-linear and difficult to solve analytically. Therefore, the morphing rate is sequentially obtained by the Newton method using the following equations (5) and (6).

ここで、ε’（＾ｒ_ｎ）は、式（３）に対して、時間成分及び周波数成分にかかるモーフィング率を定数として近似したものを表す。βは適切な定数であり、例えば０．１〜１の範囲の任意の数が選ばれる。また、

Here, ε ′ (＾ r _n ) represents the approximation of the morphing rate for the time component and the frequency component as a constant with respect to Equation (3). β is an appropriate constant, and for example, an arbitrary number in the range of 0.1 to 1 is selected. Also,

はｎ回目の更新で得られたモーフィング率＾ｒ_ｎにより時間周波数伸縮を行なった特徴量であり、

Is a feature quantity of performing a time-frequency expansion by n-th obtained morphing rate update ^ r _n,

である。

It is.

図３６を参照して、モーフィング率推定処理のプログラムは、最初に前述したε（ｒ）を記憶する変数に、その変数が記憶し得る最大値を記憶するステップ９７０と、ステップ９７０に続き、モーフィング率ベクトルｒとして適当な初期値を設定するステップ９７２と、設定されているモーフィング率ベクトルｒを使用してε（ｒ）を式（３）を使用して算出するステップ９７４と、ステップ９７４で算出されたε（ｒ）の値が収束したか否かを判定し、判定結果に応じて制御の流れを分岐させるステップ９７６とを含む。ステップ９７６の判定では、この前に算出されたε（ｒ）の値と、ステップ９７４で算出されたε（ｒ）の値との差の絶対値が所定のしきい値より小さいか否かを基準とする。 Referring to FIG. 36, the morphing rate estimation processing program first stores the maximum value that can be stored in the above-described variable storing ε (r), and continues to step 970. Step 972 for setting an appropriate initial value as the rate vector r, Step 974 for calculating ε (r) using Equation (3) using the set morphing rate vector r, and Step 974 Determining whether or not the value of ε (r) has converged, and branching the flow of control according to the determination result 976. In the determination of step 976, it is determined whether or not the absolute value of the difference between the value of ε (r) calculated before and the value of ε (r) calculated in step 974 is smaller than a predetermined threshold value. The standard.

このプログラムはさらに、ステップ９７６の判定結果がＮＯのときに実行され、式（６）によってＥｎを算出するステップ９７８と、このＥｎと現在設定されているモーフィング率ベクトルｒとを使用して、新たなモーフィング率ベクトルｒを式（５）を使用して算出して制御をステップ９７４に戻すステップ９８０とを含む。 This program is further executed when the determination result of step 976 is NO, and using step 978 for calculating En by the equation (6) and the currently set morphing rate vector r, a new And a step 980 of calculating a correct morphing rate vector r using equation (5) and returning control to step 974.

このプログラムはさらに、ステップ９７６での判定結果がＮＯのときに実行され、そのときのモーフィング率ベクトルｒが求める値であるとして出力し処理を終了するステップ９８２を含む。 The program further includes a step 982 that is executed when the determination result in step 976 is NO, and that the morphing rate vector r at that time is a value to be obtained and ends the process.

なお、図３６に示すアルゴリズムはニュートン法を実現するための一例であって、これ以外の最適化アルゴリズムを使用してもよい。多くの場合、数値計算用のプログラムキットにはニュートン法等の最適化アルゴリズムを実現するものがあり、それらを使用すると簡単にこの計算を実現できる。 The algorithm shown in FIG. 36 is an example for realizing the Newton method, and other optimization algorithms may be used. In many cases, there are program kits for numerical calculation that realize an optimization algorithm such as Newton's method, and this calculation can be easily realized by using them.

［類似音声の選択及び声質変換の際のマルチメディア製作システム５０の動作］
類似音声の選択及び声質変換の際には、マルチメディア製作システム５０は以下のように動作する。最初に類似音声の選択の処理について説明し、次に声質変換の処理について説明する。なお、これに先立ち、既に説明したように類似音声を評価するための８種類の音響特徴量のための線形結合係数が求められ、記憶装置に記憶されているものとする。 [Operation of Multimedia Production System 50 when Selecting Similar Voices and Converting Voice Quality]
In selecting similar voices and converting voice quality, the multimedia production system 50 operates as follows. First, the processing for selecting similar speech will be described, and then the processing for converting voice quality will be described. Prior to this, it is assumed that the linear combination coefficients for the eight kinds of acoustic feature amounts for evaluating the similar speech are obtained and stored in the storage device as described above.

最初に、声優音声ＤＢ８０中に記憶されている台詞（ｉ）の声優音声のうち、ユーザ音声と最も声質が類似したものを以下に述べるように決定し読出して台詞音声（ｉ）として出力する（ステップ４７４、ステップ５０４、ステップ５６０及びステップ５８０）。ここで、この処理は、第４の音声生成部３０６、第５の音声生成部３０８、第７の音声生成部３１２及び第８の音声生成部３１４により実行されるが、以下においては、これらの音声生成部を区別しないで、単に音声生成部と記載する。 First, among the voice actor voices of the line (i) stored in the voice actor voice DB 80, the voice voice having the most similar voice quality to the user voice is determined and read as described below, and output as the voice voice (i) ( Step 474, Step 504, Step 560 and Step 580). Here, this process is executed by the fourth sound generation unit 306, the fifth sound generation unit 308, the seventh sound generation unit 312 and the eighth sound generation unit 314. The voice generation unit is not distinguished and is simply referred to as a voice generation unit.

音声生成部は、類似音声決定処理には以下のように動作する。図３１を参照して、音声生成部は、予め求められ記憶されていた線形結合係数αｉ（ｉ＝１〜８）を記憶装置から読出す（ステップ９００）。ステップ９０２〜９１０の処理により、声優音声ＤＢ８０に記憶された全ての声優音声について、ユーザの音声との知覚的距離が算出される（ステップ９１０）。声優音声ＤＢ８０に記憶されている全ての声優音声について、知覚的距離の算出が終了すると、算出された知覚的距離が最も小さい声優音声から所定個数が、ユーザ音声と最も声質が類似したものとして選択される（ステップ９１２）。 The voice generation unit operates as follows in the similar voice determination process. Referring to FIG. 31, the voice generation unit reads linear combination coefficient αi (i = 1 to 8) obtained and stored in advance from the storage device (step 900). The perceptual distance from the user's voice is calculated for all voice actor voices stored in the voice actor voice DB 80 by the processing of steps 902 to 910 (step 910). When calculation of the perceptual distance is completed for all voice actor voices stored in the voice actor voice DB 80, a predetermined number of voice actor voices having the smallest calculated perceptual distance is selected as having the most similar voice quality to the user voice. (Step 912).

最も類似した声質を持つ声優音声として１つのみ決定する場合には、ステップ９１２では知覚的距離が最も小さな声優音声が選択される。最も類似した声質を持つ声優音声を複数個決定する場合には、ステップ９１２では知覚的距離が小さなものから順にその個数の声優音声が選択される。 If only one voice actor voice having the most similar voice quality is determined, the voice actor voice having the smallest perceptual distance is selected in step 912. When a plurality of voice actor voices having the most similar voice qualities are determined, in step 912, the number of voice actor voices is selected in order from the smallest perceptual distance.

図３３を参照して、声質変換をする際には、マルチメディア製作システム５０は以下のように動作する。なお、ここではユーザに類似する声優音声として３種類が選択されるものとする。これら声優音声が選択されると、処理対象の台詞の声優音声を、ユーザ音声の特徴を用いて声質変換する（ステップ５０６及びステップ５８２）。この処理は、第５の音声生成部３０８及び第８の音声生成部３１４により実行されるが、以下においては、これらの音声生成部を区別しないで、単に音声生成部と記載する。 Referring to FIG. 33, when voice quality conversion is performed, multimedia production system 50 operates as follows. Here, three types of voice actor voices similar to the user are selected. When these voice actor voices are selected, the voice actor voice of the speech to be processed is converted into voice quality using the characteristics of the user voice (steps 506 and 582). This process is executed by the fifth sound generation unit 308 and the eighth sound generation unit 314. In the following description, these sound generation units are not distinguished from each other and are simply referred to as a sound generation unit.

音声生成部は、声質変換処理では以下のように動作する。なお、図３１に示す類似音声決定処理と、図３６に示すモーフィング率推定処理とは、練習用音声の収録が終わった時点で行なわれる。また、声優音声の全てに対して、モーフィングのための特徴点が予め付与されているものとする。 The voice generation unit operates as follows in the voice quality conversion process. Note that the similar voice determination process shown in FIG. 31 and the morphing rate estimation process shown in FIG. 36 are performed when recording of the practice voice is finished. Further, it is assumed that feature points for morphing are assigned in advance to all voice actor voices.

図３５を参照して、音声生成部は、選択された３名の声優音声の音響特徴量（８種類）を記憶装置から読出す（ステップ９５０）。ユーザ音声に対して、アテンダントにより特徴点が付与される（ステップ９５２）。続いて、３人の声優音声の音響特徴量と、ユーザ音声の音響特徴量とに基づいて算出されたモーフィング率に基づいて、モーフィング後の特徴点の位置が決定されて（ステップ９５４）、特徴点以外の点を特徴点間において音響特徴量を区分線形補間して時間軸及び周波数軸を伸縮させ（ステップ９５６）、その後、推定されたモーフィング率を用いて声優音声の音響特徴量を線形結合する（ステップ９５８）。 Referring to FIG. 35, the voice generation unit reads the acoustic feature quantities (eight types) of the selected three voice actor voices from the storage device (step 950). A feature point is given to the user voice by an attendant (step 952). Subsequently, the position of the feature point after morphing is determined based on the morphing rate calculated based on the acoustic feature amount of the three voice actor voices and the acoustic feature amount of the user voice (step 954). The time and frequency axes are expanded / contracted by piecewise linear interpolation of the acoustic features between the points other than the points (step 956), and then the acoustic features of the voice actor voice are linearly combined using the estimated morphing rate. (Step 958).

したがって、マルチメディア製作システム５０によって、人が知覚するのと同様に、ユーザの音声に最も類似する声優の音声を選ぶことができ、それら複数話者の音声をモーフィングしてユーザ音声の声質に類似した音声を生成することができる。その結果、登場人物の台詞が分かっているマルチメディア製作物において、ユーザの音声が少なくても、登場人物の音声を、ユーザの声に最も近いと人間が感ずるような声に的確に置換することが可能になる。 Therefore, the multimedia production system 50 can select the voice of the voice actor that is most similar to the voice of the user, as perceived by humans, and morphs the voices of the multiple speakers to resemble the voice quality of the user voice. Voice can be generated. As a result, in multimedia productions where the dialogue of the characters is known, even if the user's voice is low, the voice of the characters should be accurately replaced with a voice that humans feel as being closest to the user's voice. Is possible.

上記した実施の形態は、映画を製作するためのマルチメディア製作システムに本発明を適用したものであった。本発明が適用可能なシステムはこれには限らず、たとえばテレビジョン番組、ラジオドラマなど、一般に台詞を発話するタイミング、その長さ、などが発話者ごとに決まったシナリオにしたがって進行するものであればどのようなものにも適用することができる。 In the embodiment described above, the present invention is applied to a multimedia production system for producing a movie. The system to which the present invention can be applied is not limited to this, for example, a television program, a radio drama, etc. In general, the timing at which speech is uttered, its length, etc., proceed according to a scenario determined for each speaker. It can be applied to anything.

今回開示された実施の形態は単に例示であって、本発明が上記した実施の形態のみに制限されるわけではない。本発明の範囲は、発明の詳細な説明の記載を参酌した上で、特許請求の範囲の各請求項によって示され、そこに記載された文言と均等の意味及び範囲内での全ての変更を含む。 The embodiment disclosed herein is merely an example, and the present invention is not limited to the above-described embodiment. The scope of the present invention is indicated by each claim of the claims after taking into account the description of the detailed description of the invention, and all modifications within the meaning and scope equivalent to the wording described therein are included. Including.

本発明の一実施の形態に係るマルチメディア製作システム５０の機能ブロック図である。It is a functional block diagram of the multimedia production system 50 which concerns on one embodiment of this invention. キャラクタ音声作成部１０２の機能ブロック図である。3 is a functional block diagram of a character voice creation unit 102. FIG. 台詞情報記憶部７２に記憶される台詞情報テーブルの構成を示す図である。It is a figure which shows the structure of the dialog information table memorize | stored in the dialog information storage part. 録音終了時の台詞音声の録音状態の例を示す図である。It is a figure which shows the example of the recording state of the line sound at the time of the end of recording. 図２に示す音声収録部１１４で行なわれる音声収録処理を、コンピュータハードウェア上で実現するコンピュータプログラムの前半部分のフローチャートである。It is a flowchart of the first half part of the computer program which implement | achieves the audio | voice recording process performed by the audio | voice recording part 114 shown in FIG. 2 on computer hardware. 図２に示す音声収録部１１４で行なわれる音声収録処理を、コンピュータハードウェア上で実現するコンピュータプログラムの後半部分のフローチャートである。It is a flowchart of the latter half part of the computer program which implement | achieves the audio | voice recording process performed by the audio | voice recording part 114 shown in FIG. 2 on computer hardware. ユーザ情報入力部１００がユーザの音声収録時に入出力装置１１２の画面上に表示する表示の一例を示す図である。It is a figure which shows an example of the display which the user information input part 100 displays on the screen of the input / output device 112 at the time of a user's audio | voice recording. ユーザ音声ＤＢ１２０の構成を示すブロック図である。It is a block diagram which shows the structure of user audio | voice DB120. 音声作成部１１８の構成を示す機能ブロック図であるIt is a functional block diagram which shows the structure of the audio | voice preparation part 118. 図２に示す合成手法決定部１１６で行なわれる音声の生成手法の決定処理を、コンピュータハードウェア上で実現するコンピュータプログラムのフローチャートである。It is a flowchart of the computer program which implement | achieves the determination process of the audio | voice production | generation method performed in the synthetic | combination method determination part 116 shown in FIG. 2 on computer hardware. 手法リストテーブル７８の構成を示す図である。It is a figure which shows the structure of the method list table. カット情報記憶部７６に記憶された音響効果リストテーブルの構成を示す模式図である。It is a schematic diagram which shows the structure of the acoustic effect list table memorize | stored in the cut information storage part. 図９に示す第１の音声生成部３００を実現するプログラムのフローチャートである。It is a flowchart of the program which implement | achieves the 1st audio | voice production | generation part 300 shown in FIG. 図９に示す第２の音声生成部３０２を実現するためのプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program for implement | achieving the 2nd audio | voice production | generation part 302 shown in FIG. 図９に示す第３の音声生成部３０４を実現するためのプログラムのフローチャートである。It is a flowchart of the program for implement | achieving the 3rd audio | voice production | generation part 304 shown in FIG. 図９に示す第４の音声生成部３０６を実現するためのプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program for implement | achieving the 4th audio | voice production | generation part 306 shown in FIG. 図９に示す第５の音声生成部３０８を実現するプログラムの制御構造を示すフローチャートである。10 is a flowchart showing a control structure of a program that realizes the fifth sound generation unit 308 shown in FIG. 9. 図９に示す第６の音声生成部３１０を実現するためのプログラムのフローチャートである。It is a flowchart of the program for implement | achieving the 6th audio | voice production | generation part 310 shown in FIG. 図９に示す第７の音声生成部３１２を実現するプログラムの制御構造を示すフローチャートである。FIG. 10 is a flowchart illustrating a control structure of a program that implements a seventh sound generation unit 312 illustrated in FIG. 9. FIG. 図９に示す第８の音声生成部３１４を実現するプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program which implement | achieves the 8th audio | voice production | generation part 314 shown in FIG. 図９に示す第９の音声生成部３１６を実現するためのプログラムのフローチャートである。It is a flowchart of the program for implement | achieving the 9th audio | voice production | generation part 316 shown in FIG. 図９に示す音声信号処理部３２０を実現するプログラムのフローチャートである。It is a flowchart of the program which implement | achieves the audio | voice signal processing part 320 shown in FIG. マルチメディア製作システム５０によって作成された映画を再生する再生システムのブロック図である。2 is a block diagram of a playback system for playing back a movie created by the multimedia production system 50. FIG. 台詞音声テーブル８８の例示的構成を模式的に示す図である。It is a figure which shows typically the example structure of the speech sound table 88. FIG. 図２３に示す同時音声統合処理部６３２による同時音声統合処理後の台詞音声テーブル８８の例示的構成を示す図である。FIG. 24 is a diagram illustrating an exemplary configuration of a speech speech table 88 after the simultaneous speech integration processing by the simultaneous speech integration processing unit 632 illustrated in FIG. 23. 同時音声統合処理部６３２を実現するためのプログラムのフローチャートである。10 is a flowchart of a program for realizing a simultaneous voice integration processing unit 632. 図２３に示す同期再生部６３８を実現するためのプログラムのフローチャートである。It is a flowchart of the program for implement | achieving the synchronous reproduction | regeneration part 638 shown in FIG. 音声信号処理部３２０が実行する音響効果処理のうち、話速変換と音量正規化処理の内容を説明するための図である。It is a figure for demonstrating the content of speech speed conversion and a volume normalization process among the acoustic effect processes which the audio | voice signal process part 320 performs. マルチメディア製作システム５０においてユーザの音声を収録するための台詞音声データ作成部９０を実現するコンピュータ８４０のハードウェア構成の外観図である。It is an external view of the hardware constitutions of the computer 840 which implement | achieves the speech audio | voice data preparation part 90 for recording a user's audio | voice in the multimedia production system 50. コンピュータ８４０の内部構成を示すブロック図である。FIG. 25 is a block diagram showing an internal configuration of a computer 840. 音声生成部で実行される類似音声決定プログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the similar audio | voice determination program performed with an audio | voice production | generation part. 類似音声決定処理に先立って行なわれる線形結合係数算出処理で使用される、類似音声の順位データ作成処理を実現するプログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the program which implement | achieves the order data creation process of similar speech used in the linear combination coefficient calculation process performed prior to a similar speech determination process. 人の知覚とよく類似した結果を得るために使用される音響特徴量の線形結合係数を算出する線形結合係数算出プログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the linear combination coefficient calculation program which calculates the linear combination coefficient of the acoustic feature-value used in order to obtain a result very similar to human perception. 図３３のステップ９４４で実行されるニュートン法を用いた線形結合係数の最適化のためのプログラムのフローチャートである。It is a flowchart of the program for the optimization of the linear combination coefficient using the Newton method performed by step 944 of FIG. 音声生成部で実行される声質変換プログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the voice quality conversion program performed with an audio | voice production | generation part. 音声生成部で実行されるモーフィング率推定プログラムの制御構造を示すフローチャートである。It is a flowchart which shows the control structure of the morphing rate estimation program performed with an audio | voice production | generation part. モーフィング率の推定を説明するための話者空間を概念的に示す図である。It is a figure which shows notionally the speaker space for demonstrating estimation of the morphing rate.

Explanation of symbols

５０マルチメディア製作システム
６０三次元スキャナ群
６２画像処理ＰＣ
６４映像生成装置
６６映像データ
７０映像素材ＤＢ
７２台詞情報記憶部
７４標準音声記憶部
７６カット情報記憶部
７８手法リストテーブル
８０声優音声ＤＢ
８２素片ＤＢ
８６台詞音声データ
８８台詞音声テーブル
９０台詞音声データ作成部
９２映像・音声再生装置
９４線形結合係数記憶部
１００〜１００Ｎユーザ情報入力部
１０２〜１０２Ｎキャラクタ音声作成部
１０４音声統合部
１１２入出力装置
１１４音声収録部
１１６合成手法決定部
１１８音声作成部
１２０ユーザ音声ＤＢ
１２２音声ＤＢ更新部
１２４素片ＤＢ更新部
１３０類似声優記憶部
１３２モーフィング率記憶部
２８０分岐部
２９２合流部
３００〜３１６第１の音声生成部〜第９の音声生成部
３２０音声信号処理部
６２０映像信号再生部
６２２表示装置
６２４効果音出力装置
６３２同時音声統合処理部
６３８同期再生部
６４０台詞音声出力装置 50 Multimedia production system 60 Three-dimensional scanner group 62 Image processing PC
64 Video generation device 66 Video data 70 Video material DB
72 Dialog information storage unit 74 Standard voice storage unit 76 Cut information storage unit 78 Method list table 80 Voice actor voice DB
82 Segment DB
86 speech sound data 88 speech sound table 90 speech sound data creation unit 92 video / audio reproduction device 94 linear combination coefficient storage unit 100 to 100N user information input unit 102 to 102N character speech creation unit 104 speech integration unit 112 input / output device 114 speech Recording unit 116 Synthesis method determination unit 118 Voice creation unit 120 User voice DB
122 audio DB update unit 124 segment DB update unit 130 similar voice actor storage unit 132 morphing rate storage unit 280 branching unit 292 merge unit 300 to 316 first audio generation unit to ninth audio generation unit 320 audio signal processing unit 620 video Signal reproduction unit 622 Display device 624 Sound effect output device 632 Simultaneous audio integration processing unit 638 Synchronous reproduction unit 640 Line sound output device

Claims

A similar voice selection device for selecting a voice similar to a target voice from a plurality of sample voices,
Means for storing the plurality of sample sounds;
The distance between the target speech and each of the plurality of sample speeches by a weighted linear sum of distance measures between two speeches calculated for each of one or more acoustic feature quantities for speech A distance calculating means for calculating
A similar voice selection device comprising: a voice selection means for selecting a voice having a smallest distance calculated by the distance calculation means from among the plurality of sample voices as a voice similar to the target voice.

The similar speech selection apparatus according to claim 1, wherein the distance measure is a distance measure calculated by dynamic time axis expansion / contraction between the same acoustic features of two sounds.

The similar distance selecting device according to claim 1, wherein the distance scale is a distance scale calculated by a Gaussian mixture distribution model between the same acoustic feature quantities of two sounds.

The combination coefficient of the weighted linear sum of the distance scale is a correlation between the result of ranking the similarity between each of the plurality of voices by the human and the result of the distance calculation unit. The similar speech selection device according to any one of claims 1 to 3, wherein the similar speech selection device is determined in advance so as to be maximized.

When executed by a computer, the computer is
Means for storing a plurality of sample sounds;
Distance calculating means for calculating a distance between a target voice and each of the plurality of sample voices by a weighted linear sum of distance scales calculated for each of one or more types of acoustic feature quantities for the voice When,
A computer program for causing a sample having the smallest distance calculated by the distance calculating means to function as a sound selecting means for selecting a sound similar to the target sound from the plurality of sample sounds.

A voice generation device for generating a voice having a predetermined content with a voice similar to a target voice,
Means for storing a plurality of sample sounds of the predetermined content;
The distance between the target speech and each of the plurality of sample speeches by a weighted linear sum of distance measures between two speeches calculated for each of one or more acoustic feature quantities for speech A distance calculating means for calculating
A voice selection means for selecting a voice having the smallest distance calculated by the distance calculation means from among the plurality of sample voices as a voice similar to the target voice;
A voice generation device comprising: voice generation means for generating voice of the predetermined content using the voice selected by the voice selection means.

The voice selection means includes means for selecting, from the plurality of sample voices, a plurality of voices having the smallest distance calculated by the distance calculation means as voices similar to the target voice,
The voice generation device according to claim 6, wherein the voice generation unit includes a voice morphing unit for generating a new voice by performing morphing on the plurality of voices selected by the voice selection unit.

Further, a predetermined feature vector of post-morphed speech obtained by morphing speech of the plurality of speakers selected by the selecting means by the speech morphing means, and the target speech, Optimization means for optimizing a morphing ratio of the plurality of speakers at the time of morphing by the voice morphing means so that a distance between the feature quantity vector corresponding to the predetermined feature quantity is minimized; The voice generation device according to claim 7.

When executed by a computer, the computer is
Means for storing a plurality of sample sounds of predetermined content;
The distance between the target speech and each of the plurality of sample speeches is obtained by a weighted linear sum of distance measures between the two speeches calculated for each of one or more acoustic feature quantities for the speech. A distance calculating means for calculating;
A voice selection means for selecting a voice having the smallest distance calculated by the distance calculation means from among the plurality of sample voices as a voice similar to the target voice;
A computer program that functions as a voice generation unit for generating a voice having the predetermined content using a voice selected by the voice selection unit.