JP2518683B2

JP2518683B2 - Image combining method and apparatus thereof

Info

Publication number: JP2518683B2
Application number: JP1053899A
Authority: JP
Inventors: 正秀金子; 淳小池; 好律羽鳥; 誠一山本; 宜男樋口
Original assignee: Kokusai Denshin Denwa KK
Current assignee: KDDI Corp
Priority date: 1989-03-08
Filing date: 1989-03-08
Publication date: 1996-07-24
Anticipated expiration: 2011-07-24
Also published as: GB2231246A; GB9005142D0; JPH02234285A; GB2231246B

Description

【発明の詳細な説明】（発明の技術分野）本発明は、ディジタル処理による画像合成方法に関す
るもので、特に、発声に伴う口形状変化を表現する顔画
像（静止画像または動画像）を合成する方式に関するも
のである。TECHNICAL FIELD OF THE INVENTION The present invention relates to an image synthesizing method by digital processing, and in particular, synthesizes a face image (still image or moving image) expressing a mouth shape change associated with utterance. It is related to the method.

（従来技術）人が発声する場合には、調音器官により音声情報が生
成され、同時に、外見的な変化として発声に伴い、口部
分の動き（形状変化）が生じる。人が直接発声するので
はなく、文字列として入力された文章を音声情報に変換
して出力する方法は音声合成と言われ、従来、多くの成
果が得られてきている。一方、入力された文章に対して
対応する口形状変化を有する顔画像を生成する方法に関
しては従来技術は少なく、松岡清利，黒須顕二による次
の報告があるにとどまっている。(Prior Art) When a person utters, voice information is generated by an articulatory organ, and at the same time, a movement of the mouth portion (shape change) occurs as a change in appearance. A method of converting a sentence input as a character string into voice information and outputting the voice information instead of directly uttering it is called voice synthesis, and many results have been obtained conventionally. On the other hand, there are few conventional techniques for generating a face image having a mouth shape change corresponding to an input sentence, and only the following reports by Kiyotoshi Matsuoka and Kenji Kurosu have been reported.

松岡，黒須の方法は、〔松岡清利，黒須顕二：「聴覚
障害者の読話訓練のための動画プログラム」電子情報通
信学会論文誌,vo.J70−D,no,11,PP.2167−2171（1987年
11月）〕に示されている。これは、プログラムの形で実
施されているが、入力された文章に対して、対応する口
形状変化を得るための考え方の基本を整理して示すと、
第６図のようになる。The method of Matsuoka and Kurosu is described in [Kiyoto Matsuoka and Kenji Kurosu: "Video Program for Reading Training for Deaf People", IEICE Transactions, vo.J70-D, no, 11, PP.2167- 2171 (1987
(November)]. Although this is implemented in the form of a program, the basic idea of how to obtain a corresponding mouth shape change for an input sentence can be summarized as follows:
It looks like Figure 6.

第６図において、50は音節分離部、51は音節と口形パ
ターンの対応付け部、52は音節と口形パターンの対応テ
ーブル、53は口形状選択部、54は口形状用メモリであ
る。次に各部の動作を簡単に説明する。音節分離部50
は、入力された文章（文字列）に対して、これを音節単
位に区切る働きをする。例えば「kuma」という入力は、
「ku」と「ma」の２つの音節に分けられる。次に、音節
と口形パターンの対応テーブル52は、予め用意された音
節と口形パターンの対応関係を蓄積したテーブルであ
る。音節は“a",“ka"などひとまとまりの音を表現する
ものである。口形パターンは、大口形（＜Ａ＞＜Ｉ＞＜
Ｕ＞＜Ｅ＞＜Ｋ＞等）と小口形（＜ｕ＞＜ｏ＞＜ｋ＞＜
ｓ＞等）とがあり、口形の種類を示すものである。これ
らを用いて“a"に対しては＜Ａ＞＜＊＞＜Ａ＞，“ka"
に対しては＜Ｋ＞＜＊＞＜Ａ＞というように音節と口形
パターンの対応関係をテーブルにしておくわけである。
ここで、＜＊＞は中間口形を示す。音節と口形パターン
の対応付け部51では、音節分離部50から送られてくる一
つ一つの音節ごとに、音節と口形パターンの対応テーブ
ル52を参照して、対応する口形パターンをテーブルから
読出す。次に口形状用メモリ54は、前述の口形パターン
の各々について具体的な口形状を図形或いは形状パラメ
ータの形で蓄積したメモリである。口形状選択部53で
は、音節と口形パターンの対応付け部51から送られてく
る口形パターン列に対して、順次口形状用メモリ54を参
照して、具体的な口形状を選択し、画像として出力す
る。この時、必要に応じて中間形状（前後の口形状の中
間の形状）の生成も行われる。なお、動画像としての出
力のために、各音節に対して固定的に４フレーム分の口
形状を生成するようになっている。In FIG. 6, 50 is a syllable separation unit, 51 is a syllable-mouth-shaped pattern correspondence unit, 52 is a syllable-mouth-shaped pattern correspondence table, 53 is a mouth shape selection unit, and 54 is a mouth shape memory. Next, the operation of each unit will be briefly described. Syllable separation unit 50
Works to divide an input sentence (character string) into syllable units. For example, the input "kuma"
It is divided into two syllables, "ku" and "ma". Next, the syllable-mouth-shaped pattern correspondence table 52 is a table in which correspondence relationships between syllables and mouth-shaped patterns prepared in advance are accumulated. Syllables represent a group of sounds such as "a" and "ka". The mouth shape pattern is a large mouth shape (<A><I><
U><E><K>, etc. and small edge (<u><o><k><
s> etc.) and indicates the type of mouth shape. Using these, for "a"<A><*><A>,"ka"
For example, <K><*><A> is used as a table for the correspondence between syllables and mouth-shaped patterns.
Here, <*> indicates an intermediate mouth shape. The syllable / mouth-shape pattern correspondence unit 51 reads out the corresponding mouth-shape pattern from the table by referring to the syllable-mouth-shape pattern correspondence table 52 for each syllable sent from the syllable separation unit 50. . Next, the mouth shape memory 54 is a memory that stores a specific mouth shape for each of the above mouth shape patterns in the form of a figure or shape parameter. The mouth shape selection unit 53 sequentially refers to the mouth shape memory 54 for the mouth shape pattern string sent from the syllable-mouth shape pattern association unit 51, selects a specific mouth shape, and outputs it as an image. Output. At this time, an intermediate shape (intermediate shape between the front and rear mouth shapes) is also generated if necessary. For output as a moving image, a mouth shape for four frames is fixedly generated for each syllable.

この外に、関連する従来技術として、文章入力に対し
てではないが、音声を入力として対応する口形状変化を
推定する方法も報告されている。これは、〔森島繁生，
相沢清晴，原島博：「音声情報に基づく表情の自動合成
の研究」第４回NICOGRAPH論文コンテスト論文集,PP.139
−146、日本コンピュータ・グラフィックス協会（1988
年11月）〕に示されている。ここでは、入力された音声
情報に対して、対数平均パワーを計算して口の開き具合
を制御する方法と、声道のホルマント特徴に対応する線
形予測係数を計算して口形状を推定する方法と２通りが
提案されている。In addition to this, as a related conventional technique, a method of estimating a corresponding mouth shape change by inputting voice, although not for text input, has been reported. This is [Shigeru Morishima,
Kiyoharu Aizawa, Hiroshi Harashima: "Study on Automatic Synthesis of Facial Expressions Based on Speech Information" Proc. Of the 4th NICOGRAPH Paper Contest, PP.139
−146, Japan Computer Graphics Association (1988
November))]. Here, a method of calculating the logarithmic average power of input speech information to control the mouth opening degree, and a method of calculating a linear prediction coefficient corresponding to the vocal tract formant feature to estimate the mouth shape And two are proposed.

（発明が解決しようとする課題）従来技術として、文章（文字列）を入力して、これに
対応する口形状変化を有する顔画像を生成するための方
法については、松岡，黒須の方法を示したが、次のよう
な点で問題がある。すなわち、発生においては音声出力
と口形状とに密接な関係があるにもかかわらず、基本的
には文章を文節に区切って文字上の対応から口形パター
ンを選択しており、音声生成の機構と口形状生成との関
連付けが不十分である。従って、音声出力と的確に対応
した口形状生成が困難であるという問題がある。次に、
音素（発生における最小単位，音節は複数の音素の組合
せからなる）については、前後の音素とのつながり等に
よって、持続時間が異なるにもかかわらず、松岡，黒須
の方法では、各音節に固定的に４フレームを割当ててお
り、入力される文章に応じた自然な口形状変化を表現す
ることが困難であるという問題がある。また、入力され
た文章に対して、音声と口形状画像を同一のタイミング
で出力しようとした場合に両者でのマッチングをとるこ
とも困難である。(Problems to be Solved by the Invention) As a conventional technique, as a method for inputting a sentence (character string) and generating a face image having a mouth shape change corresponding to this, the method of Matsuoka and Kurosu is shown. However, there are problems in the following points. In other words, although there is a close relationship between voice output and mouth shape in generation, the mouth shape pattern is basically selected from the correspondence on the characters by dividing the sentence into clauses, and the mechanism of voice generation is Insufficient association with mouth shape generation. Therefore, there is a problem that it is difficult to generate a mouth shape that accurately corresponds to voice output. next,
Regarding phonemes (minimum unit in development, syllables consist of a combination of multiple phonemes), although the duration varies depending on the connection with the preceding and following phonemes, the method of Matsuoka and Kurosu is fixed at each syllable. There is a problem that it is difficult to express a natural mouth shape change according to the input sentence because 4 frames are assigned to the. Further, when it is attempted to output the voice and the mouth shape image at the same timing with respect to the input text, it is difficult to match them.

さらに、森島，相沢，原島の方法は、入力された音声
情報をもとにして口形状を推定するという技術であり、
文章を入力して、これに対応した口形状変化を有する動
画像を生成するという目的には適用することができな
い。Furthermore, the method of Morishima, Aizawa, and Harashima is a technique of estimating the mouth shape based on the input voice information,
It cannot be applied to the purpose of inputting a sentence and generating a moving image having a mouth shape change corresponding to the sentence.

（発明の目的）本発明は、上述した従来技術の問題点を解決するため
になされたのもであり、音声出力との的確な対応付けが
なされ、かつ、各音素の持続時間に合わせた形で口形状
変化を表現することが可能な画像合成方法及びその装置
を提供することを目的とする。(Object of the Invention) The present invention has been made in order to solve the above-mentioned problems of the conventional technology, and is accurately associated with a voice output, and is provided in a form that matches the duration of each phoneme. An object of the present invention is to provide an image synthesizing method capable of expressing a shape change and its apparatus.

（発明の構成）本発明の第１の特徴は、文字列として表現される文章
を入力し、これに対応した口形状変化を有する顔動画像
を生成する画像合成方法において、前記文字列を音素列
に分割し、各音素ごとに音声特徴及び持続時間を出力す
ることが可能な音声規則合成手法を利用し、音声特徴に
基づいて各音素に対応する口形特徴を決定し、更に該口
形特徴に従って具体的な口形状を表現するための口形状
パラメータの値を決定し、また、各音素ごとの該口形状
パラメータの値に対して前記各音素ごとの持続時間に基
づいて動画像の各フレームごとに与えられる口形状パラ
メータの値を制御し、音声出力に適合した口形状変化を
有する顔動画像の合成を行うことにある。(Structure of the Invention) A first feature of the present invention is to input a sentence expressed as a character string and generate a face moving image having a mouth shape change corresponding to the sentence, in the image synthesizing method. Divide into columns, and use the voice rule synthesis method that can output the voice feature and duration for each phoneme, determine the mouth shape feature corresponding to each phoneme based on the voice feature, and further according to the mouth shape feature. For each frame of a moving image, a value of a mouth shape parameter for expressing a specific mouth shape is determined, and the value of the mouth shape parameter for each phoneme is based on the duration of each phoneme. The present invention is to control the value of the mouth shape parameter given to the above to synthesize a face moving image having a mouth shape change suitable for voice output.

本発明の第２の特徴は、文字列として表現される文章
を入力するための入力端子と、該入力端子より入力され
る該文字列を音素列に分割し、各音素ごとに音声特徴及
び持続時間を出力することが可能な音声規則合成部と、
各音素ごとの該音声特徴から口形特徴への変換を行う変
換部と、種々の口形特徴と具体的な口形状を表現する口
形状パラメータとを対応付けた変換テーブルと、前記変
換部で得られる各音素ごとの口形特徴に対応する口形状
パラメータを前記変換テーブルから取出す口形状パラメ
ータ取得部と、一定時間間隔の画像系列として与えられ
る動画像を生成するために該口形状パラメータ取得部か
ら得られる口形状パラメータの値の出力を前記音声規則
合成部から与えられる各音素ごとの持続時間に従って制
御するための時間調整部と、該時間調整部の制御のもと
に前記口形状パラメータ取得部から出力される口形状パ
ラメータの値に従って画像を生成するための画像生成部
とを備えたことにある。A second feature of the present invention is that an input terminal for inputting a sentence expressed as a character string and the character string input from the input terminal are divided into phoneme strings, and a phonetic feature and a sustaining feature are provided for each phoneme. A voice rule synthesizer capable of outputting time,
A conversion unit that performs conversion from the voice feature to the mouth shape feature for each phoneme, a conversion table that associates various mouth shape features with mouth shape parameters that express a specific mouth shape, and is obtained by the conversion unit. A mouth shape parameter acquisition unit that extracts a mouth shape parameter corresponding to a mouth shape feature for each phoneme from the conversion table, and a mouth shape parameter acquisition unit that generates a moving image given as an image sequence at fixed time intervals A time adjusting unit for controlling the output of the mouth shape parameter value according to the duration of each phoneme given from the speech rule synthesizing unit, and output from the mouth shape parameter acquiring unit under the control of the time adjusting unit. And an image generation unit for generating an image according to the value of the mouth shape parameter.

本発明の第３の特徴は、文字列として表現される文章
を入力するための入力端子と、該入力端子より入力され
る該文字列を音素列に分割し、各音素ごとに音声特徴及
び持続時間を出力することが可能な音声規則合成部と、
各音素ごとの該音声特徴から口形特徴への変換を行う変
換部と、種々の口形特徴と具体的な口形状を表現する口
形状パラメータとを対応付けた変換テーブルと、前記変
換部で得られる各音素ごとの口形特徴に対応する口形状
パラメータを前記変換テーブルから取出す口形状パラメ
ータ取得部と、一定時間間隔の画像系列として与えられ
る動画像を生成するために該口形状パラメータ取得部か
ら得られる口形状パラメータの値の出力を前記音声規則
合成部か与えられる各音素ごとの持続時間に従って制御
するための時間調整部と、該時間調整部の制御のもとに
前記口形状パラメータ取得部から出力される口形状パラ
メータの値に従って画像を生成するための画像生成部と
をに加えて、前記時間調整部の出力に従ってある音素か
ら次の音素への遷移を検出するための遷移検出部と、前
記画像生成部で用いられる口形状パラメータの値を少な
くとも１フレーム時間以上保持することが可能なメモリ
と、該メモリに保持されている口形状パラメータの値と
前記口形状パラメータ取得部より与えられる口形状パラ
メータの値との中間値を求める口形状パラメータ修正部
とを更に備え、ある音素から次の音素への遷移時に中間
的な口形状を生成して滑らかな口形状変化を有する顔動
画像を生成することにある。A third feature of the present invention is that an input terminal for inputting a sentence expressed as a character string and the character string input from the input terminal are divided into phoneme strings, and a phonetic feature and a sustaining feature for each phoneme. A voice rule synthesizer capable of outputting time,
A conversion unit that performs conversion from the voice feature to the mouth shape feature for each phoneme, a conversion table that associates various mouth shape features with mouth shape parameters that express a specific mouth shape, and is obtained by the conversion unit. A mouth shape parameter acquisition unit that extracts a mouth shape parameter corresponding to a mouth shape feature for each phoneme from the conversion table, and a mouth shape parameter acquisition unit that generates a moving image given as an image sequence at fixed time intervals A time adjustment unit for controlling the output of the mouth shape parameter value according to the duration of each phoneme given by the speech rule synthesis unit, and output from the mouth shape parameter acquisition unit under the control of the time adjustment unit And an image generation unit for generating an image according to the value of the mouth shape parameter, the transition from one phoneme to the next phoneme according to the output of the time adjustment unit. And a memory capable of holding the value of the mouth shape parameter used in the image generation unit for at least one frame time, and the value of the mouth shape parameter held in the memory. It further comprises a mouth shape parameter correction unit that obtains an intermediate value with the value of the mouth shape parameter given by the mouth shape parameter acquisition unit, and generates an intermediate mouth shape at the time of transition from one phoneme to the next phoneme and smooths it. It is to generate a face moving image having various mouth shape changes.

（実施例１）第１図は、本発明における第１の実施例を説明するた
めのブロック図である。入力情報としては、キーボード
或いは磁気ディスク等のファイル装置から得られる文字
列（文章）を考える。第１図において、１は音声規則合
成部、２は時間調整部、３は音声特徴から口形特徴への
変換部、４は口形特徴から口形状パラメータへの変換テ
ーブル、５は口形状パラメータ取得部、６は画像生成
部、10はゲート、900は文字列入力用の端子、901は画像
出力用の端子である。(Embodiment 1) FIG. 1 is a block diagram for explaining the first embodiment of the present invention. As the input information, consider a character string (sentence) obtained from a file device such as a keyboard or a magnetic disk. In FIG. 1, 1 is a voice rule synthesizing unit, 2 is a time adjusting unit, 3 is a voice feature-to-mouth feature conversion unit, 4 is a mouth feature-to-mouth shape parameter conversion table, and 5 is a mouth shape parameter acquisition unit. , 6 is an image generation unit, 10 is a gate, 900 is a character string input terminal, and 901 is an image output terminal.

次に各部の動作について説明する。音声規則合成部１
は入力された文字列に対応した音声出力を合成する部分
である。音声合成に関しては従来各種の方式が提案され
ているが、ここでは、口形状生成との整合性が優れてい
るという点から、声道モデルとしてKlatt型ホルマント
音声合成器を用いた既存の音声規則合成手法の利用を想
定している。この手法に関しては、〔山本誠一，樋口宣
男，清水徹：「テキスト編集機能付き音声規則合成装置
の試作」電子情報通信学会技術報告SP87−137（1988年
３月）〕に詳しく述べられている。音声規則合成部その
ものは既存技術であり、また本発明が目的とする部分で
はないので詳細な説明は省略する。但し、音声生成と口
形状との的確な対応をとるために、各音素毎に音韻特徴
及び持続時間に関する情報が出力されることが必要であ
る。山本，樋口，清水の手法では、調音様式，調音点，
有声／無声の区別，ピッチ制御情報などの音韻特徴及び
これに基づく持続時間の情報が出力されるようになって
おり、この要求を満足している。これらの情報が得られ
るものであれば、他の音声規則合成部を利用するもので
あっても差し支えはない。Next, the operation of each unit will be described. Speech rule synthesizer 1
Is a part that synthesizes a voice output corresponding to the input character string. Although various methods have been proposed in the past for speech synthesis, the existing speech rules using the Klatt formant speech synthesizer as a vocal tract model are used here because they are highly compatible with mouth shape generation. It is intended to use the synthesis method. This method is described in detail in [Seiichi Yamamoto, Nobuo Higuchi, Tohru Shimizu: "Prototype of Speech Rule Synthesizer with Text Editing Function" IEICE Technical Report SP87-137 (March 1988)]. The voice rule synthesizing unit itself is an existing technique and is not a part of the present invention, so a detailed description thereof will be omitted. However, in order to make an accurate correspondence between the voice generation and the mouth shape, it is necessary to output information about the phoneme characteristics and duration for each phoneme. In the method of Yamamoto, Higuchi, Shimizu, articulation style, articulation point,
It is possible to output voiced / unvoiced distinction, phonological features such as pitch control information, and duration information based thereon, and this requirement is satisfied. Any other voice rule synthesizing unit may be used as long as these pieces of information can be obtained.

次に時間調整部２は、音声規則合成部１より得られる
各音素毎の持続時間（第ｉ番目の音素の持続時間をt_iと
する）に基づいて、画像生成部６への口形状パラメータ
の受渡しを制御するためのものである。すなわち、テレ
ビジョン信号として画像（特に動画像）を出力するため
には、例えばNTSC方式の場合毎秒30フレーム（１フレー
ム当り1/30秒）であり、1/30秒毎の情報に直して画像を
生成する必要がある。時間調整部２の詳しい動作につい
ては後述する。Next, the time adjustment unit 2 supplies the mouth shape parameter to the image generation unit 6 based on the duration of each phoneme obtained from the speech rule synthesis unit 1 (assuming that the duration of the i-th phoneme is t _i ). It is for controlling the delivery of. That is, in order to output an image (particularly a moving image) as a television signal, for example, in the case of the NTSC system, it is 30 frames per second (1/30 second per frame), and the image is converted into information every 1/30 second. Needs to be generated. Detailed operation of the time adjustment unit 2 will be described later.

次に、音韻特徴から口形特徴への変換部３では、音声
規則合成部１から得られる音韻特徴に基づいて、該当音
素に対応する口形特徴への変換を行う。口形特徴として
は、例えば、（１）口の開き具合（かなり開いている〜
完全に閉じている）、（２）唇の丸め具合（丸めている
〜横に引いている）、（３）下顎の高さ（上がっている
〜下がっている）、（４）舌の見え具合、を考える。各
種の音素に対して、人間が実際にどう発声しているかに
関する観察に基づいて、音韻特徴と口形特徴との対応を
規則化している。Next, in the phoneme-to-mouth feature conversion unit 3, based on the phoneme features obtained from the voice rule synthesis unit 1, the phoneme features are converted into mouth features corresponding to the phonemes. The mouth shape features include, for example, (1) mouth opening degree (very open ~
(Completely closed), (2) Lip roundness (rounded ~ pulled sideways), (3) Lower jaw height (raised ~ lowered), (4) Tongue appearance ,think of. For various phonemes, the correspondence between phonological features and mouth features is regularized based on the observation of how humans actually utter.

例えば、“konnichiwa"という文章が入力された場
合、のような形で口形特徴への変換がなされる。ここで、1
v,1h,jawは各々口の開き具合、唇の丸め具合、下顎の高
さを示しており、数字は程度を表している。ｘは程度が
前後の音素によって決められることを示している。ま
た、tbckは舌の見え具合を示している（この場合、舌の
奥の方がわずかに見えることを表している）。For example, if the text "konnichiwa" is entered, The conversion to the mouth shape feature is performed in the form of. Where 1
v, 1h, and jaw show the degree of opening of the mouth, the degree of rounding of the lips, and the height of the lower jaw, respectively, and the numbers represent the degree. The x indicates that the degree is determined by the preceding and following phonemes. In addition, tbck shows the appearance of the tongue (in this case, the inside of the tongue is slightly visible).

口形特徴から口形状パラメータへの変換テーブルは、
音声特徴から口形特徴への変換部３で得られる前述の口
形特徴の各々について、具体的な口形状を表現するため
のパラメータの値を与えるテーブルである。ここで第２
図は、口形状を表現するためのパラメータの例を示した
図である。第２図（ａ）は口部分を正面から眺めた時の
正面図であり、点P₁〜P₈の８点の位置により口形状を、
点Q₁,Q₂の位置により上，下の歯の見え具合を、h₁,h₂の
値により上，下の唇の厚みを与える。第２図（ｂ）は口
部分を横から眺めた時の側面図であり、θ₁,θ_２の角度
により、上，下の唇のめくれを与える。変換テーブル４
では、前述の口形特徴の各々について、実際に人が発声
する時の口形状に対する計測結果を参考にして前もって
定められた上記パラメータP₁〜P₈,Q₁〜Q₂,h₁,h₂,θ₁,θ
_２の値の組をテーブルの形で保持しておく。The conversion table from mouth shape features to mouth shape parameters is
It is a table which gives the value of the parameter for expressing a concrete mouth shape about each of the above-mentioned mouth shape features obtained by the conversion part 3 from a voice feature to a mouth shape feature. Second here
The figure shows an example of parameters for expressing the mouth shape. FIG. 2 (a) is a front view of the mouth portion as viewed from the front, and the mouth shape is defined by the positions of 8 points P _{1 to} P ₈ .
The positions of the points Q ₁ and Q ₂ give the appearance of the upper and lower teeth, and the values of h ₁ and h ₂ give the upper and lower lip thickness. FIG. 2 (b) is a side view of the mouth portion viewed from the side, and the upper and lower lips are turned up depending on the angles of θ ₁ and θ ₂ . Conversion table 4
Then, for each of the mouth shape features described above, the parameters P _{1 to} P ₈ , Q _{1 to} Q ₂ , h ₁ and h ₂ previously determined with reference to the measurement results for the mouth shape when a person actually speaks , θ ₁ , θ
A set of ₂ values is kept in the form of a table.

口形状パラメータ取得部５では、音声特徴から口形特
徴への変換部３より得られる該当音素に対する口形特徴
に対して、口形特徴から口形状パラメータへの変換テー
ブル４を参照して、該当音素に対する口形状パラメータ
の値の組を取得する。The mouth shape parameter acquisition unit 5 refers to the mouth shape feature-to-mouth shape parameter conversion table 4 for the mouth shape feature for the corresponding phoneme obtained from the speech feature to mouth shape feature conversion unit 3, and refers to the mouth shape for the corresponding phoneme. Get a set of values for a shape parameter.

ゲート10は、当該音素に対する上記口形状パラメータ
を画像生成部６に送るか否かを制御するためのものであ
り、時間調整部２から指示された回数（この回数に1/30
秒を乗じた値が、該当音素に対する口形状の表示時間と
なる）だけ、上記口形状パラメータを画像生成部６に送
る。The gate 10 is for controlling whether or not to send the mouth shape parameter for the phoneme to the image generation unit 6, and the number of times instructed by the time adjustment unit 2 (1/30 of this number).
The value obtained by multiplying the second becomes the display time of the mouth shape for the corresponding phoneme) and sends the mouth shape parameter to the image generation unit 6.

画像生成部６はゲート10を介して口形状パラメータ取
得部５より送られてくる1/30秒毎の口形状パラメータに
基づいて口形状画像の生成を行う。必要に応じて顔全体
を含めた画像の生成を行う。口形状パラメータを与えて
の口形状画像ないし顔画像の生成に関する詳細について
は、例えば〔金子正秀，羽鳥好律，小池淳：「形状変化
の検出と３次元形状モデルに基づく顔動画像の符号化」
電子情報通信学会論文誌B,vol.J71−B,no,12,PP.1554−
1563（1988年12月）〕に述べられている。概略として
は、人物頭部の３次元形状を表現する３次元ワイヤフレ
ームモデルを予め用意しておく。与えられた口形状パラ
メータに従って３次元ワイヤフレームモデルの口部分
（具体的には、唇，歯，顎等）の形状を変形する。この
変形後のモデルに、各部の濃淡や色を表現する情報を画
素単位で付与することにより、リアルな口形状画像或い
は顔画像を得ることができる。The image generation unit 6 generates a mouth shape image based on the mouth shape parameter every 1/30 seconds sent from the mouth shape parameter acquisition unit 5 via the gate 10. If necessary, an image including the entire face is generated. For details on the generation of a mouth shape image or a face image given a mouth shape parameter, see, for example, [Mashide Kaneko, Yoshinori Hatori, Atsushi Koike: “Detection of Shape Change and Encoding of Face Moving Image Based on 3D Shape Model” "
IEICE Transactions B, vol.J71−B, no, 12, PP.1554−
1563 (December 1988)]. As a general outline, a three-dimensional wire frame model that represents the three-dimensional shape of the human head is prepared in advance. The shape of the mouth portion (specifically, lips, teeth, jaws, etc.) of the three-dimensional wireframe model is deformed according to the given mouth shape parameter. A realistic mouth shape image or a face image can be obtained by adding information expressing the shading and color of each part to the transformed model in pixel units.

ここで、時間調整部２の動作について詳しく説明す
る。第３図は時間調整部２の動作を説明するためのブロ
ック図である。第３図において、21は遅延部、22は大小
判定部、23,24はメモリ、25,26は加算器、27はスイッ
チ、28,29は分岐、30は時間正規化部、201,202は大小判
定部22の出力線、902は初期リセット用の端子、903は定
数（1/30）入力用端子、920,921はスイッチ27に関わる
端子である。次に各部の動作について説明する。メモリ
23は、Ｉ番目の音素までの合計の持続時間を蓄えておくためのメモリである。画像合成を始める前
に、端子902より与えられる初期リセット信号で零がセ
ットされる。音声規則合成部１からＩ番目の音素の持続
時間が与えられると、加算器25により、メモリ23に蓄え
られたＩ−１番目の音素までの合計の持続時間が求められる遅延部21は、Ｉ−１番目の音素までの合計
の持続時間を、Ｉ＋１番目の音素に対する処理に入るまで蓄積する
働きをする。時間正規化部30では、遅延部21の出力に対し、を満足するＮを求め、1/30×Ｎの値を出力する。ここ
で、Ｎは整数、また、1/30は１フレームの時間1/30秒を
与える定数である。スイッチ27は、Ｉ番目の音素に対す
る処理に入る時に、大小判定部22からの出力線202によ
り端子920の側に接続される。この時、加算器26によ
り、時間正規化部30の出力1/30×Ｎと定数1/30との和ｔ
が計算される。大小判定部22では、このｔの値との値との大小を比較し、の場合には、出力線201、またの場合には出力線202信号を出力する。Here, the operation of the time adjustment unit 2 will be described in detail. FIG. 3 is a block diagram for explaining the operation of the time adjustment unit 2. In FIG. 3, 21 is a delay unit, 22 is a size judgment unit, 23 and 24 are memories, 25 and 26 are adders, 27 are switches, 28 and 29 are branches, 30 is a time normalization unit, and 201 and 202 are size judgment. An output line of the section 22, 902 is a terminal for initial resetting, 903 is a constant (1/30) input terminal, and 920 and 921 are terminals related to the switch 27. Next, the operation of each unit will be described. memory
23 is the total duration until the I-th phoneme Is a memory for storing. Before starting the image synthesis, zero is set by the initial reset signal given from the terminal 902. When the duration of the I-th phoneme is given from the speech rule synthesizing section 1, the total duration of the I-1th phoneme stored in the memory 23 is added by the adder 25. The delay unit 21 is required to calculate the total duration up to the (I-1) th phoneme. Is accumulated until the processing for the I + 1th phoneme is started. In the time normalization unit 30, the output of the delay unit 21 As opposed to The value of 1/30 × N is output. Here, N is an integer, and 1/30 is a constant that gives the time of one frame, 1/30 seconds. The switch 27 is connected to the terminal 920 side by the output line 202 from the magnitude determination unit 22 when the process for the I-th phoneme is started. At this time, the adder 26 causes the sum t of the output 1/30 × N of the time normalization unit 30 and the constant 1/30 to be t.
Is calculated. In the size determination unit 22, the value of t Compare the magnitude with the value of Output line 201, In the case of, the output line 202 signal is output.

の場合は、Ｉ番目の音素の持続時間が終了したことを意
味し、出力線202を介して、音声合成部１へＩ＋１番目
の音素に関する情報を出力するための指示、メモリ24へ
内容をリセットするための指示、スイッチ27へ端子920
へ接続するための指示、遅延部21へ遅延されていたの値を出力するための指示がなされる。メモリ24は、加
算器26の出力を一時的に蓄えておくためのものである。
スイッチ27はが成立つ間端子921に接続されており、加算器26によ
り、順次、今までのｔに1/30を加えたものを新たなｔに
する操作が行われる。以上により、が成立つ間、大小判定部22より出力線201に信号が出力
され、この信号により第１図におけるゲート10が制御さ
れることにより、Ｉ番目の音素の持続時間の間、Ｉ番目
の音素に対応する口形状パラメータが画像生成部６に供
給される。 In the case of, it means that the duration of the I-th phoneme has expired, an instruction for outputting information on the I + 1-th phoneme to the speech synthesis unit 1 via the output line 202, and the contents are reset to the memory 24. Instructions to do, switch 27 to terminal 920
The instruction to connect to was delayed to the delay unit 21 An instruction is given to output the value of. The memory 24 is for temporarily storing the output of the adder 26.
Switch 27 Is connected to the terminal 921 for the time t is established, and the adder 26 sequentially performs the operation of adding 1/30 to the existing t to a new t. From the above, While the magnitude determination unit 22 outputs a signal to the output line 201 and the gate 10 in FIG. 1 is controlled by this signal, the I-th phoneme is maintained for the duration of the I-th phoneme. The corresponding mouth shape parameter is supplied to the image generation unit 6.

以上が本発明の第１の実施例に対する説明である。こ
こで、第１の実施例の場合、Ｉ番目の音素からＩ＋１番
目の音素に移る場合、Ｉ番目の音素に対する口形状パラ
メータから、Ｉ＋１番目の音素に対する口形状パラメー
タへと不連続に変化るることになる。両者の口形状パラ
メータに極端な違いがなければ、合成される動画像には
余り不自然さは生じない。しかし、人間が発声をする場
合、口形状は連続的に変化しており、Ｉ番目の音素から
Ｉ＋１番目の音素に移る場合、口形状が連続的に変化す
ることが望ましい。The above is the description of the first embodiment of the present invention. Here, in the case of the first embodiment, when moving from the I-th phoneme to the I + 1-th phoneme, the mouth-shape parameter for the I-th phoneme changes discontinuously to the mouth-shape parameter for the I + 1-th phoneme. become. Unless there is an extreme difference between the mouth shape parameters of the two, there is little unnaturalness in the combined moving image. However, when a human utters, the mouth shape changes continuously, and when the I-th phoneme shifts to the (I + 1) th phoneme, it is desirable that the mouth shape continuously changes.

（実施例２）第４図はこの要求を満足するための本発明の第２の実
施例を説明るるためのブロック図である。第４図におい
て、７は口形状パラメータ修正部、８は遷移検出部、９
はメモリ、40はスイッチ、910,911はスイッチ40に関わ
る端子、他は第１図に同様である。次に新たに加わった
部分の動作を説明す。(Embodiment 2) FIG. 4 is a block diagram for explaining a second embodiment of the present invention for satisfying this requirement. In FIG. 4, 7 is a mouth shape parameter correcting unit, 8 is a transition detecting unit, and 9 is a transition detecting unit.
Is a memory, 40 is a switch, 910 and 911 are terminals related to the switch 40, and others are the same as in FIG. Next, the operation of the newly added part will be described.

遷移検出部８は、ある音素（例えばＩ番目の音素）か
ら次の音素（Ｉ＋１番目の音素）への遷移を検出するた
めのものである。第５図は本発明による遷移検出部８の
動作を説明するためのブロック図であり、81はカウン
タ、82は判定回路、210,211は出力線である。カウンタ8
1は、大小判定部22からの出力線202に信号が出力された
時に０にリセットされる。また、大小判定部22において
出力線201に信号が出力されるごとに１ずつカウントア
ップする。判定回路82では、カウンタ81の出力が“1"で
あるか否かを判定し、“1"の時には、ある音素から次の
音素への遷移が生じたということであるので、出力線21
0に信号を出力する。一方、２以上の時には、現在の音
素が持続しているということであるので、出力線211に
信号を出力する。The transition detection unit 8 is for detecting a transition from a certain phoneme (for example, I-th phoneme) to the next phoneme (I + 1-th phoneme). FIG. 5 is a block diagram for explaining the operation of the transition detection unit 8 according to the present invention, in which 81 is a counter, 82 is a decision circuit, and 210 and 211 are output lines. Counter 8
1 is reset to 0 when a signal is output to the output line 202 from the magnitude determination unit 22. In addition, each time the signal is output to the output line 201 in the size determination unit 22, the count is incremented by one. The determination circuit 82 determines whether or not the output of the counter 81 is “1”. When the output is “1”, it means that a transition from one phoneme to the next phoneme has occurred.
Output signal to 0. On the other hand, when the number is 2 or more, it means that the current phoneme is continuing, and therefore a signal is output to the output line 211.

メモリ９は、前フレームの画像を合成するために用い
られた口形状パラメータを少なくとも１フレーム期間蓄
えておくためのメモリである。口形状パラメータ修正部
７は、メモリ９に蓄えられていた前フレームにおける口
形状パラメータと、口形状パラメータ取得部５より与え
られる現在の音素に対する口形状パラメータとに基づい
て、例えば両者の中間値を求めて、現フレームの画像を
合成するための口形状パラメータとする働きをする。ス
イッチ40は、遷移検出部から出力線210,211のいずれに
信号が出力されるかによって、端子910或いは911に接続
され、端子910に接続された時には、口形状パラメータ
修正部７より得られる２つの音素に対する口形状パラメ
ータの中間値を、また、端子911に接続された時には現
在の音素に対する口形状パラメータを、画像生成部６に
渡す。以上の例では、ある音素の口形状パラメータと次
の音素の口形状パラメータとの中間値は１フレーム分し
か生成されないが、例えばカンウタ82の値に応じて何段
階かの中間値を生成することにより、より滑らかな口形
状変化を実現することも可能である。The memory 9 is a memory for storing the mouth shape parameter used for synthesizing the image of the previous frame for at least one frame period. The mouth shape parameter correction unit 7 determines, for example, an intermediate value between the mouth shape parameter in the previous frame stored in the memory 9 and the mouth shape parameter for the current phoneme given from the mouth shape parameter acquisition unit 5. Then, it functions as a mouth shape parameter for synthesizing the image of the current frame. The switch 40 is connected to the terminal 910 or 911 depending on which of the output lines 210 and 211 a signal is output from the transition detection section. When connected to the terminal 910, the two phonemes obtained from the mouth shape parameter correction section 7 are connected. To the image generation unit 6 and the mouth shape parameter for the current phoneme when connected to the terminal 911. In the above example, the intermediate value between the mouth shape parameter of one phoneme and the mouth shape parameter of the next phoneme is generated for only one frame, but, for example, it is necessary to generate an intermediate value of several stages according to the value of the counter 82. It is also possible to realize a smoother mouth shape change.

以上述べたように、本発明は文字列として表現される
文章を入力した場合にこれに対応した口形状変化を有す
る顔動画像を合成する方式に関するものである。しかし
ながら、音声情報を入力した場合においても入力音声情
報に対してこれを音素列に分割し、各音素ごとに音声特
徴及び持続時間を出力することが可能な音声認識手法が
利用できるのであれば、本発明における音声合成部１を
このような動作をする音声認識部に置き換えることによ
り、入力音声情報に対応した口形状変化を有する顔動画
像を合成することも可能である。As described above, the present invention relates to a method for synthesizing a face moving image having a mouth shape change corresponding to the input of a sentence expressed as a character string. However, even if the voice information is input, if the voice recognition method capable of dividing the input voice information into phoneme strings and outputting the voice feature and the duration for each phoneme is available, By replacing the voice synthesizing unit 1 in the present invention with a voice recognizing unit that performs such an operation, it is possible to synthesize a face moving image having a mouth shape change corresponding to input voice information.

（発明の効果）以上のように、本発明により、文字列として表現され
る文章を入力として音声出力との的確な対応付けがなさ
れ、かつ、各音素の持続時間に合わせた口形状変化を有
する、従って音声出力とのマッチングのとれた自然な口
形状変化を有する動画像を合成することが可能である。(Effects of the Invention) As described above, according to the present invention, a sentence expressed as a character string is input, and is accurately associated with a speech output, and has a mouth shape change that matches the duration of each phoneme. Therefore, it is possible to synthesize a moving image having a natural mouth shape change that is matched with the audio output.

文章入力に対して、今まで音声を合成するのにとどま
っていたのに対し、本発明では、音声とのマッチングの
とれた自然な口形状変化を有する動画像まで容易に出力
できるようになる。従って、本発明は実写を必要とせず
にリアルな動画像を生成する用途（例えば、放送番組や
映画の製作），音声及び画像による自動応答装置、マン
・マシーン・インタフェースの手段としての利用、文章
から音声及び動画像へのメディア変換等に適用可能であ
り、その効果が極めて大である。In the present invention, voices have been synthesized only in response to text input, but in the present invention, it is possible to easily output even a moving image having a natural mouth shape change that matches a voice. Accordingly, the present invention is applicable to the generation of realistic moving images without the need for live-action (for example, the production of broadcast programs and movies), the automatic response device using voice and images, the use as a means of a man-machine interface, and the text. Can be applied to media conversion from audio to audio and moving images, and the effect is extremely large.

[Brief description of drawings]

第１図は本発明の第１の実施例に対応するブロック図、
第２図は口形状を表現するためのパラメータの例を示し
た図、第３図は本発明における時間調整部２の動作の一
例に対応するブロック図、第４図は本発明の第２の実施
例に対応するブロック図、第５図は本発明の第２の実施
例における遷移検出部８の動作の一例に対応するブロッ
ク図、第６図は従来の画像合成方式の動作に対応するブ
ロック図である。FIG. 1 is a block diagram corresponding to the first embodiment of the present invention,
FIG. 2 is a diagram showing an example of parameters for expressing a mouth shape, FIG. 3 is a block diagram corresponding to an example of the operation of the time adjusting unit 2 in the present invention, and FIG. 4 is a second diagram of the present invention. FIG. 5 is a block diagram corresponding to the embodiment, FIG. 5 is a block diagram corresponding to an example of the operation of the transition detection unit 8 in the second embodiment of the present invention, and FIG. 6 is a block corresponding to the operation of the conventional image synthesizing method. It is a figure.

───────────────────────────────────────────────────── フロントページの続き (72)発明者山本誠一東京都新宿区西新宿２丁目３番２号国際電信電話株式会社内 (72)発明者樋口宜男東京都新宿区西新宿２丁目３番２号国際電信電話株式会社内 (56)参考文献特開昭63−225875（ＪＰ，Ａ) ＩＢＭＴｅｃｈｎｉｃａｌＤｉｓｃｌｏｓｕｒｅＢｕｌｌｅｔｉｎ，Ｖｏｌ．14，Ｎｏ．10，Ｐ．3039−3040, Ｊ．Ｄ．Ｂａｇｌｅｙｅｔａｌ．, “ＭｅｔｈｏｄｆｏｒＣｏｍｐｕｔｅｒＡｎｉｍａｔｉｏｎｏｆＬｉｐＭｏｖｅｍｅｎｔｓ" 電子情報通信学会論文誌Ｄ，Ｖｏｌ. Ｊ70−Ｄ，Ｎｏ．11，Ｐ．2167−2171松岡清利他，「聴覚障害者の読話訓練のための動画プログラム」 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Seiichi Yamamoto 2-3-2 Nishishinjuku, Shinjuku-ku, Tokyo International Telegraph and Telephone Corporation (72) Inventor Yoshio Higuchi 2-3-3 Nishishinjuku, Shinjuku-ku, Tokyo No. 2 in Kokusai Telegraph and Telephone Corporation (56) References JP-A-63-225875 (JP, A) IBM Technical Disclosure Bulletin, Vol. 14, No. 10, p. 3039-3040, J. D. Bagley et al. , "Method for Computer Animation of Lip Movements", IEICE Transactions D, Vol. J70-D, No. 11, p. 2167-2171 Kiyotoshi Matsuoka et al., "Video program for reading training for hearing impaired people"

Claims

(57) [Claims]

1. An image synthesizing method for inputting a sentence expressed as a character string and generating a face moving image having a mouth shape change corresponding to the sentence, dividing the character string into phoneme strings, and for each phoneme. To determine a mouth shape feature corresponding to each phoneme based on the voice feature by using a voice rule synthesizing method capable of outputting a voice feature and a duration, and to express a concrete mouth shape according to the mouth feature. The value of the mouth shape parameter for each phoneme is determined based on the duration of each phoneme with respect to the value of the mouth shape parameter for each phoneme. An image synthesizing method characterized by controlling and synthesizing a face moving image having a mouth shape change suitable for audio output.

2. An input terminal for inputting a sentence expressed as a character string, the character string input from the input terminal is divided into a phoneme string, and a voice feature and duration are output for each phoneme. A speech rule synthesizing unit capable of converting, a conversion unit for converting the phonetic features of each phoneme into a mouth shape feature, and various mouth shape features and mouth shape parameters expressing a specific mouth shape are associated with each other. A conversion table, a mouth shape parameter acquisition unit that extracts mouth shape parameters corresponding to mouth shape features for each phoneme obtained by the conversion unit from the conversion table, and to generate a moving image given as an image sequence at a fixed time interval A time adjustment unit for controlling the output of the mouth shape parameter value obtained from the mouth shape parameter acquisition unit according to the duration of each phoneme given by the speech rule synthesis unit; Image synthesizing apparatus characterized by comprising an image generator for generating an image according to the value of the mouth shape parameters output from the port shape parameter acquisition unit under the control of the adjusting unit.

3. A transition detection unit for detecting a transition from one phoneme to the next phoneme according to the output of the time adjustment unit,
A memory capable of holding the value of the mouth shape parameter used in the image generation unit for at least one frame time, the value of the mouth shape parameter held in the memory, and the mouth given by the mouth shape parameter acquisition unit. A mouth shape parameter correction unit for obtaining an intermediate value with the shape parameter value is further provided, and an intermediate mouth shape is generated at the time of transition from one phoneme to the next phoneme to generate a face moving image having a smooth mouth shape change. Claim 2 characterized by generating
The image synthesizing device according to the item.