JPH09274666A

JPH09274666A - Portrait synthesizer

Info

Publication number: JPH09274666A
Application number: JP8082179A
Authority: JP
Inventors: Kenji Sakamoto; 憲治坂本; Haruo Hide; 晴夫日出
Original assignee: GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Sharp Corp
Current assignee: GIJUTSU KENKYU KUMIAI SHINJOHO SHIYORI KAIHATSU KIKO; Sharp Corp
Priority date: 1996-04-04
Filing date: 1996-04-04
Publication date: 1997-10-21
Anticipated expiration: 2016-04-04
Also published as: JP3830200B2

Abstract

PROBLEM TO BE SOLVED: To provide a human interface with no sense of incompatibility by inputting voice data, generating a mouth form change or nodding action corresponding to these data synchronously with voices, and displaying an image just like a portrait speaks. SOLUTION: Data from a voice input part 1 are temporarily stored in a buffer 2 and the contents of the buffer are read out, converted to voice signals and outputted from a voice output part 3. The mouth form corresponding to the voice data read out of the buffer is decided by a mouth form deciding part 4, the odding action is decided from the silent term of voice data by an odding action deciding part 5 and based on these decided results, the moth form and an odding action image are synthesized by an image synthesizing part 6 and displayed on an image display part 7.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、デジタル処理によ
る画像合成装置に関するもので、特に、発声に伴う口形
状やうなずき動作を表現する人物画像を合成する当該装
置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an image synthesizing device by digital processing, and more particularly to the device for synthesizing a human image expressing a mouth shape and nodding motion associated with utterance.

【０００２】[0002]

【従来の技術】此の種の技術における従来例として、次
のようなものを示すことができる。特開平２−２３４２
８５号公報は、文字列として表現される文章を入力し、
これに対応した口形状変化を有する顔動画像を生成する
画像合成方法に関するものである。ここでは、前記文字
列を音素列に分割し、各音素毎に音声特徴および持続時
間を出力することが可能な音声合成手法を利用し、音声
特徴に基づいて各音素に対応する口形特徴を決定し、さ
らに該口形特徴に従って具体的な口形状を表現するため
の口形状パラメータの値を決定する。そして、各音素ご
との該口形状パラメータの値に対して、前記各音素ごと
の持続時間に基づいて動画の各フレームごとに与えられ
る口形状パラメータの値を制御し、音声出力に適合した
口形状変化を呈する顔動画像の表示を行うことが開示さ
れている。2. Description of the Related Art The following can be shown as a conventional example in this type of technology. JP-A-2-2342
No. 85 bulletin inputs a sentence expressed as a character string,
The present invention relates to an image synthesizing method for generating a face moving image having a corresponding mouth shape change. Here, the character string is divided into phoneme strings, and a speech synthesis method capable of outputting a speech feature and duration for each phoneme is used to determine a mouth shape feature corresponding to each phoneme based on the speech feature. Then, the value of the mouth shape parameter for expressing a specific mouth shape is determined according to the mouth shape feature. Then, with respect to the value of the mouth shape parameter for each phoneme, the value of the mouth shape parameter given for each frame of the moving image is controlled based on the duration for each phoneme, and the mouth shape suitable for voice output is controlled. It is disclosed to display a moving face image that changes.

【０００３】また、音声を入力として対応する口形状変
化を推定する方法に関するものが、森島繁生，相沢清
晴，原島博：「音声情報に基づく表情の自動合成の研
究」第４回 NICOGRAPH 論文コンテスト論文集，pp.139
〜146，日本コンピュータ・グラフィック協会（1988年1
1月）に示されている。ここでは、入力された音声情報
に対して、対数平均パワーを計算して口の開き具合を制
御する方法と、声道のホルマント特徴に対応する線形予
測係数を計算して口形状を推定する方法の２通りが提案
されている。The method of estimating the corresponding mouth shape change by inputting speech is as follows: Shigeo Morishima, Kiyoharu Aizawa, Hiroshi Harashima: "Study on Automatic Synthesis of Facial Expressions Based on Speech Information" 4th NICOGRAPH Paper Contest Paper Shu, pp.139
~ 146, Japan Computer Graphics Association (1988 1
January). Here, a method of calculating the logarithmic average power of input speech information to control the mouth opening degree, and a method of calculating a linear prediction coefficient corresponding to the vocal tract formant feature to estimate the mouth shape There are two proposals.

【０００４】[0004]

【発明が解決しようとする課題】従来技術における文章
（文字列）を入力して、これに対応した口形状変化を決
定する方法では、出力される音声のデータは、あらかじ
め文章（文字列）として用意されているものを音声デー
タ化するもので、入力は文字列である必要があり、音声
データが直接入力される場合や、文字列の情報がない音
声データに対して、口形状を決定することができない。
また、上記森島らの方法では、口形状を決定することは
できるが、顔の動き等の制御については開示されていな
い。本発明は、上記従来技術の問題点に鑑みてなされた
ものであり、文字列以外の入力データに応じて発声時の
顔の表情が合成でき、さらに、音声出力との対応付けが
的確になされた口形状と顔画像のうなずき動作を表現す
ることが可能な画像合成装置を提供することをその解決
すべき課題とする。In the method of inputting a sentence (character string) and determining the mouth shape change corresponding to this in the prior art, the output voice data is previously written as a sentence (character string). This is to convert the prepared data into voice data, and the input must be a character string, and the mouth shape is determined when the voice data is directly input or for voice data without character string information. I can't.
In the method of Morishima et al., The mouth shape can be determined, but the control of the movement of the face is not disclosed. The present invention has been made in view of the above-mentioned problems of the conventional technology, and it is possible to synthesize a facial expression at the time of utterance according to input data other than a character string, and further, it is possible to accurately associate with a voice output. It is an object to be solved to provide an image synthesizing device capable of expressing a nod motion of a mouth image and a face image.

【０００５】[0005]

【課題を解決するための手段】請求項１の発明は、音声
データの入力手段と、音声出力手段と、顔動画像を生成
する画像生成手段と、画像表示手段と、前記入力手段か
ら入力された音声信号及び前記画像生成手段からの画像
を処理・合成して前記音声出力手段及び画像表示手段に
より出力表示する処理・制御手段を有する人物等の画像
合成装置において、前記入力手段からの音声データを一
時保持するバッファを備え、前記処理・制御手段は、前
記バッファから読み出した音声データから該音声に対応
する口形状及び発話に伴ううなずき動作を推定し、その
推定結果に応じて前記画像生成手段からの画像を合成処
理するようにし、音声データが順次入力されるバッファ
上の読み込み位置を選ぶことにより、読み込んだ音声デ
ータに基づいて口形状が音声の出力の前に推定され、ま
た、音声データの無音期間からうなずき動作が推定され
て、これらの結果により人物画像の発話に伴う表情変化
が表示されることを可能にするものである。According to a first aspect of the present invention, input is made from voice data input means, voice output means, image generation means for generating a face moving image, image display means, and the input means. In an image synthesizing device for a person or the like having processing / controlling means for processing and synthesizing a voice signal and an image from the image generating means and outputting and displaying by the voice output means and the image display means The processing / control means estimates the mouth shape corresponding to the voice from the voice data read from the buffer and the nod motion associated with the utterance, and the image generation means according to the estimation result. The image from is mixed and the reading position on the buffer where the audio data is sequentially input is selected. The state is estimated before the voice is output, and the nodding motion is estimated from the silent period of the voice data, and these results enable the facial expression change accompanying the utterance of the human image to be displayed. .

【０００６】請求項２の発明は、請求項１の発明におい
て、前記うなずき動作の推定を前記音声信号の対数パワ
ーの時間変化を用いて行うようにし、有効な具体化手段
を提供するものである。According to a second aspect of the present invention, in the first aspect of the invention, the nodding motion is estimated by using the temporal change of the logarithmic power of the voice signal, and an effective embodying means is provided. .

【０００７】請求項３の発明は、請求項１の発明におい
て、前記バッファへの音声データ入力位置と前記うなず
き動作の推定に用いる音声データの該バッファからの読
み出し位置を一致させるようにし、リアルタイムの動作
を可能としたものである。According to a third aspect of the present invention, in the first aspect of the present invention, the input position of the voice data to the buffer and the read position of the voice data used for the estimation of the nod motion from the buffer are made to coincide with each other, and the position of the real time is read. It is possible to operate.

【０００８】[0008]

【発明の実施の形態】本発明の実施形態を図面を参照し
ながら以下に説明する。図１は、本発明の人物画像合成
装置の一実施形態のブロック図である。図１において、
音声入力部１は、マイクなどで音声を入力し、ＡＤ変換
を行って音声データを作成するか、あるいは、予め音声
データが格納されている記録媒体からデータを読み込
む。バッファ２は、音声入力部１から入力された音声デ
ータを一時的に記憶する。格納の形式はＦＩＦＯ（Firs
t In First Out：最初に入力したデータが最初に出力さ
れる）で、音声入力部１から順次音声データが入力さ
れ、音声出力部３に順次音声データが出力される。音声
出力部３は、ＤＡ変換で音声データから音声信号に変換
し、音声を出力する。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of an embodiment of a person image synthesizing apparatus of the present invention. In FIG.
The voice input unit 1 inputs voice with a microphone and performs AD conversion to create voice data, or reads data from a recording medium in which voice data is stored in advance. The buffer 2 temporarily stores the voice data input from the voice input unit 1. The storage format is FIFO (Firs
t In First Out: The first input data is output first), the audio data is sequentially input from the audio input unit 1, and the audio data is sequentially output to the audio output unit 3. The voice output unit 3 converts the voice data into a voice signal by DA conversion and outputs a voice.

【０００９】図２は、本発明に用いられるバッファの一
例の動作を説明するための概念図で、図３は、本発明に
用いられるバッファの他の例の動作を説明するための概
念図である。バッファ２は、図２に示すように、データ
が“Ｐｓ”から１つ入力される度にデータが右に１つシ
フトし、最も右側の“Ｐｅ”から１つずつデータが出力
されるＦＩＦＯでも良いし、図３に示すように、リング
状になったバッファで、それぞれのポインタがデータが
入力される毎に１つずつ進む形式でも良い。口形判定部
４は、バッファ２上のある決められた位置“Ｐ２”から
データを読み込み、フレーム毎に音声データの特徴量を
抽出する。音声の特徴量から口形状を決定する方法とし
て、特開平５−１３５７５５号公報で開示されている手
法などを用いる。すなわち、音声の低域および高域の周
波数成分に応じて口形状を決定する。例では、１フレー
ムは１０(ms)で、１フレームのサンプル数をＮとしてい
る。バッファ上の位置“Ｐ２”を、Ｐ２＝Ｐｅ−β×Ｎにとると、口形の判定は、実際その音声の出力が出力さ
れる時点よりβフレーム以前に行うことができる。この
βの値は、口形を決定するのに要する処理時間から決め
られる値で、実際音声が出力されているのに対応する口
形が同期して表示されるように調整して決定する。この
例では、β＝５としている。FIG. 2 is a conceptual diagram for explaining the operation of an example of the buffer used in the present invention, and FIG. 3 is a conceptual diagram for explaining the operation of another example of the buffer used in the present invention. is there. As shown in FIG. 2, the buffer 2 is a FIFO that shifts data one by one each time data is input from “Ps” and outputs data one by one from the rightmost “Pe”. Alternatively, as shown in FIG. 3, a ring-shaped buffer may be used in which each pointer advances one by one each time data is input. The mouth shape determining unit 4 reads the data from a certain position "P2" on the buffer 2 and extracts the feature amount of the audio data for each frame. As a method of determining the mouth shape from the feature amount of the voice, the method disclosed in Japanese Patent Laid-Open No. 5-135755 is used. That is, the mouth shape is determined according to the low-frequency and high-frequency components of the voice. In the example, one frame is 10 (ms), and the number of samples in one frame is N. When the position “P2” on the buffer is set to P2 = Pe−β × N, the mouth shape can be determined before β frames from the time when the output of the voice is actually output. The value of β is a value determined from the processing time required to determine the mouth shape, and is adjusted and determined so that the mouth shape corresponding to the actual voice output is displayed in synchronization. In this example, β = 5.

【００１０】うなずき動作判定部５は、バッファ上のあ
る決められた位置“Ｐ１”からデータを読み込み、フレ
ーム毎に音声データの特徴量を抽出する。うなずき動作
は、文の区切りや強調したい単語を発声するときなどに
多く見られる。以下では、文の区切りを検出し、うなず
き動作をする例について説明する。特徴量として対数パ
ワーを用いる場合の例を図４に示す。実線が「わたく
し、シャープの坂本と申します」と発声した時の対数パ
ワーの時間変化である。縦軸が対数パワー、横軸が時間
である。閾（しきい）値は、音声の入力があるか、ない
かを判定できる値に設定する。音声の入力が開始されて
から（閾値を越えた時点“Ｔｓ”）、音声の対数パワー
が閾値以下で連続してＦｓフレーム以上続いた時点“Ｔ
ｅ＋Ｆｓ”で、音声の入力が終了したと判定する。そし
て、音声が終了したと判断した時点でうなずき動作を開
始する。例では、１フレームを１０(ms)にしている。Ｆ
ｓの値は、文中の無音部分（破裂音の前のクロージャの
時間長）よりも長く設定し、例では、Ｆｓ＝３０として
いる。図４の例で、発話内容を続けて発音した場合、す
なわち、Ｔ１′−Ｔ１＜１０×ＦｓＴ２′−Ｔ２＜１０×Ｆｓの場合、区間［Ｔ１，Ｔ１′］や区間［Ｔ２，Ｔ２′］
では、うなずき動作は生成されないが、「私」や「シャ
ープの」のところで区切って発声する場合、すなわち、Ｔ１′−Ｔ１＞１０×ＦｓＴ２′−Ｔ２＞１０×Ｆｓの場合、区間［Ｔ１，Ｔ１′］や区間［Ｔ２，Ｔ２′］
では、うなずき動作が生成される。１フレームのサンプ
ル数をＮとすると、バッファ上の位置“Ｐ１”を、Ｐ１＝Ｐｅ−(Ｆｓ＋α)×Ｎにとると、うなずき動作は、音声の出力が終了する時点
“Ｔｅ”よりαフレーム以前にうなずき動作の判定を行
うことができる。このαの値は、文の区切りで、発声が
終了する直前にうなずきの動作が行われるように調整し
て決定される。この例では、α＝２０としている。音声
入力部１からの出力は、バッファ２を通して音声出力部
３に入力されるため、リアルタイムに音声を入力する場
合は、音声入力と音声出力にずれが生じる。このずれを
できるだけ少なくするため、“Ｐｓ”と“Ｐ１”は一致
させるのが良い。画像合成部６は、前記口形判定部４で
判定した口形状と前記うなずき動作判定部５で判定した
うなずき動作とを合成し、顔画像を生成している。具体
的には、うなずき動作は、図５に示すような、うなずき
動作のアニメーションを複数枚用意しておき、それを連
続的に再生することでうなずき動作を表現している。The nod motion determination section 5 reads the data from a predetermined position "P1" on the buffer and extracts the feature amount of the audio data for each frame. Nodding motions are often seen at sentence breaks or when uttering a word to be emphasized. In the following, an example of detecting a sentence break and performing a nod operation will be described. An example of using logarithmic power as the feature amount is shown in FIG. The solid line is the time change of the logarithmic power when I say "I am Sakamoto of Sharp". The vertical axis represents logarithmic power and the horizontal axis represents time. The threshold value is set to a value with which it can be determined whether or not there is a voice input. After the voice input is started (the time point "Ts" when the threshold value is exceeded), the time point "T" when the logarithmic power of the voice is less than the threshold value and continuously continued for Fs frames
It is determined that the voice input is completed by "e + Fs". Then, the nod operation is started when it is determined that the voice is completed. In the example, one frame is set to 10 (ms).
The value of s is set to be longer than the silent part of the sentence (the time length of the closure before the plosive sound), and Fs = 30 in the example. In the example of FIG. 4, when the utterance content is continuously pronounced, that is, when T1′-T1 <10 × Fs T2′-T2 <10 × Fs, the interval [T1, T1 ′] or the interval [T2, T2 ′ is obtained. ]
Then, when nodding motion is not generated, but when uttering by dividing at "I" or "Sharp", that is, when T1'-T1> 10xFs T2'-T2> 10xFs, the interval [T1, T1 '] and interval [T2, T2']
Then, a nod motion is generated. If the number of samples in one frame is N, and the position “P1” on the buffer is set to P1 = Pe− (Fs + α) × N, the nod motion is α frames before the point “Te” at which voice output ends. The nod motion can be determined. The value of α is determined by separating sentences so that the nod action is performed immediately before the utterance ends. In this example, α = 20. Since the output from the voice input unit 1 is input to the voice output unit 3 through the buffer 2, when voice is input in real time, a gap occurs between the voice input and the voice output. In order to reduce this shift as much as possible, it is preferable that "Ps" and "P1" match. The image synthesis unit 6 synthesizes the mouth shape determined by the mouth shape determination unit 4 and the nod motion determined by the nod motion determination unit 5 to generate a face image. Specifically, as for the nod motion, a plurality of animations of the nod motion as shown in FIG. 5 are prepared, and the nod motion is expressed by continuously reproducing the animation.

【００１１】[0011]

【発明の効果】本発明によると、通信を介してリアルタ
イムに送られてくる音声データや蓄積された音声データ
に対して、人物画像があたかも喋っているかのように、
音声と同期してその表情を画像として表示することがで
きる。According to the present invention, it is as if a human image were talking to voice data sent in real time or accumulated voice data via communication.
The facial expression can be displayed as an image in synchronization with the voice.

[Brief description of drawings]

【図１】本発明の人物画像合成装置の一実施形態のブロ
ック図である。FIG. 1 is a block diagram of an embodiment of a person image synthesizing apparatus of the present invention.

【図２】本発明に用いられるバッファの一例の動作を説
明するための概念図である。FIG. 2 is a conceptual diagram for explaining an operation of an example of a buffer used in the present invention.

【図３】本発明に用いられるバッファの他の例の動作を
説明するための概念図である。FIG. 3 is a conceptual diagram for explaining the operation of another example of the buffer used in the present invention.

【図４】うなずき動作の特徴量として対数パワーを用い
る場合の例を説明するための図である。FIG. 4 is a diagram for explaining an example of a case where logarithmic power is used as a feature amount of nodding motion.

【図５】うなずき動作を行う人物画像の作成画面の一例
を示す図である。FIG. 5 is a diagram showing an example of a screen for creating a person image for performing a nod action.

[Explanation of symbols]

１…音声入力部、２…バッファ、３…音声出力部、４…
口形判定部、５…うなずき動作判定部、６…画像合成
部、７…画像表示部。1 ... voice input unit, 2 ... buffer, 3 ... voice output unit, 4 ...
Mouth shape determination unit, 5 ... nod motion determination unit, 6 ... image composition unit, 7 ... image display unit.

Claims

[Claims]

1. A voice data input means, a voice output means, an image generation means for generating a face moving image, an image display means, a voice signal input from the input means, and an image from the image generation means. In an image synthesizing device for a person or the like having a processing / controlling means for processing / synthesizing and outputting and displaying by the voice output means and the image display means, a buffer for temporarily holding voice data from the input means is provided, The control means estimates the mouth shape corresponding to the voice from the voice data read from the buffer and the nod motion associated with the utterance, and synthesizes the image from the image generation means according to the estimation result. An image synthesizing device for a person or the like.

2. The image synthesizing apparatus for a person or the like according to claim 1, wherein the nodding motion is estimated by using a temporal change of logarithmic power of the audio signal.

3. The image of the person or the like according to claim 1, wherein the position of inputting the voice data to the buffer and the position of reading the voice data used for estimating the nod motion from the buffer are made coincident with each other. Synthesizer.