JP3830200B2

JP3830200B2 - Human image synthesizer

Info

Publication number: JP3830200B2
Application number: JP08217996A
Authority: JP
Inventors: 憲治坂本; 晴夫日出
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1996-04-04
Filing date: 1996-04-04
Publication date: 2006-10-04
Anticipated expiration: 2016-04-04
Also published as: JPH09274666A

Description

【０００１】
【発明の属する技術分野】
本発明は、デジタル処理による画像合成装置に関するもので、特に、発声に伴う口形状やうなずき動作を表現する人物画像を合成する人物画像合成装置に関する。
【０００２】
【従来の技術】
此の種の技術における従来例として、次のようなものを示すことができる。
特開平２−２３４２８５号公報は、文字列として表現される文章を入力し、これに対応した口形状変化を有する顔動画像を生成する画像合成方法に関するものである。
ここでは、前記文字列を音素列に分割し、各音素毎に音声特徴および持続時間を出力することが可能な音声合成手法を利用し、音声特徴に基づいて各音素に対応する口形特徴を決定し、さらに該口形特徴に従って具体的な口形状を表現するための口形状パラメータの値を決定する。
そして、各音素ごとの該口形状パラメータの値に対して、前記各音素ごとの持続時間に基づいて動画の各フレームごとに与えられる口形状パラメータの値を制御し、音声出力に適合した口形状変化を呈する顔動画像の表示を行うことが開示されている。
【０００３】
また、音声を入力として対応する口形状変化を推定する方法に関するものが、森島繁生，相沢清晴，原島博：「音声情報に基づく表情の自動合成の研究」第４回 NICOGRAPH 論文コンテスト論文集，pp.139〜146，日本コンピュータ・グラフィック協会（1988年11月）に示されている。ここでは、入力された音声情報に対して、対数平均パワーを計算して口の開き具合を制御する方法と、声道のホルマント特徴に対応する線形予測係数を計算して口形状を推定する方法の２通りが提案されている。
【０００４】
【発明が解決しようとする課題】
従来技術における文章（文字列）を入力して、これに対応した口形状変化を決定する方法では、出力される音声のデータは、あらかじめ文章（文字列）として用意されているものを音声データ化するもので、入力は文字列である必要があり、音声データが直接入力される場合や、文字列の情報がない音声データに対して、口形状を決定することができない。
また、上記森島らの方法では、口形状を決定することはできるが、顔の動き等の制御については開示されていない。
本発明は、上記従来技術の問題点に鑑みてなされたものであり、文字列以外の入力データに応じて発声時の顔の表情が合成でき、さらに、音声出力との対応付けが的確になされた口形状と顔画像のうなずき動作を表現することが可能な画像合成装置を提供することをその解決すべき課題とする。
【０００５】
【課題を解決するための手段】
請求項１の発明は、音声データを入力し、該音声データに対応する口形の変化や顔の動きを有する顔動画像を生成して出力する人物画像合成装置において、前記音声データを入力する音声入力手段と、前記音声入力手段で入力した音声データを一時保持するバッファと、前記バッファから読み出した音声データに対応する口形を判定して、該口形に対応する口形画像を決定する口形判定手段と、前記バッファから読み出した音声データの無音部分を判定し、該無音部分にうなずき動作があったと決定するうなずき動作判定手段と、前記うなずき動作判定手段で判定した無音部分に同期して、前記口形判定手段で決定した口形画像とうなずき顔動画像とを合成して顔動画像を生成する画像合成手段と、前記画像合成手段で生成された顔動画像を出力する画像表示手段とを有するようにし、音声データが順次入力されるバッファ上の読み込み位置を選ぶことにより、読み込んだ音声データに基づいて口形を決定し、また、音声データの無音部分からうなずき動作の位置を決定し、これらの結果により人物画像の発話に伴う表情変化が表示されることを可能にするものである。
【０００６】
請求項２の発明は、請求項１の発明において、前記うなずき動作判定手段は、前記音声データの対数パワーの時間変化を用いて、前記無音部分を判定するようにし、有効な具体化手段を提供するものである。
【０００７】
請求項３の発明は、請求項１の発明において、前記バッファへの音声データの入力位置と、前記うなずき動作判定手段に用いる音声データの該バッファからの読み出し位置とを一致させるようにし、リアルタイムの動作を可能としたものである。
【０００８】
【発明の実施の形態】
本発明の実施形態を図面を参照しながら以下に説明する。
図１は、本発明の人物画像合成装置の一実施形態のブロック図である。
図１において、音声入力部１は、マイクなどで音声を入力し、ＡＤ変換を行って音声データを作成するか、あるいは、予め音声データが格納されている記録媒体からデータを読み込む。
バッファ２は、音声入力部１から入力された音声データを一時的に記憶する。格納の形式はＦＩＦＯ（First In First Out：最初に入力したデータが最初に出力される）で、音声入力部１から順次音声データが入力され、音声出力部３に順次音声データが出力される。音声出力部３は、ＤＡ変換で音声データから音声信号に変換し、音声を出力する。
【０００９】
図２は、本発明に用いられるバッファの一例の動作を説明するための概念図で、図３は、本発明に用いられるバッファの他の例の動作を説明するための概念図である。
バッファ２は、図２に示すように、データが“Ｐｓ”から１つ入力される度にデータが右に１つシフトし、最も右側の“Ｐｅ”から１つずつデータが出力されるＦＩＦＯでも良いし、図３に示すように、リング状になったバッファで、それぞれのポインタがデータが入力される毎に１つずつ進む形式でも良い。
口形判定部４は、バッファ２上のある決められた位置“Ｐ２”からデータを読み込み、フレーム毎に音声データの特徴量を抽出する。
音声の特徴量から口形を決定する方法として、特願平５−１３５７５５号（特開平６−３４８８１１号公報）で開示されている手法などを用いる。すなわち、音声の低域および高域の周波数成分に応じて口形を決定する。
例では、１フレームは１０(ms)で、１フレームのサンプル数をＮとしている。
バッファ上の位置“Ｐ２”を、
Ｐ２＝Ｐｅ−β×Ｎ
にとると、口形の判定は、実際その音声の出力が出力される時点よりβフレーム以前に行うことができる。
このβの値は、口形を決定するのに要する処理時間から決められる値で、実際音声が出力されているのに対応する口形が同期して表示されるように調整して決定する。この例では、β＝５としている。
【００１０】
うなずき動作判定部５は、バッファ上のある決められた位置“Ｐ１”からデータを読み込み、フレーム毎に音声データの特徴量を抽出する。
うなずき動作は、文の区切りや強調したい単語を発声するときなどに多く見られる。
以下では、文の区切りを検出し、うなずき動作をする例について説明する。
特徴量として対数パワーを用いる場合の例を図４に示す。
実線が「わたくし、シャープの坂本と申します」と発声した時の対数パワーの時間変化である。縦軸が対数パワー、横軸が時間である。
閾（しきい）値は、音声の入力があるか、ないかを判定できる値に設定する。
音声の入力が開始されてから（閾値を越えた時点“Ｔｓ”）、音声の対数パワーが閾値以下で連続してＦｓフレーム以上続いた時点“Ｔｅ＋１０×Ｆｓ”で、音声の入力が終了したと判定する。そして、音声が終了したと判断した時点でうなずき動作を開始する。
例では、１フレームを１０(ms)にしている。
Ｆｓの値は、文中の無音部分（破裂音の前のクロージャの時間長）よりも長く設定し、例では、Ｆｓ＝３０としている。
図４の例で、発話内容を続けて発音した場合、すなわち、
Ｔ１′−Ｔ１＜１０×Ｆｓ
Ｔ２′−Ｔ２＜１０×Ｆｓ
の場合、区間［Ｔ１，Ｔ１′］や区間［Ｔ２，Ｔ２′］では、うなずき動作は生成されないが、「私」や「シャープの」のところで区切って発声する場合、すなわち、
Ｔ１′−Ｔ１＞１０×Ｆｓ
Ｔ２′−Ｔ２＞１０×Ｆｓ
の場合、区間［Ｔ１，Ｔ１′］や区間［Ｔ２，Ｔ２′］では、うなずき動作が生成される。
１フレームのサンプル数をＮとすると、バッファ上の位置“Ｐ１”を、
Ｐ１＝Ｐｅ−(Ｆｓ＋α)×Ｎ
にとると、うなずき動作は、音声の出力が終了する時点“Ｔｅ”よりαフレーム以前にうなずき動作の判定を行うことができる。
このαの値は、文の区切りで、発声が終了する直前にうなずきの動作が行われるように調整して決定される。
この例では、α＝２０としている。
音声入力部１からの出力は、バッファ２を通して音声出力部３に入力されるため、リアルタイムに音声を入力する場合は、音声入力と音声出力にずれが生じる。このずれをできるだけ少なくするため、“Ｐｓ”と“Ｐ１”は一致させるのが良い。
画像合成部６は、前記うなずき動作判定部５で判定したうなずき動作に同期して、前記口形判定部４で決定した口形の画像とうなずき顔動画像とを合成した顔動画像を生成している。
具体的には、うなずき動作は、図５に示すような、うなずき動作のアニメーションを複数枚用意しておき、それを連続的に再生することでうなずき動作を表現している。
【００１１】
【発明の効果】
本発明によると、通信を介してリアルタイムに送られてくる音声データや蓄積された音声データに対して、人物画像があたかも喋っているかのように、音声と同期してその表情を画像として表示することができる。
【図面の簡単な説明】
【図１】本発明の人物画像合成装置の一実施形態のブロック図である。
【図２】本発明に用いられるバッファの一例の動作を説明するための概念図である。
【図３】本発明に用いられるバッファの他の例の動作を説明するための概念図である。
【図４】うなずき動作の特徴量として対数パワーを用いる場合の例を説明するための図である。
【図５】うなずき動作を行う人物画像の作成画面の一例を示す図である。
【符号の説明】
１…音声入力部、２…バッファ、３…音声出力部、４…口形判定部、５…うなずき動作判定部、６…画像合成部、７…画像表示部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an image synthesizing apparatus using digital processing, and more particularly, to a person image synthesizing apparatus that synthesizes a person image that expresses a mouth shape and a nodding action accompanying utterance.
[0002]
[Prior art]
The following can be shown as a conventional example of this type of technology.
Japanese Laid-Open Patent Publication No. 2-234285 relates to an image composition method for inputting a text expressed as a character string and generating a face moving image having a mouth shape change corresponding to the text.
Here, a speech synthesis method capable of dividing the character string into phoneme strings and outputting a speech feature and duration for each phoneme is used to determine a mouth shape feature corresponding to each phoneme based on the speech feature. Further, the value of the mouth shape parameter for expressing a specific mouth shape is determined according to the mouth shape feature.
Then, with respect to the value of the mouth shape parameter for each phoneme, the mouth shape parameter value given for each frame of the moving image is controlled based on the duration for each phoneme, and the mouth shape suitable for audio output It is disclosed to display a face moving image exhibiting a change.
[0003]
In addition, Shigeo Morishima, Kiyoharu Aizawa, Hiroshi Harashima: “Research on automatic facial expression synthesis based on speech information” 4th NICOGRAPH paper contest papers, pp .139-146, shown in Japan Computer Graphics Association (November 1988). Here, for the input speech information, a method for calculating the logarithmic average power and controlling the degree of mouth opening, and a method for estimating the mouth shape by calculating a linear prediction coefficient corresponding to the formant feature of the vocal tract The following two are proposed.
[0004]
[Problems to be solved by the invention]
In the method of inputting a sentence (character string) in the prior art and determining the mouth shape change corresponding to this, the voice data to be output is converted to voice data prepared in advance as a sentence (character string). Therefore, the input needs to be a character string, and the mouth shape cannot be determined when voice data is directly input or for voice data without character string information.
Further, the Morishima et al. Method can determine the mouth shape, but does not disclose control of facial movement or the like.
The present invention has been made in view of the above-described problems of the prior art. Facial expressions at the time of utterance can be synthesized in accordance with input data other than character strings, and further, correspondence with audio output is made accurately. It is an object of the present invention to provide an image composition device capable of expressing a nodding motion between a mouth shape and a face image.
[0005]
[Means for Solving the Problems]
According to the first aspect of the present invention, in the human image synthesizing apparatus that inputs voice data, generates and outputs a face moving image having a mouth shape change and a face motion corresponding to the voice data, the voice to which the voice data is input An input means; a buffer for temporarily storing voice data input by the voice input means; a mouth shape determination means for determining a mouth shape corresponding to the voice data read from the buffer and determining a mouth shape image corresponding to the mouth shape; Determining the silent portion of the audio data read from the buffer, and determining the mouth shape determination in synchronization with the nodding operation determining means for determining that the silent portion has a nodding operation, and the silent portion determined by the nodding operation determining means. Image synthesizing means for synthesizing the mouth shape image determined by the means and the nodding face moving image to generate a face moving image, and the face moving image generated by the image synthesizing means So as to have a image display means for force, by selecting the read position in the buffer input audio data are sequentially determines a mouth shapes on the basis of the audio data read, also nod from silence audio data The position of the motion is determined , and the expression change accompanying the utterance of the person image can be displayed based on these results.
[0006]
According to a second aspect of the present invention, in the first aspect of the invention, the nodding operation determination means determines the silent portion by using a time change of the logarithmic power of the audio data , and provides effective concrete means. To do.
[0007]
According to a third aspect of the present invention, in the first aspect of the present invention, the input position of the audio data to the buffer and the read position of the audio data used for the nodding operation determination means from the buffer are made to coincide with each other. Operation is possible.
[0008]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is a block diagram of an embodiment of a human image composition apparatus according to the present invention.
In FIG. 1, a voice input unit 1 inputs voice with a microphone and performs AD conversion to create voice data, or reads data from a recording medium in which voice data is stored in advance.
The buffer 2 temporarily stores the audio data input from the audio input unit 1. The storage format is FIFO (First In First Out: the first input data is output first). The audio data is sequentially input from the audio input unit 1, and the audio data is sequentially output to the audio output unit 3. The audio output unit 3 converts audio data into an audio signal by DA conversion and outputs audio.
[0009]
FIG. 2 is a conceptual diagram for explaining the operation of an example of the buffer used in the present invention. FIG. 3 is a conceptual diagram for explaining the operation of another example of the buffer used in the present invention.
As shown in FIG. 2, the buffer 2 is a FIFO in which data is shifted to the right every time one piece of data is input from “Ps”, and data is output one by one from the rightmost “Pe”. Alternatively, as shown in FIG. 3, a ring-shaped buffer may be used in which each pointer advances by one each time data is input.
The mouth shape determination unit 4 reads data from a predetermined position “P2” on the buffer 2 and extracts a feature amount of audio data for each frame.
As a method for determining the mouth shape from the feature of the voice, the like method disclosed in Japanese Application flat No. 5-135755 (JP-A-6-348811). That is, the mouth shape is determined according to the low frequency and high frequency components of the voice.
In the example, one frame is 10 (ms), and the number of samples in one frame is N.
Position “P2” on the buffer
P2 = Pe−β × N
In this case, the mouth shape can be determined before the β frame from the time when the voice output is actually output.
The value of β is a value determined from the processing time required to determine the mouth shape, and is adjusted and determined so that the mouth shape corresponding to the actual voice being output is displayed in synchronization. In this example, β = 5.
[0010]
The nodding operation determination unit 5 reads data from a predetermined position “P1” on the buffer, and extracts the feature amount of the audio data for each frame.
Nod action is often seen when sentence breaks or when speaking the word you want to emphasize.
In the following, an example of detecting a sentence break and performing a nodding operation will be described.
An example in the case of using logarithmic power as a feature quantity is shown in FIG.
The solid line is the time change of the logarithmic power when I say "I am Sakamoto of Sharp". The vertical axis is logarithmic power, and the horizontal axis is time.
The threshold value is set to a value that can be used to determine whether or not there is voice input.
The voice input is completed at the time “Te + 10 × Fs” when the logarithmic power of the voice continues below the threshold and continues for Fs frames or more after the voice input is started (when the threshold is exceeded “Ts”). Is determined. Then, when it is determined that the voice is finished, the nodding operation is started.
In the example, one frame is 10 (ms).
The value of Fs is set longer than the silent part in the sentence (the time length of the closure before the plosive), and in the example, Fs = 30.
In the example of FIG. 4, when the utterance content is continuously pronounced,
T1′−T1 <10 × Fs
T2′−T2 <10 × Fs
In the case of no, nodding motion is not generated in the sections [T1, T1 ′] and [T2, T2 ′], but when the voice is divided at “I” or “Sharp”, that is,
T1′−T1> 10 × Fs
T2′−T2> 10 × Fs
In the case of No. 1, a nodding operation is generated in the section [T1, T1 ′] and the section [T2, T2 ′].
If the number of samples in one frame is N, the position “P1” on the buffer is
P1 = Pe− (Fs + α) × N
In the case of the nod operation, the nod operation can be determined no more than α frames before the time “Te” when the output of the voice ends.
The value of α is determined by adjusting so that a nodding operation is performed immediately before the end of utterance at a sentence break.
In this example, α = 20.
Since the output from the audio input unit 1 is input to the audio output unit 3 through the buffer 2, there is a difference between the audio input and the audio output when inputting audio in real time. In order to minimize this deviation, “Ps” and “P1” are preferably matched.
Image synthesizing unit 6 in synchronization with the the nodding operation determined by the operation determining unit 5 before Symbol nods to generate a combined face moving image and the face moving image nods and determined mouth shape of the image in the mouth shape determination section 4 ing.
More specifically, the nod action expresses the nod action by preparing a plurality of nod action animations as shown in FIG. 5 and playing them continuously.
[0011]
【The invention's effect】
According to the present invention, a voice image sent in real time via communication or accumulated voice data is displayed as an image in synchronism with the voice as if a human image is being spoken. be able to.
[Brief description of the drawings]
FIG. 1 is a block diagram of an embodiment of a human image composition device of the present invention.
FIG. 2 is a conceptual diagram for explaining the operation of an example of a buffer used in the present invention.
FIG. 3 is a conceptual diagram for explaining the operation of another example of a buffer used in the present invention.
FIG. 4 is a diagram for explaining an example in the case where logarithmic power is used as a feature amount of a nodding operation.
FIG. 5 is a diagram showing an example of a person image creation screen for performing a nodding operation.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Audio | voice input part, 2 ... Buffer, 3 ... Audio | voice output part, 4 ... Mouth shape determination part, 5 ... Nodding operation | movement determination part, 6 ... Image composition part, 7 ... Image display part.

Claims

In a human image synthesizer that inputs voice data and generates and outputs a face moving image having a mouth shape change or a face motion corresponding to the voice data, voice input means for inputting the voice data, and the voice input A buffer for temporarily storing the audio data input by the means, a mouth shape determining means for determining a mouth shape corresponding to the sound data read from the buffer, and determining a mouth shape image corresponding to the mouth shape; A nodding motion determination means for determining a silent portion of the data and determining that the silent portion has a nodding motion; and a mouth shape image determined by the mouth shape determination means in synchronization with the silent portion determined by the nodding motion determination means Image synthesizing means for synthesizing a nodding face moving image to generate a face moving image, and an image display means for outputting the face moving image generated by the image synthesizing means Thieves Monoga image synthesizing apparatus characterized by having a.

The nodding motion determination means, using said time variation of the logarithmic power of the audio data, human Monoga image synthesizing apparatus according to claim 1, characterized in that so to determine the silence.

An input position of the audio data to the buffer, the nodding motion determination means human Monoga image synthesizing apparatus according to claim 1, characterized in that so as to match the read position from the buffer of the audio data for use in .