JP2001034785A

JP2001034785A - Virtual transformation device

Info

Publication number: JP2001034785A
Application number: JP11202514A
Authority: JP
Inventors: Tatsumi Sakaguchi; 竜己坂口; Atsushi Otani; 淳大谷
Original assignee: ATR Media Integration and Communication Research Laboratories
Current assignee: ATR Media Integration and Communication Research Laboratories
Priority date: 1999-07-16
Filing date: 1999-07-16
Publication date: 2001-02-09

Abstract

PROBLEM TO BE SOLVED: To reproduce a three-dimensional model which resembles a subject in virtual environment. SOLUTION: A color camera 16 photographs a full-length picture of the subject and an image processing circuit 18 generates body information on the subject from the photographed full-length picture. A microphone 20, on the other hand, picks up voice of the subject and a voice processing circuit 22 generates property information on the subject from the picked-up voice and body information. Those generated pieces of information are transmitted to a three-dimensional model selecting circuit 26 through a communication line 24. The three-dimensional model selecting circuit 26 selects a three-dimensional model matching the body information and property information out of three- dimensional models recorded in a memory 26a. The selected three-dimensional model is displayed on an image display device 30.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は仮想変身装置に関し、
特にたとえば仮想環境内に人物を３次元ＣＧ（コンピュ
ータグラフィックス）モデルで実時間で再現する、仮想
変身装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a virtual transformation device,
In particular, for example, the present invention relates to a virtual transformation device that reproduces a person in a virtual environment using a three-dimensional CG (computer graphics) model in real time.

【０００２】[0002]

【従来の技術】従来、仮想環境内に再現される３次元Ｃ
Ｇモデル（アバタ）は、予め用意されている数種類の３
次元ＣＧモデルの中から任意に選択していた。2. Description of the Related Art Conventionally, three-dimensional C reproduced in a virtual environment
The G model (avatar) has several types of 3
It was arbitrarily selected from the dimensional CG model.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、従来技
術では、たとえば、女性のキャラクタを男性が動かすと
いった面白い利用ができる反面、人物とキャラクタがミ
スマッチした場合は没入感が低下してしまうという問題
があった。However, in the prior art, for example, while it is possible to make an interesting use such as moving a female character by a male, there is a problem that when a person and a character are mismatched, the sense of immersion is reduced. Was.

【０００４】また、例えば，怒り，悲しみ，喜び，驚
き，恐怖等の人物の感情を仮想環境内において再現する
ことは難しく、特に、感性的な意味合いや微妙なニュア
ンス的な表現を仮想環境内で再現することはできなかっ
た。Further, it is difficult to reproduce emotions of a person, such as anger, sadness, joy, surprise, fear, etc., in a virtual environment. In particular, sensuous meanings and subtle nuance expressions are expressed in a virtual environment. It could not be reproduced.

【０００５】それゆえに、この発明の主たる目的は、人
物の姿態に似た３次元モデルを仮想環境内に再現するこ
とができる、仮想変身装置を提供することである。[0005] Therefore, a main object of the present invention is to provide a virtual transformation device capable of reproducing a three-dimensional model resembling a figure of a person in a virtual environment.

【０００６】この発明の他の目的は、人物の感情を仮想
環境内において再現することのできる、仮想変身装置を
提供することである。Another object of the present invention is to provide a virtual transformation device capable of reproducing emotions of a person in a virtual environment.

【０００７】[0007]

【課題を解決するための手段】この発明は、人物の音声
を入力する音声入力手段、音声に基づいて音声特徴情報
を作成する音声特徴情報作成手段、複数の３次元モデル
を予め記憶しておく３次元モデル記憶手段、複数の３次
元モデルのいずれか１つを音声特徴情報に基づいて選択
するモデル選択手段、および選択された３次元モデルを
仮想環境内に再現する再現手段を備える、仮想変身装置
である。According to the present invention, voice input means for inputting voice of a person, voice feature information generating means for generating voice feature information based on voice, and a plurality of three-dimensional models are stored in advance. A virtual make-up comprising three-dimensional model storage means, model selection means for selecting any one of the plurality of three-dimensional models based on the voice feature information, and reproduction means for reproducing the selected three-dimensional model in a virtual environment. Device.

【０００８】[0008]

【作用】人物の音声が入力されると、音声特徴情報作成
手段が、この音声に基づいて音声特徴情報を作成する。
一方、３次元モデル記憶手段は複数の３次元モデルを予
め記憶しており、モデル選択手段は、複数の３次元モデ
ルのいずれか１つを音声特徴情報に基づいて選択する。
選択された３次元モデルは、再現手段によって仮想環境
内に再現される。なお、音声特徴情報は、たとえばピッ
チ周波数を含む。When a voice of a person is input, voice feature information generating means generates voice feature information based on the voice.
On the other hand, the three-dimensional model storage means stores a plurality of three-dimensional models in advance, and the model selection means selects one of the plurality of three-dimensional models based on the voice feature information.
The selected three-dimensional model is reproduced in the virtual environment by the reproduction means. The voice feature information includes, for example, a pitch frequency.

【０００９】この発明のある局面では、人物の全身画像
が入力されると、身体情報作成手段が、この全身画像に
基づいて人物の身体情報を作成する。また、属性情報作
成手段が、身体情報および音声特徴情報に基づいて人物
の属性情報を作成する。モデル選択手段は、このように
して作成された身体情報および属性情報に基づいて、３
次元モデルの選択を行なう。In one aspect of the present invention, when a full-body image of a person is input, physical information creating means creates physical information of the person based on the full-body image. The attribute information creating means creates attribute information of the person based on the physical information and the voice feature information. The model selecting means performs 3 based on the physical information and the attribute information thus created.
Select a dimensional model.

【００１０】この発明のある実施例では、属性情報は人
物が男，女および子供のいずれであるかを示し、複数の
３次元モデルは、体型の異なる複数の男，体型の異なる
複数の女，および体型の異なる複数の子供を含む。モデ
ル選択手段は、体型および属性が人物と合致する３次元
モデルを身体情報および属性情報に基づいて選択する。In one embodiment of the present invention, the attribute information indicates whether the person is a man, a woman, or a child, and the plurality of three-dimensional models include a plurality of men having different body shapes, a plurality of women having different body shapes, And children of different body types. The model selecting means selects a three-dimensional model whose body type and attributes match those of the person based on the physical information and the attribute information.

【００１１】この発明の他の局面では、感情情報作成手
段が、音声特徴情報に基づいて人物の感情情報を作成
し、モデル変形手段が、選択された３次元モデルの表情
をこの感情情報に基づいて変形させる。[0011] In another aspect of the present invention, the emotion information creating means creates emotion information of the person based on the voice feature information, and the model deforming means converts the expression of the selected three-dimensional model based on the emotion information. To deform.

【００１２】この発明のある実施例では、音声特徴情報
はピッチ周波数，ダイナミックレンジ，スペクトル包絡
成分，パワーおよび発話速度の少なくとも１つを含む。
このため、感情情報作成手段は、これらのパラメータの
少なくとも１つに基づいて感情情報を作成する。In one embodiment of the present invention, the speech feature information includes at least one of a pitch frequency, a dynamic range, a spectral envelope component, power, and a speech rate.
Therefore, the emotion information creating means creates the emotion information based on at least one of these parameters.

【００１３】この発明の他の実施例では、人物の顔画像
が入力されたとき、顔情報作成手段がこの顔画像に基づ
いて顔情報を作成する。モデル変形手段は、感情情報お
よび顔情報の少なくとも一方に基づいて表情を変形させ
る。In another embodiment of the present invention, when a face image of a person is input, the face information creating means creates face information based on the face image. The model deformation means deforms the expression based on at least one of the emotion information and the face information.

【００１４】この発明のその他の実施例では、モデル変
形手段は、３次元モデルの拡大縮小，色の変化，一部の
変形などのレンダリング処理によって表情を変化させ
る。In another embodiment of the present invention, the model deforming means changes the expression by rendering processing such as enlargement / reduction of a three-dimensional model, color change, and partial deformation.

【００１５】[0015]

【発明の効果】この発明によれば、人物の音声あるいは
人物の全身画像に基づいて、仮想環境内に再現するアバ
タのモデルを自動的に選択するようにしたため、人物
（被験者）に似た３次元モデルを仮想環境内に再現する
ことができる。また、人物の音声から感情情報を作成
し、これに基づいて３次元モデルの表情を変形するよう
にしたため、被験者の感情を仮想環境内に再現すること
ができる。According to the present invention, an avatar model to be reproduced in a virtual environment is automatically selected based on a voice of a person or a whole body image of the person. A dimensional model can be reproduced in a virtual environment. Also, since emotion information is created from the voice of the person and the facial expression of the three-dimensional model is deformed based on the emotion information, the emotion of the subject can be reproduced in the virtual environment.

【００１６】この発明の上述の目的，その他の目的，特
徴および利点は、図面を参照して行う以下の実施例の詳
細な説明から一層明らかとなろう。The above objects, other objects, features and advantages of the present invention will become more apparent from the following detailed description of embodiments with reference to the drawings.

【００１７】[0017]

【実施例】図１に示すこの実施例の仮想変身装置１０
は、カラーカメラ（以下、単に「カメラ」という。）１
２および１６を含む。カメラ１２は人物（被験者）の顔
を正面から撮影し、撮影した顔画像を画像処理装置１４
に入力する。一方、カメラ１６は人物の全身を撮影す
る。そして、撮影した全身画像を画像処理装置１８に入
力する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A virtual transformation device 10 of this embodiment shown in FIG.
Is a color camera (hereinafter simply referred to as “camera”) 1
2 and 16. The camera 12 photographs the face of a person (subject) from the front, and processes the photographed face image into an image processing device 14.
To enter. On the other hand, the camera 16 photographs the whole body of the person. Then, the photographed whole body image is input to the image processing device 18.

【００１８】画像処理装置１４および１８のいずれも、
パソコン程度の処理能力を有するコンピュータである。
画像処理回路１４は、入力された顔画像に基づいて、顔
の軸の傾き，顔の軸周りの回転などの顔の姿勢データな
らびに喜び，悲しみなどの顔の表情データを含む顔情報
を生成する。一方、画像処理装置１８は、入力された全
身画像に基づいて、人物の背の高さ（身長），腕の長
さ，胴体部の幅などの人物の体型データならびに手足の
位置などの身体の姿勢データを含む身体情報を生成す
る。なお、人物の姿勢をより正確に推定するには、全身
を撮影するカラーカメラを複数準備した方がよい。Both of the image processing devices 14 and 18
It is a computer having the processing capability of a personal computer.
The image processing circuit 14 generates face information including face posture data such as inclination of the face axis and rotation about the face axis and face expression data such as joy and sadness based on the input face image. . On the other hand, based on the input whole-body image, the image processing device 18 generates body type data of the person such as the height (height) of the person, the length of the arms, and the width of the torso, and the body such as the position of the limbs. Generate physical information including posture data. In order to more accurately estimate the posture of the person, it is better to prepare a plurality of color cameras for photographing the whole body.

【００１９】仮想変身装置１０はまた、人物の音声を捉
えるマイク２０を含む。音声処理回路２２は、マイク２
０から出力された音声信号を取り込むとともに、画像処
理回路１８から身体情報を取り込み、これに基づいて人
物の感情情報（喜怒哀楽）および人物の属性情報（男，
女，子供）を生成する。The virtual transformation device 10 also includes a microphone 20 for capturing the voice of a person. The audio processing circuit 22 includes the microphone 2
0 as well as body information from the image processing circuit 18, and based on this, emotion information (emotions and sorrows) of the person and attribute information of the person (male,
Woman, child).

【００２０】画像処理装置１４および１８ならびに音声
処理回路２２の出力は、通信回線２４を通して、３次元
モデル選択装置２６や３次元モデル変形装置２８に与え
られる。具体的には、音声処理装置２２から出力された
属性情報および画像処理装置１８から出力された身体情
報が、３次元モデル選択装置２６に与えられ、画像処理
装置１４から出力された顔情報，画像処理装置１８から
出力された身体情報および音声処理装置２２から出力さ
れた感情情報が、３次元モデル変形装置２６に与えられ
る。The outputs of the image processing devices 14 and 18 and the audio processing circuit 22 are supplied to a three-dimensional model selecting device 26 and a three-dimensional model deforming device 28 through a communication line 24. Specifically, the attribute information output from the audio processing device 22 and the physical information output from the image processing device 18 are given to the three-dimensional model selecting device 26, and the face information, image The physical information output from the processing device 18 and the emotion information output from the voice processing device 22 are provided to a three-dimensional model transformation device 26.

【００２１】なお、３次元モデル選択装置２６および３
次元モデル変形装置２８も、パソコン程度の処理能力を
有するコンピュータによって構成される。３次元モデル
選択装置２６および３次元モデル変形装置２８は、１つ
のコンピュータにまとめられてもよい。The three-dimensional model selecting devices 26 and 3
The dimensional model deforming device 28 is also constituted by a computer having a processing capability of a personal computer. The three-dimensional model selecting device 26 and the three-dimensional model deforming device 28 may be integrated in one computer.

【００２２】３次元モデル選択装置２６内に設けられた
メモリ２６ａには、複数の３次元モデルが予め記憶され
ている。さらに、これらの３次元モデルは、体型の異な
る複数の大人の男性，体型の異なる複数の大人の女性，
および体型の異なる複数の子供を含んでいる。３次元モ
デル選択装置２６は、入力された属性情報および身体情
報（特に体型データ）に合致する３次元モデルを上述の
複数の３次元モデルの中から特定し、特定したモデルの
キャラクタデータを３次元モデル変形装置２８に出力す
る。A plurality of three-dimensional models are stored in a memory 26a provided in the three-dimensional model selecting device 26 in advance. Furthermore, these three-dimensional models are composed of a plurality of adult men with different body types, a plurality of adult women with different body types,
And multiple children of different body types. The three-dimensional model selection device 26 specifies a three-dimensional model that matches the input attribute information and physical information (particularly, body type data) from among the plurality of three-dimensional models described above, and converts the character data of the specified model into three-dimensional data. Output to the model deformation device 28.

【００２３】３次元モデル変形装置２８は、３次元モデ
ル選択装置２６から入力されたキャラクタデータを顔情
報および感情情報に基づいて修正する。具体的には、顔
情報（特に顔の表情データ）および感情情報によって人
物の感情を推定し、人物が笑えば３次元モデルも笑うよ
うにキャラクタデータを修正する。また、顔情報（特に
顔の姿勢データ）および身体情報（特に身体の姿勢デー
タ）によって人物の姿勢を推定し、人物が横を向けば３
次元モデルも横を向くようにキャラクタデータを修正す
る。そして、修正したキャラクタデータを画像表示装置
３０に出力する。The three-dimensional model deformation device 28 corrects the character data input from the three-dimensional model selection device 26 based on face information and emotion information. Specifically, the emotion of the person is estimated based on the face information (especially facial expression data) and the emotion information, and the character data is corrected so that if the person laughs, the three-dimensional model also laughs. Further, the posture of the person is estimated based on the face information (particularly, the posture data of the face) and the body information (particularly, the posture data of the body).
The character data is corrected so that the dimensional model also faces sideways. Then, the corrected character data is output to the image display device 30.

【００２４】画像表示装置３０は、入力されたキャラク
タデータに基づいて、３次元モデルを仮想環境に合成す
る。これによって、人物に似た３次元モデルが仮想環境
内に再現され、かつ３次元モデルの姿勢および表情は人
物の姿勢および表情に応じて変化する。このように仮想
環境内に再現される被験者の分身画像は、「アバタ」と
呼ばれる。なお、この画像表示装置３０は、たとえば３
Ｄグラフィックスアクセラレータを搭載したコンピュー
タによって構成される。The image display device 30 combines a three-dimensional model with a virtual environment based on the input character data. Thus, a three-dimensional model similar to a person is reproduced in the virtual environment, and the posture and expression of the three-dimensional model change according to the posture and expression of the person. The alter ego image of the subject reproduced in the virtual environment in this way is called “avatar”. The image display device 30 is, for example, 3
It is configured by a computer equipped with a D graphics accelerator.

【００２５】人物の顔を撮影するためのカラーカメラ１
２に接続された画像処理装置１４は、具体的には図２に
示すように動作する。つまり、まずステップＳ１で被験
者の顔画像を取り込み、次にステップＳ３で、この顔画
像の肌色領域を“０”、肌色領域以外を“１”として２
値化する。これによって、入力されたカラーの顔画像が
白黒の２値画像に変換される。ステップＳ５では、目領
域を追跡する。つまり、前処理によって得られた目領域
に瞳の中心を通るテンプレートを作成し、いわゆるテン
プレートマッチングによって目の位置を追跡する。その
後、ステップＳ７で、前処理によって得られた目と口の
相対的な関係に基づいって、口の位置を推定する。Color camera 1 for photographing a person's face
2 operates specifically as shown in FIG. That is, first, in step S1, the face image of the subject is fetched, and then in step S3, the skin color area of this face image is set to "0", and the areas other than the skin color area are set to "1".
Value. As a result, the input color face image is converted to a monochrome binary image. In step S5, the eye area is tracked. That is, a template that passes through the center of the pupil is created in the eye region obtained by the preprocessing, and the position of the eye is tracked by so-called template matching. Thereafter, in step S7, the position of the mouth is estimated based on the relative relationship between the eyes and the mouth obtained by the preprocessing.

【００２６】ステップＳ９では、顔表情を認識する。顔
表情の認識は、本願発明者が“電子情報通信学会論文誌
D-II,Vol.J80-D-II,No.6,pp.1547-1554（１９９７年６
月）”に発表した手法によって行なわれる。これは、２
次元離散コサイン変換によって画像を空間周波数領域に
変換し、顔部位の変化に対応する各周波数帯域での電力
変化を捉えることによって、顔表情を認識するものであ
る。つまり、目および口の形状変化を検出することで、
顔表情の認識を行なう。これによって、領域のずれや部
位の変形に強くかつ高速な検出が可能となる。In step S9, the facial expression is recognized. Recognition of facial expressions is performed by the inventor of the present invention in the IEICE Transactions
D-II, Vol.J80-D-II, No.6, pp.1547-1554 (June 1997
Monday) ”. This is done in 2
The face expression is recognized by converting an image into a spatial frequency domain by a dimensional discrete cosine transform, and capturing a power change in each frequency band corresponding to a change in a face part. In other words, by detecting changes in the shape of the eyes and mouth,
Recognize facial expressions. As a result, it is possible to perform high-speed detection, which is strong against displacement of a region and deformation of a part.

【００２７】ステップＳ１１では、顔の軸の傾きおよび
顔の軸周りの回転を検出する。両目を結ぶ線が水平軸に
対してなす角度が、顔の傾きとされる。また、右目領域
と左目領域との間隔の大きさから、顔の軸周りの回転角
が判明する。間隔が小さくなるほど、顔が回転している
とみなされる。In step S11, the inclination of the face axis and the rotation about the face axis are detected. The angle formed by the line connecting both eyes with respect to the horizontal axis is defined as the inclination of the face. Further, the rotation angle around the face axis is determined from the size of the interval between the right eye region and the left eye region. The smaller the interval, the more the face is considered to be rotating.

【００２８】ステップＳ１３では、このようにして得ら
れた顔情報つまり顔の表情データおよび顔の姿勢データ
を含む顔情報を、通信回線２４を通して３次元モデル変
形装置２８に送信する。In step S13, the thus obtained face information, that is, face information including face expression data and face posture data, is transmitted to the three-dimensional model deformation device 28 through the communication line 24.

【００２９】一方、人物の全身像を撮影するカラーカメ
ラ１６に接続された画像処理装置１８は、図３に示すよ
うに動作する。つまり、まずステップＳ２１で被験者の
全身画像を含むカラー画像をカメラ１６から取り込み、
ステップＳ２３で背景差分法によって全身画像を抽出す
る。背景差分法とは、人物を含む画像から人物以外の画
像（背景画像）を差し引くことによって、人物画像だけ
を取り出す方法である。なお、背景画像は、前処理によ
って予め獲得されている。On the other hand, an image processing device 18 connected to a color camera 16 for taking a full-body image of a person operates as shown in FIG. That is, first, in step S21, a color image including the whole body image of the subject is captured from the camera 16,
In step S23, a whole body image is extracted by the background subtraction method. The background subtraction method is a method of extracting only a person image by subtracting an image (background image) other than a person from an image including a person. Note that the background image has been obtained in advance by preprocessing.

【００３０】ステップＳ２５では、人物画像（全身画
像）を“１”、背景画像を“０”として、入力画像を２
値化する。つまり、人物のシルエット画像となる２値画
像を抽出する。続いて、ステップＳ２７でこの２値画像
にモルフォロジー処理を施すことで人物の輪郭を検出
し、ステップＳ２９でこの輪郭の特徴点を決定する。ス
テップＳ３１では決定された特徴点に基づいて身体の姿
勢を推定し、ステップＳ３３〜Ｓ３７では、同じ特徴点
に基づいて人物の背の高さ，腕の長さおよび胴体部の幅
をそれぞれ算出する。In step S25, the person image (whole body image) is set to "1", the background image is set to "0", and the input image is set to 2
Value. That is, a binary image serving as a silhouette image of a person is extracted. Subsequently, in step S27, the outline of the person is detected by performing a morphological process on the binary image, and in step S29, feature points of the outline are determined. In step S31, the posture of the body is estimated based on the determined characteristic points, and in steps S33 to S37, the height of the back, the length of the arms, and the width of the body are calculated based on the same characteristic points. .

【００３１】このようにして身体の姿勢データおよび体
型データを含む身体情報が得られると、ステップＳ３９
でこの身体情報を３次元モデル選択装置２６に送信す
る。When the body information including the posture data and the body type data of the body is obtained in this way, step S39
Transmits this physical information to the three-dimensional model selection device 26.

【００３２】音声処理装置２２は、図４〜図６に示すよ
うに動作する。まずステップＳ４１で人物の発話音声を
マイク２０から入力する。次に、ステップＳ４３でこの
発話音声にフレーム分割を施し、つまり発話音声を一定
時間長さに区切り、ＦＦＴ（Fast Fourier Transform）
によって周波数を解析する。ステップＳ４５では、ステ
ップＳ４３での解析結果に基づいて、発話音声のダイナ
ミックレンジを算出する。具体的には、音声波形を整理
して、その平均値を音の強弱つまり音量とする。The voice processing device 22 operates as shown in FIGS. First, in step S41, a speech voice of a person is input from the microphone 20. Next, in step S43, the uttered voice is subjected to frame division, that is, the uttered voice is divided into a fixed time length, and an FFT (Fast Fourier Transform)
Analyze the frequency by In step S45, the dynamic range of the uttered voice is calculated based on the analysis result in step S43. Specifically, the audio waveform is arranged, and the average value is set as the intensity of the sound, that is, the volume.

【００３３】ステップＳ４７では、算出されたダイナミ
ックレンジを所定の閾値と比較し、発話中であるかどう
かを判定する。そして、発話中でなければ、ステップＳ
５１で“発話なし”のデータを３次元モデル変形装置２
８に送信する。これに対して、発話中であれば、ステッ
プＳ４９でこの発話音声から人物の感情を推定するとと
もに、ステップＳ５３で発話音声と画像処理回路１８か
ら取り込んだ身長情報とに基づいて人物の属性を推定す
る。そして、ステップＳ５１で、人物の感情情報を３次
元モデル変形装置２８に、人物の属性情報を３次元モデ
ル選択装置２６に送信する。In step S47, the calculated dynamic range is compared with a predetermined threshold to determine whether or not speech is being made. If the speech is not being made, step S
At 51, the data of "No utterance" is converted to the three-dimensional model deformation device 2
8 On the other hand, if the user is speaking, the emotion of the person is estimated from the uttered voice in step S49, and the attribute of the person is estimated based on the uttered voice and the height information fetched from the image processing circuit 18 in step S53. I do. Then, in step S51, the emotion information of the person is transmitted to the three-dimensional model transformation device 28, and the attribute information of the person is transmitted to the three-dimensional model selection device 26.

【００３４】ステップＳ４９における感情の推定は、図
５に示す要領で行なわれる。まず、ステップＳ６１で線
形予測係数（ＬＰＣ；Linear Prediction Coefficien
t）を算出する。線形予測係数は、音声波形やスペクト
ルの性質を能率的に表現しているため、音韻の識別や話
者識別の特徴量として用いられる。なお、線形予測法
は、線形加重で表される予測誤差の２乗平均を最小とす
る方法であり、短時間のデータからスペクトルを推定で
きることから、音声信号処理に広く使われている。ステ
ップＳ６３では、算出された線形予測係数に基づいて予
測波形を生成する。この予測波形から、声道共振特性が
判明する。The estimation of the emotion in step S49 is performed in the manner shown in FIG. First, in step S61, a linear prediction coefficient (LPC; Linear Prediction Coefficien) is used.
Calculate t). Since the linear prediction coefficient efficiently represents the properties of the speech waveform and spectrum, it is used as a feature amount for phoneme identification and speaker identification. The linear prediction method is a method for minimizing the mean square of a prediction error represented by linear weighting, and is widely used in audio signal processing because a spectrum can be estimated from short-time data. In step S63, a predicted waveform is generated based on the calculated linear prediction coefficient. From the predicted waveform, the vocal tract resonance characteristics are determined.

【００３５】続くステップＳ６５では、ケプストラム算
出によって、音声信号の予測波形からスペクトル包絡成
分と基本周波数成分とを別々に抽出する。ケプストラム
法は、音声信号をフーリエ変換し、得られた振幅スペク
トルの絶対値の対数を逆フーリエ変換することで得られ
るパラメタ（ケプストラム）を利用する方法である。そ
の変数はケフレンシと呼ばれ、時間の次元を持つ。ケプ
ストラム分析では、ケフレンシについて閾値を設けるこ
とで、スペクトル包絡成分と基本周波数成分を別々に抽
出できる。In the following step S65, a spectral envelope component and a fundamental frequency component are separately extracted from the predicted waveform of the audio signal by cepstrum calculation. The cepstrum method is a method that uses a parameter (cepstrum) obtained by performing a Fourier transform on an audio signal and performing an inverse Fourier transform on the logarithm of the absolute value of the obtained amplitude spectrum. The variable is called quefrency and has a time dimension. In the cepstrum analysis, a spectral envelope component and a fundamental frequency component can be separately extracted by setting a threshold value for the quefrency.

【００３６】ステップＳ６７では、マイク２０から入力
された音声信号の原波形とステップＳ６３で生成された
予測波形とから残差波形を算出する。この残差波形は、
原波形から音韻を表現する周波数成分を除去した音源成
分を表わしていると考えられる。ステップＳ６９では、
このような残差波形に基づいてピッチ周波数を算出す
る。つまり、残差波形から相関係数を求め、相関係数の
極大値を計算し、極大値を与える係数の逆数を求めて、
対象となるフレームのピッチ周波数として出力する。In step S67, a residual waveform is calculated from the original waveform of the audio signal input from the microphone 20 and the predicted waveform generated in step S63. This residual waveform is
This is considered to represent a sound source component obtained by removing a frequency component expressing a phoneme from the original waveform. In step S69,
The pitch frequency is calculated based on such a residual waveform. That is, the correlation coefficient is obtained from the residual waveform, the maximum value of the correlation coefficient is calculated, and the reciprocal of the coefficient that gives the maximum value is obtained,
Output as the pitch frequency of the target frame.

【００３７】ステップＳ７１では、フレーム分割された
各音声信号毎に平均パワーを算出する。また、ステップ
Ｓ７３では、１つの音韻を発話している長さ、つまり発
話速度を測定する。具体的には、１つの音韻として判断
されている部分（セグメンテーション）の時間的長さを
測定する。この時間的長さが短いほど、発話速度が速い
と判断できる。In step S71, an average power is calculated for each audio signal divided into frames. In step S73, the length of uttering one phoneme, that is, the utterance speed is measured. Specifically, the time length of a portion (segmentation) determined as one phoneme is measured. It can be determined that the shorter the time length, the faster the utterance speed.

【００３８】ステップＳ７１では、ステップＳ４５で算
出されたダイナミックレンジ，ステップＳ６５で求めら
れたスペクトル包絡成分，ステップＳ６９で求められた
ピッチ周波数，ステップＳ７１で求められたパワーおよ
びステップＳ７３で測定された発話速度を利用して、ニ
ューラルネットワークによって被験者の感情を推定す
る。なお、このようなダイナミックレンジ，スペクトル
包絡成分，ピッチ周波数，パワーおよび発話速度が、音
声特徴情報と定義される。In step S71, the dynamic range calculated in step S45, the spectral envelope component obtained in step S65, the pitch frequency obtained in step S69, the power obtained in step S71, and the utterance measured in step S73. Using the speed, the emotion of the subject is estimated by a neural network. The dynamic range, the spectral envelope component, the pitch frequency, the power, and the utterance speed are defined as voice feature information.

【００３９】ステップＳ５３においては、具体的には図
６に示す処理を行なう。まず、ステップＳ８１でピッチ
周波数を所定の閾値と比較する。ここでピッチ周波数が
閾値よりも低ければ、ステップＳ８５で被験者を大人の
男性と推定する。一方、ピッチ周波数が閾値以上であれ
ば、ステップＳ８１からステップＳ８３に進み、画像処
理装置１８から取り込んだ身体情報に含まれる身長デー
タを所定の閾値と比較する。そして、被験者の身長が閾
値よりも低ければ、ステップＳ８９で被験者を子供と推
定する。これに対して、被験者の身長が閾値以上であれ
ば、ステップＳ８７で被験者を大人の女性と推定する。In step S53, the processing shown in FIG. 6 is specifically performed. First, in step S81, the pitch frequency is compared with a predetermined threshold. If the pitch frequency is lower than the threshold, the subject is estimated to be an adult male in step S85. On the other hand, if the pitch frequency is equal to or higher than the threshold, the process proceeds from step S81 to step S83, and the height data included in the physical information acquired from the image processing device 18 is compared with a predetermined threshold. If the height of the subject is lower than the threshold, the subject is estimated to be a child in step S89. On the other hand, if the height of the subject is equal to or greater than the threshold, the subject is estimated to be an adult woman in step S87.

【００４０】３次元モデル選択装置２６および３次元モ
デル変形装置２８は、図７に示すように動作する。ま
ず、ステップＳ９１で属性情報および身体情報が３次元
モデル選択装置２６に与えられ、感情情報および顔情報
が３次元モデル変形装置２６に与えられる。ステップＳ
９３では、３次元モデル選択装置２６が、属性情報およ
び身体情報（体型データ）に合致する３次元モデル（ア
バタ）のキャラクタデータをメモリ２６ａから読み出
す。読み出されたアバタのキャラクタデータは、３次元
モデル変形装置２８に与えられる。The three-dimensional model selecting device 26 and the three-dimensional model deforming device 28 operate as shown in FIG. First, in step S91, attribute information and physical information are provided to the three-dimensional model selecting device 26, and emotion information and face information are provided to the three-dimensional model deforming device 26. Step S
At 93, the three-dimensional model selection device 26 reads out from the memory 26a character data of a three-dimensional model (avatar) matching the attribute information and the body information (body type data). The read avatar character data is provided to the three-dimensional model transformation device 28.

【００４１】ステップＳ９５では、３次元モデル変形装
置２８が、感情情報および顔情報（顔表情データ）に基
づいて被験者の感情を特定し、アバタにレンダリング処
理を施す。これによって、アバタの顔の表情が変化す
る。具体的には、特定した感情が“怒り”を示す場合、
アバタの顔に赤み持たせかつ隈取りを出す。感情が“悲
しみ”を示す場合は、アバタの目から涙を出させる。
“喜び”ではアバタの目を“ヘ”の字にし、“驚き”で
はアバタの頭上にびっくりマークを出す。さらに、“恐
怖”では、アバタの顔に青みを持たせかつ身体を細くす
る。なお、被験者の発話がなければ感情の推定処理は行
なわれないため、そのときは顔情報（顔表情データ）の
みに基づいてレンダリング処理を行なう。In step S95, the three-dimensional model transformation device 28 specifies the subject's emotion based on the emotion information and the face information (face expression data), and performs rendering processing on the avatar. As a result, the facial expression of the avatar changes. Specifically, if the identified emotion indicates "anger,
Give the avatar's face a reddish shade. If the emotion shows "sadness," weep in the avatar's eyes.
For “joy”, the avatar's eyes are shaped like “he”, and for “surprise”, a surprise mark is placed above the avatar's head. In addition, “fear” makes the avatar's face bluish and slender. If the subject does not speak, the emotion estimation processing is not performed. At that time, the rendering processing is performed based only on the face information (face expression data).

【００４２】ステップＳ９５ではまた、顔情報（顔の姿
勢データ）および身体情報（身体の姿勢データ）に基づ
いて人物の姿勢を推定し、推定結果に応じてアバタの姿
勢を変更する。In step S95, the posture of the person is estimated based on the face information (face posture data) and the body information (body posture data), and the avatar posture is changed according to the estimation result.

【００４３】ステップＳ９７では、レンダリング処理お
よび姿勢の変更処理が施されたアバタのキャラクタデー
タを画像表示装置３０に出力する。この結果、被験者に
似たアバタが仮想環境内に３次元で再現され、アバタの
表情および姿勢が被験者の画像および音声に応答してリ
アルタイムで変化する。In step S97, the avatar character data that has been subjected to the rendering process and the posture changing process is output to the image display device 30. As a result, an avatar similar to the subject is reproduced in three dimensions in the virtual environment, and the expression and posture of the avatar change in real time in response to the image and voice of the subject.

【００４４】この実施例によれば、音声処理によって得
られる被験者の音声特徴情報あるいは画像処理によって
得られる被験者の身体情報から被験者の特徴を検出し、
複数の３次元モデルの中から特徴が一致する３次元モデ
ルを選択するようにしたため、仮想環境内に再現する３
次元モデルつまりアバタを被験者に似せることができ
る。また、上述の音声特徴情報や画像処理によって得ら
れる被験者の顔情報から被験者の感情を特定し、これに
基づいてアバタを変形するようにしたため、被験者の感
情を仮想環境内に再現することができる。According to this embodiment, the characteristics of the subject are detected from the voice characteristic information of the subject obtained by the voice processing or the physical information of the subject obtained by the image processing.
Since a three-dimensional model having the same feature is selected from a plurality of three-dimensional models, the three-dimensional model reproduced in the virtual environment is selected.
A dimensional model or avatar can be made to resemble a subject. In addition, since the subject's emotion is specified from the subject's face information obtained by the above-described voice feature information and image processing and the avatar is deformed based on the subject's emotion, the subject's emotion can be reproduced in the virtual environment. .

【００４５】なお、この実施例では、被験者の全身画像
の撮影に１つのカラーカメラしか用いてないが、被験者
の全身の動きに応じてアバタを変形させるためには、複
数のカラーカメラを準備し、被験者の全身を異なった方
向から撮影すればよい。この場合、撮影されたそれぞれ
のカラー画像は次にように処理される。まずカラー画像
が２値画像に変換され、２値画像から特徴点が検出され
る。３次元モデル変形装置は、このようにして検出され
た特徴点に基づいてアバタの形状を変形させる。In this embodiment, only one color camera is used to capture the whole body image of the subject. However, in order to deform the avatar according to the movement of the whole body of the subject, a plurality of color cameras are prepared. The whole body of the subject may be photographed from different directions. In this case, each captured color image is processed as follows. First, a color image is converted into a binary image, and feature points are detected from the binary image. The three-dimensional model deformation device deforms the shape of the avatar based on the feature points detected in this way.

【００４６】カメラに対して被験者の胴体と手とが重な
るために、２値画像から特徴点を検出できず予測もでき
ないときは、カラー画像から２次元肌色領域が抽出さ
れ、抽出された肌色領域が３次元的にボリューム復元さ
れる。そして、復元された体積が最大である部分が手先
領域とされ、その重心が特徴点とされる。When the feature point cannot be detected or predicted from the binary image because the body and hand of the subject overlap the camera, a two-dimensional skin color region is extracted from the color image, and the extracted skin color region is extracted. Is three-dimensionally restored. Then, a portion having the largest restored volume is set as a hand region, and its center of gravity is set as a feature point.

【００４７】また、それぞれのカラーカメラを電動雲台
の上に搭載し、被験者の画像に基づいて雲台を制御する
ようにすれば、被験者の動きにカラーカメラを追従させ
ることができる。Further, if each color camera is mounted on an electric pan head and the pan head is controlled based on the image of the subject, the color camera can follow the movement of the subject.

[Brief description of the drawings]

【図１】この発明の一実施例である仮想変身装置を示す
ブロック図であるFIG. 1 is a block diagram showing a virtual transformation device according to an embodiment of the present invention.

【図２】音声処理装置の動作を説明するためのフローチ
ャートである。FIG. 2 is a flowchart for explaining the operation of the audio processing device.

【図３】感情の推定方法を示すフローチャートであ
る。。FIG. 3 is a flowchart illustrating a method for estimating emotions. .

【図４】人物のキャラクタの検出方法を示すフローチャ
ートである。FIG. 4 is a flowchart illustrating a method of detecting a human character.

【図５】画像処理装置の動作を説明するためのフローチ
ャートである。FIG. 5 is a flowchart illustrating an operation of the image processing apparatus.

【図６】３次元モデルの選択および変形を説明するため
のフローチャートである。FIG. 6 is a flowchart for explaining selection and deformation of a three-dimensional model.

【図７】ニューラルネットワークを示す図解図である。FIG. 7 is an illustrative view showing a neural network;

[Explanation of symbols]

１０ …仮想変身装置１２ …カメラ１４ …画像処理装置１６ …マイク１８ …音声処理装置２０ …３次元モデル選択装置２２ …３次元モデル変形装置２４ …画像表示装置 DESCRIPTION OF SYMBOLS 10 ... Virtual transformation apparatus 12 ... Camera 14 ... Image processing apparatus 16 ... Microphone 18 ... Audio processing apparatus 20 ... 3D model selection apparatus 22 ... 3D model transformation apparatus 24 ... Image display apparatus

───────────────────────────────────────────────────── フロントページの続き (72)発明者大谷淳京都府相楽郡精華町大字乾谷小字三平谷５番地株式会社エイ・ティ・アール知能映像通信研究所内Ｆターム(参考） 5B050 BA07 BA08 BA09 CA07 EA19 FA02 5D015 AA01 KK01 5L096 AA02 AA03 AA07 CA02 FA02 FA67 GA38 9A001 DD12 GG07 HH16 HH29 KK60 ──────────────────────────────────────────────────続き Continuing on the front page (72) Inventor Atsushi Atsushi 5 Osanaya, 5 Seira-cho, Seika-cho, Soraku-gun, Kyoto F-term (reference) 5B050 BA07 BA08 BA09 CA07 EA19 FA02 5D015 AA01 KK01 5L096 AA02 AA03 AA07 CA02 FA02 FA67 GA38 9A001 DD12 GG07 HH16 HH29 KK60

Claims

[Claims]

A voice input unit configured to input voice of a person; a voice feature information generating unit configured to generate voice feature information based on the voice; a three-dimensional model storage unit storing a plurality of three-dimensional models in advance; A virtual transformation comprising: a model selecting means for selecting any one of the plurality of three-dimensional models based on the voice feature information; and a reproducing means for reproducing the three-dimensional model selected by the model selecting means in a virtual environment. apparatus.

2. A whole-body image inputting means for inputting a whole-body image of the person, a physical-information creating means for creating physical information of the person based on the whole-body image, and a physical information creating means for creating physical information of the person based on the whole body image. The virtual transformation apparatus according to claim 1, further comprising attribute information creating means for creating attribute information of a person, wherein said model selecting means selects a three-dimensional model based on said physical information and said attribute information.

3. The attribute information indicates whether the person is a man, a woman, or a child, and the plurality of three-dimensional models include a plurality of men having different body types, a plurality of women having different body types, and a plurality of body types. The virtual transformation device according to claim 2, comprising a plurality of different children, wherein the model selecting means selects a three-dimensional model whose body type and attribute match the person based on the physical information and the attribute information.

4. An emotion information creating means for creating emotion information of the person based on the voice feature information, and a model deformation for deforming an expression of a three-dimensional model selected by the model selection means based on the emotion information. The virtual makeover device of claim 1, further comprising means.

5. The virtual transformation device according to claim 4, wherein said voice feature information includes at least one of a pitch frequency, a dynamic range, a spectral envelope component, power, and a speech rate.

6. A face image input means for inputting a face image of the person, and a face information generating means for generating face information based on the face image, wherein the model deformation means includes the emotion information and the face information. The virtual transformation device according to claim 5, wherein the expression is deformed based on at least one of the following.

7. The virtual transformation apparatus according to claim 4, wherein said model transformation means changes said expression by a rendering process.

8. The rendering process includes enlarging / reducing the color of the three-dimensional model, changing a color, and partially deforming the three-dimensional model.
Virtual transformation device as described.