JP2005151231A

JP2005151231A - Video communication method, video communication apparatus, video creation program used for apparatus, and recording medium with program recorded thereon

Info

Publication number: JP2005151231A
Application number: JP2003386820A
Authority: JP
Inventors: Tadashi Mori; 忠毛利; Yoshito Nanjo; 義人南條; Hitomi Sato; 仁美佐藤; Yoshimasa Yanagihara; 義正柳原; Tamotsu Machino; 保町野; Hiroaki Kawada; 博昭河田; Joji Nakayama; 丈二中山; Takayoshi Mochizuki; 崇由望月
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-11-17
Filing date: 2003-11-17
Publication date: 2005-06-09

Abstract

<P>PROBLEM TO BE SOLVED: To surely protect the privacy of a speaker under communication by transmitting video signals of an optimal expression at all the times even to an unspecific speaker and regardless of an extent of fatigue without needing to prepare image data of the face at ordinary time. <P>SOLUTION: A video adjustment circuit 4 performs morphing of signals for the areas of specific parts indicating the expression of the speaker such as eyebrows, eyes, mouth and the like according to a preset expression adjustment ratio in video signals of the speaker which are imaged during a conversation, and video signals of a serious look that are stored when the communication starts, and video signals of the expression adjusted by the morphing, are transmitted to an apparatus of the communicating party. During a period of training that is performed when starting the communication, the image of the serious look of the speaker is fetched into the video adjustment circuit 4 and stored in a serious look information storage section 422. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

この発明は、例えばテレビジョン電話通信やテレビジョン会議通信に使用する映像通信方法及び映像通信装置と、この装置で使用される映像作成プログラム及びこのプログラムを記録する記録媒体に関する。 The present invention relates to a video communication method and a video communication apparatus used for, for example, a television telephone communication and a video conference communication, a video creation program used in the apparatus, and a recording medium for recording the program.

近年、遠隔地との通信システムとして、テレビジョン電話通信システムやテレビジョン会議通信システムが多用されるようになっている。この種の通信システムは、音声のみならず話者が互いに相手の表情を見ながら通話を行うことができるため、音声のみを使用する通信に比べて遙かに多くの情報量を伝送することができ、より効果的な通話又は会議が可能となる。 In recent years, a television telephone communication system and a video conference communication system are frequently used as a communication system with a remote place. In this type of communication system, not only voice but also speakers can talk while looking at each other's facial expressions. Therefore, much more information can be transmitted than communication using only voice. This enables a more effective call or conference.

しかし、一般にこの種の通信システムは、カメラにより撮像された話者の顔画像をそのまま送受信するように構成されている。このため、通信に先立ち話者は身だしなみを整えなければならなかったり、また伝送された顔画像から話者の感情や疲労度、緊張具合等のプライバシーに係わる情報が相手話者に知られてしまうと云った問題点があった。 However, in general, this type of communication system is configured to transmit and receive a speaker's face image captured by a camera as it is. For this reason, the speaker must be dressed before communication, and the other speaker will be informed of privacy information such as the speaker's emotions, fatigue, and tension from the transmitted face image. There was a problem called.

そこで従来では、事前に話者の平常時における表情を表す画像データを記憶しておく。そして、通信中に撮像された話者の画像データに対し画像処理を行って話者の疲労度を抽出し、抽出された疲労度に応じて、上記撮像された表情の画像データを上記事前に記憶されている平常時の表情を表す画像データに置き換えるようにした技術が提案されている（例えば特許文献１を参照）。 Therefore, conventionally, image data representing the normal expression of the speaker is stored in advance. Then, the speaker's image data captured during communication is subjected to image processing to extract the speaker's fatigue level, and the image data of the captured facial expression is stored in advance in accordance with the extracted fatigue level. There has been proposed a technique in which it is replaced with stored image data representing a normal facial expression (see, for example, Patent Document 1).

特開平８−４４８６１号公報JP-A-8-44861

ところが上記従来の提案技術では、話者の平常時の表情を表す画像データを予め登録しておく必要がある。このため、例えば通話直前又は通話中に話者が未登録の話者に交代したり、また未登録の話者が飛び入りで参加した場合には、これらの話者について画像データの置換処理が行われず、依然として通話中の表情を表す顔画像データがそのまま伝送されてしまう。また、登録済みの話者であっても、疲労度が抽出され難い話者については、画像データの置換処理が行われずに通話中の表情を表す顔画像データがそのまま伝送されてしまう。 However, in the conventional proposed technique, it is necessary to register in advance image data representing the normal expression of the speaker. For this reason, for example, when a speaker is replaced by an unregistered speaker immediately before or during a call, or when an unregistered speaker joins in a jump, image data replacement processing is performed for these speakers. Thus, the face image data representing the expression during the call is still transmitted as it is. In addition, even for a registered speaker, for a speaker whose fatigue level is difficult to extract, face image data representing an expression during a call is transmitted as it is without performing image data replacement processing.

この発明は上記事情に着目してなされたもので、その目的とするところは、平常時の顔画像データを事前に用意しておくことなく、不特定の話者に対してもまた疲労度の程度によらず、常に最適な表情の映像信号を送信できるようにし、これにより通信中の話者のプライバシーを確実に保護することを可能にした映像通信方法及び映像通信装置と、この装置で使用される映像作成プログラム及びこのプログラムを記録する記録媒体を提供することにある。 The present invention has been made by paying attention to the above circumstances, and the purpose of the present invention is to prepare fatigue information for an unspecified speaker without preparing normal face image data in advance. Video communication method and video communication apparatus that can always transmit video signals with an optimal facial expression regardless of the degree, thereby enabling the privacy of the communicating speaker to be reliably protected, and to be used in this apparatus And a recording medium for recording the program.

上記目的を達成するためにこの発明は、通信中の非会話期間に撮像された被写体の顔を含む部位の映像信号を被写体の真顔を表す第１の映像信号として記憶する。一方、通信中の会話期間に撮像された上記被写体の顔を含む部位の第２の映像信号から、被写体の表情を表す特定部位における第１の部分映像信号を抽出し、この抽出された第１の部分映像信号と上記記憶された第１の映像信号中の上記特定部位に対応する第２の部分映像信号とを、予め設定された表情調整割合を表す情報に従い合成して第３の部分映像信号を生成する。そして、この生成された第３の部分映像信号と、上記第２の映像信号中の上記特定部位以外の部分映像信号とを合成して第３の映像信号を生成し、この生成された第３の映像信号を送信するようにしたものである。 In order to achieve the above object, the present invention stores a video signal of a part including a face of a subject imaged during a non-conversation period during communication as a first video signal representing a true face of the subject. On the other hand, a first partial video signal in a specific part representing the facial expression of the subject is extracted from the second video signal of the part including the face of the subject imaged during the conversation period during communication. And a second partial video signal corresponding to the specific part in the stored first video signal in accordance with information representing a preset facial expression adjustment ratio to synthesize a third partial video Generate a signal. Then, the generated third partial video signal and the partial video signal other than the specific part in the second video signal are synthesized to generate a third video signal, and the generated third video signal is generated. The video signal is transmitted.

したがってこの発明によれば、通信中の非会話期間において撮像された映像信号が被写体の平常時の表情を表す真顔画像として記憶され、会話中に撮像された映像信号中の表情を表す部位の信号と、上記記憶された真顔画像とが予め設定された表情調整割合に従い合成される。このため、話者は自身の平常時の表情を表す画像データを予め登録しておく必要がなくなり、これにより例えば通話直前又は通話中に話者が未登録の話者に交代したり、また未登録の話者が飛び入りで参加した場合にも、これら未登録の話者のプライバシーについても確実に保護することが可能となる。すなわち、不特定多数の話者のプライバシーを保護することが可能となる。
また、話者に関係なく予め設定された表情調整割合に従って表情の調整処理が行われる。このため、表情の疲労度に係わらずすべての話者についてその表情が最適な表情となるように調整することが可能となる。 Therefore, according to the present invention, the video signal imaged during the non-conversation period during communication is stored as a true face image representing the normal expression of the subject, and the signal of the part representing the expression in the video signal imaged during the conversation And the stored true face image are synthesized in accordance with a preset facial expression adjustment ratio. This eliminates the need for the speaker to register in advance image data representing his / her normal facial expression. This allows the speaker to change to an unregistered speaker immediately before or during a call, for example. Even when registered speakers join in, it is possible to reliably protect the privacy of these unregistered speakers. That is, it becomes possible to protect the privacy of an unspecified number of speakers.
Also, facial expression adjustment processing is performed according to a preset facial expression adjustment ratio regardless of the speaker. For this reason, it is possible to adjust the facial expression to be the optimal facial expression for all speakers regardless of the facial expression fatigue level.

またこの発明は、次のような機能を備えることも特徴とする。
第１の機能は、上記表情調整割合を表す情報を設定する際に、想定される複数の通信相手の各々に対応付けて被写体の表情調整割合を表す情報を設定し、上記第３の部分映像信号を生成する際に、通信に際し使用される通信相手の識別情報をもとに、上記設定された複数の表情調整割合を表す情報の中から通信相手に対応する表情調整割合を表す情報を選択し、この選択された表情調整割合を表す情報に従い第１の部分映像信号と第２の部分映像信号とを合成するものである。 The present invention is also characterized by having the following functions.
The first function sets information representing the facial expression adjustment ratio of the subject in association with each of a plurality of assumed communication partners when setting the information representing the facial expression adjustment ratio, and the third partial video When generating a signal, select information representing the facial expression adjustment ratio corresponding to the communication partner from the multiple facial expression adjustment ratios set above based on the identification information of the communication partner used during communication Then, the first partial video signal and the second partial video signal are synthesized in accordance with the information representing the selected facial expression adjustment ratio.

このようにすると、通信相手に応じて表情調整割合が選択され、この選択された表情報調整割合に従い会話中の表情と非会話時に記憶した真顔の表情との合成が行われる。このため、例えば通信相手が家族や気心の知れた親しい同僚等の場合には、会話中の表情の合成割合を高くすると共に真顔の合成割合を零又は低く設定することにより、話者の表情ができるだけありのまま通信相手に伝わるようにすることができる。これに対し、通信相手が商談相手や初対面の相手の場合には、会話中の表情の合成割合を低く設定すると共に真顔の合成割合を高く設定することにより、話者の表情ができるかぎり通信相手に伝わらないようにすることができる。すなわち、通信相手に応じて最適な表情の映像信号を送信することが可能となる。 In this way, the expression adjustment ratio is selected according to the communication partner, and the expression during conversation and the expression of the true face stored during non-conversation are combined according to the selected table information adjustment ratio. For this reason, for example, when the communication partner is a family member or a close colleague who knows well, the facial expression of the speaker can be reduced by increasing the composition ratio of facial expressions during conversation and setting the composite ratio of true faces to zero or low. It can be transmitted to the communication partner as much as possible. On the other hand, if the communication partner is a business partner or the first meeting partner, the communicative partner is set as much as possible by setting the compositing ratio of facial expressions during conversation low and setting the compositing ratio of true faces high. You can avoid being transmitted to. In other words, it is possible to transmit a video signal having an optimal facial expression according to the communication partner.

第２の機能は、上記第１の映像信号を記憶する際に、映像通信のためのトレーニング期間に被写体の顔を含む部位を撮像して得られる映像信号を、真顔を表す第１の映像信号として記憶するものである。
このようにすると、通信開始が開始されてから実際に会話が始まるまでの準備期間に、被写体の真顔を表す映像信号が記憶されることになる。このため、真顔の映像信号を会話が開始される前に確実に用意することが可能となる。 A second function is a first video signal representing a true face obtained by imaging a part including a face of a subject during a training period for video communication when storing the first video signal. It is something to remember as.
In this way, a video signal representing the true face of the subject is stored in a preparation period from the start of communication to the actual start of conversation. For this reason, it is possible to reliably prepare a true-face video signal before the conversation is started.

第３の機能は、上記第１の映像信号を記憶する際に、非会話期間に得られる複数の映像信号をもとに、真顔を表す第１の映像信号を生成し記憶するものである。
このようにすると、非会話期間ごとに真顔の画像が逐次学習されて修正される。このため、真顔の画像をできる限り平常時に近い最適な画像にすることが可能となる。 The third function is to generate and store a first video signal representing a true face based on a plurality of video signals obtained during a non-conversation period when storing the first video signal.
In this way, a true face image is sequentially learned and corrected every non-conversation period. For this reason, it is possible to make a true face image as close to normal as possible.

要するにこの発明では、通信中の非会話期間に撮像された被写体の顔を含む部位の映像信号を被写体の真顔を表す第１の映像信号として記憶し、通信中の会話期間に撮像された上記被写体の顔を含む部位の第２の映像信号から被写体の表情を表す特定部位における第１の部分映像信号を抽出して、この抽出された第１の部分映像信号と上記記憶された第１の映像信号中の上記特定部位に対応する第２の部分映像信号とを予め設定された表情調整割合を表す情報に従い合成し、この合成された第３の部分映像信号を背景画像と合成して送信するようにしている。
したがってこの発明によれば、平常時の顔画像データを事前に用意しておくことなく、不特定の話者に対してもまた疲労度の程度に関係なく、常に最適な表情の映像信号を送信できるようになり、これにより通信中の話者のプライバシーを確実に保護することを可能にした映像通信方法及び映像通信装置と、この装置で使用される映像作成プログラム及びこのプログラムを記録する記録媒体を提供することができる。 In short, in the present invention, the video signal of the part including the face of the subject imaged during the non-conversation period during communication is stored as the first video signal representing the true face of the subject, and the subject imaged during the communication period during communication The first partial video signal in the specific part representing the expression of the subject is extracted from the second video signal of the part including the face, and the extracted first partial video signal and the stored first video are extracted. The second partial video signal corresponding to the specific part in the signal is synthesized in accordance with information representing a preset facial expression adjustment ratio, and the synthesized third partial video signal is synthesized with the background image and transmitted. Like that.
Therefore, according to the present invention, an image signal having an optimal facial expression is always transmitted to an unspecified speaker regardless of the degree of fatigue without preparing normal facial image data in advance. Video communication method and video communication apparatus capable of reliably protecting the privacy of a communicating speaker, a video creation program used in the apparatus, and a recording medium for recording the program Can be provided.

（第１の実施形態）
図１は、この発明に係わる映像通信装置の第１の実施形態を示す要部ブロック図である。同図において、カメラ１により撮像された被写体のアナログ映像信号ＡＳは、アナログ・ディジタル変換回路（Ａ／Ｄ）２でディジタル映像信号ＢＳに変換された後、フォーマット変換回路３に入力される。フォーマット変換回路３では、上記入力されたディジタル映像信号ＢＳが、テレビジョン電話装置やテレビジョン会議装置の映像符号化回路で使用される共通中間フォーマット信号（ＣＩＦ：Common Intermediate Format信号）ＣＳに変換され、この変換された送信ＣＩＦ信号ＣＳは映像調整回路４に入力される。 (First embodiment)
FIG. 1 is a principal block diagram showing a first embodiment of a video communication apparatus according to the present invention. In the figure, an analog video signal AS of a subject imaged by the camera 1 is converted into a digital video signal BS by an analog / digital conversion circuit (A / D) 2 and then input to a format conversion circuit 3. In the format conversion circuit 3, the input digital video signal BS is converted into a common intermediate format signal (CIF) CS used in a video encoding circuit of a television telephone device or a video conference device. The converted transmission CIF signal CS is input to the video adjustment circuit 4.

映像調整回路４では、上記入力された送信ＣＩＦ信号ＣＳに対し被写体の表情を調整するための処理が行われる。そして、この調整処理が終了した送信ＣＩＦ信号ＤＳは、映像符号化回路５で映像符号化処理されて送信映像符号化信号ＥＳとなった後、多重分離回路６に入力される。多重分離回路６は、上記入力された送信映像符号化信号ＥＳと、図示しない音声符号化回路で符号化された送信音声符号化信号ＦＳとを多重化することにより、所定の伝送フォーマットの送信多重化信号ＧＳを生成し、この生成された送信多重化信号ＧＳを図示しない伝送路へ送信する。 The video adjustment circuit 4 performs a process for adjusting the facial expression of the subject with respect to the input transmission CIF signal CS. Then, the transmission CIF signal DS for which the adjustment processing has been completed is subjected to video encoding processing by the video encoding circuit 5 to become a transmission video encoded signal ES and then input to the demultiplexing circuit 6. The demultiplexing circuit 6 multiplexes the input transmission video encoded signal ES and the transmission audio encoded signal FS encoded by the audio encoding circuit (not shown), thereby performing transmission multiplexing of a predetermined transmission format. The transmission signal GS is generated, and the generated transmission multiplexed signal GS is transmitted to a transmission path (not shown).

一方、通信相手の装置から伝送路を介して送られた多重化信号ＧＳは、多重分離回路６で受信映像符号化信号ＨＳと受信音声符号化信号ＦＳとに分離される。そして、このうち受信映像符号化信号ＨＳは映像符号化回路６に入力され、また受信音声符号化信号ＦＳは図示しない音声符号化回路に入力される。映像符号化回路６は、上記受信映像符号化信号ＨＳを映像復号処理することにより受信ＣＩＦ信号ＩＳに変換し、この受信ＣＩＦ信号ＩＳをフォーマット逆変換回路７に入力する。 On the other hand, the multiplexed signal GS sent from the communication partner device via the transmission line is separated into the received video encoded signal HS and the received audio encoded signal FS by the demultiplexing circuit 6. Of these, the received video encoded signal HS is input to the video encoding circuit 6, and the received audio encoded signal FS is input to an audio encoding circuit (not shown). The video encoding circuit 6 converts the received video encoded signal HS into a received CIF signal IS by performing video decoding processing, and inputs the received CIF signal IS to the format inverse conversion circuit 7.

フォーマット逆変換回路７は、上記受信ＣＩＦ信号ＩＳをディジタル映像信号ＪＳに逆変換する。この逆変換されたディジタル映像信号ＪＳは、ディジタル・アナログ変換回路（Ｄ／Ａ）８によりアナログ映像信号に変換された後、モニタ９に供給されて受信画像として表示される。なお、受信音声符号化信号ＦＳは、音声符号化回路で音声復号され、さらにアナログ音声信号に変換されたのち、図示しないスピーカから拡声出力される。 The format reverse conversion circuit 7 reversely converts the received CIF signal IS into a digital video signal JS. The inversely converted digital video signal JS is converted into an analog video signal by a digital / analog conversion circuit (D / A) 8 and then supplied to a monitor 9 to be displayed as a received image. The received speech encoded signal FS is subjected to speech decoding by a speech encoding circuit, converted into an analog speech signal, and then output from a speaker (not shown).

ところで、上記映像調整回路４は次のように構成される。図２はこの映像調整回路４の構成を示すブロック図である。すなわち、映像調整回路４は、表情抽出部４１と、表情合成部４２と、調整割合記憶部４３と、映像合成部４４とを備えている。
このうち調整割合記憶部４３には、通信相手の識別情報、例えば加入電話番号やＩＰアドレス等の発信ＩＤ或いは受信ＩＤに対応付けて、表情調整割合を指定する情報ＮＳが記憶してある。 By the way, the video adjustment circuit 4 is configured as follows. FIG. 2 is a block diagram showing the configuration of the video adjustment circuit 4. That is, the video adjustment circuit 4 includes a facial expression extraction unit 41, a facial expression synthesis unit 42, an adjustment ratio storage unit 43, and a video synthesis unit 44.
Among these, the adjustment ratio storage unit 43 stores information NS for designating a facial expression adjustment ratio in association with identification information of a communication partner, for example, a transmission ID or reception ID such as a subscriber telephone number or an IP address.

表情抽出部４１は、送信ＣＩＦ信号ＣＳから、被写体としての話者の顔の表情を表す特定部位の部分映像信号を抽出するためのもので、図３に示すように顔領域抽出部４１１と、特定部位切出部４１２とを備えている。
このうち顔領域抽出部４１１は、上記送信ＣＩＦ信号ＣＳから、顔領域に対応する顔領域映像信号ＰＳと、背景領域に対応する背景領域映像信号ＭＳとをそれぞれ抽出する。顔領域を切り出す方法には、例えば肌色抽出法がある。肌色抽出法は、ＲＧＢ値で定義されている映像信号をＨＳＶ表示系で表現し、このＨＳＶ表示系の色相（Ｈ）成分をもとに肌色の領域を抽出するものである。ＨＳＶ表示系は、色相（Ｈ）成分と明度（Ｖ）成分とが独立しているため、映像の明るさに影響を受けずに肌色領域を分離抽出することが可能である。 The facial expression extraction unit 41 is for extracting a partial video signal of a specific part representing the facial expression of the speaker's face as a subject from the transmission CIF signal CS. As shown in FIG. And a specific part cutout part 412.
Among these, the face area extraction unit 411 extracts a face area video signal PS corresponding to the face area and a background area video signal MS corresponding to the background area from the transmission CIF signal CS. As a method of cutting out the face area, for example, there is a skin color extraction method. In the skin color extraction method, a video signal defined by RGB values is expressed by an HSV display system, and a skin color region is extracted based on the hue (H) component of the HSV display system. In the HSV display system, since the hue (H) component and the lightness (V) component are independent, the skin color region can be separated and extracted without being affected by the brightness of the image.

また送信ＣＩＦ信号ＣＳには、顔領域以外にも肌色領域が存在する場合が考えられるので、上記抽出された肌色領域からさらに顔領域の大小及び縦横のアスペクト比をもとに顔領域を抽出する。一般に、映像通信で使用される被写体画像は、画像内で最も領域が大きく、かつ領域の縦横のアスペクト比が０．５倍〜２倍になる。したがって、このような条件を満足する肌色領域を特定することにより顔領域をさらに正確に抽出できる。なお、縦横のアスペクト比は、カメラの仕様や被写体の撮像条件等に応じて任意に設定することができる。
顔領域抽出部４１１は、以上のように抽出された顔領域に対応する映像信号ＰＳを特定部位切出部４１２に、また上記抽出された顔領域以外の領域の部分映像信号を背景領域に対応する映像信号ＭＳとして映像合成部４４にそれぞれ入力する。 Further, since there may be a skin color area in addition to the face area in the transmission CIF signal CS, a face area is further extracted from the extracted skin color area based on the size and aspect ratio of the face area. . In general, a subject image used in video communication has the largest area in the image, and the aspect ratio of the area in the vertical and horizontal directions is 0.5 to 2 times. Therefore, the face region can be extracted more accurately by specifying the skin color region that satisfies such conditions. Note that the aspect ratio of the vertical and horizontal directions can be arbitrarily set according to camera specifications, subject imaging conditions, and the like.
The face area extraction unit 411 corresponds the video signal PS corresponding to the face area extracted as described above to the specific part extraction unit 412 and the partial video signal of the area other than the extracted face area to the background area. The video signal MS to be input to the video composition unit 44.

特定部位切出部４１２は、上記入力された顔領域の部分映像信号ＰＳから、話者の表情を表す特定部位、例えば眉、目及び口を表す領域を抽出する。この特定部位を表す領域の抽出処理は、眉、目及び口がそれぞれ有する特徴に基づいて行われる。例えば、眉や目は肌色領域の輝度値より暗い領域であり、また口は肌色領域より赤い領域でかつ水平方向に細長いか又は楕円に近い形状をしている領域である。さらに、眉、目及び口の位置関係、例えば眉と目との上下関係、眉と目の左右対称関係、目と口の上下関係を利用することによっても、各部位の領域を抽出可能である。 The specific part cutout unit 412 extracts a specific part representing the expression of the speaker, for example, an area representing the eyebrows, eyes, and mouth, from the input partial video signal PS of the face area. The extraction process of the area representing the specific part is performed based on the characteristics of the eyebrows, eyes, and mouth. For example, the eyebrows and eyes are areas darker than the luminance value of the skin color area, and the mouth is an area that is redr than the skin color area and has a shape that is elongated in the horizontal direction or close to an ellipse. Furthermore, the region of each part can be extracted by using the positional relationship between the eyebrows, the eyes and the mouth, for example, the vertical relationship between the eyebrows and the eyes, the symmetrical relationship between the eyebrows and the eyes, and the vertical relationship between the eyes and the mouth. .

また特定部位切出部４１２は、上記抽出された領域がどの部位に対応するものかを識別する。識別には、例えば弛緩法が利用される。弛緩法は、抽出された領域がどの部位に対応するかに曖昧さが存在する場合に、各部位の近傍の状態から、抽出された各領域がどの部位であると矛盾が生じるかを検出し、この検出値をもとに上記曖昧さを減少させる。そして、この処理を各部位の集合全体にについて並列的に反復しつつ、次第に局所的な矛盾をなくしていき、最終的に特定部位の集合全体に対して曖昧さのないラベル付けを行う。そして、以上のようにして抽出されかつ特定された各特定部位の部分映像信号ＬＳを表情合成部４２に入力する。 Further, the specific part cutout unit 412 identifies which part the extracted region corresponds to. For the identification, for example, a relaxation method is used. In the relaxation method, when there is ambiguity about which part the extracted region corresponds to, it is detected from the state in the vicinity of each part which part the extracted region is inconsistent. The ambiguity is reduced based on the detected value. Then, while repeating this process in parallel for the entire set of parts, the local contradiction is gradually eliminated, and finally, the unambiguous labeling is performed on the entire set of specific parts. Then, the partial video signal LS of each specific part extracted and specified as described above is input to the facial expression synthesis unit 42.

表情合成部４２は、上記調整割合記憶部４３に記憶されている表情調整割合を指定する情報ＮＳに従い、上記特定部位切出部４１２において抽出された、会話中の話者の表情を表す特定部位の部分映像信号ＬＳと、予め記憶された該当話者の真顔の映像信号ＱＳとのモーフィング処理を行う。そして、このモーフィング処理により作成された、表情が調整された顔領域の部分映像信号ＯＳを映像合成部４４に入力する。 The facial expression synthesizing unit 42, in accordance with the information NS for designating the facial expression adjustment ratio stored in the adjustment percentage storage unit 43, the specific part representing the facial expression of the speaker in conversation extracted by the specific part extraction unit 412 The morphing process of the partial video signal LS and the video signal QS of the true face of the corresponding speaker stored in advance is performed. Then, the partial video signal OS of the face area with the adjusted facial expression created by the morphing process is input to the video composition unit 44.

図４は、上記表情合成部４２の構成を示すブロック図である。表情合成部４２は、正規化部４２１と、真顔情報記憶部４２２と、モーフィング部４２３とを備える。真顔情報記憶部４２２には、真顔の映像信号ＱＳが記憶される。この真顔の映像信号ＱＳとしては、話者の平常時の顔画像、例えば通信初期時に実行されるトレーニング期間において撮像された会話開始前の被写体の映像信号が使用される。 FIG. 4 is a block diagram showing a configuration of the facial expression synthesis unit 42. The facial expression synthesis unit 42 includes a normalization unit 421, a true face information storage unit 422, and a morphing unit 423. The true face information storage unit 422 stores a true face video signal QS. As the true-face video signal QS, a normal face image of a speaker, for example, a video signal of a subject before the start of conversation captured during a training period executed at the initial stage of communication is used.

正規化部４２１は、モーフィング処理の前処理として、上記特定部位切出部４１２において抽出された特定部位の部分映像信号ＬＳの正規化処理を行う。この正規化処理は、上記特定部位の映像信号ＬＳにおける各部位の位置と大きさを、上記真顔の映像信号ＱＳに含まれる上記特定部位の位置と大きさに合わせる処理である。 The normalization unit 421 performs normalization processing of the partial video signal LS of the specific part extracted by the specific part cutting unit 412 as preprocessing of the morphing process. This normalization process is a process of matching the position and size of each part in the video signal LS of the specific part with the position and size of the specific part included in the true face video signal QS.

モーフィング部４２３は、上記正規化部４２１により正規化された部分映像信号ＲＳと上記真顔の映像信号ＱＳの特定部位の形状の座標を、調整割合記憶部４３から読み出された表情調整割合ＮＳに応じて線形補完し、これにより特定部位の新しい形状を計算する。また同様に、上記正規化部４２１により正規化された部分映像信号ＲＳと上記真顔の映像信号ＱＳの特定部位以外の肌色領域、例えば頬や額、鼻等の形状の座標を、調整割合記憶部４３から読み出された表情調整割合ＮＳに応じて線形補完し、これにより特定部位以外の肌色領域の新しい形状を計算する。これらの処理により、上記調整割合に応じた口の開きや目の開き、眉の動きが調整された顔領域の部分映像信号ＯＳが作成される。 The morphing unit 423 sets the coordinates of the shape of the specific part of the partial video signal RS and the true face video signal QS normalized by the normalization unit 421 to the facial expression adjustment ratio NS read from the adjustment ratio storage unit 43. In response, linear interpolation is performed, thereby calculating a new shape of a specific part. Similarly, the coordinates of the skin color region other than the specific part of the partial video signal RS and the true face video signal QS normalized by the normalization unit 421, for example, the coordinates of the shape of the cheek, forehead, nose, etc. Linear complementation is performed in accordance with the facial expression adjustment ratio NS read out from 43, thereby calculating a new shape of the skin color area other than the specific part. Through these processes, a partial video signal OS of the face area in which the opening of the mouth, the opening of the eyes, and the movement of the eyebrows are adjusted according to the adjustment ratio is created.

映像合成部４４は、上記モーフィング部４２３により作成された顔領域の部分映像信号ＯＳと、前記表情検出部４１により抽出された背景領域の映像信号ＭＳとを、相互に位置合わせした上で合成し、これにより背景領域と顔領域とからなる、表情が調整された映像信号ＤＳを作成する。 The video synthesis unit 44 combines the partial video signal OS of the face area created by the morphing unit 423 and the video signal MS of the background area extracted by the facial expression detection unit 41 after aligning each other. Thus, a video signal DS having a facial expression adjusted, which is composed of a background area and a face area, is created.

次に、以上のように構成された映像通信装置の動作を、映像調整回路４の動作を中心に説明する。
通信に先立ち話者又は装置の管理者は、表情調整割合を指定する情報ＮＳを調整割合記憶部４３に登録する。このとき、通信先として想定される個々の相手ごとに表情調整割合の値を決定し、この決定した値を通信相手の加入電話番号又はＩＰアドレスに対応付けて登録する。 Next, the operation of the video communication apparatus configured as described above will be described focusing on the operation of the video adjustment circuit 4.
Prior to communication, the speaker or the manager of the apparatus registers information NS for designating the facial expression adjustment ratio in the adjustment ratio storage unit 43. At this time, the value of the facial expression adjustment ratio is determined for each individual partner assumed as the communication destination, and the determined value is registered in association with the subscriber telephone number or IP address of the communication partner.

この状態で、所望の通信相手との間でテレビジョン電話通信が開始されたとする。このとき映像調整回路４では、先ず映像通信を行うためのトレーニング期間において撮像された会話開始前の話者の映像信号がフォーマット変換回路３から取り込まれる。そして、この取り込まれた映像信号（ＣＩＦ信号ＣＳ）が、上記話者の真顔を表す映像信号として真顔情報記憶部４２２に格納される。すなわち、通信開始時に行われるトレーニング期間において、話者の真顔画像が自動的に得られ記憶される。 In this state, it is assumed that videophone communication is started with a desired communication partner. At this time, in the video adjustment circuit 4, first, the video signal of the speaker before the start of conversation captured in the training period for performing video communication is taken in from the format conversion circuit 3. The captured video signal (CIF signal CS) is stored in the true face information storage unit 422 as a video signal representing the true face of the speaker. In other words, a true face image of a speaker is automatically obtained and stored during a training period performed at the start of communication.

続いて会話が開始されると、映像調整回路４には会話期間中に撮像された映像信号がフォーマット変換回路３から入力される。この入力された映像信号（ＣＩＦ信号ＣＳ）は、先ず顔領域抽出部４１１において顔領域の部分映像信号ＰＳとそれ以外の背景領域の部分映像信号ＭＳとに分離され、このうち顔領域の部分映像信号ＰＳは特定部位切出部４１２に入力される。 Subsequently, when the conversation is started, the video signal captured during the conversation period is input from the format conversion circuit 3 to the video adjustment circuit 4. The input video signal (CIF signal CS) is first separated into a partial video signal PS in the face area and a partial video signal MS in the other background area by the face area extraction unit 411, and among these, the partial video in the face area The signal PS is input to the specific part cutting unit 412.

特定部位切出部４１２では、上記入力された顔領域の部分映像信号ＰＳから、話者の表情を表す眉、目及び口等の特定部位の映像信号ＬＳが抽出される。この特定部位の映像信号の抽出は、先に述べたように眉、目及び口がそれぞれ有する色彩上の特徴と相互の位置関係をもとに行われる。また、抽出された眉、目及び口の映像信号に曖昧さが存在する場合には、弛緩法により眉、目及び口の識別が行われる。 The specific part extraction unit 412 extracts a video signal LS of a specific part such as an eyebrow, an eye, and a mouth representing the expression of the speaker from the input partial video signal PS of the face region. Extraction of the video signal of the specific part is performed based on the color features of the eyebrows, eyes, and mouth and the mutual positional relationship as described above. When there is ambiguity in the extracted eyebrow, eye, and mouth video signals, the eyebrows, eyes, and mouth are identified by the relaxation method.

上記特定部位の映像信号ＬＳが切り出されると、映像調整回路４では次に表情合成部４２において、上記特定部位切出部４１２において抽出された特定部位の映像信号ＬＳと、先に真顔情報記憶部４２２に記憶された真顔の映像信号ＱＳとのモーフィング処理が以下のように行われる。 When the video signal LS of the specific part is cut out, the video adjustment circuit 4 in the facial expression synthesis unit 42 then extracts the video signal LS of the specific part extracted in the specific part cut-out part 412 and the true face information storage unit first. Morphing processing with the true-face video signal QS stored in 422 is performed as follows.

すなわち、先ず正規化部４２１により、上記特定部位の位置と大きさを、上記真顔の映像信号ＱＳに含まれる上記特定部位の位置と大きさに合わせるための正規化処理が行われる。続いてモーフィング部４２３において、上記正規化された特定部位の部分映像信号ＲＳの形状の座標と、上記真顔の映像信号ＱＳの特定部位の形状の座標とを、調整割合記憶部４３から読み出された表情調整割合ＮＳに応じて線形補完する処理が行われる。また同様に、上記正規化部４２１により正規化された特定部位以外の部位、例えば頬や額、鼻等のその他の肌色領域の形状の座標と、上記真顔の映像信号ＱＳにおける特定部位以外の肌色領域の形状の座標とを、調整割合記憶部４３から読み出された表情調整割合ＮＳに応じて線形補完する処理が行われる。以上の線形補完処理により、特定部位の新しい形状と、特定部位以外の肌色領域の新しい形状がそれぞれ求められ、これにより上記表情調整割合ＮＳに応じて口の開きや目の開き、眉の動きが調整された顔領域の部分映像信号ＯＳが作成される。 That is, first, the normalization unit 421 performs a normalization process for matching the position and size of the specific part with the position and size of the specific part included in the true face video signal QS. Subsequently, in the morphing unit 423, the coordinates of the shape of the partial video signal RS of the normalized specific part and the coordinates of the shape of the specific part of the true face video signal QS are read from the adjustment ratio storage unit 43. Processing for linear interpolation is performed in accordance with the facial expression adjustment ratio NS. Similarly, the coordinates of the shape of other skin color areas such as cheeks, foreheads and noses, and the skin color other than the specific parts in the true face video signal QS are normalized by the normalization unit 421. A process of linearly complementing the coordinates of the shape of the region in accordance with the facial expression adjustment ratio NS read from the adjustment ratio storage unit 43 is performed. Through the above linear interpolation processing, a new shape of the specific part and a new shape of the skin color region other than the specific part are respectively obtained, and accordingly, the opening of the mouth, the opening of the eyes, and the movement of the eyebrows are performed according to the expression adjustment ratio NS. A partial video signal OS of the adjusted face area is created.

例えば、いま通信相手が商談相手だったとする。そうすると、この商談相手の加入電話番号又はＩＰアドレスをもとに調整割合記憶部４３から対応する表情調整割合が読み出される。そして、この読み出された表情調整割合の値に従い、特定部位の形状と真顔の映像信号ＱＳにおける特定部位の形状とがモーフィングされる。このとき、商談相手に対応付けて設定した表情調整割合の値が例えば“０”またはそれに近い値であれば、真顔の映像信号ＱＳを主体として合成され、これにより表情が真顔のままほとんど変化しない顔領域の部分映像信号が作成される。 For example, suppose that the communication partner is now a business partner. Then, the corresponding facial expression adjustment ratio is read from the adjustment ratio storage unit 43 based on the telephone number or IP address of the business partner. Then, according to the value of the read facial expression adjustment ratio, the shape of the specific part and the shape of the specific part in the true face video signal QS are morphed. At this time, if the value of the facial expression adjustment ratio set in association with the business partner is, for example, “0” or a value close thereto, the video signal QS of the true face is synthesized as a main component, and thus the facial expression remains almost unchanged. A partial video signal of the face area is created.

これに対し通信相手が家族だったとする。この場合には、家族の加入電話番号又はＩＰアドレスをもとに調整割合記憶部４３から対応する表情調整割合が読み出される。そして、この読み出された表情調整割合の値に従い、特定部位の形状と真顔の映像信号ＱＳにおける特定部位の形状とがモーフィングされる。このとき、家族に対応付けて設定した表情調整割合の値が例えば“１”またはそれに近い値であれば、会話中に撮像された映像信号を主体として合成され、これにより会話中の表情がそのまま表現された顔領域の部分映像信号が作成される。 In contrast, the communication partner is a family. In this case, the corresponding facial expression adjustment ratio is read from the adjustment ratio storage unit 43 based on the family telephone number or IP address of the family. Then, according to the read facial expression adjustment ratio value, the shape of the specific part and the shape of the specific part in the true face video signal QS are morphed. At this time, if the value of the facial expression adjustment ratio set in association with the family is, for example, “1” or a value close thereto, the video signal captured during the conversation is synthesized as a main component, so that the facial expression during the conversation remains as it is. A partial video signal of the expressed face area is created.

また同様に、通信相手が職場の上司等であれば、この上司に対応付けて記憶されていた表示調整割合の値に従いモーフィングが行われる。例えば、職場の上司に対応付けて設定した表情調整割合の値が例えば“０．５”であれば、会話中に撮像された映像信号と真顔の映像信号とが半々の割合で合成され、これにより表情が中間的な状態に表現された顔領域の部分映像信号が作成される。 Similarly, if the communication partner is a boss or the like in the workplace, morphing is performed according to the value of the display adjustment ratio stored in association with the boss. For example, if the value of the facial expression adjustment ratio set in association with the boss of the workplace is “0.5”, for example, the video signal captured during the conversation and the video signal of the true face are combined at a ratio of half. Thus, a partial video signal of the face area in which the expression is expressed in an intermediate state is created.

次に映像調整回路４では、上記モーフィング処理により作成された顔領域の部分映像信号ＯＳと、先に表情検出部４１において分離された背景領域の部分映像信号ＭＳとを合成する処理が行われ、これにより表情が調整された映像信号ＤＳが得られる。この表情が調整された映像信号（送信ＣＩＦ信号）ＤＳは、映像符号化回路５で映像符号化処理された後、多重分離回路６において音声符号化信号ＦＳと多重化され、しかるのち通信相手装置に向け図示しない伝送路へ送信される。 Next, the video adjustment circuit 4 performs a process of combining the partial video signal OS of the face area created by the morphing process and the partial video signal MS of the background area previously separated by the facial expression detection unit 41, As a result, a video signal DS whose facial expression is adjusted is obtained. The video signal (transmission CIF signal) DS whose facial expression has been adjusted is subjected to video coding processing by the video coding circuit 5, and then multiplexed with the audio coding signal FS by the demultiplexing circuit 6, and then the communication partner apparatus. Toward the transmission path (not shown).

以上述べたように第１の実施形態では、映像調整回路４において、会話中に撮像された話者の顔領域の画像と、通信初期時のトレーニング期間に撮像して記憶した真顔画像の顔領域の画像とを、予め設定した表情調整割合を指定する情報ＮＳに従いモーフィング処理し、このモーフィング処理により表情が調整された映像信号を通信相手装置へ送信するようにしている。このため、会話中において、話者の感情や疲労度、緊張具合等のプライバシーに係わる情報が相手話者に知られてしまう不具合は低減される。
しかも、通信開始時に行われるトレーニング期間において、話者の真顔画像が自動的に取得されて記憶されるため、通信に先立ち予め話者の真顔画像を撮像して記憶しておく必要がなくなり、これにより不特定多数の話者にも対応可能となる。 As described above, in the first embodiment, in the video adjustment circuit 4, the face area image of the speaker captured during the conversation and the face area of the true face image captured and stored during the training period at the beginning of communication. Are morphed according to information NS designating a preset facial expression adjustment ratio, and a video signal whose facial expression is adjusted by this morphing process is transmitted to the communication partner apparatus. For this reason, the trouble that the other speaker is made aware of privacy-related information such as the speaker's emotions, fatigue level, and tension during conversation is reduced.
Moreover, since the speaker's true face image is automatically acquired and stored during the training period at the start of communication, it is not necessary to capture and store the speaker's true face image prior to communication. This makes it possible to handle a large number of unspecified speakers.

さらにこの実施形態では、通信相手ごとに表情調整割合を設定し、この設定された表情調整割合を通信相手の加入電話番号やＩＰアドレスに対応付けて調整割合記憶部４３に記憶する。そして、通信相手の加入電話番号やＩＰアドレスをもとに、調整割合記憶部４３から通信相手に対応する表情調整割合を読み出してモーフィング処理に供するようにしている。
したがって、例えば通信相手が家族等の気心の知れた相手の場合には、会話中の表情の合成割合を高くすると共に真顔の合成割合を零又は低く設定することにより、話者の表情ができるだけありのまま通信相手に伝わるようにすることができる。これに対し、通信相手が商談相手や初対面の相手の場合には、会話中の表情の合成割合を低く設定すると共に真顔の合成割合を高く設定することにより、話者の表情ができるかぎり通信相手に伝わらないようにすることができる。すなわち、通信相手に応じて最適な表情の映像信号を送信することができる。 Further, in this embodiment, a facial expression adjustment ratio is set for each communication partner, and the set facial expression adjustment ratio is stored in the adjustment ratio storage unit 43 in association with the subscriber telephone number or IP address of the communication partner. Then, the facial expression adjustment ratio corresponding to the communication partner is read out from the adjustment ratio storage unit 43 based on the subscriber telephone number and IP address of the communication partner and used for the morphing process.
Therefore, for example, when the communication partner is an energetic partner such as a family, the facial expression of the speaker remains as much as possible by increasing the synthetic ratio of facial expressions during conversation and setting the synthetic ratio of true faces to zero or low. It can be communicated to the communication partner. On the other hand, if the communication partner is a business partner or the first meeting partner, the communicative partner is set as much as possible by setting the compositing ratio of facial expressions during conversation low and setting the compositing ratio of true faces high. You can avoid being transmitted to. That is, it is possible to transmit a video signal having an optimal facial expression according to the communication partner.

（第２の実施形態）
この発明の第２の実施形態は、中央演算処理ユニット（ＣＰＵ：Central Processing Unit）においてプログラムを実行することにより、被写体の顔画像に対する映像調整処理を行うようにしたものである。 (Second Embodiment)
In the second embodiment of the present invention, a video is adjusted on a face image of a subject by executing a program in a central processing unit (CPU).

図５は、この発明に係わる映像通信装置の第２の実施形態を示すブロック図である。なお、同図において前記図２と同一部分には同一符号を付して詳しい説明は省略する。
この実施形態に係わる映像通信装置は、ＣＰＵ１０及びメモリ１１を備え、これらにより映像調整回路を構成している。メモリ１１は、ハードディスク又はフラッシュメモリからなる不揮発性メモリ部と、ＲＡＭからなる揮発性メモリ部とから構成され、不揮発性メモリ部には映像調整処理プログラムが格納してある。また揮発性メモリ部には、映像調整処理に必要な情報として、表情調整割合を指定する情報ＮＳと、真顔の映像信号（真顔画像）が記憶される。
ＣＰＵ１０は、上記メモリ１１に記憶された映像調整処理プログラムに従い、かつ上記表情調整割合を指定する情報ＮＳ及び真顔の映像信号を使用して、会話期間中に撮像された被写体の顔画像に対し映像調整処理を実行する。 FIG. 5 is a block diagram showing a second embodiment of the video communication apparatus according to the present invention. In the figure, the same parts as those in FIG.
The video communication apparatus according to this embodiment includes a CPU 10 and a memory 11, and these constitute a video adjustment circuit. The memory 11 includes a non-volatile memory unit composed of a hard disk or a flash memory, and a volatile memory unit composed of a RAM, and a video adjustment processing program is stored in the non-volatile memory unit. The volatile memory unit stores information NS for designating a facial expression adjustment ratio and a true face video signal (true face image) as information necessary for the video adjustment process.
The CPU 10 follows the video adjustment processing program stored in the memory 11 and uses the information NS for designating the expression adjustment ratio and the video signal of the true face to generate a video for the face image of the subject captured during the conversation period. Execute the adjustment process.

次に、以上のように構成された装置の動作をＣＰＵ１０の制御手順に従い説明する。図６は、その制御手順と制御内容を示すフローチャートである。
ＣＰＵ１０は、ステップ６ａで表情調整割合の入力を監視すると共に、ステップ６ｂで映像信号の入力を監視している。この状態で、話者又は装置の管理者が図示しない入力装置において表情調整割合を指定する情報ＮＳの入力操作を行うと、ステップ６ａからステップ６ｃに移行して、上記入力された表情調整割合を指定する情報ＮＳをメモリ１１に格納する。なお、上記表情調整割合を指定する情報ＮＳは通信相手ごとに設定され、この設定された情報ＮＳは通信相手の識別情報、例えば加入電話番号又はＩＰアドレスに対応付けてメモリ１１に格納される。 Next, the operation of the apparatus configured as described above will be described according to the control procedure of the CPU 10. FIG. 6 is a flowchart showing the control procedure and control contents.
The CPU 10 monitors the input of the facial expression adjustment ratio in step 6a, and monitors the input of the video signal in step 6b. In this state, when the speaker or the administrator of the device performs an input operation of the information NS for designating the facial expression adjustment ratio on an input device (not shown), the process proceeds from step 6a to step 6c, and the inputted facial expression adjustment ratio is set. Information NS to be specified is stored in the memory 11. The information NS for designating the expression adjustment ratio is set for each communication partner, and the set information NS is stored in the memory 11 in association with identification information of the communication partner, for example, a subscriber telephone number or an IP address.

さて、この状態で通信が開始されて、フォーマット変換回路３から映像信号（ＣＩＦ信号）ＣＳが入力されると、ＣＰＵ１０はステップ６ｂからステップ６ｄに移行して、会話が行われているか否かを判定する。そして、例えば映像通信のためのトレーニング期間のように会話開始前であれば、ステップ６ｅに移行して、上記入力された映像信号ＣＳを真顔画像ＱＳとしてメモリ１１に記憶する。かくして、話者の会話開始前における真顔画像が自動記憶される。 When communication is started in this state and a video signal (CIF signal) CS is input from the format conversion circuit 3, the CPU 10 proceeds from step 6b to step 6d to determine whether or not a conversation is being performed. judge. Then, if the conversation is not started, for example, during a training period for video communication, the process proceeds to step 6e and the input video signal CS is stored in the memory 11 as a true face image QS. Thus, the true face image before the start of the conversation of the speaker is automatically stored.

一方、会話中に映像信号（ＣＩＦ信号）ＣＳが入力されると、ＣＰＵ１０はステップ６ｄからステップ６ｆに移行し、ここで先ず上記入力された映像信号ＣＳから顔領域と背景領域とをそれぞれ分離抽出する。そして、抽出された顔領域及び背景領域の映像信号をメモリ１１に保存する。なお、顔領域の抽出には、前記第１の実施形態で述べたように肌色抽出法と、顔領域の大小及び縦横のアスペクト比を用いた方法が使用される。 On the other hand, when a video signal (CIF signal) CS is input during a conversation, the CPU 10 proceeds from step 6d to step 6f, where first the face area and the background area are separately extracted from the input video signal CS. To do. Then, the extracted video signals of the face area and the background area are stored in the memory 11. Note that, as described in the first embodiment, the skin color extraction method and the method using the size and vertical / horizontal aspect ratio of the face region are used for extracting the face region.

ＣＰＵ１０は、次にステップ６ｇにおいて、上記抽出された顔領域の映像信号から話者の表情報を表す特定部位、例えば眉、目及び口を表す領域の画像を抽出する。この特定部位を表す領域の抽出処理も、前記第１の実施形態で述べたように眉、目及び口がそれぞれ有する形状及び色彩上の特徴と、眉、目及び口相互の位置関係に基づいて行われる。また、抽出結果に曖昧さが存在する場合に弛緩法を使用して曖昧さを減少させる点についても前記第１の実施形態と同様である。 Next, in step 6g, the CPU 10 extracts an image of a specific part representing the speaker's table information, for example, an area representing the eyebrows, eyes, and mouth, from the extracted video signal of the face area. As described in the first embodiment, the extraction process of the region representing the specific part is also based on the shape and color characteristics of the eyebrows, eyes, and mouth, and the positional relationship between the eyebrows, eyes, and mouth. Done. Further, when the ambiguity exists in the extraction result, the ambiguity is reduced by using the relaxation method as in the first embodiment.

上記特定部位の領域が抽出されると、ＣＰＵ１０は続いてステップ６ｈにより、上記抽出された特定部位の位置と大きさを、上記会話開始前に記憶した真顔画像に含まれる上記特定部位の位置と大きさに合わせるための正規化処理を行う。そして、ステップ６ｉにより、上記正規化された特定部位の部分映像信号の形状の座標と、上記真顔画像中の特定部位における部分映像信号の形状の座標とを、メモリ１１から読み出された表情調整割合に応じて線形補完する処理を行う。また同様に、上記正規化された特定部位以外の部位、例えば頬や額、鼻等のその他の肌色領域の形状の座標と、上記真顔画像の上記特定部位以外の肌色領域の形状の座標とを、メモリ１１から読み出された表情調整割合に応じて線形補完する処理を行う。かくして、予め設定した表情調整割合に応じて、会話中に撮像された顔画像と会話前に撮像して記憶した真顔画像とのモーフィング処理が行われる。 When the region of the specific part is extracted, the CPU 10 subsequently, in step 6h, determines the position and size of the extracted specific part and the position of the specific part included in the true face image stored before the start of the conversation. Perform normalization to match the size. Then, in step 6i, the coordinates of the shape of the partial video signal of the normalized specific part and the coordinates of the shape of the partial video signal of the specific part in the true face image are read out from the memory 11. Performs linear interpolation according to the ratio. Similarly, the coordinates of the shape of other skin color areas other than the normalized specific part, such as cheeks, forehead, and nose, and the coordinates of the shape of the skin color area other than the specific part of the true face image are also obtained. Then, linear interpolation processing is performed in accordance with the expression adjustment ratio read from the memory 11. Thus, according to a preset facial expression adjustment ratio, a morphing process is performed between the face image captured during the conversation and the true face image captured and stored before the conversation.

上記モーフィング処理が終了するとＣＰＵ１０は、続いてステップ６ｊにおいて、上記モーフィング処理により作成された顔領域の映像信号と、先にステップ６ｆにより分離された背景領域の映像信号とを合成する処理を行い、これにより表情が調整された映像信号ＤＳを得る。そして、この表情が調整された映像信号（送信ＣＩＦ信号）ＤＳを、ステップ６ｋにより映像符号化回路５へ出力する。
以後、会話中に映像信号が入力されるごとに、上記ステップ６ｂ〜ステップ６ｋにより顔画像の表情を調整するための一連の処理が実行される。 When the morphing process ends, in step 6j, the CPU 10 performs a process of synthesizing the video signal of the face area created by the morphing process and the video signal of the background area previously separated in step 6f, As a result, a video signal DS whose facial expression is adjusted is obtained. Then, the video signal (transmission CIF signal) DS whose facial expression is adjusted is output to the video encoding circuit 5 in step 6k.
Thereafter, each time a video signal is input during a conversation, a series of processes for adjusting the facial expression of the face image is performed in steps 6b to 6k.

以上述べたように第２の実施形態によれば、先に述べた第１の実施形態と同様に、会話中に撮像された話者の顔領域の画像と、通信初期時のトレーニング期間に撮像して記憶した真顔画像の顔領域の画像とを、予め設定した表情調整割合を指定する情報ＮＳに従いモーフィング処理し、このモーフィング処理により表情が調整された映像信号を通信相手装置へ送信するようにしている。このため、会話中において、話者の感情や疲労度、緊張具合等のプライバシーに係わる情報が相手話者に知られてしまう不具合は低減される。しかも、通信開始時に行われるトレーニング期間において、話者の真顔画像が自動的に取得されて記憶されるため、通信に先立ち予め話者の真顔画像を撮像して記憶しておく必要がなくなり、これにより不特定多数の話者にも対応可能となる。 As described above, according to the second embodiment, as in the first embodiment described above, the image of the face area of the speaker captured during the conversation and the training period during the initial communication are captured. The morphing process is performed on the image of the face area of the true face image stored in accordance with information NS designating a preset facial expression adjustment ratio, and a video signal whose facial expression is adjusted by the morphing process is transmitted to the communication partner apparatus. ing. For this reason, the trouble that the other speaker is made aware of privacy-related information such as the speaker's emotions, fatigue level, and tension during conversation is reduced. Moreover, since the speaker's true face image is automatically acquired and stored during the training period at the start of communication, it is not necessary to capture and store the speaker's true face image prior to communication. This makes it possible to handle a large number of unspecified speakers.

さらに、通信相手ごとに表情調整割合を設定し、この設定された表情調整割合を通信相手の加入電話番号やＩＰアドレスに対応付けてメモリ１１に記憶し、通信相手の加入電話番号やＩＰアドレスをもとに、メモリ１１から通信相手に対応する表情調整割合を読み出してモーフィング処理に仕様するようにしている。したがって、通信相手の属性、例えば家族等の気心の知れた相手、商談相手や初対面の相手等に応じ、最適な表情の映像信号を送信することができる。 Further, a facial expression adjustment ratio is set for each communication partner, the set facial expression adjustment ratio is stored in the memory 11 in association with the communication partner's subscriber telephone number or IP address, and the communication partner's subscriber telephone number or IP address is stored. Originally, the facial expression adjustment ratio corresponding to the communication partner is read from the memory 11 and is used for the morphing process. Therefore, it is possible to transmit a video signal having an optimal facial expression in accordance with the attributes of the communication partner, for example, an energetic partner such as a family member, a business partner or a first meeting partner.

（その他の実施形態）
前記各実施形態では、真顔の映像信号として、通信初期時に実行されるトレーニング期間において撮像された会話開始前の被写体の映像信号を記憶するようにした。しかし、それに限らず通信開始後の最初の非会話期間に得られる被写体の映像信号を記憶するようにしてもよい。また、その後の非会話期間に被写体の映像信号が得られるごとに、この得られた映像信号をもとに上記記憶されている真顔の映像信号を逐次学習して補正するようにしてもよい。このようにすると、より平常時の状態に近い被写体の真顔画像を得ることができる。なお、非会話期間の判定は、送話音声信号及び受話音声信号の有無を監視することにより可能である。 (Other embodiments)
In each of the above-described embodiments, the video signal of the subject before the start of conversation captured during the training period executed at the initial stage of communication is stored as a true face video signal. However, the present invention is not limited thereto, and the video signal of the subject obtained in the first non-conversation period after the start of communication may be stored. Further, every time a video signal of a subject is obtained in a subsequent non-conversation period, the stored true video signal may be sequentially learned and corrected based on the obtained video signal. In this way, a true face image of the subject that is closer to the normal state can be obtained. The non-conversation period can be determined by monitoring the presence / absence of a transmission voice signal and a reception voice signal.

また、前記各実施形態では通信先として想定される相手ごとに表情調整割合の値を決定し、この決定した値を相手の加入電話番号又はＩＰアドレスに対応付けて調整割合記憶部４３に記憶するようにした。しかし、それに限らず想定される通信相手をグループ化し、このグループごとに表情調整割合の値を決定して、この決定した値を上記グループの識別番号に対応付けて調整割合記憶部４３に記憶するようにしてもよい。このようにすると、例えば通信相手を、家族や親戚のグループ、職場の同僚のグループ、商談相手のグループ等に分け、これらのグループごとに表情調整割合を設定すればよいので、個々の通信相手ごとに表情調整割合を設定する場合に比べて表情調整割合の設定管理を簡単化することができる。 In each of the above embodiments, the value of the facial expression adjustment ratio is determined for each partner assumed as a communication destination, and the determined value is stored in the adjustment ratio storage unit 43 in association with the subscriber's subscriber telephone number or IP address. I did it. However, the present invention is not limited to this, and possible communication partners are grouped, a facial expression adjustment ratio value is determined for each group, and the determined value is stored in the adjustment ratio storage unit 43 in association with the identification number of the group. You may do it. In this way, for example, the communication partner can be divided into a group of family members or relatives, a group of colleagues in the workplace, a group of business partners, etc., and a facial expression adjustment ratio can be set for each group. It is possible to simplify the setting management of the facial expression adjustment ratio compared to the case where the facial expression adjustment ratio is set in

その他、映像調整処理の手順と内容、表情調整割合の設定手法、真顔画像の生成記憶手法等についても、この発明の要旨を逸脱しない範囲で種々変形して実施できることは勿論である。
要するにこの発明は、上記各実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記各実施形態に開示されている複数の構成要素の適宜な組み合せにより種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態に亘る構成要素を適宜組み合せてもよい。 In addition, the procedure and content of the video adjustment process, the facial expression adjustment ratio setting method, the true face image generation and storage method, and the like can of course be modified in various ways without departing from the scope of the present invention.
In short, the present invention is not limited to the above-described embodiments as they are, and can be embodied by modifying the components without departing from the scope of the invention in the implementation stage. Moreover, various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, you may combine suitably the component covering different embodiment.

この発明に係わる映像通信装置の第１の実施形態を示す要部ブロック図。1 is a principal block diagram showing a first embodiment of a video communication apparatus according to the present invention. 図１に示した映像通信装置における映像調整回路の構成を示すブロック図。The block diagram which shows the structure of the image | video adjustment circuit in the video communication apparatus shown in FIG. 図２に示した映像調整回路における表情検出部の構成を示すブロック図。The block diagram which shows the structure of the expression detection part in the image | video adjustment circuit shown in FIG. 図２に示した映像調整回路における表情合成部の構成を示すブロック図。FIG. 3 is a block diagram showing a configuration of a facial expression synthesis unit in the video adjustment circuit shown in FIG. 2. この発明に係わる映像通信装置の第２の実施形態を示す要部ブロック図。The principal part block diagram which shows 2nd Embodiment of the video communication apparatus concerning this invention. 図５に示した映像通信装置のＣＰＵにおいて実行される映像作成制御の手順と内容を示すフローチャート。The flowchart which shows the procedure and content of the image | video creation control performed in CPU of the video communication apparatus shown in FIG.

Explanation of symbols

１…カメラ、２…アナログ・ディジタル変換回路（Ａ／Ｄ）、３…フォーマット変換回路、４…映像調整回路、５…映像符号化回路、６…多重分離回路、７…フォーマット逆変換回路、８…ディジタル・アナログ変換回路（Ｄ／Ａ）、９…モニタ、１０…中央演算処理ユニット（ＣＰＵ）、１１…メモリ、４１…表情抽出部、４２…表情合成部、４３…調整割合記憶部、４４…映像合成部、４１１…顔領域抽出部、４１２…特定部位切出部、４２１…正規化部、４２２…真顔情報記憶部、４２３…モーフィング部。 DESCRIPTION OF SYMBOLS 1 ... Camera, 2 ... Analog-digital conversion circuit (A / D), 3 ... Format conversion circuit, 4 ... Video adjustment circuit, 5 ... Video encoding circuit, 6 ... Demultiplexing circuit, 7 ... Format reverse conversion circuit, 8 ... Digital / analog conversion circuit (D / A), 9 ... monitor, 10 ... central processing unit (CPU), 11 ... memory, 41 ... facial expression extraction unit, 42 ... facial expression synthesis unit, 43 ... adjustment ratio storage unit, 44 ... Image synthesizing unit, 411... Face area extracting unit, 412... Specific part extracting unit, 421... Normalizing unit, 422.

Claims

A process of setting information representing the expression adjustment ratio of the subject;
Imaging a part including the face of the subject during a non-conversational period during communication;
Storing the imaged video signal as a first video signal representing a true face;
Capturing a portion including the face of the subject during a conversation period during communication to obtain a second video signal;
Extracting a first partial video signal at a specific part representing the expression of the subject from the second video signal obtained during the conversation period;
The extracted first partial video signal and the second partial video signal corresponding to the specific part in the stored first video signal are combined according to the information representing the set facial expression adjustment ratio. And generating a third partial video signal,
Combining the generated third partial video signal and a partial video signal other than the specific part in the second video signal to generate a third video signal;
And a step of transmitting the generated third video signal.

The process of setting information representing the facial expression adjustment ratio sets information representing the facial expression adjustment ratio of the subject in association with each of a plurality of assumed communication partners,
In the process of generating the third partial video signal, a facial expression corresponding to the communication partner is selected from the plurality of facial expression adjustment ratios set based on identification information of the communication partner used in communication. 2. The video communication according to claim 1, wherein information representing an adjustment ratio is selected, and the first partial video signal and the second partial video signal are synthesized according to the information representing the selected facial expression adjustment ratio. Method.

The process of storing the first video signal includes storing a video signal obtained by imaging a part including the face of the subject during a training period for video communication as a first video signal representing a true face. The video communication method according to claim 1, wherein:

2. The step of storing the first video signal includes generating and storing a first video signal representing a true face based on a plurality of video signals obtained during a non-conversation period. Video communication method.

Memory means for storing information indicating the facial expression adjustment ratio of the subject;
Means for storing a video signal obtained by imaging a portion including the face of the subject during a non-conversation period during communication as a first video signal representing a true face;
Means for extracting a first partial video signal at a specific part representing the expression of the subject from a second video signal obtained by imaging a part including the face of the subject during a conversation period during communication;
The extracted first partial video signal and the second partial video signal corresponding to the specific part in the stored first video signal represent facial expression adjustment ratios stored in the memory means. Means for synthesizing according to the information to generate a third partial video signal;
Means for synthesizing the generated third partial video signal and a partial video signal other than the specific part in the second video signal to generate a third video signal;
And a means for transmitting the generated third video signal.

The memory means stores information representing a facial expression adjustment ratio of a subject in association with each of a plurality of possible communication partners,
The means for generating the third partial video signal selectively reads out information representing a facial expression adjustment ratio corresponding to the communication partner from the memory means based on the identification information of the communication partner used in communication. 6. The video communication apparatus according to claim 5, wherein the first partial video signal and the second partial video signal are synthesized in accordance with the information indicating the read facial expression adjustment ratio.

The means for storing the first video signal stores a video signal obtained by imaging a part including the face of the subject during a training period for video communication as a first video signal representing a true face. The video communication apparatus according to claim 5.

6. The video according to claim 5, wherein the means for storing the first video signal generates and stores a first video signal representing a true face based on a plurality of video signals obtained during a non-conversation period. Communication device.

A video creation program used in a video communication device for creating a transmission video signal by a computer based on a video signal obtained by imaging a subject and transmitting the created transmission video signal,
A process of storing a video signal obtained by imaging a portion including the face of the subject during a non-conversation period during communication as a first video signal representing a true face;
A process of extracting a first partial video signal at a specific part representing the expression of the subject from a second video signal obtained by imaging a part including the face of the subject during a conversation period during communication;
The extracted first partial video signal and the second partial video signal corresponding to the specific part in the stored first video signal are combined according to information representing a preset facial expression adjustment ratio. A process of generating a third partial video signal;
The generated third partial video signal and the partial video signal other than the specific part in the second video signal are combined to generate a third video signal, and the generated third video A video creation program that causes the computer to execute a process of outputting a signal as the transmission video signal.

The process of generating the third partial video signal is based on identification information of a communication partner used in communication, and a facial expression corresponding to the communication partner from among a plurality of preset facial expression adjustment ratios. 10. The video creation according to claim 9, wherein information representing an adjustment ratio is selected, and the first partial video signal and the second partial video signal are synthesized according to the information representing the selected facial expression adjustment ratio. program.

The process of storing the first video signal includes storing a video signal obtained by imaging a part including the face of the subject during a training period for video communication as a first video signal representing a true face. The video creation program according to claim 9, wherein:

10. The process of storing the first video signal generates and stores a first video signal representing a true face based on a plurality of video signals obtained during a non-conversation period. Video creation program.

A recording medium on which a transmission video signal is created by a computer based on a video signal obtained by imaging a subject and a video creation program used in a video communication apparatus for transmitting the created transmission video signal is recorded. ,
A process of storing a video signal obtained by imaging a portion including the face of the subject during a non-conversation period during communication as a first video signal representing a true face;
A process of extracting a first partial video signal at a specific part representing the expression of the subject from a second video signal obtained by imaging a part including the face of the subject during a conversation period during communication;
The extracted first partial video signal and the second partial video signal corresponding to the specific part in the stored first video signal are combined according to information representing a preset facial expression adjustment ratio. A process of generating a third partial video signal;
The generated third partial video signal and the partial video signal other than the specific part in the second video signal are combined to generate a third video signal, and the generated third video The recording medium which recorded the video production program which makes the said computer perform the process which outputs a signal as the said transmission video signal.

The process of generating the third partial video signal is based on identification information of a communication partner used in communication, and a facial expression corresponding to the communication partner from among a plurality of preset facial expression adjustment ratios. 14. The video creation according to claim 13, wherein information representing an adjustment ratio is selected, and the first partial video signal and the second partial video signal are synthesized according to the information representing the selected facial expression adjustment ratio. A recording medium that records the program.

The process of storing the first video signal includes storing a video signal obtained by imaging a part including the face of the subject during a training period for video communication as a first video signal representing a true face. 14. A recording medium on which the video creation program according to claim 13 is recorded.

14. The processing for storing the first video signal generates and stores a first video signal representing a true face based on a plurality of video signals obtained during a non-conversation period. A recording medium that records a video creation program.