JP2010066844A

JP2010066844A - Method and device for processing video content, and program for processing video content

Info

Publication number: JP2010066844A
Application number: JP2008230442A
Authority: JP
Inventors: Kentaro Miyamoto; 健太郎宮本
Original assignee: Fujifilm Corp
Current assignee: Fujifilm Corp
Priority date: 2008-09-09
Filing date: 2008-09-09
Publication date: 2010-03-25

Abstract

<P>PROBLEM TO BE SOLVED: To accurately determine a feeling of a person appearing in video content and to process the video content based upon the determined feeling. <P>SOLUTION: A face detection unit 36 detects a face in each frame of a still image constituting the video content, and also detects time-series change and movement. An expression recognition unit 37 recognizes an expression of the detected face. A movement detection unit 38 detects movement of the person whose face is detected. A voice extraction unit 39 extracts a voice of the person whose face is detected. A voice feeling recognition unit 40 recognizes a voice feeling of the person whose face is detected. A feeling determination unit 41 determines a feeling of the person whose face is detected with reference to a feeling table 51 stored in a feeling DB 45. A decorative image acquisition unit 42 acquires a decorative image corresponding to the feeling determined by the feeling determination unit 41 from a decorative image DB 46. A composition processing unit 43 combines the decorative image acquired by the decorative image acquisition unit 42 with the video content. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、写っている人物の感情に基づいて動画コンテンツを加工する動画コンテンツの加工方法及び装置、並びに動画コンテンツの加工プログラムに関する。 The present invention relates to a moving image content processing method and apparatus for processing moving image content based on the emotion of a photographed person, and a moving image content processing program.

インターネット上には、動画コンテンツの配信サービスを提供するYouTube（登録商標）、ニコニコ動画（登録商標）、Yahoo!（登録商標）動画などといった配信サイトが存在する。最近の動画コンテンツの配信サイトでは、ユーザーからのコメントや装飾画像を合成した動画コンテンツを提供することでサービスの付加価値を高めている。 On the Internet, there are distribution sites such as YouTube (registered trademark), Nico Nico Douga (registered trademark), Yahoo! Recent video content distribution sites increase the added value of services by providing video content that combines user comments and decorative images.

動画コンテンツの加工技術としては、写っている人物の顔付きから感情を判定し、判定した感情に対応する装飾画像を合成するものがある（例えば、特許文献１、２参照）。
特開２００８−０８４２１３号公報特開２００６−１８５３２９号公報 As a moving image content processing technique, there is a technique for determining an emotion from the face of a person being photographed and synthesizing a decoration image corresponding to the determined emotion (for example, see Patent Documents 1 and 2).
JP 2008-084213 A JP 2006-185329 A

ところで、総じて大人は、人前において顔に感情を出すことを我慢することがある。例えば、怒りを抑えて笑うこともあり、このような顔付き（顔の表情）からは感情を判定することが難しい。すなわち、顔付きだけから感情を判定する特許文献１、２の発明では、感情を的確に判定できず、動画コンテンツに相応しい装飾画像を合成することができないおそれがある。 By the way, as a whole, adults may endure feelings on their faces in public. For example, there is a case of laughing while suppressing anger, and it is difficult to judge an emotion from such a face (facial expression). That is, in the inventions of Patent Documents 1 and 2 that determine an emotion only from the face, the emotion cannot be accurately determined, and there is a possibility that a decorative image suitable for the moving image content cannot be synthesized.

本発明は、上記課題を鑑みてなされたものであり、写っている人物の感情を的確に判定し、判定した感情に基づいて動画コンテンツを加工する動画コンテンツの加工方法及び装置、並びに動画コンテンツの加工プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and a moving image content processing method and apparatus that accurately determines the emotion of a person being photographed and processes the moving image content based on the determined emotion, and the moving image content The purpose is to provide a machining program.

上記目的を達成するために、本発明の動画コンテンツの加工装置は、動画コンテンツから顔を検出する顔検出部と、前記顔検出部で検出された顔の表情を認識する表情認識部と、動画コンテンツから、前記顔検出部で顔が検出された人物の動きを検知する動き検知部と、動画コンテンツから、前記顔検出部で顔が検出された人物の声を抽出する声抽出部と、前記表情認識部で認識された表情、前記動き検知部で検知された人物の動き、及び前記声抽出部で抽出された声から、前記顔検出部で顔が検出された人物の感情を判定する感情判定部と、前記感情判定部で判定された感情に基づいて動画コンテンツを加工する加工処理部とを備えている。 In order to achieve the above object, a moving image content processing apparatus according to the present invention includes a face detection unit that detects a face from moving image content, a facial expression recognition unit that recognizes a facial expression detected by the face detection unit, and a moving image. A motion detection unit that detects a motion of a person whose face is detected by the face detection unit from a content; a voice extraction unit that extracts a voice of a person whose face is detected by the face detection unit from video content; Emotion that determines the emotion of the person whose face is detected by the face detection unit from the facial expression recognized by the facial expression recognition unit, the movement of the person detected by the motion detection unit, and the voice extracted by the voice extraction unit A determination unit; and a processing unit that processes the moving image content based on the emotion determined by the emotion determination unit.

請求項２に記載の発明では、前記声抽出部で抽出された声から音声感情を認識する音声感情認識部と、顔から認識される表情、人物の動き、及び音声感情の組合せと、感情とを対応させて記憶した感情テーブルを格納した感情データベースとを備え、前記感情判定部は、前記感情テーブルを用いて、前記表情認識部で認識された表情、前記動き検知部で検知された人物の動き、及び前記音声感情認識部で認識された音声感情の組合せに対応する感情を、前記顔検出部で顔が検出された人物の感情であると判定する。 In the invention according to claim 2, a voice emotion recognition unit for recognizing voice emotion from the voice extracted by the voice extraction unit, a combination of facial expression recognized from a face, movement of a person, and voice emotion, And an emotion database storing an emotion table stored in association with each other, the emotion determination unit using the emotion table, the facial expression recognized by the facial expression recognition unit, the person detected by the motion detection unit The emotion corresponding to the combination of the movement and the voice emotion recognized by the voice emotion recognition unit is determined as the emotion of the person whose face is detected by the face detection unit.

請求項３に記載の発明では、動画コンテンツに装飾する装飾コンテンツを、感情と対応させて記憶した装飾コンテンツデータベースと、前記感情判定部で判定された感情に対応する前記装飾コンテンツを前記装飾コンテンツデータベースから取得する装飾コンテンツ取得部とを備え、前記加工処理部は、前記装飾コンテンツ取得部で取得された前記装飾コンテンツを動画コンテンツに合成する合成処理部である。 According to a third aspect of the present invention, there is provided a decoration content database in which decoration content to be decorated to the moving image content is stored in correspondence with emotion, and the decoration content corresponding to the emotion determined by the emotion determination unit is stored in the decoration content database. The processing unit is a composition processing unit that combines the decoration content acquired by the decoration content acquisition unit with the moving image content.

請求項４に記載の発明では、前記装飾コンテンツデータベースに感情と対応して記憶された前記装飾コンテンツは複数種類であり、前記装飾コンテンツ取得部は、入力された種類の前記装飾コンテンツを取得する。 According to a fourth aspect of the present invention, there are a plurality of types of decorative content stored in the decorative content database in association with emotions, and the decorative content acquisition unit acquires the input type of decorative content.

請求項５に記載の発明では、前記顔検出部で検出された顔、及び前記声抽出部で抽出された声から、前記顔検出部で顔が検出された人物の性別を推定する性別推定部を備え、前記加工処理部は、前記性別判別部で推定された性別に基づいて動画コンテンツを加工する。 In the invention according to claim 5, a gender estimating unit that estimates the gender of the person whose face is detected by the face detecting unit from the face detected by the face detecting unit and the voice extracted by the voice extracting unit. The processing unit processes the moving image content based on the gender estimated by the gender determination unit.

請求項６に記載の発明では、前記顔検出部で検出された顔、及び前記声抽出部で抽出された声から、前記顔検出部で顔が検出された人物の年齢を推定する年齢推定部を備え、前記加工処理部は、前記年齢判別部で推定された年齢に基づいて動画コンテンツを加工する。 In the invention according to claim 6, an age estimation unit that estimates the age of a person whose face is detected by the face detection unit from the face detected by the face detection unit and the voice extracted by the voice extraction unit The processing unit processes video content based on the age estimated by the age determination unit.

本発明の動画コンテンツの加工方法は、顔検出部において、動画コンテンツから顔を検出する顔検出ステップと、前記顔検出ステップで検出された顔から表情を表情認識部で認識する表情認識ステップと、動画コンテンツから、前記顔検出ステップで顔が検出された人物の動きを動き検知部で検知する動き検知ステップと、動画コンテンツから、前記顔検出ステップで顔が検出された人物の声を声抽出部で抽出する声抽出ステップと、前記表情認識ステップで認識された表情、前記動き検知ステップで検知された人物の動き、及び前記声抽出ステップで抽出された声から、前記顔検出ステップで顔が検出された人物の感情を感情判定部で判定する感情判定ステップと、前記感情判定ステップで判定された感情に基づいて動画コンテンツを加工処理部で加工する加工ステップとを備えている。 In the moving image content processing method of the present invention, in the face detection unit, a face detection step of detecting a face from the moving image content, a facial expression recognition step of recognizing a facial expression from the face detected in the face detection step, A motion detection step for detecting a motion of the person whose face is detected in the face detection step from the moving image content by a motion detection unit, and a voice extraction unit for the voice of the person whose face is detected in the face detection step from the video content The face is detected in the face detection step from the voice extraction step extracted in step S, the facial expression recognized in the facial expression recognition step, the movement of the person detected in the motion detection step, and the voice extracted in the voice extraction step. The emotion determination step of determining the emotion of the person who has been performed by the emotion determination unit, and processing the video content based on the emotion determined in the emotion determination step And a processing step of processing by the processing section.

本発明の動画コンテンツの加工プログラムは、動画コンテンツから顔を検出する顔検出ステップと、前記顔検出ステップで検出された顔から表情を認識する表情認識ステップと、動画コンテンツから、前記顔検出ステップで顔が検出された人物の動きを検知する動き検知ステップと、動画コンテンツから、前記顔検出ステップで顔が検出された人物の声を抽出する声抽出ステップと、前記表情認識ステップで認識された表情、前記動き検知ステップで検知された人物の動き、及び前記声抽出ステップで抽出された声から、前記顔検出ステップで顔が検出された人物の感情を判定する感情判定ステップと、前記感情判定ステップで判定された感情に基づいて動画コンテンツを加工する加工ステップとをコンピュータに実行させる。 The moving image content processing program of the present invention includes a face detecting step for detecting a face from the moving image content, a facial expression recognition step for recognizing an expression from the face detected in the face detecting step, a moving image content, and the face detecting step. A motion detection step for detecting the motion of the person whose face is detected, a voice extraction step for extracting the voice of the person whose face is detected in the face detection step from the video content, and the facial expression recognized in the facial expression recognition step An emotion determination step for determining the emotion of the person whose face is detected in the face detection step from the movement of the person detected in the motion detection step and the voice extracted in the voice extraction step; and the emotion determination step The computer is caused to execute a processing step for processing the moving image content based on the emotion determined in (1).

本発明の動画コンテンツの加工方法及び装置、並びに動画コンテンツの加工プログラムによれば、写っている人物の感情を的確に判定し、判定した感情に基づいて動画コンテンツを加工することができる。 According to the moving image content processing method and apparatus and the moving image content processing program of the present invention, it is possible to accurately determine the emotion of the person in the image and process the moving image content based on the determined emotion.

［第１実施形態］
図１において、第１実施形態における動画コンテンツの加工装置は、動画コンテンツの加工プログラム４４（図３参照）のインストールによってサーバ１１に構築される形式で実現される。動画コンテンツの加工装置は、写っている人物の感情に基づいて動画コンテンツを加工する。本明細書において動画コンテンツとは、音声を伴う動画のことを意味する。なお、本実施形態では、動画コンテンツを構成する静止画のフレームに装飾画像を合成することによって動画コンテンツを加工する場合を例に説明するが、音声その他のコンテンツを合成することによって動画コンテンツを加工するようにしてもよい。 [First Embodiment]
In FIG. 1, the moving image content processing apparatus according to the first embodiment is realized in a format constructed in the server 11 by installing a moving image content processing program 44 (see FIG. 3). The moving image content processing device processes moving image content based on the emotion of the person in the image. In this specification, the moving image content means a moving image with sound. In this embodiment, a case where moving image content is processed by combining a decorative image with a still image frame constituting the moving image content will be described as an example. However, moving image content is processed by combining audio and other content. You may make it do.

サーバ１１は、インターネット１２を介して接続されたクライアント端末１３とともに、ネットワークシステム１４を構成する。クライアント端末１３は、例えば周知のパーソナルコンピュータやワークステーションであり、各種操作画面などを表示するモニタ１５と、操作信号を出力するマウス１６及びキーボード１７からなる操作部１８とを備えている。 The server 11 and the client terminal 13 connected via the Internet 12 constitute a network system 14. The client terminal 13 is, for example, a known personal computer or workstation, and includes a monitor 15 that displays various operation screens, and an operation unit 18 including a mouse 16 and a keyboard 17 that output operation signals.

クライアント端末１３には、デジタルカメラ１９で撮影して得られた動画コンテンツや、メモリカードやＣＤ−Ｒなどの記録媒体２０に記録された動画コンテンツが送信され、あるいは、インターネット１２を経由して動画コンテンツが転送される。 The client terminal 13 receives moving image content captured by the digital camera 19 or moving image content recorded on a recording medium 20 such as a memory card or a CD-R, or the moving image content via the Internet 12. The content is transferred.

デジタルカメラ１９は、例えば、ＩＥＥＥ１３９４、ＵＳＢ（Universal Serial Bus）などに準拠した通信ケーブルや、無線ＬＡＮなどによりクライアント端末１３に接続され、クライアント端末１３とのデータの相互通信が可能となっている。また、記録媒体２０も同様に、専用のドライバを介してクライアント端末１３とのデータの遣り取りが可能となっている。 The digital camera 19 is connected to the client terminal 13 by a communication cable compliant with, for example, IEEE 1394, USB (Universal Serial Bus), a wireless LAN, or the like, and data communication with the client terminal 13 is possible. Similarly, the recording medium 20 can exchange data with the client terminal 13 via a dedicated driver.

図２に示すように、クライアント端末１３を構成するＣＰＵ２１は、操作部１８から入力される操作信号などに従ってクライアント端末１３全体を統括的に制御する。ＣＰＵ２１には、操作部１８の他に、データバス２２を介して、ＲＡＭ２３、ハードディスクドライブ（ＨＤＤ）２４、通信インターフェース（通信Ｉ／Ｆ）２５、及びモニタ１５が接続されている。 As shown in FIG. 2, the CPU 21 constituting the client terminal 13 comprehensively controls the entire client terminal 13 according to an operation signal input from the operation unit 18. In addition to the operation unit 18, a RAM 23, a hard disk drive (HDD) 24, a communication interface (communication I / F) 25, and a monitor 15 are connected to the CPU 21 via a data bus 22.

ＲＡＭ２３は、ＣＰＵ２１が処理を実行するための作業用メモリである。ＨＤＤ２４には、クライアント端末１３を動作させるための各種プログラムやデータが記憶されている他に、デジタルカメラ１９、記録媒体２０、あるいは、インターネット１２から取り込まれた動画コンテンツが記憶される。ＣＰＵ２１は、ＨＤＤ２４からプログラムを読み出してＲＡＭ２３に展開し、読み出したプログラムを逐次処理する。 The RAM 23 is a working memory for the CPU 21 to execute processing. In addition to storing various programs and data for operating the client terminal 13, the HDD 24 stores moving image content captured from the digital camera 19, the recording medium 20, or the Internet 12. The CPU 21 reads a program from the HDD 24 and develops it in the RAM 23, and sequentially processes the read program.

通信Ｉ／Ｆ２５は、例えばモデムやルータであり、インターネット１２に適合した通信プロトコルの制御を行い、インターネット１２を経由したデータの遣り取りをする。また、通信Ｉ／Ｆ２５は、デジタルカメラ１９や記録媒体２０などの外部機器とのデータ通信も行う。 The communication I / F 25 is, for example, a modem or a router, controls a communication protocol suitable for the Internet 12, and exchanges data via the Internet 12. The communication I / F 25 also performs data communication with external devices such as the digital camera 19 and the recording medium 20.

図３に示すように、サーバ１１を構成するＣＰＵ３１は、インターネット１２を経由してクライアント端末１３から入力される操作信号に従ってサーバ１１全体を統括的に制御する。ＣＰＵ３１には、データバス３２を介して、ＲＡＭ３３、ハードディスクドライブ（ＨＤＤ）３４、通信インターフェース（通信Ｉ／Ｆ）３５、顔検出部３６、表情認識部３７、動き検知部３８、声抽出部３９、音声感情認識部４０、感情判定部４１、装飾画像取得部４２、及び合成処理部（加工処理部）４３が接続されている。 As shown in FIG. 3, the CPU 31 configuring the server 11 comprehensively controls the entire server 11 according to an operation signal input from the client terminal 13 via the Internet 12. The CPU 31 has a RAM 33, a hard disk drive (HDD) 34, a communication interface (communication I / F) 35, a face detection unit 36, a facial expression recognition unit 37, a motion detection unit 38, a voice extraction unit 39, via a data bus 32. A voice emotion recognition unit 40, an emotion determination unit 41, a decorative image acquisition unit 42, and a synthesis processing unit (processing unit) 43 are connected.

ＲＡＭ３３は、ＣＰＵ３１が処理を実行するための作業用メモリである。ＨＤＤ３４には、サーバ１１を動作させるための各種プログラムやデータが記憶されている。また、ＨＤＤ３４には、動画コンテンツの加工プログラム４４が記憶されている。ＣＰＵ３１は、ＨＤＤ３４からプログラムを読み出してＲＡＭ３３に展開し、読み出したプログラムを逐次処理する。 The RAM 33 is a working memory for the CPU 31 to execute processing. The HDD 34 stores various programs and data for operating the server 11. The HDD 34 stores a moving image content processing program 44. The CPU 31 reads a program from the HDD 34 and develops it in the RAM 33, and sequentially processes the read program.

ＨＤＤ３４には、感情データベース（感情ＤＢ）４５と、装飾画像データベース（装飾画像ＤＢ）４６とが設けられている。感情ＤＢ４５には、図４に示す感情テーブル５１が格納されている。 The HDD 34 is provided with an emotion database (emotion DB) 45 and a decoration image database (decoration image DB) 46. The emotion DB 45 stores an emotion table 51 shown in FIG.

感情テーブル５１は、顔の表情、人物の動き、及び音声感情の組合せと、この組合せから観念される感情とを対応させて記憶している。音声感情とは、声から認識される感情のことを意味するが、以下では、感情の強弱を示す要素である場合を例に説明する。顔の表情が「無表情」、「笑顔」、「怒り顔」、「悲しみ顔」の四つであり、人物の動きが「なし」、「小」、「中」、「大」の四段階であり、音声感情が「無音」、「なし」、「小」、「中」、「大」の五段階であれば、８０（＝４×４×５）通りの組合せそれぞれに対応した感情が記憶されていることになる。例えば図４に示すように、顔の表情が「笑顔」であり、人物の動きが「なし」であり、音声感情が「無音」である組合せに対応して、感情として「愛想笑い」が記憶されている。 The emotion table 51 stores a combination of a facial expression, a person's movement, and a voice emotion in association with an emotion that is conceived from this combination. The voice emotion means an emotion recognized from the voice, but in the following, a case where the emotion is an element indicating the strength of the emotion will be described as an example. There are four facial expressions: “no expression”, “smile”, “anger face”, “sad face”, and the movement of the person is “None”, “Small”, “Medium”, “Large” If the voice emotions are five levels of “silence”, “none”, “small”, “medium”, “large”, emotions corresponding to 80 (= 4 × 4 × 5) combinations are provided. It will be remembered. For example, as shown in FIG. 4, “loving laughter” is stored as an emotion corresponding to a combination in which the facial expression is “smile”, the person's movement is “none”, and the voice emotion is “silent”. Has been.

装飾画像ＤＢ４６には、複数種類の装飾画像が、感情と対応させて記憶されている。例えば図５に示すように、感情「愛想笑い」に対応して、三種類の装飾画像が記憶されている。一種類目は、心の内を示す吹出しとして「早く帰りたいな〜。」が記憶され、二種類目は、擬音及び漫符として「たら〜／汗マーク」が記憶され、三種類目は、台詞の吹出しとして「。。。」が記憶されている。漫符とは、感情や感覚を視覚化した符号のことをいう。デフォルトでは、一種類目の心の内を示す吹出しの装飾画像が選択される。クライアント端末１３からの操作指示に基づいて、選択される装飾画像は切り替わる。具体的には、二種類目の擬音及び漫符の装飾画像、若しくは三種類目の台詞の吹出しの装飾画像が選択されたり、複数種類の装飾画像が選択されたり、又はランダムに選択されたりするように切り替わる。 In the decoration image DB 46, a plurality of types of decoration images are stored in association with emotions. For example, as shown in FIG. 5, three types of decorative images are stored corresponding to the emotion “loving laughter”. In the first type, “I want to go home early” is stored as a balloon that shows the inside of the mind, in the second type, “Tara // sweat mark” is stored as an onomatopoeia and comics, and for the third type, “...” is stored as a speech balloon. A comic book is a code that visualizes emotions and feelings. By default, a balloon decoration image indicating the first type of heart is selected. Based on the operation instruction from the client terminal 13, the selected decorative image is switched. Specifically, a decorative image of the second type of onomatopoeia and comics, or a decorative image of the third type of speech, or a plurality of types of decorative images are selected or randomly selected. It switches as follows.

通信Ｉ／Ｆ３５は、例えばモデムやルータであり、インターネット１２に適合した通信プロトコルの制御を行い、インターネット１２を経由したデータの遣り取りをする。通信Ｉ／Ｆ３５を介して入力されたデータは、ＲＡＭ３３に一時的に記憶される。 The communication I / F 35 is, for example, a modem or a router, controls a communication protocol suitable for the Internet 12, and exchanges data via the Internet 12. Data input via the communication I / F 35 is temporarily stored in the RAM 33.

顔検出部３６は、動画コンテンツがサーバ１１に入力されると、その動画コンテンツを構成する静止画の各フレームから顔を検出したり、時系列的変化や動きを検出したりする。顔の検出には、特開２００５−２６７５１２号公報で開示されている赤目検出を用いた方法などを利用する。詳しい説明は、特開２００５−２６７５１２号公報などを参照されたい。なお、顔の検出には、パターンマッチングや、肌色検出などを用いた技術を利用してもよい。 When the moving image content is input to the server 11, the face detection unit 36 detects a face from each frame of a still image constituting the moving image content, or detects a time-series change or movement. For the detection of the face, a method using red-eye detection disclosed in Japanese Patent Laid-Open No. 2005-267512 is used. For details, refer to JP-A-2005-267512. For face detection, a technique using pattern matching, skin color detection, or the like may be used.

表情認識部３７は、顔検出部３６で検出された顔の表情を認識する。顔の表情の認識には、特開平１０−２５５０４３号公報で開示されている隠れマルコフモデル（ＨＭＭ）を用いた技術などを利用する。なお、詳しい説明は、特開平１０−２５５０４３号公報などを参照されたい。 The facial expression recognition unit 37 recognizes the facial expression detected by the face detection unit 36. For recognition of facial expressions, a technique using a hidden Markov model (HMM) disclosed in Japanese Patent Laid-Open No. 10-255043 is used. For details, refer to Japanese Patent Laid-Open No. 10-255043.

動き検知部３８は、顔検出部３６で顔が検出された人物の動きを検知する。人物の動きの検知には、特開平０９−２５１５４２号公報で開示されている統計的データを用いた技術などを利用する。なお、詳しい説明は、特開０９−２５１５４２号公報などを参照されたい。また、顔検出部３６で検出された顔の位置をフレーム間で比較することによって、人物の動きを検知してもよい。 The motion detector 38 detects the motion of the person whose face is detected by the face detector 36. For detecting the movement of a person, a technique using statistical data disclosed in Japanese Patent Laid-Open No. 09-251542 is used. For details, refer to Japanese Patent Application Laid-Open No. 09-251542. Further, the movement of a person may be detected by comparing the position of the face detected by the face detection unit 36 between frames.

声抽出部３９は、サーバ１１に入力された動画コンテンツを構成する音声から、人物の声を抽出する。そして、抽出された声が、顔検出部３６で顔が検出された人物の声であるか否かを判定する。声の抽出にはバンドパスフィルタなどを、上記の判定には特開平０９−１２７９７５号公報、あるいは特開平１１−１１９７９１号公報などで開示されている技術をそれぞれ利用する。上記の判定についての詳しい説明は、特開平０９−１２７９７５号公報、特開平１１−１１９７９１号公報などを参照されたい。なお、上記の判定には、口角の変化、すなわち口角の動きを検知して判定をする技術を利用してもよい。 The voice extraction unit 39 extracts a person's voice from the voice constituting the moving image content input to the server 11. Then, it is determined whether or not the extracted voice is a voice of a person whose face is detected by the face detection unit 36. A band-pass filter or the like is used for voice extraction, and the technique disclosed in Japanese Patent Application Laid-Open No. 09-127975 or Japanese Patent Application Laid-Open No. 11-1119791 is used for the above determination. Refer to Japanese Patent Application Laid-Open No. 09-127975, Japanese Patent Application Laid-Open No. 11-1119791, and the like for a detailed description of the above determination. For the above-described determination, a technique for detecting a change in the mouth angle, that is, a movement of the mouth angle, may be used.

音声感情認識部４０は、声抽出部３９で抽出された声を元に、顔検出部３６で顔が検出された人物の音声感情を認識する。音声感情の認識には、特開平０９−１２７９７５号公報、特開平１０−０４９１８８号公報あるいは特開平１１−１１９７９１号公報などで開示されている技術を利用する。なお、詳しい説明は、特開平０９−１２７９７５号公報、特開平１０−０４９１８８号公報、特開平１１−１１９７９１号公報などを参照されたい。 The voice emotion recognition unit 40 recognizes the voice emotion of the person whose face is detected by the face detection unit 36 based on the voice extracted by the voice extraction unit 39. For voice emotion recognition, techniques disclosed in Japanese Patent Application Laid-Open No. 09-127975, Japanese Patent Application Laid-Open No. 10-049188, Japanese Patent Application Laid-Open No. 11-1119791, and the like are used. For detailed explanation, refer to JP-A 09-127975, JP-A 10-049188, JP-A 11-1119791, and the like.

感情判定部４１は、感情ＤＢ４５にアクセスし、図４に示す感情テーブル５１を用いて、顔検出部３６で顔が検出された人物の感情を、所定のスパン（例えば５秒のスパン）毎に判定する。具体的には、顔検出部３６で顔が検出された人物毎に、表情認識部３７で認識された顔の表情、動き検知部３８で検知された人物の動き、及び音声感情認識部４０で認識された音声感情の組合せに対応する感情を、当該人物の感情であると判定する。 The emotion determination unit 41 accesses the emotion DB 45 and uses the emotion table 51 shown in FIG. 4 to express the emotion of the person whose face is detected by the face detection unit 36 every predetermined span (for example, a span of 5 seconds). judge. Specifically, for each person whose face is detected by the face detection unit 36, the facial expression recognized by the expression recognition unit 37, the movement of the person detected by the motion detection unit 38, and the voice emotion recognition unit 40. The emotion corresponding to the recognized voice emotion combination is determined to be the emotion of the person.

例えば、顔検出部３６で顔が検出された人物について、表情認識部３７で顔の表情が「笑顔」であると認識され、動き検知部３８で人物の動きが「なし」であると検知され、且つ、音声感情認識部４０で音声感情が「無音」であると認識された場合、感情判定部４１は、当該人物の感情が「愛想笑い」と判定する（図４の上から３段目の欄参照）。なお、所定のスパンの中で、表情認識部３７で認識された顔の表情、又は音声感情認識部４０で認識された音声感情が変化している場合には、感情判定部４１は、変化前の顔の表情、又は音声感情を用いて上記の判定を行う。 For example, for a person whose face is detected by the face detection unit 36, the facial expression recognition unit 37 recognizes that the facial expression is “smile”, and the motion detection unit 38 detects that the person's movement is “none”. When the voice emotion recognition unit 40 recognizes that the voice emotion is “silence”, the emotion determination unit 41 determines that the emotion of the person is “loving laughter” (third row from the top in FIG. 4). Column). If the facial expression recognized by the facial expression recognition unit 37 or the voice emotion recognized by the voice emotion recognition unit 40 has changed within a predetermined span, the emotion determination unit 41 determines whether the facial expression is changed. The above determination is performed using the facial expression or voice emotion.

装飾画像取得部４２は、装飾画像ＤＢ４６にアクセスし、感情判定部４１で判定された感情に対応する装飾画像を取得する。例えば、感情判定部４１で感情が「愛想笑い」と判定された場合、デフォルトでは、心の内を示す吹出しの装飾画像である「早く帰りたいな〜。」を取得する。クライアント端末１３からの操作指示があった場合には、その操作指示に基づいて、擬音及び漫符の装飾画像である「たら〜／汗マーク」、又は台詞の吹出しの装飾画像である「。。。」を取得したりする。 The decoration image acquisition unit 42 accesses the decoration image DB 46 and acquires a decoration image corresponding to the emotion determined by the emotion determination unit 41. For example, when the emotion determination unit 41 determines that the emotion is “loving laughter”, by default, “I want to go home early”, which is a balloon decoration image showing the inside of the heart, is acquired. When there is an operation instruction from the client terminal 13, based on the operation instruction, “Tara // sweat mark” which is a decorative image of onomatopoeia and comics, or a decorative image of speech balloon “. . "

合成処理部４３は、装飾画像取得部４２が取得した装飾画像を、動画コンテンツを構成する静止画の各フレームに合成する。図６を参照しながら、顔検出部３６で三つの顔が検出された場合を例に説明する。顔の表情が「笑顔」で、人物の動きが「なし」で、且つ、音声感情が「無音」であるフレーム上方に写る人物は、感情が「愛想笑い」と判定され、デフォルトでは、装飾画像取得部４２で一種類目の心の内を示す吹出しの装飾画像である「早く帰りたいな〜。」が取得されている。合成処理部４３は、装飾画像取得部４２で取得された装飾画像である「早く帰りたいな〜。」を、フレーム上方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 The composition processing unit 43 synthesizes the decoration image acquired by the decoration image acquisition unit 42 with each frame of the still image constituting the moving image content. A case where three faces are detected by the face detection unit 36 will be described as an example with reference to FIG. A person in the upper frame where the facial expression is “smile”, the person ’s movement is “none”, and the voice emotion is “silence” is determined to be “loving laughter”. The acquisition unit 42 acquires “I want to go home early” which is a decorative image of a balloon showing the inside of the first kind of heart. The compositing processing unit 43 arranges the still image constituting the moving image content so that the decoration image acquired by the decoration image acquiring unit 42 “I want to go home early” is arranged around the person appearing above the frame. Composite to frame.

顔の表情が「笑顔」で、人物の動きが「小」で、且つ、音声感情が「小」であるフレーム左方に写る人物は、感情が「ややふざけている」と判定され、デフォルトでは、装飾画像取得部４２で一種類目の心の内を示す吹出しの装飾画像である「うふふふ。ワタシ写ってる？」が取得されている。合成処理部４３は、装飾画像取得部４２で取得された装飾画像である「うふふふ。ワタシ写ってる？」を、フレーム左方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 A person who appears to the left of the frame with a facial expression of “smiling”, a person's movement of “small”, and a voice emotion of “small” is judged as “slightly joke” by default. The decoration image acquisition unit 42 has acquired “Ufufufu. Do you see me?”, Which is a decoration image of a balloon showing the inside of the first type of heart. The compositing processing unit 43 sets the still image that constitutes the moving image content so that the decoration image acquired by the decoration image acquisition unit 42 is “Ufufufu. Composite to the frame of the image.

顔の表情が「笑顔」で、人物の動きが「大」で、且つ、音声感情が「大」であるフレーム右方に写る人物は、感情が「とても楽しい」と判定され、デフォルトでは、装飾画像取得部４２で一種類目の心の内を示す吹出しの装飾画像である「すごく楽しいな♪」が取得されている。合成処理部４３は、装飾画像取得部４２で取得された装飾画像である「すごく楽しいな♪」を、フレーム右方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 A person who appears to the right of the frame with a facial expression of “smiling”, a human movement of “large”, and a voice emotion of “large” is judged to be “very fun”. The image acquisition unit 42 has acquired “very fun ♪”, which is a decoration image of a balloon showing the inside of the first kind of mind. The compositing processing unit 43 arranges the frame of the still image that constitutes the moving image content so that the decoration image acquired by the decoration image acquisition unit 42 is arranged around the person in the right side of the frame. To synthesize.

また、図７に示すように、フレーム上方に写る人物について、二種類目の擬音及び漫符の装飾画像である「たら〜／汗マーク」が取得されている場合には、合成処理部４３は、取得されている装飾画像である「たら〜／汗マーク」を、フレーム上方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 Also, as shown in FIG. 7, when the “Tara ~ / sweat mark”, which is a decorative image of the second type of onomatopoeia and comics, has been acquired for the person shown above the frame, the composition processing unit 43 Then, the acquired decoration image “Tara // sweat mark” is combined with the frame of the still image constituting the moving image content so as to be arranged around the person in the upper part of the frame.

フレーム左方に写る人物について、二種類目の擬音及び漫符の装飾画像である「ニヤニヤ」が取得されている場合には、合成処理部４３は、取得されている装飾画像である「ニヤニヤ」を、フレーム左方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 In the case where “Niyanya”, which is a decoration image of the second type of onomatopoeia and comics, has been acquired for the person appearing on the left side of the frame, the compositing processing unit 43 “Niyanya” which is the acquired decoration image. Are arranged in a frame of a still image constituting the moving image content so as to be arranged around a person appearing on the left side of the frame.

フレーム右方に写る人物について、二種類目の擬音及び漫符の装飾画像である「あはは（笑）」が取得されている場合には、合成処理部４３は、取得されている装飾画像である「あはは（笑）」を、フレーム右方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 When “Ahaha (laughs)”, which is a decorative image of the second type of onomatopoeia and comics, has been acquired for the person shown on the right side of the frame, the composition processing unit 43 is the acquired decorative image. “Ahaha (laughs)” is combined with a still image frame constituting the moving image content so as to be arranged around a person shown on the right side of the frame.

同様に、図８に示すように、フレーム上方に写る人物について、三種類目の台詞の吹出しの装飾画像である「。。。」が取得されている場合には、合成処理部３４は、取得されている装飾画像である「。。。」を、フレーム上方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 Similarly, as illustrated in FIG. 8, when “..”, which is a third-type speech balloon decoration image, has been acquired for the person shown above the frame, the composition processing unit 34 acquires the image. The decorative image “..” is combined with a still image frame constituting the moving image content so as to be arranged around a person appearing above the frame.

フレーム左方に写る人物について、三種類目の台詞の吹出しの装飾画像である「現金ですかぁ！！」が取得されている場合には、合成処理部４３は、取得されている装飾画像である「現金ですかぁ！！」を、フレーム左方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画のフレームに合成する。 In the case where “Cash is a cash!”, Which is a decoration image of the third type of speech, is acquired for the person shown on the left side of the frame, the composition processing unit 43 is the acquired decoration image. “Cash?” Is combined with the still image frame that composes the video content so that it is placed around the person on the left side of the frame.

フレーム右方に写る人物について、三種類目の台詞の吹出しの装飾画像である「元気ですかぁ！！」が取得されている場合には、合成処理部４３は、取得されている装飾画像である「元気ですかぁ！！」を、フレーム右方に写る人物の周囲に配置するように、動画コンテンツを構成する静止画の各フレームに合成する。 For the person appearing on the right side of the frame, when the decoration image of the third type of speech is acquired, “How are you !!”, the composition processing unit 43 is the acquired decoration image. “I'm fine!” Is combined with each frame of the still image that composes the video content so that it is placed around the person in the right side of the frame.

次に、上記構成のサーバ１１（図１及び図３参照）が動画コンテンツの加工装置として機能したときの処理手順について、図９のフローチャートを参照しながら説明する。ユーザーは、クライアント端末１３の操作部１８を操作して、サーバ１１に動画コンテンツを入力する。サーバ１１に入力された動画コンテンツは、ＲＡＭ３３に記憶される。 Next, a processing procedure when the server 11 (see FIGS. 1 and 3) configured as described above functions as a moving image content processing device will be described with reference to the flowchart of FIG. The user operates the operation unit 18 of the client terminal 13 to input moving image content to the server 11. The moving image content input to the server 11 is stored in the RAM 33.

サーバ１１に入力された動画コンテンツは、所定のスパン（例えば５秒のスパン（１５０フレーム））毎にリアルタイムで加工処理が施される。動画コンテンツは、まず、ＲＡＭ３３から顔検出部３６、動き検知部３８、声抽出部３９にそれぞれ読み出される。顔検出部３６では、動画コンテンツを構成する静止画の各フレームから顔が検出されたり、時系列的変化や動きが検出されたりする。検出された顔や、時系列的変化及び動きは、ＲＡＭ３３に記憶される。 The moving image content input to the server 11 is processed in real time every predetermined span (for example, a span of 5 seconds (150 frames)). The moving image content is first read from the RAM 33 to the face detection unit 36, the motion detection unit 38, and the voice extraction unit 39, respectively. In the face detection unit 36, a face is detected from each frame of a still image constituting the moving image content, or a time-series change or motion is detected. The detected face and time-series changes and movements are stored in the RAM 33.

顔検出部３６で検出された顔や、時系列的変化及び動きは、ＲＡＭ３３から表情認識部３７に読み出される。表情認識部３７では、顔検出部３６で検出された顔の表情が認識される。認識された顔の表情は、ＲＡＭ３３に記憶される。 The face detected by the face detection unit 36 and time-series changes and movements are read from the RAM 33 to the facial expression recognition unit 37. The facial expression recognition unit 37 recognizes the facial expression detected by the face detection unit 36. The recognized facial expression is stored in the RAM 33.

表情認識部３７で認識された顔の表情は、ＲＡＭ３３から動き検知部３８に読み出される。動き検知部３８では、顔検出部３６で顔が検出された人物の動きが検知される。検知された人物の動きは、ＲＡＭ３３に記憶される。 The facial expression recognized by the facial expression recognition unit 37 is read from the RAM 33 to the motion detection unit 38. The motion detector 38 detects the motion of the person whose face is detected by the face detector 36. The detected movement of the person is stored in the RAM 33.

声抽出部３９では、動画コンテンツを構成する音声から、顔検出部３６で顔が検出された人物の声が抽出される。抽出された人物の声は、ＲＡＭ３３に記憶される。 In the voice extraction unit 39, the voice of the person whose face is detected by the face detection unit 36 is extracted from the sound constituting the moving image content. The extracted human voice is stored in the RAM 33.

声抽出部３９で抽出された人物の声は、ＲＡＭ３３から音声感情認識部４０に読み出される。音声感情認識部４０では、顔検出部３６で顔が検出された人物の音声感情が認識される。認識された人物の音声感情は、ＲＡＭ３３に記憶される。 The voice of the person extracted by the voice extraction unit 39 is read from the RAM 33 to the voice emotion recognition unit 40. The voice emotion recognition unit 40 recognizes the voice emotion of the person whose face is detected by the face detection unit 36. The recognized voice emotion of the person is stored in the RAM 33.

表情認識部３７で認識された顔の表情、動き検知部３８で検知された人物の動き、及び音声感情認識部４０で認識された音声感情は、ＲＡＭ３３から感情判定部４１に読み出される。感情判定部４１では、感情ＤＢ４５に格納された感情テーブル５１（図４参照）が参照されながら、顔検出部３６で顔が検出された人物の感情が判定される。判定された感情は、ＲＡＭ３３に記憶される。 The facial expression recognized by the facial expression recognition unit 37, the movement of the person detected by the motion detection unit 38, and the voice emotion recognized by the voice emotion recognition unit 40 are read from the RAM 33 to the emotion determination unit 41. The emotion determination unit 41 determines the emotion of the person whose face is detected by the face detection unit 36 while referring to the emotion table 51 (see FIG. 4) stored in the emotion DB 45. The determined emotion is stored in the RAM 33.

感情判定部４１で判定された感情は、ＲＡＭ３３から装飾画像取得部４２に読み出される。装飾画像取得部４２では、感情判定部４１で判定された感情に対応する装飾画像（図５参照）が、装飾画像ＤＢ４６から取得される。取得された装飾画像は、ＲＡＭ３３に記憶される。 The emotion determined by the emotion determination unit 41 is read from the RAM 33 to the decoration image acquisition unit 42. In the decoration image acquisition unit 42, a decoration image (see FIG. 5) corresponding to the emotion determined by the emotion determination unit 41 is acquired from the decoration image DB 46. The acquired decoration image is stored in the RAM 33.

装飾画像取得部４２で取得された装飾画像は、ＲＡＭ３３から合成処理部４３に読み出される。合成処理部４３では、装飾画像取得部４２で取得された装飾画像が、動画コンテンツを構成する静止画のフレームに合成される。装飾画像が合成された動画コンテンツは、ＲＡＭ３３に記憶される。 The decoration image acquired by the decoration image acquisition unit 42 is read from the RAM 33 to the composition processing unit 43. In the composition processing unit 43, the decoration image acquired by the decoration image acquisition unit 42 is combined with a frame of a still image constituting the moving image content. The moving image content combined with the decoration image is stored in the RAM 33.

合成処理部４３で装飾画像が合成された動画コンテンツは、顔検出→表情認識→動き検知→声抽出→音声感情認識→感情判定→性別判定→年齢判定→装飾画像取得→画像合成の一連の処理が終了した所定のスパン毎に、ＲＡＭ３３から読み出される。 The moving image content in which the decoration image is synthesized by the composition processing unit 43 is a series of processes of face detection → expression recognition → motion detection → voice extraction → speech emotion recognition → emotion determination → gender determination → age determination → decoration image acquisition → image composition. Is read from the RAM 33 at every predetermined span for which.

ＲＡＭ３３から読み出された動画コンテンツは、インターネット１２を介してクライアント端末１３に送信される。クライアント端末１３に送信された動画コンテンツは、ＲＡＭ２３に記憶される。 The moving image content read from the RAM 33 is transmitted to the client terminal 13 via the Internet 12. The moving image content transmitted to the client terminal 13 is stored in the RAM 23.

クライアント端末１３に送信された動画コンテンツは、ＲＡＭ２３から読み出され、リアルタイムでモニタ１５に表示される。 The moving image content transmitted to the client terminal 13 is read from the RAM 23 and displayed on the monitor 15 in real time.

以上説明したように、顔の表情だけでなく、人物の動き、及び音声感情を用いて総合的に感情を判定し、判定した感情に対応する装飾画像を合成したから、動画コンテンツに相応しい装飾画像を合成することができる。また、感情に関連性のない装飾画像を当該感情に対応させて装飾画像ＤＢ４６に記憶させておくことで、意外性のある新しいストーリーを有した動画コンテンツを加工することができる（図８参照）。 As described above, not only facial expressions but also human movements and voice emotions are used to comprehensively determine emotions, and a decorative image corresponding to the determined emotions is synthesized, so a decorative image suitable for video content Can be synthesized. In addition, by storing a decoration image that is not related to emotions in the decoration image DB 46 in correspondence with the emotions, it is possible to process video content having a surprising new story (see FIG. 8). .

［第２実施形態］
上記第１実施形態の動画コンテンツの加工装置では、写っている人物の性別・年齢と関連性のない装飾画像を合成するから、例えば、女児が写っている動画コンテンツに対して老父の台詞の吹出しの装飾画像を合成した場合、女児が老父の台詞を発していることになり、当該動画コンテンツを見た人に違和感を与えてしまう。確かに、男性が写っている動画コンテンツに対して女性の台詞の吹出しの装飾画像を合成した場合、男性が女性の台詞を発していることになるから、その動画コンテンツを見た人にその男性がニューハーフであると感じさせるなど、意外性を楽しませることも可能である。しかし、全ての人物について性別・年齢と関連性のない装飾画像を合成することは、違和感を与えるだけであり好ましいことではない。そこで、次に説明する第２実施形態の動画コンテンツの加工装置では、写っている人物の性別・年齢と関連性のある装飾画像を合成し、動画コンテンツを見た人に違和感を与えないようにする。 [Second Embodiment]
In the moving image content processing apparatus according to the first embodiment, a decoration image that is not related to the gender / age of the person being photographed is synthesized. If the decorative image is synthesized, the girl child is uttering the words of the old father, which gives a strange feeling to the person who viewed the moving image content. Certainly, when a decorative image of a female speech balloon is combined with a video content showing a male, the male is speaking a female speech. It is also possible to entertain unexpectedness, such as making you feel that she is a shemale. However, synthesizing decorative images that are not related to gender and age for all persons only gives a sense of incongruity and is not preferable. Therefore, in the moving image content processing apparatus according to the second embodiment, which will be described next, a decorative image that is related to the gender and age of the person being photographed is synthesized so as not to give a sense of incongruity to the person who viewed the moving image content. To do.

図１０において、第２実施形態における動画コンテンツの加工装置は、動画コンテンツの加工プログラム４４のインストールによってサーバ６１に構築される形式で実現される。ＣＰＵ３１には、データバス３２を介して、性別推定部６２、及び年齢推定部６３などが接続されている。 In FIG. 10, the moving image content processing apparatus according to the second embodiment is realized in a format constructed in the server 61 by installing the moving image content processing program 44. A sex estimation unit 62 and an age estimation unit 63 are connected to the CPU 31 via the data bus 32.

装飾画像ＤＢ４６には、複数種類の装飾画像が、感情、性別、及び年齢の組合せと対応させて記憶されている。 In the decorative image DB 46, a plurality of types of decorative images are stored in association with combinations of emotion, sex, and age.

性別推定部６２は、顔検出部３６で顔が検出された人物の性別を推定する。具体的には、顔検出部３６で検出された顔、表情認識部３７で認識された顔の表情、声抽出部３９で抽出された人物の声、及び音声感情認識部４０で認識された音声感情を元に、人物の性別を推定する。性別の推定には、特開２００７−０８００５７号公報などで開示されている技術を利用する。なお、詳しい説明は、特開２００７−０８００５７号公報などを参照されたい。 The gender estimating unit 62 estimates the gender of the person whose face is detected by the face detecting unit 36. Specifically, the face detected by the face detection unit 36, the facial expression recognized by the expression recognition unit 37, the voice of a person extracted by the voice extraction unit 39, and the voice recognized by the voice emotion recognition unit 40 Estimate the gender of a person based on emotion. For gender estimation, a technique disclosed in Japanese Patent Application Laid-Open No. 2007-080057 is used. For details, refer to Japanese Patent Application Laid-Open No. 2007-080057.

年齢推定部６３は、顔検出部３６で顔が検出された人物の年齢を推定する。具体的には、顔検出部３６で検出された顔、表情認識部３７で認識された顔の表情、声抽出部３９で抽出された人物の声、及び音声感情認識部４０で認識された音声感情を元に、人物の年齢を推定する。年齢の推定には、特開２００７−０８００５７号公報などで開示されている技術を利用する。なお、詳しい説明は、特開２００７−０８００５７号公報などを参照されたい。 The age estimation unit 63 estimates the age of the person whose face is detected by the face detection unit 36. Specifically, the face detected by the face detection unit 36, the facial expression recognized by the expression recognition unit 37, the voice of a person extracted by the voice extraction unit 39, and the voice recognized by the voice emotion recognition unit 40 Estimate the age of a person based on emotion. The technique disclosed in Japanese Patent Application Laid-Open No. 2007-080057 is used for estimating the age. For details, refer to Japanese Patent Application Laid-Open No. 2007-080057.

装飾画像取得部４２は、装飾画像ＤＢ４６にアクセスし、感情判定部４１で判定された感情、性別推定部６２で推定された性別、及び年齢推定部６３で推定された年齢の組合せに対応する装飾画像を取得する。なお、上記第１実施形態と同様の構成については、その説明を省略する。 The decoration image acquisition unit 42 accesses the decoration image DB 46, and the decoration corresponding to the combination of the emotion determined by the emotion determination unit 41, the gender estimated by the gender estimation unit 62, and the age estimated by the age estimation unit 63. Get an image. Note that the description of the same configuration as in the first embodiment is omitted.

次に、上記構成のサーバ６１（図１０参照）が動画コンテンツの加工装置として機能したときの処理手順について、図１１のフローチャートを参照しながら説明する。なお、上記第１実施形態と同様の処理手順については、その説明を省略する。 Next, a processing procedure when the server 61 (see FIG. 10) configured as described above functions as a moving image content processing apparatus will be described with reference to the flowchart of FIG. Note that a description of processing procedures similar to those in the first embodiment is omitted.

顔検出部３６で検出された顔、表情認識部３７で認識された顔の表情、声抽出部３９で抽出された人物の声、及び音声感情認識部４０で認識された音声感情は、ＲＡＭ３３から性別推定部６２、及び年齢推定部６３のそれぞれに読み出される。性別推定部６２では、顔検出部３６で顔が検出された人物の性別が推定される。推定された性別は、ＲＡＭ３３に記憶される。なお、顔検出部３６で顔が検出された人物の性別が一旦推定された場合、該当する人物については、性別推定部６２による処理を省略してもよい。 The face detected by the face detection unit 36, the facial expression recognized by the expression recognition unit 37, the voice of the person extracted by the voice extraction unit 39, and the voice emotion recognized by the voice emotion recognition unit 40 are read from the RAM 33. The data is read out to each of the sex estimation unit 62 and the age estimation unit 63. The gender estimating unit 62 estimates the gender of the person whose face is detected by the face detecting unit 36. The estimated gender is stored in the RAM 33. When the gender of the person whose face is detected by the face detection unit 36 is once estimated, the process by the gender estimation unit 62 may be omitted for the corresponding person.

年齢推定部６３では、顔検出部３６で顔が検出された人物の性別が推定される。推定さえた年齢は、ＲＡＭ３３に記憶される。なお、顔検出部３６で顔が検出された人物の年齢が一旦推定された場合、該当する人物については、年齢推定部６３による処理を省略してもよい。 The age estimation unit 63 estimates the gender of the person whose face is detected by the face detection unit 36. The estimated age is stored in the RAM 33. When the age of the person whose face is detected by the face detection unit 36 is once estimated, the process by the age estimation unit 63 may be omitted for the corresponding person.

感情判定部４１で判定された感情、性別推定部６２で推定された性別、及び年齢推定部６３で推定された年齢は、ＲＡＭ３３から装飾画像取得部４２に読み出される。装飾画像取得部４２では、感情判定部４１で判定された感情、性別推定部６２で推定された性別、及び年齢推定部６３で推定された年齢の組合せに対応する装飾画像が、装飾画像ＤＢ４６から取得される。 The emotion determined by the emotion determination unit 41, the gender estimated by the gender estimation unit 62, and the age estimated by the age estimation unit 63 are read from the RAM 33 to the decoration image acquisition unit 42. In the decoration image acquisition unit 42, a decoration image corresponding to the combination of the emotion determined by the emotion determination unit 41, the gender estimated by the gender estimation unit 62, and the age estimated by the age estimation unit 63 is obtained from the decoration image DB 46. To be acquired.

合成処理部４３で装飾画像が合成された動画コンテンツは、顔検出→表情認識→動き検知→声抽出→音声感情認識→感情判定→性別推定→年齢推定→装飾画像取得→画像合成の一連の処理が終了した所定のスパン毎に、ＲＡＭ３３から読み出される。 The moving image content in which the decoration image is synthesized by the synthesis processing unit 43 is a series of processes of face detection → expression recognition → motion detection → voice extraction → voice emotion recognition → emotion determination → sex estimation → age estimation → decoration image acquisition → image synthesis. Is read from the RAM 33 at every predetermined span for which.

以上説明したように、写っている人物の性別・年齢と関連性のある装飾画像を合成するようにしたから、例えば、女児が老父の台詞を発しているような動画コンテンツになることはなく、動画コンテンツを見た人に違和感を与えることはない。 As explained above, since the decoration image that is related to the gender and age of the person in the picture is synthesized, for example, it will not become a video content that the girl is uttering the words of the old father, It does not give a sense of incongruity to those who have seen video content.

なお、上記各実施形態では、インターネット１２に接続されたサーバ１１に動画コンテンツの加工装置が構築され、万人がアクセス可能である場合を例に説明したが、これに限定されるものではない。例えば、個人が使用するパーソナルコンピュータに動画コンテンツの加工装置が構築されるようにしてもよい。また、動画撮影可能なデジタルカメラに、動画コンテンツの加工機能を備えるようにしてもよい。 In each of the above embodiments, a case has been described in which a moving image content processing apparatus is constructed in the server 11 connected to the Internet 12 and can be accessed by everyone. However, the present invention is not limited to this. For example, a moving image content processing apparatus may be constructed on a personal computer used by an individual. In addition, a digital camera capable of shooting a moving image may be provided with a moving image content processing function.

また、上記各実施形態では、所定のスパンに区切られた全てに対して加工処理が施されたが、感情が変化した時にだけ加工処理が施されるようにしてもよい。この場合、装飾画像取得部４２及び合成処理部４３は、感情判定部４１で判定された感情が変化した時にだけ処理を実行することになる。 Further, in each of the above embodiments, the processing is performed on all of the sections divided into the predetermined spans, but the processing may be performed only when the emotion changes. In this case, the decoration image acquisition unit 42 and the composition processing unit 43 execute the process only when the emotion determined by the emotion determination unit 41 changes.

また、上記各実施形態では、心の内を示す吹出し、擬音及び漫符、及び台詞の吹出しの装飾画像の場合を例に説明したが、この場合に限定されることはない。また、旅行、結婚式、入学式、誕生日などの場面に応じた装飾画像を用いてもよい。 Further, in each of the above embodiments, the case of the decoration image of the balloon showing the inside of the heart, the onomatopoeia and the comics, and the balloon of the dialogue is described as an example, but the present invention is not limited to this case. In addition, decorative images corresponding to scenes such as travel, wedding ceremony, entrance ceremony, and birthday may be used.

また、上記各実施形態では、ユーザーがクライアント端末１３で選択した種類の装飾画像が合成される場合を例に説明したが、動画コンテンツの加工装置がランダムに選択した種類の装飾画像が合成されるようにしてもよい。 Further, in each of the above embodiments, the case where the type of decoration image selected by the user on the client terminal 13 is combined has been described as an example. However, the type of decoration image randomly selected by the moving image content processing apparatus is combined. You may do it.

また、上記各実施形態では、静止画である装飾画像を合成する場合を例に説明したが、動画、音声、その他のコンテンツを合成するようにしてもよい。 In each of the above-described embodiments, the case where a decorative image that is a still image is combined has been described as an example. However, a moving image, audio, or other content may be combined.

また、上記各実施形態で示した動画コンテンツの加工装置は一例にすぎず、本発明の趣旨を逸脱しなければ、如何様な態様にも適宜変更することができる。 Further, the moving image content processing apparatus shown in each of the above embodiments is merely an example, and can be appropriately changed to any mode without departing from the gist of the present invention.

ネットワークシステムの構成を示す概略図である。It is the schematic which shows the structure of a network system. クライアント端末の内部構成を示すブロック図である。It is a block diagram which shows the internal structure of a client terminal. サーバの内部構成を示すブロック図である。It is a block diagram which shows the internal structure of a server. 感情テーブルの構成を示す説明図である。It is explanatory drawing which shows the structure of an emotion table. 装飾画像ＤＢに記憶されている装飾画像の一覧表である。It is a list of the decoration image memorize | stored in decoration image DB. 動画コンテンツを構成する静止画のフレームに、一種類目の装飾画像を合成する処理を説明した図である。It is a figure explaining the process which synthesize | combines the decoration image of the 1st type with the frame of the still image which comprises a moving image content. 動画コンテンツを構成する静止画のフレームに、二種類目の装飾画像を合成する処理を説明した図である。It is a figure explaining the process which synthesize | combines the 2nd kind of decoration image with the frame of the still image which comprises a moving image content. 動画コンテンツを構成する静止画のフレームに、二種類目の装飾画像を合成する処理を説明した図である。It is a figure explaining the process which synthesize | combines the 2nd kind of decoration image with the frame of the still image which comprises a moving image content. 動画コンテンツの加工処理手順を説明するフローチャートである。It is a flowchart explaining the processing procedure of a moving image content. 別の形態のサーバの内部構成を示すブロック図である。It is a block diagram which shows the internal structure of the server of another form. 別の形態の動画コンテンツの加工処理手順を説明するフローチャートである。It is a flowchart explaining the processing procedure of the moving image content of another form.

Explanation of symbols

１１、６１サーバ（動画コンテンツの加工装置）
３６顔検出部
３７表情認識部
３８動き検知部
３９声抽出部
４０音声感情認識部
４１感情判定部
４２装飾画像取得部
４３合成処理部
４４動画コンテンツの加工プログラム
４５感情データベース（感情ＤＢ）
４６装飾画像データベース（装飾画像ＤＢ）
５１感情テーブル
６２性別推定部
６３年齢推定部 11, 61 server (video content processing device)
36 face detection unit 37 facial expression recognition unit 38 motion detection unit 39 voice extraction unit 40 voice emotion recognition unit 41 emotion determination unit 42 decoration image acquisition unit 43 composition processing unit 44 video content processing program 45 emotion database (emotion DB)
46 Decorative Image Database (Decorative Image DB)
51 Emotion Table 62 Gender Estimation Unit 63 Age Estimation Unit

Claims

A face detection unit that detects faces from video content;
A facial expression recognition unit that recognizes facial expressions detected by the face detection unit;
A motion detection unit that detects a motion of a person whose face is detected by the face detection unit from video content;
A voice extraction unit that extracts a voice of a person whose face is detected by the face detection unit from video content;
From the facial expression recognized by the facial expression recognition unit, the movement of the person detected by the motion detection unit, and the voice extracted by the voice extraction unit, the emotion of the person whose face is detected by the face detection unit An emotion determination unit for determining;
A video content processing apparatus comprising: a processing unit that processes video content based on the emotion determined by the emotion determination unit.

A voice emotion recognition unit that recognizes voice emotions from the voice extracted by the voice extraction unit;
An emotion database storing an emotion table that stores a combination of facial expressions, human movements, and voice emotions, and the emotions stored in correspondence with each other;
The emotion determination unit uses the emotion table to determine the facial expression recognized by the facial expression recognition unit, the movement of the person detected by the motion detection unit, and the voice emotion recognized by the voice emotion recognition unit. The moving image content processing apparatus according to claim 1, wherein an emotion corresponding to the combination is determined to be an emotion of a person whose face is detected by the face detection unit.

A decoration content database that stores decoration content to be decorated in video content in correspondence with emotions;
A decoration content acquisition unit that acquires the decoration content corresponding to the emotion determined by the emotion determination unit from the decoration content database;
The moving image content processing apparatus according to claim 1, wherein the processing unit is a composition processing unit that combines the decorative content acquired by the decorative content acquisition unit with the moving image content.

The decorative content stored in the decorative content database corresponding to emotions is a plurality of types,
The video content processing apparatus according to claim 3, wherein the decoration content acquisition unit acquires the input type of the decoration content.

A gender estimating unit that estimates the gender of the person whose face is detected by the face detecting unit from the face detected by the face detecting unit and the voice extracted by the voice extracting unit;
5. The moving image content processing apparatus according to claim 1, wherein the processing unit processes the moving image content based on the gender estimated by the gender determination unit.

An age estimation unit for estimating the age of a person whose face is detected by the face detection unit from the face detected by the face detection unit and the voice extracted by the voice extraction unit;
The moving image content processing apparatus according to claim 1, wherein the processing unit processes the moving image content based on the age estimated by the age determination unit.

A face detection step of detecting a face from the video content in the face detection unit;
A facial expression recognition step for recognizing the facial expression detected in the face detection step by a facial expression recognition unit;
A motion detection step in which a motion detection unit detects a motion of a person whose face is detected in the face detection step from video content;
A voice extraction step of extracting a voice of a person whose face is detected in the face detection step by a voice extraction unit from video content;
From the facial expression recognized in the facial expression recognition step, the movement of the person detected in the motion detection step, and the voice extracted in the voice extraction step, the emotion of the person whose face was detected in the face detection step is represented. An emotion determination step for determination by the emotion determination unit;
A processing method for moving image content, comprising: a processing step for processing the moving image content by a processing processing unit based on the emotion determined in the emotion determination step.

A face detection step for detecting a face from video content;
A facial expression recognition step for recognizing facial expressions detected in the face detection step;
A motion detection step for detecting a motion of a person whose face is detected in the face detection step from video content;
A voice extraction step of extracting the voice of the person whose face was detected in the face detection step from the video content;
From the facial expression recognized in the facial expression recognition step, the movement of the person detected in the motion detection step, and the voice extracted in the voice extraction step, the emotion of the person whose face was detected in the face detection step is represented. An emotion determination step for determining;
A video content processing program that causes a computer to execute a processing step of processing video content based on the emotion determined in the emotion determination step.