JP2007097047A

JP2007097047A - Contents editing apparatus, contents editing method and contents editing program

Info

Publication number: JP2007097047A
Application number: JP2005286399A
Authority: JP
Inventors: Takashi Hiuga; 崇日向; Toshinori Nagahashi; 敏則長橋
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2005-09-30
Filing date: 2005-09-30
Publication date: 2007-04-12

Abstract

<P>PROBLEM TO BE SOLVED: To produce a summary video picture more reflecting individual sensibility. <P>SOLUTION: Expressions of the face of a viewer viewing video contents are imaged, uttered voices are collected and on the basis of them, the emotion of the viewer is inferred. Similarly, the emotion of a player is inferred from the expressions of the face and the voices of the player in the video contents. A frame matching emotions, based on face expressions, of the player and the viewer of the video contents and a frame matching emotions based on voices are defined as a viewer sympathized part, the frames of such a sympathized part are extracted, and a video image constructed from said frames is stored, as a summary video image, in a predetermined storage area in accordance with a title of the video contents as a source of the summary video image. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、コンテンツの要約を作成するコンテンツの編集装置、コンテンツの編集方法及びコンテンツの編集プログラムに関する。 The present invention relates to a content editing apparatus for creating a content summary, a content editing method, and a content editing program.

従来、映像コンテンツを編集する方法として、コンテンツを要約するための切れ目を検出し、シーンの切り換わりに着目して編集するようにしたものや、コンテンツを管理するために映像内の人物等を自動的に認識しこれに基づいて編集を行うようにしたもの等が提案されている。また、上記に加えて、映像内容のキーワード（メタデータ）に基づいて、コンテンツの検索に活用するようにしたもの等も提案されている。そして、コンテンツの増加に伴って、効率の良い検索や、要約映像の生成等、各種編集方法のニーズは高まっている。 Conventionally, as a method for editing video content, a break for summarizing the content is detected and editing is performed by paying attention to scene switching, and a person in the video is automatically managed to manage the content. Have been proposed that can be recognized and edited based on this. In addition to the above, there are also proposed ones that are used for content search based on keywords (metadata) of video content. With the increase in content, needs for various editing methods such as efficient search and generation of summary video are increasing.

このように、各種編集方法が提案されているが、個人で撮影したホームビデオの場合、ユーザ自身が、映像内容を表現したキーワード（メタデータ）を付加することは稀であり、編集に用いるための情報量が限られている。そのため、様々な情報を付加することが考案されてきている。
例えば、特許文献１では、生体情報から解析された感情に基づいて映像や音声等の情報コンテンツの検索を行うようにした検索方法が提案されている。この検索方法では、センサ等を用いて視聴者の感情や気分等の心理状態等を分析し、感情情報によるコンテンツの検索を試みている。
特開２００５−１２４９０９号公報 As described above, various editing methods have been proposed. However, in the case of home video shot by an individual, it is rare that the user himself adds a keyword (metadata) expressing the video content, and it is used for editing. The amount of information is limited. Therefore, adding various information has been devised.
For example, Patent Literature 1 proposes a search method in which information content such as video and audio is searched based on emotions analyzed from biological information. In this search method, a sensor or the like is used to analyze a viewer's emotion, psychological state such as mood, and the like, and an attempt is made to search content based on emotion information.
JP 2005-124909 A

しかしながら、上述の方法においては、生体情報を感情情報として扱い、視聴者の興味のある場面で、視聴者の感情のみを評価しているだけである。このため、感情といった主観的な情報を用いた、より効果的な編集方法が望まれていた。
そこで、この発明は上記従来の未解決の問題に着目してなされたものであり、個人の感性を、より有効に活用したコンテンツの編集装置、コンテンツの編集方法及びコンテンツの編集プログラムを提供することを目的としている。 However, in the above-described method, biological information is treated as emotion information, and only the viewer's emotion is evaluated in a scene in which the viewer is interested. For this reason, a more effective editing method using subjective information such as emotions has been desired.
Therefore, the present invention has been made paying attention to the above-mentioned conventional unsolved problems, and provides a content editing apparatus, a content editing method, and a content editing program that make more effective use of individual sensibilities. It is an object.

上記した課題を解決するために、本発明のコンテンツの編集装置は、少なくとも人物が登場するコンテンツの要約を作成するコンテンツの編集装置であって、前記コンテンツの出演者の感情を推測する出演者感情推測手段と、前記コンテンツを視聴しているときの視聴者の感情を推測する視聴者感情推測手段と、前記出演者感情推測手段及び前記視聴者感情推測手段で推測される前記出演者及び視聴者の感情情報に基づいて前記コンテンツのうち、前記視聴者の前記出演者に対する共感度の高いシーンを推測する高共感度シーン推測手段と、当該高共感度シーン推測手段で推測した共感度の高いシーンを用いて前記コンテンツの要約コンテンツを生成する要約コンテンツ生成手段と、を備えることを特徴としている。 In order to solve the above-described problems, a content editing apparatus according to the present invention is a content editing apparatus that creates a summary of content in which at least a person appears, and the performer emotion that estimates the emotion of the performer of the content Guessing means, viewer feeling guessing means for guessing viewers' feelings when viewing the content, and performers and viewers guessed by the performer feeling guessing means and the viewer feeling guessing means Among the contents, a high co-sensitivity scene inferring means for inferring a scene with high co-sensitivity to the performer of the viewer, and a scene with high co-sensitivity inferred by the high co-sensitivity scene inferring means And a summary content generation means for generating the summary content of the content.

上記構成によれば、コンテンツの出演者の感情と、コンテンツの視聴者の感情とを推測し、コンテンツの視聴者がコンテンツの出演者に共感している共感度の高いシーンを用いてコンテンツの要約コンテンツを生成するから、視聴者の感性に則したシーンからなる要約コンテンツを作成することができ、この要約コンテンツを参照することによりその元となるコンテンツの概要を容易且つ的確に認識することができる。 According to the above configuration, the content summary is estimated by using the scene with high consensus that the content viewer sympathizes with the content performer by estimating the emotion of the content performer and the emotion of the content viewer. Since the content is generated, it is possible to create summary content composed of scenes in accordance with the viewer's sensibility, and the summary of the original content can be recognized easily and accurately by referring to the summary content. .

また、上記したコンテンツの編集装置において、前記高共感度シーン推測手段は、前記出演者感情推測手段で推測される前記出演者の感情と、前記視聴者感情推測手段で推測される前記視聴者の感情とが一致するシーンを共感度の高いシーンとして推測することを特徴としている。
上記構成によれば、コンテンツの出演者の感情と、コンテンツの視聴者の感情とが一致するシーンを共感度の高いシーンとして推測するから、視聴者の共感度の高いシーンを容易且つ的確に推測することができる。 In the content editing apparatus described above, the high co-sensitivity scene estimation means includes the performer's emotion estimated by the performer emotion estimation means and the viewer's emotion estimation means estimated by the viewer emotion estimation means. It is characterized by inferring a scene that matches emotion as a highly co-sensitive scene.
According to the above configuration, since the scene where the emotion of the content performer matches the emotion of the viewer of the content is estimated as a scene with high co-sensitivity, the scene with high co-sensitivity of the viewer is easily and accurately estimated. can do.

また、上記したコンテンツの編集装置において、前記コンテンツは、前記人物の映像信号及び音声信号の何れか一方を含み、前記出演者感情推測手段は、前記映像信号及び音声信号の何れかに基づいて前記出演者の感情を推測することを特徴としている。
上記構成によれば、コンテンツに含まれる、出演者の映像信号及び音声信号の何れか一方に基づいて出演者の感情を推測するから、出演者の感情を容易に検出することができる。 In the content editing apparatus described above, the content includes one of a video signal and an audio signal of the person, and the performer emotion estimation means is based on either the video signal or the audio signal. It is characterized by guessing the emotions of the performers.
According to the above configuration, since the performer's emotion is estimated based on either the performer's video signal or audio signal included in the content, the performer's emotion can be easily detected.

また、上記したコンテンツの編集装置において、前記視聴者感情推測手段は、前記視聴者の表情を撮像する撮像手段、前記視聴者が発する音声を集音する集音手段、及び前記視聴者の生体情報を検出する生体情報検出手段の少なくとも何れか一つを備え、前記撮像手段の撮像情報、前記集音手段の集音情報及び前記生体情報検出手段の検出情報の何れかに基づいて前記視聴者の感情を推測することを特徴としている。 In the content editing apparatus described above, the viewer emotion estimation unit includes an imaging unit that captures an image of the viewer's facial expression, a sound collection unit that collects sound generated by the viewer, and the biological information of the viewer. And at least one of biological information detection means for detecting the viewer, and based on any one of the imaging information of the imaging means, the sound collection information of the sound collection means, and the detection information of the biological information detection means It is characterized by guessing emotions.

上記構成によれば、撮像手段で撮像した視聴者の表情、集音手段で集音した視聴者が発する音声、及び生体情報検出手段で検出した視聴者の生体情報の少なくとも何れか一つに基づいて視聴者の感情を推測するから、視聴者の感情を容易に推測することができる。
また、上記したコンテンツの編集装置において、前記コンテンツは、前記人物の映像信号及び音声信号を含み、前記出演者感情推測手段は、前記出演者について、前記映像信号に基づく感情及び前記音声信号に基づく感情をそれぞれ推測し、前記視聴者感情推測手段は、前記視聴者の表情を撮像する撮像手段及び前記視聴者が発する音声を集音する集音手段を備え、前記視聴者について、前記撮像手段の撮像情報に基づく感情及び前記集音手段の集音情報に基づく感情をそれぞれ推測し、前記高共感度シーン推測手段は、前記出演者の映像信号に基づく感情及び前記視聴者の撮像情報に基づく感情が一致するシーン及び前記出演者の前記音声信号に基づく感情及び前記視聴者の集音情報に基づく感情が一致するシーンを共感度の高いシーンとして推測することを特徴としている。 According to the above configuration, based on at least one of the facial expression of the viewer imaged by the imaging means, the sound emitted by the viewer collected by the sound collection means, and the biological information of the viewer detected by the biological information detection means. Thus, the viewer's emotion can be easily estimated.
In the content editing apparatus, the content includes a video signal and an audio signal of the person, and the performer emotion estimation unit is based on the emotion based on the video signal and the audio signal for the performer. The viewer's emotion estimation means comprises an imaging means for capturing the viewer's facial expression and a sound collection means for collecting the sound emitted by the viewer, and the viewer is Emotions based on imaging information and emotions based on the sound collection information of the sound collection means are estimated, respectively, and the high co-sensitivity scene estimation means is based on emotions based on the video signals of the performers and emotions based on the imaging information of the viewers A scene having the same emotion and a scene having the same emotion based on the audio signal of the performer and an emotion based on the collected sound information of the viewer as a scene with high co-sensitivity It is characterized by guess Te.

上記構成によれば、コンテンツの出演者の映像信号に基づく感情、コンテンツの視聴者の撮像情報に基づく感情といった外見からわかる感情が一致するシーンと、コンテンツの出演者の音声信号に基づく感情とコンテンツの視聴者の音声信号に基づく感情といったコンテンツの出演者及び視聴者が発する音声からわかる感情どうしが一致するシーンを共感度の高いシーンとして推測するから、同種の感情の表現どうしを比較することでより共感度の高いシーンを的確に推測することができると共に、外見だけから、或いは音声信号だけからは、感情がわからない場合であっても双方の感情を考慮する事で、共感度の高いシーンを的確に推測することができる。 According to the above configuration, a scene in which an emotion that can be seen from the appearance such as an emotion based on the video signal of the content performer and an emotion based on the imaging information of the content viewer match, and the emotion and content based on the audio signal of the content performer By comparing the expressions of the same kind of emotions, it is assumed that the scenes where the emotions based on the voices produced by the viewers and viewers of the content, such as emotions based on the audio signals of the viewers of the viewers, coincide with each other are considered as high-sensitivity scenes. A scene with higher co-sensitivity can be accurately estimated, and a scene with high co-sensitivity can be obtained by considering both emotions even if the emotion is not understood only from the appearance or the audio signal. Can be guessed accurately.

また、上記したコンテンツの編集装置において、前記出演者感情推測手段は、前記映像信号に基づき前記出演者の表情を認識する表情認識手段を備え、当該表情認識手段で認識した出演者の表情から前記感情を推測することを特徴としている。
上記構成によれば、映像信号に基づいてコンテンツの出演者の顔の表情を認識し、これに基づき前記出演者の感情を推測するから、出演者の感情を容易的確に推測することができる。 Further, in the content editing apparatus, the performer emotion estimation unit includes a facial expression recognition unit that recognizes the facial expression of the performer based on the video signal, and from the facial expression of the performer recognized by the facial expression recognition unit It is characterized by guessing emotions.
According to the above configuration, since the facial expression of the performer of the content is recognized based on the video signal and the emotion of the performer is estimated based on this, the performer's emotion can be easily estimated.

また、上記したコンテンツの編集装置において、前記出演者感情推測手段は、前記音声信号に基づき前記出演者の喉頭原音の振動を表す振動周波数の解析を行う解析手段を備え、当該解析手段の解析結果に基づいて前記出演者の感情を推測することを特徴としている。
上記構成によれば、音声信号に基づいて、コンテンツの出演者が発する喉頭原音の振動周波数を解析し、これに基づいて出演者の感情を推測するから、出演者の感情を容易的確に推測することができる。 In the content editing apparatus described above, the performer emotion estimation means includes an analysis means for analyzing a vibration frequency representing vibration of the performer's laryngeal original sound based on the audio signal, and an analysis result of the analysis means It is characterized by guessing the performer's emotion based on the above.
According to the above configuration, since the vibration frequency of the laryngeal original sound emitted by the content performer is analyzed based on the audio signal and the performer's emotion is estimated based on this, the performer's emotion is easily estimated. be able to.

また、上記したコンテンツの編集装置において、前記視聴者感情推測手段は、前記撮像情報に基づき前記視聴者の表情を認識する表情認識手段を備え、当該表情認識手段で認識した視聴者の表情から前記感情を推測することを特徴としている。
上記構成によれば、視聴者を撮像した撮像情報に基づき、視聴者の表情を認識しこれに基づき視聴者の感情を推測するから、視聴者の感情を容易的確に推測することができる。 In the content editing apparatus, the viewer emotion estimation means includes a facial expression recognition means for recognizing the viewer's facial expression based on the imaging information, and the viewer's facial expression recognized by the facial expression recognition means It is characterized by guessing emotions.
According to the above configuration, since the viewer's facial expression is recognized based on the imaging information obtained by capturing the viewer and the viewer's emotion is estimated based on the viewer's facial expression, the viewer's emotion can be easily estimated.

また、上記したコンテンツの編集装置において、前記視聴者感情推測手段は、前記集音情報に基づき前記視聴者の喉頭原音の振動を表す振動周波数の解析を行う解析手段を備え、当該解析手段の解析結果に基づいて前記視聴者の感情を推測することを特徴としている。
上記構成によれば、集音手段で集音した集音情報に基づき、視聴者が発する喉頭原音の振動周波数を解析し、これに基づいて視聴者の感情を推測するから、視聴者の感情を容易的確に推測することができる。 Further, in the content editing apparatus, the viewer emotion estimation unit includes an analysis unit that analyzes a vibration frequency representing a vibration of the viewer's laryngeal original sound based on the sound collection information, and the analysis unit analyzes The viewer's emotion is estimated based on the result.
According to the above configuration, the vibration frequency of the laryngeal original sound emitted by the viewer is analyzed based on the sound collection information collected by the sound collecting means, and the viewer's emotion is estimated based on the analysis. It can be estimated easily and accurately.

また、本発明のコンテンツの編集方法は、少なくとも人物が登場するコンテンツの要約を作成するコンテンツの編集方法であって、前記コンテンツの出演者の感情を検出すると共に、前記コンテンツを視聴しているときの視聴者の感情を推測し、推測した前記出演者及び前記視聴者の感情に基づいて、前記コンテンツにおいて、前記視聴者が前記出演者に共感しているシーンを特定し、特定したシーンを用いて前記コンテンツの要約コンテンツを作成することを特徴としている。 Also, the content editing method of the present invention is a content editing method for creating a summary of the content in which at least a person appears, and when the emotion of the performer of the content is detected and the content is viewed Based on the estimated performer and the viewer's emotion, the scene in which the viewer sympathizes with the performer is identified in the content, and the identified scene is used. Then, a summary content of the content is created.

上記構成によれば、コンテンツの出演者に対して、コンテンツの視聴者が共感しているシーンを特定しこれを用いて前記コンテンツの要約コンテンツを作成するから、視聴者の感性に則した要約コンテンツを作成することができ、この要約コンテンツを参照することで、その元となるコンテンツの概要を容易的確に認識することができる。
また、本発明のコンテンツの編集プログラムは、コンテンツの出演者の感情を推測するステップと、前記コンテンツを視聴しているときの視聴者の感情を推測するステップと、前記出演者及び視聴者の感情に基づいて、前記コンテンツにおいて、前記視聴者が前記出演者に共感しているシーンを特定するステップと、前記特定したシーンを用いて前記コンテンツの要約コンテンツを作成するステップと、をコンピュータに実行させることを特徴としている。 According to the above configuration, since the content that the viewer of the content sympathizes with the performer of the content is identified and the summary content of the content is created using the scene, the summary content in accordance with the viewer's sensitivity By referring to the summary content, it is possible to easily recognize the outline of the original content.
The content editing program of the present invention includes a step of estimating the emotion of a performer of the content, a step of estimating the emotion of the viewer when viewing the content, and the emotion of the performer and the viewer Based on the above, in the content, the step of specifying a scene in which the viewer sympathizes with the performer and the step of creating a summary content of the content using the specified scene are executed by a computer It is characterized by that.

上記構成によれば、コンテンツの出演者に対して、コンテンツの視聴者が共感しているシーンを特定しこれを用いて前記コンテンツの要約コンテンツを作成するから、視聴者の感性に則した要約コンテンツを作成することができ、この要約コンテンツを参照することで、その元となるコンテンツの概要を容易的確に認識することができる。 According to the above configuration, since the content that the viewer of the content sympathizes with the performer of the content is identified and the summary content of the content is created using the scene, the summary content in accordance with the viewer's sensitivity By referring to the summary content, it is possible to easily recognize the outline of the original content.

以下、本発明の実施の形態を説明する。
図１は、本発明のコンテンツの編集装置の一例を示す機能ブロック図である。
図１に示すように、このコンテンツの編集装置は、補助記憶装置等、所定の記憶領域に格納されているホームビデオ等のコンテンツファイルのうち、指定された映像コンテンツのコンテンツ情報を読み出し、これを再生する再生手段１と、前記指定された映像コンテンツの出演者の感情を推測するコンテンツ出演者感情推測部２と、前記指定された映像コンテンツを視聴した視聴者の感情を推測する視聴者感情推測部３と、前記コンテンツ出演者感情推測部２で推測した出演者の感情と視聴者感情推測部３で推測した視聴者の感情とに基づいてその共感度を算出する共感度算出手段４と、共感度算出手段４で算出された共感度に基づいてその共感度が高い部分を抽出し、これをもとに要約映像を生成する要約映像作成手段５と、を備える。 Embodiments of the present invention will be described below.
FIG. 1 is a functional block diagram showing an example of a content editing apparatus according to the present invention.
As shown in FIG. 1, this content editing apparatus reads content information of designated video content from content files such as home video stored in a predetermined storage area, such as an auxiliary storage device, and stores it. Reproduction means 1 for reproducing, content performer emotion estimation unit 2 for estimating the emotion of the performer of the specified video content, and viewer emotion estimation for estimating the emotion of the viewer who has viewed the specified video content And a consensus calculating means 4 for calculating the consensus based on the emotion of the performer estimated by the content performer emotion estimation unit 2 and the viewer's emotion estimated by the viewer emotion estimation unit 3; Based on the co-sensitivity calculated by the co-sensitivity calculating means 4, a high-co-sensitivity part is extracted, and a summary video creating means 5 for generating a summary video based on the extracted part is provided.

前記出演者感情推測部２は、図１に示すように、補助記憶装置等の所定の記憶領域に格納されているホームビデオ等のファイルのうち視聴者が指定した映像コンテンツの映像信号及び音声信号を取り込むための映像・音声信号入力手段１１、この映像・音声信号入力手段１１で入力した映像信号から映像コンテンツの出演者の顔画像を抽出する顔検出手段１２、この顔検出手段１２で抽出した顔画像を解析しその表情を認識する表情認識手段１３、及び、前記映像・音声信号入力手段１１で入力した音声信号に基づいて映像コンテンツの出演者の感情を推測する音声感情推測手段１４を備える。 As shown in FIG. 1, the performer emotion estimation unit 2 includes video signals and audio signals of video content specified by a viewer among files such as home videos stored in a predetermined storage area such as an auxiliary storage device. The video / audio signal input means 11 for capturing the image, the face detection means 12 for extracting the face image of the performer of the video content from the video signal input by the video / audio signal input means 11, and extracted by the face detection means 12 Facial expression recognition means 13 for analyzing a facial image and recognizing the facial expression, and voice emotion estimation means 14 for estimating the emotion of the performer of the video content based on the audio signal input by the video / audio signal input means 11. .

前記顔検出手段１２は、公知の手順で映像信号の中から顔画像を抽出する。例えば、顔を表現するには、輝度やエッジから生成される顔画像特徴量を用いる。そして、例えば、特開平９−５０５２８号公報に記載されているように、ある入力画像について、まず、人物肌色領域の有無を判定し、人物肌色領域に対して自動的にモザイクサイズを決定し、候補領域をモザイク化する。そして、人物顔辞書との距離を計算することにより、人物顔の有無を判定し、人物顔の切り出しを行うことによって、背景等の影響による誤抽出を減らし、効率的に画像中から人間の顔を自動的に見つける方法等を用いることができる。また、顔検出手段１２としては、例えば、公知の、多層ニューラルネットワーク、サポートベクトルマシンや判別分析等の統計的な手法により実現することができる。また、例えば、技術文献「背景と顔の方向に依存しない顔の検出と顔方向の推定」（信学技報，ＰＲＭＵ２００１−２１７，ｐｐ．８７−９４（２００２−０１），荒木祐一，島田伸敬，白井良明）に記載されているように、顔の部位を確率付きで求め、弛緩法を用いて顔の検出を行うようにしてもよい。なお、本実施の形態においては、顔方向の推定は不要である。 The face detection means 12 extracts a face image from the video signal by a known procedure. For example, in order to express a face, a face image feature amount generated from luminance or an edge is used. For example, as described in JP-A-9-50528, for a certain input image, first, the presence or absence of a human skin color area is determined, and a mosaic size is automatically determined for the human skin color area. Mosaic the candidate area. Then, by calculating the distance from the human face dictionary, the presence or absence of the human face is determined, and the human face is cut out, thereby reducing erroneous extraction due to the influence of the background, etc. For example, a method for automatically finding a file can be used. The face detection means 12 can be realized by a known statistical method such as a multilayer neural network, a support vector machine, or discriminant analysis. Also, for example, the technical document “Detection of face independent of background and face direction and estimation of face direction” (Science Technical Report, PRMU 2001-217, pp. 87-94 (2002-01), Yuichi Araki, Shin Shimada) As described in Rei, Yoshiaki Shirai), a face part may be obtained with probability, and the face may be detected using a relaxation method. In this embodiment, estimation of the face direction is not necessary.

前記表情認識手段１３は、公知の判別分析等の統計的な手法により、前記顔検出手段１２で抽出した顔画像に対しその表情を分類する。例えば、Ｅｋｍａｎらの「怒り」、「恐れ」、「悲しみ」、「喜び」、「嫌悪」、「驚き」の基本６表情にあてはまる顔画像を予め多数収集しておき、これら収集した顔画像を元に、前記顔検出手段１２で抽出した顔画像に対し、その表情を分類し、コンテンツ出演者の感情情報を得る。 The facial expression recognition means 13 classifies the facial expression with respect to the facial image extracted by the face detection means 12 by a known statistical method such as discriminant analysis. For example, Ekman et al.'S “anger”, “fear”, “sadness”, “joy”, “disgust”, “surprise” face images that have been collected in advance are collected in large numbers. Originally, the facial expressions extracted by the face detecting means 12 are classified into facial expressions, and emotion information of content performers is obtained.

なお、この顔表情の検出は、ニューラルネットワークを用いた学習により、表情を分類する識別器を構成することで実現することも可能である。
なお、例えば、特開平８−２４９４５３号公報の「顔の表情認識装置」に記載されているように、顔画像の特徴ベクトルを抽出し、喜怒哀楽等のカテゴリ別に量子化し、各カテゴリの表情を認識するようにしてもよい。 This facial expression detection can also be realized by configuring a discriminator that classifies facial expressions by learning using a neural network.
For example, as described in “Facial expression recognition device” in Japanese Patent Application Laid-Open No. 8-249453, a feature vector of a facial image is extracted, quantized according to categories such as emotions, and the facial expression of each category. May be recognized.

前記音声感情推測手段１４は、喉頭原音の振動を表す振動周波数に基づいて公知の一般的な音声による感情解析を行う。すなわち、前記喉頭原音の振動を表す振動周波数は、一般に、興奮している場合には振動周波数が上がり、悲しみの表現や愛情の表現では低くなる傾向にあることから、前記振動周波数の平均値及び標準偏差をもとに、統計的な手法で分類する。また、例えば、特開平５−１２０２３号公報の「感情認識装置」に記載されているように、音声信号に基づきその特徴信号と、基準データとの正規化ずれ量を算出し、その継続時間や信号のずれ方向、信号レベルや立ち上がりの速さ等に基づいて感情を判断するようにしてもよい。 The voice emotion estimation means 14 performs a known general voice emotion analysis based on the vibration frequency representing the vibration of the laryngeal original sound. That is, since the vibration frequency representing the vibration of the laryngeal original sound generally tends to increase when excited, the vibration frequency tends to be low, and the expression of sadness or love tends to be low. Based on standard deviation, classify by statistical method. Further, as described in, for example, “Emotion recognition device” of Japanese Patent Laid-Open No. 5-12023, the normalization deviation amount between the feature signal and the reference data is calculated based on the audio signal, and the duration time or Emotions may be determined based on the direction of signal shift, signal level, rising speed, and the like.

そして、前記表情認識手段１３で検出された映像信号に基づくコンテンツ出演者の感情及び音声感情推測手段１４で検出された音声信号に基づくコンテンツ出演者の感情は、前記映像コンテンツのどの時点における感情であるかを特定するタイミング情報と対応付けて管理される。
一方、前記視聴者感情推測部３は、前記指定された映像コンテンツを視聴しているときの視聴者の表情を撮像すると共に、視聴者が発生する音声を入力するための映像・音声信号入力手段２１、前記映像・音声信号入力手段２１で入力した映像信号から視聴者の顔画像を抽出する顔検出手段２２、この顔検出手段２２で抽出した顔画像を解析し表情を認識する表情認識手段２３、前記映像・音声信号入力手段２１で入力した音声信号に基づいて視聴者の感情を認識する音声感情推測手段２４を備える。 The emotion of the content performer based on the video signal detected by the facial expression recognition unit 13 and the emotion of the content performer based on the audio signal detected by the audio emotion estimation unit 14 are emotions at any point in the video content. It is managed in association with timing information for specifying whether or not there is.
On the other hand, the viewer emotion estimation unit 3 captures the facial expression of the viewer when viewing the designated video content and also inputs video / audio signal input means for inputting audio generated by the viewer. 21. Face detection means 22 for extracting a viewer's face image from the video signal input by the video / audio signal input means 21, and facial expression recognition means 23 for analyzing the face image extracted by the face detection means 22 and recognizing a facial expression. And voice emotion estimation means 24 for recognizing the viewer's emotion based on the voice signal input by the video / audio signal input means 21.

前記映像・音声信号入力手段２１は、例えば、視聴者の顔の表情を認識するためのＣＣＤ（Charge Coupled Device）等のイメージセンサといった撮像手段及び視聴者の音声を収集するためのマイク等の集音手段で構成される。
また、前記顔検出手段２２は、前記コンテンツ出演者感情推測部２の顔検出手段１２と同様に構成され、ＣＣＤ等の撮像手段で撮像して得た視聴者を撮像した映像信号の中から視聴者の顔画像を抽出する。そして、表情認識手段２３は、上記コンテンツ出演者感情推測部２の表情認識手段１３と同様の手順で、前記顔検出手段２２で検出した視聴者の顔画像からその表情を認識し、視聴者の感情情報を得る。 The video / audio signal input means 21 is a collection of imaging means such as an image sensor such as a CCD (Charge Coupled Device) for recognizing the facial expression of the viewer and a microphone for collecting the audio of the viewer. Consists of sound means.
Further, the face detection means 22 is configured in the same manner as the face detection means 12 of the content performer emotion estimation unit 2, and is viewed from a video signal obtained by imaging a viewer obtained by imaging with an imaging means such as a CCD. A person's face image is extracted. Then, the facial expression recognition unit 23 recognizes the facial expression from the viewer's face image detected by the face detection unit 22 in the same procedure as the facial expression recognition unit 13 of the content performer emotion estimation unit 2, and Get emotional information.

また、前記音声感情推測手段２４は、前記マイク等の集音手段で検出した視聴者が発する声や笑い声、泣き声等の音声信号に対して、上記コンテンツ出演者感情推測部２の音声感情推測手段１４と同様の手順で解析を行い、音声信号に基づく視聴者の感情を推測する。
そして、これら表情認識手段２３で認識した視聴者の表情から推測される感情及び音声感情推測手段２４で音声信号に基づき推測される視聴者の感情は、前記映像コンテンツのどの時点で生じたかを表すタイミング情報と対応付けて管理される。 Further, the voice emotion estimation means 24 is adapted to perform voice emotion estimation means of the content performer emotion estimation unit 2 with respect to voice signals such as voices, laughter, and cry voices uttered by the viewer detected by the sound collecting means such as the microphone. Analysis is performed in the same procedure as in FIG. 14, and the viewer's emotion based on the audio signal is estimated.
The emotion estimated from the viewer's facial expression recognized by the facial expression recognition means 23 and the viewer's emotion estimated based on the audio signal by the voice emotion estimation means 24 indicate at which point in time the video content occurred. It is managed in association with timing information.

前記共感度算出手段４は、前記コンテンツ出演者感情推測部２の表情認識手段１３及び音声感情推測手段１４で推測したコンテンツ出演者の感情情報と、前記視聴者感情推測部３の表情認識手段２３及び音声感情推測手段２４で推測した視聴者の感情情報とを比較し、コンテンツ出演者の感情と視聴者の感情との一致度合、すなわち共感度を算出する。この共感度の算出は、コンテンツ出演者の感情情報及び視聴者の感情情報においてそのタイミング情報が一致するときのコンテンツ出演者の感情と視聴者の感情とが一致するとき、共感度が一致し共感度は高いと判断する。逆に、顔画像に基づく感情及び音声信号に基づく感情のいずれも一致しないときには共感度は低いと判断する。 The co-sensitivity calculation means 4 includes the emotion information of the content performer estimated by the facial expression recognition means 13 and the voice emotion estimation means 14 of the content performer emotion estimation unit 2, and the facial expression recognition unit 23 of the viewer emotion estimation unit 3. Then, the emotion information of the viewer estimated by the voice emotion estimation means 24 is compared, and the degree of coincidence between the emotion of the content performer and the emotion of the viewer, that is, co-sensitivity is calculated. The calculation of this sensitivity is performed when the emotion of the content performer and the emotion of the viewer match the timing of the content performer and the viewer's emotion. Judge that the degree is high. On the contrary, when neither the emotion based on the face image nor the emotion based on the audio signal match, it is determined that the co-sensitivity is low.

前記要約映像作成手段５は、前記共感度算出手段４での共感度の算出結果に基づき、その共感度が高いと判断される時点における映像コンテンツの映像信号を抽出し、共感度が高いときの映像信号からなるファイルを生成しこれを要約映像とし、これを、補助記憶装置等所定の記憶領域に格納する。
そして、前記コンテンツの編集装置は、具体的には、図２に示すように、コンピュータ１００及びこれに実行させるプログラムとして実現することができる。 The summary video creation means 5 extracts the video signal of the video content at the time when the cosensitivity is determined to be high based on the cosensitivity calculation result in the cosensitivity calculation means 4 and A file composed of video signals is generated and used as a summary video, which is stored in a predetermined storage area such as an auxiliary storage device.
The content editing apparatus can be realized as a computer 100 and a program executed by the computer 100, as shown in FIG.

図２は、コンピュータ１００の構成を示すブロック図である。
コンピュータ１００は、図２に示すように、制御プログラムに基づいて演算及びシステム全体を制御する演算処理部（ＣＰＵ）１０１と、演算処理部１０１で実行される制御プログラム等が格納されるＲＯＭ１０２と、ＲＯＭ１０２等から読み出したデータや、演算処理部１０１の演算過程で必要な演算結果を格納するためのＲＡＭ１０３と、外部装置に対してデータの入出力を媒介するＩ／Ｆ１０４とで構成されており、これらはデータを転送するための信号線であるバス１０５で相互に且つデータ授受可能に接続されている。 FIG. 2 is a block diagram illustrating a configuration of the computer 100.
As shown in FIG. 2, the computer 100 includes an arithmetic processing unit (CPU) 101 that controls arithmetic operation and the entire system based on a control program, a ROM 102 that stores a control program executed by the arithmetic processing unit 101, It is composed of a RAM 103 for storing data read from the ROM 102 and the like, a calculation result required in the calculation process of the calculation processing unit 101, and an I / F 104 that mediates input / output of data to an external device. These are connected to each other via a bus 105 which is a signal line for transferring data so that data can be exchanged.

前記Ｉ／Ｆ１０４には、外部装置として、ホームビデオファイル等の映像コンテンツが格納されると共に、コンテンツの編集装置で生成したコンテンツの要約映像を記憶するためのＨＤＤ等の補助記憶装置１０６、ヒューマンインタフェースとしてデータの入力が可能なキーボードやマウス等からなる入力装置１０７、映像信号に基づいて映像表示を行い、映像コンテンツやその要約映像の表示を行うディスプレイ等の出力装置１０８が接続されると共に、前述の視聴者の表情を撮像するための撮像手段及び視聴者の発する音声を収集するための集音手段からなる視聴者センサ１１０が接続されている。 The I / F 104 stores video content such as a home video file as an external device, an auxiliary storage device 106 such as an HDD for storing a summary video of the content generated by the content editing device, and a human interface. And an input device 107 such as a keyboard or mouse capable of inputting data, and an output device 108 such as a display for displaying video content and displaying a summary video thereof based on the video signal. A viewer sensor 110 is connected which includes an imaging means for capturing the facial expressions of the viewers and a sound collecting means for collecting the sound produced by the viewers.

前記演算処理部１０１は、マイクロプロセッシングユニット（ＭＰＵ）等からなり、ＲＯＭ１０２の所定領域に格納されている所定の処理プログラムを起動させ、映像コンテンツの要約映像を生成する。
図３は、コンテンツ信号或いは視聴者の撮像画像から顔画像を抽出しその感情を推測する際の処理手順の一例を示すフローチャートであって、前記コンテンツ出演者感情推測部２での顔画像に基づく感情の推測と視聴者感情推測部３での顔画像に基づく感情の推測とは同じ手順で行われるので、ここでは、コンテンツ信号に対する処理について説明する。 The arithmetic processing unit 101 includes a microprocessing unit (MPU) or the like, and activates a predetermined processing program stored in a predetermined area of the ROM 102 to generate a summary video of video content.
FIG. 3 is a flowchart showing an example of a processing procedure when a face image is extracted from a content signal or a captured image of a viewer and the emotion is estimated, and is based on the face image in the content performer emotion estimation unit 2 Since the estimation of emotion and the estimation of emotion based on the face image in the viewer emotion estimation unit 3 are performed in the same procedure, processing for the content signal will be described here.

まず、入力装置１０７で指定された映像コンテンツの映像信号を補助記憶装置１０６から読み込み、連続するフレームから１フレームずつ取り出す（ステップＳ１）。
そして、取り出した１フレームについて、顔検出手段１２によって、顔画像の検出を行い、フレーム内の顔を検出する。この際、検出された“顔”として、そのフレームにおける位置（Ｘ，Ｙ座標）や、輪郭を含めたサイズ等を検出する（ステップＳ２）。 First, the video signal of the video content designated by the input device 107 is read from the auxiliary storage device 106 and taken out frame by frame from successive frames (step S1).
Then, the face detection unit 12 detects a face image of the extracted frame, and detects a face in the frame. At this time, as the detected “face”, the position (X, Y coordinates) in the frame, the size including the contour, and the like are detected (step S2).

次いで、ステップＳ３に移行し、ステップＳ２での顔画像の検出結果に基づいて、そのフレーム内に顔画像が存在するかどうかを判断し、顔画像が存在しない場合には、ステップＳ１に戻って次のフレームについて同様に処理を行う。
一方、フレーム内に顔画像が存在する場合にはステップＳ４に移行し、ステップＳ２で検出した顔画像のフレーム内での位置及びそのサイズ等に基づいて、フレーム内のどの辺りの位置に顔画像が存在するかを推測し、推測した位置の画像に対して、表情認識手段１３により表情認識を行う。例えば、怒り、恐れ、悲しみ、喜び、嫌悪、驚き等の表情に分類する。そして、このフレームを特定する識別情報を前記タイミング情報とし、このタイミング情報と表情の認識結果とを対応付けて、このフレームの感情情報として所定の記録領域に格納する（ステップＳ５）。 Next, the process proceeds to step S3, where it is determined whether a face image exists in the frame based on the detection result of the face image in step S2. If no face image exists, the process returns to step S1. The same processing is performed for the next frame.
On the other hand, if a face image exists in the frame, the process proceeds to step S4, and the face image is located at any position in the frame based on the position of the face image detected in step S2 and the size thereof. The facial expression recognition unit 13 performs facial expression recognition on the image at the estimated position. For example, it is classified into facial expressions such as anger, fear, sadness, joy, disgust, and surprise. Then, the identification information for specifying the frame is used as the timing information, and the timing information and the facial expression recognition result are associated with each other and stored as emotion information of the frame in a predetermined recording area (step S5).

次いで、ステップＳ６に移行し、指定された映像コンテンツの全てのフレームについてその感情情報を検出していなければステップＳ１に戻って次のフレームについて、上記と同様の手順で感情情報の検出を行う。そして、指定された映像コンテンツの全てのフレームについてその感情情報の検出が終了したならば処理を終了する。
以上の処理によって、指定された映像コンテンツに登場する出演者の映像信号に基づく表情認識が終了する。例えば、図４に示すように、一連のフレーム（図４（ａ））のうち、フレームｆ２〜ｆ４、ｆ６及びｆ７に顔画像がある場合には（図４（ｂ））、これら顔画像のあるフレームについてのみ、その感情情報の検出が行われ、例えば、フレームｆ２〜ｆ４のコンテンツ出演者の感情は、“喜び”、フレームｆ６、ｆ７のコンテンツ出演者の感情は“驚き”として特定される（図４（ｃ））。 Next, the process proceeds to step S6. If the emotion information is not detected for all the frames of the designated video content, the process returns to step S1 to detect the emotion information for the next frame in the same procedure as described above. Then, when the detection of the emotion information for all the frames of the designated video content is completed, the process is terminated.
With the above processing, facial expression recognition based on the video signal of the performer appearing in the designated video content is completed. For example, as shown in FIG. 4, when there are face images in frames f2 to f4, f6 and f7 in a series of frames (FIG. 4A) (FIG. 4B), The emotion information is detected only for a certain frame. For example, the emotions of the content performers in the frames f2 to f4 are identified as “joy”, and the emotions of the content performers in the frames f6 and f7 are identified as “surprise”. (FIG. 4 (c)).

なお、図４は、フレームにおける感情の変化状況を模式的に表したものであって、図４（ｂ）において、ハッチングがなされたフレームは顔画像があるフレームを表している。また、図４（ｃ）において、ハッチングがなされたフレームは、感情推測が行われたフレームを表し、ハッチングの種類で感情の種類を表している。
同様の手順で、ＣＣＤ等の撮像手段で撮像した、この映像コンテンツを視聴したときの視聴者の表情を撮像した映像信号に対し、同様の手順で１フレームずつ感情情報を検出する。 FIG. 4 schematically shows a change state of emotion in a frame. In FIG. 4B, the hatched frame represents a frame with a face image. In FIG. 4C, the hatched frame represents a frame in which emotion estimation has been performed, and the type of emotion represents the type of hatching.
Emotion information is detected frame by frame in a similar procedure for a video signal that is captured by an imaging means such as a CCD in the same procedure and that captures the facial expression of the viewer when viewing this video content.

図５は、コンテンツ信号或いは視聴者の音声信号からその感情を検出する際の処理手順の一例を示すフローチャートであって、前記コンテンツ出演者感情推測部２での音声信号に基づく感情の検出と視聴者感情推測部３での音声信号に基づく感情の推測とは同じ手順で行われるので、ここでは、コンテンツ信号に対する処理について説明する。
まず、入力装置１０７で指定された映像コンテンツについて補助記憶装置１０６からその音声信号を全フレームについて一度に読み込む（ステップＳ１１）。ここでは、音声信号の周波数について解析を行うため全ての音声信号を一度に読み込んでいる。 FIG. 5 is a flowchart showing an example of a processing procedure for detecting the emotion from the content signal or the audio signal of the viewer, and the emotion detection and viewing based on the audio signal in the content performer emotion estimation unit 2 Since the emotion estimation based on the audio signal in the person emotion estimation unit 3 is performed in the same procedure, the processing for the content signal will be described here.
First, the audio signal for the video content designated by the input device 107 is read from the auxiliary storage device 106 at once for all the frames (step S11). Here, in order to analyze the frequency of the audio signal, all the audio signals are read at once.

次いで、ステップＳ１２に移行し、音声信号を周波数成分として扱い、前記音声感情推測手段１４により、感情情報を解析する。すなわち、その喉頭原音の振動周波数の高低に基づいてその感情を推測する。このとき、その感情が生じたときの開始時刻及び終了時刻も検出する。
次いで、ステップＳ１３に移行し、ステップＳ１２で検出した音声信号に基づく感情の開始時刻及び終了時刻をタイミング情報とし、各感情とそのタイミング情報とを対応付けて所定の記憶領域に格納する。 Next, the process proceeds to step S12, where the voice signal is handled as a frequency component, and the voice emotion estimation means 14 analyzes emotion information. That is, the emotion is estimated based on the vibration frequency of the laryngeal original sound. At this time, the start time and end time when the emotion occurs are also detected.
Next, the process proceeds to step S13, where the emotion start time and end time based on the audio signal detected in step S12 are used as timing information, and each emotion and its timing information are associated with each other and stored in a predetermined storage area.

以上によって、コンテンツ出演者の音声信号に基づく感情情報が検出されたことになる。例えば、図６に示すように、時点ｔ１〜ｔ２、時点ｔ３〜ｔ４で、振動周波数に感情情報を表す特徴的な周波数成分が現れたときには（図６（ａ））、その感情情報を特定し、例えば、“喜び”として認識する（図６（ｂ））。なお、図６（ａ）において横軸は時間、縦軸は振幅を表す。また、図６（ｂ）において、ハッチングの種類により感情の種類を表す。 Thus, emotion information based on the audio signal of the content performer is detected. For example, as shown in FIG. 6, when characteristic frequency components representing emotion information appear in the vibration frequency at time points t1 to t2 and time points t3 to t4 (FIG. 6A), the emotion information is specified. For example, it is recognized as “joy” (FIG. 6B). In FIG. 6A, the horizontal axis represents time, and the vertical axis represents amplitude. In FIG. 6B, the type of emotion is represented by the type of hatching.

同様の手順で、マイク等の集音手段で収集した、このコンテンツを視聴した視聴者の感情を、音声信号に基づき推測し、その感情が生じたときの開始時刻及び終了時刻と対応付けて所定の記憶領域に格納する。
図７は、顔画像及び音声信号に基づいて推測した感情情報に基づいて要約映像を作成するための、要約映像作成処理の処理手順の一例を示すフローチャートである。 In the same manner, the emotion of the viewer who viewed this content collected by sound collecting means such as a microphone is estimated based on the audio signal, and is associated with the start time and end time when the emotion occurs. Stored in the storage area.
FIG. 7 is a flowchart illustrating an example of a processing procedure of summary video creation processing for creating a summary video based on emotion information estimated based on a face image and an audio signal.

この要約映像作成処理では、まず、ステップＳ２１で顔画像及び映像信号に基づいて推測した、コンテンツ出演者及び視聴者の感情情報を所定の記憶領域から読み出す（ステップＳ２１）。次いで、ステップＳ２２に移行し、顔画像に基づく感情情報に対応付けられているタイミング情報、すなわちフレームの識別情報に基づいて、コンテンツ出演者及びその視聴者の、顔の表情から読み取れる感情情報が一致するフレームを特定する。同様に、音声信号に基づく感情情報に対応付けられたタイミング情報、すなわち、感情の開始時刻及び終了時刻に基づいて、コンテンツ出演者とその視聴者の、音声信号から読み取れる感情情報が一致する時点を特定する。 In this summary video creation process, first, emotion information of content performers and viewers estimated based on the face image and video signal in step S21 is read from a predetermined storage area (step S21). Next, the process proceeds to step S22, and the emotion information that can be read from the facial expressions of the content performer and the viewer is matched based on the timing information associated with the emotion information based on the face image, that is, the identification information of the frame. Specify the frame to be played. Similarly, the timing information associated with the emotion information based on the audio signal, that is, the time point at which the emotion information that can be read from the audio signal of the content performer and the viewer coincides with each other based on the emotion start time and end time. Identify.

次いで、ステップＳ２３に移行し、顔画像に基づくコンテンツ出演者及び視聴者の感情情報が一致する時点及び音声信号に基づくコンテンツ出演者及び視聴者の感情情報が一致する時点を特定し、この時点の共感度は高いと判断する。
そして、顔画像に基づくコンテンツ出演者及び視聴者の感情情報が一致するフレームの識別情報で特定されるタイミングと、音声信号に基づくコンテンツ出演者及び視聴者の感情情報が一致するときの、感情の開始時刻及び終了時刻で特定されるタイミングとが一致するとき、この時点の共感度は最も高いと判断する。 Next, the process proceeds to step S23, where the time point at which the emotional information of the content performer and the viewer based on the face image matches and the time point at which the emotional information of the content performer and the viewer based on the audio signal match are specified. Judge that co-sensitivity is high.
Then, the timing of the emotion specified when the emotion information of the content performer and the viewer based on the audio signal matches the timing specified by the identification information of the frame where the content information of the content performer and the viewer based on the face image matches. When the timing specified by the start time and the end time matches, it is determined that the co-sensitivity at this time is the highest.

また、顔画像に基づくコンテンツ出演者及び視聴者の感情情報どうし及び音声信号に基づくコンテンツ出演者及び視聴者の感情情報どうしのいずれも一致しないとき、この時点の共感度は低いと判断する。
次いで、ステップＳ２４に移行し、ステップＳ２３で共感度が高いと判断された時点のフレームを、コンテンツ信号からそのフレームの並び順に抽出する。そして、この抽出したフレームからなる映像信号を生成し、これを指定されたコンテンツの要約映像として補助記憶装置１０６に格納し、処理を終了する。 Further, when neither the content performers' and viewers' emotion information based on the face image nor the content performers and viewers' emotion information based on the audio signal match, it is determined that the co-sensitivity at this point is low.
Next, the process proceeds to step S24, and the frame at the time when the co-sensitivity is determined to be high in step S23 is extracted from the content signal in the order of the frames. Then, a video signal composed of the extracted frames is generated, stored in the auxiliary storage device 106 as a summary video of the designated content, and the process ends.

次に、上記実施の形態の動作を説明する。
まず、入力装置１０７によって、要約の作成対象の映像コンテンツを指定し、これを再生手段１によって再生する。このとき、撮像手段及び集音手段を備えた視聴者センサ１１０によって、視聴中の視聴者の顔の表情を撮像すると共に視聴者が発する音声を集音し所定の記憶領域に格納しておく。 Next, the operation of the above embodiment will be described.
First, video content for which a summary is to be created is designated by the input device 107 and is reproduced by the reproducing means 1. At this time, the viewer sensor 110 including the image pickup means and the sound collection means picks up the facial expression of the viewer who is viewing and collects the sound emitted by the viewer and stores it in a predetermined storage area.

次に、要約の作成対象のコンテンツについてそのコンテンツ出演者の感情情報を獲得する。具体的には、入力装置１０７で対象の映像コンテンツを指定し、前記図３及び図５の処理を実行させて映像信号及び音声信号についてその感情情報を獲得する。
演算処理部１０１では、映像コンテンツの連続するフレームのうち１フレームずつ読み込み（図３ステップＳ１）、顔画像の検出を行う（ステップＳ２）。そして、顔画像が検出されたフレームについて、表情認識を行う（ステップＳ４）。 Next, emotion information of the content performer is acquired for the content to be summarized. Specifically, the target video content is designated by the input device 107, and the processing shown in FIGS. 3 and 5 is executed to acquire emotion information about the video signal and the audio signal.
The arithmetic processing unit 101 reads one frame at a time from consecutive frames of the video content (step S1 in FIG. 3), and detects a face image (step S2). Then, facial expression recognition is performed on the frame in which the face image is detected (step S4).

これによって、例えば、図８の模式図に示すように、顔画像が検出されたフレームについて、怒り、恐れ、悲しみ、喜び、嫌悪、驚き、といった６つの表情に分類され（図８（ａ））、これがフレームの識別情報と対応付けられて所定の記憶領域に格納される。図８（ａ）の場合、フレームｆ１２〜ｆ１４，ｆ１６〜ｆ１７、ｆ２１〜ｆ２５、ｆ２７〜ｆ２９が顔画像のあるフレームとして検出され、フレームｆ１２〜ｆ１４及びｆ２１〜ｆ２５は“喜び、フレームｆ１６〜ｆ１７及びｆ２７〜ｆ２９は、“驚き”として認識される。 As a result, for example, as shown in the schematic diagram of FIG. 8, the frames in which the face images are detected are classified into six facial expressions such as anger, fear, sadness, joy, disgust, and surprise (FIG. 8A). This is stored in a predetermined storage area in association with the frame identification information. In the case of FIG. 8A, frames f12 to f14, f16 to f17, f21 to f25, and f27 to f29 are detected as frames having face images, and frames f12 to f14 and f21 to f25 are “joy, frames f16 to f17”. And f27 to f29 are recognized as “surprise”.

また、音声信号に基づき、図８（ｂ）に示すように、時点ｔ１２〜ｔ１６及び時点ｔ２８〜ｔ３０は“喜び”、時点ｔ２０〜ｔ２３は“驚き”として認識される。なお、図８において、各フレームと各時点とは、タイミングを一致させて表示しており、図８（ａ）〜（ｅ）の各タイミングは一致している。
次に、この映像コンテンツを視聴した視聴者の感情情報を同様にして検出する。この場合には、先に収集して所定の記憶領域に格納している視聴者の映像信号及び音声信号に対して処理を行い、その感情情報をそのタイミング情報と対応付けて所定の記憶領域に格納する。 Further, based on the audio signal, as shown in FIG. 8B, the times t12 to t16 and the times t28 to t30 are recognized as “joy”, and the times t20 to t23 are recognized as “surprise”. In FIG. 8, each frame and each time point are displayed with matching timing, and each timing in FIGS. 8A to 8E is matched.
Next, the emotion information of the viewer who viewed this video content is detected in the same manner. In this case, processing is performed on the video signal and audio signal of the viewer that has been collected and stored in the predetermined storage area, and the emotion information is associated with the timing information in the predetermined storage area. Store.

これによって、例えば、図８（ｃ）に示すように、映像信号については、フレームｆ１３〜ｆ１６及びｆ２０〜ｆ２３は“喜び”、フレームｆ２６〜ｆ３０は“驚き”として認識され、音声信号については、図８（ｄ）に示すように、時点ｔ１４〜ｔ１５及び時点ｔ３０は“喜び”、時点ｔ２２〜ｔ２３は“驚き”として認識される。
そして、これら映像信号及び音声信号に基づく感情情報の検出結果に対し、図７に示す要約映像作成処理が実行され、コンテンツ出演者の感情と、その視聴者の感情とが一致する箇所が特定される。 Thus, for example, as shown in FIG. 8C, for the video signal, the frames f13 to f16 and f20 to f23 are recognized as “joy”, the frames f26 to f30 are recognized as “surprise”, and the audio signal is As shown in FIG. 8D, the times t14 to t15 and the time t30 are recognized as “joy”, and the times t22 to t23 are recognized as “surprise”.
Then, a summary video creation process shown in FIG. 7 is performed on the emotion information detection results based on the video signal and the audio signal, and a location where the emotion of the content performer and the emotion of the viewer are identified is specified. The

図８の場合、映像信号に基づく感情情報は、図８（ａ）及び図８（ｃ）に示すように、フレームｆ１３〜ｆ１４、フレームｆ２１〜ｆ２３が“喜び”、フレームｆ２７〜ｆ２９が“驚き”で一致する。また、音声信号に基づく感情情報は図８（ｂ）及び図８（ｄ）に示すように、時点ｔ１４〜ｔ１５、時点ｔ２２〜ｔ２３が“喜び”、時点ｔ３０が“驚き”で一致する。したがって、映像信号について、コンテンツ出演者の感情とその視聴者の感情とが“喜び”で一致するフレームｆ１３〜ｆ１４及びｆ２１〜ｆ２３と、“驚き”で一致するフレームｆ２７〜ｆ２９が共感度の高いフレームとして特定される。同様に、音声信号について、コンテンツ出演者の感情とその視聴者の感情とが“喜び”で一致する時点ｔ１４〜ｔ１５及び時点ｔ３０と、“驚き”で一致する時点ｔ２２〜ｔ２３が共感度の高い箇所として特定される。 In the case of FIG. 8, the emotion information based on the video signal is “pleasant” in the frames f13 to f14 and the frames f21 to f23, and “surprise” in the frames f27 to f29, as shown in FIG. 8 (a) and FIG. 8 (c). "Matches. In addition, as shown in FIGS. 8 (b) and 8 (d), emotion information based on the audio signal is coincident with time t14 to t15, time t22 to t23 “joy”, and time t30 “surprise”. Therefore, in the video signal, the frames f13 to f14 and f21 to f23 in which the emotion of the content performer and the viewer's emotion match with “joy” and the frames f27 to f29 that match with “surprise” have high cosensitivity. Identified as a frame. Similarly, the time t14 to t15 and the time t30 when the emotion of the content performer and the emotion of the viewer coincide with “joy” and the time t22 to t23 when “surprise” coincide with each other for the audio signal are high in sensitivity. Identified as a location.

したがって、コンテンツ出演者とその視聴者とで、映像信号に基づく感情が一致するフレーム及び音声信号に基づく感情が一致する時点が、共感度の高い箇所として抽出され、すなわち、図８（ｅ）に示すように、“喜び”の感情を有するフレームｆ１３〜ｆ１５、“喜び”と“驚き”の感情を有するフレームｆ２１〜ｆ２３及びｆ２７〜ｆ３０が抽出され、これに基づき、要約映像が生成される。そして、この要約映像は、指定された映像コンテンツの要約映像として、補助記憶装置１０６に格納される。この要約映像は、例えば、図９に示すように、要約を作成した映像コンテンツを特定する番組タイトルとその録画日時が付加されて補助記憶装置１０６に格納される。 Accordingly, the frame where the emotion based on the video signal and the emotion based on the audio signal match between the content performer and the viewer are extracted as a highly co-sensitive portion, that is, FIG. 8 (e). As shown, frames f13 to f15 having an emotion of “joy” and frames f21 to f23 and f27 to f30 having an emotion of “joy” and “surprise” are extracted, and based on this, a summary video is generated. This summary video is stored in the auxiliary storage device 106 as a summary video of the designated video content. For example, as shown in FIG. 9, the summary video is stored in the auxiliary storage device 106 with the program title specifying the video content for which the summary is created and the recording date / time added.

利用者は、補助記憶装置１０６に格納された各要約映像を読み出し、これを再生手段１で再生することにより、その映像コンテンツの概要を把握することができる。
ここで、前述のように、前記要約映像を、コンテンツ出演者と視聴者との感情が一致したフレームから構成している。したがって、視聴者がその映像コンテンツに対して感情移入している程度が大きい箇所、すなわち、視聴者側の感受性に応じた特徴的な要約映像を作成することができる。したがって、この要約映像を参照することによって、その元となる映像コンテンツの内容を速やかに認識させることができ、映像コンテンツの検索を速やかに行うことができる。 The user can grasp the outline of the video content by reading each summary video stored in the auxiliary storage device 106 and playing it back by the playback means 1.
Here, as described above, the summary video is composed of frames in which emotions of content performers and viewers match. Therefore, it is possible to create a characteristic summary video in accordance with the sensitivity of the viewer side, that is, the portion where the viewer has a large degree of emotion transfer to the video content. Therefore, by referring to this summary video, it is possible to quickly recognize the content of the original video content and to quickly search for the video content.

このとき、コンテンツ出演者及びその視聴者の感情情報を、顔の表情及び音声信号に基づいて推測しているから、共感度の高い箇所を容易に抽出することができる。
また、共感度の高い箇所を抽出することにより、コンテンツ出演者の感情が大きく変化した箇所等、特徴ある場面を抽出して要約映像を作成することができるから、映像コンテンツの内容を効率良く把握することができ、所望の映像コンテンツを容易に検索することができる。 At this time, since the emotion information of the content performer and the viewer is estimated based on the facial expression and the audio signal, it is possible to easily extract a portion with high co-sensitivity.
In addition, by extracting high-sensitivity locations, it is possible to create a summary video by extracting characteristic scenes such as locations where the emotions of the content performers have changed significantly, so the content of the video content can be grasped efficiently. The desired video content can be easily searched.

また、人間の感情は、顔及び音声に共に現れず、顔或いは音声の何れか一方にのみ現れる場合もあるが、上述のように、顔画像に基づく感情だけでなく音声信号に基づく感情も検出しこれら両方から得られる感情に基づいて視聴者が共感している箇所を抽出するようにしているから、感情が顔及び音声に共に現れない場合であっても、視聴者が共感している箇所を的確に抽出することができる。 In addition, human emotions may not appear both in the face and voice, but may appear only in either the face or voice. As described above, not only emotions based on face images but also emotions based on voice signals are detected. However, since the part where the viewer is sympathetic is extracted based on the emotions obtained from both, the part where the viewer is sympathetic even if the emotion does not appear both in the face and the voice Can be accurately extracted.

また、特に、家庭で個人的に撮像したホームビデオ等の場合には、コンテンツの検索に用いるための、映像内容のキーワード等のメタデータを付加することは面倒な作業である。しかしながら、上述のように、自動的に要約映像を作成することができるから、この要約映像を参照することによって容易に所望のホームビデオを検索することができる。

なお、上記実施の形態においては、顔画像から得られる感情及び音声信号から得られる感情の何れか一方について、コンテンツ出演者と視聴者の感情が一致するとき共感度が高いとしてこの箇所を抽出して要約映像を形成する場合について説明したが、例えば、顔画像から得られる感情及び音声信号から得られる感情について、コンテンツ出演者及び視聴者の感情が共に一致する、共感度が最も高い箇所のみを、要約映像として抽出するようにしてもよい。 In particular, in the case of a home video or the like that is personally captured at home, it is a troublesome task to add metadata such as a keyword for video content to be used for content search. However, as described above, since a summary video can be automatically created, a desired home video can be easily searched by referring to this summary video.

In the above-described embodiment, this part is extracted because either the emotion obtained from the face image or the emotion obtained from the audio signal has high co-sensitivity when the emotions of the content performer and the viewer match. However, for example, for emotions obtained from face images and emotions obtained from audio signals, only the points with the highest co-sensitivity where the emotions of the content performers and viewers match are the same. Alternatively, it may be extracted as a summary video.

また、上記実施の形態においては、視聴者の顔の表情及び視聴者が発する音声信号に基づいて視聴者の感情を推測する場合について説明したが、視聴者の感情を推測することが可能な感情推測手段であればどのような情報に基づいて視聴者の感情を推測するようにしてもよく、例えば、視聴者の、心拍、脈拍、血圧、脳波、呼吸、発汗、瞬目といった、生体情報を検出する生体情報検出手段を設け、これによって、前記生体情報を獲得し、これに基づいて視聴者の感情を推測するようにしてもよい。この場合、例えば、特許第３３１０４９８号の「生体情報解析装置及び生体情報解析方法」に記載されているように、脳波のゆらぎ信号を入力としてニューラルネットワークにより、脳波等の生体情報から快適度等の感情情報を定量的に計測すればよい。そして、これら複数の感情推測手段の推測結果を視聴者の感情として用いるようにしてもよく、また、これら複数の感情推測手段による推測結果の何れか一つ或いは複数の感情推測手段の推測結果を組み合わせてもよい。 In the above embodiment, a case has been described in which the viewer's emotion is estimated based on the facial expression of the viewer and the audio signal emitted by the viewer, but the emotion that can be used to estimate the viewer's emotion As long as it is an estimation means, the viewer's emotion may be estimated based on any information. For example, the viewer's biological information such as heartbeat, pulse, blood pressure, brain wave, breathing, sweating, and blinking may be used. Biometric information detection means for detecting may be provided, whereby the biometric information is acquired, and the viewer's emotion may be estimated based on the biometric information. In this case, for example, as described in “biological information analysis apparatus and biological information analysis method” of Japanese Patent No. 3310498, a fluctuation signal of an electroencephalogram is input and a neural network is used to input information such as comfort level from an electroencephalogram or the like. What is necessary is just to measure emotion information quantitatively. Then, the estimation results of the plurality of emotion estimation means may be used as the viewer's emotions, and any one of the estimation results of the plurality of emotion estimation means or the estimation results of the plurality of emotion estimation means may be used. You may combine.

また、上記実施の形態においては、顔画像に基づく感情と音声信号に基づく感情とを推測し、これらに基づいてコンテンツ出演者の感情と視聴者の感情とが一致する箇所を抽出する場合について説明したが、顔画像に基づく感情及び音声信号に基づく感情の何れか一方のみを推測し、コンテンツ出演者及び視聴者の感情が一致する箇所を抽出するようにしてもよく、また、前記生体情報のみに基づいて、コンテンツ出演者及び視聴者の感情が一致する箇所を抽出するようにしてもよい。 Moreover, in the said embodiment, the case where the emotion based on a face image and the emotion based on an audio | voice signal are estimated and the location where the content performer's emotion and viewer's emotion correspond are extracted based on these are demonstrated. However, only one of the emotion based on the face image and the emotion based on the audio signal may be estimated, and the portion where the emotions of the content performer and the viewer are matched may be extracted. Based on the above, a location where the emotions of the content performer and the viewer match may be extracted.

また、上記実施の形態においては、共感度が一致するフレームのみから要約映像を生成するようにした場合について説明したが、例えば、共感度が一致するフレームを含むその前後所定量のフレーム、或いは共感度が一致するフレームから所定量のフレームを抽出しこれを要約映像としてもよい。
また、上記実施の形態においては、ホームビデオの映像コンテンツの要約映像を生成した場合について説明したがこれに限らず、ドラマ等、その他の映像コンテンツであっても適用することができる。また、映像コンテンツに限らず、映像を含まない音声コンテンツであっても適用することができる。この場合には、音声コンテンツの出演者の音声信号に基づく感情とこの音声コンテンツを視聴した視聴者が発する音声に基づく感情とが一致する箇所を共感度が高い箇所として抽出し、これを要約音声とすればよい。 In the above-described embodiment, the case where the summary video is generated only from the frames having the same sensitivities has been described. However, for example, a predetermined amount of frames before and after the frames including the same sensitivities, or the sympathy A predetermined amount of frames may be extracted from frames having the same degree and used as a summary video.
In the above embodiment, the case where the summary video of the video content of the home video is generated has been described. However, the present invention is not limited to this, and the present invention can be applied to other video content such as a drama. Further, not only video content but also audio content that does not include video can be applied. In this case, a portion where the emotion based on the audio signal of the performer of the audio content and the emotion based on the audio uttered by the viewer who has watched the audio content match is extracted as a location with high co-sensitivity, and this is summarized audio And it is sufficient.

ここで、上記実施の形態において、図１のコンテンツ出演者感情推測部２が出演者感情推測手段に対応し、視聴者感情推測部３が視聴者感情推測手段に対応し、共感度算出手段４が高共感度シーン推測手段に対応し、要約映像作成手段５が要約コンテンツ生成手段に対応している。
また、図１の顔検出手段１２及び表情認識手段１３、顔検出手段２２及び表情認識手段２３がそれぞれ表情検出手段に対応し、音声感情推測手段１４及び２４が振動周波数の解析手段に対応している。 Here, in the above embodiment, the content performer emotion estimation unit 2 of FIG. 1 corresponds to the performer emotion estimation unit, the viewer emotion estimation unit 3 corresponds to the viewer emotion estimation unit, and the co-sensitivity calculation unit 4 Corresponds to the high co-sensitivity scene estimation means, and the summary video creation means 5 corresponds to the summary content generation means.
Further, the face detection means 12, the expression recognition means 13, the face detection means 22 and the expression recognition means 23 in FIG. 1 correspond to the expression detection means, respectively, and the voice emotion estimation means 14 and 24 correspond to the vibration frequency analysis means. Yes.

本発明を適用したコンテンツ編集装置の一例を示す機能ブロック図である。It is a functional block diagram which shows an example of the content editing apparatus to which this invention is applied. 本発明のコンテンツ編集装置を実現するハードウェア構成を示す図である。It is a figure which shows the hardware constitutions which implement | achieve the content editing apparatus of this invention. 顔画像に基づく感情情報の検出方法の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the detection method of the emotion information based on a face image. 顔画像に基づく感情情報の検出方法を説明するための説明図である。It is explanatory drawing for demonstrating the detection method of the emotion information based on a face image. 音声信号に基づく感情情報の検出方法の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the detection method of the emotion information based on an audio | voice signal. 音声信号に基づく感情情報の検出方法を説明するための説明図である。It is explanatory drawing for demonstrating the detection method of the emotion information based on an audio | voice signal. 要約映像生成処理の、処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of summary video generation processing. 要約映像の生成方法を説明するための説明図である。It is explanatory drawing for demonstrating the production | generation method of a summary image | video. 補助記憶装置に記憶された要約映像の一例である。It is an example of the summary image | video memorize | stored in the auxiliary storage device.

Explanation of symbols

２コンテンツ出演者感情推測部、３視聴者感情推測部、４共感度算出手段、５要約映像作成手段、１１、２１映像・音声信号入力手段、１２、２２顔検出手段、１３、２３表情認識手段、１４、２４音声感情推測手段、１０１演算処理部、１０６補助記憶装置、１１０視聴者センサ 2 content performer emotion estimation unit, 3 viewer emotion estimation unit, 4 co-sensitivity calculation unit, 5 summary video creation unit, 11, 21 video / audio signal input unit, 12, 22 face detection unit, 13, 23 facial expression recognition unit , 14, 24 Voice emotion estimation means, 101 arithmetic processing unit, 106 auxiliary storage device, 110 viewer sensor

Claims

A content editing device that creates a summary of content in which at least people appear,
Performer emotion guessing means for guessing the performer's emotion of the content;
Viewer emotion estimation means for estimating the viewer's emotion when viewing the content;
Based on the performer and viewer's emotion information estimated by the performer's emotion estimation means and the viewer's emotion estimation means, a scene with high sensitivity to the performer of the viewer is estimated from the content. A high co-sensitivity scene estimation means;
An apparatus for editing content, comprising: summary content generation means for generating summary content of the content using a scene with high cosensitivity estimated by the high cosensitivity scene estimation means.

The high co-sensitivity scene estimation unit is configured to detect a scene in which the performer's emotion estimated by the performer emotion estimation unit matches the viewer's emotion estimated by the viewer emotion estimation unit. The content editing apparatus according to claim 1, wherein the content is estimated as a high scene.

The content includes either one of the person's video signal and audio signal,
The content editing apparatus according to claim 1 or 2, wherein the performer emotion estimation means estimates the performer's emotion based on either the video signal or the audio signal.

The viewer emotion estimation means is at least one of an image pickup means for picking up an image of the viewer's facial expression, a sound collection means for collecting a sound uttered by the viewer, and a biological information detection means for detecting the biological information of the viewer. Or one
4. The viewer's emotion is estimated based on any one of imaging information of the imaging means, sound collection information of the sound collection means, and detection information of the biological information detection means. The content editing apparatus according to any one of the above.

The content includes a video signal and an audio signal of the person,
The performer emotion estimation means estimates the emotion based on the video signal and the emotion based on the audio signal for the performer,
The viewer emotion estimation unit includes an imaging unit that captures an image of the viewer's facial expression and a sound collection unit that collects a sound emitted by the viewer. Guess each emotion based on the sound collection information of the sound collection means,
The high co-sensitivity scene estimation means includes a scene in which an emotion based on the video signal of the performer and an emotion based on the imaging information of the viewer match, an emotion based on the audio signal of the performer, and a sound collection of the viewer The content editing apparatus according to claim 1, wherein a scene with matching emotions based on information is estimated as a scene with high co-sensitivity.

The performer emotion estimation means includes facial expression recognition means for recognizing the performer's facial expression based on the video signal,
6. The content editing apparatus according to claim 3, wherein the emotion is inferred from the facial expression of the performer recognized by the facial expression recognition means.

The performer emotion estimation means includes analysis means for analyzing a vibration frequency representing vibration of the laryngeal original sound of the performer based on the audio signal,
The content editing apparatus according to any one of claims 3, 5, and 6, wherein an emotion of the performer is estimated based on an analysis result of the analysis means.

The viewer emotion estimation means includes facial expression recognition means for recognizing the viewer's facial expression based on the imaging information,
6. The content editing apparatus according to claim 4, wherein the emotion is estimated from a facial expression of the viewer recognized by the facial expression recognition means.

The viewer emotion estimation means includes analysis means for analyzing a vibration frequency representing vibration of the viewer's laryngeal original sound based on the sound collection information,
The content editing apparatus according to any one of claims 4, 5, and 8, wherein the viewer's emotion is estimated based on an analysis result of the analysis means.

A content editing method that creates a summary of at least a person ’s content,
Guessing the emotions of the performers of the content, and guessing the emotions of the viewers when viewing the content,
Based on the guess of the performer and the viewer's feelings, in the content, specify the scene that the viewer is sympathizing with the performer,
A content editing method, wherein summary content of the content is created using the identified scene.

Guessing the emotions of the content performers,
Inferring viewers' emotions when viewing the content;
Identifying the scene in which the viewer is sympathizing with the performer in the content based on the emotions of the performer and the viewer;
A content editing program causing a computer to execute a summary content of the content using the identified scene.