JP2009111938A

JP2009111938A - Device, method and program for editing information, and record medium recorded with the program thereon

Info

Publication number: JP2009111938A
Application number: JP2007284706A
Authority: JP
Inventors: Takeshi Irie; 豪入江; Kota Hidaka; 浩太日高; Takashi Sato; 隆佐藤; Yukinobu Taniguchi; 行信谷口
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2007-11-01
Filing date: 2007-11-01
Publication date: 2009-05-21
Anticipated expiration: 2027-11-01
Also published as: JP4812733B2

Abstract

<P>PROBLEM TO BE SOLVED: To allow execution of automatic editing and editing support by automatically detecting emotional expressions in video images or in sound and capturing each section in an emotional state on the basis of the information obtained by the detection. <P>SOLUTION: An information editing device is provided with a section division means 13 for dividing one or more video images or sound into prescribed sectional units, an emotion detection means 14 for calculating the sureness of one or more types of emotional expression states by using video image information and sound information for each divided section, an image output means 15 for outputting representative images, video images or sound of each divided section, or the representative images and the video images or sound of each divided section for each divided section, an image processing means 16 for executing processing and imparting information to the representative images on the basis of the sureness, a layout means 17 for outputting the processed representative images to a user after spatially laying out the representative images, and a generation means 19 for generating video images or sound by connecting each section, corresponding to at least one representative image selected by the user, with each other. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

本発明は、映像コンテンツ又は音声コンテンツを編集する情報編集装置、情報編集方法、情報編集プログラムおよびそのプログラムを記録した記録媒体に関する。 The present invention relates to an information editing apparatus for editing video content or audio content, an information editing method, an information editing program, and a recording medium on which the program is recorded.

これまでの放送に限らず、コンテンツ配信・共有サイトの普及によって、Ｗｅｂサイトや個人ＰＣにおいても、コンテンツを視聴することが増えてきており、コンテンツの種類も、例えば、映画やドラマ、ホームビデオ、ニュース、ドキュメンタリ、音楽等、非常に多様化している。通信と放送の連携に伴い、今後更に、コンテンツ視聴を楽しむユーザが増加することは容易に予想される。 Due to the widespread use of content distribution / sharing sites as well as broadcasts so far, websites and personal PCs are increasingly viewing content. For example, the types of content include movies, dramas, home videos, News, documentary, music, etc. are very diverse. With the cooperation of communication and broadcasting, it is easily expected that the number of users who enjoy viewing content will increase further in the future.

中でも、コンテンツ共有サイトでは、個人が撮影もしくは作成したコンテンツを公開し、利用者同士で共有することができる。コンテンツ共有サイトでは、公開できるコンテンツの容量が限られている場合が多く、また、コンテンツの数も膨大であることから、より多くの視聴者にコンテンツを視聴してもらうためには、利用者は視聴してもらいたい区間を中心に据えるように編集したコンテンツを公開することが好ましい。 Among them, on the content sharing site, content shot or created by individuals can be released and shared among users. In content sharing sites, the amount of content that can be published is often limited, and the number of content is enormous, so in order for more viewers to view the content, It is preferable to publish the edited content so that the section to be viewed is centered.

しかしながら、蓄積されたコンテンツの殆どは編集工程を経ていない未編集のコンテンツである。この理由は、コンテンツの編集には、編集を実施するための設備や技術が必要である、コンテンツの内容を把握し、コンテンツ中のどの区間が視聴してもらいたい区間であるかを確認しておく必要がある、
の２つの要因によって、利用者にとっては困難を伴うためである。 However, most of the accumulated content is unedited content that has not undergone the editing process. The reason for this is that editing the content requires equipment and technology to perform the editing. Understand the content of the content and confirm which section of the content is the section you want to watch. It is necessary to keep
This is because it is difficult for the user due to these two factors.

この課題を解決するためには、
１）編集設備、技術を持たない利用者でも簡単に編集を実行できる、
２）利用者が視聴してもらいたいコンテンツ中の区間を効率的に確認できる、
の２点を実現する編集技術が必要である。 To solve this problem,
1) Editing facilities and users who do not have technology can easily perform editing.
2) Efficiently confirm the section in the content that the user wants to watch.
Editing technology that realizes these two points is necessary.

従来技術として、コンテンツの編集を自動化、もしくは支援する技術が、例えば特許文献１乃至３などに記載されている。 As conventional techniques, techniques for automating or supporting editing of contents are described in, for example, Patent Documents 1 to 3.

特許文献１に記載の技術では、映像を幾つかの区間に分割して、各区間の代表画像を選出し、更に、それらのレイアウト、及び代表画像の絞込みを行うことで、一覧的に映像内容を把握することのできる編集支援技術について記載されている。 In the technique described in Patent Document 1, a video is divided into several sections, a representative image of each section is selected, and further, the layout and the representative image are narrowed down to list the video contents in a list. It describes the editing support technology that can grasp this.

特許文献２に記載の技術では、映像を幾つかの区間に分割して、各区間の手振れやカメラワークなどの動きを検出し、動きの安定した区間を編集に向いた映像素材として利用者に提示する編集支援技術について記載されている。 In the technique described in Patent Document 2, the video is divided into several sections, motions such as camera shake and camera work in each section are detected, and the stable motion sections are presented to the user as video materials suitable for editing. It describes the editing support technology to be presented.

これらの従来技術によれば、映像を幾つかの区間に分割することで、区間の繋ぎ合わせが容易になり、
１）編集設備、技術を持たない利用者でも簡単に編集を実行できる、
を満たす編集支援技術となっている。 According to these prior arts, dividing the video into several sections makes it easy to connect the sections,
1) Editing facilities and users who do not have technology can easily perform editing.
It is an editing support technology that meets the requirements.

一方で、コンテンツの各区間の内容を反映することのできる編集技術として、特許文献３に記載の技術がある。ここでは、コンテンツに含まれる音声情報に基づいて、音声を所定の単位区間に分割し、それぞれの区間が強調状態の音声であるか否かを判定することで強調状態にある区間を選出し、更にそれらの選出された区間を繋ぎ合わせる自動編集技術について記載されている。 On the other hand, there is a technique described in Patent Document 3 as an editing technique that can reflect the contents of each section of the content. Here, based on the audio information included in the content, the audio is divided into predetermined unit intervals, and the sections in the emphasized state are selected by determining whether or not each section is the audio in the emphasized state. Furthermore, an automatic editing technique for connecting the selected sections is described.

この技術によれば、コンテンツの音声情報に基づいて、強調状態にある区間を中心に据えた編集が自動的に実行できる。 According to this technique, editing centered on the highlighted section can be automatically executed based on the audio information of the content.

尚、本発明に関連する技術の参考文献としては、下記非特許文献１〜６に記載のものが存在する。
特開平１１−３０８５６７号公報特開２０００−２６１７５７号公報特開２００５−３３３２０５号公報「ディジタル音声処理第４章４．９ピッチ抽出」、古井貞熙、東海大学出版会、ｐｐ．５７−５９、１９８５年９月嵯峨山茂樹、板倉文忠、「音声の動的尺度に含まれる個人性情報」、日本音響学会昭和５４年度春季研究発表会講演論文集、３−２−７、１９７９年、ｐｐ．５８９−５９０「わかりやすいパターン認識」、石井健一郎、上田修功、前田栄作、村瀬洋、オーム社、ｐｐ５２−５４、１９９８年「計算統計Ｉ第ＩＩＩ章３ＥＭ法４変分ベイズ法」、上田修功、岩波書店、ｐｐ．１５７−１８６、２００３年６月「映像特徴インデクシングに基づく構造化映像ハンドリング機構と映像利用インタフェースに関する研究第３章画像処理に基づく映像インデクシング」、外村佳伸、京都大学博士論文、ｐｐ．１５−２３、２００６「コンピュータ画像処理」、田村秀行編著、オーム社、ｐｐ．２４２−２４７、２００２年１２月 In addition, the reference of the following nonpatent literatures 1-6 exists as a reference of the technique relevant to this invention.
Japanese Patent Laid-Open No. 11-308567 JP 2000-261757 A JP 2005-333205 A “Digital Speech Processing, Chapter 4, 4.9 Pitch Extraction”, Sadaaki Furui, Tokai University Press, pp. 57-59, September 1985 Shigeki Hatakeyama, Fumitada Itakura, “Personality information included in the dynamic scale of speech”, Proc. 589-590 “Easy-to-understand pattern recognition”, Kenichiro Ishii, Noriyoshi Ueda, Eisaku Maeda, Hiroshi Murase, Ohmsha, pp 52-54, 1998 “Computational Statistics I Chapter 3 3EM Method 4 Variation Bayes Method”, Nobuyoshi Ueda, Iwanami Shoten, pp. 157-186, June 2003 “Research on structured video handling mechanism and video interface based on video feature indexing, Chapter 3 Video indexing based on image processing”, Yoshinobu Tonomura, Ph.D. 15-23, 2006 "Computer image processing", edited by Hideyuki Tamura, Ohmsha, pp. 242-247, December 2002

利用者が視聴してもらいたい区間となる代表的な例は、映像又は音声中の感情表出の大きい区間である。言い換えれば、映像中の楽しい区間を視聴してもらいたいと思う利用者や、悲しい区間を視聴してもらいたいと思う利用者が多く、このような区間を自動的に分類し、自動編集を行う技術や、編集を簡単に行うことのできる技術の開発が望まれている。 A typical example of a section that the user wants to watch is a section where emotional expression in video or audio is large. In other words, there are many users who want to watch a fun section in the video and many users who want to watch a sad section. Such sections are automatically classified and automatically edited. Development of technology and technology that can be easily edited is desired.

しかしながら、これらの技術では、各区間についての情報は、動きが安定している区間や、強調状態にある区間などの限られた指標を示すにとどまっており、各区間の感情表出についての詳細な情報は提示されなかった。このため、前述のような利用者が、視聴してもらいたい区間であるかどうかを即座に確認することができず、このような利用者にとっては有益な編集技術とはなりにくいという問題があった。 However, in these techniques, the information about each section shows only a limited index such as a section in which movement is stable or a section in an emphasized state. No information was presented. For this reason, there is a problem in that it is difficult for such a user to immediately confirm whether or not the section is desired to be viewed, and it is difficult to be an editing technique useful for such a user. It was.

本発明は、前記課題に基づいてなされたものであって、その目的は、映像又は音声中の感情表出を１つ以上の映像又は音声を解析することによって自動的に検出し、この検出によって得られる情報に基づいて感情的状態にある区間を捉えた自動編集、及び編集支援を実行できる情報編集装置、情報編集方法、情報編集プログラムおよびそのプログラムを記録した記録媒体を提供することにある。 The present invention has been made on the basis of the above-mentioned problems, and its object is to automatically detect emotional expression in video or audio by analyzing one or more video or audio, and by this detection. An object of the present invention is to provide an information editing apparatus, an information editing method, an information editing program, and a recording medium on which the program is recorded, which can execute automatic editing that captures a section in an emotional state based on information obtained and editing support.

本発明において、感情とは、情動、雰囲気、印象等も含むものとする。音声とは、人間による発話音声のみではなく、歌唱音声、音楽、環境音等も含むものとする。 In the present invention, emotion includes emotion, atmosphere, impression, and the like. The voice includes not only human speech but also singing voice, music, environmental sound, and the like.

前記課題の解決を図るために、請求項１の発明は、１つ以上の映像又は音声を編集する装置であって、１つ以上の映像又は音声を所定の区間単位に分割する区間分割手段と、前記分割された区間毎に、映像情報、音声情報のうち何れか１つ以上を用いて、１種以上の感情表出状態の確からしさを求める感情検出手段と、前記確からしさに基づいて、１つ以上の前記区間を繋ぎ合わせて映像又は音声を生成する自動生成手段と、を備えることを特徴とする。 In order to solve the above-mentioned problem, the invention of claim 1 is an apparatus for editing one or more videos or sounds, and section dividing means for dividing one or more videos or sounds into predetermined sections. , For each of the divided sections, using one or more of video information and audio information, emotion detection means for determining the probability of one or more emotion expression states, and based on the probability, Automatic generation means for generating video or audio by connecting one or more of the sections.

請求項２記載の発明は、１つ以上の映像又は音声を編集する装置であって、１つ以上の映像又は音声を所定の区間単位に分割する区間分割手段と、前記分割された区間毎に、映像情報、音声情報のうち何れか１つ以上を用いて、１種以上の感情表出状態の確からしさを求める感情検出手段と、前記分割された区間毎に、代表画像又は、該分割された区間毎の映像又は音声、又は代表画像及び該分割された区間毎の映像又は音声を出力する出力手段と、前記代表画像を、前記確からしさに基づいて、加工、情報付与のうち何れか１つ以上を実行する画像加工手段と、前記加工された代表画像を、空間的にレイアウトして利用者に出力するレイアウト手段と、利用者が選択した少なくとも１つの前記代表画像に対応する前記区間を繋ぎ合わせて映像又は音声を生成する生成手段と、を備えることを特徴とする。 The invention according to claim 2 is an apparatus for editing one or more videos or sounds, section dividing means for dividing one or more videos or sounds into predetermined sections, and for each of the divided sections , Emotion detection means for determining the probability of one or more emotion expression states using at least one of video information and audio information, and a representative image or the divided image for each of the divided sections Output means for outputting video or audio for each section or a representative image and video or audio for each of the divided sections, and processing the representative image to any one of processing and information addition based on the probability. Image processing means for executing at least one image, layout means for spatially laying out the processed representative image and outputting to the user, and the section corresponding to at least one representative image selected by the user. Connect and project Or characterized in that it comprises generation means for generating a sound, a.

請求項３記載の発明は、１つ以上の映像又は音声を編集する方法であって、区間分割手段が、１つ以上の映像又は音声を所定の区間単位に分割する区間分割ステップと、感情検出手段が、前記分割された区間毎に、映像情報、音声情報のうち何れか１つ以上を用いて、１種以上の感情表出状態の確からしさを求める感情検出ステップと、自動生成手段が、前記確からしさに基づいて、１つ以上の前記区間を繋ぎ合わせて映像又は音声を生成する自動生成ステップと、を備えることを特徴とする。 The invention according to claim 3 is a method for editing one or more videos or sounds, wherein the section dividing means divides one or more videos or sounds into predetermined sections, and emotion detection. An emotion detecting step for determining the probability of one or more emotion expression states using one or more of video information and audio information for each of the divided sections, and an automatic generation means, And an automatic generation step of generating video or audio by connecting one or more of the sections based on the certainty.

請求項４記載の発明は、１つ以上の映像又は音声を編集する方法であって、区間分割手段が、１つ以上の映像又は音声を所定の区間単位に分割する区間分割ステップと、感情検出手段が、前記分割された区間毎に、映像情報、音声情報のうち何れか１つ以上を用いて、１種以上の感情表出状態の確からしさを求める感情検出ステップと、出力手段が、前記分割された区間毎に、代表画像又は、該分割された区間毎の映像又は音声、又は代表画像及び該分割された区間毎の映像又は音声を出力する出力ステップと、画像加工手段が、前記代表画像を、前記確からしさに基づいて、加工、情報付与のうち何れか１つ以上を実行する画像加工ステップと、レイアウト手段が、前記加工された代表画像を、空間的にレイアウトして利用者に提示するレイアウトステップと、生成手段が、利用者が選択した少なくとも１つの前記代表画像に対応する前記区間を繋ぎ合わせて映像又は音声を生成する生成ステップと、を備えることを特徴とする。 The invention according to claim 4 is a method for editing one or more videos or sounds, wherein the section dividing means divides one or more videos or sounds into predetermined sections, and emotion detection. An emotion detecting step for determining the probability of one or more emotion expression states using one or more of video information and audio information for each of the divided sections, and an output means, For each divided section, an output step for outputting a representative image or video or audio for each of the divided sections, or a representative image and video or audio for each of the divided sections; An image processing step that executes one or more of processing and information addition based on the certainty of the image, and a layout unit spatially lays out the processed representative image to the user. To present And out step, the generation means, characterized in that it comprises a generation step of generating a video or audio by connecting the section corresponding to at least one of the representative image selected by the user.

請求項５記載の発明は、コンピュータを、請求項１又は２に記載の各手段として機能させる情報編集プログラムであることを特徴とする。 The invention described in claim 5 is an information editing program for causing a computer to function as each means described in claim 1 or 2.

請求項６記載の発明は、請求項５に記載の情報編集プログラムを記録したコンピュータ読み取り可能な記録媒体であることを特徴とする。 A sixth aspect of the present invention is a computer-readable recording medium on which the information editing program according to the fifth aspect is recorded.

前記請求項１、３の発明によれば、映像又は音声から感情表出状態の確からしさを自動的に計算し、これに基づいて編集コンテンツを自動生成することができ、自動編集が可能となる。 According to the first and third aspects of the present invention, the likelihood of the emotional expression state is automatically calculated from the video or audio, and the edited content can be automatically generated based on the calculated likelihood of automatic expression. .

前記請求項２、４の発明によれば、映像又は音声から感情表出状態の確からしさを自動的に計算する。この確からしさに基づいて、利用者はどの区間がどのような感情的状態にあるかを判断し、所望の区間を含む編集コンテンツを生成することができ、編集の支援が可能となる。 According to the second and fourth aspects of the present invention, the likelihood of the emotional expression state is automatically calculated from video or audio. Based on this certainty, the user can determine which section is in what emotional state, can generate edited content including the desired section, and can support editing.

前記請求項５の発明によれば、自動編集又は編集支援が可能な情報編集プログラムが提供される。 According to the invention of claim 5, an information editing program capable of automatic editing or editing support is provided.

前記請求項６の発明によれば、請求項５に記載の情報編集プログラムを記録した記録媒体として、コンピュータに組み込むことができる。 According to the sixth aspect of the invention, the information editing program according to the fifth aspect can be incorporated into a computer as a recording medium.

（１）請求項１、３の発明によれば、感情表出状態の確からしさに基づいてコンテンツの自動編集を実行できるため、利用者の視聴してもらいたい区間を適切に含めた自動編集を実行することができる。
（２）請求項２、４の発明によれば、利用者は、感情表出状態の確からしさに基づいて加工、情報の付与された代表画像を参照しながら、編集コンテンツに含める区間を選定できるため、各区間の内容をいちいち確認することなく効率的に感情表出状態となっている区間を把握し、効率的に、視聴してもらいたい区間を含めた編集を実行することができ、編集の支援が可能となる。
（３）請求項５の発明によれば、自動編集又は編集支援が可能な情報編集プログラムを提供することができる。
（４）請求項６の発明によれば、請求項１又は２の発明をコンピュータ上で実現することができる。 (1) According to the first and third aspects of the invention, since automatic editing of content can be executed based on the likelihood of the emotional expression state, automatic editing that appropriately includes a section that the user wants to watch is performed. Can be executed.
(2) According to the inventions of claims 2 and 4, the user can select a section to be included in the edited content while referring to a representative image to which processing and information are given based on the likelihood of the emotional expression state. Therefore, without confirming the contents of each section, it is possible to grasp the section that is in the emotional expression state efficiently, and efficiently execute the editing including the section that you want to watch. Can be supported.
(3) According to the invention of claim 5, an information editing program capable of automatic editing or editing support can be provided.
(4) According to the invention of claim 6, the invention of claim 1 or 2 can be realized on a computer.

これらを以って情報編集分野に貢献できる。 These can contribute to the information editing field.

以下、図面を参照しながら本発明の実施の形態を説明するが、本発明は以下の実施形態例に限定されるものではない。
[実施形態の第１例]
本発明の実施形態の第１例は、編集支援を実行する装置として実施した場合であり、請求項２、４に対応している。図１は本実施形態例に係る情報編集装置１の一例を示している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings, but the present invention is not limited to the following embodiments.
[First example of embodiment]
The first example of the embodiment of the present invention is a case where the present invention is implemented as an apparatus for executing editing support, and corresponds to claims 2 and 4. FIG. 1 shows an example of an information editing apparatus 1 according to this embodiment.

情報編集装置１は、編集対象であるコンテンツ入力部１１と、コンテンツ入力部１１から入力された映像／音声データ１０を格納する記憶装置１２と、映像情報、音声情報のうち何れか１つ以上を用いて、コンテンツデータを所定の区間単位に分割する区間分割手段１３と、映像情報、音声情報のうち何れか１つ以上を用いて、分割されたぞれぞれの区間毎に、感情表出状態の確からしさである感情度を算出する感情検出手段１４と、映像情報、音声情報のうち何れか１つ以上を用いて、区間毎に代表画像を決定し、出力する画像出力手段１５と、感情度に基づいて、代表画像を加工する画像加工手段１６と、代表画像をレイアウトして利用者に提示するレイアウト手段１７と、利用者が指定した代表画像に対応する区間を再生する再生手段１８と、利用者が選択した代表画像に対応する区間を繋ぎ合わせて、編集コンテンツを生成する生成手段１９とを備え、前記各機能は例えばコンピュータによって実行される。 The information editing apparatus 1 includes a content input unit 11 to be edited, a storage device 12 for storing video / audio data 10 input from the content input unit 11, and one or more of video information and audio information. Using the section dividing means 13 that divides the content data into predetermined section units, and using one or more of video information and audio information, an emotional expression is provided for each divided section. An emotion detection means 14 for calculating the degree of emotion which is the probability of the state, an image output means 15 for determining and outputting a representative image for each section using one or more of video information and audio information; Based on the emotion level, the image processing means 16 for processing the representative image, the layout means 17 for laying out the representative image and presenting it to the user, and the reproducing means for reproducing the section corresponding to the representative image designated by the user 8, by joining a section corresponding to the representative image selected by the user, and a generation means 19 for generating an edited content, each function is performed for example by a computer.

図２に、本実施形態例に係る情報編集装置１の処理フローの一例を図示する。この処理フローに基づいて、図１に示す情報編集装置１の詳細な動作を説明する。 FIG. 2 illustrates an example of a processing flow of the information editing apparatus 1 according to the present embodiment. Based on this processing flow, the detailed operation of the information editing apparatus 1 shown in FIG. 1 will be described.

ステップＳ２１において、利用者からの要請によって、コンテンツ入力部１１が映像／音声データを受け付けると、ステップＳ２２において、これを記憶装置１２に格納する。 In step S21, when the content input unit 11 accepts video / audio data in response to a request from the user, the content input unit 11 stores it in the storage device 12 in step S22.

次に、ステップＳ２３において、区間分割手段１３が、映像情報又は音声情報の何れか１つ以上に基づいて、該映像／音声データを所定の区間単位に分割する。 Next, in step S23, the section dividing unit 13 divides the video / audio data into predetermined section units based on one or more of video information and audio information.

区間分割の方法としては、メディア処理技術を用いる方法がある。 As a method of segmentation, there is a method using a media processing technique.

例えば、映像情報を用いる場合、映像のカット点を検出し、カット点で囲まれた部分を区間として定義する。このとき、区間単位はショットとなる。カット点を検出する方法としては、例えば、非特許文献５に記載の方法を用いることができる。 For example, when video information is used, a cut point of the video is detected, and a portion surrounded by the cut points is defined as a section. At this time, the section unit is a shot. As a method for detecting the cut point, for example, the method described in Non-Patent Document 5 can be used.

また、例えば、音声情報を用いる場合、音声のスペクトル情報を用いた有声無声判定を行い、ある一定以上継続する無声部分を検出し、この無声部分で囲まれた部分を区間として定義する。このとき、区間単位は連続有声区間となる。 For example, when voice information is used, voiced / unvoiced determination using voice spectrum information is performed, a voiceless part continuing for a certain amount or more is detected, and a part surrounded by the voiceless part is defined as a section. At this time, the section unit is a continuous voiced section.

有声無声判定は、スペクトル情報の他、例えば、基本周波数の存在の有無によっても実行するものとしてもよい。 The voiced / unvoiced determination may be executed based on, for example, the presence / absence of a fundamental frequency in addition to the spectrum information.

スペクトル情報を用いる場合は、パワースペクトル密度のピークを持つ周波数を１つ以上計算し、これらの周波数が予め定めた周波数帯域に収まる範囲に存在する場合には有声区間、それ以外の場合には無声区間として判定すればよい。周波数帯域としては、例えば、５０〜３５０Ｈｚ（ヘルツ）の間にピークを持つ周波数が観測された場合には、有声区間であると判定する、としてもよい。 When using spectral information, calculate one or more frequencies with peaks in the power spectral density, and if these frequencies are within a predetermined frequency band, they are voiced, otherwise they are silent What is necessary is just to determine as an area. As a frequency band, for example, when a frequency having a peak between 50 and 350 Hz (Hertz) is observed, it may be determined as a voiced section.

基本周波数を用いる場合には、例えば、非特許文献１に記載の方法などを用いて基本周波数を求め、これが予め定めた周波数帯域に収まる範囲の値を取る場合に有声区間、それ以外の場合には無声区間として判定すればよい。周波数帯域としては、例えば、５０〜３５０Ｈｚ（ヘルツ）などとして設定してもよい。 In the case of using the fundamental frequency, for example, the fundamental frequency is obtained by using the method described in Non-Patent Document 1, and when the value is in a range that falls within a predetermined frequency band, the voiced interval is used. May be determined as a silent section. As a frequency band, you may set as 50-350 Hz (hertz) etc., for example.

また、例えば、音声波形の振幅値の絶対値に閾値を設定し、この条件が所定の時間連続する時、区間の終／始端としても良い。 Further, for example, a threshold value may be set for the absolute value of the amplitude value of the speech waveform, and when this condition continues for a predetermined time, it may be the end / start of the section.

入力されたコンテンツが音声の場合には、音声情報を用いて区間分割を実行すればよく、映像の場合には映像情報を用いて区間分割を実行すればよい。更に、可能であれば、上記のショット、及び連続有声区間の両方を区間単位として採用してもよい。 When the input content is audio, segment division may be performed using audio information, and when the content is video, segment division may be performed using video information. Further, if possible, both the shot and the continuous voiced section may be adopted as a section unit.

入力された映像又は音声が１つ以上存在する場合には、各々の映像又は音声に対して前記区間分割法を適用すればよい。 When there are one or more input video or audio, the section division method may be applied to each video or audio.

次に、ステップＳ２４において、感情検出手段１４が、映像情報又は音声情報の何れか１つ以上に基づいて、区間毎の感情度を算出する。 Next, in step S24, the emotion detection means 14 calculates the emotion level for each section based on one or more of video information and audio information.

本実施形態の一例では、検出する感情表出状態として、例えば、“楽しい”、“哀しい”、“怖い”、“激しい”、“かっこいい”、“かわいい”、“エキサイティング”、“情熱的”、“ロマンチック”、“暴力的”、“穏やか”、“癒される”、“暖かい”、“冷たい”、“不気味”、“笑い”がある。ステップＳ２４では、それぞれの感情表出状態について、その確からしさを表す感情度を計算する。 In an example of the present embodiment, as the emotion expression state to be detected, for example, “fun”, “sad”, “scary”, “violent”, “cool”, “cute”, “exciting”, “passionate”, There are "romantic", "violent", "calm", "healed", "warm", "cold", "creepy", "laughter". In step S24, for each emotion expression state, an emotion level representing the certainty is calculated.

ここで、本発明の実施形態の第１例に係る感情検出手段１４は、例えば図３に示す感情検出手段３のように構成される。 Here, the emotion detection means 14 according to the first example of the embodiment of the present invention is configured as the emotion detection means 3 shown in FIG. 3, for example.

感情検出手段３は、コンテンツに含まれる映像、音声信号データから、分析フレーム毎に特徴量を抽出する特徴量抽出手段３１と、特徴量に基づいて、感情表出状態である確率を求める統計モデルを備えた感情確率計算手段３２と、前記感情確率に基づいて、当該区間の感情度を計算する感情度計算手段３３と、感情度に基づいて、当該区間の感情表出状態を判定する感情判定手段３４により構成する。 The emotion detection unit 3 includes a feature amount extraction unit 31 that extracts a feature amount for each analysis frame from video and audio signal data included in the content, and a statistical model that calculates a probability of an emotion expression state based on the feature amount. Emotion probability calculation means 32 provided with, emotion level calculation means 33 for calculating the emotion level of the section based on the emotion probability, and emotion determination for determining the emotion expression state of the section based on the emotion level Consists of means 34.

ここで、上記統計モデルは、実際に本発明によって編集を実行する前に学習によって作成しておく。 Here, the statistical model is created by learning before actually executing editing according to the present invention.

図４に、本発明の実施形態の第１例に係る感情検出手段３の処理フローの一例を図示する。この処理フローに基づいて、図３に示す感情検出手段３の詳細な動作を説明する。 FIG. 4 shows an example of the processing flow of the emotion detection means 3 according to the first example of the embodiment of the present invention. Based on this processing flow, the detailed operation of the emotion detection means 3 shown in FIG. 3 will be described.

区間毎に感情度を計算する際、映像情報に基づく場合も、音声情報に基づく場合も、あるいはそれら両方に基づく場合も、それぞれ必要な映像特徴量、音声特徴量を抽出し、これらに基づいて感情度を計算する。 When calculating the emotion level for each section, the necessary video feature amount and audio feature amount are extracted, based on video information, audio information, or both, respectively. Calculate feelings.

ステップＳ４１では、特徴抽出手段３１が、取り込まれたコンテンツの音声信号データから、所望の音声特徴量を分析フレーム（以下、フレームと呼ぶ）毎に計算し、抽出する。 In step S41, the feature extraction unit 31 calculates and extracts a desired audio feature amount for each analysis frame (hereinafter referred to as a frame) from the audio signal data of the captured content.

それぞれの特徴量を抽出する方法の一例を説明する。 An example of a method for extracting each feature amount will be described.

まず、映像特徴量を抽出する方法の一例を説明する。 First, an example of a method for extracting video feature amounts will be described.

本実施形態の一例では、映像特徴量は、ショット長、色ヒストグラム、色ヒストグラムの時間変動特性、動きベクトル、それぞれの時間変動特性を抽出するものとする。時間変動特性の例としては、例えば、フレーム間差分がある。 In one example of the present embodiment, as the video feature amount, a shot length, a color histogram, a time variation characteristic of the color histogram, a motion vector, and each time variation characteristic are extracted. An example of the time variation characteristic is an inter-frame difference, for example.

映像特徴量は、映像フレーム毎に抽出を行う。この際、処理時間を短縮することを目的として、数フレームおき、例えば、５フレームおきに抽出する、といった工夫をしてもよい。 The video feature amount is extracted for each video frame. At this time, for the purpose of shortening the processing time, it may be devised to extract every several frames, for example, every 5 frames.

ここで、ショット長、動きベクトル、色ヒストグラムなどの基本的な抽出方法は様々あるが、これらは公知であり、例えば、非特許文献５などに示されている方法を用いることができる。 Here, there are various basic extraction methods such as a shot length, a motion vector, and a color histogram, but these are known, and for example, a method shown in Non-Patent Document 5 or the like can be used.

ショット長については、単一映像フレーム内で抽出することは事実上不可能であるので、例えば、対象としている映像フレームが含まれるショットの長さとして抽出すればよい。また、１つ以上のショットを含むある区間におけるショット長の平均値や最大値、最小値などを用いてもよい。 Since it is practically impossible to extract the shot length within a single video frame, for example, it may be extracted as the length of a shot including the target video frame. Further, an average value, maximum value, minimum value, or the like of the shot length in a certain section including one or more shots may be used.

色ヒストグラムについては、例えば、次のように抽出する。 The color histogram is extracted as follows, for example.

映像フレーム中の画素毎に、色相（Ｈｕｅ）を抽出する。この色相は、例えば１１や２５６など、所定の数Ｑに量子化しておくことで、全画素がＱ個の量子のうち何れに該当するかを求めることができる。これを全画素に渡り実行し、量子毎の出現数を計数することにより、映像フレームの色相ヒストグラムが抽出できる。 A hue (Hue) is extracted for each pixel in the video frame. This hue is quantized to a predetermined number Q such as 11 or 256, for example, so that it can be determined which of Q quanta corresponds to all pixels. By executing this over all the pixels and counting the number of appearances for each quantum, the hue histogram of the video frame can be extracted.

この際、映像フレーム全体を１つ以上の領域に分割し、それぞれの領域毎にヒストグラムを抽出してもよい。 At this time, the entire video frame may be divided into one or more regions, and a histogram may be extracted for each region.

動きベクトルについては、例えば、オプティカルフローベクトルを計算することによって、Ｘ成分とＹ成分からなるベクトルとして抽出することができる。オプティカルフローベクトルの計算の方法としては、例えば、非特許文献６などに記載の方法を用いることができる。この他、例えば、映像フレーム毎にベクトルのノルムを計算するのでもよいし、非特許文献５に開示されている方法などを用いて、パン、チルト、ズームなどのカメラ操作を検出し、それぞれ個別に単位時間辺りの操作量などとして計量化するのでもよい。 The motion vector can be extracted as a vector composed of an X component and a Y component, for example, by calculating an optical flow vector. As a method for calculating the optical flow vector, for example, the method described in Non-Patent Document 6 or the like can be used. In addition, for example, the vector norm may be calculated for each video frame, or camera operations such as panning, tilting, and zooming are detected using the method disclosed in Non-Patent Document 5 and the like. It may be quantified as an operation amount per unit time.

音声特徴量を抽出する方法の一例を説明する。 An example of a method for extracting speech feature amounts will be described.

音声特徴量は、例えば、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値、振幅のｒｍｓ値の時間変動特性、パワー、パワーの時間変動特性、発話速度、発話速度の時間変動特性のうち１つ以上の要素で構成するものとする。 The voice feature amount is, for example, a fundamental frequency, a temporal variation characteristic of the fundamental frequency, an rms value of the amplitude, a temporal variation characteristic of the rms value of the amplitude, a power, a temporal variation characteristic of the power, a speech speed, or a temporal variation characteristic of the speech speed It shall consist of one or more elements.

各音声特徴量を抽出する方法の一例について詳細を説明する。 Details of an example of a method for extracting each voice feature amount will be described.

多様なコンテンツを対象とする本発明の実施形態の一例に用いる特徴量としては、高次元音声パラメータの解析を必要とする音韻情報と比較して、多様な音源要因の混在した音声に対しても安定して得られるものが好ましい。 The feature quantity used in an example of the embodiment of the present invention for various contents is as follows, even for speech mixed with various sound source factors, compared to phonological information that requires analysis of high-dimensional speech parameters. Those obtained stably are preferred.

例えば、音声認識等を用いて音声をテキスト情報に変換する方法は、このような音韻情報を必要とし、例えば、ニュース映像等の発話者の音声が鮮明に聴取できるジャンルのコンテンツについては有効である。 For example, a method of converting speech into text information using speech recognition or the like requires such phonological information, and is effective for, for example, content in a genre where a speaker's speech such as a news video can be clearly heard. .

しかし、映画、ドラマや、ホームビデオ等においては、発話以外にも、背景音楽、環境音等の様々な音源要因が存在するために、発話を鮮明に聴き取ることができず、音声認識が難しい。更に、必ずしも発話のみによってコンテンツの感情表出状態を決定できるとは限らないため、印象や雰囲気を含めた感情表出状態を検出するためには、音楽、効果音、環境音等も取り扱うことのできる音声特徴量が必要である。 However, in movies, dramas, home videos, etc., there are various sound source factors such as background music and environmental sounds in addition to utterances, so the utterances cannot be heard clearly and speech recognition is difficult. . Furthermore, since it is not always possible to determine the emotional expression state of content based on utterances, music, sound effects, environmental sounds, etc. can be handled in order to detect emotional expression states including impressions and atmosphere. An audio feature that can be used is necessary.

そこで、本実施形態の第１例では、韻律情報、特に、基本周波数、基本周波数の時間変動特性、振幅のｒｍｓ値（以下、単にｒｍｓと呼ぶ）、ｒｍｓの時間変動特性、パワー、パワーの時間変動特性、発話速度、発話速度の時間変動特性を抽出する。 Therefore, in the first example of the present embodiment, the prosody information, in particular, the fundamental frequency, the time variation characteristic of the fundamental frequency, the rms value of amplitude (hereinafter simply referred to as rms), the time variation characteristic of rms, power, and the time of power. Extract fluctuation characteristics, utterance speed, and temporal fluctuation characteristics of utterance speed.

特に、時間変動特性として数種の短時間変化量を用いることによって、コンテンツに含まれる感情を抽出する場合においての感情的な音声における重要な挙動を検出することが可能となる。 In particular, by using several types of short-time change amounts as the time variation characteristics, it is possible to detect important behaviors in emotional speech when emotions included in content are extracted.

時間変動特性の例としては、例えば、フレーム間差分や、回帰係数がある。また、パワーは、パワースペクトル密度などを用いるのでもよい。基本周波数、パワーの抽出法は様々あるが、公知であり、その詳細については、例えば非特許文献１に記載の方法等を参照されたい。 Examples of time variation characteristics include inter-frame differences and regression coefficients. The power may be a power spectral density. There are various methods for extracting the fundamental frequency and power, but they are known. For details, see the method described in Non-Patent Document 1, for example.

また、発話速度、音楽リズム、テンポ等を含めた発話速度については、例えば非特許文献２に開示されている方法などによって、動的尺度として抽出することができる。例えば、動的尺度のピークを検出し、その数をカウントすることで発話速度を検出する方法をとってもよく、また、発話速度の時間変動特性に相当するピーク間隔の平均値、分散値を計算して発話速度の時間変動特性を検出する方法をとるのでもよい。以下、本発明の実施形態の第１例では、発話速度として動的尺度のピーク間隔平均値を用いるものとする。 Further, the speech speed including the speech speed, music rhythm, tempo, and the like can be extracted as a dynamic scale by the method disclosed in Non-Patent Document 2, for example. For example, it is possible to detect the speech rate by detecting the peak of the dynamic scale and counting the number, and calculating the average value and variance value of the peak interval corresponding to the temporal variation characteristic of the speech rate. Thus, a method of detecting the temporal variation characteristic of the speech rate may be taken. Hereinafter, in the first example of the embodiment of the present invention, the peak interval average value of the dynamic scale is used as the speech rate.

これらの音声特徴量を、フレーム毎に抽出する方法の１例を説明する。１フレームの長さ（以下、フレーム長とよぶ）を、例えば５０ｍｓとし、次のフレームは現フレームに対して、例えば、２０ｍｓの時間シフトによって形成されるものとする。 An example of a method for extracting these audio feature amounts for each frame will be described. The length of one frame (hereinafter referred to as the frame length) is, for example, 50 ms, and the next frame is formed by a time shift of, for example, 20 ms with respect to the current frame.

図５に示すように、これらのフレーム毎に、各フレーム内での各音声特徴量の平均値、つまり、平均基本周波数、基本周波数の平均時間変動特性、平均ｒｍｓ、ｒｍｓの平均時間変動特性、平均パワー、パワーの平均時間変動特性、動的尺度の平均ピーク間隔平均値などを計算するものとする。あるいは、これらの平均値のみではなく、フレーム内での各音声特徴量の最大値、最小値、または変動幅などを計算して用いてもよい。 As shown in FIG. 5, for each of these frames, the average value of each voice feature amount in each frame, that is, the average fundamental frequency, the average frequency variation characteristic of the fundamental frequency, the average rms, the average time variation characteristic of rms, It is assumed that average power, average time variation characteristic of power, average peak interval average value of dynamic scale, and the like are calculated. Alternatively, not only the average value but also the maximum value, the minimum value, or the fluctuation range of each voice feature amount in the frame may be calculated and used.

ここで、コンテンツ中の感情表出状態にある部分に特徴的に現れる音声においては、基本周波数そのものの抽出が困難な場合が多く、しばしば欠損することがある。このため、そのような欠損を補完する効果を容易に得ることのできる、基本周波数の時間変動特性は含むことが好ましい。 Here, it is often difficult to extract the fundamental frequency itself in the voice that appears characteristically in the emotional expression part of the content, and it is often lost. For this reason, it is preferable to include the time-variation characteristic of the fundamental frequency that can easily obtain the effect of complementing such defects.

更には、話者依存性を低く抑えたまま、判定精度を高めるため、パワーの時間変動特性を更に含むことが好ましい。以上、フレーム毎に行った音声特徴量の抽出処理を、コンテンツ全てに渡るフレームに対して行うことで、全てのフレームにおいて音声特徴量を得ることが可能である。 Furthermore, it is preferable to further include a time variation characteristic of power in order to improve the determination accuracy while keeping speaker dependency low. As described above, the audio feature amount extraction processing performed for each frame is performed on the frames over the entire content, so that the audio feature amounts can be obtained in all the frames.

映像特徴量、音声特徴量の両方を用いて感情度の計算を実行する場合、映像特徴量抽出に用いたフレームと、音声特徴量抽出に用いたフレームを統一しておくことが好ましい。 When the emotion level is calculated using both the video feature value and the audio feature value, it is preferable to unify the frame used for the video feature value extraction and the frame used for the audio feature value extraction.

これを実施する方法の一例としては、例えば、映像特徴量を音声特徴量抽出に用いたフレームに合わせて換算し、５０ｍｓのフレーム幅、２０ｍｓの時間シフトで再計算する方法が挙げられる。 As an example of a method for implementing this, for example, there is a method in which a video feature value is converted according to a frame used for audio feature value extraction, and recalculated with a frame width of 50 ms and a time shift of 20 ms.

次に、ステップＳ４２において、ステップＳ４１でフレーム毎に抽出した特徴量に基づいて、感情確率計算手段３２が、検出する対象となる感情表出状態である確率を、感情確率として計算する。 Next, in step S42, based on the feature amount extracted for each frame in step S41, the emotion probability calculation means 32 calculates the probability of the emotion expression state to be detected as the emotion probability.

感情確率は、検出する対象となる感情表出状態のカテゴリ（感情カテゴリ）と、統計モデルを対応付けることで、統計モデルを用いて感情カテゴリ毎に感情確率を計算する。この統計モデルとしては、例えば、正規分布、混合正規分布、隠れマルコフモデル、一般化状態空間モデルなどを用いる。好ましくは、感情の時間遷移をモデル化できる、一般化状態空間モデルを採用する。 The emotion probability is calculated for each emotion category using the statistical model by associating the category of the emotion expression state (emotion category) to be detected with the statistical model. As this statistical model, for example, a normal distribution, a mixed normal distribution, a hidden Markov model, a generalized state space model, or the like is used. Preferably, a generalized state space model that can model the temporal transition of emotion is adopted.

統計モデルのパラメータの推定方法は、例えば、最尤推定法や、ＥＭアルゴリズム、変分ベイズ法などが公知のもとして知られており、用いることができる。詳しくは非特許文献３、非特許文献４などを参照されたい。 As a method for estimating a parameter of a statistical model, for example, a maximum likelihood estimation method, an EM algorithm, a variational Bayes method, and the like are known and can be used. For details, refer to Non-Patent Document 3, Non-Patent Document 4, and the like.

次に、ステップＳ４３において、ステップＳ４２で計算した感情確率に基づいて、感情度計算手段３３が区間毎の感情度を計算する。 Next, in step S43, based on the emotion probability calculated in step S42, the emotion level calculation means 33 calculates the emotion level for each section.

区間毎の感情度を計算する方法の１例について説明する。 An example of a method for calculating the emotion level for each section will be described.

各感情カテゴリを順にｅ¹、ｅ²、・・・と表記し、感情カテゴリの数を＃（Ｋ）と表す。コンテンツ中の区間Ｓの集合を時刻の早いものから順に{Ｓ₁，Ｓ₂，・・・，Ｓ_NS}とする。ここで、ＮＳは区間の総数である。 Each emotion category is expressed in order as e ¹ , e ² ,..., And the number of emotion categories is expressed as # (K). Assume that a set of sections S in the content is {S ₁ , S ₂ ,..., S _NS } in order from the earliest time. Here, NS is the total number of sections.

ある区間Ｓ_iに含まれるフレームを{ｆ₁，ｆ₂，・・・，ｆ_NFi}と置く。ここで、ＮＦｉは区間Ｓ_iに含まれるフレーム数である。各フレームｆ_tは、ステップＳ４２において、フレーム単位でのｋ番目の感情カテゴリｅ^kの感情確率ｐｆ_t（ｅ^k）が求められている。これを用いて、ｋ番目の感情カテゴリｅ^kの区間Ｓ_iの感情度ｐＳ_i（ｅ^k）は、例えば、区間Ｓ_iにおける平均感情確率を表す Frames included in a certain section S _i are set as {f ₁ , f ₂ ,..., F _NFi }. Here, NFi is the number of frames included in the section S _i . For each frame f _t , the emotion probability pf _t (e ^k ) of the ^kth emotion category ek on a frame basis is obtained in step S42. Using this, the emotion level pS _i (e ^k ) in the section S _i of the ^kth emotion category ek represents, for example, the average emotion probability in the section S _i .

として計算することや、最大値を表す次式によって計算する。 Or the following formula representing the maximum value.

また、式（１）によって実施される平均化は、各区間Ｓ_iを更に分割して小区間を生成し、この小区間単位で実行するのでもよい。 The averaging performed by the equation (1) may each section S _i further divided to generate a small section, or may be to run on this small section units.

この小区間は、例えば１秒の小区間幅、小区間シフトを０．５秒として、この小区間毎に平均化を実行する。そして、区間Ｓ_i内に含まれる全ての小区間の平均値のうち、最大の値を取るものを区間Ｓ_iの感情度としてもよい。これら以外にも、例えば、区間内でフィルタ処理を行ってから感情度を計算するなど、方法は様々あるが、各区間の間で感情度を比較する場合があるため、感情度はある一定の値の範囲内、例えば０〜１の間に収まるようにすることが好ましい。 In this small section, for example, a small section width of 1 second and a small section shift are set to 0.5 seconds, and averaging is performed for each small section. Then, the average value of all the small sections included in the section S _i may take the maximum value as the emotion level of the section S _i . In addition to these, for example, there are various methods such as calculating the emotion level after performing filtering within the interval, but the emotion level may be compared between the intervals, so the emotion level may be constant. It is preferable to be within the range of values, for example, between 0 and 1.

以上のような計算を、全ての区間に渡り行うことで、全ての区間に対して感情カテゴリ毎の感情度を計算することができる。 By performing the above calculation over all the sections, the emotion level for each emotion category can be calculated for all the sections.

次に、ステップＳ４４において、ステップＳ４３で計算した感情度に基づいて、感情判定手段３４が区間毎の感情表出状態を判定する。 Next, in step S44, the emotion determination means 34 determines the emotion expression state for each section based on the emotion level calculated in step S43.

これを実施する方法としては、例えば、感情度が最大値をとる感情カテゴリを、その区間の感情として求める。 As a method for implementing this, for example, an emotion category having the maximum emotion level is obtained as the emotion of the section.

これによって、各区間でどのような感情表出状態が優勢に表れているかを判定することができる。 This makes it possible to determine what emotional expression state prevails in each section.

映像又は音声が複数入力された場合には、これらの映像又は音声を連続的に繋げ、１つの映像又は音声であるとみなして、上記感情度の算出を実行すればよい。 When a plurality of images or sounds are input, these images or sounds may be continuously connected and regarded as one image or sound, and the emotion level may be calculated.

次に、図２のステップＳ２５において、画像出力手段１５が区間毎に代表画像を選定し、出力する。 Next, in step S25 of FIG. 2, the image output means 15 selects and outputs a representative image for each section.

まず、映像に対する代表画像を選定する処理の一例について説明する。 First, an example of processing for selecting a representative image for a video will be described.

代表画像の選定は、例えば、ある区間Ｓ_iに含まれる全ての映像フレームを時間方向に並べたときの先頭フレーム、中央のフレーム、末尾のフレームなどを抽出し、これを代表画像として選定するのでもよい。 Selection of the representative image, for example, to select the top frame when arranged all of the video frames included in a certain section S _i in the time direction, the center of the frame, to extract and end of the frame, it as a representative image But you can.

また、ステップＳ２４において求めた、区間における感情表出状態、感情度、感情確率に基づいて代表画像を抽出してもよい。 Further, the representative image may be extracted based on the emotion expression state, the emotion level, and the emotion probability obtained in step S24.

例えば、感情表出状態と感情確率に基づいた処理方法を、図６を用いて説明する。 For example, a processing method based on the emotion expression state and the emotion probability will be described with reference to FIG.

まず、代表画像を選定する区間Ｓ_iの感情表出状態を取得し、この感情表出状態に対応する感情確率が最大の値を取る時刻Ｔ_Frameを取得する。 First, to get the emotional expression status of the section S _i for selecting a representative image, emotion probability corresponding to the emotional expression state obtains time T _Frame taking the maximum value.

次に、区間に含まれる全ての映像フレーム６１の中からこの時刻Ｔ_Frameに最も近い映像フレームを、代表画像６２として選定する。 Next, the video frame closest to the time T _Frame is selected as the representative image 62 from all the video frames 61 included in the section.

この方法によって、その区間の最も優勢な感情表出状態が、最も強く現れていると想定される映像フレームを代表画像として選定することができ、どのような内容の区間であるかを分かりやすく確認することができる。 By this method, it is possible to select the video frame that is assumed to have the strongest emotional expression state in that section as the representative image, and confirm the contents of the section in an easy-to-understand manner can do.

入力されたファイルが音声である場合には、代表画像として、図７に示すような、音声であることを示すアイコン７１を代表画像とするのでもよいし、当該区間の音声波形７２を代表画像とするのでもよい。 When the input file is audio, an icon 71 indicating audio as shown in FIG. 7 may be used as the representative image as a representative image, or the audio waveform 72 of the section is used as the representative image. It may be.

次に、ステップＳ２６において、画像加工手段１６が区間毎の感情度に基づいて、代表画像を加工する。 Next, in step S26, the image processing means 16 processes the representative image based on the emotion level for each section.

代表画像を加工する処理の一例について、図８を用いて説明する。 An example of processing for processing a representative image will be described with reference to FIG.

まず事前に、感情カテゴリ毎に、感情度の高低に応じて、加工内容を決定しておく。例えば、“楽しい”感情カテゴリであれば、黄色の外枠を付与する、“哀しい”感情カテゴリであれば、同様に青色の外枠を付与するなどとしてもよい。また、感情度が高い場合には、代表画像のサイズを大きくし、低い場合には代表画像のサイズを小さくするなどとしてもよい。また、この他、画像の透過率、コントラスト、明るさ、色相、画素数等を加工項目に設定してもよい。 First, processing contents are determined in advance for each emotion category according to the level of emotion level. For example, a yellow outer frame may be assigned to the “fun” emotion category, and a blue outer frame may be similarly applied to the “sad” emotion category. Further, the size of the representative image may be increased when the emotion level is high, and the size of the representative image may be decreased when the emotion level is low. In addition to this, image transmissivity, contrast, brightness, hue, number of pixels, and the like may be set as processing items.

感情度の高低の判断は、例えば、感情カテゴリｅ^kの区間Ｓ_iの感情度ｐＳ_i（ｅ^k）が０〜１の間の値を取る場合には、０≦ｐＳ_i（ｅ^k）＜０．３であれば「低」、０．３≦ｐＳ_i（ｅ^k）＜０．７であれば「中」、０．７≦ｐＳ_i（ｅ^k）≦１．０であれば「高」などとすることができる。 For example, when the emotion level pS _i (e ^k ) in the section S _i of the emotion category e ^k takes a value between 0 and 1, the determination is made as follows: 0 ≦ pS _i (e ^k ) < “Low” if 0.3, “Medium” if 0.3 ≦ pS _i (e ^k ) <0.7, “High” if 0.7 ≦ pS _i (e ^k ) ≦ 1.0. Or the like.

実際に加工を行う段階においては、まず、最も優勢に表れている感情表出状態である感情カテゴリを求める。この方法は、感情度が最も高い値となった感情カテゴリを最も優勢な感情カテゴリであるとすればよい。 In the stage of actual processing, first, the emotion category that is the most prevalent emotion expression state is obtained. In this method, the emotion category having the highest emotion level may be the most dominant emotion category.

次に、この優勢な感情カテゴリと、その感情度に対応する加工を代表画像に実施すればよい。 Next, this dominant emotion category and processing corresponding to the emotion level may be performed on the representative image.

前記のような加工のほか、アイコンの付与を実施するのでもよい。 In addition to the processing as described above, an icon may be assigned.

この処理を行う場合にも、予め、感情カテゴリ毎に、感情度の高低に応じて、付与するアイコンを決定しておく。 Even when this processing is performed, an icon to be assigned is determined in advance for each emotion category according to the level of emotion.

代表画像にアイコンを付与した一例を図９に示す。例えば、代表画像に対応する区間の感情カテゴリが“楽しい”で、感情度が「高」の場合には、例えば、“笑顔”を示すアイコン９１を代表画像９２に付与する。また、例えば、感情カテゴリが“哀しい”で、感情度が「高」の場合には、例えば、“泣き顔”を示すアイコン９３を代表画像９４に付与する。 An example in which an icon is added to the representative image is shown in FIG. For example, when the emotion category of the section corresponding to the representative image is “fun” and the emotion level is “high”, for example, an icon 91 indicating “smile” is added to the representative image 92. For example, when the emotion category is “sad” and the emotion level is “high”, for example, an icon 93 indicating “crying face” is added to the representative image 94.

このようなアイコンの付与によって、各区間がどのような感情的状態にある区間であるかを、直観的に理解しやすくすることができ、確認の作業を効率化できる。 By giving such an icon, it is possible to intuitively understand the emotional state of each section, and the confirmation work can be made more efficient.

次に、ステップＳ２７において、レイアウト手段１７が各代表画像をレイアウトして出力し、利用者に提示する。 Next, in step S27, the layout means 17 lays out and outputs each representative image and presents it to the user.

この処理方法の一例を、図１０を用いて説明する。 An example of this processing method will be described with reference to FIG.

図の例では、代表画像には、感情度に合わせて、前記の方法によってアイコン、及びサイズの加工を施してある。例えば、提示平面１００の横（左右）軸１０１を時間軸に取り、区間の時刻情報に合わせて左から順にレイアウトしてもよい。 In the example shown in the figure, the icon and the size of the representative image are processed by the above-described method according to the emotion level. For example, the horizontal (left / right) axis 101 of the presentation plane 100 may be taken as the time axis, and the layout may be sequentially performed from the left in accordance with the time information of the section.

さらに図１１に示すように、縦（上下）軸１１１を感情度に取り、感情度を反映したレイアウトを行ってもよい。例えば、代表画像１１２に対応する区間は感情度が「高」、代表画像１１３に対応する区間などは「低」、代表画像１１４に対応する区間などは「中」を表すものとしてもよい。 Furthermore, as shown in FIG. 11, the vertical (vertical) axis 111 may be taken as the emotion level, and a layout reflecting the emotion level may be performed. For example, the section corresponding to the representative image 112 may indicate “high”, the section corresponding to the representative image 113 may be “low”, and the section corresponding to the representative image 114 may be “medium”.

また、例えば、提示平面の縦（上下）軸を時間軸に、横（左右）軸をパンによるカメラワークの画角を表す軸に取り、加工した代表画像をレイアウトするのでもよい。また、それぞれの軸は、チルトなどのカメラワークの画角や、感情度の大きさを示すようにしてもよい。 Further, for example, the processed representative image may be laid out by taking the vertical (up / down) axis of the presentation plane as the time axis and the horizontal (left / right) axis as the axis representing the angle of view of the camera work by panning. In addition, each axis may indicate the angle of view of camerawork such as tilt or the degree of emotion.

さらに、例えば、図１２に示すように、提示平面１２０を、一連の区間をあるまとまりとしてまとめ、そのまとまりの単位で、例えば、１番目のまとまりに属する区間の代表画像を提示する提示領域１２１、２番目のまとまりに属する区間の代表画像を提示する提示領域１２２、などのように、レイアウト位置を分割してもよい。 Furthermore, for example, as shown in FIG. 12, the presentation plane 120 is a group of a series of sections, and a presentation area 121 for presenting a representative image of a section belonging to the first group in a unit of the group. The layout position may be divided, such as a presentation area 122 that presents a representative image of a section belonging to the second unit.

この際用いるまとまりは、例えば、映像情報を利用して、ディゾルブのような漸次カットなどの特殊なカットで囲まれる一まとまりとしてもよいし、また、ある一定時間以上、例えば、１秒以上の無音区間で囲まれる一まとまりとしてもよい。 The group used at this time may be a group surrounded by a special cut such as a gradual cut such as a dissolve by using video information, or may be silence for a certain period of time, for example, one second or more. It is good also as a unit surrounded by a section.

まとまりの数は、図１２では２つの例を示しているが、必ずしも２つである必要はなく、入力された映像、音声ファイルと、区間分割、まとまりを構成する方法に合わせて変更されるものである。 The number of groups is shown in FIG. 12 as two examples. However, the number of groups is not necessarily two. The number of groups is changed according to the input video and audio files, the section division, and the method of configuring the groups. It is.

これらのレイアウトによって、横一列に全て提示する場合に比べて、利用者がより一覧的に区間全体を把握することができる。また、例えば、レイアウトした代表画像が、提示平面の右端１２３に到達したところで区切ってひとまとまりとするのでもよい。 With these layouts, the user can grasp the entire section more in a list than in the case where all are presented in a horizontal row. Further, for example, the representative images that have been laid out may be divided into a group when they reach the right end 123 of the presentation plane.

この他、例えば、アイコンを付与する方法と同様に、前記感情度が「高」の区間が出現したところで区切り、そこから先は列を変えるものとしてもよい。つまり、一列の末尾は感情度が「高」、即ち、感情表出が大きく表れている区間となる。この方法によって、一列毎に、感情表出が大きく表れる一連のコンテキストを含めることができる。 In addition, for example, similarly to the method of assigning an icon, it may be divided at a point where the emotion level is “high”, and the column is changed after that. That is, the end of the line is a section where the emotion level is “high”, that is, the emotional expression is large. By this method, it is possible to include a series of contexts in which emotional expression greatly appears for each column.

また、入力された元映像又は音声が１つ以上存在する場合には、これらの映像又は音声を連続的に繋げ、１つの映像又は音声であるとみなして、図１０の提示平面１００に表示してもよい。 In addition, when one or more input original video or audio exists, these video or audio are continuously connected and regarded as one video or audio and displayed on the presentation plane 100 of FIG. May be.

また、それぞれ提示領域を切り替えて提示するものとしてもよい。このレイアウト方法の一例について、図１３を用いて説明する。 Moreover, it is good also as what switches each presentation area and presents. An example of this layout method will be described with reference to FIG.

図１３は、映像又は音声であるコンテンツが３つ入力された場合である。このときは、例えば、各コンテンツ毎に提示平面１３０を３つ設置し、提示平面切換タブを押すことで、これら表示される提示平面を切り替えて提示する。 FIG. 13 shows a case where three contents, which are video or audio, are input. At this time, for example, three presentation planes 130 are installed for each content, and by pressing a presentation plane switching tab, the displayed presentation planes are switched and presented.

次に、利用者が区間の再生を要求した場合には、ステップＳ２８において、再生手段１８が、指定された区間の再生を実行する。すなわち、区間再生要求を受け付けたか否かの判定（ステップＳ２８ａ）と、区間の再生（ステップＳ２８ｂ）と、代表画像選択を終了するか否かの判定（ステップＳ２８ｃ）とを実行する。 Next, when the user requests playback of the section, the playback means 18 executes playback of the designated section in step S28. That is, it is determined whether or not a section playback request has been accepted (step S28a), section playback (step S28b), and whether or not to end representative image selection (step S28c).

区間の再生を要求する方法の一例としては、例えば、利用者がポインティングデバイスなどを用いて、代表画像を指定した場合に、それをもって要求とすればよい。 As an example of a method for requesting playback of a section, for example, when a user designates a representative image using a pointing device or the like, the request may be made with that.

利用者の再生要求に合わせて、任意の区間から再生を実行する方法の一例について説明する。 An example of a method for executing reproduction from an arbitrary section in accordance with a user's reproduction request will be described.

利用者が、代表画像の指定等によって区間の再生を要求する。この際、各代表画像と、対応する区間の再生開始点となる映像フレーム、又は、タイムライン時刻とを予め対応づけておく。この方法によって、代表画像を指定した際に、映像中のどのフレーム位置、もしくはタイムライン時刻から再生を開始すればよいかが判別できるため、再生位置のシークによって、任意の区間から再生を実行することができる。 A user requests playback of a section by designating a representative image or the like. At this time, each representative image is associated in advance with a video frame serving as a reproduction start point of the corresponding section or a timeline time. With this method, when a representative image is specified, it is possible to determine which frame position in the video or playback should start from time, so playback can be performed from any section by seeking the playback position. Can do.

その他の方法の一例としては、予め、各区間を全て個別に映像、動画ファイル化する方法がある。例えば、各区間の先頭の映像フレームから終端の映像フレームまでを切り出しておき、単一の映像ファイルとして記憶する。このようにして生成した区間に対応した映像ファイルと、同じく区間に対応した代表画像を対応付けておき、代表画像が指定された時点で対応する映像ファイルを再生すればよい。 As an example of another method, there is a method in which all sections are individually converted into video and moving image files in advance. For example, the first video frame to the last video frame of each section are cut out and stored as a single video file. The video file corresponding to the section generated in this way may be associated with the representative image corresponding to the section, and the corresponding video file may be reproduced when the representative image is designated.

最後に、ステップＳ２９において、利用者が代表画像を選択し、編集コンテンツに用いる区間の選択を終了する要求を受けた場合に、生成手段１９が、選択された区間を繋ぎ合わせて編集コンテンツを生成する。 Finally, in step S29, when the user selects a representative image and receives a request to finish selecting a section to be used for the edited content, the generation unit 19 generates the edited content by connecting the selected sections. To do.

この繋ぎ合わせ方の例としては、例えば、元映像のタイムライン上の時刻が早い順に繋ぎ合わせるのでもよいし、例えば、感情度の高い区間又は低い区間から順に繋ぎ合わせるのでもよい。また、利用者が繋ぎ合わせ順を指定できるように設計してもよい。 As an example of this connection method, for example, connection may be performed in order from the earliest time on the timeline of the original video, or, for example, connection may be performed in order from a section having a higher emotion level or a section having a lower emotion level. Moreover, you may design so that a user can designate the connection order.

以上が、本発明の実施形態の第１例に係る情報編集装置、情報編集方法の一例である。
[実施形態の第２例]
本発明の実施形態の第２例は、自動編集を実行する装置として実施した場合であり、請求項１，３に対応している。 The above is an example of the information editing apparatus and the information editing method according to the first example of the embodiment of the present invention.
[Second Example of Embodiment]
The second example of the embodiment of the present invention is a case where it is implemented as an apparatus for executing automatic editing, and corresponds to claims 1 and 3.

図１４は本実施形態例に係る情報編集装置１４０の一例を示している。 FIG. 14 shows an example of the information editing apparatus 140 according to this embodiment.

情報編集装置１４０は、編集対象であるコンテンツ入力部１４１と、コンテンツ入力部１４１から入力された映像／音声データ１０を格納する記憶装置１４２と、映像情報、音声情報のうち何れか１つ以上を用いて、コンテンツデータを所定の区間単位に分割する区間分割手段１４３と、映像情報、音声情報のうち何れか１つ以上を用いて、分割されたぞれぞれの区間毎に、感情表出状態の確からしさである感情度を算出する感情検出手段１４４と、感情度に基づいて区間を選択する選択手段１４５と、選択された区間を繋ぎ合わせて、編集コンテンツを生成する生成手段１４６と、を備え、前記各機能は例えばコンピュータにより実行される。 The information editing device 140 includes a content input unit 141 to be edited, a storage device 142 that stores the video / audio data 10 input from the content input unit 141, and one or more of video information and audio information. Using the section dividing unit 143 that divides the content data into predetermined section units, and using one or more of video information and audio information, an emotional expression is provided for each divided section. Emotion detecting means 144 for calculating the emotional degree that is the probability of the state, selecting means 145 for selecting the sections based on the emotional degree, and generating means 146 for connecting the selected sections to generate edited content; The functions are executed by a computer, for example.

図１５に、本実施形態例に係る情報編集装置１４０の処理フローの一例を図示する。この処理フローに基づいて、図１４に示す情報編集装置１４０の詳細な動作を説明する。 FIG. 15 illustrates an example of a processing flow of the information editing apparatus 140 according to the present embodiment. Based on this processing flow, the detailed operation of the information editing apparatus 140 shown in FIG. 14 will be described.

ステップＳ１５１〜ステップＳ１５４の各処理は、本発明の実施形態の第１例の説明に示した図２のステップＳ２１〜ステップＳ２４と、それぞれ全く同一のものとしてよい。 Each process of step S151 to step S154 may be exactly the same as step S21 to step S24 of FIG. 2 shown in the description of the first example of the embodiment of the present invention.

ステップＳ１５４までの各処理を終えた後、ステップＳ１５５では、選択手段１４５が、それまでに計算された区間毎の感情度に基づいて、編集コンテンツに含める区間の選択を実行する。ステップＳ１５４において、各区間の感情カテゴリについて感情度が図１６に示すように得られている。 After completing each process up to step S154, in step S155, the selection unit 145 selects a section to be included in the edited content based on the emotion level calculated for each section. In step S154, the emotion level is obtained for the emotion category of each section as shown in FIG.

選択の処理の一例について説明する。 An example of the selection process will be described.

編集コンテンツに含まれる区間は、例えば、感情度の高い区間を採用する。この方法について説明する。まず、図１６に示す、各区間の感情カテゴリと、感情度を参照する。このうち、例えば、“楽しい”感情カテゴリの感情度についてランキングし、最も高い値を持つ区間から順に、上限として定めた区間数に達するまで採用する。 As the section included in the edited content, for example, a section having a high emotion level is adopted. This method will be described. First, reference is made to the emotion category and emotion level of each section shown in FIG. Among these, for example, ranking is performed for the emotion level of the “fun” emotion category, and the ranking is adopted in order from the section having the highest value until the number of sections determined as the upper limit is reached.

“楽しい”以外の感情カテゴリの場合においても、同様に処理を実施すればよい。 In the case of emotion categories other than “fun”, the same process may be performed.

この上限となる区間数の決定の仕方としては、利用するシステムの仕様等に合わせて予め決定しておいてもよいし、利用者の要求を受付、所望の数を設定するのでもよい。 As the method of determining the upper limit section number, it may be determined in advance according to the specifications of the system to be used, or a user request may be accepted and a desired number may be set.

また、区間数の上限でなくとも、編集コンテンツの時間を上限として定めても良い。この場合は、各区間の時間を別途算出しておけば、採用した区間全体の時間を計算することができる。 Further, the time of the edited content may be set as the upper limit, not the upper limit of the number of sections. In this case, if the time for each section is calculated separately, the time for the entire adopted section can be calculated.

編集コンテンツの時間を上限とする場合にも、区間数の上限の場合と同様、利用するシステムの仕様等に合わせて予め決定しておいてもよいし、利用者の要求を受付、所望の時間を上限として設定するのでもよい。 Even when the upper limit is set for the edited content time, as with the upper limit of the number of sections, it may be determined in advance according to the specifications of the system to be used, etc. May be set as the upper limit.

次に、ステップＳ１５６では、生成手段１４６が、選択された区間を繋ぎ合せ、編集コンテンツを生成する。 Next, in step S156, the generation unit 146 connects the selected sections and generates edited content.

この繋ぎ合わせ方の例としては、例えば、元映像のタイムライン上の時刻が早い順に繋ぎ合わせるのでもよいし、例えば、感情度の高い区間又は低い区間から順に繋ぎ合わせるのでもよい。 As an example of this connection method, for example, connection may be performed in order from the earliest time on the timeline of the original video, or, for example, connection may be performed in order from a section having a higher emotion level or a section having a lower emotion level.

以上、この発明によるユーザ支援方法の、実施形態における方法の１例について詳細に説明した。 Heretofore, an example of the method in the embodiment of the user support method according to the present invention has been described in detail.

また、上記示した情報編集方法では、区間毎にどのような感情的状態にあるかを判定できる。これに基づいて、さらに、編集コンテンツに加工を加えても良い。 Further, in the information editing method described above, it is possible to determine what emotional state is in each section. Based on this, editing content may be further processed.

映像が入力されていた場合に、編集コンテンツが楽しい区間を多く含むものであれば、楽しい背景音楽を付与する、などはその一例である。 For example, if the edited content includes a lot of fun sections when video is input, a pleasant background music is given.

この方法の一例としては、編集コンテンツの、全区間数に対する“楽しい”感情度が「高」である区間数の割合を求め、この割合が一定以上であれば、背景音楽を付与するとしてもよい。 As an example of this method, the ratio of the number of sections of the edited content in which the “fun” emotion level is “high” with respect to the total number of sections is obtained, and if this ratio is a certain level or more, background music may be added. .

その他、本発明の実施形態として示した１例以外のものであっても、本発明の原理に基づいて取りうる実施形態の範囲においては、適宜その実施形態に変化しうるものである。 Other than the example shown as the embodiment of the present invention, the embodiment can be appropriately changed within the scope of the embodiment that can be taken based on the principle of the present invention.

以下では、本発明の具体的な実施例を説明する。 Hereinafter, specific examples of the present invention will be described.

本実施例は、本発明の実施形態の第１例に説明した機能を、ローカルＰＣで動作するアプリケーションソフトとして利用した場合の実施例である。 In this example, the function described in the first example of the embodiment of the present invention is used as application software that operates on a local PC.

この実施例１では、本発明による情報編集方法を実行するプログラムを記憶した記憶媒体がローカルＰＣに搭載されている。実施手順は以下の通りである。
[手順１]利用者が任意の映像又は音声ファイルを指定し、プログラムで読み込ませる。
[手順２]情報編集プログラムが動作し、前記図２で説明した処理に従い、区間分割、感情検出、代表画像の決定、代表画像へのアイコン付与を実施した後、図１７に示すようにレイアウトして利用者に提示する。図１７の画面では、映像ファイルについて、タイムライン順に区間をレイアウトしている。
[手順３]利用者が、区間をプレビューし、編集コンテンツに採用する区間を、代表画像を選択することで決定し、編集開始ボタンを押すことで、編集の終了を通知する。
[手段４]情報編集プログラムが動作し、利用者が採用した区間を繋ぎ合わせ、編集コンテンツの映像又は音声ファイルを出力する。 In the first embodiment, a storage medium storing a program for executing an information editing method according to the present invention is mounted on a local PC. The implementation procedure is as follows.
[Procedure 1] The user designates an arbitrary video or audio file and reads it by a program.
[Procedure 2] The information editing program operates, and after segmentation, emotion detection, determination of the representative image, and icon assignment to the representative image in accordance with the processing described in FIG. 2, the layout is performed as shown in FIG. And present it to the user. In the screen of FIG. 17, sections are laid out in order of timelines for the video file.
[Procedure 3] A user previews a section, determines a section to be adopted for editing content by selecting a representative image, and notifies the end of editing by pressing an editing start button.
[Means 4] The information editing program operates to connect the sections adopted by the user and output a video or audio file of the edited content.

本実施例は、本発明の実施形態の第１例に説明した機能を、インターネット網で接続されたサーバで動作するアプリケーションソフトとして利用した場合の実施例である。 In this example, the function described in the first example of the embodiment of the present invention is used as application software that operates on a server connected via the Internet network.

本実施例２における具体的な装置１８０の構成を、図１８を用いて説明する。 A specific configuration of the apparatus 180 according to the second embodiment will be described with reference to FIG.

サーバ装置１８１と、複数の利用者端末１８２ａ、１８２ｂ、・・・が、インターネット網によって相互通信可能な形態で接続されている。サーバ装置１８１は、記憶部１８３、制御部１８４、データベース１８５を備えており、記憶部１８３は、本発明による情報処理方法を、コンピュータ読み取り可能な情報編集プログラムとして記憶した記憶媒体である。制御部１８４は、ＣＰＵ１８６、ＲＡＭ１８７、ＲＯＭ１８８などを備え、サーバ処理の各種演算処理を実行する。データベース１８５には、利用者の識別情報や、各利用者がアップロードした映像コンテンツ、編集コンテンツ等が格納されている。実施手順は以下の通りである。
[手順１]利用者が利用者端末１８２ａ，１８２ｂ…を通して任意の映像又は音声ファイルを指定し、サーバ装置１８１にアップロードする。
[手順２]サーバ装置１８１が、映像又は音声ファイルのアップロードを確認すると、記憶部１８３の情報編集プログラムが動作し、前記図２で説明した処理に従い、区間分割、感情検出、代表画像の決定、代表画像へのアイコン付与を実施した後、図１７に示すようにレイアウトして、コンテンツをアップロードした利用者に提示する。
[手順３]利用者が、区間をプレビューし、編集コンテンツに採用する区間を、代表画像を選択することで決定し、編集開始ボタンを押すことで、編集の終了を通知する。
[手順４]情報編集プログラムが動作し、利用者が採用した区間を繋ぎ合わせ、編集コンテンツの映像又は音声ファイルを出力し、データベース１８５に格納する。 The server device 181 and a plurality of user terminals 182a, 182b,... Are connected in a form that allows mutual communication via the Internet network. The server device 181 includes a storage unit 183, a control unit 184, and a database 185. The storage unit 183 is a storage medium that stores the information processing method according to the present invention as a computer-readable information editing program. The control unit 184 includes a CPU 186, a RAM 187, a ROM 188, and the like, and executes various arithmetic processes of server processing. The database 185 stores user identification information, video content uploaded by each user, edited content, and the like. The implementation procedure is as follows.
[Procedure 1] The user designates an arbitrary video or audio file through the user terminals 182a, 182b, and uploads it to the server device 181.
[Procedure 2] When the server device 181 confirms uploading of the video or audio file, the information editing program in the storage unit 183 operates, and in accordance with the processing described in FIG. 2, section division, emotion detection, representative image determination, After the icon is assigned to the representative image, it is laid out as shown in FIG. 17 and presented to the user who uploaded the content.
[Procedure 3] A user previews a section, determines a section to be adopted for editing content by selecting a representative image, and notifies the end of editing by pressing an editing start button.
[Procedure 4] The information editing program operates, connects the sections adopted by the user, outputs a video or audio file of the edited content, and stores it in the database 185.

本実施例は、本発明の実施形態の第２例に説明した機能を、インターネット網で接続されたサーバで動作するアプリケーションソフトとして利用した場合の実施例である。 In this example, the function described in the second example of the embodiment of the present invention is used as application software that operates on a server connected via the Internet network.

本実施例３における装置の具体的な構成は、前記実施例２の装置の構成図１８と同様である。実施手順は以下の通りである。
[手順１]利用者が利用者端末１８２ａ，１８２ｂ…を通して任意の映像又は音声ファイルを指定し、サーバ装置１８１にアップロードする。この際、編集コンテンツの上限時間と、編集コンテンツに採用する区間の感情カテゴリを指定する。
[手順２]サーバ装置１８１が、映像又は音声ファイルのアップロードを確認すると、記憶部１８３の情報編集プログラムが動作し、前記図１５で説明した処理に従い、区間分割、感情検出を実行する。
[手順３]全区間のうち、利用者の指定した感情カテゴリの感情度が高い順に、上限時間に達する長さになるまで区間を採用する。
[手順４]採用した区間を繋ぎ合わせ、編集コンテンツの映像又は音声ファイルを出力、データベース１８５に格納する。 The specific configuration of the apparatus according to the third embodiment is the same as the configuration diagram 18 of the apparatus according to the second embodiment. The implementation procedure is as follows.
[Procedure 1] The user designates an arbitrary video or audio file through the user terminals 182a, 182b, and uploads it to the server device 181. At this time, the upper limit time of the edited content and the emotion category of the section adopted for the edited content are designated.
[Procedure 2] When the server device 181 confirms uploading of the video or audio file, the information editing program in the storage unit 183 operates and executes section division and emotion detection according to the processing described in FIG.
[Procedure 3] Among all the sections, the sections are adopted in descending order of the emotion level of the emotion category designated by the user until the length reaches the upper limit time.
[Procedure 4] The adopted sections are connected, and the video or audio file of the edited content is output and stored in the database 185.

本実施例は、本発明の実施形態の第１例に説明した機能を、インターネット網で接続されたサーバで動作するアプリケーションソフトとして利用した場合の実施例である。実施例２との違いとしては、複数の映像又は音声を入力し、最終的に１つの編集コンテンツを生成する利用となっていることである。 In this example, the function described in the first example of the embodiment of the present invention is used as application software that operates on a server connected via the Internet network. The difference from the second embodiment is that a plurality of videos or sounds are input and finally one edited content is generated.

本実施例４における装置の構成は、実施形態の第１例と同じであり、図１８を用いて説明する。実施手順は以下の通りである。
[手順１]利用者が、１つ以上の任意の映像又は音声ファイルを指定し、サーバ装置１８１にアップロードする。
[手順２]サーバ装置１８１が、映像又は音声ファイルのアップロードを確認すると、記憶部１８３の情報編集プログラムが動作し、前記図２で説明した処理に従い、区間分割、感情検出、代表画像の決定、代表画像へのアイコン付与を実施した後、各映像又は音声ファイルを図１９に示すように、レイアウトして表示する。図１９の例では、映像又は音声ファイルが３つ入力された場合の例を図示してある。この例では、提示平面上部に設置された提示平面切換タブによって、提示するファイルを切り換えることができるようになっている。
[手順３]利用者が、区間をプレビューし、編集コンテンツに採用する区間を、代表画像を選択することで決定し、編集開始ボタンを押すことで、編集の終了を通知する。
[手順４]情報編集プログラムが動作し、各映像又は音声ファイルから、利用者が採用した区間を繋ぎ合わせ、編集コンテンツの映像又は音声ファイルを出力し、データベース１８５に格納する。 The configuration of the apparatus in Example 4 is the same as that of the first example of the embodiment, and will be described with reference to FIG. The implementation procedure is as follows.
[Procedure 1] The user designates one or more arbitrary video or audio files and uploads them to the server device 181.
[Procedure 2] When the server device 181 confirms uploading of the video or audio file, the information editing program in the storage unit 183 operates, and in accordance with the processing described in FIG. 2, section division, emotion detection, representative image determination, After the icon assignment to the representative image, each video or audio file is laid out and displayed as shown in FIG. In the example of FIG. 19, an example in which three video or audio files are input is illustrated. In this example, a file to be presented can be switched by a presentation plane switching tab installed at the upper part of the presentation plane.
[Procedure 3] A user previews a section, determines a section to be adopted for editing content by selecting a representative image, and notifies the end of editing by pressing an editing start button.
[Procedure 4] The information editing program operates, connects the sections adopted by the user from each video or audio file, outputs the video or audio file of the edited content, and stores it in the database 185.

尚、本実施形態の情報編集装置における各手段の一部もしくは全部の機能をコンピュータのプログラムで構成し、そのプログラムをコンピュータを用いて実行して本発明を実現することができること、本実施形態の情報編集方法における手順をコンピュータのプログラムで構成し、そのプログラムをコンピュータに実行させることができることは言うまでもなく、コンピュータでその機能を実現するためのプログラムを、そのコンピュータが読み取り可能な記録媒体、例えばＦＤ（Ｆｌｏｐｐｙ（登録商標）Ｄｉｓｋ）や、ＭＯ（Ｍａｇｎｅｔｏ−Ｏｐｔｉｃａｌｄｉｓｋ）、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、メモリカード、ＣＤ（ＣｏｍｐａｃｔＤｉｓｋ）−ＲＯＭ、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）−ＲＯＭ、ＣＤ−Ｒ，、ＣＤ−ＲＷ，ＨＤＤ，リムーバブルディスクなどに記録して、保存したり、配布したりすることが可能である。また、上記のプログラムをインターネットや電子メールなど、ネットワークを通して提供することも可能である。 Note that the present invention can be realized by configuring some or all of the functions of each means in the information editing apparatus of the present embodiment by a computer program and executing the program using the computer. It goes without saying that the procedure in the information editing method can be constituted by a computer program, and the program can be executed by the computer, and the computer-readable recording medium, for example, FD, can be realized by the computer. (Floppy (registered trademark) Disk), MO (Magneto-Optical disk), ROM (Read Only Memory), memory card, CD (Compact Disk) -ROM, DVD (Digital Versatile Diary) k) -ROM, CD-R ,, CD-RW, HDD, and recorded in a removable disk, or stored, it is possible to or distribute. It is also possible to provide the above program through a network such as the Internet or electronic mail.

本発明の実施形態の第１例に係る情報編集装置の構成を説明するブロック図。The block diagram explaining the structure of the information editing apparatus which concerns on the 1st example of embodiment of this invention. 本発明の実施形態の第１例に係る処理の流れを説明するフロー図。The flowchart explaining the flow of the process which concerns on the 1st example of embodiment of this invention. 本発明の実施形態における感情検出手段を説明するブロック図。The block diagram explaining the emotion detection means in embodiment of this invention. 本発明の実施形態における感情検出手段の処理の流れを説明するフロー図。The flowchart explaining the flow of a process of the emotion detection means in embodiment of this invention. 本発明の実施形態における音声特徴量抽出方法を説明するフロー図。The flowchart explaining the audio | voice feature-value extraction method in embodiment of this invention. 本発明の実施形態における代表画像選定方法を説明する説明図。Explanatory drawing explaining the representative image selection method in embodiment of this invention. 本発明の実施形態における音声コンテンツが入力された場合の代表画像の一例を示す説明図。Explanatory drawing which shows an example of the representative image when the audio | voice content in embodiment of this invention is input. 本発明の実施形態における代表画像の加工ルールを説明する説明図。Explanatory drawing explaining the processing rule of the representative image in embodiment of this invention. 本発明の実施形態における代表画像にアイコン付与を実施する方法を説明する説明図。Explanatory drawing explaining the method to implement icon provision to the representative image in embodiment of this invention. 本発明の実施形態におけるレイアウトの方法の一例を示す説明図。Explanatory drawing which shows an example of the method of the layout in embodiment of this invention. 本発明の実施形態におけるレイアウトの方法の他の例を示す説明図。Explanatory drawing which shows the other example of the method of the layout in embodiment of this invention. 本発明の実施形態におけるレイアウトの方法の他の例を示す説明図。Explanatory drawing which shows the other example of the method of the layout in embodiment of this invention. 本発明の実施形態における複数の映像又は音声コンテンツが入力された場合のレイアウトの方法を示す説明図。Explanatory drawing which shows the method of layout when the some video or audio | voice content in the embodiment of this invention is input. 本発明の実施形態の第２例に係る情報編集装置の構成を説明するブロック図。The block diagram explaining the structure of the information editing apparatus which concerns on the 2nd example of embodiment of this invention. 本発明の実施形態の第２例に係る処理の流れを説明するフロー図。The flowchart explaining the flow of the process which concerns on the 2nd example of embodiment of this invention. 本発明の実施形態における区間毎の感情カテゴリと感情度の対応付けを説明する説明図。Explanatory drawing explaining the matching of the emotion category for every area and emotion level in embodiment of this invention. 本発明の実施例１乃至３に係るレイアウト画面を示す説明図。Explanatory drawing which shows the layout screen which concerns on Example 1 thru | or 3 of this invention. 本発明の実施例２乃至４に係る装置の構成を説明するブロック図。The block diagram explaining the structure of the apparatus which concerns on Example 2 thru | or 4 of this invention. 本発明の実施例４に係るレイアウト画面を示す説明図。Explanatory drawing which shows the layout screen which concerns on Example 4 of this invention.

Explanation of symbols

１，１４０…情報編集装置、３，１４，１４４…感情検出手段、１１，１４１…コンテンツ入力部、１２，１４２…記憶装置、１３，１４３…区間分割手段、１５…画像出力手段、１６…画像加工手段、１７…レイアウト手段、１８…再生手段、１９，１４６…生成手段、３１…特徴量抽出手段、３２…感情確率計算手段、３３…感情度計算手段、３４…感情判定手段、１４５…選択手段、１８１…サーバ装置、１８２ａ，１８２ｂ…利用者端末、１８３…記憶部、１８４…制御部、１８５…データベース。 DESCRIPTION OF SYMBOLS 1,140 ... Information editing device, 3, 14, 144 ... Emotion detection means, 11, 141 ... Content input part, 12, 142 ... Memory | storage device, 13,143 ... Section division means, 15 ... Image output means, 16 ... Image Processing means, 17 ... Layout means, 18 ... Reproduction means, 19,146 ... Generation means, 31 ... Feature quantity extraction means, 32 ... Emotion probability calculation means, 33 ... Emotion degree calculation means, 34 ... Emotion determination means, 145 ... Selection Means, 181 ... server device, 182a, 182b ... user terminal, 183 ... storage unit, 184 ... control unit, 185 ... database.

Claims

A device for editing one or more video or audio,
Section dividing means for dividing one or more video or audio into predetermined section units;
Emotion detection means for determining the probability of one or more emotion expression states using one or more of video information and audio information for each of the divided sections;
Automatic generation means for generating video or audio by connecting one or more of the sections based on the probability;
An information editing apparatus comprising:

A device for editing one or more video or audio,
Section dividing means for dividing one or more video or audio into predetermined section units;
Emotion detection means for determining the probability of one or more emotion expression states using one or more of video information and audio information for each of the divided sections;
Output means for outputting a representative image or video or audio for each of the divided sections, or a representative image and video or audio for each of the divided sections, for each of the divided sections;
Image processing means for executing one or more of processing and information provision on the representative image based on the probability;
Layout means for spatially laying out the processed representative image and outputting it to a user;
Generating means for connecting the sections corresponding to at least one representative image selected by the user to generate video or audio;
An information editing apparatus comprising:

A method of editing one or more video or audio comprising:
A section dividing step in which the section dividing means divides one or more video or audio into predetermined section units;
An emotion detecting step for determining the probability of one or more emotion expression states using one or more of video information and audio information for each of the divided sections;
An automatic generation step, wherein the automatic generation means connects one or more of the sections based on the certainty to generate video or audio;
An information editing method comprising:

A method of editing one or more video or audio comprising:
A section dividing step in which the section dividing means divides one or more video or audio into predetermined section units;
An emotion detecting step for determining the probability of one or more emotion expression states using one or more of video information and audio information for each of the divided sections;
An output unit that outputs a representative image or video or audio for each of the divided sections, or a representative image and video or audio for each of the divided sections for each of the divided sections;
An image processing step in which the image processing means executes one or more of processing and information addition on the representative image based on the certainty;
A layout step in which the layout means spatially lays out the processed representative image and presents it to the user;
A generating step of generating video or audio by connecting the sections corresponding to at least one representative image selected by a user;
An information editing method comprising:

An information editing program for causing a computer to function as each means according to claim 1 or 2.

A computer-readable recording medium on which the information editing program according to claim 5 is recorded.