JP2008217447A

JP2008217447A - Content generation device and content generation program

Info

Publication number: JP2008217447A
Application number: JP2007054295A
Authority: JP
Inventors: Narichika Hamaguchi; 斉周浜口; Hiroyuki Kaneko; 浩之金子; Seiki Inoue; 誠喜井上; Mamoru Doke; 守道家
Original assignee: Nippon Hoso Kyokai NHK; NHK Engineering Services Inc; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2007-03-05
Filing date: 2007-03-05
Publication date: 2008-09-18
Anticipated expiration: 2027-03-05
Also published as: JP4917920B2

Abstract

<P>PROBLEM TO BE SOLVED: To generate a content added with the optimum representation to human voice. <P>SOLUTION: This content generation device for generating the content of executing by a representation object a prescribed representation set based on feeling information obtained from the input human voice, has a reading-in means for reading a script of the preliminarily generated content in every unit, a sound recording means for recording the human voice of a user, a feeling estimation means for estimating feeling based on the human voice recorded by the sound recording means, a representation means for extracting preset representation information, corresponding to the feeling information comprising a classification and an intensity of feeling obtained by the feeling estimation means, and a script generation means for substituting the script generated based on the representation information obtained by the representation means, with the script read in by the reading-in means, thus solving the above problem. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、コンテンツ生成装置及びコンテンツ生成プログラムに係り、特に、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成するためのコンテンツ生成装置及びコンテンツ生成プログラムに関する。 The present invention relates to a content generation device and a content generation program, and more particularly, to a content generation device and a content generation program for generating content with an optimal performance added to the real voice based on emotion information obtained from real voice data. .

従来より、ニュースやスポーツ等の情報提供番組等の映像コンテンツを映像コンテンツ制作者が制作する際に、映像を生成する前に所定の記述様式により予め台本を作成し、これを専用のソフトウェア等に入力することで、台本に従ったＣＧ（ＣｏｍｐｕｔｅｒＧｒａｐｈｉｃｓ）キャラクタ等の仮想物体の仮想空間上の動作やカメラワークを行う映像を生成することが可能な仕組みが存在している（例えば、特許文献１参照。）。 Conventionally, when video content creators produce video content such as news and sports information-providing programs, a script is created in advance in a predetermined description format before the video is generated, and this is made into dedicated software, etc. There is a mechanism capable of generating a video for performing a motion or camera work of a virtual object such as a computer graphics (CG) character or the like in accordance with a script by inputting it (for example, Patent Document 1). reference.).

なお、上述した特許文献１に示されている技術では、ＴＶＭＬ（ＴＶｐｒｏｇｒａｍＭａｒｋｉｎｇＬａｎｇｕａｇｅ）を用いて番組の制作を行っている。ここで、ＴＶＭＬとは、テレビ番組を制作するためのオブジェクトベース記述言語である。ＴＶＭＬは、テレビ番組の映像と音声を、素材と台本（演出内容）とに分けて記述するものであり、番組台本を記述すれば、パソコン等で動作するソフトウェア等がこれを読取り、即座にテレビ番組として視聴（提示）することができるものである。ＴＶＭＬを利用することで、ＣＧキャラクタの動作やカメラワークをその都度指定し、アドリブ的に制御することができる。 In the technology disclosed in Patent Document 1 described above, a program is produced using TVML (TV program Marking Language). Here, TVML is an object-based description language for producing a television program. TVML describes the video and audio of a TV program divided into materials and scripts (production contents). If a program script is described, software that runs on a personal computer or the like reads it and immediately reads the TV program. It can be viewed (presented) as a program. By using TVML, it is possible to specify the motion and camera work of the CG character each time and control it ad lib.

また、ＴＶＭＬを用いた番組制作では、制作者側で制作された番組の台本や、その台本に記述される番組制作エンジン（ＡＰＥ：ＡｕｔｏｍａｔｉｃＰｒｏｄｕｃｔｉｏｎＥｎｇｉｎｅ）、制作した番組に用いられる素材データ等を用いて番組を制作する。ここで、上述した番組制作エンジンとは、番組に登場する番組司会者や出演者等のＣＧキャラクタや番組における１つの動作の単位で「タイトル表示」、「ズームイン」、「ＣＧキャラクタの動作」等のイベントが予め定義されたものであり、この番組制作エンジンを用いることにより、ニュースやバラエティ、スポーツ、ドラマ等の所定のジャンルの番組制作を効率的に実現することができる。 Also, in program production using TVML, a script of a program produced by the producer, a program production engine (APE: Automatic Production Engine) described in the script, material data used for the produced program, and the like are used. To produce a program. Here, the above-mentioned program production engine is a CG character such as a program presenter or a performer appearing in a program, or one operation unit in a program, such as “title display”, “zoom-in”, “CG character operation”, etc. These events are predefined, and by using this program production engine, it is possible to efficiently realize program production of a predetermined genre such as news, variety, sports, drama and the like.

ここで、ＣＧ等を用いて例えばテレビ番組用の映像コンテンツを生成する場合、出演者であるＣＧキャラクタの喋りの質を確保するため、合成音声ではなく肉声を利用することが多い。この場合、予め出演者のセリフを録音しておき、これらの声に合わせて手動でＣＧキャラクタの表情や振る舞いを付加したＣＧ映像コンテンツを生成する。或いは、先にＣＧキャラクタの表情や振る舞いが付加されたＣＧ映像を生成しておき、その後ＣＧ映像を再生しながら肉声を割り当てるいわゆるアフレコ（アフター・レコーディング）処理によりＣＧ映像コンテンツを生成する。 Here, when video content for TV programs, for example, is generated using CG or the like, a real voice is often used instead of a synthesized voice in order to ensure the quality of the CG character who is a performer. In this case, the speech of the performer is recorded in advance, and CG video content with the expression and behavior of the CG character added manually according to these voices is generated. Alternatively, a CG video content to which a facial expression or behavior of a CG character is added is generated in advance, and then a CG video content is generated by a so-called post-recording (after recording) process in which a real voice is assigned while reproducing the CG video.

また、上記以外の手法としては、音声合成アプリケーションによる合成音声を用いて一旦合成音声でＣＧ映像コンテンツを作成し、その後、このＣＧ映像コンテンツを再生させながら肉声に置換する方法が存在している（例えば、特許文献２参照。）。
特開２００５−３１８２５４号公報特開２００４−７１０１３号公報 As a method other than the above, there is a method in which CG video content is once created with synthesized voice using synthesized voice by a voice synthesis application, and then replaced with real voice while reproducing the CG video content ( For example, see Patent Document 2.)
JP 2005-318254 A JP 2004-71013 A

しかしながら、上述した従来技術の場合、予め肉声を録音し、その音声を聞いた上でＣＧキャラクタの表情やジェスチャーを付加する場合では、セリフを簡単に変更することができない。また、セリフに対応したキャラクタ等の演出対象物に対する表情やジェスチャー等の振る舞い（演出）の付加は人手で行うため手間と時間がかかってしまう。 However, in the case of the above-described prior art, when the real voice is recorded in advance and the voice is heard and then the expression or gesture of the CG character is added, the lines cannot be easily changed. In addition, adding a behavior (production) such as a facial expression or a gesture to a production object such as a character corresponding to a line manually is time-consuming and time-consuming.

また、ＣＧキャラクタの表情やジェスチャーは既に決められているため、それに合うように肉声を割り当てなければならないが、肉声を割り当てるまでは実際にセリフが入ったＣＧ映像コンテンツを見ることができない。また、これについてもセリフを簡単に変えることができず、更に、演出対象物に対する演出の付加を人手で行う必要がある。 In addition, since the expression and gesture of the CG character have already been determined, it is necessary to assign a real voice so as to match it, but it is not possible to see the CG video content containing the actual speech until the real voice is assigned. In addition, the lines cannot be easily changed, and it is necessary to manually add an effect to the effect object.

また、従来技術である上述した特許文献２に示されている手法では、予め合成音声によりＣＧ映像コンテンツを作成できるため、セリフを変更した上で、事前にＣＧ映像コンテンツ全体を確認することができる。しかしながら、合成音声によるＣＧ映像コンテンツ生成後のアフレコ処理では、アフレコした肉声に合った適切な表情やジェスチャー等を演出対象物に実現させるには、やはりアフレコ後の音声を聞きながら人手で調整する必要がある。 Also, with the technique disclosed in Patent Document 2 described above, which is a prior art, CG video content can be created in advance using synthesized speech, so the entire CG video content can be confirmed in advance after changing the lines. . However, in post-recording processing after CG video content generation using synthesized speech, it is necessary to manually adjust the post-recording sound while listening to the post-recording speech in order to achieve an appropriate facial expression, gesture, etc. that matches the post-recording real voice. There is.

本発明は、上述した問題点に鑑みなされたものであり、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成するためのコンテンツ生成装置及びコンテンツ生成プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems, and based on emotion information obtained from real voice data, a content generation apparatus and a content generation program for generating content with an optimal presentation added to the real voice The purpose is to provide.

上記課題を解決するために、本件発明は、以下の特徴を有する課題を解決するための手段を採用している。 In order to solve the above problems, the present invention employs means for solving the problems having the following characteristics.

請求項１に記載された発明は、入力される肉声データから得られる感情情報に基づいて設定される所定の演出を演出対象物に行わせたコンテンツを生成するコンテンツ生成装置において、予め生成されたコンテンツのスクリプトを所定単位毎に読み込む読み込み手段と、使用者の肉声を録音するための録音手段と、前記録音手段により録音された肉声データから感情推定を行う感情推定手段と、前記感情推定手段により得られる感情の種別及び強さからなる感情情報に対応して予め設定された演出情報を抽出する演出手段と、前記演出手段により得られる前記演出情報に基づいて生成されたスクリプトを、前記読み込み手段により読み込まれたスクリプトと置換するスクリプト生成手段とを有することを特徴とする。 The invention described in claim 1 is generated in advance in a content generating apparatus that generates content in which a predetermined effect set based on emotion information obtained from input real voice data is performed on a production target object. Reading means for reading a script of content for each predetermined unit, recording means for recording a user's real voice, emotion estimation means for estimating an emotion from real voice data recorded by the recording means, and the emotion estimation means Production means for extracting production information set in advance corresponding to emotion information consisting of the type and strength of emotion obtained, and a script generated based on the production information obtained by the production means And a script generation means for replacing the script read by the above.

請求項１記載の発明によれば、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成することができる。また、これにより、適切な演出の番組を迅速に制作することができる。 According to the first aspect of the present invention, based on the emotion information obtained from the real voice data, it is possible to generate content in which an optimal presentation is added to the real voice. This also makes it possible to quickly produce a program with an appropriate performance.

請求項２に記載された発明は、前記演出手段は、前記演出情報として、前記演出対象物の表情又は動作に関する情報を含むことを特徴とする。 The invention described in claim 2 is characterized in that the effect means includes, as the effect information, information related to an expression or an action of the effect object.

請求項２記載の発明によれば、肉声から得られる感情情報に対応して、演出対象物に高精度な演出を行わせることができる。 According to the second aspect of the present invention, it is possible to cause the production target to produce a highly accurate production corresponding to the emotion information obtained from the real voice.

請求項３に記載された発明は、前記演出手段は、演出情報を設定する際には、前記感情の種別及び強さ毎に前記演出対象物の表情又は動作に対して下限値と上限値を設定し、前記演出情報を抽出する際には、前記下限値から前記上限値との間でランダムに値を決定することを特徴とする。 According to a third aspect of the present invention, when the production means sets production information, the production means sets a lower limit value and an upper limit value for the expression or action of the production object for each type and intensity of the emotion. When setting and extracting the effect information, a value is randomly determined between the lower limit value and the upper limit value.

請求項３記載の発明によれば、ある特定の動作だけでなく、ある程度の範囲の中で汎用的に動作を行わせることができる。これにより、使用者等に飽きさせない演出を行うことができる。 According to the third aspect of the invention, not only a specific operation but also a general operation can be performed within a certain range. Thereby, it is possible to perform an effect that does not make the user bored.

請求項４に記載された発明は、前記録音手段において、前記使用者の肉声を録音するために前記予め生成されたコンテンツを再生し、前記肉声を録音する直前で前記コンテンツを停止させる再生手段を有することを特徴とする。 According to a fourth aspect of the present invention, the recording means reproduces the pre-generated content for recording the user's real voice and stops the content immediately before recording the real voice. It is characterized by having.

請求項４記載の発明によれば、スクリプトを再生させながら、必要箇所毎に肉声を録音することができる。また、再生された内容を見ながら、使用者等に音声を録音させることができるため、コンテンツの流れに沿った適切な感情による肉声の入力を実現することができる。 According to the fourth aspect of the present invention, it is possible to record a real voice for each necessary portion while reproducing a script. In addition, since the user or the like can record the sound while watching the reproduced content, it is possible to realize the input of a real voice with an appropriate emotion along the flow of the content.

請求項５に記載された発明は、入力される肉声データから得られる感情情報に基づいて設定される所定の演出を演出対象物に行わせたコンテンツを生成するコンテンツ生成処理をコンピュータに実行させるためのコンテンツ生成プログラムにおいて、予め生成されたコンテンツのスクリプトを所定単位毎に読み込む読み込み処理と、使用者の肉声を録音するための録音処理と、前記録音処理により録音された肉声データから感情推定を行う感情推定処理と、前記感情推定処理により得られる感情の種別及び強さからなる感情情報に対応して予め設定された演出情報を抽出する演出処理と、前記演出処理により得られる前記演出情報に基づいて生成されたスクリプトを、前記読み込み処理により読み込まれたスクリプトと置換するスクリプト生成処理とをコンピュータに実行させる。 The invention described in claim 5 is for causing a computer to execute content generation processing for generating content in which a predetermined effect set on the basis of emotion information obtained from input real voice data is performed on a production target object. In the content generation program, a reading process for reading a script of content generated in advance for each predetermined unit, a recording process for recording the user's real voice, and emotion estimation from the real voice data recorded by the recording process Based on the emotion estimation process, an effect process for extracting effect information set in advance corresponding to emotion information including the type and strength of the emotion obtained by the emotion estimation process, and the effect information obtained by the effect process A script that replaces the script generated by To execute a process on the computer.

請求項５記載の発明によれば、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成することができる。また、これにより、適切な演出の番組を迅速に制作することができる。更に、実行プログラムをコンピュータにインストールすることにより、容易にコンテンツ生成を実現することができる。 According to the fifth aspect of the present invention, based on the emotion information obtained from the real voice data, it is possible to generate content in which an optimal presentation is added to the real voice. This also makes it possible to quickly produce a program with an appropriate performance. Furthermore, content generation can be easily realized by installing an execution program in a computer.

本発明によれば、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成することができる。 According to the present invention, based on emotion information obtained from real voice data, it is possible to generate content in which an optimal presentation is added to the real voice.

＜本発明の概要＞
本発明では、例えばテレビ番組用のＣＧ映像等を有するコンテンツの生成において、アフレコにより肉声（生声）を利用した際に、肉声の感情に合わせてＣＧキャラクタ等の演出対象物に対する表情やジェスチャー等の振る舞い（演出）を自動的に付加するものである。 <Outline of the present invention>
In the present invention, for example, in the production of content having a CG video for a television program and the like, when a real voice (live voice) is used by post-recording, facial expressions, gestures, etc. for a production object such as a CG character according to the emotion of the real voice The behavior (production) of is automatically added.

具体的には、本発明はテレビ番組等を主とした映像コンテンツ制作の分野に関わるものであり、ＣＧを用いたテレビ番組用の映像コンテンツを制作する際、例えば映像コンテンツの出演者であるＣＧキャラクタにセリフを喋らせるのに、肉声を使って音声のリアリティを向上させる場合があるが、この場合に肉声の持つ感情情報を用いて肉声の感情に合ったＣＧキャラクタの表情やジェスチャーを自動的に付加することを可能とする。 Specifically, the present invention relates to the field of video content production mainly for television programs and the like. When producing video content for television programs using CG, for example, CG which is a performer of video content. In order to make the character speak, there is a case where the voice reality is improved by using the real voice. In this case, the emotional information of the real voice is used to automatically express the expression and gesture of the CG character that matches the emotion of the real voice. It is possible to add to.

以下に、上述したような特徴を有する本発明におけるコンテンツ生成装置及びコンテンツ生成プログラムを好適に実施した形態について、図面を用いて詳細に説明する。 Hereinafter, a preferred embodiment of a content generation device and a content generation program according to the present invention having the above-described features will be described in detail with reference to the drawings.

なお、本実施形態では、コンテンツの一例として番組を用いる。また、演出対象物の一例として、番組に出演しているＣＧキャラクタを用いる。更に、本実施形態では、番組の生成や提示等に用いられるスクリプトの一例として、ＴＶＭＬを用いる。なお、本発明は、ＴＶＭＬに限定されるものではなく、ＴＶＭＬ以外の表現形式を用いてもよい。 In the present embodiment, a program is used as an example of content. In addition, a CG character appearing in a program is used as an example of the production target. Furthermore, in this embodiment, TVML is used as an example of a script used for program generation or presentation. The present invention is not limited to TVML, and an expression format other than TVML may be used.

＜コンテンツ生成装置：機能構成例＞
図１は、コンテンツ生成装置の一構成例を示す図である。図１に示すコンテンツ生成装置１０は、入力手段１１と、出力手段１２と、蓄積手段１３と、スクリプト生成手段１４と、録音処理手段１５と、感情推定手段１６と、演出手段１７と、再生手段１８と、送受信手段１９と、制御手段２０とを有するよう構成されている。 <Content generation device: functional configuration example>
FIG. 1 is a diagram illustrating a configuration example of a content generation apparatus. 1 includes an input unit 11, an output unit 12, a storage unit 13, a script generation unit 14, a recording processing unit 15, an emotion estimation unit 16, an effect unit 17, and a reproduction unit. 18, a transmission / reception means 19, and a control means 20.

入力手段１１は、使用者や制作者等からのコンテンツ生成指示や、音声入力指示、音声（肉声）、感情推定指示、スクリプト生成指示、再生指示等、本装置を実施するための各種入力を受け付ける。なお、入力手段１１は、例えばキーボードや、マウス等のポインティングデバイス、マイク等の音声入力デバイス等からなる。 The input means 11 accepts various inputs for implementing the apparatus, such as content generation instructions, voice input instructions, voice (real voice), emotion estimation instructions, script generation instructions, playback instructions, etc. from users and producers. . Note that the input unit 11 includes, for example, a keyboard, a pointing device such as a mouse, a voice input device such as a microphone, and the like.

出力手段１２は、入力手段１１により入力された指示内容や、録音された音声、指示内容に基づいて生成されたコンテンツ、そのコンテンツに係る映像及び音声等の編集内容、編集結果等の内容を表示したり、録音した肉声等の音声データ等を出力する。なお、出力手段１２は、ディスプレイやスピーカ等からなる。 The output unit 12 displays the instruction content input by the input unit 11, the recorded sound, the content generated based on the content of the instruction, the editing content such as video and audio related to the content, the content of the editing result, etc. Or output voice data such as recorded real voice. The output unit 12 includes a display, a speaker, and the like.

蓄積手段１３は、入力される音声や、コンテンツを生成するための画像や映像、音声、テキストデータ等の各種データからなる複数の素材データ、生成されたコンテンツ、感情推定手段１６において推定された感情の内容、演出手段１７における演出の設定内容（例えば、ＣＧキャラクタ演出内容等）、スクリプト生成手段１７におけるスクリプト生成結果等を蓄積する。 The storage means 13 is input voice, a plurality of material data consisting of various data such as images and videos for generating contents, sound, text data, generated contents, emotion estimated by the emotion estimation means 16. , The setting contents of the production in the production means 17 (for example, the contents of the CG character production), the script generation result in the script generation means 17 and the like are stored.

また、蓄積手段１３は、台本に対応するＴＶＭＬスクリプト、番組制作エンジンとしてのＡＰＥスクリプト、番組や対話シーンの生成時に使用することができる番組セットアップ用のＴＶＭＬスクリプト、番組生成データ（テンプレート）、番組に登場するキャラクタとの対話シーンを実現するための対話シーン生成データ（テンプレート）、対話応答用辞書（レスポンススクリプト）等が蓄積される。 In addition, the storage means 13 stores a TVML script corresponding to a script, an APE script as a program production engine, a TVML script for program setup that can be used when generating a program and a dialogue scene, program generation data (template), and a program. Dialog scene generation data (template), a dialog response dictionary (response script), and the like for realizing a dialog scene with an appearing character are stored.

ここで、上述した番組セットアップ用のＴＶＭＬスクリプトとは、番組としてどのようなスタジオセット、小道具、照明、出演者（出演者同士の関係（性別、年齢、性格、職業（歌手、コメンテータ、コメディアン等）も含む）、音声を用いるか等の番組の初期設定情報が指定されたスクリプトである。また、対話応答用辞書とは、例えば視聴者からの問い合わせに対して想定される回答データが蓄積されたものである。したがって、蓄積手段１３は、例えばデータベース等のように文字情報や画像情報、その他の情報の集合物であり、問い合わせの内容から蓄積された各種情報を検索することができるように体系的に構成されていてもよい。 Here, the TVML script for program setup mentioned above is what kind of studio set, prop, lighting, and performer as a program (relationship between performers (gender, age, personality, occupation (singer, commentator, comedian, etc.)) The default setting information of the program, such as whether to use audio, etc. The interactive response dictionary stores, for example, expected response data for inquiries from viewers. Accordingly, the storage means 13 is a collection of character information, image information, and other information such as a database, for example, and is structured so that various types of information stored in the inquiry can be searched. It may be configured.

なお、蓄積手段１３は、上述した各種データを、送受信手段１９を介してインターネットやや通信回線等に代表される通信ネットワークに接続された外部装置等から取得することもできる。 The storage unit 13 can also acquire the various data described above from the external device connected to the Internet or a communication network represented by a communication line or the like via the transmission / reception unit 19.

スクリプト生成手段１４は、予め蓄積された合成音声を利用した元となるＴＶＭＬスクリプトを入力し、元のコンテンツに含まれるＣＧキャラクタ等の振る舞い等の演出を記述した演出ＴＶＭＬスクリプトを、元のＴＶＭＬスクリプトにおける合成音声でＣＧキャラクタを喋らせる指示に対応するＴＶＭＬスクリプトの直前に組み込む。また、スクリプト生成手段１４は、合成音声を予め録音した肉声でＣＧキャラクタが喋るようにＴＶＭＬスクリプトへの置換処理を行う。 The script generation unit 14 inputs an original TVML script using synthesized speech stored in advance, and creates an effect TVML script describing an effect such as behavior of a CG character included in the original content. Is incorporated immediately before the TVML script corresponding to the instruction to turn the CG character with the synthesized voice. Further, the script generation unit 14 performs a replacement process with the TVML script so that the CG character speaks with a real voice in which the synthesized voice is recorded in advance.

つまり、スクリプト生成手段１４は、感情推定手段１６により得られる演出用のスクリプトを、既に設定されたＣＧキャラクタの演出がある場合には、その演出のＴＶＭＬスクリプトを新たに生成したスクリプトに置換し、ＣＧキャラクタの演出がない場合には、生成したスクリプトを付加する。なお、この場合には、使用者等が録音した肉声の音声データからなるセリフが用いられる。 That is, the script generation unit 14 replaces the production script obtained by the emotion estimation unit 16 with the newly generated script when the production of the CG character is already set. If there is no CG character production, the generated script is added. In this case, a speech composed of voice data recorded by a user or the like is used.

録音処理手段１５は、マイク等の入力手段から得られる使用者等からの音声（肉声等）で所定の言葉や文章を録音する。なお、録音処理手段１５は、所定のファイル形式（例えば、ＷＡＶＥ，ＡＩＦＦ，ＭＰ３，ａｕ，ＷＭＡ、ｒａｍ、ＡＡＣ等）のファイルデータを生成し、生成した音声データを蓄積手段１３やスクリプト生成手段１４等に出力する。 The recording processing means 15 records a predetermined word or sentence with a voice (such as a real voice) from a user or the like obtained from an input means such as a microphone. The recording processing unit 15 generates file data in a predetermined file format (for example, WAVE, AIFF, MP3, au, WMA, ram, AAC, etc.), and the generated audio data is stored in the storage unit 13 or the script generation unit 14. Etc.

感情推定手段１６は、録音処理手段１５により録音された肉声（人間の生声等）に対する感情を推定する。また、感情推定手段１６における推定結果（感情の種別、強さ等）等の予め設定される条件に対してＣＧキャラクタの振る舞い（演出）が設定される。 The emotion estimation means 16 estimates the emotion for the real voice (human voice etc.) recorded by the recording processing means 15. Further, the behavior (effect) of the CG character is set with respect to preset conditions such as an estimation result (emotion type, strength, etc.) in the emotion estimation means 16.

ここで、感情推定手段１６は、例えば録音したセリフ等の肉声データを入力し、感情推定結果として、肉声の感情の種類（例えば、平常、怒り、喜び、悲しみ等）と、その強さ（度合い、レベル）を出力する。なお、感情推定手段１６における感情の種類と、強さの推定は、例えば、予め設定される感情推定エンジン（例えば、ＳＴ）等を用いて解析することができる。なお、感情推定手段１６における具体的な推定手法については後述する。 Here, the emotion estimation means 16 inputs, for example, recorded voice data such as speech, and as a result of emotion estimation, the type of emotion of the voice (for example, normal, anger, joy, sadness, etc.) and its strength (degree) , Level). Note that the emotion type and strength estimation in the emotion estimation means 16 can be analyzed using, for example, a preset emotion estimation engine (for example, ST). A specific estimation method in the emotion estimation means 16 will be described later.

演出手段１７は、感情推定手段１６により得られる推定結果に基づいてコンテンツに登場するＣＧキャラクタ等の表情やジェスチャー等の振る舞いを表現するＴＶＭＬスクリプトを、蓄積手段１３に蓄積されたデータ群から選択してスクリプト生成手段１４に出力する。また、演出手段１７は、例えば推定結果として得られる肉声の感情の種類や、その強さ毎に使用者等が容易に演出設定を行えるようにＧＵＩ等による入力フォームを用いて設定を入力することができる。これにより、容易かつ効率的に適切な演出を入力することができる。なお、演出手段１７における演出設定の詳細な説明については後述する。 The rendering unit 17 selects a TVML script that expresses the behavior of a CG character or the like appearing in the content based on the estimation result obtained by the emotion estimation unit 16 from the data group stored in the storage unit 13. To the script generation means 14. Further, the production means 17 inputs the setting using an input form such as a GUI so that the user can easily perform the production setting for each kind of the emotion of the real voice obtained as the estimation result and its strength, for example. Can do. Thereby, it is possible to input an appropriate effect easily and efficiently. Detailed description of the effect setting in the effect means 17 will be described later.

再生手段１８は、予め生成された元となるＴＶＭＬスクリプトを読み込み、使用者等が肉声を録音する直前までの映像を再生し、その場所で停止する。なお、肉声を録音する直前とは、例えば、元となるＴＶＭＬスクリプトに含まれる合成音声が出力される部分の直前等があるが、これに限定されるものではない。これにより、使用者等は、スクリプトを再生させながら録音処理手段１５を用いて必要箇所毎に肉声を録音することができる。また、再生された内容を見ながら、使用者等に音声を録音させることができるため、コンテンツの流れに沿った適切な感情による肉声の入力を実現することができる。 The reproduction means 18 reads the original TVML script generated in advance, reproduces the video up to immediately before the user or the like records the real voice, and stops at that location. Note that “immediately before recording the real voice” includes, for example, immediately before a portion where the synthesized speech included in the original TVML script is output, but is not limited thereto. As a result, the user or the like can record the real voice for each necessary location using the recording processing means 15 while reproducing the script. In addition, since the user or the like can record the sound while watching the reproduced content, it is possible to realize the input of a real voice with an appropriate emotion along the flow of the content.

また、再生手段１８は、スクリプト生成手段１４から得られる本実施形態による最終的なＣＧキャラクタの演出を含むＴＶＭＬスクリプトと、それに対応する肉声からなる音声データとに基づいて番組を再生する。 Further, the reproducing means 18 reproduces the program based on the TVML script including the final CG character effect according to the present embodiment obtained from the script generating means 14 and the corresponding voice data composed of the real voice.

なお、再生手段１８は、ＴＶＭＬプレイヤー等の機能を有している。ここで、ＴＶＭＬプレイヤーとは、ＴＶＭＬで記述されたスクリプトを読み取り、番組の映像や音声等をリアルタイムに出力することができるソフトウェアである。また、ＴＶＭＬプレイヤーは、スタジオセットをリアルタイムＣＧで生成し、ＣＧスタジオセットの中に登場する番組司会者や出演者等のＣＧキャラクタがＴＶＭＬスクリプト中に記述されたセリフを喋り、演技するところを表示する機能を有する。 Note that the playback means 18 has a function such as a TVML player. Here, the TVML player is software that can read a script described in TVML and output a program video, audio, or the like in real time. Also, the TVML player generates a studio set with real-time CG, and displays a place where a CG character such as a program presenter or a performer appearing in the CG studio set speaks a line described in the TVML script and performs. It has the function to do.

また、ＴＶＭＬプレイヤーは、その他にも動画再生、文字フォント、及び画像によるタイトル表示やスーパーインポーズ、オーディオデータファイル再生によるＢＧＭ再生、音声によるナレーション等をリアルタイムに生成し、番組の映像、音声を作成する機能を有する。再生手段１８は、これらの機能を用いることで効率的で高精度な番組や対話シーンを再生することができる。 In addition, the TVML player also generates video and audio for programs by generating video playback, text fonts, image title display and superimposition, audio data file playback BGM playback, voice narration, etc. in real time. It has the function to do. The playback means 18 can play back an efficient and highly accurate program or dialog scene by using these functions.

送受信手段１９は、インターネットや通信回線等に代表される通信ネットワークを介して接続される外部装置等から元となるＴＶＭＬスクリプトや、演出設定内容、予め録音された肉声データ、感情推定エンジン、素材データ、ＡＰＥ、各種スクリプト群等を受信するための通信インタフェイスである。また、送受信手段１９は、本実施形態により生成されたＴＶＭＬスクリプト、番組、録音された肉声データ、新たに設定された演出内容等を通信ネットワークを介して他の外部装置等に送信することもできる。 The transmission / reception means 19 is a TVML script based on an external device connected via a communication network represented by the Internet or a communication line, production setting contents, prerecorded real voice data, emotion estimation engine, material data A communication interface for receiving APE, various script groups, and the like. The transmission / reception means 19 can also transmit the TVML script, the program, the recorded voice data, the newly set production content, etc. generated by the present embodiment to other external devices etc. via the communication network. .

制御手段２０は、コンテンツ生成装置１０における各機能構成全体の制御を行う。具体的には、制御手段２０は、入力手段１１により入力された使用者からの入力情報に基づいて肉声の録音処理を行ったり、取得した肉声データから感情推定処理を行ったり、推定結果からＣＧキャラクタの振る舞い等の演出の設定を行ったり、設定された演出に基づいてスクリプトの生成を行ったり、生成されたスクリプトを再生する等の各種制御処理を行う。 The control unit 20 controls the entire functional configuration in the content generation apparatus 10. Specifically, the control unit 20 performs a recording process of the real voice based on the input information from the user input by the input unit 11, performs an emotion estimation process from the acquired real voice data, or calculates CG from the estimation result. Various control processes such as setting of effects such as the behavior of the character, generation of scripts based on the set effects, and reproduction of the generated scripts are performed.

これにより、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成することができる。具体的には、例えばアフレコ処理を行う際に、肉声の持つ感情情報を基にして、肉声の感情に適したＣＧキャラクタの表情やジェスチャー等を表現する演出処理を自動的に付加することができる。したがって、コンテンツ生成処理では、アフレコ処理でＣＧキャラクタの合成音声によるセリフを肉声に置換するだけで、人手を介さず自動的に最適な振る舞い等の演出が付加されたＣＧ映像コンテンツを生成することが可能となる。 Thereby, based on the emotion information obtained from the real voice data, it is possible to generate content in which an optimal presentation is added to the real voice. Specifically, for example, when performing post-recording processing, it is possible to automatically add an effect process that expresses a facial expression, a gesture, or the like of a CG character suitable for a real voice emotion based on emotion information possessed by the real voice. . Therefore, in the content generation process, it is possible to automatically generate a CG video content to which an effect such as an optimum behavior is added without replacing a speech by replacing the speech of the synthesized voice of the CG character with a real voice in the after-recording process. It becomes possible.

＜コンテンツ処理概要＞
次に、上述したコンテンツ生成装置１０を用いたコンテンツ生成までのコンテンツ処理概要について図を用いて説明する。図２は、コンテンツ生成処理の概要の一例を示す図である。 <Outline of content processing>
Next, an outline of content processing up to content generation using the content generation apparatus 10 described above will be described with reference to the drawings. FIG. 2 is a diagram illustrating an example of an outline of content generation processing.

図２に示す例では、まず合成音声を利用してＣＧキャラクタが会話等を行うコンテンツを元となるＴＶＭＬスクリプトとして入力すると、スクリプト生成手段１４において、スクリプト読出処理３１と、スクリプト置換・付加処理３２とを行う。 In the example shown in FIG. 2, when content that a CG character talks or the like is input as a base TVML script using synthesized speech, the script generation unit 14 performs script reading processing 31 and script replacement / addition processing 32. And do.

具体的には、スクリプト生成手段１４は、合成音声で出演者が喋る記述がされている元となるスクリプトを文字列として所定単位（例えば、１行又は１コマンド）毎に読み込み、読み込んだスクリプト内に所定の合成音声でＣＧキャラクタを喋らせる指示に対応するＴＶＭＬスクリプトがある場合、録音処理手段１５における録音処理において、そこまでのＴＶＭＬスクリプトを再生手段１８としてのＴＶＭＬプレイヤーで再生させる。また、合成音声の直前で再生が停止され、録音処理により合成音声で喋ったセリフを人が入力手段１１としてのマイクに向かって喋り、肉声の録音処理を行う。 Specifically, the script generation means 14 reads a script, which is a description of a performer speaking in a synthesized voice, as a character string for each predetermined unit (for example, one line or one command), and in the read script If there is a TVML script corresponding to an instruction to turn a CG character with a predetermined synthesized voice, the TVML script up to that point is played back by a TVML player as the playback means 18 in the recording processing means 15. In addition, playback is stopped immediately before the synthesized speech, and a person who speaks the synthesized speech by the recording processing is spoken toward the microphone as the input means 11 to perform the recording processing of the real voice.

＜録音処理の具体例＞
ここで、録音処理手段１５における録音処理の具体例について図を用いて説明する。図３は、録音処理の内容を説明するための一例を示す図である。なお、図３（ａ）は、録音処理を行うための音声入力フォームの一例であり、図３（ｂ）は、元となるＴＶＭＬスクリプトの再生画面の一例を示している。 <Specific example of recording process>
Here, a specific example of the recording process in the recording processing means 15 will be described with reference to the drawings. FIG. 3 is a diagram illustrating an example for explaining the contents of the recording process. FIG. 3A shows an example of a voice input form for performing a recording process, and FIG. 3B shows an example of a playback screen of the original TVML script.

図３（ａ）に示す音声入力フォーム４０は、メニュー表示領域４１と、録音編集領域４２と、ボタン領域４３とを有するよう構成されている。また、図３（ｂ）には、再生画面５０にＣＧキャラクタ５１が表示され、合成音声のセリフを喋る直前の動作で停止状態となっている。 The voice input form 40 shown in FIG. 3A is configured to have a menu display area 41, a recording / editing area 42, and a button area 43. Also, in FIG. 3B, a CG character 51 is displayed on the playback screen 50, and is in a stopped state immediately before the speech of the synthesized speech is spoken.

メニュー表示領域４１は、元のＴＶＭＬファイルや、既に録音された肉声データファイルを読み出したり、読み出したＴＶＭＬファイルのファイル名を表示したり、肉声を録音した肉声データにファイル名を付けて蓄積手段１３等の所定の記憶領域に蓄積したり、所定のファイルに関する各種オプションを設定する。 The menu display area 41 reads the original TVML file or the recorded voice data file, displays the file name of the read TVML file, or adds the file name to the recorded voice data and stores the voice data. Are stored in a predetermined storage area, and various options relating to a predetermined file are set.

また、録音編集領域４２は、図３（ｂ）示すような「楽しい」というセリフを合成音声で発音させているＣＧキャラクタに対して、図２に示すように使用者３３等は、マイク等の入力手段１１を用いて、録音編集領域４２にある録音ボタン４４を押して所定時間内に肉声によるセリフ（図３においては、「楽しい」）の入力を行う。このとき、録音ボタン４４を押すと、図３（ｂ）に示す画面のＣＧキャラクタも動作させることもできるため、口の動きやジェスチャー等を見ながら適切なタイミングで最適な肉声の入力を行うことができる。 Further, the recording / editing area 42 is for a CG character that produces a speech of “fun” as shown in FIG. 3 (b) with synthesized speech, as shown in FIG. Using the input means 11, the recording button 44 in the recording editing area 42 is pressed and a speech (“fun” in FIG. 3) is input within a predetermined time. At this time, if the recording button 44 is pressed, the CG character on the screen shown in FIG. 3B can also be operated, so that an optimal real voice can be input at an appropriate timing while observing mouth movements and gestures. Can do.

更に、予め設定された合成音声の音声出力時間は、ステータス表示領域４５に表示されるため、使用者等は録音する肉声の長さを容易に把握することができ、長さを調整することができる。 Furthermore, since the voice output time of the synthesized voice set in advance is displayed in the status display area 45, the user or the like can easily grasp the length of the real voice to be recorded and can adjust the length. it can.

また、録音編集領域４２では、録音した結果を音や映像によりプレビューしたり、録音をキャンセルしたり、元のＴＶＭＬスクリプトに含まれる次のセリフ（次の行）について録音をしたり、キャプション（タイトル、説明文等）等を付けるか否かの設定等を行うことができる。 In the recording / editing area 42, the recorded result is previewed by sound or video, the recording is canceled, the next line (next line) included in the original TVML script is recorded, and a caption (title) is displayed. , Etc.) can be set.

ボタン領域４３は、録音処理を中止する「中止ボタン」、録音した音声を全てプレビューする「全てプレビューボタン」、録音した音声ファイルを例えばＷＡＶＥファイル形式等の所定のファイル形式で蓄積手段１３等の所定の領域に保存する「保存ボタン」、音声ファイルを圧縮して保存する「圧縮して保存ボタン」等を有している。なお、ボタン領域４３には、予備領域や各種データ表示領域等も含まれている。 The button area 43 includes a “stop button” for canceling the recording process, an “all preview button” for previewing all the recorded audio, and a predetermined audio file such as the WAVE file format, for example, the storage means 13 and the like. A “save button” for saving in this area, a “compress and save button” for compressing and saving the audio file, and the like. The button area 43 includes a spare area and various data display areas.

なお、上述する録音処理においては、１行分の録音が終了した場合に、例えば「録音が完了しました」とダイアログメッセージを表示したり、ＴＶＭＬスクリプトファイル全体のアフレコが終了した場合には、例えば「収録作業が完了しました。」等とダイアログメッセージを表示したり、アフレコ後のファイルを保存した場合には、例えば「ファイルを保存しました」といったダイアログを表示するようにしてもよい。これにより、使用者３３等は、対応する画面を参照しながら、高精度に肉声を入力することができる。 In the recording process described above, when recording for one line is completed, for example, a dialog message “Recording is completed” is displayed, or after the dubbing of the entire TVML script file is completed, for example, When a dialog message such as “Recording has been completed.” Is displayed, or when the post-recording file is saved, a dialog such as “File saved” may be displayed. Thereby, the user 33 etc. can input the voice with high accuracy while referring to the corresponding screen.

次に、図２では、マイク等の入力手段１１により入力される録音したセリフ等の音声データのファイル（例えば、ＷＡＶＥファイル）を感情推定手段１６に入力して感情推定処理により解析を行い、肉声の感情の種類（平常、怒り、喜び、悲しみ）と、その強さ等の感情情報を出力する。 Next, in FIG. 2, a voice data file (for example, a WAVE file) such as a recorded speech input by the input means 11 such as a microphone is input to the emotion estimation means 16 and analyzed by emotion estimation processing. Type of emotion (normal, anger, joy, sadness) and emotion information such as strength.

＜感情推定処理について＞
ここで、感情推定処理について説明する。感情推定処理では、例えば入力される肉声等の音声データから、その音声の強度やテンポ、抑揚等を検出し、検出された強度、テンポ、及び抑揚等の時間軸方向の変化量のパターンと、そのパターンに関連づけて予め蓄積されている感情状態とに基づいて対応する感情状態を出力する。 <Emotion estimation process>
Here, the emotion estimation process will be described. In the emotion estimation process, for example, from voice data such as input real voice, the intensity, tempo, and inflection of the voice are detected, and the detected intensity, tempo, and pattern of the amount of change in the time axis direction such as intonation, The corresponding emotional state is output based on the emotional state accumulated in advance in association with the pattern.

また、その他にも、例えば声の音量、声の波形、声のピッチ、又は音韻等の音声認識を行い、その結果と上述の声の条件に対応して予め設定した閾値とを比較することで、その人の感情を推定する手法や、発言内容に対して形態素解析を行い、その音声認識の結果から予め設定された感情辞書を用いて発話の感情を推定する手法等を用いることができる。 In addition, for example, voice recognition such as voice volume, voice waveform, voice pitch, or phoneme is performed, and the result is compared with a preset threshold value corresponding to the above-described voice conditions. It is possible to use a method for estimating the emotion of the person, a method for performing morphological analysis on the content of the utterance, and estimating the emotion of the utterance using a preset emotion dictionary from the result of the speech recognition.

なお、感情推定処理では、例えば予め設定された感情推定エンジンを用いて肉声に対する感情を推定することができる。ここで、感情推定エンジンとしては、例えばＳＴ（ＳｅｎｓｉｂｉｌｉｔｙＴｅｃｈｎｏｌｏｇｙ：感性制御技術）を用いることができる。ＳＴは、コンピュータに人の感性情報を理解させ、反応させるというソフトウェア技術であり、具体的には、人の発話から得られる話者の感情情報（例えば、怒り、喜び、悲しみ、平常、笑い、興奮等）を、音声認識等を通じて得られたワード情報に付加することで、例えば話者の感情推移に応じた連続応答シナリオシステム等の構築が可能である。 In the emotion estimation process, for example, an emotion for a real voice can be estimated using a preset emotion estimation engine. Here, for example, ST (Sensitivity Technology) can be used as the emotion estimation engine. ST is a software technology that allows a computer to understand and react to human sensibility information. Specifically, ST's emotion information (eg, anger, joy, sadness, normal, laughter) For example, a continuous response scenario system corresponding to the emotional transition of the speaker can be constructed by adding the excitement etc.) to the word information obtained through voice recognition or the like.

また、上述した感情推定手法は、既に周知の技術であり、例えば特開平５−１２０２３号公報、特開平９−２２２９６号公報、特開平１１−１１９７９１号公報、特開２００２−９１４８２号公報等に示されている。 The above-described emotion estimation method is a well-known technique. For example, Japanese Patent Laid-Open No. 5-12023, Japanese Patent Laid-Open No. 9-22296, Japanese Patent Laid-Open No. 11-1119791, Japanese Patent Laid-Open No. 2002-91482, etc. It is shown.

図２において、上述した感情推定手段１６による感情推定処理により推定された感情情報に基づいて、推定される感情情報（感情の種別、感情の強さ等）を出力する。出力された感情情報は、演出手段１７による設定処理により、蓄積手段１３に予め蓄積されたＣＧキャラクタの演出スクリプト群の中から、感情情報に対応した振る舞いを行うためのＴＶＭＬスクリプトを取得する。 In FIG. 2, estimated emotion information (emotion type, emotion strength, etc.) is output based on the emotion information estimated by the emotion estimation process by the emotion estimation means 16 described above. The output emotion information obtains a TVML script for performing behavior corresponding to the emotion information from the CG character effect script group stored in advance in the storage means 13 by the setting process by the effect means 17.

＜演出設定＞
ここで、ＣＧキャラクタ演出ＤＢ等の蓄積手段１３に蓄積されているＣＧキャラクタの演出の設定手法について図を用いて説明する。図４は、演出設定の内容を説明するための一例の図である。また、図５は、図４に対応する演出設定入力フォームにより設定された演出内容の一例を示す図である。 <Direction setting>
Here, the setting method of the production | presentation of the CG character accumulate | stored in the storage means 13, such as CG character production DB, is demonstrated using figures. FIG. 4 is a diagram illustrating an example of the contents of the effect setting. FIG. 5 is a diagram showing an example of the contents of effects set by the effect setting input form corresponding to FIG.

演出手段１７における演出設定処理は、感情推定手段１６における感情推定処理により得られる結果に基づいて、ＣＧキャラクタを対象にして、表情や顔の角度（上を向く、下を向く等）、うなずき、首を横に振る等の振る舞いを生成するＴＶＭＬスクリプトを推定結果（例えば、感情の種別や感情の強さ等）に合わせてそれぞれ設定する。 The effect setting process in the effect means 17 is based on the result obtained by the emotion estimation process in the emotion estimation means 16, with a facial expression and a face angle (facing up, facing down, etc.), nodding, A TVML script that generates a behavior such as shaking the head sideways is set according to the estimation result (for example, emotion type, emotion strength, etc.).

なお、本実施形態では、演出設定時に使用者に容易に演出の設定を行わせるための演出設定入力フォーム６０を有している。演出設定入力フォーム６０は、具体的には、タグ表示領域６１、各種設定領域６２と、ボタン領域６３とを有するように構成されている。 In the present embodiment, there is an effect setting input form 60 for allowing the user to easily set the effect when setting the effect. The effect setting input form 60 is specifically configured to have a tag display area 61, various setting areas 62, and a button area 63.

タグ表示領域６１は、予め設定される感情推定手段１６により得られる感情情報に含まれる感情の種別（平常、怒り、喜び、悲しみ等）、及び感情の強さ（レベル１、レベル２、レベル３等）毎に各種設定領域６２を表示できるようにタグが形成されている。また、各種設定領域６２には、それぞれの感情の種別及びレベル毎に、例えば「表情（種類、程度）」、「顔の向き（縦、横）」、「うなずき（する／しない、回数、スピード）」、「首ふり（する／しない、回数、スピード、振りの程度）」、「使用者設定（ポーズ名、スピード等）」等のうち、少なくとも１つを設定できるようになっている。 The tag display area 61 includes a type of emotion (normal, angry, joy, sadness, etc.) included in emotion information obtained by the emotion estimation means 16 set in advance, and emotion strength (level 1, level 2, level 3). Etc.) tags are formed so that each setting area 62 can be displayed. In the various setting areas 62, for example, “expression (type, degree)”, “face orientation (vertical, horizontal)”, “nodding (on / off, number of times, speed) for each emotion type and level. ) "," Swing (no / no, number of times, speed, degree of swing) "," user setting (pause name, speed, etc.) ", etc. can be set.

また、設定後は、ボタン領域６３にある「設定ボタン」を押すことにより設定が行われ、例えば図５に示すような各感情の種別及びレベルによるＣＧキャラクタ演出ＤＢが蓄積される。また、ボタン領域６３において、「キャンセルボタン」を押すことによりそれまでの設定がキャンセルされる。 After the setting, the setting is performed by pressing the “setting button” in the button area 63, and for example, a CG character effect DB according to the type and level of each emotion as shown in FIG. 5 is accumulated. In the button area 63, pressing the “cancel button” cancels the setting so far.

また、図５には、設定内容の一例として、各感情情報に応じて付加する振る舞い（演出）の内容とその程度が示されている。具体的には、それぞれ感情の種別及び強さに応じて表情、表情程度、顔の位置、動作等が設定されている。つまり、この設定内容に基づいて演出用のスクリプトが生成される。 Further, FIG. 5 shows the contents and the degree of the behavior (effect) added according to each emotion information as an example of the setting contents. Specifically, expressions, expression levels, face positions, actions, etc. are set according to the type and strength of emotion. That is, an effect script is generated based on the set content.

なお、図５において、例えば「うなずく」、「大きくうなずく」、「首を横に振る」については、その動作をする又はしないを乱数等によりランダムに決定させるようにすることができる。また、例えば、感情の種別が「喜び」である場合には、どのレベル（レベル１、レベル２、レベル３、・・・）においても「表情程度」に所定の範囲が設定されており、その範囲内で乱数等によりランダムに値が決定させるようにすることができる。 In FIG. 5, for example, “nodding”, “large nodding”, and “shake the head to the side” can be determined at random by random numbers or not. For example, when the emotion type is “joy”, a predetermined range is set for “expression level” at any level (level 1, level 2, level 3,...) A value can be determined randomly by a random number or the like within the range.

なお、上述したように、感情情報に応じて付加する振る舞いは、例えば表情とその程度、顔を上又は下に向ける角度、うなずき、首を横に振る動作であるが本発明においてはこれに限定されるものではない。また、表情の程度については、感情のレベル毎に下限値と上限値を設定し、その中から毎回ランダムに値を決定することができる。また、うなずき、及び首を横に振る動作についても、その動作を付加する／しないは毎回ランダムに決定することができる。これにより、ＣＧキャラクタに対して、ある特定の動作だけでなく、ある程度の範囲の中で汎用的に動作を行わせることができる。これにより、使用者等に飽きさせない演出を行うことができる。 As described above, the behavior added according to the emotion information is, for example, an expression and its degree, an angle for turning the face up or down, a nod, and an action of shaking the neck sideways, but the present invention is not limited to this. Is not to be done. As for the degree of facial expression, a lower limit value and an upper limit value can be set for each emotion level, and the value can be randomly determined each time. In addition, nodding and shaking motion of the head can be determined randomly each time whether or not to add the motion. As a result, the CG character can be caused to perform a general purpose motion within a certain range as well as a specific motion. Thereby, it is possible to perform an effect that does not make the user bored.

また、図５に示す演出内容以外にも、例えば番組におけるスタジオの照明やカメラワーク、スイッチング等の仮想空間上の演出対象物について設定することができる。これにより、高演出のコンテンツを提供することができる。 In addition to the effects shown in FIG. 5, it is possible to set effects objects in a virtual space such as studio lighting, camera work, and switching in a program. As a result, it is possible to provide high performance content.

また、図２における演出処理では、感情推定処理の出力を用いて、蓄積手段１３に蓄積された図５に示すようなＣＧキャラクタ演出ＤＢから感情に適したＣＧキャラクタの表情やジェスチャー等の振る舞いを選択し、それに対応するＴＶＭＬスクリプトを生成して、スクリプト生成手段１４のスクリプト置換・付加処理３２に出力する。 In the effect process in FIG. 2, the behavior of the CG character's facial expression, gesture, etc. suitable for emotion is obtained from the CG character effect DB as shown in FIG. The selected TVML script is generated, and is output to the script replacement / addition process 32 of the script generation unit 14.

更に、スクリプト置換・付加処理３２では、演出手段１７により得られたＣＧキャラクタの演出情報に基づいてスクリプトの置換や付加を行う。また、スクリプト置換・付加処理３２は、上述した録音処理により得られた肉声を入力し、肉声を利用したＴＶＭＬスクリプトを生成し、再生手段１８としてのＴＶＭＬプレイヤーにより再生され、コンテンツ３４が出力される。 Further, in the script replacement / addition process 32, the script is replaced or added based on the CG character presentation information obtained by the presentation means 17. The script replacement / addition process 32 inputs the real voice obtained by the recording process described above, generates a TVML script using the real voice, is reproduced by the TVML player as the reproduction means 18, and the content 34 is output. .

＜スクリプト生成例＞
ここで、本実施形態により生成されるスクリプト例について、図を用いて説明する。図６は、本実施形態により生成されるスクリプトの一例を示す図である。なお、図６（ａ）は、元となるＴＶＭＬスクリプトの一例を示し、図６（ｂ）は、本実施形態により生成された後のＴＶＭＬスクリプトの一例を示す図である。なお、図６（ａ）、（ｂ）の左側には、便宜上行番号を付している。 <Script generation example>
Here, an example of a script generated according to the present embodiment will be described with reference to the drawings. FIG. 6 is a diagram illustrating an example of a script generated according to the present embodiment. 6A shows an example of the original TVML script, and FIG. 6B shows an example of the TVML script generated according to the present embodiment. Note that row numbers are assigned to the left side of FIGS. 6A and 6B for convenience.

本実施形態におけるスクリプト生成では、まず図６（ａ）に示すように、合成音声によりＣＧキャラクタに喋らせるためのスクリプトに含まれる所定のコマンド（例えば、ｔａｌｋコマンド（図６（ａ）における（０５）行目））を検出すると、そこまでのスクリプトがＴＶＭＬプレイヤー上に表示され、ＣＧキャラクタを停止させる。 In the script generation in the present embodiment, as shown in FIG. 6A, first, as shown in FIG. 6A, a predetermined command (for example, (05 in FIG. ) Line))) is detected, the script up to that point is displayed on the TVML player, and the CG character is stopped.

そこで、使用者等は、合成音声と同じセリフを録音する。なお、録音された肉声は、ユニークなファイル名をつけて保存する（例えば、図６の例では、ｖｏｉｃｅ００１．ｗａｖ）。 Therefore, the user or the like records the same speech as the synthesized speech. The recorded real voice is stored with a unique file name (for example, voice001.wav in the example of FIG. 6).

次に、録音された肉声に基づく感情推定を行い、感情の種別とレベルを出力する（図６の例では、感情は「喜び（ｈａｐｐｙ）」で、レベルは「２」とする）。 Next, emotion estimation based on the recorded real voice is performed, and the type and level of emotion are output (in the example of FIG. 6, the emotion is “happy” and the level is “2”).

次に、感情情報に基づいて、ＣＧキャラクタの表情と程度を決定するＴＶＭＬスクリプトを生成する。（図６の例では、表情はｈａｐｐｙで、程度は０．６）。また、感情情報に基づいて、ＣＧキャラクタの振る舞いを決定し、そのＴＶＭＬスクリプトを生成する（図６の例では、顔の角度は７度上を向き、うなずく動作（する・しないをランダムに選択した場合で０．５）とする）。 Next, a TVML script that determines the expression and degree of the CG character is generated based on the emotion information. (In the example of FIG. 6, the expression is happy and the degree is 0.6). Further, the behavior of the CG character is determined based on the emotion information, and the TVML script is generated (in the example of FIG. 6, the face angle is turned upward by 7 degrees, and the nodding motion is selected (whether or not is randomly selected). 0.5) in some cases).

更に、図６（ａ）に示す元のスクリプトの（０５）行目の箇所を、図６（ｂ）に示すよう（０５）〜（０８）行目に示すように置換・付加する。そして、その結果をＴＶＭＬスクリプトとして出力する。 Further, the position of the (05) line of the original script shown in FIG. 6A is replaced and added as shown in the (05) to (08) lines as shown in FIG. 6B. Then, the result is output as a TVML script.

なお、上述したコンテンツ生成は、入力されるＴＶＭＬスクリプトの最終行まで行い、最終的にメモリ等に一時的に蓄積された肉声利用のＴＶＭＬスクリプトが出力される。 Note that the above-described content generation is performed up to the last line of the input TVML script, and finally the TVML script using real voice temporarily stored in a memory or the like is output.

上述したように、コンテンツ生成装置１０の機能構成により、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成することができる。具体的には、合成音声を利用して、セリフを変えながら、番組全体を一旦作成することができ、作成した番組を再生させながら、再生箇所毎に肉声を録音することができる。また、肉声を録音するだけで、録音した肉声の感情に合わせたＣＧキャラクタの振る舞いが自動的に付加され、適切な演出の番組が自動的に生成できる。 As described above, based on the emotion information obtained from the real voice data, it is possible to generate the content with the optimal presentation added to the real voice based on the emotional information obtained from the real voice data. Specifically, the entire program can be created once while changing the speech using synthesized speech, and a real voice can be recorded at each playback location while the created program is played. In addition, by simply recording the real voice, the behavior of the CG character in accordance with the recorded emotion of the real voice is automatically added, and a program with an appropriate performance can be automatically generated.

＜コンテンツ生成プログラム＞
ここで、上述したコンテンツ生成装置１０は、上述した専用の装置構成により本発明におけるコンテンツの生成を行うこともできるが、各構成における処理をコンピュータに実行させるための実行プログラムを生成し、例えば、汎用のパーソナルコンピュータやサーバ等にプログラムをインストールすることにより、コンテンツ生成処理を実現することができる。 <Content generation program>
Here, the content generation device 10 described above can also generate content in the present invention with the dedicated device configuration described above, but generates an execution program for causing a computer to execute the processing in each configuration, for example, A content generation process can be realized by installing a program in a general-purpose personal computer or server.

＜ハードウェア構成＞
ここで、本発明における実行可能なコンピュータのハードウェア構成例について図を用いて説明する。図７は、本発明におけるコンテンツ生成処理が実現可能なハードウェア構成の一例を示す図である。 <Hardware configuration>
Here, an example of a hardware configuration of an executable computer in the present invention will be described with reference to the drawings. FIG. 7 is a diagram illustrating an example of a hardware configuration capable of realizing the content generation processing according to the present invention.

図７におけるコンピュータ本体には、入力装置７１と、出力装置７２と、ドライブ装置７３と、補助記憶装置７４と、メモリ装置７５と、各種制御を行うＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）７６と、ネットワーク接続装置７７とを有するよう構成されており、これらはシステムバスＢで相互に接続されている。 7 includes an input device 71, an output device 72, a drive device 73, an auxiliary storage device 74, a memory device 75, a CPU (Central Processing Unit) 76 for performing various controls, and a network connection device. 77, which are connected to each other by a system bus B.

入力装置７１は、使用者が操作するキーボード及びマウス等のポインティングデバイスやマイク等の音声入力デバイス等を有しており、使用者からのプログラムの実行等、各種操作信号を入力する。出力装置７２は、本発明における処理を行うためのコンピュータ本体を操作するのに必要な各種ウィンドウやデータ等を表示するディスプレイや音声を出力するスピーカ等を有し、ＣＰＵ７６が有する制御プログラムによりプログラムの実行経過や結果等を表示又は音声出力することができる。 The input device 71 includes a keyboard and a pointing device such as a mouse operated by a user, a voice input device such as a microphone, and the like, and inputs various operation signals such as execution of a program from the user. The output device 72 has a display for displaying various windows and data necessary for operating the computer main body for performing the processing in the present invention, a speaker for outputting sound, and the like, and the program is controlled by a control program of the CPU 76. Execution progress, results, etc. can be displayed or voice output.

ここで、本発明において、コンピュータ本体にインストールされる実行プログラムは、例えばＣＤ−ＲＯＭ等の記録媒体７８等により提供される。プログラムを記録した記録媒体７８は、ドライブ装置７３にセット可能であり、記録媒体７８に含まれる実行プログラムが、記録媒体７８からドライブ装置７３を介して補助記憶装置７４にインストールされる。 In the present invention, the execution program installed in the computer main body is provided by a recording medium 78 such as a CD-ROM. The recording medium 78 on which the program is recorded can be set in the drive device 73, and the execution program included in the recording medium 78 is installed from the recording medium 78 to the auxiliary storage device 74 via the drive device 73.

補助記憶装置７４は、ハードディスク等のストレージ手段であり、本発明における実行プログラムや、コンピュータに設けられた制御プログラム等を蓄積し必要に応じて入出力を行うことができる。 The auxiliary storage device 74 is a storage means such as a hard disk, and can store an execution program according to the present invention, a control program provided in a computer, etc., and perform input / output as necessary.

メモリ装置７５は、ＣＰＵ７６により補助記憶装置７４から読み出された実行プログラム等を格納する。なお、メモリ装置７５は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等からなる。 The memory device 75 stores an execution program or the like read from the auxiliary storage device 74 by the CPU 76. The memory device 75 includes a ROM (Read Only Memory), a RAM (Random Access Memory), and the like.

ＣＰＵ７６は、ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）等の制御プログラム、メモリ装置７５に格納されている実行プログラムに基づいて、各種演算や各ハードウェア構成部とのデータの入出力等、コンピュータ全体の処理を制御して各処理を実現することができる。また、ＣＰＵ７６は、プログラムの実行中に必要な各種情報を補助記憶装置７４から取得することができ、またＣＰＵ７６は、処理結果等を格納することもできる。 The CPU 76 controls processing of the entire computer, such as various operations and input / output of data with each hardware component, based on a control program such as OS (Operating System) and an execution program stored in the memory device 75. Each processing can be realized. Further, the CPU 76 can acquire various types of information necessary during execution of the program from the auxiliary storage device 74, and the CPU 76 can also store processing results and the like.

ネットワーク接続装置７７は、通信ネットワーク等と接続することにより、実行プログラムを通信ネットワークに接続されている他の端末等から取得したり、プログラムを実行することで得られた実行結果又は本発明における実行プログラム自体を他の端末等に提供することができる。 The network connection device 77 obtains an execution program from another terminal connected to the communication network by connecting to a communication network or the like, or an execution result obtained by executing the program or an execution in the present invention The program itself can be provided to other terminals.

上述したようなハードウェア構成により、特別な装置構成を必要とせず、低コストで効率的にコンテンツ生成処理を実現することができる。また、プログラムをインストールすることにより、コンテンツ生成処理を容易に実現することができる。 With the hardware configuration as described above, a content generation process can be realized efficiently at a low cost without requiring a special device configuration. Further, the content generation process can be easily realized by installing the program.

＜コンテンツ生成処理＞
次に、本発明における実行プログラムによるコンテンツ生成処理手順についてフローチャートを用いて説明する。なお、以下の処理の説明では、コンテンツの一例として「番組」を用いているが本発明においては特に限定されるものではない。また、以下に示す予め生成されたＴＶＭＬスクリプトには、合成音声を出力するスクリプトを含むものとする。 <Content generation processing>
Next, a content generation processing procedure by the execution program in the present invention will be described using a flowchart. In the following description of the processing, “program” is used as an example of content, but the present invention is not particularly limited. In addition, the TVML script generated in advance shown below includes a script for outputting synthesized speech.

＜コンテンツ生成処理手順＞
図８は、本実施形態におけるコンテンツ生成処理手順の一例を示すフローチャートである。図８において、まず、既存のＴＶＭＬスクリプトを文字列として１行読み込む（Ｓ０１）。 <Content generation processing procedure>
FIG. 8 is a flowchart showing an example of a content generation processing procedure in the present embodiment. In FIG. 8, first, one line of an existing TVML script is read as a character string (S01).

ここで、最終行までの処理が終了したか否かを判断し（Ｓ０２）、終了していない場合（Ｓ０２において、ＮＯ）、その読み込んだ行が所定のスクリプトであるか否かを判断する（Ｓ０３）。ここで、所定のスクリプトとは、例えば合成音声により生成された所定のＣＧキャラクタを喋らせるスクリプトのコマンド（例えば、“ｔａｌｋ”等）を含んでいるか否かにより判断する。 Here, it is determined whether or not the processing up to the last line has been completed (S02). If it has not been completed (NO in S02), it is determined whether or not the read line is a predetermined script (S02). S03). Here, the predetermined script is determined based on whether or not it includes a script command (for example, “talk” or the like) that makes a predetermined CG character generated by synthesized speech appear.

Ｓ０３の処理において、所定のスクリプトでない場合（Ｓ０３において、ＮＯ）、バッファ（メモリ）等にその読み込んだＴＶＭＬスクリプトを追加し（Ｓ０４）、Ｓ０１の処理に戻り、次のスクリプトを読み込んで以降の処理を行う。 If the script is not a predetermined script in the process of S03 (NO in S03), the read TVML script is added to the buffer (memory) or the like (S04), the process returns to S01, the next script is read, and the subsequent processes I do.

また、Ｓ０３の処理において、読み込んだスクリプト行が所定のスクリプトである場合（Ｓ０３において、ＹＥＳ）、読み込んだＴＶＭＬスクリプトをバッファの最後に追加して、バッファに存在するＴＶＭＬスクリプトをＴＶＭＬプレイヤーで再生させる（Ｓ０５）。なお、上述のＴＶＭＬスクリプトによる再生処理では、ＣＧキャラクタを既存の合成音声で喋らせる直前で再生が停止させる。 In the process of S03, if the read script line is a predetermined script (YES in S03), the read TVML script is added to the end of the buffer, and the TVML script existing in the buffer is reproduced by the TVML player. (S05). In the playback process using the above-described TVML script, the playback is stopped immediately before the CG character is beaten with the existing synthesized speech.

次に、合成音声により喋るセリフと同一のセリフを使用者等が喋り、その肉声をマイク等の音声入力手段により録音する録音処理を行う（Ｓ０６）。なお、録音処理では肉声をＷＡＶＥファイル等のファイル形式で蓄積する。 Next, a recording process is performed in which a user or the like speaks the same speech spoken by synthesized speech, and the real voice is recorded by speech input means such as a microphone (S06). In the recording process, the real voice is stored in a file format such as a WAVE file.

次に、録音したセリフの肉声データを、例えば感情推定エンジン（ＳＴ）に入力し、ＳＴによりＷＡＶＥファイルを解析し、肉声の感情の種類（平常、怒り、喜び、悲しみ）と、その強さを出力する感情推定処理を行う（Ｓ０７）。 Next, the recorded voice data of the speech is input to, for example, the emotion estimation engine (ST), and the WAVE file is analyzed by ST, and the type of emotion (normal, angry, joyful, sadness) of the voice and the strength thereof are determined. The emotion estimation process to be output is performed (S07).

また、Ｓ０７の処理により得られる感情推定結果に基づいて、感情に適したＣＧキャラクタの表情やジェスチャー等の振る舞いを表現する演出内容を選択し、その選択した演出に内容に対応するＴＶＭＬスクリプトを生成する演出処理を行う（Ｓ０８）。 Also, based on the emotion estimation result obtained by the processing of S07, the production contents expressing the behavior of the CG character suitable for the emotion, such as facial expressions and gestures, are selected, and a TVML script corresponding to the contents of the selected production is generated. An effect process is performed (S08).

次に、上述したＳ０８の演出処理により得られるＴＶＭＬスクリプトを元のＴＶＭＬスクリプトの合成音声でＣＧキャラクタを喋らせる指示に対応するＴＶＭＬスクリプトの直前に組み込み、また、合成音声から録音した肉声でＣＧキャラクタが喋るように、対応するＴＶＭＬスクリプトの置換処理を行う（Ｓ０９）。 Next, the TVML script obtained by the effect processing of S08 described above is incorporated immediately before the TVML script corresponding to the instruction to make the CG character struck by the synthesized voice of the original TVML script, and the real CG character recorded from the synthesized voice. Then, the corresponding TVML script is replaced (S09).

また、Ｓ０９の処理が終了後、再生用のバッファに蓄積されたデータをクリアし（Ｓ１０）、Ｓ０１に戻り、中断していた元のＴＶＭＬスクリプトを中断箇所の次の行から読み込み上述した各処理を行う。 After the process of S09 is completed, the data stored in the reproduction buffer is cleared (S10), and the process returns to S01 to read the interrupted original TVML script from the next line of the interrupted part and the above-described processes. I do.

また、Ｓ０２の処理において、最終行の処理が完了した場合（Ｓ０２において、ＹＥＳ）、新たに生成されたＴＶＭＬスクリプトを出力して処理を終了する（Ｓ１１）。 In the process of S02, when the process on the last line is completed (YES in S02), the newly generated TVML script is output and the process is terminated (S11).

上述したように、本発明における実行プログラムを用いて肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを容易に生成することができる。 As described above, based on the emotion information obtained from the real voice data using the execution program of the present invention, it is possible to easily generate the content with the optimal performance added to the real voice.

上述したように本発明によれば、肉声データから得られる感情情報に基づいて、その肉声に最適な演出を付加したコンテンツを生成することができる。具体的には、合成音声を利用して、セリフを変えながら、番組全体を一旦作成することができる。また、作成した番組を再生させながら、再生箇所毎に肉声を録音することができる。また、肉声を録音するだけで、録音した肉声の感情に合わせたＣＧキャラクタの振る舞いが自動的に付加され、適切な演出の番組が自動的に生成できる。 As described above, according to the present invention, based on emotion information obtained from real voice data, it is possible to generate content in which an optimal presentation is added to the real voice. Specifically, the entire program can be created once while changing speech using synthesized speech. In addition, a real voice can be recorded for each playback location while the created program is played back. In addition, by simply recording the real voice, the behavior of the CG character in accordance with the recorded emotion of the real voice is automatically added, and a program with an appropriate performance can be automatically generated.

なお、本発明における演出内容は、上述したようにＣＧキャラクタに対する振る舞い等に限定されるものではなく、例えばスタジオのカメラワークや照明等、仮想空間上の演出対象物に対する演出であればよい。 It should be noted that the content of the effect in the present invention is not limited to the behavior for the CG character as described above, and may be an effect for the effect object in the virtual space such as studio camera work or illumination.

また、本発明におけるコンテンツ生成処理により生成されたコンテンツは、例えば使用者が自分のチャンネルに自分の作成した台本をアップロードして番組としてインターネット上に公開するといった、いわゆるブログテレビとして使用したり、ニュース番組を自動生成したり、コミュニティ番組の制作、教育ツール、広告ツールとして広く適用することができる。 In addition, the content generated by the content generation processing in the present invention can be used as a so-called blog TV in which a user uploads a script created by the user to his channel and publishes it on the Internet as a program, It can be widely applied as program generation, community program production, educational tools, and advertising tools.

以上本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形、変更が可能である。 Although the preferred embodiment of the present invention has been described in detail above, the present invention is not limited to the specific embodiment, and various modifications, within the scope of the gist of the present invention described in the claims, It can be changed.

コンテンツ生成装置の一構成例を示す図である。It is a figure which shows one structural example of a content production | generation apparatus. コンテンツ生成処理の概要の一例を示す図である。It is a figure which shows an example of the outline | summary of a content production | generation process. 録音処理の内容を説明するための一例を示す図である。It is a figure which shows an example for demonstrating the content of a recording process. 演出設定の内容を説明するための一例の図である。It is a figure of an example for demonstrating the content of production | presentation setting. 図４に対応する演出設定入力フォームにより設定された演出内容の一例を示す図である。It is a figure which shows an example of the content of the effect set by the effect setting input form corresponding to FIG. 本実施形態により生成されるスクリプトの一例を示す図である。It is a figure which shows an example of the script produced | generated by this embodiment. 本発明におけるコンテンツ生成処理が実現可能なハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions which can implement | achieve the content generation process in this invention. 本実施形態におけるコンテンツ生成処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the content production | generation procedure in this embodiment.

Explanation of symbols

１０コンテンツ生成装置
１１入力手段
１２出力手段
１３蓄積手段
１４スクリプト生成手段
１５録音処理手段
１６感情推定手段
１７演出手段
１８再生手段
１９送受信手段
２０制御手段
３１スクリプト読出処理
３２スクリプト置換・付加処理
３３使用者
３４コンテンツ
４０音声入力フォーム
４１メニュー表示領域
４２録音編集領域
４３，６３ボタン領域
４４録音ボタン
４５ステータス表示領域
５０再生画面
５１ＣＧキャラクタ
６０演出設定入力フォーム
６１タグ表示領域
６２各種設定領域
７１入力装置
７２出力装置
７３ドライブ装置
７４補助記憶装置
７５メモリ装置
７６ＣＰＵ
７７ネットワーク接続装置
７８記録媒体 DESCRIPTION OF SYMBOLS 10 Content production | generation apparatus 11 Input means 12 Output means 13 Accumulation means 14 Script production means 15 Recording processing means 16 Emotion estimation means 17 Production means 18 Reproduction means 19 Transmission / reception means 20 Control means 31 Script reading process 32 Script substitution and addition process 33 34 Contents 40 Voice input form 41 Menu display area 42 Recording edit area 43, 63 Button area 44 Record button 45 Status display area 50 Playback screen 51 CG character 60 Production setting input form 61 Tag display area 62 Various setting areas 71 Input device 72 Output Device 73 Drive device 74 Auxiliary storage device 75 Memory device 76 CPU
77 Network connection device 78 Recording medium

Claims

In a content generation apparatus that generates content in which a production subject performs a predetermined effect set based on emotion information obtained from input voice data,
Reading means for reading a script of content generated in advance for each predetermined unit;
Recording means for recording the user's real voice;
Emotion estimation means for performing emotion estimation from the real voice data recorded by the recording means;
Production means for extracting production information set in advance corresponding to emotion information consisting of emotion type and strength obtained by the emotion estimation means;
A content generation apparatus comprising script generation means for replacing a script generated based on the presentation information obtained by the presentation means with a script read by the reading means.

The production means is
The content generating apparatus according to claim 1, wherein the effect information includes information related to an expression or an action of the effect object.

The production means is
When setting the production information, a lower limit value and an upper limit value are set for the expression or motion of the production object for each type and strength of the emotion, and when the production information is extracted, the lower limit is set. 3. The content generation apparatus according to claim 2, wherein a value is randomly determined between a value and the upper limit value.

2. The recording unit according to claim 1, further comprising a reproducing unit that reproduces the previously generated content to record the user's real voice and stops the content immediately before recording the real voice. 4. The content generation device according to any one of 3 above.

In a content generation program for causing a computer to execute a content generation process for generating content in which a production subject performs a predetermined effect set based on emotion information obtained from input voice data,
A read process for reading a script of content generated in advance for each predetermined unit;
A recording process for recording the user's real voice;
Emotion estimation processing for estimating emotions from real voice data recorded by the recording processing;
Production processing for extracting production information set in advance corresponding to emotion information consisting of emotion type and strength obtained by the emotion estimation processing;
A content generation program for causing a computer to execute a script generation process for replacing a script generated based on the presentation information obtained by the presentation process with a script read by the reading process.