JP6993034B1

JP6993034B1 - Content playback method and content playback system

Info

Publication number: JP6993034B1
Application number: JP2021082702A
Authority: JP
Inventors: 継河合
Original assignee: Ａｉインフルエンサー株式会社
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-01-13
Anticipated expiration: 2041-05-14
Also published as: JP2022175923A

Abstract

【課題】ユーザの見た目をユーザの好みのキャラクターの見た目で表現すること可能となるコンテンツ再生方法、及びコンテンツ再生システムを提供する。【解決手段】キャラクターの顔を含む顔画像データと、感情を示す感情データとを取得する取得ステップと、予め取得された参照用顔画像データと参照用感情データとを含む第１入力データと、前記参照用顔画像データに含まれるキャラクターと同一のキャラクターの顔を含むと共に前記参照用顔画像データと異なる第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成された第１処理用データベースを参照して、前記取得ステップにより取得した顔画像データと感情データとに対する第１処理後顔画像データを生成する第１生成ステップとをコンピュータに実行させることを特徴とするコンテンツ再生方法。【選択図】図２PROBLEM TO BE SOLVED: To provide a content reproduction method and a content reproduction system capable of expressing a user's appearance with the appearance of a user's favorite character. SOLUTION: An acquisition step for acquiring face image data including a character's face and emotion data indicating emotion, a first input data including a reference face image data and reference emotion data acquired in advance, and The first output data including the face of the same character as the character included in the reference face image data and including the first processed face image data different from the reference face image data is used as a set of data sets. Using a plurality of training data for 1 processing, the first processing database generated by machine learning is referred to, and the face image data after the first processing for the face image data and the emotion data acquired in the acquisition step is generated. A content reproduction method characterized by causing a computer to execute a first generation step. [Selection diagram] Fig. 2

Description

本発明は、コンテンツ再生方法、及びコンテンツ再生システムに関する。 The present invention relates to a content reproduction method and a content reproduction system.

近年、動画を配信する場合において、配信者のプライバシーを保護するため、又は配信者の好みの顔や声で配信ができるように、ユーザの見た目や声をユーザの好みのキャラクターの見た目や声で表現することが可能となる技術が必要とされている。そのため、キャラクターによる会話がユーザによる会話と比べて、違和感のない会話にするための、キャラクターの音声を表現するための技術が注目されており、例えば特許文献１の情報処理システムが知られている。 In recent years, when distributing videos, the user's appearance and voice should be the appearance and voice of the user's favorite character in order to protect the privacy of the distributor or to distribute with the distributor's favorite face and voice. There is a need for technology that can be expressed. Therefore, a technique for expressing a character's voice so that a conversation by a character is more natural than a conversation by a user is attracting attention. For example, the information processing system of Patent Document 1 is known. ..

上記特許文献１に記載された技術は、プロセッサが、ユーザのクライアント端末を介して特定のキャラクターを選択する選択信号を受信すると共に、当該特定のキャラクターの発話フレーズを通信部により送信し、受信したユーザのメッセージに基づいて、特定のキャラクターの音声に変換した変換メッセージを生成する。さらに特許文献１に記載された技術は、ユーザのメッセージに対応する特定のキャラクターの発話フレーズを生成し、生成した変換メッセージおよび発話フレーズをクライアント端末に返送する。これにより、キャラクターをユーザ自身が体験できるようにすることで娯楽性をさらに高めることが可能となる情報処理システムに関する技術が特許文献１に記載されている。 In the technique described in Patent Document 1, the processor receives a selection signal for selecting a specific character via a client terminal of a user, and at the same time, transmits and receives a speech phrase of the specific character by a communication unit. Generates a converted message converted to the voice of a specific character based on the user's message. Further, the technique described in Patent Document 1 generates an utterance phrase of a specific character corresponding to a user's message, and returns the generated conversion message and the utterance phrase to the client terminal. Patent Document 1 describes a technique related to an information processing system that makes it possible to further enhance entertainment by allowing a user to experience a character by himself / herself.

特開２０２１－３９３７０号公報Japanese Unexamined Patent Publication No. 2021-39370

ここで、特許文献１では、受信したユーザのメッセージに基づいて、特定のキャラクターの音声に変換した変換メッセージを生成する。しかしながら、特許文献１では、特定のキャラクターの見た目をキャラクターの変換メッセージに反映させることを想定していない。このため、特許文献１では、ユーザの好みの見た目でユーザの配信を表現することができない。従って、ユーザの見た目をユーザの好みのキャラクターの見た目で表現することが可能となる技術が望まれている。 Here, in Patent Document 1, a converted message converted into a voice of a specific character is generated based on the received message of the user. However, Patent Document 1 does not assume that the appearance of a specific character is reflected in the character conversion message. Therefore, in Patent Document 1, it is not possible to express the user's distribution with the appearance desired by the user. Therefore, there is a demand for a technique capable of expressing the user's appearance with the appearance of the user's favorite character.

そこで本発明は、上述した問題点に鑑みて案出されたものであり、その目的とするところは、ユーザの見た目をユーザの好みのキャラクターの見た目で表現することが可能となるコンテンツ再生方法、及びコンテンツ再生システムを提供することにある。 Therefore, the present invention has been devised in view of the above-mentioned problems, and an object thereof is a content reproduction method capable of expressing the appearance of a user with the appearance of a character of the user's preference. And to provide a content reproduction system.

第１発明に係るコンテンツ再生方法は、キャラクターの顔を含む顔画像データと、感情を示す感情データと、テキストデータとを取得する取得ステップと、予め取得された参照用顔画像データと参照用感情データとを含む第１入力データと、前記参照用顔画像データに含まれるキャラクターと同一のキャラクターの顔を含むと共に前記参照用顔画像データと異なる第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成された第１処理用データベースを参照して、前記取得ステップにより取得した顔画像データと感情データとに対する第１処理後顔画像データを生成する第１生成ステップと、前記第１生成ステップにより生成された第１処理後顔画像データと、前記取得ステップにより取得したテキストデータとに基づいて、前記第１処理後顔画像データの一部を変化させた第２処理後顔画像データを生成する第２生成ステップとをコンピュータに実行させ、前記第２生成ステップは、予め取得された参照用第１処理後顔画像データと、参照用テキストデータとを含む第２入力データと、参照用第２処理後顔画像データを含む第２出力データとを一組のデータセットとする第２処理用学習データを複数用いて、機械学習により生成された第２処理用データベースを参照して、前記第１生成ステップにより生成された第１処理後顔画像データと、前記取得ステップにより取得したテキストデータとに基づいて、前記第２処理後顔画像データを生成することを特徴とする。 The content reproduction method according to the first invention includes an acquisition step of acquiring face image data including a character's face, emotion data indicating emotions, and text data, and previously acquired reference face image data and reference emotions. The first output data including the first input data including the data and the face of the same character as the character included in the reference face image data and including the first processed face image data different from the reference face image data. With reference to the first processing database generated by machine learning using a plurality of first processing training data having The first is based on the first generation step of generating the post-processed face image data, the first post-processed face image data generated by the first generation step , and the text data acquired by the acquisition step. A computer is made to execute a second generation step of generating a second post-processed face image data in which a part of the post-processed face image data is changed, and the second generation step is a pre-acquired first post-processing for reference. A plurality of second processing training data in which the second input data including the face image data and the reference text data and the second output data including the second processed face image data for reference are set as a set of data. Using, referring to the second processing database generated by machine learning, based on the first processed face image data generated by the first generation step and the text data acquired by the acquisition step. It is characterized in that the face image data after the second processing is generated .

第２発明に係るコンテンツ再生方法は、第１発明において、前記取得ステップは、声質に関する声質データを取得し、前記取得ステップにより取得した声質データと、テキストデータと、感情データとに基づいて、前記キャラクターの音声を示す音声データを生成する音声処理ステップと、前記生成ステップにより生成された処理後顔画像データと、前記音声処理ステップにより生成された音声データとに基づいて、前記キャラクターの表現を示す表現データを生成する表現生成ステップとをさらにコンピュータに実行させることを特徴とする。 The content reproduction method according to the second invention is the first invention, in which the acquisition step acquires voice quality data related to voice quality, and is based on the voice quality data, text data, and emotion data acquired by the acquisition step. , The expression of the character based on the voice processing step that generates the voice data indicating the voice of the character, the processed face image data generated by the generation step, and the voice data generated by the voice processing step. It is characterized in that a computer is further executed with an expression generation step for generating expression data indicating.

第３発明に係るコンテンツ再生方法は、第２発明において、前記音声処理ステップは、予め取得された参照用声質データと、参照用テキストデータと、参照用感情データとを含む第３入力データと、参照用音声データを含む第３出力データとを一組のデータセットとする音声処理用学習データを複数用いて、機械学習により生成された音声処理用データベースを参照して、前記取得ステップにより取得した声質データと、テキストデータと、感情データとに対する前記音声データを生成することを特徴とする。 The content reproduction method according to the third invention is the second invention, in which the voice processing step includes a third input data including a previously acquired reference voice quality data, reference text data, and reference emotion data. Using a plurality of voice processing learning data having a third output data including reference voice data as a set of data, the voice processing database generated by machine learning was referred to, and the data was acquired by the acquisition step. It is characterized in that the voice data for the voice quality data, the text data, and the emotion data is generated.

第４発明に係るコンテンツ再生方法は、第１発明～第３発明のいずれかにおいて、予め取得された参照用会話文データと前記参照用会話文データに対する返答データとの対応関係を示す返答モデルを参照し、ユーザが入力した会話文データに対する返答データを決定し、決定された返答データに基づく前記テキストデータを取得することを特徴とする。 The content reproduction method according to the fourth invention is a response model showing a correspondence relationship between the reference conversation text data acquired in advance and the response data to the reference conversation text data in any one of the first invention to the third invention. It is characterized in that the response data to the conversational sentence data input by the user is determined by reference, and the text data based on the determined response data is acquired.

第５発明に係るコンテンツ再生システムは、キャラクターの顔を含む顔画像データと、感情を示す感情データと、テキストデータとを取得する取得手段と、予め取得された参照用顔画像データと参照用感情データとを含む第１入力データと、前記参照用顔画像データに含まれるキャラクターと同一のキャラクターの顔を含むと共に前記参照用顔画像データと異なる第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成された第１処理用データベースを参照して、前記取得手段により取得した顔画像データと感情データとに対する第１処理後顔画像データを生成する第１生成手段と、前記第１生成手段により生成された第１処理後顔画像データと、前記取得手段により取得したテキストデータとに基づいて、前記第１処理後顔画像データの一部を変化させた第２処理後顔画像データを生成する第２生成手段とを備え、前記第２生成手段は、予め取得された参照用第１処理後顔画像データと、参照用テキストデータとを含む第２入力データと、参照用第２処理後顔画像データを含む第２出力データとを一組のデータセットとする第２処理用学習データを複数用いて、機械学習により生成された第２処理用データベースを参照して、前記第１生成手段により生成された第１処理後顔画像データと、前記取得手段により取得したテキストデータとに基づいて、前記第２処理後顔画像データを生成することを特徴とする。 The content reproduction system according to the fifth invention is an acquisition means for acquiring face image data including a character's face, emotion data indicating emotions, and text data, and previously acquired reference face image data and reference emotions. The first output data including the first input data including the data and the face of the same character as the character included in the reference face image data and including the first processed face image data different from the reference face image data. With reference to the first processing database generated by machine learning using a plurality of first processing training data having The first generation means based on the first generation means for generating the face image data after the first processing, the first post-processing face image data generated by the first generation means , and the text data acquired by the acquisition means. The second generation means for generating the second post-processed face image data in which a part of the processed face image data is changed is provided, and the second generation means is the first post-processed face image data for reference acquired in advance. And, using a plurality of second processing training data in which the second input data including the reference text data and the second output data including the reference second post-processed facial image data are set as a set of data. With reference to the second processing database generated by machine learning, the second processing based on the first processed face image data generated by the first generation means and the text data acquired by the acquisition means. It is characterized by generating face image data after processing .

第１発明～第４発明によれば、第１処理用データベースを参照して、顔画像データと感情データとに対する第１処理後顔画像データを生成する。これにより、ユーザの感情を反映した第１処理後顔画像データを生成することが可能となる。これによって、ユーザの感情が反映し、ユーザの見た目をユーザの好みのキャラクターの見た目で表現すること可能となる。また、第２処理用データベースを参照して、第１処理後顔画像データと、テキストデータとに基づいて、第２処理後顔画像データを生成する。これにより、入力されたテキストデータに適した精度の高い第２処理後顔画像データを生成することが可能となり、ユーザの会話に合わせた精度の高いキャラクターの表現が可能となる。 According to the first to fourth inventions, the first processed facial image data for the facial image data and the emotional data is generated with reference to the first processing database. This makes it possible to generate face image data after the first processing that reflects the emotions of the user. As a result, the user's emotions are reflected, and the user's appearance can be expressed by the appearance of the user's favorite character. Further, referring to the second processing database, the second processed face image data is generated based on the first processed face image data and the text data. As a result, it is possible to generate highly accurate second-processed face image data suitable for the input text data, and it is possible to express a highly accurate character according to the conversation of the user.

特に、第２発明によれば、声質データと、テキストデータと、感情データとに基づいて、音声データを生成し、処理後顔画像データと、音声データとに基づいて、表現データを生成する。これにより、ユーザの感情が反映された音声データを生成することができるため、ユーザの感情が反映できるキャラクターの表現が可能となる。 In particular, according to the second invention, voice data is generated based on voice quality data, text data, and emotion data, and expression data is generated based on processed facial image data and voice data. As a result, it is possible to generate voice data that reflects the user's emotions, so that it is possible to express a character that can reflect the user's emotions.

特に、第３発明によれば、機械学習により生成された音声処理用データベースを参照して、声質データと、テキストデータと、感情データとに基づいて、音声データを生成する。これにより、ユーザの感情が反映された精度の高い音声データを生成することができるため、ユーザの感情が反映できるキャラクターの表現が可能となる。 In particular, according to the third invention, the voice data is generated based on the voice quality data, the text data, and the emotion data with reference to the voice processing database generated by machine learning. As a result, it is possible to generate highly accurate voice data that reflects the user's emotions, so that it is possible to express a character that can reflect the user's emotions.

第４発明によれば、返答モデルを参照し、ユーザが入力した会話文データに対する返答データを決定し、決定された返答データに基づくテキストデータを取得する。これにより、ユーザが入力した会話文データに対する返答データを自動的に取得することができるため、ユーザの会話に合わせたキャラクターの表現が可能となる。 According to the fourth invention, the response data for the conversational sentence data input by the user is determined with reference to the response model, and the text data based on the determined response data is acquired. As a result, the response data to the conversational sentence data input by the user can be automatically acquired, so that the character can be expressed according to the conversation of the user.

図１は、第１実施形態におけるコンテンツ再生システムの一例を示す模式図である。FIG. 1 is a schematic diagram showing an example of a content reproduction system according to the first embodiment. 図２は、第１実施形態におけるコンテンツ再生システムの動作の一例を示す模式図である。FIG. 2 is a schematic diagram showing an example of the operation of the content reproduction system according to the first embodiment. 図３は、第１処理用データベースの学習方法の一例を示す模式図である。FIG. 3 is a schematic diagram showing an example of a learning method of the first processing database. 図４は、第１処理用データベースの連関性の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of the linkage of the first processing database. 図５（ａ）は、第１実施形態におけるコンテンツ再生装置の構成の一例を示す模式図であり、図５（ｂ）は、第１実施形態におけるコンテンツ再生装置の機能の一例を示す模式図である。FIG. 5A is a schematic diagram showing an example of the configuration of the content playback device according to the first embodiment, and FIG. 5B is a schematic diagram showing an example of the function of the content playback device according to the first embodiment. be. 図６は、第１実施形態におけるコンテンツ再生装置の動作の一例を示す図である。FIG. 6 is a diagram showing an example of the operation of the content reproduction device according to the first embodiment. 図７は、第２実施形態におけるコンテンツ再生システムの動作の一例を示す模式図である。FIG. 7 is a schematic diagram showing an example of the operation of the content reproduction system according to the second embodiment. 図８は、第２処理用データベースの学習方法の一例を示す模式図である。FIG. 8 is a schematic diagram showing an example of a learning method of the second processing database. 図９は、第２処理用データベースの連関性の一例を示す模式図である。FIG. 9 is a schematic diagram showing an example of the linkage of the second processing database. 図１０は、第２実施形態におけるコンテンツ再生装置の動作の一例を示す図である。FIG. 10 is a diagram showing an example of the operation of the content reproduction device according to the second embodiment. 図１１は、第３実施形態におけるコンテンツ再生システムの動作の一例を示す模式図である。FIG. 11 is a schematic diagram showing an example of the operation of the content reproduction system according to the third embodiment. 図１２は、第３実施形態における第１処理用データベースの学習方法の一例を示す模式図である。FIG. 12 is a schematic diagram showing an example of a learning method of the first processing database in the third embodiment. 図１３は、第３処理用データベースの連関性の一例を示す模式図である。FIG. 13 is a schematic diagram showing an example of the linkage of the third processing database. 図１４は、第３実施形態におけるコンテンツ再生装置の動作の一例を示す図である。FIG. 14 is a diagram showing an example of the operation of the content reproduction device according to the third embodiment. 図１５は、第４実施形態におけるコンテンツ再生装置の動作の一例を示す図である。FIG. 15 is a diagram showing an example of the operation of the content reproduction device according to the fourth embodiment. 図１６は、音声処理用データベースの学習方法の一例を示す模式図である。FIG. 16 is a schematic diagram showing an example of a learning method of a voice processing database. 図１７は、音声処理用データベースの連関性の一例を示す模式図である。FIG. 17 is a schematic diagram showing an example of the linkage of the speech processing database. 図１８は、第５実施形態におけるコンテンツ再生装置の動作の一例を示す図である。FIG. 18 is a diagram showing an example of the operation of the content reproduction device according to the fifth embodiment. 図１９は、第６実施形態におけるコンテンツ再生システムの動作の一例を示す模式図である。FIG. 19 is a schematic diagram showing an example of the operation of the content reproduction system according to the sixth embodiment. 図２０は、第６実施形態における第１処理用データベースの学習方法の一例を示す模式図である。FIG. 20 is a schematic diagram showing an example of a learning method of the first processing database according to the sixth embodiment. 図２１は、第６処理用データベースの連関性の一例を示す模式図である。FIG. 21 is a schematic diagram showing an example of the linkage of the sixth processing database. 図２２は、第６実施形態におけるコンテンツ再生装置の動作の一例を示す図である。FIG. 22 is a diagram showing an example of the operation of the content reproduction device according to the sixth embodiment.

以下、本発明を適用した実施形態におけるコンテンツ再生システムの一例について、図面を参照しながら説明する。 Hereinafter, an example of the content reproduction system according to the embodiment to which the present invention is applied will be described with reference to the drawings.

（第１実施形態）
図を参照して、第１実施形態におけるコンテンツ再生システム１００、コンテンツ再生装置１、及び学習方法の一例について説明する。図１は、本実施形態におけるコンテンツ再生システム１００の一例を示す模式図である。図２は、本実施形態におけるコンテンツ再生システム１００の動作の一例を示す模式図である。 (First Embodiment)
An example of the content reproduction system 100, the content reproduction device 1, and the learning method according to the first embodiment will be described with reference to the drawings. FIG. 1 is a schematic diagram showing an example of the content reproduction system 100 in the present embodiment. FIG. 2 is a schematic diagram showing an example of the operation of the content reproduction system 100 in the present embodiment.

＜コンテンツ再生システム１００＞
コンテンツ再生システム１００は、入力された任意の顔画像データと、感情データとに対し、第１処理後顔画像データを生成するために用いられる。コンテンツ再生システム１００は、例えば学習データを用いた機械学習により生成された第１処理用データベースを参照し、顔画像データと、感情データとに対し、第１処理後顔画像データを生成する。 <Content playback system 100>
The content reproduction system 100 is used to generate the face image data after the first processing with respect to the input arbitrary face image data and the emotion data. The content reproduction system 100 refers to, for example, a first processing database generated by machine learning using learning data, and generates face image data after the first processing with respect to face image data and emotion data.

コンテンツ再生システム１００は、例えば図１に示すように、コンテンツ再生装置１を備える。コンテンツ再生システム１００は、例えば端末２及びサーバ３の少なくとも何れかを備えてもよい。コンテンツ再生装置１は、例えば通信網４を介して端末２やサーバ３と接続される。 The content reproduction system 100 includes, for example, a content reproduction device 1 as shown in FIG. The content reproduction system 100 may include, for example, at least one of a terminal 2 and a server 3. The content reproduction device 1 is connected to the terminal 2 and the server 3 via, for example, the communication network 4.

コンテンツ再生システム１００では、例えば図２に示すように、コンテンツ再生装置１が入力データを取得する。例えばコンテンツ再生装置１は、入力データを取得する。その後、第１処理用データベースを参照し、入力データに対し、第１処理後顔画像データを生成する。 In the content reproduction system 100, for example, as shown in FIG. 2, the content reproduction device 1 acquires input data. For example, the content reproduction device 1 acquires input data. After that, the face image data after the first processing is generated with respect to the input data by referring to the database for the first processing.

顔画像データは、キャラクターの顔を含む画像データである。顔画像データは、例えばコンテンツ再生システム１００によって出力される第１処理後顔画像データを生成する際に用いられる。画像データは、複数の画素の集合体を含むデータである。顔画像データは、例えば動画から抽出されたものであってもよく、動画データであってもよい。また、参照用顔画像データは、第１処理用学習データに用いられる顔画像データであり、顔画像データと同じ形式のものを用いてもよい。 The face image data is image data including the face of the character. The face image data is used, for example, when generating the face image data after the first processing output by the content reproduction system 100. The image data is data including an aggregate of a plurality of pixels. The face image data may be, for example, extracted from a moving image or may be moving image data. Further, the reference face image data is face image data used for the first processing learning data, and may be in the same format as the face image data.

顔画像データは、例えば通信網４を介して取得したものであってもよい。顔画像データは、例えば公知の撮像装置等を用いて撮像された顔画像を示す他、例えば公知の技術で生成された擬似的な顔画像を示してもよい。顔画像データは、例えばコンテンツ再生装置１等を介して、ユーザ等により入力されてもよい。 The face image data may be acquired via, for example, the communication network 4. The face image data may show, for example, a face image captured by using a known image pickup device or the like, or may show, for example, a pseudo face image generated by a known technique. The face image data may be input by a user or the like via, for example, the content reproduction device 1.

感情データは、感情を示すデータである。感情データは、例えば、怒り、喜び、悲しみなどの感情を示すテキストデータであってもよい。また、感情データは、怒り、喜び、悲しみ等が百分率等の３段階以上の評価で示されるテキストデータ又は数値データであってもよい。 Emotion data is data showing emotions. The emotional data may be, for example, text data showing emotions such as anger, joy, and sadness. Further, the emotional data may be text data or numerical data in which anger, joy, sadness, etc. are evaluated by three or more grades such as percentage.

第１処理後顔画像データは、顔画像データに含まれるキャラクターと同一のキャラクターの顔を含むと共に当該顔画像データと異なる顔画像データである。第１処理後顔画像データは、例えば顔画像データに含まれるキャラクターと同一のキャラクターの顔を含むと共に当該顔画像データのキャラクターの顔と表情やしぐさ、向き等が異なるものであってもよい。 The face image data after the first processing is face image data that includes the face of the same character as the character included in the face image data and is different from the face image data. The face image data after the first processing may include, for example, the face of the same character as the character included in the face image data, and may have a different facial expression, gesture, orientation, etc. from the face of the character in the face image data.

なお、上述した「キャラクター」は、ユーザを模して擬似的に生成された人物又は動物、或いは実在する人物又は動物を模して擬似的に生成された人物又は動物のほか、アニメーション等のような、擬似的に生成された人物又は動物でもよい。 The above-mentioned "character" is a person or animal simulated to imitate a user, a person or animal simulated to imitate a real person or animal, or an animation or the like. It may be a pseudo-generated person or animal.

第１処理用データベースは、機械学習により生成される。第１処理用データベースとして、例えば参照用顔画像データと参照用感情データとを含む第１入力データと、第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成された、第１入力データから第１出力データを生成するための学習済みモデルが用いられる。 The first processing database is generated by machine learning. As the first processing database, for example, the first input data including the reference face image data and the reference emotion data and the first output data including the first processed face image data are set as a set of data sets. A trained model for generating the first output data from the first input data, which is generated by machine learning, is used by using a plurality of training data for processing.

第１処理用データベースは、例えば図３に示すように、参照用顔画像データと参照用感情データと含むを第１入力データと、第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成される。 As shown in FIG. 3, for example, the first processing database includes the first input data including the reference face image data and the reference emotion data, and the first output data including the first processed face image data. It is generated by machine learning using a plurality of first processing training data as a set of data sets.

第１処理用データベースは、例えばニューラルネットワークをモデルとした機械学習を用いて、生成される。第１処理用データベースは、例えばＣＮＮ（Convolution Neural Network）等のニューラルネットワークをモデルとした機械学習を用いて生成されるほか、任意のモデルが用いられてもよい。 The first processing database is generated using, for example, machine learning using a neural network as a model. The first processing database is generated by using machine learning using a neural network such as CNN (Convolution Neural Network) as a model, or any model may be used.

第１処理用データベースには、例えば参照用顔画像データと参照用感情データとを含む第１入力データと、第１処理後顔画像データを含む第１出力データとの間における連関度を有する第１連関性が記憶される。連関度は、第１入力データと第１出力データとの繋がりの度合いを示しており、例えば連関度が高いほど各データの繋がりが強いと判断することができる。連関度は、例えば百分率等の３値以上（３段階以上）で示されるほか、２値（２段階）で示されてもよい。 The first processing database has a degree of association between the first input data including, for example, reference face image data and reference emotion data, and the first output data including the first processed face image data. 1 Relationship is memorized. The degree of association indicates the degree of connection between the first input data and the first output data. For example, it can be determined that the higher the degree of association, the stronger the connection of each data. The degree of association may be indicated by three values or more (three stages or more) such as percentage, or may be indicated by two values (two stages).

例えば連関性は、多対多の情報（複数の第１入力データ、対、複数の第１出力データ）の間における繋がりの度合いにより構築される。連関性は、機械学習の過程で適宜更新され、例えば複数の第１入力データ、及び複数の第１出力データに基づいて最適化された関数（分類器）を示す。なお、第１連関性は、例えば各データの間における繋がりの度合いを示す複数の連関度を有してもよい。連関度は、例えばデータベースがニューラルネットワークで構築される場合、重み変数に対応させることができる。 For example, the association is constructed by the degree of connection between many-to-many information (plurality of first input data, pair, plurality of first output data). The association is appropriately updated in the process of machine learning, and indicates a function (classifier) optimized based on, for example, a plurality of first input data and a plurality of first output data. The first association may have a plurality of association degrees indicating the degree of connection between the data, for example. The degree of association can correspond to a weight variable, for example, when the database is constructed with a neural network.

このため、コンテンツ再生システム１００では、例えば分類器の判定した結果を全て踏まえた第１連関性を用いて、第１入力データに適した第１出力データを選択する。これにより、第１入力データが、第１出力データと同一又は類似である場合のほか、非類似である場合においても、第１入力データに適した第１出力データを定量的に選択することができる。 Therefore, in the content reproduction system 100, for example, the first output data suitable for the first input data is selected by using the first association based on all the determination results of the classifier. As a result, it is possible to quantitatively select the first output data suitable for the first input data even when the first input data is the same as or similar to the first output data or dissimilar to the first output data. can.

第１連関性は、例えば図４に示すように、複数の第１出力データと、複数の第１入力データとの間における繋がりの度合いを示してもよい。この場合、第１連関性を用いることで、複数の第１出力データ（図４では「第１処理後顔画像データＡ」～「第１処理後顔画像データＣ」）のそれぞれに対し、複数の第１入力データ（図４では「顔画像データＡ＋感情データＡ」～「顔画像データＣ＋感情データＣ」）の関係の度合いを紐づけて記憶させることができる。このため、例えば第１連関性を介して、１つの第１出力データに対して、複数の第１入力データを紐づけることができる。これにより、第１入力データに対して多角的な第１出力データの選択を実現することができる。 The first association may indicate the degree of connection between the plurality of first output data and the plurality of first input data, for example, as shown in FIG. In this case, by using the first association, a plurality of first output data (“first processed face image data A” to “first processed face image data C” in FIG. 4) are obtained. It is possible to store the degree of relationship of the first input data (“face image data A + emotion data A” to “face image data C + emotion data C” in FIG. 4) in association with each other. Therefore, for example, a plurality of first input data can be associated with one first output data via the first association. As a result, it is possible to realize the selection of the first output data from various angles with respect to the first input data.

第１連関性は、例えば各第１出力データと、各第１入力データとをそれぞれ紐づける複数の連関度を有する。連関度は、例えば百分率、１０段階、又は５段階等の３段階以上で示され、例えば線の特徴（例えば太さ等）で示される。例えば、第１入力データに含まれる「顔画像データＡ＋感情データＡ」は、第１出力データに含まれる「第１処理後顔画像データＡ」との間の連関度ＡＡ「７３％」を示し、第１出力データに含まれる「第１処理後顔画像データＢ」との間の連関度ＡＢ「１２％」を示す。すなわち、「連関度」は、各データ間における繋がりの度合いを示しており、例えば連関度が高いほど、各データの繋がりが強いことを示す。 The first association has, for example, a plurality of association degrees for associating each first output data with each first input data. The degree of association is shown in three or more steps such as percentage, 10 steps, or 5 steps, and is shown by, for example, the characteristics of the line (for example, thickness). For example, the "face image data A + emotion data A" included in the first input data indicates a degree of association AA "73%" with the "first processed face image data A" included in the first output data. , The degree of association AB “12%” with the “first processed face image data B” included in the first output data is shown. That is, the "degree of association" indicates the degree of connection between each data, and for example, the higher the degree of association, the stronger the connection of each data.

また、第１内部表象用データベースは、第１入力データと第１出力データとの間に少なくとも１以上の隠れ層が設けられ、機械学習させるようにしてもよい。第１入力データ又は隠れ層データの何れか一方又は両方において上述した連関度が設定され、これが各データの重み付けとなり、これに基づいて出力の選択が行われる。そして、この連関度がある閾値を超えた場合に、その出力を選択するようにしてもよい。 Further, in the first internal representation database, at least one hidden layer may be provided between the first input data and the first output data, and machine learning may be performed. The above-mentioned degree of association is set in either one or both of the first input data and the hidden layer data, and this is the weighting of each data, and the output is selected based on this. Then, when this degree of association exceeds a certain threshold value, the output may be selected.

＜コンテンツ再生装置１＞
次に、図５、図６を参照して、本実施形態におけるコンテンツ再生装置１の一例を説明する。図５（ａ）は、本実施形態におけるコンテンツ再生装置１の構成の一例を示す模式図であり、図５（ｂ）は、本実施形態におけるコンテンツ再生装置１の機能の一例を示す模式図である。 <Content playback device 1>
Next, an example of the content reproduction device 1 in the present embodiment will be described with reference to FIGS. 5 and 6. FIG. 5A is a schematic diagram showing an example of the configuration of the content reproduction device 1 in the present embodiment, and FIG. 5B is a schematic diagram showing an example of the function of the content reproduction device 1 in the present embodiment. be.

コンテンツ再生装置１として、例えばラップトップ（ノート）ＰＣ又はデスクトップＰＣ等の電子機器が用いられる。コンテンツ再生装置１は、例えば図５（ａ）に示すように、筐体１０と、ＣＰＵ（Central Processing Unit）１０１と、ＲＯＭ（Read Only Memory）１０２と、ＲＡＭ（Random Access Memory）１０３と、保存部１０４と、Ｉ／Ｆ１０５～１０７とを備える。各構成１０１～１０７は、内部バス１１０により接続される。 As the content reproduction device 1, for example, an electronic device such as a laptop (notebook) PC or a desktop PC is used. As shown in FIG. 5A, for example, the content playback device 1 stores a housing 10, a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, and the like. A unit 104 and I / F 105 to 107 are provided. Each configuration 101 to 107 is connected by an internal bus 110.

ＣＰＵ１０１は、コンテンツ再生装置１全体を制御する。ＲＯＭ１０２は、ＣＰＵ１０１の動作コードを格納する。ＲＡＭ１０３は、ＣＰＵ１０１の動作時に使用される作業領域である。保存部１０４は、データベースや学習対象データ等の各種情報が記憶される。保存部１０４として、例えばＨＤＤ（Hard Disk Drive）のほか、ＳＳＤ（Solid State Drive）等のデータ保存装置が用いられる。なお、例えばコンテンツ再生装置１は、図示しないＧＰＵ（Graphics Processing Unit）を有してもよい。 The CPU 101 controls the entire content reproduction device 1. The ROM 102 stores the operation code of the CPU 101. The RAM 103 is a work area used when the CPU 101 operates. The storage unit 104 stores various information such as a database and learning target data. As the storage unit 104, for example, in addition to an HDD (Hard Disk Drive), a data storage device such as an SSD (Solid State Drive) is used. For example, the content reproduction device 1 may have a GPU (Graphics Processing Unit) (not shown).

Ｉ／Ｆ１０５は、通信網４を介して、必要に応じて端末２、サーバ３、ウェブサイト等との各種情報の送受信を行うためのインターフェースである。Ｉ／Ｆ１０６は、入力部１０８との情報の送受信を行うためのインターフェースである。入力部１０８として、例えばキーボードが用いられ、コンテンツ再生装置１の使用者等は、入力部１０８を介して、各種情報、又はコンテンツ再生装置１の制御コマンド等を入力する。Ｉ／Ｆ１０７は、表示部１０９との各種情報の送受信を行うためのインターフェースである。表示部１０９は、保存部１０４に保存された各種情報、又はコンテンツ等を表示する。表示部１０９として、ディスプレイが用いられ、例えばタッチパネル式の場合、入力部１０８と一体に設けられる。また、表示部１０９は、スピーカが用いられてもよい。 The I / F 105 is an interface for transmitting and receiving various information to and from the terminal 2, the server 3, the website, etc., as needed, via the communication network 4. The I / F 106 is an interface for transmitting / receiving information to / from the input unit 108. For example, a keyboard is used as the input unit 108, and the user or the like of the content reproduction device 1 inputs various information, a control command of the content reproduction device 1, or the like via the input unit 108. The I / F 107 is an interface for transmitting and receiving various information to and from the display unit 109. The display unit 109 displays various information, contents, and the like stored in the storage unit 104. A display is used as the display unit 109, and for example, in the case of a touch panel type, it is provided integrally with the input unit 108. Further, a speaker may be used for the display unit 109.

図５（ｂ）は、コンテンツ再生装置１の機能の一例を示す模式図である。コンテンツ再生装置１は、取得部１１と、処理部１２と、生成部１３と、出力部１４と、記憶部１５とを備え、例えばＤＢ生成部１６を有してもよい。なお、図５（ｂ）、に示した各機能は、ＣＰＵ１０１が、ＲＡＭ１０３を作業領域として、保存部１０４等に記憶されたプログラムを実行することにより実現され、例えば人工知能等により制御されてもよい。 FIG. 5B is a schematic diagram showing an example of the function of the content reproduction device 1. The content reproduction device 1 includes an acquisition unit 11, a processing unit 12, a generation unit 13, an output unit 14, and a storage unit 15, and may include, for example, a DB generation unit 16. It should be noted that each function shown in FIG. 5B is realized by the CPU 101 executing a program stored in the storage unit 104 or the like using the RAM 103 as a work area, even if it is controlled by, for example, artificial intelligence. good.

＜＜取得部１１＞＞
取得部１１は、顔画像データと、感情データとを取得する。取得したデータは、上述した第１処理後顔画像データを生成する際に用いられる。取得部１１は、例えば入力部１０８から入力された顔画像データと、感情データとを取得するほか、例えば通信網４を介して、端末２等から顔画像データと、感情データとを取得してもよい。また、取得部１１は、予め取得された複数の顔画像データ、及び感情データの中からユーザが選択したデータを取得してもよい。 << Acquisition unit 11 >>
The acquisition unit 11 acquires the face image data and the emotion data. The acquired data is used when generating the above-mentioned first post-processing facial image data. The acquisition unit 11 acquires, for example, the face image data and the emotion data input from the input unit 108, and also acquires the face image data and the emotion data from the terminal 2 or the like via, for example, the communication network 4. May be good. Further, the acquisition unit 11 may acquire data selected by the user from a plurality of face image data and emotion data acquired in advance.

取得部１１は、例えば上述したデータベースの生成に用いられる学習データを取得してもよい。取得部１１は、例えば入力部１０８から入力された学習データを取得するほか、例えば通信網４を介して、端末２等から学習データを取得してもよい。 The acquisition unit 11 may acquire, for example, the learning data used for generating the above-mentioned database. In addition to acquiring the learning data input from the input unit 108, for example, the acquisition unit 11 may acquire the learning data from the terminal 2 or the like via, for example, the communication network 4.

例えば、第１処理用データベースの生成に用いられる第１処理用学習データとして、過去の参照用顔画像データ及び参照用感情データが挙げられる。 For example, as the first processing learning data used for generating the first processing database, past reference face image data and reference emotion data can be mentioned.

＜＜処理部１２＞＞
処理部１２は、例えば第１処理用データベースを参照し、顔画像データと感情データとに対する第１処理後顔画像データを生成する。 << Processing unit 12 >>
The processing unit 12 refers to, for example, the first processing database, and generates the face image data after the first processing for the face image data and the emotion data.

＜＜生成部１３＞＞
生成部１３は、処理部１２で生成した顔画像データに基づき、少なくとも１つの擬似データを生成する。生成部１３は、例えば処理部１２で生成された第１処理後顔画像データに基づき、音声及び顔画像を含む擬似データを生成する。擬似データを生成することによって、記憶部１５に記憶されていないキャラクターの表現を出力することが可能となる。生成部１３は、擬似データを生成する際に、公知の技術を用いてもよい。 << Generation unit 13 >>
The generation unit 13 generates at least one pseudo data based on the face image data generated by the processing unit 12. The generation unit 13 generates pseudo data including voice and face image based on the first processed face image data generated by the processing unit 12, for example. By generating the pseudo data, it is possible to output the expression of the character that is not stored in the storage unit 15. The generation unit 13 may use a known technique when generating pseudo data.

＜＜出力部１４＞＞
出力部１４は、各種データを出力する。出力部１４は、例えば生成部１３で生成された擬似データを出力してもよい。出力部１４は、Ｉ／Ｆ１０７を介して表示部１０９に各種データを出力するほか、例えばＩ／Ｆ１０５を介して、複数の端末２等に各種データを出力する。 << Output unit 14 >>
The output unit 14 outputs various data. The output unit 14 may output, for example, the pseudo data generated by the generation unit 13. The output unit 14 outputs various data to the display unit 109 via the I / F 107, and also outputs various data to a plurality of terminals 2 and the like via, for example, the I / F 105.

＜＜記憶部１５＞＞
記憶部１５は、保存部１０４に保存されたデータベース等の各種データを必要に応じて取出す。記憶部１５は、各構成１１～１４、１６により取得又は生成された各種データを、必要に応じて保存部１０４に保存する。 << Memory unit 15 >>
The storage unit 15 retrieves various data such as a database stored in the storage unit 104 as needed. The storage unit 15 stores various data acquired or generated by the configurations 11 to 14 and 16 in the storage unit 104 as needed.

＜＜ＤＢ生成部１６＞＞
ＤＢ生成部１６は、複数の学習データを用いた機械学習によりデータベースを生成する。機械学習には、例えばニューラルネットワーク等が用いられる。 << DB generation unit 16 >>
The DB generation unit 16 generates a database by machine learning using a plurality of learning data. For machine learning, for example, a neural network or the like is used.

＜端末２＞
端末２は、例えばコンテンツ再生システム１００を用いたサービスのユーザ等が保有し、通信網４を介してコンテンツ再生装置１と接続される。端末２は、例えばデータベースを生成する電子機器を示してもよい。端末２は、例えばパーソナルコンピュータや、タブレット端末等の電子機器が用いられる。端末２は、例えばコンテンツ再生装置１の備える機能のうち、少なくとも一部の機能を備えてもよい。 <Terminal 2>
The terminal 2 is owned by, for example, a user of a service using the content reproduction system 100, and is connected to the content reproduction device 1 via a communication network 4. The terminal 2 may indicate, for example, an electronic device that generates a database. As the terminal 2, for example, an electronic device such as a personal computer or a tablet terminal is used. The terminal 2 may have at least some of the functions of the content reproduction device 1, for example.

＜サーバ３＞
サーバ３は、通信網４を介してコンテンツ再生装置１と接続される。サーバ３は、過去の各種データ等が記憶され、必要に応じてコンテンツ再生装置１から各種データが送信される。サーバ３は、例えばコンテンツ再生装置１の備える機能のうち、少なくとも一部の機能を備えてもよく、例えばコンテンツ再生装置１の代わりに少なくとも一部の処理を行ってもよい。サーバ３は、例えばコンテンツ再生装置１の保存部１０４に記憶された各種データのうち少なくとも一部が記憶され、例えば保存部１０４の代わりに用いられてもよい。 <Server 3>
The server 3 is connected to the content reproduction device 1 via the communication network 4. Various past data and the like are stored in the server 3, and various data are transmitted from the content reproduction device 1 as needed. The server 3 may have at least a part of the functions of the content reproduction device 1, for example, and may perform at least a part of the processing instead of the content reproduction device 1, for example. The server 3 stores, for example, at least a part of various data stored in the storage unit 104 of the content reproduction device 1, and may be used in place of the storage unit 104, for example.

＜通信網４＞
通信網４は、例えばコンテンツ再生装置１が通信回路を介して接続されるインターネット網等である。通信網４は、いわゆる光ファイバ通信網で構成されてもよい。また、通信網４は、有線通信網のほか、無線通信網等の公知の通信技術で実現してもよい。 <Communication network 4>
The communication network 4 is, for example, an Internet network or the like to which the content reproduction device 1 is connected via a communication circuit. The communication network 4 may be configured by a so-called optical fiber communication network. Further, the communication network 4 may be realized by a known communication technology such as a wireless communication network in addition to the wired communication network.

（第１実施形態：コンテンツ再生システムの動作）
次に、本実施形態におけるコンテンツ再生システム１００の動作の一例について説明する。図６は、第１実施形態におけるコンテンツ再生装置の動作の一例を示す図である。 (First Embodiment: Operation of content reproduction system)
Next, an example of the operation of the content reproduction system 100 in the present embodiment will be described. FIG. 6 is a diagram showing an example of the operation of the content reproduction device according to the first embodiment.

＜取得ステップＳ１１０＞
取得ステップＳ１１０は、ユーザ等により入力された顔画像データと感情データとを取得する。取得ステップＳ１１０では、例えば取得部１１が、顔画像データと感情データとを取得する。取得部１１は、例えば端末２等から顔画像データと感情データとを取得するほか、例えば記憶部１５を介して、保存部１０４から取得してもよい。また、取得ステップＳ１１０は、例えば顔画像データとして、動画のように顔画像データと、顔画像データに紐づいた音声データを取得してもよい。 <Acquisition step S110>
The acquisition step S110 acquires the face image data and the emotion data input by the user or the like. In the acquisition step S110, for example, the acquisition unit 11 acquires the face image data and the emotion data. The acquisition unit 11 may acquire the face image data and the emotion data from, for example, the terminal 2 or the like, or may acquire the face image data and the emotion data from the storage unit 104 via, for example, the storage unit 15. Further, in the acquisition step S110, for example, as face image data, face image data such as a moving image and voice data associated with the face image data may be acquired.

＜第１処理ステップＳ１２０＞
第１処理ステップＳ１２０は、例えば第１処理用データベースを参照し、取得ステップＳ１１０で取得した顔画像データと感情データとに対する第１処理後顔画像データを生成する。第１処理ステップＳ１２０では、例えば第１処理部１２１は、第１処理用データベースを参照し、顔画像データと感情データとに対する第１処理後顔画像データを生成する。第１処理ステップＳ１２０は、例えば記憶部１５を介して、生成した第１処理後顔画像データを保存部１０４に保存してもよい。なお、生成した第１処理後顔画像データは、例えばサーバ３や他のコンテンツ再生装置１、又は複数のユーザ端末２に送信されてもよい。生成するデータは、ひとつの入力データに対して複数のデータを生成してもよい。これにより、ユーザの感情を反映した第１処理後顔画像データを生成することが可能となる。これによって、ユーザの感情が反映できるキャラクターの表現が可能となる。また、生成部１３により、疑似的に第１処理後顔画像データを生成してもよい。また、第１処理ステップＳ１２０は、例えば処理部１２に含まれる第１処理部１２１により、処理してもよい。 <First processing step S120>
The first processing step S120 refers to, for example, the database for the first processing, and generates the face image data after the first processing for the face image data and the emotion data acquired in the acquisition step S110. In the first processing step S120, for example, the first processing unit 121 refers to the first processing database and generates the first processed facial image data for the facial image data and the emotional data. In the first processing step S120, the generated face image data after the first processing may be stored in the storage unit 104, for example, via the storage unit 15. The generated face image data after the first processing may be transmitted to, for example, the server 3, another content reproduction device 1, or a plurality of user terminals 2. As the data to be generated, a plurality of data may be generated for one input data. This makes it possible to generate face image data after the first processing that reflects the emotions of the user. This makes it possible to express a character that can reflect the user's emotions. Further, the generation unit 13 may generate the face image data after the first processing in a pseudo manner. Further, the first processing step S120 may be processed by, for example, the first processing unit 121 included in the processing unit 12.

＜出力ステップＳ１３０＞
出力ステップＳ１３０では、例えば出力部１４は、第１処理ステップＳ１２０により取得された第１処理後顔画像データを、表示部１０９や端末２等に出力する。 <Output step S130>
In the output step S130, for example, the output unit 14 outputs the face image data after the first processing acquired in the first processing step S120 to the display unit 109, the terminal 2, and the like.

上述した各ステップを行うことで、本実施形態におけるコンテンツ再生システム１００の動作が完了する。 By performing each of the steps described above, the operation of the content reproduction system 100 in the present embodiment is completed.

（第２実施形態）
以下、本発明の第２実施形態を適応したコンテンツ再生システム１００について説明する。本発明の第２実施形態は、第１処理後顔画像データとテキストデータに対する第２処理後顔画像データを生成する点で第１実施形態と異なる。また、第１実施形態と同様な構成の説明は省略する。 (Second Embodiment)
Hereinafter, the content reproduction system 100 to which the second embodiment of the present invention is applied will be described. The second embodiment of the present invention is different from the first embodiment in that the second post-processed face image data for the first post-processed face image data and the text data is generated. Further, the description of the configuration similar to that of the first embodiment will be omitted.

＜コンテンツ再生システム１００＞
コンテンツ再生システム１００は、図７に示すように、第１処理用データベースを参照して生成された顔画像データと感情データとに対する第１処理後顔画像データと、テキストデータとに対する第２処理後顔画像データを生成するために用いられる。コンテンツ再生システム１００は、例えば学習データを用いた機械学習により生成された第２処理用データベースを参照し、第１処理後顔画像データと、テキストデータとに対する第２処理後顔画像データを生成する。かかる場合、コンテンツ再生システム１００は、例えば顔画像データと感情データとに対する第１処理後顔画像データとして、顔画像データに含まれるキャラクターの目元を変化させた第１処理後顔画像データを生成し、さらに第２処理用データベースを参照し、生成した第１処理後顔画像データと、テキストデータに対する第１処理後顔画像データに含まれるキャラクターの口元を変化させた第２処理後顔画像データを生成してもよい。 <Content playback system 100>
As shown in FIG. 7, the content reproduction system 100 has, as shown in FIG. 7, after the first processing for the face image data and the emotion data generated by referring to the first processing database, and after the second processing for the text data. It is used to generate facial image data. The content reproduction system 100 refers to, for example, a second processing database generated by machine learning using learning data, and generates a second post-processing face image data for the first post-processing face image data and text data. .. In such a case, the content reproduction system 100 generates, for example, the first post-processed face image data in which the eyes of the character included in the face image data are changed as the first post-processed face image data for the face image data and the emotion data. Further, referring to the second processing database, the generated first post-processing face image data and the second post-processing face image data in which the mouth of the character included in the first post-processing face image data for the text data is changed are obtained. May be generated.

テキストデータは、例えばコンテンツ再生システム１００によって生成される顔画像データを生成する際に用いられる。テキストデータは、例えばユーザが入力した会話文、又はキャラクターに話させたい会話文等であってもよい。テキストデータは、文字や文字コードによって表されるデータである。テキストデータは、例えば、モニタやプリンタなどの機器を制御するためのデータである制御文字を含む。制御文字は、例えば、改行を表す改行文字やタブ（水平タブ）などが含まれる。 The text data is used, for example, when generating face image data generated by the content reproduction system 100. The text data may be, for example, a conversational sentence input by the user, a conversational sentence that the character wants to speak, or the like. Text data is data represented by characters and character codes. The text data includes, for example, control characters that are data for controlling a device such as a monitor or a printer. The control character includes, for example, a line feed character representing a line feed and a tab (horizontal tab).

テキストデータは、例えば通信網４を介して、また、テキストデータは、音声を音声認識することによって抽出したものであってもよい。テキストデータは、例えばコンテンツ再生装置１等を介して、ユーザ等により入力されてもよい。 The text data may be extracted via, for example, the communication network 4, and the text data may be extracted by recognizing the voice. The text data may be input by a user or the like via, for example, the content reproduction device 1.

第２処理後顔画像データは、第１処理後顔画像データの一部を変化させた顔画像データである。第２処理後顔画像データは、例えば第１処理後顔画像データに含まれるキャラクターの口等の画像データの一部を変化させたものであってもよい。 The face image data after the second processing is face image data obtained by changing a part of the face image data after the first processing. The face image data after the second processing may be, for example, a part of the image data such as the mouth of the character included in the face image data after the first processing changed.

第２処理用データベースは、例えば機械学習により生成されることが好ましいがこの限りではない。第２処理用データベースとして、例えば参照用第１処理後顔画像データと参照用テキストデータとを含む第２入力データと、参照用第２処理後顔画像データを含む第２出力データとを一組のデータセットとする第２処理用学習データを複数用いて、機械学習により生成された、第２入力データから第２出力データを生成するための学習済みモデルが用いられる。かかる場合、第２処理用データベースの生成方法は、入力データを第２入力データ、出力を第２出力データとする点で第１処理用データベースと異なる。 The second processing database is preferably generated by machine learning, for example, but this is not the case. As the database for the second processing, for example, a set of the second input data including the first processed face image data for reference and the text data for reference and the second output data including the second processed face image data for reference. A trained model for generating the second output data from the second input data generated by machine learning is used by using a plurality of second processing training data as the data set of. In such a case, the method of generating the second processing database is different from the first processing database in that the input data is the second input data and the output is the second output data.

第２処理用データベースは、例えば図８に示すように、参照用第１処理後顔画像データと参照用感情データと含むを第２入力データと、参照用第２処理後顔画像データを含む第２出力データとを一組のデータセットとする第２処理用学習データを複数用いて、機械学習により生成される。 As shown in FIG. 8, for example, the second processing database includes a second input data including a reference first post-processing facial image data and a reference emotion data, and a second post-processing facial image data for reference. It is generated by machine learning using a plurality of second processing training data having two output data as a set of data.

第２処理用データベースは、例えばニューラルネットワークをモデルとした機械学習を用いて、生成される。第２処理用データベースは、例えばＣＮＮ（Convolution Neural Network）等のニューラルネットワークをモデルとした機械学習を用いて生成されるほか、任意のモデルが用いられてもよい。 The second processing database is generated using, for example, machine learning using a neural network as a model. The second processing database is generated by using machine learning using a neural network such as CNN (Convolution Neural Network) as a model, or any model may be used.

第２処理用データベースには、例えば図９に示すように第２入力データと、第２出力データとの間における連関度を有する第２連関性が記憶される。連関度は、第２入力データと第２出力データとの繋がりの度合いを示しており、例えば連関度が高いほど各データの繋がりが強いと判断することができる。連関度は、例えば百分率等の３値以上（３段階以上）で示されるほか、２値（２段階）で示されてもよい。 In the second processing database, for example, as shown in FIG. 9, a second association having a degree of association between the second input data and the second output data is stored. The degree of association indicates the degree of connection between the second input data and the second output data. For example, it can be determined that the higher the degree of association, the stronger the connection of each data. The degree of association may be indicated by three values or more (three stages or more) such as percentage, or may be indicated by two values (two stages).

（第２実施形態：コンテンツ再生システムの動作）
次に、第２実施形態におけるコンテンツ再生システム１００の動作の一例について説明する。図１０は、第２実施形態におけるコンテンツ再生装置の動作の一例を示す図である。 (Second Embodiment: Operation of content reproduction system)
Next, an example of the operation of the content reproduction system 100 in the second embodiment will be described. FIG. 10 is a diagram showing an example of the operation of the content reproduction device according to the second embodiment.

＜取得ステップＳ２１０＞
取得ステップＳ２１０は、ユーザ等により入力された顔画像データと感情データとテキストデータとを取得する。取得ステップＳ２１０では、例えば取得部１１が、顔画像データと感情データとテキストデータとを取得する。また、取得ステップＳ２１０は、例えばテキストデータとして、動画のように顔画像データと、顔画像データに紐づいた音声データを取得して、取得した音声データを音声認識することにより取得してもよい。 <Acquisition step S210>
The acquisition step S210 acquires the face image data, the emotion data, and the text data input by the user or the like. In the acquisition step S210, for example, the acquisition unit 11 acquires face image data, emotion data, and text data. Further, in the acquisition step S210, for example, as text data, face image data such as a moving image and voice data associated with the face image data may be acquired, and the acquired voice data may be acquired by voice recognition. ..

＜第２処理ステップＳ２４０＞
第２処理ステップＳ２４０は、例えば第２処理用データベースを参照し、第１処理ステップＳ１２０により生成された第１処理後顔画像データと、取得ステップＳ２１０で取得したテキストデータとに対する第２処理後顔画像データを生成する。第２処理ステップＳ２４０では、例えば第２処理部１２２は、第２処理用データベースを参照し、第１処理後顔画像データと、テキストデータとに対する第２処理後顔画像データを生成する。かかる場合、第２処理部１２２は、入力された第１処理後顔画像データを公知の画像解析技術により画像解析し、第１処理後顔画像データに含まれるキャラクターの顔の一部、例えば口を判定し、判定した部分をテキストデータに合わせて変化させた第２処理後顔画像データを生成してもよい。また、第２処理ステップＳ２４０は、第２処理用データベースを用いることなく、第２処理後顔画像データを生成してもよい。これにより、入力されたテキストデータに適した第２処理後顔画像データを生成することが可能となり、ユーザの会話に合わせた精度の高いキャラクターの表現が可能となる。また、生成部１３により、疑似的に第２処理後顔画像データを生成してもよい。また、第２処理ステップＳ２４０は、例えば処理部１２に含まれ、第１処理部１２１に接続される第２処理部１２２により、処理してもよい。 <Second processing step S240>
The second processing step S240 refers to, for example, the database for the second processing, and the second processed face with respect to the first processed face image data generated by the first processing step S120 and the text data acquired in the acquisition step S210. Generate image data. In the second processing step S240, for example, the second processing unit 122 refers to the second processing database and generates the second processed face image data for the first processed face image data and the text data. In such a case, the second processing unit 122 analyzes the input first-processed face image data by a known image analysis technique, and a part of the character's face included in the first-processed face image data, for example, the mouth. The second processed facial image data may be generated by determining the above and changing the determined portion according to the text data. Further, in the second processing step S240, the face image data after the second processing may be generated without using the database for the second processing. This makes it possible to generate face image data after the second processing suitable for the input text data, and it is possible to express a character with high accuracy according to the conversation of the user. Further, the generation unit 13 may generate pseudo second-processed face image data. Further, the second processing step S240 may be processed by the second processing unit 122 included in the processing unit 12, for example, and connected to the first processing unit 121.

（第３実施形態）
以下、本発明の第３実施形態を適応したコンテンツ再生システム１００について説明する。本発明の第３実施形態は、顔画像データと感情データとテキストデータとに対する第１処理後顔画像データを生成する点で第１実施形態と異なる。また、第１実施形態と同様な構成の説明は省略する。 (Third Embodiment)
Hereinafter, the content reproduction system 100 to which the third embodiment of the present invention is applied will be described. The third embodiment of the present invention is different from the first embodiment in that the face image data after the first processing for the face image data, the emotion data, and the text data is generated. Further, the description of the configuration similar to that of the first embodiment will be omitted.

＜コンテンツ再生システム１００＞
コンテンツ再生システム１００は、図１１に示すように入力された任意の顔画像データと、感情データと、テキストデータとに対し、第１処理後顔画像データを生成するために用いられる。コンテンツ再生システム１００は、例えば学習データを用いた機械学習により生成された第１処理用データベースを参照し、顔画像データと、感情データと、テキストデータに対する第１処理後顔画像データを生成する。 <Content playback system 100>
The content reproduction system 100 is used to generate the face image data after the first processing with respect to the arbitrary face image data, the emotion data, and the text data input as shown in FIG. The content reproduction system 100 refers to, for example, a first processing database generated by machine learning using learning data, and generates face image data, emotion data, and first-processed face image data for text data.

第１処理用データベースは、機械学習により生成される。第１処理用データベースとして、例えば参照用顔画像データと参照用感情データと参照用テキストデータとを含む第１入力データと、参照用第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成された、第１入力データから第１出力データを生成するための学習済みモデルが用いられる。かかる場合、第１処理用データベースの生成方法は、第１入力データに参照用テキストデータが含まれている点で第１実施形態と異なる。 The first processing database is generated by machine learning. As the first processing database, for example, the first input data including the reference face image data, the reference emotion data, and the reference text data, and the first output data including the reference first processed face image data are combined. A trained model for generating the first output data from the first input data generated by machine learning is used by using a plurality of first processing training data as a set of data sets. In such a case, the method of generating the first processing database is different from the first embodiment in that the reference text data is included in the first input data.

第１処理用データベースは、例えば図１２に示すように、参照用第１処理後顔画像データと参照用感情データと参照用テキストデータとを含む第１入力データと、参照用第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成される。 As shown in FIG. 12, for example, the first processing database includes first input data including reference first post-processing face image data, reference emotion data, and reference text data, and reference first post-processing face. It is generated by machine learning using a plurality of first processing training data having the first output data including image data as a set of data sets.

第２処理用データベースは、例えば図１３に示すように、ニューラルネットワークをモデルとした機械学習を用いて、生成される。第２処理用データベースは、例えばＣＮＮ（Convolution Neural Network）等のニューラルネットワークをモデルとした機械学習を用いて生成されるほか、任意のモデルが用いられてもよい。 The second processing database is generated by using machine learning using a neural network as a model, for example, as shown in FIG. The second processing database is generated by using machine learning using a neural network such as CNN (Convolution Neural Network) as a model, or any model may be used.

第１処理用データベースには、例えば第１入力データと、第１出力データとの間における連関度を有する第１連関性が記憶される。連関度は、第１入力データと第１出力データとの繋がりの度合いを示しており、例えば連関度が高いほど各データの繋がりが強いと判断することができる。連関度は、例えば百分率等の３値以上（３段階以上）で示されるほか、２値（２段階）で示されてもよい。 In the first processing database, for example, the first association having a degree of association between the first input data and the first output data is stored. The degree of association indicates the degree of connection between the first input data and the first output data. For example, it can be determined that the higher the degree of association, the stronger the connection of each data. The degree of association may be indicated by three values or more (three stages or more) such as percentage, or may be indicated by two values (two stages).

（第３実施形態：コンテンツ再生システムの動作）
次に、第３実施形態におけるコンテンツ再生システム１００の動作の一例について説明する。図１４は、第３実施形態におけるコンテンツ再生装置の動作の一例を示す図である。 (Third Embodiment: Operation of content reproduction system)
Next, an example of the operation of the content reproduction system 100 in the third embodiment will be described. FIG. 14 is a diagram showing an example of the operation of the content reproduction device according to the third embodiment.

＜取得ステップＳ３１０＞
取得ステップＳ３１０は、ユーザ等により入力された顔画像データと感情データとテキストデータとを取得する。 <Acquisition step S310>
The acquisition step S310 acquires the face image data, the emotion data, and the text data input by the user or the like.

＜第１処理ステップＳ３２０＞
第１処理ステップＳ３２０は、例えば第１処理用データベースを参照し、取得ステップＳ３１０で取得した顔画像データと、感情データと、テキストデータとに対する第１処理後顔画像データを生成する。第１処理ステップＳ３２０では、例えば第１処理部１２１は、第１処理用データベースを参照し、顔画像データと、感情データと、テキストデータとに対する第１処理後顔画像データを生成する。これにより、入力されたテキストデータに適した第１処理後顔画像データを生成することが可能となり、ユーザの会話に合わせた精度の高いキャラクターの表現が可能となる。また、生成部１３により、擬似的に第１処理後顔画像データを生成してもよい。また、第１処理ステップＳ３２０は、例えば処理部１２に含まれる第１処理部１２１により、処理してもよい。 <First processing step S320>
The first processing step S320 refers to, for example, the database for the first processing, and generates the face image data after the first processing for the face image data, the emotion data, and the text data acquired in the acquisition step S310. In the first processing step S320, for example, the first processing unit 121 refers to the first processing database and generates the face image data, the emotion data, and the post-processed face image data for the text data. This makes it possible to generate face image data after the first processing suitable for the input text data, and it is possible to express a character with high accuracy according to the conversation of the user. Further, the generation unit 13 may generate the face image data after the first processing in a pseudo manner. Further, the first processing step S320 may be processed by, for example, the first processing unit 121 included in the processing unit 12.

（第４実施形態）
以下、本発明の第４実施形態を適応したコンテンツ再生システム１００について説明する。本発明の第４実施形態は、返答モデルを参照し、ユーザが入力した会話文に対する返答を決定し、決定された返答に基づくテキストデータを取得する点で第３実施形態と異なる。また、第３実施形態と同様な構成の説明は省略する。 (Fourth Embodiment)
Hereinafter, the content reproduction system 100 to which the fourth embodiment of the present invention is applied will be described. The fourth embodiment of the present invention is different from the third embodiment in that it refers to a response model, determines a response to a conversational sentence input by the user, and acquires text data based on the determined response. Further, the description of the configuration similar to that of the third embodiment will be omitted.

＜コンテンツ再生システム１００＞
コンテンツ再生システム１００は、返答モデルを参照し、ユーザが入力した会話文に対する返答を決定し、決定された返答に基づくテキストデータを取得する。その後、コンテンツ再生システム１００は、入力された任意の顔画像データと、感情データと、テキストデータとに対し、第１処理後顔画像データを生成する。コンテンツ再生システム１００は、予め取得された参照用会話文と前記参照用会話文に対する返答との対応関係を示す返答モデルを参照し、ユーザが入力した会話文に対する返答を決定し、決定された返答に基づくテキストデータを取得する。 <Content playback system 100>
The content reproduction system 100 refers to the response model, determines a response to the conversational sentence input by the user, and acquires text data based on the determined response. After that, the content reproduction system 100 generates the face image data after the first processing for the input arbitrary face image data, the emotion data, and the text data. The content reproduction system 100 refers to a response model showing a correspondence relationship between a reference conversation sentence acquired in advance and a response to the reference conversation sentence, determines a response to the conversation sentence input by the user, and determines the determined response. Get text data based on.

返答モデルは、例えば表１のようにユーザが入力した会話文に対する返答が一義的に決定されるテーブルであってもよい。かかる場合、例えばユーザが「ただいま」という会話文を入力した場合、「おかえり」という会話文が返答として決定される。また、返答モデルは、ユーザが入力した日本語の会話文を英語に翻訳した会話文を返答として決定してもよい。また、返答モデルは機械学習により生成されてもよい。かかる場合、返答モデルは、参照用会話文を入力、参照用会話文に対する返答を出力とした複数の学習データを用いて機械学習により生成される。

The response model may be a table in which the response to the conversational sentence input by the user is uniquely determined, for example, as shown in Table 1. In such a case, for example, when the user inputs the conversational sentence "I'm home", the conversational sentence "Welcome back" is determined as a reply. Further, the response model may determine a conversational sentence obtained by translating a Japanese conversational sentence input by the user into English as a response. The response model may also be generated by machine learning. In such a case, the response model is generated by machine learning using a plurality of learning data in which the reference conversation sentence is input and the response to the reference conversation sentence is output.

（第４実施形態：コンテンツ再生システムの動作）
次に、第４実施形態におけるコンテンツ再生システム１００の動作の一例について説明する。図１５は、第４実施形態におけるコンテンツ再生装置の動作の一例を示す図である。 (Fourth Embodiment: Operation of content reproduction system)
Next, an example of the operation of the content reproduction system 100 in the fourth embodiment will be described. FIG. 15 is a diagram showing an example of the operation of the content reproduction device according to the fourth embodiment.

＜取得ステップＳ４１０＞
取得ステップＳ４１０は、ユーザ等により入力された顔画像データと感情データとテキストデータとを取得する。 <Acquisition step S410>
The acquisition step S410 acquires the face image data, the emotion data, and the text data input by the user or the like.

＜返答処理ステップＳ４４０＞
返答処理ステップＳ４４０は、例えば返答モデルを参照し、取得ステップＳ４１０で取得したテキストデータに対する返答テキストデータを生成する。これにより、入力されたテキストデータに適した返答に基づくテキストデータを生成することが可能となり、ユーザの会話に合わせた精度の高いキャラクターの表現が可能となる。また、返答処理ステップＳ４４０は、例えば処理部１２に含まれる返答処理部１２３により、処理してもよい。 <Response processing step S440>
The response processing step S440 refers to, for example, the response model, and generates the response text data for the text data acquired in the acquisition step S410. This makes it possible to generate text data based on a response suitable for the input text data, and it is possible to express a character with high accuracy according to the conversation of the user. Further, the response processing step S440 may be processed by, for example, the response processing unit 123 included in the processing unit 12.

（第５実施形態）
以下、本発明の第５実施形態を適応したコンテンツ再生システム１００について説明する。本発明の第５実施形態は、音質データと感情データとテキストデータとに対する音声データを生成し、処理後顔画像データと音声データとに対する表現データを生成する点で第３実施形態と異なる。また、第３実施形態と同様な構成の説明は省略する。 (Fifth Embodiment)
Hereinafter, the content reproduction system 100 to which the fifth embodiment of the present invention is applied will be described. The fifth embodiment of the present invention is different from the third embodiment in that voice data for sound quality data, emotion data, and text data is generated, and expression data for processed facial image data and voice data is generated. Further, the description of the configuration similar to that of the third embodiment will be omitted.

＜コンテンツ再生システム１００＞
コンテンツ再生システム１００は、顔画像データと、感情データとテキストデータと声質に関する声質データを取得し、テキストデータと、感情データと、音質データとに基づいて、キャラクターの音声を示す音声データを生成する。その後、コンテンツ再生システム１００は、生成された処理後顔画像データと、生成された音声データとに基づいて、キャラクターの表現を示す表現データを生成する。 <Content playback system 100>
The content reproduction system 100 acquires face image data, emotion data, text data, and voice quality data related to voice quality, and generates voice data indicating the voice of the character based on the text data, emotion data, and sound quality data. .. After that, the content reproduction system 100 generates expression data indicating the expression of the character based on the generated processed face image data and the generated voice data.

また、コンテンツ再生システム１００は、予め取得された参照用声質データと、参照用テキストデータと、参照用感情データとを含む第３入力データと、参照用音声データを含む第３出力データとを一組のデータセットとする音声処理用学習データを複数用いて、機械学習により生成された音声処理用データベースを参照して、取得した声質データと、テキストデータと、感情データとに基づいて、音声データを生成してもよい。 Further, the content reproduction system 100 includes a third input data including a reference voice quality data acquired in advance, a reference text data, a reference emotion data, and a third output data including a reference voice data. Using a plurality of voice processing training data as a set of data sets, referring to the voice processing database generated by machine learning, voice data based on the acquired voice quality data, text data, and emotion data. May be generated.

音質データは音質を示すデータである。音質データは、例えば音の響きの特徴である音響特徴量を示すデータである。音響特徴量は、例えば、基本周波数、スペクトル包絡、非周期性指標、スペクトログラム、音声の大きさ、ケプストラム、単語の発音、イントネーション、音波の時間遅れ、音声の時間による増減の変化等を示したものである。音質データは、例えばコンテンツ再生装置１等を介して、ユーザ等により入力されてもよい。 Sound quality data is data indicating sound quality. The sound quality data is, for example, data showing an acoustic feature amount which is a feature of sound resonance. The acoustic feature quantity indicates, for example, fundamental frequency, spectral inclusion, aperiodicity index, spectrogram, voice size, cepstrum, word pronunciation, intonation, sound wave time delay, change in increase / decrease with time of voice, etc. Is. The sound quality data may be input by a user or the like via, for example, the content reproduction device 1.

音声データは、音声を符号化したものである。音声の符号化には例えば、量子化ビット数とサンプリング周波数と時間とで定まる長さのビット列として表されるパルス符号変調（ＰＣＭ）方式に基づくものと、音声の波の疎密を１ｂｉｔで表現して一定の間隔で標本化するパルス密度変調（ＰＤＭ）方式に基づくものなどがある。 The voice data is a coded voice. For example, the coding of voice is based on the pulse code modulation (PCM) method, which is represented as a bit string of a length determined by the number of quantization bits, sampling frequency, and time, and the density of voice waves is expressed in 1 bit. Some are based on the pulse density modulation (PDM) method, which samples at regular intervals.

音声データは、例えば動画データから抽出された音声に基づいたものであってもよい。音声データは、例えば公知の収音装置等を用いて収音された音声のデータを示すほか、例えば公知の技術で生成された擬似的な音声を示してもよい。 The audio data may be based on audio extracted from moving image data, for example. As the voice data, for example, the data of the voice collected by using a known sound collecting device or the like may be shown, or, for example, a pseudo voice generated by a known technique may be shown.

表現データは、キャラクターを含む画像、及びキャラクターの音声によって構成されるキャラクターの表現を示すデータである。表現は、例えば映像的表現、音声的表現、身体的表現等がある。映像的表現は、視覚に働きかける表現であり、身振りや表情等がある。音声的表現は、聴覚に働きかける表現であり、言葉や発言、歌等がある。身体的表現は、触覚に働きかける表現であり、ボディタッチなどがある。表現データは、擬似的に生成された擬似データを含んでいてもよい。また、表現データはキャラクターを含む動画でもよい。 The expression data is data showing the expression of the character composed of the image including the character and the voice of the character. Expressions include, for example, visual expressions, audio expressions, physical expressions, and the like. The visual expression is an expression that works on the visual sense, and includes gestures and facial expressions. Speech expressions are expressions that work on hearing, and include words, remarks, songs, and the like. Physical expressions are expressions that work on the sense of touch, such as body touch. The representation data may include pseudo data generated in a pseudo manner. Further, the expression data may be a moving image including a character.

音声処理用データベースは、機械学習により生成される。音声処理用データベースとして、例えば参照用音質データと参照用感情データと参照用テキストデータとを含む第３入力データと、参照用音声データを含む第３出力データとを一組のデータセットとする第３処理用学習データを複数用いて、機械学習により生成された、第３入力データから第３出力データを生成するための学習済みモデルが用いられる。かかる場合、音声処理用データベースの生成方法は、入力データを第３入力データ、出力を第３出力データとする点で第１処理用データベースと異なる。 The speech processing database is generated by machine learning. As a database for voice processing, for example, a third input data including reference sound quality data, reference emotion data, and reference text data, and a third output data including reference voice data are used as a set of data sets. A trained model for generating the third output data from the third input data generated by machine learning is used by using a plurality of training data for processing. In such a case, the method of generating the voice processing database is different from the first processing database in that the input data is the third input data and the output is the third output data.

音声処理用データベースは、例えば図１６に示すように、参照用音質データと参照用感情データと参照用テキストデータとを含むを第３入力データと、参照用音声データを含む第３出力データとを一組のデータセットとする音声処理用学習データを複数用いて、機械学習により生成される。 As shown in FIG. 16, for example, the voice processing database has a third input data including reference sound quality data, reference emotion data, and reference text data, and a third output data including reference voice data. It is generated by machine learning using a plurality of learning data for voice processing as a set of data sets.

音声処理用データベースは、例えばニューラルネットワークをモデルとした機械学習を用いて、生成される。音声処理用データベースは、例えばＣＮＮ（Convolution Neural Network）等のニューラルネットワークをモデルとした機械学習を用いて生成されるほか、任意のモデルが用いられてもよい。 The speech processing database is generated using machine learning modeled on, for example, a neural network. The speech processing database is generated by using machine learning using a neural network such as CNN (Convolution Neural Network) as a model, or any model may be used.

音声処理用データベースには、例えば図１７に示すように、第３入力データと、第３出力データとの間における連関度を有する第３連関性が記憶される。連関度は、第３入力データと第３出力データとの繋がりの度合いを示しており、例えば連関度が高いほど各データの繋がりが強いと判断することができる。連関度は、例えば百分率等の３値以上（３段階以上）で示されるほか、２値（２段階）で示されてもよい。 As shown in FIG. 17, for example, the voice processing database stores a third association having a degree of association between the third input data and the third output data. The degree of association indicates the degree of connection between the third input data and the third output data. For example, it can be determined that the higher the degree of association, the stronger the connection of each data. The degree of association may be indicated by three values or more (three stages or more) such as percentage, or may be indicated by two values (two stages).

（第５実施形態：コンテンツ再生システムの動作）
次に、第５実施形態におけるコンテンツ再生システム１００の動作の一例について説明する。図１８は、第５実施形態におけるコンテンツ再生装置の動作の一例を示す図である。 (Fifth Embodiment: Operation of content reproduction system)
Next, an example of the operation of the content reproduction system 100 in the fifth embodiment will be described. FIG. 18 is a diagram showing an example of the operation of the content reproduction device according to the fifth embodiment.

＜取得ステップＳ５１０＞
取得ステップＳ５１０は、ユーザ等により入力された顔画像データと感情データとテキストデータと音質データとを取得する。 <Acquisition step S510>
The acquisition step S510 acquires the face image data, the emotion data, the text data, and the sound quality data input by the user or the like.

＜音声処理ステップＳ５５０＞
音声処理ステップＳ５５０は、例えば音声処理用データベースを参照し、取得ステップＳ３１０で取得した音質データと、感情データと、テキストデータとに対する音声データを生成する。音声処理ステップＳ５５０では、例えば音声処理部１２４は、音声処理用データベースを参照し、音質データと、感情データと、テキストデータとに対する音声データを生成する。これにより、入力された音質データと、感情データと、テキストデータとに適した音声データを生成することが可能となり、ユーザの会話に合わせた精度の高いキャラクターの表現が可能となる。また、音声処理ステップＳ５５０は、例えば処理部１２に含まれる音声処理部１２４により、処理してもよい。 <Voice processing step S550>
The voice processing step S550 refers to, for example, a voice processing database, and generates voice data for the sound quality data, emotion data, and text data acquired in the acquisition step S310. In the voice processing step S550, for example, the voice processing unit 124 refers to the voice processing database and generates voice data for the sound quality data, the emotion data, and the text data. This makes it possible to generate voice data suitable for the input sound quality data, emotion data, and text data, and it is possible to express a character with high accuracy according to the user's conversation. Further, the voice processing step S550 may be processed by, for example, the voice processing unit 124 included in the processing unit 12.

＜表現生成ステップＳ５６０＞
表現生成ステップＳ５６０は、生成された音声データと処理後顔画像データとに基づいて表現データを生成する。処理後顔画像データは、例えば第１処理後顔画像データ、又は第２処理後顔画像データを含む。また、表現生成ステップＳ５６０は、例えば生成部１３により、処理してもよい。 <Expression generation step S560>
The expression generation step S560 generates expression data based on the generated voice data and the processed face image data. The processed face image data includes, for example, the first post-processed face image data or the second post-processed face image data. Further, the expression generation step S560 may be processed by, for example, the generation unit 13.

（第６実施形態）
以下、本発明の第６実施形態を適応したコンテンツ再生システム１００について説明する。本発明の第６実施形態は、顔画像データとテキストデータに対する第１処理後顔画像データを生成する点で第１実施形態と異なる。また、第１実施形態と同様な構成の説明は省略する。 (Sixth Embodiment)
Hereinafter, the content reproduction system 100 to which the sixth embodiment of the present invention is applied will be described. The sixth embodiment of the present invention is different from the first embodiment in that the face image data after the first processing for the face image data and the text data are generated. Further, the description of the configuration similar to that of the first embodiment will be omitted.

＜コンテンツ再生システム１００＞
コンテンツ再生システム１００は、図１９に示すように、第１処理用データベースを参照して生成された顔画像データとテキストデータとに対する第１処理後顔画像データを生成するために用いられる。コンテンツ再生システム１００は、例えば学習データを用いた機械学習により生成された第１処理用データベースを参照し、顔画像データと、テキストデータとに対する第１処理後顔画像データを生成する。 <Content playback system 100>
As shown in FIG. 19, the content reproduction system 100 is used to generate the face image data after the first processing for the face image data and the text data generated by referring to the database for the first processing. The content reproduction system 100 refers to, for example, a first processing database generated by machine learning using learning data, and generates face image data and first processed face image data for text data.

第１処理用データベースは、例えば機械学習により生成される。第１処理用データベースとして、例えば参照用顔画像データと参照用テキストデータとを含む第１入力データと、参照用第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成された、第１入力データから第１出力データを生成するための学習済みモデルが用いられる。かかる場合、第１処理用データベースの生成方法は、入力データに顔画像データとテキストデータが含まれる点で第１実施形態における第１処理用データベースと異なる。 The first processing database is generated by machine learning, for example. As the first processing database, for example, the first input data including the reference face image data and the reference text data and the first output data including the reference first processed face image data are combined with a set of data sets. A trained model for generating the first output data from the first input data generated by machine learning is used by using a plurality of the training data for the first processing. In such a case, the method for generating the first processing database is different from the first processing database in the first embodiment in that the input data includes face image data and text data.

第１処理用データベースは、例えば図２０に示すように、参照用顔画像データと参照用テキストデータと含むを第１入力データと、参照用第１処理後顔画像データを含む第１出力データとを一組のデータセットとする第１処理用学習データを複数用いて、機械学習により生成される。 As shown in FIG. 20, for example, the first processing database includes first input data including reference face image data and reference text data, and first output data including reference first processed face image data. Is generated by machine learning using a plurality of first processing training data having the above as a set of data sets.

第１処理用データベースには、例えば図２１に示すように第１入力データと、第１出力データとの間における連関度を有する第１連関性が記憶される。連関度は、第１入力データと第１出力データとの繋がりの度合いを示しており、例えば連関度が高いほど各データの繋がりが強いと判断することができる。連関度は、例えば百分率等の３値以上（３段階以上）で示されるほか、２値（２段階）で示されてもよい。 In the first processing database, for example, as shown in FIG. 21, the first association having a degree of association between the first input data and the first output data is stored. The degree of association indicates the degree of connection between the first input data and the first output data. For example, it can be determined that the higher the degree of association, the stronger the connection of each data. The degree of association may be indicated by three values or more (three stages or more) such as percentage, or may be indicated by two values (two stages).

（第６実施形態：コンテンツ再生システムの動作）
次に、第６実施形態におけるコンテンツ再生システム１００の動作の一例について説明する。図２２は、第２実施形態におけるコンテンツ再生装置の動作の一例を示す図である。 (Sixth Embodiment: Operation of content reproduction system)
Next, an example of the operation of the content reproduction system 100 in the sixth embodiment will be described. FIG. 22 is a diagram showing an example of the operation of the content reproduction device according to the second embodiment.

＜取得ステップＳ６１０＞
取得ステップＳ６１０は、ユーザ等により入力された顔画像データとテキストデータとを取得する。取得ステップＳ６１０では、例えば取得部１１が、顔画像データとテキストデータとを取得する。また、取得ステップＳ６１０は、例えば、動画のように顔画像データと、顔画像データに紐づいた音声データを取得して、取得した音声データを音声認識することによりテキストデータを取得してもよい。 <Acquisition step S610>
The acquisition step S610 acquires the face image data and the text data input by the user or the like. In the acquisition step S610, for example, the acquisition unit 11 acquires the face image data and the text data. Further, the acquisition step S610 may acquire text data by acquiring face image data and voice data associated with the face image data as in a moving image and recognizing the acquired voice data by voice. ..

＜第１処理ステップＳ６２０＞
第１処理ステップＳ６２０は、例えば第１処理用データベースを参照し、取得ステップＳ６１０で取得した顔画像データと、テキストデータとに対する第１処理後顔画像データを生成する。第１処理ステップＳ６２０では、例えば第１処理部１２１は、第１処理用データベースを参照し、顔画像データと、テキストデータとに対する第１処理後顔画像データを生成する。これにより、入力されたテキストデータに適した第１処理後顔画像データを生成することが可能となり、ユーザの会話に合わせたキャラクターの表現が可能となる。また、生成部１３により、擬似的に第１処理後顔画像データを生成してもよい。また、第１処理ステップＳ６２０は、例えば処理部１２に含まれる第１処理部１２１により、処理してもよい。 <First processing step S620>
The first processing step S620 refers to, for example, the database for the first processing, and generates the face image data after the first processing for the face image data acquired in the acquisition step S610 and the text data. In the first processing step S620, for example, the first processing unit 121 refers to the first processing database and generates the face image data and the first processed face image data for the text data. This makes it possible to generate face image data after the first processing suitable for the input text data, and it is possible to express a character according to the conversation of the user. Further, the generation unit 13 may generate the face image data after the first processing in a pseudo manner. Further, the first processing step S620 may be processed by, for example, the first processing unit 121 included in the processing unit 12.

本発明の実施形態を説明したが、この実施形態は例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although embodiments of the present invention have been described, these embodiments are presented as examples and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other embodiments, and various omissions, replacements, and changes can be made without departing from the gist of the invention. These embodiments and variations thereof are included in the scope and gist of the invention, and are also included in the scope of the invention described in the claims and the equivalent scope thereof.

１：コンテンツ再生装置
２：端末
３：サーバ
４：通信網
１０：筐体
１１：取得部
１２：処理部
１３：生成部
１４：出力部
１５：記憶部
１６：ＤＢ生成部
１００：コンテンツ再生システム
１０１：ＣＰＵ
１０２：ＲＯＭ
１０３：ＲＡＭ
１０４：保存部
１０５：Ｉ／Ｆ
１０６：Ｉ／Ｆ
１０７：Ｉ／Ｆ
１０８：入力部
１０９：表示部
１１０：内部バス
１２１：第１処理部
１２２：第２処理部
１２３：返答処理部
１２４：音声処理部
Ｓ１１０：取得ステップ
Ｓ１２０：第１処理ステップ
Ｓ１３０：出力ステップ
Ｓ２１０：取得ステップ
Ｓ２２０：第１処理ステップ
Ｓ２３０：出力ステップ
Ｓ２４０：第２処理ステップ
Ｓ３１０：取得ステップ
Ｓ３２０：第１処理ステップ
Ｓ３３０：出力ステップ
Ｓ４１０：取得ステップ
Ｓ４２０：第１処理ステップ
Ｓ４３０：出力ステップ
Ｓ４４０：返答処理ステップ
Ｓ５１０：取得ステップ
Ｓ５２０：第１処理ステップ
Ｓ５３０：出力ステップ
Ｓ５５０：音声処理ステップ
Ｓ５６０：表現生成ステップ
Ｓ６１０：取得ステップ
Ｓ６２０：第１処理ステップ
Ｓ６３０：出力ステップ 1: Content playback device 2: Terminal 3: Server 4: Communication network 10: Housing 11: Acquisition unit 12: Processing unit 13: Generation unit 14: Output unit 15: Storage unit 16: DB generation unit 100: Content playback system 101 : CPU
102: ROM
103: RAM
104: Storage unit 105: I / F
106: I / F
107: I / F
108: Input unit 109: Display unit 110: Internal bus 121: First processing unit 122: Second processing unit 123: Response processing unit 124: Voice processing unit S110: Acquisition step S120: First processing step S130: Output step S210: Acquisition step S220: First processing step S230: Output step S240: Second processing step S310: Acquisition step S320: First processing step S330: Output step S410: Acquisition step S420: First processing step S430: Output step S440: Response processing Step S510: Acquisition step S520: First processing step S530: Output step S550: Speech processing step S560: Expression generation step S610: Acquisition step S620: First processing step S630: Output step

Claims

An acquisition step for acquiring face image data including a character's face, emotion data indicating emotions, and text data ,
The first input data including the reference face image data and the reference emotion data acquired in advance includes the face of the same character as the character included in the reference face image data, and is different from the reference face image data. The above-mentioned The first generation step of generating the face image data after the first processing for the face image data and the emotion data acquired by the acquisition step, and
After the second processing, a part of the post-processed face image data is changed based on the post-processed face image data generated by the first generation step and the text data acquired by the acquisition step. Let the computer perform the second generation step to generate the face image data ,
In the second generation step, the second input data including the first post-processed face image data for reference and the text data for reference acquired in advance, and the second output data including the second post-processed face image data for reference are included. With reference to the second processing database generated by machine learning using a plurality of second processing training data having To generate the face image data after the second processing based on the image data and the text data acquired in the acquisition step.
A content playback method characterized by.

The acquisition step acquires voice quality data regarding voice quality and obtains voice quality data.
A voice processing step that generates voice data indicating the voice of the character based on the voice quality data, the text data, and the emotion data acquired in the acquisition step.
Further, an expression generation step for generating expression data indicating the expression of the character based on the second processed face image data generated by the second generation step and the voice data generated by the voice processing step. The content reproduction method according to claim 1 , wherein the data is executed by a computer.

The voice processing step includes a set of a third input data including a reference voice quality data acquired in advance, a reference text data, a reference emotion data, and a third output data including the reference voice data. Using a plurality of voice processing training data as a data set, referring to the voice processing database generated by machine learning, the voice data acquired in the acquisition step, the text data, and the voice data for the emotion data. The content reproduction method according to claim 2 , wherein the data is generated.

The acquisition step refers to a response model showing a correspondence relationship between the reference conversation data acquired in advance and the response data for the reference conversation data, determines the response data for the conversation data input by the user, and determines the response data. The content reproduction method according to any one of claims 1 to 3, wherein the text data based on the determined response data is acquired.

An acquisition method for acquiring face image data including a character's face, emotion data indicating emotions, and text data ,
The first input data including the reference face image data and the reference emotion data acquired in advance includes the face of the same character as the character included in the reference face image data, and is different from the reference face image data. The above-mentioned The first generation means for generating the face image data after the first processing for the face image data and the emotion data acquired by the acquisition means , and the first generation means .
After the second processing, a part of the post-processed face image data is changed based on the post-processed face image data generated by the first generation means and the text data acquired by the acquisition means. With a second generation means to generate face image data
Prepare,
The second generation means includes second input data including a pre-acquired first post-processed face image data for reference, text data for reference, and second output data including second post-processed face image data for reference. The first post-processing face generated by the first generation means with reference to the second processing database generated by machine learning using a plurality of second processing training data having To generate the face image data after the second processing based on the image data and the text data acquired by the acquisition means.
A content playback system featuring.