JP7369884B1

JP7369884B1 - Information processing device, information processing method, and information processing program

Info

Publication number: JP7369884B1
Application number: JP2023061972A
Authority: JP
Inventors: 輝明村上
Original assignee: Individual
Current assignee: Individual
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-10-26
Anticipated expiration: 2043-04-06
Also published as: JP7371299B1

Abstract

【課題】ユーザによるコンテンツの視聴時に該ユーザの快適性を向上させることができる技術を提供する。【解決手段】本開示の情報処理装置は、ユーザが視聴するコンテンツの音量を自動で調節する情報処理装置である。この情報処理装置は、所定の撮影装置によって撮影された撮影画像データであって、ユーザによるコンテンツの視聴時の該ユーザの表情画像を表す第１撮影画像データを取得することと、ユーザがコンテンツを視聴する際に該ユーザが感じる快不快の状態である快適性状態を、所定の画像データを用いて学習を行うことにより構築された事前学習モデルに第１撮影画像データを入力することで取得することと、快適性状態に基づいて、ユーザによるコンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量を自動で調節することと、を実行する制御部を備える。【選択図】図３The present invention provides a technology that can improve the comfort of a user when viewing content. An information processing device according to the present disclosure automatically adjusts the volume of content viewed by a user. This information processing device acquires first photographed image data, which is photographed image data photographed by a predetermined photographing device, and represents a facial expression image of a user when the user views the content; The comfort state, which is the state of pleasure and displeasure felt by the user when viewing, is obtained by inputting the first photographed image data to a pre-learning model constructed by performing learning using predetermined image data. and automatically adjusting the volume of the content based on the user's comfort state so as to improve the user's comfort level when viewing the content. [Selection diagram] Figure 3

Description

本発明は、ユーザが視聴するコンテンツの音量を自動で調節する情報処理装置、情報処理方法及び情報処理プログラムに関する。 The present invention relates to an information processing device, an information processing method, and an information processing program that automatically adjust the volume of content that a user views.

携帯端末、タブレット端末、スマートフォン、ウェアラブル端末、パーソナルコンピュータ等を操作するためのユーザインタフェースとして、マウスやタッチパネル等のデバイスを利用した入力インタフェースが従来から用いられている。しかしながら、ユーザは、このような該ユーザの操作による入力インタフェースを用いることに煩わしさを感じることがあった。 BACKGROUND ART Input interfaces using devices such as mice and touch panels have conventionally been used as user interfaces for operating mobile terminals, tablet terminals, smartphones, wearable terminals, personal computers, and the like. However, the user sometimes finds it troublesome to use such an input interface that is operated by the user.

また、例えば、ユーザが車両を運転しているときには、該ユーザは、マウスやタッチパネル等の入力インタフェースを操作することができない。そのため、入力インタフェースを用いたユーザによる操作によらずに、上記の端末が自動で操作されることが好ましい。 Further, for example, when a user is driving a vehicle, the user cannot operate an input interface such as a mouse or a touch panel. Therefore, it is preferable that the above-mentioned terminal be operated automatically without the user's operation using an input interface.

そして、特許文献１には、乗員感情に応じて提供したコンテンツにより乗員が不快になったとき、不快感情を改善させるコンテンツ提供装置が開示されている。この技術では、コンテンツ出力部により第１コンテンツが出力された後に推定された乗員感情に応じて、例えば、第１コンテンツの出力により乗員感情が悪化しているときには、第１コンテンツから第２コンテンツへのコンテンツの変更が指令される。 Patent Document 1 discloses a content providing device that improves the uncomfortable feeling when the passenger becomes uncomfortable due to the content provided according to the passenger's feeling. In this technology, the first content is changed from the first content to the second content according to the passenger emotion estimated after the content output unit outputs the first content. A change in the content of is commanded.

特開２０１８－１０１３４１号公報Japanese Patent Application Publication No. 2018-101341

ユーザは、携帯端末、タブレット端末、スマートフォン、ウェアラブル端末、パーソナルコンピュータ等を操作するとき、マウスやタッチパネル等の入力インタフェースを用いた操作に煩わしさを感じることがあるため、これら端末が自動で操作されることが好ましい。 When operating mobile terminals, tablet terminals, smartphones, wearable terminals, personal computers, etc., users may find it troublesome to operate using input interfaces such as a mouse or touch panel. It is preferable that

ここで、特許文献１に記載の技術によれば、例えば、第１コンテンツの出力により乗員感情が悪化した場合には、コンテンツが第１コンテンツから第２コンテンツへ自動で変更されるため、ユーザによる操作の煩わしさが軽減できるようにも思われる。しかしながら、コンテンツに対してユーザが感じる快適性は、該コンテンツのジャンルのみによらず、該コンテンツの音量によっても影響を受ける。このように、ユーザによるコンテンツの視聴時に該ユーザの快適性を向上させる技術については、未だ改善の余地を残すものである。 Here, according to the technology described in Patent Document 1, for example, if the occupant's emotions deteriorate due to the output of the first content, the content is automatically changed from the first content to the second content, so that the user can It also seems that the hassle of operation can be reduced. However, the comfort that a user feels with content is affected not only by the genre of the content but also by the volume of the content. As described above, there is still room for improvement in technology for improving the comfort of users when they view content.

本開示の目的は、ユーザによるコンテンツの視聴時に該ユーザの快適性を向上させることができる技術を提供することにある。 An object of the present disclosure is to provide a technology that can improve the comfort of a user when viewing content.

本開示の情報処理装置は、ユーザが視聴するコンテンツの音量を自動で調節する情報処理装置である。そして、この情報処理装置は、所定の撮影装置によって撮影された撮影画像データであって、ユーザによるコンテンツの視聴時の該ユーザの表情画像を表す第１撮影画像データを取得することと、前記ユーザが前記コンテンツを視聴する際に該ユーザが感じる快不快の状態である快適性状態を、所定の画像データを用いて学習を行うことにより構築された事前学習モデルに前記第１撮影画像データを入力することで取得することと、前記快適性状態に基づいて、前記ユーザによる前記コンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量を自動で調節することと、を実行する制御部を備える。 An information processing device of the present disclosure is an information processing device that automatically adjusts the volume of content that a user views. The information processing device acquires first photographed image data that is photographed by a predetermined photographing device and represents a facial expression image of the user when the user views the content; Input the first captured image data into a pre-learning model constructed by learning a comfort state, which is a state of pleasure and displeasure felt by the user when viewing the content, using predetermined image data. and automatically adjusting the volume of the content based on the comfort state so as to improve the user's comfort level when viewing the content by the user. Department.

上記の情報処理装置では、ユーザによるコンテンツの視聴時の該ユーザの表情画像を表す第１撮影画像データを事前学習モデルに入力することで、該ユーザの快適性状態が取得される。そして、この快適性状態に基づいて、ユーザによるコンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量が自動で調節されるため、例えば、コンテンツの音量の影響によってユーザが不快に感じている場合には、ユーザによる操作によらずに自動でコンテンツの音量が調節されることになる。これにより、ユーザによる操作の煩わしさを軽減しつつ、ユーザの快適性を向上させることができる。 In the above information processing device, the comfort state of the user is acquired by inputting the first captured image data representing the facial expression image of the user when the user views the content into the pre-learning model. Based on this comfort state, the volume of the content is automatically adjusted to improve the comfort of the user when viewing the content. If the user is feeling the same, the volume of the content will be automatically adjusted without any operation by the user. Thereby, the user's comfort can be improved while reducing the troublesome operation by the user.

そして、本開示の情報処理装置は、カメラによって撮影された撮影画像データであって、ユーザによるコンテンツの視聴時の該ユーザの表情画像を表す第１撮影画像データを取得することと、前記ユーザが前記コンテンツを視聴する際に該ユーザが感じる快不快の状態である快適性状態を、予め撮影された画像データを用いて学習を行うことにより構築された事前学習モデルに前記第１撮影画像データを入力することで取得することと、前記快適性状態に基づいて、前記ユーザによる前記コンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量を自動で調節することと、を実行する制御部を備え、前記制御部は、前記ユーザによる前記コンテンツの視聴前に、初期設定用の初期コンテンツを再生させ、且つその再生時に該初期コンテンツの音量を自動で変化させ、該音量の変化によって該ユーザが不快と感じるタイミングを取得することと、前記カメラによって撮影された撮影画像データであって、前記タイミングにおける前記ユーザの表情画像を表す第２撮影画像データを取得することと、を更に実行し、前記第２撮影画像データを教師データとして、前記事前学習モデルに学習を行わせてもよい。そして、この場合、前記制御部は、前記第２撮影画像データを加工することで生成される撮影画像データであって、該第２撮影画像データに含まれる人物の位置が任意に変更された、又は／及び該第２撮影画像データに含まれる背景の色が任意に変更された、又は／及び該第２撮影画像データに含まれる人物の服装が任意に変更された、前記ユーザの画像を表す第３撮影画像データを自動で生成することを、更に実行し、前記第３撮影画像データを前記教師データに加えて、前記事前学習モデルに学習を行わせてもよい。これによれば、一つの第２撮影画像データに基づいて複数の第３撮影画像データを自動で生成することで、印象が異なる撮影画像データを複数生成することができ、事前学習モデルに学習を行わせるための教師データの数を効率的に増やすことができる。 The information processing device of the present disclosure includes the steps of acquiring first captured image data captured by a camera and representing a facial expression image of a user when the user views content; The first photographed image data is applied to a pre-learning model that is constructed by learning a comfort state, which is a state of pleasure and displeasure that the user feels when viewing the content, using image data that has been photographed in advance. and automatically adjusting the volume of the content based on the comfort state so as to improve the user's comfort level when viewing the content by the user. The control unit is configured to play an initial content for initial settings before the user views the content, and automatically changes the volume of the initial content during the playback, and according to the change in the volume. Further performing the following steps: obtaining a timing at which the user feels uncomfortable; and obtaining second captured image data, which is captured image data captured by the camera and represents a facial expression image of the user at the timing. However, the pre-learning model may be caused to perform learning using the second captured image data as training data. In this case, the control unit controls the captured image data generated by processing the second captured image data, in which the position of the person included in the second captured image data is arbitrarily changed. or/and represents an image of the user in which the color of the background included in the second captured image data has been arbitrarily changed, and/and the clothing of the person included in the second captured image data has been arbitrarily changed. The third captured image data may be further automatically generated, the third captured image data may be added to the teacher data, and the pre-learning model may be caused to perform learning. According to this, by automatically generating multiple pieces of third photographed image data based on one second photographed image data, it is possible to generate multiple pieces of photographed image data with different impressions, and to apply learning to the pre-learning model. It is possible to efficiently increase the amount of training data required for the process.

また、本開示の情報処理装置では、前記制御部は、前記撮影装置によって撮影された撮影画像データであって、前記ユーザが前記コンテンツとは異なる他コンテンツを視聴しているときの該ユーザの表情画像を表す第４撮影画像データを、該他コンテンツの再生中に周期的に取得することを、更に実行し、前記第４撮影画像データを教師データとして、前記事前学習モデルに学習を行わせてもよい。 Further, in the information processing device of the present disclosure, the control unit may display captured image data captured by the imaging device, and the expression of the user when the user is viewing other content different from the content. further performing periodic acquisition of fourth photographed image data representing the image during playback of the other content, and causing the pre-learning model to perform learning using the fourth photographed image data as training data. It's okay.

そして、この場合、前記制御部は、前記第４撮影画像データに対して、前記ユーザが前記他コンテンツの音量を調節したときの該ユーザの表情画像を不快状態とラベル付けし、前記ユーザが前記他コンテンツの音量を調節して所定時間経過した後の該ユーザの表情画像を快状態とラベル付けして、前記事前学習モデルに学習を行わせてもよい。更に、前記制御部は、前記不快状態との合致割合と、前記快状態との合致割合と、に基づいて、前記快適性状態を取得してもよい。これによれば、ユーザの快適性状態の誤認識を可及的に抑制することができる。 In this case, the control unit labels, with respect to the fourth captured image data, an expression image of the user when the user adjusts the volume of the other content as an uncomfortable state, and The pre-learning model may perform learning by labeling the facial expression image of the user after a predetermined period of time has elapsed after adjusting the volume of other content as being in a pleasant state. Furthermore, the control unit may acquire the comfort state based on a matching ratio with the uncomfortable state and a matching ratio with the pleasant state. According to this, it is possible to suppress erroneous recognition of the user's comfort state as much as possible.

また、本開示は、コンピュータによる情報処理方法の側面から捉えることができる。すなわち、本開示の情報処理方法は、ユーザが視聴するコンテンツの音量を自動で調節する情報処理方法であって、コンピュータが、カメラによって撮影された撮影画像データであって、ユーザによるコンテンツの視聴時の該ユーザの表情画像を表す第１撮影画像データを取得する第１取得ステップと、前記ユーザが前記コンテンツを視聴する際に該ユーザが感じる快不快の状態である快適性状態を、予め撮影された画像データを用いて学習を行うことにより構築された事前学習モデルに前記第１撮影画像データを入力することで取得する第２取得ステップと、前記快適性状態に基づいて、前記ユーザによる前記コンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量を自動で調節する自動調整ステップと、を実行し、前記コンピュータは、前記ユーザによる前記コンテンツの視聴前に、初期設定用の初期コンテンツを再生させ、且つその再生時に該初期コンテンツの音量を自動で変化させ、該音量の変化によって該ユーザが不快と感じるタイミングを取得することと、前記カメラによって撮影された撮影画像データであって、前記タイミングにおける前記ユーザの表情画像を表す第２撮影画像データを取得することと、を更に実行し、前記第２撮影画像データを教師データとして、前記事前学習モデルに学習を行わせることを実行する。 Further, the present disclosure can be understood from the aspect of an information processing method by a computer. That is, the information processing method of the present disclosure is an information processing method that automatically adjusts the volume of content viewed by a user, in which a computer receives photographed image data taken by a camera , and when the user views the content. a first acquisition step of acquiring first photographed image data representing a facial expression image of the user; and a comfort state, which is a state of pleasure and displeasure felt by the user when the user views the content, which has been photographed in advance. a second acquisition step of acquiring the first photographed image data by inputting the first photographed image data into a pre-learning model constructed by performing learning using the image data obtained by the user; and a second acquisition step of acquiring the content by the user based on the comfort state. automatically adjusting the volume of the content so as to improve the user's comfort when viewing the content; Playing content, automatically changing the volume of the initial content at the time of playback, and obtaining timing at which the user feels uncomfortable due to the change in volume, and photographed image data taken by the camera, , obtaining second captured image data representing an expression image of the user at the timing, and causing the pre-learning model to perform learning using the second captured image data as training data. Execute.

また、本開示は、情報処理プログラムの側面から捉えることができる。すなわち、本開示の情報処理プログラムは、ユーザが視聴するコンテンツの音量を自動で調節する情報処理プログラムであって、コンピュータに、カメラによって撮影された撮影画像データであって、ユーザによるコンテンツの視聴時の該ユーザの表情画像を表す第１撮影画像データを取得する第１取得ステップと、前記ユーザが前記コンテンツを視聴する際に該ユーザが感じる快不快の状態である快適性状態を、予め撮影された画像データを用いて学習を行うことにより構築された事前学習モデルに前記第１撮影画像データを入力することで取得する第２取得ステップと、前記快適性状態に基づいて、前記ユーザによる前記コンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量を自動で調節する自動調整ステップと、を実行させ、前記コンピュータに、前記ユーザによる前記コンテンツの視聴前に、初期設定用の初期コンテンツを再生させ、且つその再生時に該初期コンテンツの音量を自動で変化させ、該音量の変化によって該ユーザが不快と感じるタイミングを取得することと、前記カメラによって撮影された撮影画像データであって、前記タイミングにおける前記ユーザの表情画像を表す第２撮影画像データを取得することと、を更に実行させ、前記第２撮影画像データを教師データとして、前記事前学習モデルに学習を行わせることを実行させる。 Further, the present disclosure can be viewed from the aspect of an information processing program. In other words, the information processing program of the present disclosure is an information processing program that automatically adjusts the volume of content viewed by a user, and the information processing program automatically adjusts the volume of content viewed by a user, and stores photographed image data captured by a camera on a computer when the user views the content. a first acquisition step of acquiring first photographed image data representing a facial expression image of the user; and a comfort state, which is a state of pleasure and displeasure felt by the user when the user views the content, which has been photographed in advance. a second acquisition step of acquiring the first photographed image data by inputting the first photographed image data into a pre-learning model constructed by performing learning using the image data obtained by the user; and a second acquisition step of acquiring the content by the user based on the comfort state. an automatic adjustment step of automatically adjusting the volume of the content so as to improve the comfort of the user when viewing the content; Reproducing initial content, automatically changing the volume of the initial content during playback, and obtaining timing at which the user feels uncomfortable due to the change in volume, and capturing image data taken by the camera. and acquiring second captured image data representing an expression image of the user at the timing, and causes the pre-learning model to perform learning using the second captured image data as training data. Execute.

本開示によれば、ユーザによるコンテンツの視聴時に該ユーザの快適性を向上させることができる。 According to the present disclosure, it is possible to improve the comfort of the user when the user views content.

第１実施形態における情報処理システムの概略構成を示す図である。FIG. 1 is a diagram showing a schematic configuration of an information processing system in a first embodiment. 第１実施形態における、情報処理システムに含まれるサーバの構成要素をより詳細に示すとともに、サーバと通信を行うユーザ端末の構成要素を示した図である。FIG. 2 is a diagram illustrating in more detail the components of a server included in the information processing system and the components of a user terminal that communicates with the server in the first embodiment. 第１実施形態における情報処理システムの動作の流れを例示する図である。FIG. 3 is a diagram illustrating the flow of operation of the information processing system in the first embodiment. 情報処理システムを利用するための初期設定画面を例示する図である。FIG. 2 is a diagram illustrating an initial setting screen for using the information processing system. 第１実施形態における事前学習モデルに対する入力から得られる識別結果と、該事前学習モデルを構成するニューラルネットワークを説明するための図である。FIG. 2 is a diagram for explaining identification results obtained from input to a pre-learning model and a neural network forming the pre-learning model in the first embodiment. 第２実施形態における情報処理システムの動作の流れを例示する図である。FIG. 7 is a diagram illustrating the flow of operation of the information processing system in the second embodiment.

以下、図面に基づいて、本開示の実施の形態を説明する。以下の実施形態の構成は例示であり、本開示は実施形態の構成に限定されない。 Embodiments of the present disclosure will be described below based on the drawings. The configurations of the following embodiments are illustrative, and the present disclosure is not limited to the configurations of the embodiments.

＜第１実施形態＞
第１実施形態における情報処理システムの概要について、図１を参照しながら説明する。図１は、本実施形態における情報処理システムの概略構成を示す図である。本実施形態に係る情報処理システム１００は、ネットワーク２００と、サーバ３００と、ユーザ端末４００と、を含んで構成される。なお、本開示の情報処理システムは、ユーザが視聴するコンテンツの音量を自動で調節するシステムであって、サーバ３００からの指令に従って、ユーザ端末４００において再生されているコンテンツの音量が調節される。 <First embodiment>
An overview of the information processing system in the first embodiment will be explained with reference to FIG. FIG. 1 is a diagram showing a schematic configuration of an information processing system in this embodiment. The information processing system 100 according to the present embodiment includes a network 200, a server 300, and a user terminal 400. Note that the information processing system of the present disclosure is a system that automatically adjusts the volume of the content that the user views, and the volume of the content that is being played on the user terminal 400 is adjusted according to a command from the server 300.

ネットワーク２００は、例えば、ＩＰネットワークである。ネットワーク２００は、ＩＰネットワークであれば、無線であっても有線であっても無線と有線の組み合わせであってもよく、例えば、無線による通信であれば、ユーザ端末４００は、無線ＬＡＮアクセスポイント（不図示）にアクセスし、ＬＡＮやＷＡＮを介してサーバ３００と通信してもよい。また、ネットワーク２００は、これらの例に限られず、例えば、公衆交換電話網や、光回線、ＡＤＳＬ回線、衛星通信網などであってもよい。 Network 200 is, for example, an IP network. As long as the network 200 is an IP network, it may be wireless, wired, or a combination of wireless and wired. For example, in the case of wireless communication, the user terminal 400 may be connected to a wireless LAN access point ( (not shown) and communicate with the server 300 via a LAN or WAN. Further, the network 200 is not limited to these examples, and may be, for example, a public switched telephone network, an optical line, an ADSL line, a satellite communication network, or the like.

サーバ３００は、ネットワーク２００を介して、ユーザ端末４００と接続される。なお、図１において、説明を簡単にするために、サーバ３００は１台、ユーザ端末４００は４台示してあるが、これらに限定されないことは言うまでもない。 Server 300 is connected to user terminal 400 via network 200. In addition, in FIG. 1, in order to simplify the explanation, one server 300 and four user terminals 400 are shown, but it goes without saying that the present invention is not limited to these.

サーバ３００は、データの取得、生成、更新等の演算処理及び加工処理のための処理能力のあるコンピュータ機器であればどの様な電子機器でもよく、例えば、パーソナルコンピュータ、サーバ、メインフレーム、その他電子機器であってもよい。すなわち、サーバ３００は、ＣＰＵやＧＰＵ等のプロセッサ、ＲＡＭやＲＯＭ等の主記憶装置、ＥＰＲＯＭ、ハードディスクドライブ、リムーバブルメディア等の補助記憶装置を有するコンピュータとして構成することができる。なお、リムーバブルメディアは、例えば、ＵＳＢメモリ、あるいは、ＣＤやＤＶＤのようなディスク記録媒体であってもよい。補助記憶装置には、オペレーティングシステム（ＯＳ）、各種プログラム、各種テーブル等が格納されている。 The server 300 may be any electronic device as long as it has the processing capacity for arithmetic processing and processing such as data acquisition, generation, and updating; for example, a personal computer, server, mainframe, or other electronic device. It may be a device. That is, the server 300 can be configured as a computer having a processor such as a CPU or a GPU, a main storage device such as a RAM or ROM, and an auxiliary storage device such as an EPROM, a hard disk drive, or a removable medium. Note that the removable medium may be, for example, a USB memory or a disk recording medium such as a CD or a DVD. The auxiliary storage device stores an operating system (OS), various programs, various tables, and the like.

また、サーバ３００は、本実施形態に係る情報処理システム１００専用のソフトウェアやハードウェア、ＯＳ等を設けずに、クラウドサーバによるＳａａＳ（Software as a Service）、Ｐａａｓ（Platform as a Service）、ＩａａＳ（Infrastructure as a Service）を適宜用いてもよい。 Furthermore, the server 300 can be configured as a cloud server using SaaS (Software as a Service), Paas (Platform as a Service), IaaS ( infrastructure as a service) may be used as appropriate.

ユーザ端末４００は、情報処理システム１００を利用するユーザが保有する携帯端末等の電子機器であればよく、例えば、携帯端末、タブレット端末、スマートフォン、ウェアラブル端末、パーソナルコンピュータ等、その他端末機器であってもよい。 The user terminal 400 may be any electronic device such as a mobile terminal owned by a user who uses the information processing system 100, and may be any other terminal device such as a mobile terminal, a tablet terminal, a smartphone, a wearable terminal, a personal computer, etc. Good too.

次に、図２に基づいて、主にサーバ３００の構成要素の詳細な説明を行う。図２は、第１実施形態における、情報処理システム１００に含まれるサーバ３００の構成要素をより詳細に示すとともに、サーバ３００と通信を行うユーザ端末４００の構成要素を示した図である。 Next, based on FIG. 2, a detailed explanation will be given mainly of the components of the server 300. FIG. 2 is a diagram showing in more detail the components of the server 300 included in the information processing system 100, and also shows the components of the user terminal 400 that communicates with the server 300, in the first embodiment.

サーバ３００は、機能部として通信部３０１、記憶部３０２、制御部３０３を有しており、補助記憶装置に格納されたプログラムを主記憶装置の作業領域にロードして実行し、プログラムの実行を通じて各機能部等が制御されることによって、各機能部における所定の目的に合致した各機能を実現することができる。ただし、一部または全部の機能はＡＳＩＣやＦＰＧＡのようなハードウェア回路によって実現されてもよい。 The server 300 has a communication unit 301, a storage unit 302, and a control unit 303 as functional units, and loads a program stored in an auxiliary storage device into a work area of a main storage device and executes it. By controlling each functional unit, it is possible to realize each function that meets a predetermined purpose in each functional unit. However, some or all of the functions may be realized by a hardware circuit such as an ASIC or FPGA.

ここで、通信部３０１は、サーバ３００をネットワーク２００に接続するための通信インタフェースである。通信部３０１は、例えば、ネットワークインタフェースボードや、無線通信のための無線通信回路を含んで構成される。サーバ３００は、通信部３０１を介して、ユーザ端末４００やその他の外部装置と通信可能に接続される。 Here, the communication unit 301 is a communication interface for connecting the server 300 to the network 200. The communication unit 301 includes, for example, a network interface board and a wireless communication circuit for wireless communication. The server 300 is communicably connected to the user terminal 400 and other external devices via the communication unit 301 .

記憶部３０２は、主記憶装置と補助記憶装置を含んで構成される。主記憶装置は、制御部３０３によって実行されるプログラムや、当該制御プログラムが利用するデータが展開されるメモリである。補助記憶装置は、制御部３０３において実行されるプログラムや、当該制御プログラムが利用するデータが記憶される装置である。なお、サーバ３００は、通信部３０１を介してユーザ端末４００等から送信されたデータを取得し、記憶部３０２には、後述する撮影画像データが記憶される。また、記憶部３０２には、後述する快適性状態を取得するための教師データや事前学習モデルが記憶される。 The storage unit 302 includes a main storage device and an auxiliary storage device. The main storage device is a memory in which programs executed by the control unit 303 and data used by the control program are expanded. The auxiliary storage device is a device that stores programs executed by the control unit 303 and data used by the control programs. Note that the server 300 acquires data transmitted from the user terminal 400 or the like via the communication unit 301, and the storage unit 302 stores captured image data, which will be described later. Further, the storage unit 302 stores teacher data and a pre-learning model for acquiring a comfort state, which will be described later.

制御部３０３は、サーバ３００が行う制御を司る機能部である。制御部３０３は、ＣＰＵなどの演算処理装置によって実現することができる。制御部３０３は、更に、第１取得部３０３１と、第２取得部３０３２と、音量調節部３０３３と、学習部３０３４と、の４つの機能部を有して構成される。各機能部は、記憶されたプログラムをＣＰＵによって実行することで実現してもよい。なお、学習部３０３４は、機械学習に伴う演算量が多いため、記憶されたプログラムをＧＰＵによって実行することで実現してもよい。このように、ＧＰＵを機械学習に伴う演算処理に利用するようにすると、高速処理できるようになる。また、より高速な処理を行うために、このようなＧＰＵを搭載したコンピュータを複数台用いてコンピュータ・クラスターを構築し、このコンピュータ・クラスターに含まれる複数のコンピュータにて並列処理を行うようにしてもよい。 The control unit 303 is a functional unit that manages control performed by the server 300. The control unit 303 can be realized by an arithmetic processing device such as a CPU. The control unit 303 further includes four functional units: a first acquisition unit 3031, a second acquisition unit 3032, a volume adjustment unit 3033, and a learning unit 3034. Each functional unit may be realized by executing a stored program by a CPU. Note that since the learning unit 3034 requires a large amount of calculations associated with machine learning, it may be implemented by executing a stored program using a GPU. In this way, when the GPU is used for arithmetic processing associated with machine learning, high-speed processing becomes possible. In addition, in order to perform faster processing, a computer cluster is constructed using multiple computers equipped with such GPUs, and multiple computers included in this computer cluster perform parallel processing. Good too.

第１取得部３０３１は、情報処理システム１００を利用するユーザによるコンテンツの視聴時の該ユーザの表情画像を表す第１撮影画像データを取得する。ここで、上記のコンテンツは、動画や楽曲などのコンテンツである。そして、上記の第１撮影画像データは、ユーザが、ユーザ端末４００を用いてコンテンツを視聴しているときに、該ユーザ端末４００が備える撮影装置によって撮影される。なお、ユーザ端末４００には、情報処理システム１００を利用するための所定のアプリが予めインストールされ、コンテンツの再生時に該アプリがバックグラウンドで上記の画像を撮影する処理を実行する。そして、撮影されたデータがサーバ３００にアップロードされる。そうすると、第１取得部３０３１は、第１撮影画像データを取得し、これを記憶部３０２に記憶させる。 The first acquisition unit 3031 acquires first photographed image data representing a facial expression image of a user using the information processing system 100 when the user views content. Here, the above content is content such as videos and songs. The first photographed image data described above is photographed by a photographing device included in the user terminal 400 while the user is viewing the content using the user terminal 400. Note that a predetermined application for using the information processing system 100 is installed in the user terminal 400 in advance, and the application executes the process of photographing the above-mentioned image in the background when the content is played back. The photographed data is then uploaded to the server 300. Then, the first acquisition unit 3031 acquires the first captured image data and stores it in the storage unit 302.

ここで、本実施形態におけるユーザ端末４００は、機能部として通信部４０１、入出力部４０２、記憶部４０３を有している。通信部４０１は、ユーザ端末４００をネットワーク２００に接続するための通信インタフェースであり、例えば、ネットワークインタフェースボードや、無線通信のための無線通信回路を含んで構成される。入出力部４０２は、通信部４０１を介して外部から送信されてきた情報等を表示させたり、通信部４０１を介して外部に情報を送信する際に当該情報を入力したりするための機能部である。記憶部４０３は、サーバ３００の記憶部３０２と同様に主記憶装置と補助記憶装置を含んで構成される。 Here, the user terminal 400 in this embodiment has a communication section 401, an input/output section 402, and a storage section 403 as functional sections. The communication unit 401 is a communication interface for connecting the user terminal 400 to the network 200, and includes, for example, a network interface board and a wireless communication circuit for wireless communication. The input/output unit 402 is a functional unit that displays information transmitted from the outside via the communication unit 401 and inputs the information when transmitting information to the outside via the communication unit 401. It is. Like the storage unit 302 of the server 300, the storage unit 403 includes a main storage device and an auxiliary storage device.

入出力部４０２は、更に、表示部４０２１、操作入力部４０２２、画像・音声入出力部４０２３を有している。表示部４０２１は、各種情報を表示する機能を有し、例えば、ＬＣＤ（Liquid Crystal Display）ディスプレイ、ＬＥＤ（Light Emitting Diode）ディスプレイ、ＯＬＥＤ（Organic Light Emitting Diode）ディスプレイ等により実現される。操作入力部４０２２は、ユーザからの操作入力を受け付ける機能を有し、具体的には、タッチパネル等のソフトキーあるいはハードキーにより実現される。画像・音声入出力部４０２３は、静止画や動画等の画像の入力を受け付ける機能を有し、具体的には、Charged-Coupled Devices（ＣＣＤ）、Metal-oxide-semiconductor（ＭＯＳ）あるいはComplementary Metal-Oxide-Semiconductor（ＣＭＯＳ）等のイメージセンサを用いたカメラにより実現される。また、画像・音声入出力部４０２３は、音声の入出力を受け付ける機能を有し、具体的には、マイクやスピーカーにより実現される。 The input/output unit 402 further includes a display unit 4021, an operation input unit 4022, and an image/audio input/output unit 4023. The display unit 4021 has a function of displaying various information, and is realized by, for example, an LCD (Liquid Crystal Display) display, an LED (Light Emitting Diode) display, an OLED (Organic Light Emitting Diode) display, or the like. The operation input unit 4022 has a function of accepting operation input from the user, and is specifically realized by soft keys or hard keys of a touch panel or the like. The image/audio input/output unit 4023 has a function of receiving input images such as still images and videos. This is realized by a camera using an image sensor such as Oxide-Semiconductor (CMOS). Further, the image/audio input/output unit 4023 has a function of receiving input/output of audio, and is specifically realized by a microphone or a speaker.

そうすると、上記のコンテンツ（例えば、動画）は、画像が表示部４０２１によって表示され、音声がスピーカーによって出力され得る。そして、カメラによって、上記の画像が撮影され得る。 Then, in the content (for example, a moving image), an image can be displayed by the display unit 4021, and an audio can be outputted by the speaker. The above image may then be captured by the camera.

第２取得部３０３２は、情報処理システム１００を利用するユーザがコンテンツを視聴する際に該ユーザが感じる快不快の状態である快適性状態を取得する。ここで、第２取得部３０３２は、上記の第１撮影画像データを後述する事前学習モデルに入力することで、快適性状態を取得する。 The second acquisition unit 3032 acquires a comfort state that is a state of pleasure or displeasure felt by a user using the information processing system 100 when the user views content. Here, the second acquisition unit 3032 acquires the comfort state by inputting the first photographed image data described above to a pre-learning model described below.

音量調節部３０３３は、上記の快適性状態に基づいて、上記のユーザによるコンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量を自動で調節する。 The volume adjustment unit 3033 automatically adjusts the volume of the content based on the comfort state so as to improve the comfort of the user when the user views the content.

学習部３０３４は、上記の第２取得部３０３２による処理に用いられる事前学習モデルを構築する機能部であって、その詳細は後述する。 The learning unit 3034 is a functional unit that constructs a pre-learning model used in the processing by the second acquisition unit 3032, and its details will be described later.

なお、制御部３０３が、第１取得部３０３１、第２取得部３０３２、音量調節部３０３３、および学習部３０３４の処理を実行することで、本開示に係る制御部として機能する。 Note that the control unit 303 functions as a control unit according to the present disclosure by executing the processes of the first acquisition unit 3031, the second acquisition unit 3032, the volume adjustment unit 3033, and the learning unit 3034.

ここで、本実施形態における情報処理システム１００の動作の流れについて説明する。図３は、本実施形態における情報処理システム１００の動作の流れを例示する図である。図３では、本実施形態における情報処理システム１００におけるサーバ３００とユーザ端末４００との間の動作の流れ、およびサーバ３００とユーザ端末４００とが実行する処理を説明する。 Here, the flow of operation of the information processing system 100 in this embodiment will be explained. FIG. 3 is a diagram illustrating the flow of operations of the information processing system 100 in this embodiment. In FIG. 3, the flow of operations between the server 300 and the user terminal 400 in the information processing system 100 in this embodiment, and the processing executed by the server 300 and the user terminal 400 will be described.

本実施形態では、先ず、情報処理システム１００を利用するための初期設定が行われる。サーバ３００は、情報処理システム１００を利用するユーザのユーザ端末４００において初期設定用の初期コンテンツを再生させるために、初期コンテンツをユーザ端末４００に送信する（Ｓ１０１）。そうすると、ユーザ端末４００において、初期コンテンツが再生される（Ｓ１０２）。このとき、初期コンテンツでは、音量が自動で変化するように再生される。そして、ユーザ端末４００には、ユーザによって不快タイミングが入力され（Ｓ１０３）、それがサーバ３００に送信されることで、サーバ３００は、上記の音量の変化によってユーザが不快と感じるタイミングを取得することができる（Ｓ１０４）。 In this embodiment, first, initial settings for using the information processing system 100 are performed. The server 300 transmits the initial content to the user terminal 400 in order to play the initial content for initial settings on the user terminal 400 of the user using the information processing system 100 (S101). Then, the initial content is played back on the user terminal 400 (S102). At this time, the initial content is played back with the volume automatically changing. Then, the discomfort timing is input by the user into the user terminal 400 (S103), and by transmitting it to the server 300, the server 300 can acquire the timing at which the user feels uncomfortable due to the change in the volume. (S104).

ここで、図４は、情報処理システム１００を利用するための初期設定画面を例示する図である。図４に例示する画面ＳＣ１は情報処理システム１００を利用するユーザのユーザ端末４００の表示部４０２１に表示され、図４（ａ）の画面ＳＣ１には、初期設定開始ボタンＳＣ１１が示される。図４（ａ）の画面ＳＣ１において初期設定開始ボタンＳＣ１１が押下されると、図４（ｂ）の画面ＳＣ１に画面遷移し、初期コンテンツ再生フィールドにおいて初期コンテンツが再生される。このとき、初期コンテンツの音量が徐々に大きくなるように音声が流され、ユーザは、音声が大きすぎると感じた場合に、音量マイナスボタンＳＣ１２を押下することで、音量の変化によって不快と感じるタイミングを入力することができる。また、図４（ｃ）の画面ＳＣ１では、初期コンテンツの音量が徐々に小さくなるように音声が流され、ユーザは、音声が小さすぎると感じた場合に、音量プラスボタンＳＣ１３を押下することで、音量の変化によって不快と感じるタイミングを入力することができる。 Here, FIG. 4 is a diagram illustrating an initial setting screen for using the information processing system 100. The screen SC1 illustrated in FIG. 4 is displayed on the display unit 4021 of the user terminal 400 of the user using the information processing system 100, and the initial setting start button SC11 is shown on the screen SC1 of FIG. 4(a). When the initial setting start button SC11 is pressed on the screen SC1 of FIG. 4(a), the screen transitions to the screen SC1 of FIG. 4(b), and the initial content is reproduced in the initial content reproduction field. At this time, the audio is played so that the volume of the initial content gradually increases, and if the user feels that the audio is too loud, the user can press the volume minus button SC12 to determine the timing at which the volume changes make the user feel uncomfortable. can be entered. In addition, on the screen SC1 of FIG. 4(c), the audio is played so that the volume of the initial content gradually decreases, and if the user feels that the audio is too low, the user can press the volume plus button SC13. , you can input the timing at which you feel uncomfortable due to changes in volume.

そして、図３に戻って、サーバ３００は、上記のタイミングにおけるユーザの表情画像を表す第２撮影画像データを撮影するための撮影指令をユーザ端末４００に送信する（Ｓ１０５）。そうすると、ユーザ端末４００は、その情報を取得し（Ｓ１０６）、第２撮影画像データを撮影する（Ｓ１０７）。なお、第２撮影画像データは、ユーザ端末４００が有するカメラによって撮影され得る。そして、第２撮影画像データは、ユーザ端末４００からサーバ３００に送信され、サーバ３００が、第２撮影画像データを取得する（Ｓ１０８）。 Then, returning to FIG. 3, the server 300 transmits a photographing command to the user terminal 400 for photographing second photographed image data representing the user's facial expression image at the above timing (S105). Then, the user terminal 400 acquires the information (S106) and photographs the second photographed image data (S107). Note that the second captured image data may be captured by a camera included in the user terminal 400. The second captured image data is then transmitted from the user terminal 400 to the server 300, and the server 300 acquires the second captured image data (S108).

そして、サーバ３００は、第２撮影画像データを教師データとして、事前学習モデルに学習を行わせる（Ｓ１０９）。上述したように、第２撮影画像データは、コンテンツの音量の変化によってユーザが不快と感じるタイミングにおける該ユーザの表情画像を表すものであるため、これを教師データとして事前学習モデルに学習を行わせることで、該事前学習モデルを用いて、ユーザがコンテンツを視聴する際の快適性状態を識別することが可能になる。 Then, the server 300 causes the pre-learning model to perform learning using the second captured image data as training data (S109). As described above, the second photographed image data represents the facial expression image of the user at a time when the user feels uncomfortable due to a change in the volume of the content, so the pre-learning model is made to perform learning using this as training data. This makes it possible to identify the user's comfort state when viewing content using the pre-learning model.

なお、サーバ３００は、上記の第２撮影画像データに基づいて第３撮影画像データを自動で生成し、該第３撮影画像データを教師データに加えて、事前学習モデルに学習を行わせてもよい。ここで、上記の第３撮影画像データは、第２撮影画像データを加工することで生成される撮影画像データであって、該第２撮影画像データに含まれる人物の位置が任意に変更された、又は／及び該第２撮影画像データに含まれる背景の色が任意に変更された、又は／及び該第２撮影画像データに含まれる人物の服装が任意に変更されたデータである。 Note that the server 300 may automatically generate third captured image data based on the second captured image data, add the third captured image data to the teacher data, and cause the pre-learning model to perform learning. good. Here, the above-mentioned third photographed image data is photographed image data generated by processing the second photographed image data, and the position of the person included in the second photographed image data has been arbitrarily changed. , or/and the background color included in the second captured image data has been arbitrarily changed, and/and the clothing of the person included in the second captured image data has been arbitrarily changed.

ここで、ユーザの表情が同一であっても、該ユーザの周囲の環境（背景色や服装、位置による明るさの違い等）によって、撮影画像データの印象が異なることがある。そこで、一つの第２撮影画像データに基づいて複数の第３撮影画像データを自動で生成することで、印象が異なる撮影画像データを複数生成することができ、事前学習モデルに学習を行わせるための教師データの数を効率的に増やすことができる。 Here, even if the facial expressions of the users are the same, the impression of the photographed image data may differ depending on the surrounding environment of the user (background color, clothing, differences in brightness depending on position, etc.). Therefore, by automatically generating multiple pieces of third photographed image data based on one second photographed image data, it is possible to generate multiple pieces of photographed image data with different impressions, and to make the pre-learning model perform learning. The number of training data can be efficiently increased.

また、Ｓ１０１からＳ１０９の初期設定および学習処理は、ユーザが情報処理システム１００を利用する都度実行されてもよいし、ユーザが情報処理システム１００を利用する初回のみ実行されてもよい。 Further, the initial setting and learning process from S101 to S109 may be executed each time the user uses the information processing system 100, or may be executed only the first time the user uses the information processing system 100.

そして、上述した初期設定が完了した状態において、情報処理システム１００を利用するユーザのユーザ端末４００において、任意のコンテンツが再生される（Ｓ１１０）。このとき、ユーザ端末４００では、予めインストールされた所定のアプリによって、コンテンツの再生時にバックグラウンドでユーザの表情画像を撮影する処理が実行される（Ｓ１１１）。そして、このようにして撮影された第１撮影画像データは、ユーザ端末４００からサーバ３００に送信される。 Then, in a state in which the above-described initial settings are completed, arbitrary content is played back on the user terminal 400 of the user who uses the information processing system 100 (S110). At this time, in the user terminal 400, a predetermined application installed in advance executes a process of photographing a facial expression image of the user in the background during reproduction of the content (S111). The first captured image data captured in this manner is then transmitted from the user terminal 400 to the server 300.

そうすると、サーバ３００は、ユーザ端末４００から送信された第１撮影画像データを取得し（Ｓ１１２）、取得した第１撮影画像データを記憶部３０２に格納する。 Then, the server 300 acquires the first captured image data transmitted from the user terminal 400 (S112), and stores the acquired first captured image data in the storage unit 302.

そして、サーバ３００は、第１撮影画像データに基づいて快適性状態を取得する。これについて、以下に説明する。 Then, the server 300 acquires the comfort state based on the first captured image data. This will be explained below.

サーバ３００は、事前学習モデルを呼出す処理を実行する（Ｓ１１３）。ここで、事前学習モデルは、第１撮影画像データに基づいて快適性状態を取得するために用いられる機械学習モデルであって、学習部３０３４によって、第２撮影画像データを教師データとして学習を行うことにより事前に構築される。 The server 300 executes a process of calling the pre-learning model (S113). Here, the pre-learning model is a machine learning model used to obtain the comfort state based on the first captured image data, and is trained by the learning unit 3034 using the second captured image data as training data. It is pre-built by

ここで、図５は、本実施形態における事前学習モデルに対する入力から得られる識別結果と、該事前学習モデルを構成するニューラルネットワークを説明するための図である。本実施形態では、事前学習モデルとして、ディープラーニングにより生成されるニューラルネットワークモデルを用いる。本実施形態における事前学習モデル３０は、入力画像データの入力を受け付ける入力層３１と、入力層３１に入力された該画像データから人物の不快表情を表す特徴量を抽出する中間層（隠れ層）３２と、特徴量に基づく識別結果を出力する出力層３３とを有する。なお、図５の例では、事前学習モデル３０は、１層の中間層３２を有しており、入力層３１の出力が中間層３２に入力され、中間層３２の出力が出力層３３に入力されている。ただし、中間層３２の数は、１層に限られなくてもよく、事前学習モデル３０は、２層以上の中間層３２を有してもよい。 Here, FIG. 5 is a diagram for explaining the identification results obtained from the input to the pre-learning model and the neural network forming the pre-learning model in this embodiment. In this embodiment, a neural network model generated by deep learning is used as the pre-learning model. The pre-learning model 30 in this embodiment includes an input layer 31 that receives input image data, and an intermediate layer (hidden layer) that extracts feature quantities representing unpleasant facial expressions of a person from the image data input to the input layer 31. 32, and an output layer 33 that outputs identification results based on feature amounts. In the example of FIG. 5, the pre-learning model 30 has one intermediate layer 32, the output of the input layer 31 is input to the intermediate layer 32, and the output of the intermediate layer 32 is input to the output layer 33. has been done. However, the number of intermediate layers 32 is not limited to one layer, and the pre-learning model 30 may have two or more intermediate layers 32.

また、図５によると、各層３１～３３は、１又は複数のニューロンを備えている。例えば、入力層３１のニューロンの数は、入力される画像データに応じて設定することができる。また、出力層３３のニューロンの数は、識別結果である快適性状態に応じて設定することができる。 Further, according to FIG. 5, each layer 31 to 33 includes one or more neurons. For example, the number of neurons in the input layer 31 can be set depending on input image data. Furthermore, the number of neurons in the output layer 33 can be set according to the comfort state that is the identification result.

そして、隣接する層のニューロン同士は適宜結合され、各結合には重み（結合荷重）が機械学習の結果に基づいて設定される。図５の例では、各ニューロンは、隣接する層の全てのニューロンと結合されているが、ニューロンの結合は、このような例に限定されなくてもよく、適宜設定することができる。 Neurons in adjacent layers are then appropriately connected, and a weight (connection weight) is set for each connection based on the results of machine learning. In the example of FIG. 5, each neuron is connected to all neurons in adjacent layers, but the connection of neurons does not need to be limited to this example and can be set as appropriate.

このような事前学習モデル３０は、例えば、人物の表情を表す画像を含んだ画像データと、人物の不快表情を表す画像のラベルと、の組みである教師データを用いて教師あり学習を行うことで構築される。具体的には、特徴量とラベルとの組みをニューラルネットワークに与え、ニューラルネットワークの出力がラベルと同じとなるように、ニューロン同士の結合の重みがチューニングされる。このようにして、教師データの特徴を学習し、入力から結果を推定するための事前学習モデルが帰納的に獲得される。 Such a pre-learning model 30 performs supervised learning using, for example, teacher data that is a combination of image data including an image representing a person's facial expression and a label of an image representing an unpleasant facial expression of the person. Constructed with. Specifically, a combination of a feature amount and a label is given to a neural network, and the weights of connections between neurons are tuned so that the output of the neural network is the same as the label. In this way, a pre-trained model for learning the features of the training data and estimating the result from the input is obtained inductively.

図３に戻って、サーバ３００は、上記の事前学習モデルに第１撮影画像データを入力することで、快適性状態を取得する（Ｓ１１４）。そして、サーバ３００は、ユーザがコンテンツを視聴する際の該ユーザの快適性状態が不快状態であるか否かを判別する（Ｓ１１５）。そして、Ｓ１１５の処理で肯定判定された場合、本フローはＳ１１６の処理へ進み、Ｓ１１５の処理で否定判定された場合、本フローはＳ１１１の処理へ戻る。 Returning to FIG. 3, the server 300 acquires the comfort state by inputting the first photographed image data to the above pre-learning model (S114). Then, the server 300 determines whether the comfort state of the user when the user views the content is an uncomfortable state (S115). If an affirmative determination is made in the process of S115, the flow proceeds to the process of S116, and if a negative determination is made in the process of S115, the present flow returns to the process of S111.

Ｓ１１５の処理で肯定判定された場合、次に、Ｓ１１６において、音量調節処理が実行される。Ｓ１１６の処理では、サーバ３００は、ユーザによるコンテンツの視聴時に該ユーザの快適性が向上するように該コンテンツの音量を自動で調節する。例えば、コンテンツの音声が大きすぎることによりユーザの快適性状態が不快状態となっていると判定される場合には、サーバ３００は、コンテンツの音量を下げる処理を実行する。また、例えば、コンテンツの音声が小さすぎることによりユーザの快適性状態が不快状態となっていると判定される場合には、サーバ３００は、コンテンツの音量を上げる処理を実行する。そして、このような音量調節処理の指令がサーバ３００からユーザ端末４００に送信されることで、ユーザ端末４００において、音量の調節が自動で行われることになる（Ｓ１１７）。そして、Ｓ１１７の処理の後、本フローはＳ１１１の処理へ戻る。 When an affirmative determination is made in the process of S115, next, in S116, a volume adjustment process is executed. In the process of S116, the server 300 automatically adjusts the volume of the content so as to improve the comfort of the user when the user views the content. For example, if it is determined that the user's comfort state is in an uncomfortable state due to the audio of the content being too loud, the server 300 executes processing to lower the volume of the content. Further, for example, if it is determined that the user's comfort state is in an uncomfortable state due to the audio of the content being too low, the server 300 executes a process of increasing the volume of the content. Then, by transmitting such a command for volume adjustment processing from the server 300 to the user terminal 400, the volume is automatically adjusted in the user terminal 400 (S117). After the process in S117, the flow returns to the process in S111.

そして、コンテンツの再生時には、Ｓ１１１からＳ１１７の処理が所定の周期で繰り返し実行され、コンテンツの再生が終了されると、本フローの実行が終了される。そして、以上に述べた処理によれば、コンテンツの音量の影響によってユーザが不快に感じている場合に、ユーザによる操作によらずに自動でコンテンツの音量が調節される。そのため、ユーザによる操作の煩わしさを軽減しつつ、ユーザの快適性を向上させることができる。 Then, when playing the content, the processes from S111 to S117 are repeatedly executed at a predetermined cycle, and when the playing of the content is finished, the execution of this flow is finished. According to the process described above, when the user feels uncomfortable due to the influence of the volume of the content, the volume of the content is automatically adjusted without any operation by the user. Therefore, the user's comfort can be improved while reducing the troublesomeness of the user's operations.

以上に述べた情報処理システム１００によれば、ユーザによるコンテンツの視聴時に該ユーザの快適性を向上させることができる。 According to the information processing system 100 described above, it is possible to improve the comfort of the user when the user views content.

＜第２実施形態＞
第２実施形態における情報処理システムについて、図６に基づいて説明する。本実施形態では、サーバ３００が、第４撮影画像データを取得することを、更に実行する。ここで、上記の第４撮影画像データとは、ユーザが、音量の自動調節が行われるコンテンツとは異なる任意の他コンテンツを視聴しているときの、該ユーザの表情画像を表す撮影画像データであって、ユーザ端末４００が有するカメラによって、該他コンテンツの再生中に周期的に撮影される。そして、サーバ３００は、この第４撮影画像データを教師データとして、事前学習モデルに学習を行わせる。 <Second embodiment>
An information processing system in the second embodiment will be described based on FIG. 6. In the present embodiment, the server 300 further acquires fourth captured image data. Here, the above-mentioned fourth photographed image data is photographed image data that represents a facial expression image of the user when the user is viewing any other content different from the content for which the volume is automatically adjusted. The camera of the user terminal 400 periodically takes pictures while the other content is being played back. Then, the server 300 causes the pre-learning model to perform learning using this fourth captured image data as teacher data.

ここで、図６は、本実施形態における情報処理システム１００の動作の流れを例示する図である。図６では、本実施形態における情報処理システム１００におけるサーバ３００とユーザ端末４００との間の動作の流れ、およびサーバ３００とユーザ端末４００とが実行する処理を説明する。なお、図６に示す各処理において、上記の図３に示した処理と実質的に同一の処理については、同一の符号を付してその詳細な説明を省略する。 Here, FIG. 6 is a diagram illustrating the flow of operation of the information processing system 100 in this embodiment. FIG. 6 describes the flow of operations between the server 300 and the user terminal 400 in the information processing system 100 in this embodiment, and the processing executed by the server 300 and the user terminal 400. Note that in each process shown in FIG. 6, the processes that are substantially the same as the processes shown in FIG.

図６に示す例では、ユーザ端末４００において、音量の自動調節が行われるコンテンツとは異なる任意の他コンテンツが再生されると（Ｓ２０１）、その情報が、サーバ３００に送信される（Ｓ２０２）。ここで、ユーザ端末４００では、予めインストールされた情報処理システム１００に関する所定のアプリがバックグラウンドで実行され、任意のコンテンツが再生されると、その情報が該アプリによってサーバ３００に送信される。 In the example shown in FIG. 6, when any other content different from the content whose volume is automatically adjusted is played on the user terminal 400 (S201), the information is transmitted to the server 300 (S202). Here, in the user terminal 400, a predetermined application related to the information processing system 100 installed in advance is executed in the background, and when arbitrary content is played, the information is transmitted to the server 300 by the application.

そして、サーバ３００は、上記の情報を取得すると（Ｓ２０２）、ユーザの表情画像を表す第４撮影画像データを撮影するための撮影指令をユーザ端末４００に送信する（Ｓ２０３）。ここで、サーバ３００は、上記の他コンテンツの再生中に周期的に第４撮影画像データを撮影するように、上記の撮影指令を送信する。そうすると、ユーザ端末４００は、その情報を取得し（Ｓ２０４）、第４撮影画像データを周期的に撮影する（Ｓ２０５）。そして、第４撮影画像データは、ユーザ端末４００からサーバ３００に送信され、サーバ３００が、第４撮影画像データを取得する（Ｓ２０６）。 Then, upon acquiring the above information (S202), the server 300 transmits a photographing command for photographing fourth photographed image data representing the user's facial expression image to the user terminal 400 (S203). Here, the server 300 transmits the above-mentioned photographing command so as to periodically photograph the fourth photographed image data while the above-mentioned other content is being played back. Then, the user terminal 400 acquires the information (S204) and periodically photographs the fourth photographed image data (S205). The fourth captured image data is then transmitted from the user terminal 400 to the server 300, and the server 300 acquires the fourth captured image data (S206).

そして、サーバ３００は、第４撮影画像データを教師データとして、事前学習モデルに学習を行わせる（Ｓ２０７）。 Then, the server 300 causes the pre-learning model to perform learning using the fourth captured image data as training data (S207).

ここで、上述したように、事前学習モデル３０は、例えば、人物の表情を表す画像を含んだ画像データと、人物の不快表情を表す画像のラベルと、の組みである教師データを用いて教師あり学習を行うことで構築され得る。そこで、本実施形態では、第４撮影画像データに対して、ユーザが他コンテンツの音量を調節したときの該ユーザの表情画像を不快状態とラベル付けし、ユーザが他コンテンツの音量を調節して所定時間経過した後の該ユーザの表情画像を快状態とラベル付けして、事前学習モデルに学習を行わせる。 Here, as described above, the pre-learning model 30 uses teacher data that is a set of image data including an image representing a person's facial expression and a label of an image representing an unpleasant facial expression of the person. It can be constructed by performing learning. Therefore, in the present embodiment, an image of the user's facial expression when the user adjusts the volume of other content is labeled as an unpleasant state in the fourth captured image data, and when the user adjusts the volume of other content, The facial expression image of the user after a predetermined period of time has passed is labeled as a pleasant state, and the pre-learning model is caused to perform learning.

なお、第４撮影画像データは周期的に撮影される。また、上記のアプリによって、ユーザが他コンテンツの音量を調節したタイミングもモニタリングすることができる。そのため、周期的に撮影された第４撮影画像データの中から上記のタイミングに合致する撮影画像データを抽出することで、ユーザが他コンテンツの音量を調節したときの該ユーザの表情画像を取得することが可能になる。そして、本実施形態では、このようにユーザが他コンテンツの音量を調節したタイミングにおいて、該ユーザが不快に感じていると推定して、該タイミングにおける該ユーザの表情画像を不快状態とラベル付けする。 Note that the fourth photographed image data is periodically photographed. The above app also allows you to monitor when the user adjusts the volume of other content. Therefore, by extracting photographed image data that matches the above-mentioned timing from the fourth photographed image data that is periodically photographed, an image of the user's facial expression when the user adjusts the volume of other content is obtained. becomes possible. Then, in this embodiment, at the timing when the user adjusts the volume of other content in this way, it is estimated that the user is feeling uncomfortable, and the facial expression image of the user at that timing is labeled as being in an uncomfortable state. .

また、コンテンツを視聴しているときのユーザは、一度音量の調節を始めると、快適な音量となるまで調節し続ける傾向がある。言い換えれば、ユーザは、快適な音量となると調節を終了する。そこで、本実施形態では、ユーザが他コンテンツの音量を調節して所定時間経過した後において、該音量調節によって快適な音量となり該ユーザが快適に感じていると推定して、そのときの該ユーザの表情画像を快状態とラベル付けする。なお、上記の所定時間は、例えば、３０秒から１分である。 Additionally, once a user starts adjusting the volume while viewing content, there is a tendency to continue adjusting the volume until a comfortable volume is reached. In other words, the user finishes adjusting the volume when the volume is comfortable. Therefore, in the present embodiment, after the user adjusts the volume of other content and a predetermined period of time has elapsed, it is estimated that the volume has become comfortable due to the volume adjustment and the user feels comfortable, and the user at that time Label the facial expression image as a pleasurable state. Note that the above predetermined time is, for example, 30 seconds to 1 minute.

そして、図６に示す例では、Ｓ１１４の処理において、上記のようにしてラベル付けされた教師データを用いて学習された事前学習モデルに第１撮影画像データを入力することで、快適性状態が取得される。このとき、本実施形態では、第１撮影画像データに対して事前学習モデルから出力される識別結果について、不快状態との合致割合と、快状態との合致割合と、が出力される。つまり、上述した２つのラベルとの合致割合が出力される。そして、サーバ３００は、不快状態との合致割合と、快状態との合致割合と、に基づいて、例えば、不快状態との合致割合が快状態との合致割合よりも高い場合には、快適性状態として不快状態を取得する。 In the example shown in FIG. 6, in the process of S114, the comfort state is adjusted by inputting the first captured image data to the pre-learning model trained using the teacher data labeled as described above. be obtained. At this time, in this embodiment, for the identification results output from the pre-learning model for the first photographed image data, a matching ratio with the unpleasant state and a matching ratio with the pleasant state are output. In other words, the match ratio with the two labels described above is output. Then, based on the matching ratio with the unpleasant state and the matching ratio with the pleasant state, for example, if the matching ratio with the unpleasant state is higher than the matching ratio with the pleasant state, the server 300 determines the comfort state. Get discomfort state as the state.

このような処理によれば、ユーザの快適性状態の誤認識を可及的に抑制することができる。 According to such processing, erroneous recognition of the user's comfort state can be suppressed as much as possible.

そして、以上に述べた情報処理システム１００によっても、ユーザによるコンテンツの視聴時に該ユーザの快適性を向上させることができる。 The information processing system 100 described above can also improve the user's comfort level when the user views content.

＜その他の変形例＞
上記の実施形態はあくまでも一例であって、本開示はその要旨を逸脱しない範囲内で適宜変更して実施しうる。例えば、本開示において説明した処理や手段は、技術的な矛盾が生じない限りにおいて、自由に組み合わせて実施することができる。 <Other variations>
The embodiments described above are merely examples, and the present disclosure may be implemented with appropriate changes within the scope of the gist thereof. For example, the processes and means described in this disclosure can be implemented in any combination as long as no technical contradiction occurs.

また、１つの装置が行うものとして説明した処理が、複数の装置によって分担して実行されてもよい。例えば、学習部３０３４をサーバ３００とは別の演算処理装置に形成してもよい。このとき当該別の演算処理装置はサーバ３００と好適に協働可能に構成される。また、異なる装置が行うものとして説明した処理が、１つの装置によって実行されても構わない。コンピュータシステムにおいて、各機能をどのようなハードウェア構成（サーバ構成）によって実現するかは柔軟に変更可能である。 Further, the processing described as being performed by one device may be shared and executed by a plurality of devices. For example, the learning unit 3034 may be formed in a processing device separate from the server 300. At this time, the other arithmetic processing device is configured to suitably cooperate with the server 300. Further, processes described as being performed by different devices may be executed by one device. In a computer system, the hardware configuration (server configuration) that implements each function can be flexibly changed.

本開示は、上記の実施形態で説明した機能を実装したコンピュータプログラムをコンピュータに供給し、当該コンピュータが有する１つ以上のプロセッサがプログラムを読み出して実行することによっても実現可能である。このようなコンピュータプログラムは、コンピュータのシステムバスに接続可能な非一時的なコンピュータ可読記憶媒体によってコンピュータに提供されてもよいし、ネットワークを介してコンピュータに提供されてもよい。非一時的なコンピュータ可読記憶媒体は、例えば、磁気ディスク（フロッピー（登録商標）ディスク、ハードディスクドライブ（ＨＤＤ）等）、光ディスク（ＣＤ－ＲＯＭ、ＤＶＤディスク・ブルーレイディスク等）など任意のタイプのディスク、読み込み専用メモリ（ＲＯＭ）、ランダムアクセスメモリ（ＲＡＭ）、ＥＰＲＯＭ、ＥＥＰＲＯＭ、磁気カード、フラッシュメモリ、光学式カード、電子的命令を格納するために適した任意のタイプの媒体を含む。 The present disclosure can also be realized by supplying a computer program implementing the functions described in the above embodiments to a computer, and having one or more processors included in the computer read and execute the program. Such a computer program may be provided to the computer by a non-transitory computer-readable storage medium connectable to the computer's system bus, or may be provided to the computer via a network. The non-transitory computer-readable storage medium may be any type of disk, such as a magnetic disk (floppy disk, hard disk drive (HDD), etc.), an optical disk (CD-ROM, DVD disk, Blu-ray disk, etc.), Includes read only memory (ROM), random access memory (RAM), EPROM, EEPROM, magnetic cards, flash memory, optical cards, and any type of medium suitable for storing electronic instructions.

１００・・・情報処理システム
２００・・・ネットワーク
３００・・・サーバ
３０１・・・通信部
３０２・・・記憶部
３０３・・・制御部
４００・・・ユーザ端末 100... Information processing system 200... Network 300... Server 301... Communication unit 302... Storage unit 303... Control unit 400... User terminal

Claims

Obtaining first captured image data that is captured image data captured by a camera and represents a facial expression image of the user when the user views the content;
The first photograph is applied to a pre-learning model constructed by learning a comfort state, which is a state of pleasure and displeasure felt by the user when the user views the content, using image data photographed in advance. Obtaining by inputting image data,
automatically adjusting the volume of the content based on the comfort state so as to improve the user's comfort level when the user views the content;
Equipped with a control unit that executes
The control unit includes:
Before the user views the content, initial content for initial settings is played back, and during playback, the volume of the initial content is automatically changed, and the timing at which the user feels uncomfortable due to the change in volume is obtained. And,
acquiring second photographed image data that is photographed image data photographed by the camera and represents an expression image of the user at the timing;
further execute,
causing the pre-learning model to perform learning using the second captured image data as training data;
Information processing device.

The control unit includes:
Photographed image data generated by processing the second photographed image data, wherein the position of a person included in the second photographed image data has been arbitrarily changed, and/or the position of the person included in the second photographed image data has been changed. Automatically generate third photographed image data representing an image of the user in which the color of the included background has been arbitrarily changed and/or the clothing of the person included in the second photographed image data has been arbitrarily changed. do more of that,
adding the third captured image data to the teacher data and causing the pre-learning model to perform learning;
The information processing device according to claim 1 .

The computer is
a first acquisition step of acquiring first photographed image data, which is photographed image data photographed by a camera and represents a facial expression image of the user when the user views the content;
The first photograph is applied to a pre-learning model constructed by learning a comfort state, which is a state of pleasure and displeasure felt by the user when the user views the content, using image data photographed in advance. a second acquisition step of acquiring image data by inputting it;
an automatic adjustment step of automatically adjusting the volume of the content based on the comfort state so as to improve the comfort of the user when the user views the content;
Run
The computer includes:
Before the user views the content, initial content for initial settings is played back, and during playback, the volume of the initial content is automatically changed, and the timing at which the user feels uncomfortable due to the change in volume is obtained. And,
acquiring second photographed image data that is photographed image data photographed by the camera and represents an expression image of the user at the timing;
further execute,
causing the pre-learning model to perform learning using the second captured image data as training data;
Information processing method.

to the computer,
a first acquisition step of acquiring first photographed image data, which is photographed image data photographed by a camera and represents a facial expression image of the user when the user views the content;
The first photograph is applied to a pre-learning model constructed by learning a comfort state, which is a state of pleasure and displeasure felt by the user when the user views the content, using image data photographed in advance. a second acquisition step of acquiring image data by inputting it;
an automatic adjustment step of automatically adjusting the volume of the content based on the comfort state so as to improve the comfort of the user when the user views the content;
run the
to the computer;
Before the user views the content, initial content for initial settings is played back, and during playback, the volume of the initial content is automatically changed, and the timing at which the user feels uncomfortable due to the change in volume is obtained. And,
acquiring second photographed image data that is photographed image data photographed by the camera and represents an expression image of the user at the timing;
further execute
causing the pre-learning model to perform learning using the second captured image data as training data;
Information processing program.