JP7123856B2

JP7123856B2 - Presentation evaluation system, method, trained model and program, information processing device and terminal device

Info

Publication number: JP7123856B2
Application number: JP2019094014A
Authority: JP
Inventors: 翔太塩原; 聡太杉村; 将吾岡田; 悠太朗八木
Original assignee: Japan Advanced Institute of Science and Technology; SoftBank Corp
Current assignee: Japan Advanced Institute of Science and Technology; SoftBank Corp
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2022-08-23
Anticipated expiration: 2039-05-17
Also published as: JP2020190579A

Description

本発明は、プレゼンテーションを評価するシステム、方法、学習済みモデル及びプログラム、並びに、プレゼンテーションの評価に用いる情報処理装置及び端末装置に関するものである。 The present invention relates to a presentation evaluation system, method, trained model and program, as well as an information processing apparatus and terminal device used for presentation evaluation.

従来、発表者が行うプレゼンテーションを評価する評価装置が知られている。例えば、特許文献１には、パーソナルコンピュータ上の画面を発表資料であるスライドとして聴講者向けの画面に表示して行うプレゼンテーションにおいて、どの程度の割合で聴講者の方向に視線を送っていたかを視線方向検知装置で検知して判断し、その視線の判断結果と、発表資料の各アジェンダページに割り当てられた時間と発表者が現実に各ページに費やした時間との比較結果とに基づいて、プレゼンテーションを評価するプレゼンテーション評価装置が開示されている。このプレゼンテーション評価装置によれば、発表用資料に対してどの程度の時間講演を費やしたかを記録し、その記録された値に対してどの程度の評価を与えることができるかを数値化し、且つ発表者がどの程度聴講者を見てコンタクトをとりながら発表を行っているかを客観的な評価値として取得できる、とされている。 2. Description of the Related Art Conventionally, an evaluation device for evaluating a presentation given by a presenter is known. For example, in Patent Document 1, in a presentation performed by displaying a screen on a personal computer as a slide, which is a presentation material, on a screen for an audience member, it is possible to measure the percentage of the audience's gaze direction. Detect and judge with a direction detection device, and based on the results of the line of sight judgment and the results of comparison between the time allocated to each agenda page of the presentation material and the time actually spent by the presenter on each page A presentation evaluator is disclosed for evaluating the. According to this presentation evaluation device, the amount of time spent on the presentation material is recorded, the recorded value is quantified to what extent the evaluation can be given, and the It is said that it is possible to obtain an objective evaluation value of how much a person is making a presentation while looking at and contacting the audience.

特開２００７－２１９１６１号公報JP 2007-219161 A

上記従来の評価装置による評価は、プレゼンテーションに費やした時間の程度や聴講者を見たアイコンタクトの程度といった限定的な評価であり、プレゼンテーションを評価する観点が限定的である。本来、プレゼンテーションとは、プレゼンタの声、発話内容、表情、ジェスチャなどを統合して行うはずのものであるが、上記従来の評価装置では、そのようなプレゼンテーションの評価を実現することができない。また、上記従来の評価装置は、発表者の視線を検知する特別なハードウェア（視線方向検知装置）を必要とするため、ユーザが使用する端末装置等に実装するときのハードルが高い。 The evaluation by the above-described conventional evaluation apparatus is a limited evaluation such as the degree of time spent on the presentation and the degree of eye contact with the audience, and the viewpoint of evaluating the presentation is limited. Originally, a presentation should be performed by integrating the presenter's voice, utterance contents, facial expressions, gestures, etc., but the above-described conventional evaluation apparatus cannot realize such a presentation evaluation. In addition, since the above-described conventional evaluation device requires special hardware (line-of-sight direction detection device) for detecting the line of sight of the presenter, there is a high hurdle when implementing it in a terminal device or the like used by the user.

本発明の一態様に係るシステムは、プレゼンテーションを評価するシステムである。このシステムは、プレゼンテーションを行っている対象者の音声データ及び動画データを取得するデータ取得部と、前記音声データから前記プレゼンテーションの言語特徴量及び韻律特徴量を抽出し、前記動画データから前記プレゼンテーションを行っているときの前記対象者の動作特徴量を抽出する特徴量抽出部と、前記言語特徴量と前記韻律特徴量と前記動作特徴量とを解析して前記プレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を推定する推論部と、前記解析結果を出力する解析結果出力部と、を備える。 A system according to one aspect of the present invention is a system for evaluating presentations. This system includes a data acquisition unit that acquires voice data and video data of a target person who is giving a presentation, extracts language feature values and prosodic feature values of the presentation from the voice data, and extracts the presentation from the video data. a feature quantity extraction unit for extracting the motion feature quantity of the subject during presentation, and analyzing the linguistic feature quantity, the prosodic feature quantity, and the motion feature quantity to quantitatively evaluate a predetermined evaluation item of the presentation. and an analysis result output unit for outputting the analysis result.

前記システムにおいて、前記データ取得部で取得した前記音声データ及び前記動画データが所定の品質を有しているか否かを確認するデータ確認部を更に備え、前記所定の品質を有する音声データ及び動画データを、前記言語特徴量、前記韻律特徴量及び前記動作特徴量の抽出に用い、前記所定の品質を有しない音声データ及び動画データを、前記言語特徴量、前記韻律特徴量及び前記動作特徴量の抽出に用いないようにしてもよい。
ここで、前記確認する品質は、前記音声データの音圧が所定の範囲にあること、前記音声データにおける雑音の大きさが閾値以下であること、前記動画データにおける画像に前記動作特徴量の抽出に用いる座標を取得する前記対象者の身体部が含まれていること、及び、前記対象者の正面方向に対する動画の撮像方向の画角が所定の角度範囲内にあること、の少なくとも一つについての品質であってもよい。
また、前記データ取得部で取得した前記音声データ及び前記動画データが所定の品質を有していないとき、前記音声データ及び前記動画データの取得に関する助言メッセージを出力するデータ取得助言出力部を更に備えてもよい。 The system further comprises a data confirmation unit for confirming whether the audio data and the video data acquired by the data acquisition unit have a predetermined quality, wherein the audio data and the video data having the predetermined quality. is used to extract the linguistic feature amount, the prosody feature amount, and the action feature amount, and the voice data and video data that do not have a predetermined quality are extracted from the linguistic feature amount, the prosody feature amount, and the action feature amount. It may not be used for extraction.
Here, the quality to be confirmed is that the sound pressure of the audio data is within a predetermined range, that the magnitude of noise in the audio data is equal to or less than a threshold, and that the motion feature amount is extracted from the image in the moving image data. At least one of the following: that the body part of the subject whose coordinates are to be acquired is included; may be of the quality of
and a data acquisition advice output unit for outputting an advice message regarding acquisition of the audio data and the video data when the audio data and the video data acquired by the data acquisition unit do not have a predetermined quality. may

前記システムにおいて、前記推論部は、前記言語特徴量、前記韻律特徴量及び前記動作特徴量を含む入力を所定のアルゴリズムで処理することにより前記定量的な評価値を含む解析結果を出力する解析モデルを用いてもよい。
前記推論部において、前記言語特徴量、前記韻律特徴量及び前記動作特徴量はそれぞれ複数種類の特徴量を含み、前記推論部で用いる前記解析モデルは、前記複数種類の特徴量のうち前記音声データ及び前記動画データからパターン化して抽出する処理に所定の時間以上を要する特徴量を前記入力として用いない解析モデルであってもよい。
前記推論部は、前記アルゴリズムが互いに異なる複数種類の解析モデルを前記解析に使用してもよい。
前記推論部は、前記アルゴリズムが互いに異なる複数種類の解析モデルを有し、前記複数種類の解析モデルから選択した解析モデルを前記解析に使用してもよい。
前記推論部は、前記言語特徴量、前記韻律特徴量及び前記動作特徴量の少なくとも一つに基づいて、前記複数種類の解析モデルから前記解析に使用する解析モデルを選択してもよい。
前記推論部は、前記プレゼンテーションの対象者の属性及び前記プレゼンテーションの種類の少なくとも一方に基づいて、前記複数種類の解析モデルから前記解析に使用する解析モデルを選択してもよい。
前記解析モデルは、複数のプレゼンテーションについて取得した前記言語特徴量、前記韻律特徴量及び前記動作特徴量と前記評価値の正解データとを含む教師あり学習データを用いて機械学習して作成された学習済みモデルであってもよい。 In the system, the inference unit is an analysis model that outputs an analysis result including the quantitative evaluation value by processing an input including the linguistic feature amount, the prosodic feature amount, and the action feature amount with a predetermined algorithm. may be used.
In the inference unit, the linguistic feature amount, the prosodic feature amount, and the action feature amount each include a plurality of types of feature amounts, and the analysis model used in the inference unit is the speech data among the plurality of types of feature amounts. Further, the analysis model may not use, as the input, a feature quantity that requires a predetermined time or longer for patterning and extracting the moving image data.
The inference unit may use a plurality of types of analysis models with mutually different algorithms for the analysis.
The inference unit may have a plurality of types of analysis models with mutually different algorithms, and may use an analysis model selected from the plurality of types of analysis models for the analysis.
The inference unit may select an analysis model to be used for the analysis from the plurality of types of analysis models based on at least one of the linguistic feature amount, the prosodic feature amount, and the action feature amount.
The inference unit may select an analysis model to be used for the analysis from the plurality of types of analysis models based on at least one of an attribute of a target person of the presentation and a type of the presentation.
The analysis model is a learning created by machine learning using supervised learning data including the linguistic feature amount, the prosodic feature amount, the action feature amount, and the correct answer data of the evaluation value acquired for a plurality of presentations. It may be a finished model.

前記システムにおいて、前記言語特徴量は、前記プレゼンテーションの全文におけるフィラー数、名詞数、動詞数、感動詞、動詞繰り返し数及び名詞繰り返し数の少なくとも一つに関する特徴量を含んでもよい。また、前記韻律特徴量は、前記プレゼンテーションの音声におけるピッチ、インテンシティ、音圧、抑揚、話速、発話長、無音長及び発話比の少なくとも一つに関する特徴量を含んでもよい。また、前記動作特徴量は、前記プレゼンテーションの全体における前記対象者の身体各部の動作量及び発話中における前記対象者の身体各部の動作量の少なくとも一つに関する特徴量を含んでもよい。 In the system, the linguistic feature amount may include a feature amount relating to at least one of filler number, noun number, verb number, interjection, verb repetition number, and noun repetition number in the full sentence of the presentation. Further, the prosodic feature amount may include a feature amount relating to at least one of pitch, intensity, sound pressure, intonation, speech speed, speech length, silence length, and speech ratio in speech of the presentation. Further, the motion feature amount may include a feature amount related to at least one of motion amount of each part of the subject's body during the entire presentation and motion amount of each part of the subject's body during speech.

前記システムにおいて、前記評価項目は、前記プレゼンテーションの目的網羅性、内容の論理性、見え方及び聞こえ方、並びに、効果的演出要素を含んでもよい。 In the system, the evaluation items may include coverage of purpose of the presentation, logicality of content, how it looks and sounds, and effective production elements.

前記システムにおいて、前記推論部は、前記プレゼンテーションに対するコメントの候補となる複数のコメントデータを記憶し、前記評価値と前記評価値を推定したときの決定係数とに基づいて、前記複数のコメントデータから一または複数のコメントデータを選択し、前記選択したコメントデータを用いて生成したコメントを前記解析結果に含めてもよい。 In the system, the inference unit stores a plurality of comment data as comment candidates for the presentation, and based on the evaluation value and a coefficient of determination when the evaluation value is estimated, from the plurality of comment data: One or a plurality of comment data may be selected, and comments generated using the selected comment data may be included in the analysis result.

前記システムにおいて、通信網を介して互いに通信可能な端末装置と情報処理装置とを備え、前記端末装置は、前記データ取得部と、前記情報処理装置に前記音声データ及び前記動画データを送信するデータ送信部と、前記情報処理装置から前記解析結果を受信する解析結果受信部と、前記解析結果出力部と、を有し、前記情報処理装置は、前記端末装置から前記音声データ及び前記動画データを受信するデータ受信部と、前記特徴量抽出部と、前記推論部と、前記端末装置に前記解析結果を送信する解析結果送信部と、を有してもよい。 The system includes a terminal device and an information processing device that can communicate with each other via a communication network, and the terminal device includes the data acquisition unit and data for transmitting the audio data and the moving image data to the information processing device. a transmission unit, an analysis result reception unit that receives the analysis result from the information processing device, and the analysis result output unit, and the information processing device receives the audio data and the video data from the terminal device. A data receiving unit for receiving data, the feature quantity extracting unit, the inferring unit, and an analysis result transmitting unit for transmitting the analysis result to the terminal device may be provided.

本発明の他の態様に係る学習済みモデルは、プレゼンテーションを評価するようにコンピュータ又はプロセッサを機能させるための学習済みモデルである。この学習済みモデルは、複数のプレゼンテーションについて取得した対象者の音声データ及び動画データに基づいて抽出した言語特徴量、韻律特徴量及び動作特徴量と前記プレゼンテーションの所定の評価項目について定量的に評価した評価値の正解データと含む教師あり学習データを用いて機械学習して作成され、評価対象のプレゼンテーションについて取得した対象者の音声データ及び動画データに基づいて抽出した言語特徴量、韻律特徴量及び動作特徴量を含む入力があったときに、前記プレゼンテーションの所定の評価項目について定量的に評価した評価値を出力する。 A trained model according to another aspect of the invention is a trained model for operating a computer or processor to evaluate a presentation. This trained model was quantitatively evaluated with respect to the linguistic feature amount, prosodic feature amount, and motion feature amount extracted based on the subject's voice data and video data obtained for a plurality of presentations, and the predetermined evaluation items of the presentation. It is created by machine learning using the correct evaluation value data and supervised learning data, and is extracted based on the voice data and video data of the target audience for the presentation to be evaluated. When an input including a feature amount is received, an evaluation value obtained by quantitatively evaluating a predetermined evaluation item of the presentation is output.

本発明の更に他の態様に係るプレゼンテーションを評価する方法は、プレゼンテーションを行っている対象者の音声データ及び動画データを取得することと、前記音声データから前記プレゼンテーションの言語特徴量及び韻律特徴量を抽出し、前記動画データから前記プレゼンテーションを行っているときの前記対象者の動作特徴量を抽出することと、前記言語特徴量と前記韻律特徴量と前記動作特徴量とを解析して前記プレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を推定することと、前記解析結果を出力することと、を含む。 A method for evaluating a presentation according to still another aspect of the present invention includes acquiring voice data and video data of a subject who is giving a presentation, and determining linguistic features and prosodic features of the presentation from the voice data. and extracting the motion feature of the subject during the presentation from the moving image data; analyzing the language feature, the prosody feature, and the motion feature to perform the presentation; It includes estimating an analysis result including an evaluation value quantitatively evaluated for a predetermined evaluation item, and outputting the analysis result.

本発明の更に他の態様に係る端末装置は、通信網を介して情報処理装置と通信可能な端末装置である。この端末装置は、プレゼンテーションを行っている対象者の音声データ及び動画データを取得するデータ取得部と、前記情報処理装置に前記音声データ及び前記動画データを送信するデータ送信部と、前記音声データ及び前記動画データに基づいて前記プレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を、前記情報処理装置から受信する解析結果受信部と、前記解析結果を出力する解析結果出力部と、を備える。 A terminal device according to still another aspect of the present invention is a terminal device capable of communicating with an information processing device via a communication network. This terminal device includes a data acquisition unit that acquires audio data and video data of a target person who is giving a presentation, a data transmission unit that transmits the audio data and the video data to the information processing device, and the audio data and video data. An analysis result receiving unit for receiving, from the information processing device, an analysis result including an evaluation value quantitatively evaluated for a predetermined evaluation item of the presentation based on the moving image data, and an analysis result output unit for outputting the analysis result. And prepare.

本発明の更に他の態様に係る情報処理装置は、通信網を介して端末装置と通信可能な情報処理装置である。この情報処理装置は、プレゼンテーションを行っている対象者の音声データ及び動画データを前記端末装置から受信するデータ受信部と、前記音声データから前記プレゼンテーションの言語特徴量及び韻律特徴量を抽出し、前記動画データから前記プレゼンテーションを行っているときの前記対象者の動作特徴量を抽出する特徴量抽出部と、前記言語特徴量と前記韻律特徴量と前記動作特徴量とを解析して前記プレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を推定する推論部と、前記端末装置に前記解析結果を送信する解析結果送信部と、を備える。 An information processing device according to still another aspect of the present invention is an information processing device capable of communicating with a terminal device via a communication network. This information processing device includes a data receiving unit that receives voice data and moving image data of a target person who is giving a presentation from the terminal device; a feature amount extracting unit for extracting a motion feature amount of the subject during the presentation from video data; and an analysis result transmission unit configured to transmit the analysis result to the terminal device.

本発明の更に他の態様に係るプログラムは、通信網を介して情報処理装置と通信可能な端末装置に備えるコンピュータ又はプロセッサにおいて実行されるプログラムである。このプログラムは、プレゼンテーションを行っている対象者の音声データ及び動画データを取得するためのプログラムコードと、前記情報処理装置に前記音声データ及び前記動画データを送信するためのプログラムコードと、前記音声データ及び前記動画データに基づいて前記プレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を、前記情報処理装置から受信するためのプログラムコードと、前記解析結果を出力するためのプログラムコードと、を含む。 A program according to still another aspect of the present invention is a program executed by a computer or processor provided in a terminal device capable of communicating with an information processing device via a communication network. This program includes a program code for acquiring audio data and video data of a target person who is giving a presentation, a program code for transmitting the audio data and the video data to the information processing device, and the audio data. and a program code for receiving from the information processing apparatus an analysis result including an evaluation value quantitatively evaluated for a predetermined evaluation item of the presentation based on the moving image data; and a program for outputting the analysis result. including code and

本発明の更に他の態様に係るプログラムは、通信網を介して端末装置と通信可能な情報処理装置に備えるコンピュータ又はプロセッサにおいて実行されるプログラムである。このプログラムは、プレゼンテーションを行っている対象者の音声データ及び動画データを前記端末装置から受信するためのプログラムコードと、前記音声データから前記プレゼンテーションの言語特徴量及び韻律特徴量を抽出し、前記動画データから前記プレゼンテーションを行っているときの前記対象者の動作特徴量を抽出するためのプログラムコードと、前記言語特徴量と前記韻律特徴量と前記動作特徴量とを解析して前記プレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を推定するためのプログラムコードと、前記端末装置に前記解析結果を送信するためのプログラムコードと、を含む。 A program according to still another aspect of the present invention is a program executed by a computer or processor provided in an information processing device capable of communicating with a terminal device via a communication network. This program extracts a program code for receiving voice data and video data of a target person who is giving a presentation from the terminal device, extracts the linguistic feature amount and the prosodic feature amount of the presentation from the voice data, and extracts the video data. A program code for extracting the motion feature of the subject during the presentation from data; A program code for estimating an analysis result including an evaluation value obtained by quantitatively evaluating an evaluation item, and a program code for transmitting the analysis result to the terminal device.

前記システム、前記学習済みモデル、前記方法、前記端末装置、前記情報処理装置及び前記プログラムにおいて、前記韻律特徴量は、前記プレゼンテーションを行っている対象者の声の特徴量を含み、前記言語特徴量は、前記プレゼンテーションを行っている対象者の発話内容の特徴量を含み、前記動作特徴量は、前記プレゼンテーションを行っている対象者のジェスチャの特徴量を含んでもよい。
また、前記システム、前記学習済みモデル、前記方法、前記端末装置、前記情報処理装置及び前記プログラムにおいて、前記取得するデータは、前記プレゼンテーションを行っている対象者について測定した赤外線センサ、心拍センサ等の各種センサで検知した検知データを含み、前記解析に用いる特徴量は、前記検知データから抽出した体温、心拍数などを含んでもよい。また、前記システム、前記学習済みモデル、前記方法、前記端末装置、前記情報処理装置及び前記プログラムにおいて、前記出力又は前記送信の対象には、過去の評価値を含む解析結果と最新の評価値を含む解析結果を同時に含んでもよいし、過去の評価値を含む解析結果と最新の評価値の差分を含んでもよい。 In the system, the trained model, the method, the terminal device, the information processing device, and the program, the prosodic feature amount includes a voice feature amount of the subject giving the presentation, and the linguistic feature amount may include feature amounts of utterance content of the subject who is giving the presentation, and the motion feature amounts may include feature amounts of gestures of the subject who is giving the presentation.
In addition, in the system, the trained model, the method, the terminal device, the information processing device, and the program, the data to be acquired is measured by an infrared sensor, a heartbeat sensor, or the like of the target person who is giving the presentation. Detection data detected by various sensors may be included, and the feature amount used for the analysis may include body temperature, heart rate, etc. extracted from the detection data. Further, in the system, the trained model, the method, the terminal device, the information processing device, and the program, analysis results including past evaluation values and the latest evaluation values are to be output or transmitted. It may include the analysis results including the evaluation values at the same time, or the difference between the analysis results including the past evaluation values and the latest evaluation values.

本発明によれば、プレゼンテーションのマルチモーダル情報からプレゼンテーションの多角的な視点からの定量評価が可能であり、また、実装するときのハードルが低い、という効果を奏する。 According to the present invention, it is possible to quantitatively evaluate the presentation from multiple viewpoints based on the multimodal information of the presentation, and the hurdles for implementation are low.

実施形態に係るシステムにおけるプレゼンテーション評価の概要の一例を示す説明図。FIG. 4 is an explanatory diagram showing an example of an overview of presentation evaluation in the system according to the embodiment; 実施形態に係るプレゼンテーション評価の手順の一例を示す説明図。FIG. 4 is an explanatory diagram showing an example of a presentation evaluation procedure according to the embodiment; 実施形態に係るシステムにおける端末装置及び情報処理装置の概略構成の一例を示すブロック図。1 is a block diagram showing an example of schematic configurations of a terminal device and an information processing device in a system according to an embodiment; FIG. 実施形態に係る情報処理装置の特徴量抽出部の要部構成の一例を示すブロック図。図。FIG. 2 is a block diagram showing an example of a main configuration of a feature amount extraction unit of the information processing apparatus according to the embodiment; figure. （ａ）は、実施形態に係る特徴量抽出部による発表者の各部の動作量の抽出処理に用いる骨格検出像の一例を示す説明図。（ｂ）は、発表者の頭部の検出点の位置の時間変化（軌跡）の一例を示す説明図。4A is an explanatory diagram showing an example of a skeleton detection image used for extraction processing of motion amounts of various parts of a presenter by a feature amount extraction unit according to the embodiment; FIG. (b) is an explanatory diagram showing an example of temporal change (trajectory) of the position of the detection point of the presenter's head. 実施形態に係る端末装置における解析結果の表示画面の一例を示す説明図。FIG. 5 is an explanatory diagram showing an example of a display screen of analysis results in the terminal device according to the embodiment;

以下、図面を参照して本発明の実施形態について説明する。
図１は、本実施形態に係るシステムにおけるプレゼンテーション評価の概要の一例を示す説明図である。本実施形態のプレゼンテーション評価は、評価対象者がプレゼンテーションを行っているときの音声及び動画のデータから抽出した韻律特徴量、言語特徴量及び動作特徴量を含むマルチモーダルな特徴量から、当該プレゼンテーションの多角的な視点からの定量的な評価値の算出など行うものである。本実施形態のシステムは、プレゼンテーション評価システムとして用いてもよいし、プレゼンテーション能力推定システムとして用いてもよい。 Embodiments of the present invention will be described below with reference to the drawings.
FIG. 1 is an explanatory diagram showing an example of an overview of presentation evaluation in the system according to this embodiment. The presentation evaluation of this embodiment is based on multimodal feature amounts including prosodic feature amounts, language feature amounts, and action feature amounts extracted from audio and video data when the subject of evaluation is giving a presentation. Quantitative evaluation values are calculated from multiple viewpoints. The system of this embodiment may be used as a presentation evaluation system, or may be used as a presentation ability estimation system.

評価対象のプレゼンテーションは、視覚的及び聴覚的な手段を用いて、計画、企画案、見積り、講義内容、自己アピールなどを、会議、ミーティング、講義、打ち合わせ、面接などにおいて、評価対象者である発表者が他の出席者等に向けて発表したり提示したりするものである。 Presentations to be evaluated are presentations of plans, proposals, estimates, lecture contents, self-appeal, etc., using visual and auditory means, at meetings, meetings, lectures, meetings, interviews, etc. Presented or presented by a person to other attendees.

図１のプレゼンテーション評価の例では、評価対象者がプレゼンテーションを行っているときの音声及び動画のデータから抽出した韻律特徴量、言語特徴量及び動作特徴量を含むマルチモーダルな特徴量が、説明変数の値として、解析モデルとしての機械学習済みモデルに入力される。機械学習済みモデルは、入力されたマルチモーダルな特徴量を所定のアルゴリズムに従って解析し、プレゼンテーションの所定の評価項目について定量的に評価したプレゼンテーション評価値を、目的変数の値として出力する。 In the example of presentation evaluation shown in FIG. is input to the machine-learned model as the analysis model as the value of The machine-learned model analyzes the input multimodal feature amount according to a predetermined algorithm, and outputs a presentation evaluation value obtained by quantitatively evaluating a predetermined evaluation item of the presentation as the value of the objective variable.

プレゼンテーション評価値は、目的網羅性、内容ロジック（内容の論理性）、ビジュアル及びボーカル（見え方及び聞こえ方）、並びに、効果的演出要素の大項目それぞれに関する複数の評価項目の値である。例えば、目的網羅性の評価項目は、例えば、「誰に」、「何を」、「どうしてほしい」という目的に関する３つの内容が網羅されているかを評価する項目である。内容ロジックの評価項目は、例えば、「結論」、その「根拠」及び「相手の利益」に関する３つの内容が含まれているかを評価する項目である。ビジュアル及びボーカルの評価項目は、例えば、抑揚、声量、アイコンタクト及びジェスチャといった見え方及び聞こえ方に関する評価項目である。効果的演出要素は、例えば、強調、繰り返し、具体表現及び双方向性といった効果的演出に関する評価項目である。 The presentation evaluation value is the value of a plurality of evaluation items relating to each of the large items of purpose coverage, content logic (the logic of content), visuals and vocals (how they look and hear), and effective production elements. For example, the objective coverage evaluation item is an item for evaluating whether or not the three contents related to the objective of "who", "what", and "how do you want me to do it" are covered. The content logic evaluation item is an item for evaluating whether or not three contents related to, for example, "conclusion", its "basis", and "interest of the other party" are included. Visual and vocal evaluation items are, for example, evaluation items related to appearance and hearing, such as intonation, volume, eye contact, and gesture. The effective production elements are evaluation items related to effective production such as emphasis, repetition, concrete expression, and interactivity.

プレゼンテーション評価に用いる解析モデルは、例えば、複数のプレゼンテーションそれぞれについて、プレゼンテーションの様子を撮像した音声及び動画のデータから抽出した韻律特徴量、言語特徴量及び動作特徴量と、各プレゼンテーションに対して機械ではなく人間の評価者が上記評価項目のそれぞれについて付与した正解データとしての評価値とを関連付けた複数の教師あり学習データを用いて予め機械学習することにより、モデルにおける各特徴量に対する重み等のパラメータの値を決定して作成した機械学習済みモデルである。 The analysis model used for presentation evaluation is, for example, for each of a plurality of presentations, the prosodic feature amount, language feature amount, and action feature amount extracted from the audio and video data of the presentation, and the machine for each presentation. By performing machine learning in advance using multiple supervised learning data that associates evaluation values as correct data given by human evaluators for each of the above evaluation items, parameters such as weights for each feature value in the model It is a machine-learned model created by determining the value of

複数種類の機械学習済みモデルは、評価対象者である発表者（プレゼンター）１０の個性又はユーザ４０の個性に応じて作成されたものであってもよい。機械学習済みモデルは、前記複数の評価項目を一括して出力するものであってもよいし、複数の評価項目それぞれについて出力する評価値の精度を高めるために評価項目ごとに設けてもよい。また、発表者１０の属性（例えば、年齢層、性別、プレゼンテーションの経験度）が互いに異なる複数種類の対象者グループそれぞれに対応するように複数種類の機械学習済みモデルを予め作成しておき、それらの複数種類の機械学習済みモデルをすべて用いるようにしてもよいし、それらの複数種類の機械学習済みモデルから選択して用いるようにしてもよい。また、複数のプレゼンテーションの種類それぞれに対応するように複数種類の機械学習済みモデルを予め作成しておき、それらの複数種類の機械学習済みモデルをすべて用いるようにしてもよいし、それらの複数種類の機械学習済みモデルから選択して用いるようにしてもよい。機械学習済みモデルの選択は、ユーザが手動で行ってもよいし、発表者の属性、プレゼンテーションの種類、前記言語特徴量、前記韻律特徴量及び前記動作特徴量の少なくとも一つに基づいて行ってもよい。また、複数種類の機械学習済みモデルそれぞれについてキャリブレーションを行い、そのキャリブレーションの情報をもとに選択してもよい。ここで、「キャリブレーション」とは、学習済みモデルによって算出された予測確率を本来の確率に近づける処理である。また、複数種類の機械学習済みモデルをすべて用いて解析を行い最も精度が高かった機械学習済みモデルを選択し、その後の解析に用いるようにしてもよい。 A plurality of types of machine-learned models may be created according to the individuality of the presenter (presenter) 10 or the individuality of the user 40 who is the subject of evaluation. The machine-learned model may collectively output the plurality of evaluation items, or may be provided for each evaluation item in order to increase the accuracy of the evaluation values output for each of the plurality of evaluation items. In addition, a plurality of types of machine-learned models are created in advance so as to correspond to a plurality of types of target person groups with different attributes of the presenter 10 (for example, age group, gender, presentation experience). All of the plurality of types of machine-learned models may be used, or a plurality of types of machine-learned models may be selected and used. Also, a plurality of types of machine-learned models may be created in advance so as to correspond to each of a plurality of types of presentations, and all of the plurality of types of machine-learned models may be used. It is also possible to select and use from the machine-learned models of The machine-learned model may be selected manually by the user, or based on at least one of the attribute of the presenter, the type of presentation, the language feature, the prosodic feature, and the action feature. good too. Further, calibration may be performed for each of a plurality of types of machine-learned models, and selection may be made based on the calibration information. Here, “calibration” is processing for bringing the prediction probability calculated by the trained model closer to the original probability. Alternatively, analysis may be performed using all of a plurality of types of machine-learned models, and the machine-learned model with the highest accuracy may be selected and used for subsequent analysis.

また、プレゼンテーションの評価に用いる解析モデル（機械学習済みモデル）は、韻律特徴量、言語特徴量及び動作特徴量それぞれに含まれる複数種類の特徴量のうち音声データ及び動画データからパターン化して抽出する処理に所定の時間以上を要する特徴量やパターン化が困難な特徴量（例えば、後述のフィラー数）を入力として用いない解析モデルであってもよい。ここで、前記抽出する処理に所定の時間以上を要する特徴量や前記パターン化が困難な特徴量には、例えば、パターンが多様であるため既存サービスを使った特徴量化では解析に十分な水準が得られない特徴量、水準を満たそうとすると人手による作業などが必要になりリアルタイムの処理ができなくなる特徴量などが含まれる。 In addition, the analysis model (machine-learned model) used to evaluate the presentation is patterned and extracted from the voice data and video data among the multiple types of features included in each of the prosodic features, language features, and action features. An analysis model that does not use, as an input, a feature amount that requires a predetermined time or more for processing or a feature amount that is difficult to pattern (for example, the number of fillers described later) may be used. Here, for the feature quantity that requires more than a predetermined time for the extraction process and the feature quantity that is difficult to pattern, for example, since the pattern is diverse, the feature quantity using the existing service does not have a sufficient level for analysis. It includes feature values that cannot be obtained, and feature values that cannot be processed in real time because manual work is required to meet the standard.

図２は、本実施形態に係るプレゼンテーション評価の手順の一例を示す説明図である。図２において、実施形態に係るプレゼンテーション評価システムを利用する場合、ユーザが端末装置２０を操作して、端末装置２０に予め組み込まれているプレゼンテーション評価のアプリケーションのプログラムを起動する。ユーザがデータ取得開始操作を行うと、端末装置２０のマイク及びカメラにより、対象者である発表者１０が行っているプレゼンテーションの音声入力及び動画の撮像が行われ、当該プレゼンテーションの音声データ及び動画データが取得される（ステップＳ１）。なお、取得するデータは、プレゼンテーションを行っている発表者１０について測定した赤外線センサ、心拍センサ等の各種センサで検知した検知データを含んでもよい。 FIG. 2 is an explanatory diagram showing an example of the presentation evaluation procedure according to the present embodiment. In FIG. 2 , when using the presentation evaluation system according to the embodiment, the user operates the terminal device 20 to start a presentation evaluation application program preinstalled in the terminal device 20 . When the user performs a data acquisition start operation, the microphone and camera of the terminal device 20 are used to input the voice of the presentation given by the presenter 10, who is the target person, and to capture the video. is obtained (step S1). The data to be acquired may include detection data detected by various sensors such as an infrared sensor and a heart rate sensor, which are measured by the presenter 10 who is giving a presentation.

次に、プレゼンテーションが終わってユーザがデータ取得終了操作を行うと、当該プレゼンテーションの音声データ及び動画データが端末装置２０から移動通信網を介して、通信網上に構築された情報処理装置（クラウドサービスプラットフォーム）３０に送信される（ステップＳ２）。 Next, when the presentation is over and the user performs a data acquisition end operation, the audio data and video data of the presentation are transmitted from the terminal device 20 via the mobile communication network to the information processing device (cloud service) built on the communication network. platform) 30 (step S2).

情報処理装置（クラウドサービスプラットフォーム）３０は、端末装置２０から受信した音声データをテキストデータに変換する文字起こし処理を行って言語特徴量を抽出し、音声データからプレゼンテーションの韻律特徴量を抽出し、端末装置２０から受信したから前記プレゼンテーションを行っているときの対象者である発表者１０の動作特徴量を抽出する（ステップＳ３）。更に、情報処理装置３０は、前記抽出した言語特徴量、韻律特徴量、動作特徴量を解析して、前述の目的網羅性、内容ロジック（内容の論理性）、ビジュアル及びボーカル（見え方及び聞こえ方）、並びに効果的演出要素のそれぞれの評価項目について定量的に評価した評価値を含む解析結果を推定し（ステップＳ４）、その解析結果を、移動通信網を介して端末装置２０に送信する（ステップＳ５）。なお、韻律特徴量は、プレゼンテーションを行っている発表者１０の声の特徴量を含み、言語特徴量は、プレゼンテーションを行っている発表者１０の発話内容の特徴量を含み、動作特徴量は、プレゼンテーションを行っている発表者１０のジェスチャの特徴量を含んでもよい。また、解析に用いる特徴量は、前記各種センサの検知データから抽出した体温、心拍数などを含んでもよい。 The information processing device (cloud service platform) 30 performs transcription processing for converting voice data received from the terminal device 20 into text data, extracts language features, extracts presentation prosodic features from the voice data, Since it is received from the terminal device 20, the motion feature amount of the presenter 10, who is the target person during the presentation, is extracted (step S3). Furthermore, the information processing device 30 analyzes the extracted linguistic feature amount, prosodic feature amount, and action feature amount, and analyzes the above-mentioned target coverage, content logic (content logicality), visual and vocal (appearance and hearing). method), and an analysis result including an evaluation value quantitatively evaluated for each evaluation item of the effective production element is estimated (step S4), and the analysis result is transmitted to the terminal device 20 via the mobile communication network. (Step S5). The prosodic feature amount includes the feature amount of the voice of the presenter 10 who is giving a presentation, the language feature amount includes the feature amount of the utterance content of the presenter 10 who is giving the presentation, and the action feature amount is: Gesture features of the presenter 10 who is giving a presentation may also be included. Moreover, the feature amount used for the analysis may include body temperature, heart rate, etc. extracted from the detection data of the various sensors.

端末装置２０は、情報処理装置３０から、プレゼンテーションの解析結果を受信すると、その解析結果を自装置の画面上に表示する（ステップＳ６）。 When the terminal device 20 receives the analysis result of the presentation from the information processing device 30, the terminal device 20 displays the analysis result on its own screen (step S6).

図３は、本実施形態に係るシステムにおける端末装置２０及び情報処理装置３０の概略構成の一例を示すブロック図である。なお、図３及び前述の図２の例では、情報処理装置３０が通信網上に構築されたクラウドプラットフォームであるが、情報処理装置３０は、一又は複数のコンピュータ装置からなるサーバであってもよい。また、図３の例では、本システムの端末装置２０のユーザがプレゼンテーションの発表者自身である場合の例であるが、端末装置２０のユーザは、プレゼンテーションの発表者以外の者であってもよい。 FIG. 3 is a block diagram showing an example of schematic configurations of the terminal device 20 and the information processing device 30 in the system according to this embodiment. In the example of FIG. 3 and FIG. 2 described above, the information processing device 30 is a cloud platform built on a communication network, but the information processing device 30 may be a server consisting of one or more computer devices. good. In the example of FIG. 3, the user of the terminal device 20 of this system is the presenter of the presentation, but the user of the terminal device 20 may be someone other than the presenter of the presentation. .

図３において、端末装置２０は、データ取得部２０１とデータ確認部２０２とデータ取得助言表示部（データ取得助言出力部）２０３とデータ送信部２０４と解析結果受信部２０５と解析結果表示部（解析結果出力部）２０６とを備える。端末装置２０の各部におけるデータ処理及び信号処理の機能は、例えば、端末装置２０に設けられたコンピュータ又はプロセッサにおいて所定のアプリケーションプログラムが実行されることで実現される。 3, the terminal device 20 includes a data acquisition unit 201, a data confirmation unit 202, a data acquisition advice display unit (data acquisition advice output unit) 203, a data transmission unit 204, an analysis result reception unit 205, and an analysis result display unit (analysis result output unit) 206. The functions of data processing and signal processing in each part of the terminal device 20 are realized by executing a predetermined application program in a computer or processor provided in the terminal device 20, for example.

データ取得部２０１は、プレゼンテーションを行っている発表者１０の様子をカメラで撮像して動画データにするとともに、発表者１０の音声をマイクで取得して音声データにする。動画データ及び音声データをメモリに一時的に保存してもよい。 The data acquisition unit 201 captures the state of the presenter 10 giving a presentation with a camera and converts it into moving image data, and also acquires the voice of the presenter 10 with a microphone into voice data. Video data and audio data may be temporarily stored in memory.

データ確認部２０２は、データ取得部２０１で取得した動画データ及び音声データの品質が後段の解析を行うにあたって問題ないか否かを確認する。例えば、データ確認部２０２は、音声データに関し、音声の音圧の値が指定の範囲内にあるか、及び，周囲の雑音の大きさが所定の閾値以内であるかを確認し、動画データに関し、動作特徴量の抽出のための座標を取得する発表者１０の身体部がすべて画像中に含まれているか、及び、発表者１０の正面方向に対する動画撮像方向の画角が所定の角度範囲内にあること、を確認する。ここで、所定の角度範囲は、後段の処理で動作特徴量の抽出が可能な角度範囲（例えば、±３０度の角度範囲）である。また、データ確認部２０２は、発表者１０に特定の文章を読み上げてもらい、その音声を文字起こしした際に正しく認識されていれば、音声データが後段の解析を行うにあたって問題ない「解析可能な品質」を有すると判定してもよい。また、データ確認部２０２は、発表者１０に特定の動作をしてもらい、特定の骨格情報が認識されれば、動画データが後段の解析を行うにあたって問題ない「解析可能な品質」を有すると判定してもよい。例えば、肘を伸ばして両手を真上にあげる動作をしてもらい、その真上に上げた両手が認識されれば、動画データが「解析可能な品質」を有すると判定してもよい。 The data confirmation unit 202 confirms whether or not the quality of the moving image data and the audio data acquired by the data acquisition unit 201 poses no problem in subsequent analysis. For example, the data confirmation unit 202 confirms whether the value of the sound pressure of the audio data is within a specified range and whether the volume of the ambient noise is within a predetermined threshold for the video data. , whether all the body parts of the presenter 10 whose coordinates are to be obtained for extracting the motion feature amount are included in the image, and whether the angle of view of the moving image capturing direction with respect to the front direction of the presenter 10 is within a predetermined angle range. to confirm that Here, the predetermined angle range is an angle range (for example, an angle range of ±30 degrees) in which motion feature amounts can be extracted in subsequent processing. In addition, the data confirmation unit 202 asks the presenter 10 to read out a specific sentence, and if the speech is correctly recognized when the speech is transcribed, the speech data does not cause any problem in the subsequent analysis. quality”. Further, if the presenter 10 is asked to perform a specific action and the specific skeleton information is recognized, the data confirmation unit 202 determines that the moving image data has an "analyzable quality" that poses no problem for subsequent analysis. You can judge. For example, it may be determined that the moving image data has “analyzable quality” if the user is asked to stretch his/her elbows and raise both hands straight up, and if the raised hands are recognized.

データ確認部２０２は、動画データ及び音声データの品質に問題がある場合（図中のＮＧ（否定的な結果）の場合）、その情報をデータ取得助言表示部２０３に送る。データ取得助言表示部２０３は、データ確認部２０２から受けた情報に基づいて、声を大きくする、プレゼンテーションを行っている位置を変化させる等の助言メッセージを、端末装置２０のディスプレイ上に表示する。なお、助言メッセージは、表示に加えて又は代えて、音声で出力してもよい。 If there is a problem with the quality of the moving image data and the audio data (in the case of NG (negative result) in the figure), the data confirmation section 202 sends the information to the data acquisition advice display section 203 . Based on the information received from the data confirmation unit 202, the data acquisition advice display unit 203 displays an advice message on the display of the terminal device 20, such as raising the voice or changing the presentation position. Note that the advisory message may be output by voice in addition to or instead of being displayed.

一方、当該品質に問題がない場合（図中のＯＫ（肯定的な結果）の場合）、データ確認部２０２は、動画データ及び音声データをデータ送信部２０４に送る。データ送信部２０４は、例えば無線通信装置等により、移動通信網などの通信網を介して、動画データ及び音声データを情報処理装置（クラウドサービスプラットフォーム）３０に送信する。 On the other hand, if there is no problem with the quality (in the case of OK (positive result) in the drawing), the data confirmation section 202 sends the moving image data and the audio data to the data transmission section 204 . The data transmission unit 204 transmits moving image data and audio data to the information processing device (cloud service platform) 30 via a communication network such as a mobile communication network, for example, using a wireless communication device or the like.

なお、データ送信部２０４は、動画データ及び音声データとともに、対応するプレゼンテーションの種類に関する情報や発表者の属性に関する情報を情報処理装置３０に送信してもよい。また、データ送信部２０４は、動画データ及び音声データともに、それらのデータを識別するためのデータ群ＩＤ、又は、それらのデータに対応するプレゼンテーションを識別するためのプレゼンテーションＩＤを送信してもよい。 Note that the data transmission unit 204 may transmit information about the type of corresponding presentation and information about the attributes of the presenter to the information processing device 30 together with the moving image data and the audio data. In addition, the data transmission unit 204 may transmit a data group ID for identifying the moving image data and the audio data, or a presentation ID for identifying the presentation corresponding to the data.

解析結果受信部２０５は、例えば無線通信装置等により、移動通信網などの通信網を介して、プレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を情報処理装置３０から受信する。 The analysis result receiving unit 205 receives the analysis result including the evaluation value quantitatively evaluated for the predetermined evaluation item of the presentation from the information processing device 30 via a communication network such as a mobile communication network, for example, by using a wireless communication device or the like. do.

解析結果表示部２０６は、情報処理装置３０から受信したプレゼンテーションの解析結果を端末装置２０のディスプレイ上に表示する。プレゼンテーションの解析結果は、例えば図６に例示するように、なお、解析結果は、表示に加えて又は代えて、音声で出力してもよい。 The analysis result display unit 206 displays the analysis result of the presentation received from the information processing device 30 on the display of the terminal device 20 . The analysis result of the presentation may be output by voice in addition to or instead of display, as illustrated in FIG. 6, for example.

情報処理装置（クラウドサービスプラットフォーム）３０は、データ受信部３０１と特徴量抽出部３０２と解析モデル判定部３０３と推論部３０４と解析モデルデータベース（ＤＢ）３０５と解析結果送信部３０６とを備える。情報処理装置３０の各部におけるデータ処理及び信号処理の機能は、例えば、情報処理装置３０に設けられた一又は複数のコンピュータ又はプロセッサにおいて所定のプログラムが実行されることで実現される。 The information processing device (cloud service platform) 30 includes a data reception unit 301 , a feature amount extraction unit 302 , an analysis model determination unit 303 , an inference unit 304 , an analysis model database (DB) 305 and an analysis result transmission unit 306 . The functions of data processing and signal processing in each part of the information processing device 30 are realized by executing a predetermined program in one or more computers or processors provided in the information processing device 30, for example.

データ受信部３０１は、例えば無線通信装置等により、移動通信網などの通信網を介して、動画データ及び音声データを端末装置２０から受信する。なお、データ受信部３０１は、動画データ及び音声データとともに、対応するプレゼンテーションの種類に関する情報や発表者の属性に関する情報を端末装置２０から受信してもよい。 The data receiving unit 301 receives moving image data and audio data from the terminal device 20 via a communication network such as a mobile communication network, for example, using a wireless communication device or the like. The data receiving unit 301 may receive from the terminal device 20 information about the type of corresponding presentation and information about the attributes of the presenter, together with the moving image data and the audio data.

特徴量抽出部３０２は、端末装置２０から受信した動画データ及び音声データから、後述の解析モデル（機械学習済みモデル）に説明変数として入力する各種特徴量を抽出する。例えば、図４に例示する特徴量抽出部３０２は、音声解析部３２１と言語解析部３２２と動作解析部３２３とを備える。音声解析部３２１の文字起こし部３２１１は、端末装置２０から受信した音声データを音声認識によりテキストデータに変換する文字起こし処理を行う。 The feature amount extraction unit 302 extracts various feature amounts to be input as explanatory variables to an analysis model (machine-learned model), which will be described later, from the video data and audio data received from the terminal device 20 . For example, the feature amount extraction unit 302 illustrated in FIG. 4 includes a speech analysis unit 321, a language analysis unit 322, and a motion analysis unit 323. A transcription unit 3211 of the voice analysis unit 321 performs a transcription process of converting voice data received from the terminal device 20 into text data by voice recognition.

言語解析部３２２は、文字起こし部３２１１で得られたテキストデータから、表１に例示するプレゼンテーションの言語特徴量を抽出する。表１中のフィラーは、「えー」、「あのー」、「はいっ」、「えーっと」等の言葉と言葉の隙間を埋めるために使う言葉又は音である。また、動詞繰り返し数は、プレゼンテーション中で繰り返された同一動詞の繰り返し数の最大値であり、名詞繰り返し数は、プレゼンテーション中で繰り返された同一名詞の繰り返し数の最大値である。

The linguistic analysis unit 322 extracts the linguistic features of the presentation shown in Table 1 from the text data obtained by the transcription unit 3211 . Fillers in Table 1 are words or sounds used to fill gaps between words such as "uh", "uh", "ha", "uh". The verb repetition count is the maximum number of repetitions of the same verb in the presentation, and the noun repetition count is the maximum number of repetitions of the same noun in the presentation.

また、図４に例示する音声解析部３２１の韻律解析部３２１２は、音声データから、表２に例示するプレゼンテーションの韻律特徴量を抽出する。表２中のピッチは音声の高さ（周波数）である。インテンシティは音声の物理的な強さであり、例えば単位面積を通して伝わる音響パワー［Ｗ／ｍ^２］である。また、合計発話長はプレゼンテーション全体における発話時間の合計値であり、合計発話長（１秒以上）は１秒以上の発話時間の合計値である。また、合計無音長は、プレゼンテーション全体における無音時間の合計値であり、合計無音長（１秒以上）は、１秒以上の無音時間の合計値である。また、発話比は、プレゼンテーション全体の時間に対する発話時間の合計値の比率であり、発話比（１秒以上）は、プレゼンテーション全体の時間に対する１秒以上の発話時間の合計値の比率である。

Also, the prosody analysis unit 3212 of the speech analysis unit 321 illustrated in FIG. 4 extracts the prosody feature amount of the presentation illustrated in Table 2 from the speech data. Pitch in Table 2 is the pitch (frequency) of the voice. Intensity is the physical strength of sound, for example sound power transmitted through a unit area [W/m ² ]. Also, the total speech length is the total value of the speech time in the entire presentation, and the total speech length (one second or longer) is the total value of the speech time of one second or longer. Also, the total silent length is the total value of silent time in the entire presentation, and the total silent length (one second or longer) is the total value of one second or longer of silent time. The speech ratio is the ratio of the total speech time to the entire presentation time, and the speech ratio (one second or more) is the ratio of the total speech time of one second or more to the entire presentation time.

また、図４に例示する動作解析部３２３は、動画データを解析することにより、表３に例示する動作特徴量を抽出する。表３中の身体各部の動作量は、例えば次のように計算する。図５（ａ）に示すように発表者１０を撮像した動画中の骨格検出像１００の検出点（関節点）１０１～１１９について、例えば図５（ｂ）に示すように動画のフレームごとの２次元的な位置座標（Ｘ，Ｚ）の時間変化量（軌跡）を計算する。そのすべての検出点１０１～１１９におけるフレーム単位の位置座標の変化量である動作量の平均及び標準偏差が、表３中の身体各部の動作量の平均及び標準偏差である。また、表３中の発話中の身体各部の動作量の平均及び標準偏差は、発表者が発話している時間帯について計算した、発表者の骨格検出像１００の検出点１０１～１１９における動作量の平均及び標準偏差である。

Further, the motion analysis unit 323 illustrated in FIG. 4 extracts motion feature amounts illustrated in Table 3 by analyzing the moving image data. The motion amount of each part of the body in Table 3 is calculated, for example, as follows. As shown in FIG. 5(a), detection points (joint points) 101 to 119 of a skeleton detection image 100 in a moving image of a presenter 10 are detected, for example, as shown in FIG. 5(b). A time variation (trajectory) of the dimensional position coordinates (X, Z) is calculated. The average and standard deviation of the amount of movement, which is the amount of change in position coordinates in units of frames at all the detection points 101 to 119, are the average and standard deviation of the amount of movement of each part of the body in Table 3. In addition, the average and standard deviation of the amount of movement of each part of the body during speech in Table 3 are the amounts of movement at detection points 101 to 119 of the skeleton detection image 100 of the presenter calculated for the time period during which the presenter is speaking. is the mean and standard deviation of

なお、動作特徴量としては、上記動作量の平均及び標準偏差の加えて又は代えて、発表者の骨格検出像１００の検出点１０１～１１９の速度、加速度又はその両者の平均及び標準変化を用いてもよい。 In addition to or instead of the average and standard deviation of the amount of motion, the average and standard changes of the velocity, acceleration, or both of the detection points 101 to 119 of the skeleton detection image 100 of the presenter are used as the motion feature amount. may

解析モデル判定部３０３は、特徴量抽出部３０２で抽出した韻律特徴量、言語特徴量及び動作特徴量に基づいて、それらの特徴量を解析してプレゼンテーションの評価項目の定量的な評価に使用する解析モデルを判定する。例えば、解析モデル判定部３０３は、韻律特徴量、言語特徴量及び動作特徴量に基づいて、評価対象のプレゼンテーションの種類及び発表者の種類を判定し、当該プレゼンテーションの評価項目の定量的な評価に適する解析モデルを、予め登録した複数種類の解析モデルから選択して決定し、その決定した解析モデルを識別する解析モデルＩＤを特徴量抽出部３０２に出力する。また、解析モデル判定部３０３は、予め登録した複数種類の解析モデルをすべて選択して決定し、その決定した複数種類の解析モデルそれぞれを識別する複数の解析モデルＩＤを特徴量抽出部３０２に出力してもよい。 Based on the prosodic feature amount, language feature amount, and action feature amount extracted by the feature amount extraction unit 302, the analysis model determination unit 303 analyzes these feature amounts and uses them for quantitative evaluation of presentation evaluation items. Determine the analytical model. For example, the analysis model determination unit 303 determines the type of the presentation to be evaluated and the type of the presenter based on the prosodic feature amount, the language feature amount, and the action feature amount. A suitable analysis model is selected and determined from a plurality of types of pre-registered analysis models, and an analysis model ID for identifying the determined analysis model is output to the feature quantity extraction unit 302 . Further, the analysis model determination unit 303 selects and determines all of the plurality of types of analysis models registered in advance, and outputs a plurality of analysis model IDs for identifying each of the determined plurality of types of analysis models to the feature amount extraction unit 302. You may

なお、解析モデルの選択・決定には、端末装置２０から受信した発表者（プレゼンター）の属性（例えば、年齢層、性別、プレゼンテーションの経験度）の情報、及び、プレゼンテーションの種類の情報の少なくとも一方の情報を用いてもよい。 In addition, for the selection/decision of the analysis model, at least one of the information of the attributes of the presenter (for example, age group, gender, experience of presentation) received from the terminal device 20 and the information of the type of presentation. information may be used.

推論部３０４は、特徴量抽出部３０２から受信した一又は複数の解析モデルＩＤに基づいて、解析モデルＤＢ３０５に保存されている複数の解析モデルから、当該プレゼンテーションの評価項目の定量的な評価に使用する解析モデル（推定プログラム及びそれに用いる学習済みのパラメータ値）を選択する。 Based on one or more analysis model IDs received from the feature amount extraction unit 302, the inference unit 304 selects from a plurality of analysis models stored in the analysis model DB 305 to quantitatively evaluate the evaluation items of the presentation. Select an analysis model (estimation program and learned parameter values used for it) to be used.

解析モデルは、前述のように複数の教師あり学習データを用いて予め機械学習することによりモデルにおける各特徴量に対する重み等のパラメータの値を決定して作成した機械学習済みモデルである。推論部３０４で用いる機械学習済みモデルは、前記複数の評価項目を一括して出力するものであってもよいし、複数の評価項目それぞれについて出力する評価値の精度を高めるために評価項目ごとに設けてもよい。 The analysis model is a machine-learned model created by determining parameter values such as weights for each feature quantity in the model by carrying out machine learning in advance using a plurality of supervised learning data as described above. The machine-learned model used in the inference unit 304 may collectively output the plurality of evaluation items, or for each evaluation item in order to increase the accuracy of the evaluation value output for each of the plurality of evaluation items. may be provided.

本実施形態の機械学習済みモデルに用いるアルゴリズムは特定のアルゴリズムに限定されない。例えば、教師あり学習データを用いて学習する機械学習済みモデルのアルゴリズムとしては、数値データを学習して数値を予測する「回帰（Regression）」に分類されるＳＶＲ（サポートベクター回帰）を用いることができる。このＳＶＲの代わりに、線形回帰（Linear (Ordinary) Regression）、ベイズ線形回帰（Bayesian Linear Regression）、ランダムフォレスト（Randam (Decision) Forest）、ブースト決定木（Boosed decision tree）、高速フォレスト分布（Fast forest quantile）、ニューラルネットワーク（Neural network）、ポアソン回帰（Poisson Regression）、サポートベクトル序数回帰（Ordinal Regression）、リッジ回帰（Ridge Regression）、ラッソ回帰（Lasso Regression）などを用いてもよい。 The algorithm used for the machine-learned model of this embodiment is not limited to a specific algorithm. For example, as a machine-learned model algorithm that learns using supervised learning data, it is possible to use SVR (support vector regression), which is classified as "regression" that learns numerical data and predicts numerical values. can. Instead of this SVR, Linear (Ordinary) Regression, Bayesian Linear Regression, Random (Decision) Forest, Boosted decision tree, Fast forest distribution quantile), neural network, Poisson regression, support vector ordinal regression, ridge regression, lasso regression, etc. may be used.

推論部３０４は、解析モデルＩＤに基づいて選択した一又は複数の解析モデル（推定プログラム及びそれに用いる学習済みのパラメータ値を含む機械学習済みモデル）に、特徴量抽出部３０２から受信した言語特徴量、韻律特徴量及び動作特徴量が入力されることにより、所定の評価項目について定量的に評価した評価値を出力する。例えば、推論部３０４は、表４に例示する１４種類の評価項目それぞれについて３段階（１～３）の定量的な評価値を出力する。

The inference unit 304 adds the linguistic feature quantity received from the feature quantity extraction unit 302 to one or more analysis models (estimation program and machine-learned model including learned parameter values used therein) selected based on the analysis model ID , the prosodic feature amount and the action feature amount are input, and an evaluation value obtained by quantitatively evaluating a predetermined evaluation item is output. For example, the inference unit 304 outputs quantitative evaluation values in three stages (1 to 3) for each of the 14 types of evaluation items shown in Table 4.

推論部３０４が出力する解析結果は、発表者にフィードバック（ＦＢ）する定性的な評価として、例えば次の（１）～（４）に例示するような、プレゼンテーション全体に対する一言コメント（フィードバックコメント）を含んでもよい。
（１）特に「強調」は充分に発揮できています。
（２）特に「強調」は意識して臨んでください。
（３）視線が宙に浮いているためアイコンタクトは無し。時折ジェスチュアは自然に出ているが、左右に揺れる癖がある。笑顔は終始出ている。
（４）声量があって聞こえやすい。時折抑揚はついているが、間が無い。ジェスチュアは自然に出ているが、話しの中身と合わないジェスチュアが時折出る。 The analysis result output by the inference unit 304 is used as a qualitative evaluation to be fed back (FB) to the presenter, such as one-word comments (feedback comments) on the entire presentation, such as the following examples (1) to (4). may include
(1) In particular, "emphasis" has been fully demonstrated.
(2) Please be conscious of "emphasis" in particular.
(3) There is no eye contact because the line of sight is in the air. Occasionally gestures come out naturally, but there is a habit of swaying left and right. A smile is always present.
(4) The voice is loud and easy to hear. There is an occasional intonation, but there is no pause. Gestures come out naturally, but sometimes gestures that don't match the content of the talk appear.

前記一言コメント（フィードバックコメント）は、例えば表５に例示するように、前述のＳＶＭ等の解析モデルによって推定する評価項目の評価値と決定係数とに基づいて生成することができる。ここで、決定係数は、ＳＶＭ等の解析モデルによる評価値の推定の精度を示す値であり、－１から＋１の値をとる。例えば、この決定係数の絶対値が０．２よりも小さいときは、評価値の推定の精度が低く、決定係数の絶対値が０．２以上１以下ときは、評価値の推定の精度が十分に高いと判断することができる。

The one-word comment (feedback comment) can be generated, for example, as shown in Table 5, based on the evaluation value of the evaluation item and the coefficient of determination estimated by the aforementioned analysis model such as SVM. Here, the coefficient of determination is a value indicating the accuracy of estimation of an evaluation value by an analysis model such as SVM, and takes a value from -1 to +1. For example, when the absolute value of this coefficient of determination is less than 0.2, the accuracy of estimation of the evaluation value is low, and when the absolute value of the determination coefficient is 0.2 or more and 1 or less, the accuracy of estimation of the evaluation value is sufficient. can be judged to be high.

表５は、前述の効果的演出要素の「強調」の評価値及び決定係数に基づいて一言コメント（フィードバックコメント）を生成する場合の例である。例えば、表５中の評価項目「強調」の評価値が１であり、決定係数の絶対値が０．２以上であって推定精度が充分に高いと判断した場合は、『特に「強調」は充分に発揮できています。』という一言コメントを生成する。また、評価項目「強調」の評価値が０であり、決定係数の絶対値が０．２以上であって推定精度が充分に高いと判断した場合は、『特に「強調」は意識して臨んでください。』という一言コメントを生成する。決定係数の絶対値が０．２よりも小さく推定精度が低いと判断した場合は、一言コメントを生成しない。 Table 5 is an example of generating a one-word comment (feedback comment) based on the evaluation value and coefficient of determination of "emphasis" of the above-described effective effect element. For example, if the evaluation value of the evaluation item “emphasis” in Table 5 is 1 and the absolute value of the coefficient of determination is 0.2 or more and the estimation accuracy is sufficiently high, “especially “emphasis” I have been able to fully demonstrate it. ] is generated. In addition, if the evaluation value of the evaluation item "emphasis" is 0 and the absolute value of the coefficient of determination is 0.2 or more, and it is judged that the estimation accuracy is sufficiently high, "especially "emphasis" should be considered. please ] is generated. If it is determined that the absolute value of the coefficient of determination is smaller than 0.2 and the estimation accuracy is low, no brief comment is generated.

ここで、推論部３０４は、一言コメントの候補として、『特に「評価項目名」は充分に発揮できています。』及び『特に「評価項目名」は意識して臨んでください。』を記憶しておき、「」内の部分に、「強調」、「繰り返し」などの評価項目名を入れて一言コメントを生成してもよい。 Here, the inference unit 304 selects “especially 'evaluation item name' is fully demonstrated” as a short comment candidate. and "Please be aware of the 'evaluation item name' in particular. ] is stored, and evaluation item names such as “emphasis” and “repetition” are put in the part inside “ ” to generate a one-word comment.

解析モデルデータベース（ＤＢ）３０５は、前述のように互いに異なる解析モデルＩＤを付与して複数種類の解析モデルを保存している。解析モデルは、推定プログラム及びそれに用いる学習済みのパラメータ値からなる機械学習済みモデルである。例えば、複数種類の機械学習済みモデルはそれぞれ、入力及び出力の形式が共通であり、複数種類のプレゼンテーションと複数種類の発表者との複数の組み合わせについて予め作成されたプレゼンテーションの評価項目の定量的な評価に適する解析モデルである。複数種類の機械学習済みモデルは、発表者１０の個性又はユーザ４０の個性に応じて作成されたものであってもよい。また、機械学習済みモデルは、前述のように、複数の評価項目を一括して出力するものであってもよいし、複数の評価項目それぞれについて出力する評価値の精度を高めるために評価項目ごとに設けてもよい。 The analysis model database (DB) 305 assigns different analysis model IDs to each other and stores a plurality of types of analysis models as described above. The analytical model is a machine-learned model consisting of an estimation program and learned parameter values used therein. For example, multiple types of machine-learned models each have a common input and output format, and quantitative evaluation items for presentations created in advance for multiple combinations of multiple types of presentations and multiple types of presenters. It is an analytical model suitable for evaluation. A plurality of types of machine-learned models may be created according to the personality of the presenter 10 or the personality of the user 40 . In addition, as described above, the machine-learned model may output multiple evaluation items at once. may be set to

解析結果送信部３０６は、推論部３０４から出力された評価値を含む解析結果を、移動通信網などの通信網を介して端末装置２０に送信する。解析結果送信部３０６は、解析結果とともに、その解析結果に対応する前述のデータ群ＩＤ又はプレゼンテーションＩＤを送信してもよい。 The analysis result transmission unit 306 transmits the analysis result including the evaluation value output from the inference unit 304 to the terminal device 20 via a communication network such as a mobile communication network. The analysis result transmission unit 306 may transmit the aforementioned data group ID or presentation ID corresponding to the analysis result together with the analysis result.

図６は、本実施形態に係る端末装置２０における解析結果の表示画面２１の一例を示す説明図である。図６に例示する端末装置２０の表示画面２１は、定量的評価表示部２１１と定性的評価表示部２１２とを有する。定量的評価表示部２１１には、前述の表４の１４種類の評価項目それぞれの定量的な評価値が数字及びグラフで表示される。定性的評価表示部２１２には、前述の情報処理装置３０からフィードバック（ＦＢ）された一言コメントが表示される。表示画面２１には、評価値を含む解析結果の根拠となった韻律特徴量、言語特徴量及び動作特徴量のデータや解析モデル（機械学習済みモデル）の情報を表示してもよい。 FIG. 6 is an explanatory diagram showing an example of the analysis result display screen 21 in the terminal device 20 according to the present embodiment. The display screen 21 of the terminal device 20 illustrated in FIG. 6 has a quantitative evaluation display section 211 and a qualitative evaluation display section 212 . In the quantitative evaluation display section 211, quantitative evaluation values for each of the 14 types of evaluation items in Table 4 are displayed in numbers and graphs. The qualitative evaluation display section 212 displays a brief comment fed back (FB) from the information processing apparatus 30 described above. The display screen 21 may display the data of the prosodic feature amount, the language feature amount, and the action feature amount, which are the basis of the analysis result including the evaluation value, and the information of the analysis model (machine-learned model).

以上、本実施形態によれば、プレゼンテーションの言語特徴量、韻律特徴量及び動作特徴量といったマルチモーダルな特徴量を解析してプレゼンテーションの所定の評価項目について定量的に評価した評価値を含む解析結果を推定して出力できるため、プレゼンテーションの多角的な視点からの定量評価が可能である。しかも、そのプレゼンテーションの評価に用いる言語特徴量、韻律特徴量及び動作特徴量は、音声データ及び動画データから抽出することができ、視点方向検知装置のような特別なハードウェアを必要としないため、ユーザが使用する端末装置等に実装するときのハードルが低い。 As described above, according to the present embodiment, analysis results including evaluation values obtained by analyzing multimodal feature amounts such as the linguistic feature amount, the prosodic feature amount, and the action feature amount of the presentation and quantitatively evaluating the predetermined evaluation items of the presentation can be estimated and output, it is possible to quantitatively evaluate presentations from multiple perspectives. Moreover, the linguistic feature amount, prosodic feature amount, and action feature amount used for evaluation of the presentation can be extracted from audio data and video data, and no special hardware such as a viewing direction detection device is required. The hurdles for implementation in terminal devices used by users are low.

また、本実施形態によれば、所定の品質を有する音声データ及び動画データのみを言語特徴量、韻律特徴量及び動作特徴量の抽出に用いることにより、各特徴量の抽出精度を高めることができる。特に、所定の音圧を有し雑音が所定以下の音声データを言語特徴量及び韻律特徴量の抽出に用いることにより言語特徴量及び韻律特徴量の抽出精度を高め、動作特徴量の抽出に用いる座標を取得する対象者の身体部が含まれ所定の角度範囲内の画角を有する動画データを動作特徴量の抽出に用いることにより、動作特徴量の抽出精度を高めることができる。音声データ及び動画データが所定の品質を有していないときに音声データ及び動画データの取得に関する助言メッセージを表示（出力）することにより、本システムによる評価が可能なプレゼンテーション又はその評価に適したプレゼンテーションを行うように発表者に注意して促すことができる。 Further, according to the present embodiment, by using only voice data and moving image data having a predetermined quality for extraction of the linguistic feature amount, the prosodic feature amount, and the motion feature amount, it is possible to improve the extraction accuracy of each feature amount. . In particular, speech data with a predetermined sound pressure and noise below a predetermined level is used to extract the linguistic feature amount and the prosody feature amount, thereby increasing the extraction accuracy of the linguistic feature amount and the prosodic feature amount, and using it for the extraction of the action feature amount. By using the moving image data that includes the body part of the subject whose coordinates are to be acquired and has an angle of view within a predetermined angle range for extracting the motion feature amount, the extraction accuracy of the motion feature amount can be improved. Presentations that can be evaluated by this system or presentations that are suitable for such evaluation by displaying (outputting) an advisory message regarding the acquisition of audio data and video data when the audio data and video data do not have the specified quality. Presenters can be reminded to do

また、本実施形態によれば、言語特徴量、韻律特徴量及び動作特徴量を含む入力を所定のアルゴリズムで処理することにより定量的な評価値を含む解析結果を出力する解析モデルを用いることにより、言語特徴量、韻律特徴量及び動作特徴量といったマルチモーダルな特徴量を統合して推定した評価値を含む解析結果を推定できる。 Further, according to the present embodiment, by using an analysis model that outputs an analysis result including a quantitative evaluation value by processing an input including a linguistic feature amount, a prosody feature amount, and an action feature amount with a predetermined algorithm, , a linguistic feature amount, a prosody feature amount, and a motion feature amount.

また、本実施形態によれば、言語特徴量、韻律特徴量及び動作特徴量それぞれが複数種類の特徴量を含むので、プレゼンテーションのより多角的な評価が可能になる。また、複数種類の特徴量のうち音声データ及び動画データからパターン化して抽出する処理に所定の時間以上を要する特徴量を入力として用いない解析モデルを用いることにより、プレゼンテーションの定量的な評価値を含む解析結果の情報処理装置３０から端末装置２０へのリアルタイムフィードバックが可能になる。 Further, according to the present embodiment, each of the linguistic feature amount, the prosodic feature amount, and the action feature amount includes a plurality of types of feature amounts, so that more multifaceted evaluation of the presentation becomes possible. In addition, by using an analysis model that does not use as input feature amounts that require a predetermined amount of time or more to extract patterns from audio data and video data among multiple types of feature amounts, the quantitative evaluation value of the presentation can be obtained. Real-time feedback from the information processing device 30 to the terminal device 20 of the included analysis result is enabled.

また、本実施形態によれば、アルゴリズムが互いに異なる複数種類の解析モデルから選択した解析モデルを、韻律特徴量、言語特徴量及び動作特徴量の解析に使用することにより、各種の条件に応じて、プレゼンテーションの精度の高い評価が可能になる。特に、言語特徴量、韻律特徴量及び動作特徴量の少なくとも一つに基づいて選択した解析モデルを用いることにより、特徴量に応じた高い精度の評価が可能になる。また、プレゼンテーションの発表者（対象者）の属性及びプレゼンテーションの種類の少なくとも一方に基づいて選択した解析モデルを用いることにより、発表者の属性及びプレゼンテーションの種類に応じた高い精度の評価が可能になる。また、複数のプレゼンテーションについて予め取得した言語特徴量、韻律特徴量及び動作特徴量と評価値の正解データとを含む教師あり学習データを用いて機械学習して作成された学習済みモデルを用いることにより、評価対象のプレゼンテーションの言語特徴量、韻律特徴量及び動作特徴量を入力してプレゼンテーションの多角的な視点からの定量評価が出力可能になる。 Further, according to the present embodiment, an analysis model selected from a plurality of types of analysis models with mutually different algorithms is used for analysis of the prosodic feature amount, the language feature amount, and the motion feature amount, so that the , which enables a highly accurate evaluation of the presentation. In particular, by using an analysis model selected based on at least one of the linguistic feature amount, the prosodic feature amount, and the action feature amount, highly accurate evaluation according to the feature amount becomes possible. Also, by using an analysis model selected based on at least one of the attributes of the presenter (target person) of the presentation and the type of presentation, highly accurate evaluation according to the attributes of the presenter and the type of presentation is possible. . In addition, by using a trained model created by machine learning using supervised learning data including linguistic feature values, prosodic feature values, motion feature values, and correct evaluation value data obtained in advance for multiple presentations , input the linguistic feature amount, the prosodic feature amount, and the action feature amount of the presentation to be evaluated, and output a quantitative evaluation of the presentation from multiple viewpoints.

特に、本実施形態によれば、プレゼンテーションの全文におけるフィラー数、名詞数、動詞数、感動詞、動詞繰り返し数及び名詞繰り返し数の少なくとも一つに関する言語特徴量の観点から評価した評価値を出力することができる。また、プレゼンテーションの音声におけるピッチ、インテンシティ、音圧、抑揚、話速、発話長、無音長及び発話比の少なくとも一つに関する韻律特徴量の観点から評価した評価値を出力することができる。また、プレゼンテーションの全体における発表者の身体各部の動作量及び発話中における発表者の身体各部の動作量の少なくとも一つに関する動作特徴量の観点から評価した評価値を出力することができる。 In particular, according to this embodiment, evaluation values evaluated from the viewpoint of linguistic feature amounts regarding at least one of the number of fillers, the number of nouns, the number of verbs, the number of interjections, the number of verb repetitions, and the number of noun repetitions in the full sentence of the presentation are output. be able to. In addition, it is possible to output an evaluation value evaluated from the viewpoint of prosodic features regarding at least one of pitch, intensity, sound pressure, intonation, speech speed, speech length, silence length, and speech ratio in speech of a presentation. Also, it is possible to output an evaluation value evaluated from the viewpoint of the motion feature amount regarding at least one of the amount of motion of each part of the presenter's body during the entire presentation and the amount of motion of each part of the presenter's body during speech.

また、本実施形態によれば、プレゼンテーションの目的網羅性、内容の論理性、見え方及び聞こえ方、並びに、効果的演出要素を含む多角的な評価を出力できる。 In addition, according to the present embodiment, it is possible to output a multifaceted evaluation that includes coverage of the purpose of the presentation, logicality of the content, how it looks and sounds, and effective production elements.

また、本実施形態によれば、評価値及びその評価値を推定したときの決定係数に基づいて生成した一言コメントを解析結果に含めることにより、プレゼンテーションの発表者又は端末装置２０のユーザに直感的でわかりやすい評価を伝えることができる。 In addition, according to the present embodiment, by including in the analysis result a brief comment generated based on the evaluation value and the determination coefficient when estimating the evaluation value, the presenter of the presentation or the user of the terminal device 20 can intuitively Able to give clear and concise evaluations.

また、本実施形態によれば、対象者である発表者１０が行っているプレゼンテーションの音声データ及び動画データを端末装置２０で取得するという簡易な操作で、プレゼンテーションの多角的な視点からの定量的な評価値を含む解析結果を端末装置２０に出力することができる。 Further, according to the present embodiment, with a simple operation of acquiring the audio data and video data of the presentation given by the presenter 10, who is the target person, with the terminal device 20, quantitative data can be obtained from multiple viewpoints of the presentation. It is possible to output the analysis result including the evaluation value to the terminal device 20 .

なお、本明細書で説明された処理工程並びにプレゼンテーション評価を行うシステム、端末装置、情報処理装置の構成要素は、様々な手段によって実装することができる。例えば、これらの工程及び構成要素は、ハードウェア、ファームウェア、ソフトウェア、又は、それらの組み合わせで実装されてもよい。 It should be noted that the processing steps and components of the presentation evaluation system, terminal device, and information processing device described herein can be implemented by various means. For example, these processes and components may be implemented in hardware, firmware, software, or any combination thereof.

ハードウェア実装については、実体（例えば、コンピュータ装置、サーバ、クラウドサービスプラットフォーム（クラウドコンピュータシステム）、各種無線通信装置、ＮｏｄｅＢ、端末、ハードディスクドライブ装置、又は、光ディスクドライブ装置）において上記工程及び構成要素を実現するために用いられる処理ユニット等の手段は、１つ又は複数の、特定用途向けＩＣ（ＡＳＩＣ）、デジタルシグナルプロセッサ（ＤＳＰ）、デジタル信号処理装置（ＤＳＰＤ）、プログラマブル・ロジック・デバイス（ＰＬＤ）、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）、プロセッサ、コントローラ、マイクロコントローラ、マイクロプロセッサ、電子デバイス、本明細書で説明された機能を実行するようにデザインされた他の電子ユニット、コンピュータ、又は、それらの組み合わせの中に実装されてもよい。 For hardware implementation, the above steps and components in an entity (e.g., computer device, server, cloud service platform (cloud computer system), various wireless communication devices, Node B, terminal, hard disk drive device, or optical disk drive device) The means such as processing units used to implement the ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described herein, computers, or , may be implemented in combinations thereof.

また、ファームウェア及び／又はソフトウェア実装については、上記構成要素を実現するために用いられる処理ユニット等の手段は、本明細書で説明された機能を実行するプログラム（例えば、プロシージャ、関数、モジュール、インストラクション、などのコード）で実装されてもよい。一般に、ファームウェア及び／又はソフトウェアのコードを明確に具体化する任意のコンピュータ／プロセッサ読み取り可能な媒体が、本明細書で説明された上記工程及び構成要素を実現するために用いられる処理ユニット等の手段の実装に利用されてもよい。例えば、ファームウェア及び／又はソフトウェアコードは、例えば制御装置において、メモリに記憶され、コンピュータやプロセッサにより実行されてもよい。そのメモリは、コンピュータやプロセッサの内部に実装されてもよいし、又は、プロセッサの外部に実装されてもよい。また、ファームウェア及び／又はソフトウェアコードは、例えば、ランダムアクセスメモリ（ＲＡＭ）、リードオンリーメモリ（ＲＯＭ）、不揮発性ランダムアクセスメモリ（ＮＶＲＡＭ）、プログラマブルリードオンリーメモリ（ＰＲＯＭ）、電気的消去可能ＰＲＯＭ（ＥＥＰＲＯＭ）、ＦＬＡＳＨメモリ、フロッピー（登録商標）ディスク、コンパクトディスク（ＣＤ）、デジタルバーサタイルディスク（ＤＶＤ）、磁気又は光データ記憶装置、などのような、コンピュータやプロセッサで読み取り可能な媒体に記憶されてもよい。そのコードは、１又は複数のコンピュータやプロセッサにより実行されてもよく、また、コンピュータやプロセッサに、本明細書で説明された機能性のある態様を実行させてもよい。 Also, for firmware and/or software implementations, means such as processing units used to implement the above components may be programs (e.g., procedures, functions, modules, instructions) that perform the functions described herein. , etc.). In general, any computer/processor readable medium tangibly embodying firmware and/or software code means, such as a processing unit, used to implement the above steps and components described herein. may be used to implement For example, firmware and/or software code may be stored in memory and executed by a computer or processor, such as in a controller. The memory may be implemented within the computer or processor, or external to the processor. The firmware and/or software code may also be, for example, random access memory (RAM), read only memory (ROM), non-volatile random access memory (NVRAM), programmable read only memory (PROM), electrically erasable PROM (EEPROM). ), FLASH memory, floppy disk, compact disk (CD), digital versatile disk (DVD), magnetic or optical data storage devices, etc. good. The code may be executed by one or more computers or processors and may cause the computers or processors to perform certain aspects of the functionality described herein.

また、前記媒体は非一時的な記録媒体であってもよい。また、前記プログラムのコードは、コンピュータ、プロセッサ、又は他のデバイス若しくは装置機械で読み込んで実行可能であれよく、その形式は特定の形式に限定されない。例えば、前記プログラムのコードは、ソースコード、オブジェクトコード及びバイナリコードのいずれでもよく、また、それらのコードの２以上が混在したものであってもよい。 Also, the medium may be a non-temporary recording medium. Also, the code of the program may be readable and executable by a computer, processor, or other device or apparatus machine, and its format is not limited to a particular format. For example, the program code may be source code, object code, or binary code, or may be a mixture of two or more of these codes.

また、本明細書で開示された実施形態の説明は、当業者が本開示を製造又は使用するのを可能にするために提供される。本開示に対するさまざまな修正は当業者には容易に明白になり、本明細書で定義される一般的原理は、本開示の趣旨又は範囲から逸脱することなく、他のバリエーションに適用可能である。それゆえ、本開示は、本明細書で説明される例及びデザインに限定されるものではなく、本明細書で開示された原理及び新規な特徴に合致する最も広い範囲に認められるべきである。 Moreover, the description of the embodiments disclosed herein is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to this disclosure will be readily apparent to those skilled in the art, and the general principles defined herein are applicable to other variations without departing from the spirit or scope of this disclosure. This disclosure, therefore, is not to be limited to the examples and designs described herein, but is to be accorded the broadest scope consistent with the principles and novel features disclosed herein.

１０：発表者
２０：端末装置
２１：表示画面
３０：情報処理装置
４０：ユーザ
１００：骨格検出像
１０１～１１９：検出点
２０１：データ取得部
２０２：データ確認部
２０３：データ取得助言表示部
２０４：データ送信部
２０５：解析結果受信部
２０６：解析結果表示部
２１１：定量的評価表示部
２１２：定性的評価表示部
３０１：データ受信部
３０２：特徴量抽出部
３０３：解析モデル判定部
３０４：推論部
３０５：解析モデルＤＢ
３０６：解析結果送信部
３２１：音声解析部
３２２：言語解析部
３２３：動作解析部
３２１１：文字起こし部
３２１２：韻律解析部 10: Presenter 20: Terminal device 21: Display screen 30: Information processing device 40: User 100: Skeleton detection images 101 to 119: Detection point 201: Data acquisition unit 202: Data confirmation unit 203: Data acquisition advice display unit 204: Data transmission unit 205 : Analysis result reception unit 206 : Analysis result display unit 211 : Quantitative evaluation display unit 212 : Qualitative evaluation display unit 301 : Data reception unit 302 : Feature amount extraction unit 303 : Analysis model determination unit 304 : Inference unit 305: Analysis model DB
306: Analysis result transmission unit 321: Voice analysis unit 322: Language analysis unit 323: Motion analysis unit 3211: Transcription unit 3212: Prosody analysis unit

Claims

A system for evaluating presentations, comprising:
a data acquisition unit that acquires voice data and video data of a target person who is giving a presentation;
a feature amount extracting unit that extracts the linguistic feature amount and the prosodic feature amount of the presentation from the audio data, and extracts the motion feature amount of the subject during the presentation from the moving image data;
analyzing the linguistic feature amount, the prosodic feature amount, and the action feature amount to generate an analysis result including an evaluation value quantitatively evaluated for each of evaluation items regarding a plurality of evaluation contents regarding major items for evaluating the presentation; an inference unit that estimates;
and an analysis result output unit that outputs the analysis result,
The reasoning unit stores a plurality of comment data as comment candidates for the presentation, and generates the plurality of comments based on the evaluation values of the plurality of evaluation items and a coefficient of determination indicating the accuracy of estimation of the evaluation values. A system characterized by selecting one or more pieces of comment data from data and including comments generated using the selected comment data in the analysis result .

The system of claim 1, wherein
further comprising a data confirmation unit for confirming whether the audio data and the video data acquired by the data acquisition unit have a predetermined quality;
The audio data and video data having the predetermined quality are used to extract the language feature amount, the prosodic feature amount, and the action feature amount, and the audio data and video data not having the predetermined quality are used to extract the language feature amount. , is not used for extracting the prosody feature amount and the motion feature amount.

The system of claim 2, wherein
The quality to be confirmed is that the sound pressure of the audio data is within a predetermined range, that the magnitude of noise in the audio data is equal to or less than a threshold, and that the image in the moving image data is used for extracting the motion feature amount. and that the angle of view of the moving image capturing direction with respect to the front direction of the subject is within a predetermined angle range. , a system characterized by:

In the system of claim 2 or 3,
A data acquisition advice output unit for outputting an advice message regarding acquisition of the audio data and the video data when the audio data and the video data acquired by the data acquisition unit do not have a predetermined quality. characterized system.

In the system according to any one of claims 1 to 4,
The inference unit uses an analysis model that outputs an analysis result including the quantitative evaluation value by processing an input including the linguistic feature amount, the prosodic feature amount, and the action feature amount with a predetermined algorithm. A system characterized by

The system of claim 5, wherein
the linguistic feature amount, the prosodic feature amount, and the action feature amount each include a plurality of types of feature amounts,
The analysis model used in the inference unit is an analysis model that does not use, as the input, a feature amount that requires a predetermined time or more for patterning and extracting the audio data and the moving image data among the plurality of types of feature amounts. A system characterized by:

In the system of claim 5 or 6,
The system according to claim 1, wherein the inference unit uses a plurality of types of analysis models with mutually different algorithms for the analysis.

In the system of claim 5 or 6,
The system according to claim 1, wherein the inference unit has a plurality of types of analysis models with mutually different algorithms, and uses an analysis model selected from the plurality of types of analysis models for the analysis.

9. The system of claim 8, wherein
The inference unit selects an analysis model to be used for the analysis from the plurality of types of analysis models based on at least one of the linguistic feature amount, the prosodic feature amount, and the action feature amount. system.

The system of claim 8 or 9,
The system, wherein the inference unit selects an analysis model to be used for the analysis from the plurality of types of analysis models based on at least one of an attribute of a target person of the presentation and a type of the presentation.

In the system according to any one of claims 5 to 10,
The analysis model is a learning created by machine learning using supervised learning data including the linguistic feature amount, the prosodic feature amount, the action feature amount, and the correct answer data of the evaluation value acquired for a plurality of presentations. A system characterized in that it is a finished model.

The system according to any one of claims 1 to 11,
The linguistic feature amount includes a feature amount related to at least one of the number of fillers, the number of nouns, the number of verbs, the interjection, the number of verb repetitions, and the number of noun repetitions in the full sentence of the presentation,
The prosodic feature amount includes a feature amount related to at least one of pitch, intensity, sound pressure, intonation, speech speed, speech length, silence length, and speech ratio in speech of the presentation,
A system, wherein the motion feature quantity includes a feature quantity relating to at least one of motion amounts of each part of the body of the subject during the entire presentation and motion of each part of the body of the subject during speech.

In the system of any one of claims 1-12,
A system according to claim 1, wherein major items for evaluating said presentation include coverage of purpose of said presentation, logicality of content, how it looks and sounds, and effective production elements .

In the system according to any one of claims 1 to 13,
Equipped with a terminal device and an information processing device that can communicate with each other via a communication network,
The terminal device includes the data acquisition unit, a data transmission unit that transmits the audio data and the video data to the information processing device, an analysis result reception unit that receives the analysis result from the information processing device, and the analysis a result output unit;
The information processing device includes a data receiving unit that receives the audio data and the video data from the terminal device, the feature extraction unit, the inference unit, and an analysis result transmission that transmits the analysis result to the terminal device. and a system.

A trained model program for operating a computer or processor to evaluate a presentation,
Verbal features, prosodic features, and motion features extracted based on the subject's voice data and video data acquired for a plurality of presentations, and correct data of evaluation values quantitatively evaluated for predetermined evaluation items of the presentations Created by machine learning using supervised learning data including
When there is an input including language features, prosodic features, and motion features extracted based on the subject's voice data and video data acquired for the presentation to be evaluated, a plurality of major items for evaluating the presentation A program code for outputting an analysis result including an evaluation value quantitatively evaluated for each evaluation item regarding the evaluation content ;
storing a plurality of comment data as comment candidates for the presentation, and selecting one or more of the plurality of comment data based on the evaluation values of the plurality of evaluation items and a coefficient of determination indicating the accuracy of estimation of the evaluation value; and a program code for selecting comment data from and including comments generated using the selected comment data in the analysis result .

A method of evaluating a presentation, comprising:
Acquiring audio data and video data of a subject who is giving a presentation;
extracting the language feature amount and the prosodic feature amount of the presentation from the audio data, and extracting the motion feature amount of the subject during the presentation from the moving image data;
analyzing the linguistic feature amount, the prosodic feature amount, and the action feature amount to generate an analysis result including an evaluation value quantitatively evaluated for each of evaluation items regarding a plurality of evaluation contents regarding major items for evaluating the presentation; estimating;
outputting the analysis result;
storing a plurality of comment data as comment candidates for the presentation, and selecting one or more of the plurality of comment data based on the evaluation values of the plurality of evaluation items and a coefficient of determination indicating the accuracy of estimation of the evaluation value; and including comments generated using the selected comment data in the analysis results .

A terminal device capable of communicating with an information processing device via a communication network,
a data acquisition unit that acquires voice data and video data of a target person who is giving a presentation;
a data transmission unit that transmits the audio data and the video data to the information processing device;
Analysis for receiving, from the information processing device, an analysis result including an evaluation value quantitatively evaluated for each of evaluation items regarding a plurality of evaluation contents regarding major items for evaluating the presentation based on the audio data and the video data. a result receiving unit;
and an analysis result output unit that outputs the analysis result ,
The analysis result received from the information processing device is such that the information processing device, which stores a plurality of comment data as comment candidates for the presentation, compares the evaluation values of the plurality of evaluation items and the estimation accuracy of the evaluation values. one or more comment data is selected from the plurality of comment data based on a coefficient of determination shown, and a comment generated by the information processing device using the selected comment data is included. .

An information processing device capable of communicating with a terminal device via a communication network,
a data receiving unit that receives audio data and video data of a target person who is giving a presentation from the terminal device;
a feature amount extracting unit that extracts the linguistic feature amount and the prosodic feature amount of the presentation from the audio data, and extracts the motion feature amount of the subject during the presentation from the moving image data;
analyzing the linguistic feature amount, the prosodic feature amount, and the action feature amount to generate an analysis result including an evaluation value quantitatively evaluated for each of evaluation items regarding a plurality of evaluation contents regarding major items for evaluating the presentation; an inference unit that estimates;
an analysis result transmission unit that transmits the analysis result to the terminal device ;
The reasoning unit stores a plurality of comment data as comment candidates for the presentation, and generates the plurality of comments based on the evaluation values of the plurality of evaluation items and a coefficient of determination indicating the accuracy of estimation of the evaluation values. An information processing apparatus , comprising: selecting one or a plurality of pieces of comment data from data; and including a comment generated using the selected comment data in the analysis result .

A program executed in a computer or processor provided in a terminal device capable of communicating with an information processing device via a communication network,
A program code for acquiring audio data and video data of a target person who is giving a presentation;
a program code for transmitting the audio data and the video data to the information processing device;
To receive, from the information processing apparatus, an analysis result including an evaluation value quantitatively evaluated for each of evaluation items regarding a plurality of evaluation contents regarding major items for evaluating the presentation based on the audio data and the video data. program code of
a program code for outputting the analysis result ,
The analysis result received from the information processing device is such that the information processing device, which stores a plurality of comment data as comment candidates for the presentation, compares the evaluation values of the plurality of evaluation items and the estimation accuracy of the evaluation values. A program comprising: selecting one or a plurality of comment data from the plurality of comment data based on a coefficient of determination shown, and including a comment generated by the information processing device using the selected comment data .

A program executed in a computer or processor provided in an information processing device capable of communicating with a terminal device via a communication network,
a program code for receiving audio data and video data of a target person who is giving a presentation from the terminal device;
a program code for extracting the linguistic feature amount and the prosody feature amount of the presentation from the audio data, and extracting the motion feature amount of the subject during the presentation from the moving image data;
analyzing the linguistic feature amount, the prosodic feature amount, and the action feature amount to generate an analysis result including an evaluation value quantitatively evaluated for each of evaluation items regarding a plurality of evaluation contents regarding major items for evaluating the presentation; program code for estimating;
a program code for transmitting the analysis result to the terminal device;
storing a plurality of comment data as comment candidates for the presentation, and selecting one or more of the plurality of comment data based on the evaluation values of the plurality of evaluation items and a coefficient of determination indicating the accuracy of estimation of the evaluation value; and program code for selecting comment data from the above and including comments generated using the selected comment data in the analysis result .