WO2022180855A1

WO2022180855A1 - Video session evaluation terminal, video session evaluation system, and video session evaluation program

Info

Publication number: WO2022180855A1
Application number: PCT/JP2021/007572
Authority: WO
Inventors: 渉三神谷
Original assignee: 株式会社I’mbesideyou
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2022-09-01
Also published as: JPWO2022180855A1

Abstract

[Problem] To evaluate an online session by evaluating video acquired in the online session. [Solution] A system of the present disclosure comprises: an acquisition means that acquires video pertaining to an online session between a first user and a second user; a face recognition means that recognizes at least a face image of the first user included in the video for each prescribed frame; a speech recognition means that recognizes at least the speech of the second user included in the video; an evaluation means that calculates an evaluation value from multiple perspectives on the basis of both the recognized face image; a keyword detection means that detects a prescribed keyword within at least the recognized speech; and an alert transmission means that sends a prescribed alert on the basis of the evaluation value when the keyword is detected.

Description

VIDEO SESSION EVALUATION TERMINAL, VIDEO SESSION EVALUATION SYSTEM AND VIDEO SESSION EVALUATION PROGRAM

The present disclosure relates to a video session evaluation terminal, a video session evaluation system, and a video session evaluation program.

Conventionally, there is known a technique for analyzing the emotions others receive in response to a speaker's remarks (see Patent Document 1, for example). There is also known a technique for analyzing changes in facial expressions of a subject in chronological order over a long period of time and estimating the emotions held during that period (see, for example, Patent Literature 2). Furthermore, there are known techniques for identifying factors that have the greatest influence on changes in emotions (see Patent Documents 3 to 5, for example). Furthermore, there is also known a technology that compares the subject's usual facial expression with the current facial expression and issues an alert when the facial expression is dark (see, for example, Patent Document 6). There is also known a technique for determining the degree of emotion of a subject by comparing the subject's normal (expressionless) facial expression with the current facial expression (for example, Patent Documents 7 to 9). reference). Furthermore, there is also known a technique for analyzing the feeling of an organization and the atmosphere within a group that an individual feels (see

Patent Documents

10 and 11, for example).

JP 2019-58625 A JP 2016-149063 A JP 2020-86559 A JP-A-2000-76421 JP 2017-201499 A JP 2018-112831 A JP 2011-154665 A JP-A-2012-8949 Japanese Unexamined Patent Application Publication No. 2013-300 JP 2011-186521 A WO15/174426

All the technologies mentioned above are only sub-functions in situations where communication in the real world is the main thing. In other words, due to the recent DX (Digital Transformation) of work and the global epidemic of infectious diseases, it is not a situation where communication such as work and classes is mainly conducted online.

The purpose of the present invention is to objectively evaluate exchanged communication in order to conduct more efficient communication in situations where online communication is the main focus.

According to the invention,
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least face images of the first user and the second user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
determining means for determining the degree of matching of the second user to the first user based on the evaluation value;
A moving image analysis system is obtained.

According to the invention,
a moving image acquisition unit that acquires a plurality of moving images obtained by photographing a target person;
a biological reaction analysis unit that analyzes changes in the biological reaction of the subject based on the moving image acquired by the moving image acquisition unit;
Based on the change in the biological reaction of the subject analyzed by the biological reaction analysis unit, the emotional level of the subject is evaluated according to a standardized evaluation standard for the subject among the plurality of moving images. A moving image analysis system comprising an emotion evaluation unit is obtained.

According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for classifying into predetermined emotional information based on both the recognized face image and the recognized voice;
an annotation receiving means for receiving an annotation operation on the classified emotional information;
A moving image analysis system is obtained.

According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
evaluation value providing means for providing the subject with an average value for each of the evaluation values from the plurality of viewpoints over a predetermined period;
A moving image analysis system is obtained.

According to the invention,
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame;
speech recognition means for recognizing at least the speech of the second user included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image;
keyword detection means for detecting at least a predetermined keyword in the recognized speech;
an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected;
A moving image analysis system is obtained.

According to the invention,
Acquisition means for acquiring a plurality of moving images showing at least a target person;
voice recognition means for recognizing at least the voice of the target person included in the evaluation target moving image among the moving images;
a unique word extracting means for extracting, from among the words included in the recognized speech, words not included in the moving image other than the evaluation target moving image;
a text display means for converting the extracted word into a size according to the frequency of its utterance and displaying it as text;
A moving image analysis system is obtained.

According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
text conversion means for converting words included in the recognized speech into text and displaying the text;
size setting means for setting the size of the converted text to a size corresponding to the frequency of speech;
a color setting means for setting the color of the converted text to a color corresponding to the evaluation value;
comprising
A moving image analysis system is obtained.

According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for evaluating the degree of fatigue based on both the recognized face image and the voice,
A moving image analysis system is obtained.

According to the invention,
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each speaker included in the moving image;
and an object generation unit that generates an object that associates the utterance with the target person and displays them in chronological order.

According to the invention,
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each target person included in the moving image;
A moving image analysis system is obtained that includes a utterance object generation unit that plots a utterance object corresponding to the utterance in association with the target person.

According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
intonation acquisition means for extracting intonation information of the recognized speech;
and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information.

According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
context acquisition means for acquiring context information of the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
a correction means for correcting the evaluation value using the context information;
Video image analysis system.

According to the present disclosure, by analyzing and evaluating moving images of a video session, it is possible to objectively evaluate especially the content.

In particular, according to the present invention, exchanged communication can be objectively evaluated in order to conduct more efficient communication in situations where online communication is the main activity.

It is a figure which shows the whole system diagram by embodiment of this invention. 1 is an example of a functional block diagram of an evaluation terminal according to an embodiment of the present invention; FIG. FIG. 3 is a diagram showing functional configuration example 1 of the evaluation terminal according to the embodiment of the present invention; FIG. 8 is a diagram showing functional configuration example 2 of the evaluation terminal according to the embodiment of the present invention; FIG. 10 is a diagram showing a functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 7 is a screen display example according to the functional configuration example 3 of FIG. 6. FIG. FIG. 7 is another screen display example according to the functional configuration example 3 of FIG. 6. FIG. FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 1 shows a system according to a first embodiment of the invention; FIG. 1 shows a system according to a first embodiment of the invention; FIG. Fig. 3 shows a system according to a second embodiment of the invention; Fig. 3 shows a system according to a second embodiment of the invention; Fig. 3 shows a system according to a third embodiment of the invention; Fig. 4 shows a system according to a fourth embodiment of the invention; Fig. 5 shows a system according to a fifth embodiment of the invention; Fig. 6 shows a system according to a sixth embodiment of the invention; Fig. 6 shows a system according to a sixth embodiment of the invention; Fig. 7 shows a system according to a seventh embodiment of the invention; Fig. 7 shows a system according to a seventh embodiment of the invention; FIG. 11 illustrates a system according to an eighth embodiment of the invention; Fig. 10 shows a system according to a ninth embodiment of the invention; Fig. 10 shows a system according to a tenth embodiment of the invention; Fig. 10 shows a system according to a tenth embodiment of the invention; FIG. 11 illustrates a system according to an eleventh embodiment of the invention;

The contents of the embodiments of the present disclosure are listed and described. The present disclosure has the following configurations.
[Item 1]
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least face images of the first user and the second user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
determining means for determining the degree of matching of the second user to the first user based on the evaluation value;
Video image analysis system.
[Item 2]
a moving image acquisition unit that acquires a plurality of moving images obtained by photographing a target person;
a biological reaction analysis unit that analyzes changes in the biological reaction of the subject based on the moving image acquired by the moving image acquisition unit;
Based on the change in the biological reaction of the subject analyzed by the biological reaction analysis unit, the emotional level of the subject is evaluated according to a standardized evaluation standard for the subject among the plurality of moving images. A moving image analysis system comprising an emotion evaluation unit.
[Item 3]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for classifying into predetermined emotional information based on both the recognized face image and the recognized voice;
an annotation receiving means for receiving an annotation operation on the classified emotional information;
Video image analysis system.
[Item 4]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
evaluation value providing means for providing the subject with an average value for each of the evaluation values from the plurality of viewpoints over a predetermined period;
Video image analysis system.
[Item 5]
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame;
speech recognition means for recognizing at least the speech of the second user included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image;
keyword detection means for detecting at least a predetermined keyword in the recognized speech;
an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected;
Video image analysis system.
[Item 6]
Acquisition means for acquiring a plurality of moving images showing at least a target person;
voice recognition means for recognizing at least the voice of the target person included in the evaluation target moving image among the moving images;
a unique word extracting means for extracting, from among the words included in the recognized speech, words not included in the moving image other than the evaluation target moving image;
a text display means for converting the extracted word into a size according to the frequency of its utterance and displaying it as text;
Video image analysis system.
[Item 7]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
text conversion means for converting words included in the recognized speech into text and displaying the text;
size setting means for setting the size of the converted text to a size corresponding to the frequency of speech;
a color setting means for setting the color of the converted text to a color corresponding to the evaluation value;
comprising
Video image analysis system.
[Item 8]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for evaluating the degree of fatigue based on both the recognized face image and the voice,
Video image analysis system.
[Item 9]
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each speaker included in the moving image;
and an object generation unit that generates an object that associates the utterance with the target person and displays them in chronological order.
[Item 10]
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each target person included in the moving image;
A moving image analysis system comprising: a utterance object generation unit that plots a utterance object corresponding to the utterance in association with the target person.
[Item 11]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
intonation acquisition means for extracting intonation information of the recognized speech;
and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information.
[Item 12]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
context acquisition means for acquiring context information of the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
a correction means for correcting the evaluation value using the context information;
Video image analysis system.

Preferred embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. In the present specification and drawings, constituent elements having substantially the same functional configuration are denoted by the same reference numerals, thereby omitting redundant description.

<Basic functions>
In the video session evaluation system of the present embodiment, in an environment where a video session (hereinafter referred to as an online session including one-way and two-way sessions) is held by a plurality of people, the person to be analyzed among the plurality of people is different from the others. It is a system that analyzes and evaluates specific emotions (feelings that occur in response to one's own or others' words and actions. Pleasant/unpleasant, or their degree). Online sessions are, for example, online meetings, online classes, online chats, etc. Terminals installed in multiple locations are connected to a server via a communication network such as the Internet, and moving images are transmitted between multiple terminals through the server. It's made to be interactable. Moving images handled in online sessions include facial images and voices of users using terminals. Moving images also include images such as materials that are shared and viewed by a plurality of users. It is possible to switch between the face image and the document image on the screen of each terminal to display only one of them, or to divide the display area and display the face image and the document image at the same time. In addition, it is possible to display the image of one user out of a plurality of users on the full screen, or divide the images of some or all of the users into small screens and display them. It is possible to designate one or a plurality of users among a plurality of users participating in an online session using terminals as analysis subjects. For example, an online session leader, moderator, or manager (hereinafter collectively referred to as the organizer) designates any user as an analysis subject. Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session. It should be noted that all participants may be subject to analysis without specifying the person to be analyzed. In addition, it is also possible for an online session leader, moderator, or administrator (hereinafter collectively referred to as the organizer) to designate any user as an analysis subject. Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.

The video session evaluation system according to the present embodiment displays at least moving images obtained from a video session established between a plurality of terminals. The displayed moving image is acquired by the terminal, and at least a face image included in the moving image is identified for each predetermined frame unit. An evaluation value for the identified face image is then calculated. The evaluation value is shared as necessary. In particular, in this embodiment, the acquired moving image is stored in the terminal, analyzed and evaluated on the terminal, and the result is provided to the user of the terminal. Therefore, for example, even a video session containing personal information or a video session containing confidential information can be analyzed and evaluated without providing the moving image itself to an external evaluation agency or the like. In addition, by providing only the evaluation result (evaluation value) to the external terminal as necessary, the result can be visualized and cross-analysis can be performed.

As shown in FIG. 1, the video session evaluation system according to the present embodiment includes

user terminals

10 and 20 each having at least an input unit such as a camera unit and a microphone unit, a display unit such as a display, and an output unit such as a speaker. , a video session service terminal 30 for providing an interactive video session to the

user terminals

10, 20, and an evaluation terminal 40 for performing part of the evaluation of the video session.

<Hardware configuration example>
Each functional block, functional unit, and functional module described below can be configured by any of hardware, DSP (Digital Signal Processor), and software provided in a computer, for example. For example, when configured by software, it is actually configured with a computer CPU, RAM, ROM, etc., and is realized by running a program stored in a recording medium such as RAM, ROM, hard disk, or semiconductor memory. A series of processes by the systems and terminals described herein may be implemented using software, hardware, or a combination of software and hardware. It is possible to create a computer program for realizing each function of the information sharing support device 10 according to the present embodiment and implement it in a PC or the like. It is also possible to provide a computer-readable recording medium storing such a computer program. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Also, the above computer program may be distributed, for example, via a network without using a recording medium.

The evaluation terminal according to the present embodiment acquires a moving image from a video session service terminal, identifies at least a face image included in the moving image for each predetermined frame unit, and calculates an evaluation value for the face image ( will be described in detail later).
<How to get videos>
As shown in FIG. 3, the video session service provided by the video session service terminal (hereinafter sometimes simply referred to as "this service") provides

user terminals

10 and 20 with two-way images and voice. Communication is possible. In this service, a moving image captured by the camera of the other user's terminal is displayed on the display of the user's terminal, and audio captured by the microphone of the other's user's terminal can be output from the speaker. In addition, this service allows both or either of the user terminals to record moving images and sounds (collectively referred to as "moving images, etc.") in the storage unit of at least one of the user terminals. configured as possible. The recorded moving image information Vs (hereinafter referred to as “recorded information”) is cached in the user terminal that started recording and is locally recorded only in one of the user terminals. If necessary, the user can view the recorded information by himself or share it with others within the scope of using this service.

<Functional configuration example 1>
FIG. 4 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 4, the video session evaluation system of this embodiment is implemented as a functional configuration of the user terminal 10. FIG. That is, the user terminal 10 has, as its functions, a moving image acquisition unit 11, a biological reaction analysis unit 12, a peculiar determination unit 13, a related event identification unit 14, a clustering unit 15, and an analysis result notification unit 16.

The moving image acquisition unit 11 acquires from each terminal a moving image obtained by photographing a plurality of people (a plurality of users) with a camera provided in each terminal during an online session. It does not matter whether the moving image acquired from each terminal is set to be displayed on the screen of each terminal. That is, the moving image acquisition unit 11 acquires moving images from each terminal, including moving images being displayed and moving images not being displayed on each terminal.

The biological reaction analysis unit 12 analyzes changes in the biological reaction of each of a plurality of people based on the moving images (whether or not they are being displayed on the screen) acquired by the moving image acquiring unit 11. In the present embodiment, the biological reaction analysis unit 12 separates the moving image acquired by the moving image acquisition unit 11 into a set of images (collection of frame images) and voice, and analyzes changes in the biological reaction from each.

For example, the biological reaction analysis unit 12 analyzes the user's facial image using a frame image separated from the moving image acquired by the moving image acquisition unit 11 to obtain at least one of facial expression, gaze, pulse, and facial movement. Analyze changes in biological reactions related to Further, the biological reaction analysis unit 12 analyzes the voice separated from the moving image acquired by the moving image acquisition unit 11 to analyze changes in the biological reaction related to at least one of the user's utterance content and voice quality.

When a person's emotions change, it manifests as a change in biological reactions such as facial expressions, eye gaze, pulse, facial movements, content of remarks, and voice quality. In this embodiment, changes in the user's emotions are analyzed through analysis of changes in the user's biological reactions. The emotion analyzed in this embodiment is, for example, the degree of comfort/discomfort. In the present embodiment, the biological reaction analysis unit 12 calculates a biological reaction index value reflecting the change in biological reaction by quantifying the change in biological reaction according to a predetermined standard.

For example, the analysis of changes in facial expressions is performed as follows. That is, for each frame image, a facial region is identified from the frame image, and the identified facial expressions are classified into a plurality of types according to an image analysis model machine-learned in advance. Then, based on the classification results, it analyzes whether positive facial expression changes occur between consecutive frame images, whether negative facial expression changes occur, and to what extent the facial expression changes occur, A facial expression change index value corresponding to the analysis result is output.

For example, the analysis of changes in line of sight is performed as follows. That is, for each frame image, the eye region is specified in the frame image, and the orientation of both eyes is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or whether the user is looking outside the screen. Also, it may be analyzed whether the eye movement is large or small, or whether the movement is frequent or infrequent. A change in line of sight is also related to the user's degree of concentration. The biological reaction analysis unit 12 outputs a line-of-sight change index value according to the analysis result of the line-of-sight change.

The analysis of pulse changes is performed, for example, as follows. That is, for each frame image, the face area is specified in the frame image. Then, using a trained image analysis model that captures numerical values of face color information (G of RGB), changes in the G color of the face surface are analyzed. By arranging the results along the time axis, a waveform representing changes in color information is formed, and the pulse is identified from this waveform. When a person is tense, the pulse speeds up, and when the person is calm, the pulse slows down. The biological reaction analysis unit 12 outputs a pulse change index value according to the analysis result of the pulse change.

For example, analysis of changes in facial movement is performed as follows. That is, for each frame image, the face area is specified in the frame image, and the direction of the face is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or whether the user is looking outside the screen. Further, it may be analyzed whether the movement of the face is large or small, or whether the movement is frequent or infrequent. The movement of the face and the movement of the line of sight may be analyzed together. For example, it may be analyzed whether the face of the speaker being displayed is viewed straight, whether the face is viewed with upward or downward gaze, or whether the face is viewed obliquely. The biological reaction analysis unit 12 outputs a face orientation change index value according to the analysis result of the face orientation change.

　Analysis of the contents of the statement is performed, for example, as follows. That is, the biological reaction analysis unit 12 converts the voice into a character string by performing known voice recognition processing on the voice for a specified time (for example, about 30 to 150 seconds), and morphologically analyzes the character string. By doing so, words such as particles and articles that are unnecessary for expressing conversation are removed. Then, vectorize the remaining words, analyze whether a positive emotional change has occurred, whether a negative emotional change has occurred, and to what extent the emotional change has occurred. Outputs the statement content index value.

Voice quality analysis is performed, for example, as follows. That is, the biological reaction analysis unit 12 identifies the acoustic features of the voice by performing known voice analysis processing on the voice for a specified time (for example, about 30 to 150 seconds). Then, based on the acoustic features, it analyzes whether a positive change in voice quality has occurred, whether a negative change in voice quality has occurred, and to what extent the change in voice quality has occurred, and according to the analysis results, output the voice quality change index value.

The biological reaction analysis unit 12 uses at least one of the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value calculated as described above. to calculate the biological reaction index value. For example, the biological reaction index value is calculated by weighting the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value.

The peculiarity determination unit 13 determines whether or not the change in the analyzed biological reaction of the person to be analyzed is more specific than the change in the analyzed biological reaction of the person other than the person to be analyzed. In the present embodiment, the peculiarity determination unit 13 compares changes in the biological reaction of the person to be analyzed with those of others based on the biological reaction index values calculated for each of the plurality of users by the biological reaction analysis unit 12. is specific or not.

For example, the peculiar determination unit 13 calculates the variance of the biological reaction index values calculated for each of the plurality of persons by the biological reaction analysis unit 12, and compares the biological reaction index values calculated for the analysis subject with the variance, It is determined whether or not the change in the analyzed biological reaction of the person to be analyzed is specific compared to others.

The following three patterns are conceivable as cases where the changes in biological reactions analyzed for the subject of analysis are more specific than those of others. The first is a case where a relatively large change in biological reaction occurs in the subject of analysis, although no particularly large change in biological reaction has occurred in the other person. The second is a case where a particularly large change in biological reaction has not occurred in the subject of analysis, but a relatively large change in biological reaction has occurred in the other person. The third is a case where a relatively large change in biological reaction occurs in both the subject of analysis and the other person, but the content of the change differs between the subject of analysis and the other person.

The related event identification unit 14 identifies an event occurring in relation to at least one of the person to be analyzed, the other person, and the environment when the change in the biological reaction determined to be peculiar by the peculiarity determination unit 13 occurs. . For example, the related event identification unit 14 identifies from the moving image the speech and behavior of the person to be analyzed when a specific change in biological reaction occurs in the person to be analyzed. In addition, the related event identifying unit 14 identifies, from the moving image, the speech and behavior of the other person when a specific change in the biological reaction of the person to be analyzed occurs. In addition, the related event identification unit 14 identifies from the moving image the environment in which a specific change in the biological reaction of the person to be analyzed occurs. The environment is, for example, the shared material being displayed on the screen, the background image of the person to be analyzed, and the like.

The clustering unit 15 clusters the change in the biological reaction determined to be specific by the peculiarity determination unit 13 (for example, one or a combination of eye gaze, pulse, facial movement, statement content, and voice quality), and the peculiarity Analyzing the degree of correlation with an event (event identified by the related event identification unit 14) that occurs when a change in biological reaction occurs, and if it is determined that the correlation is at a certain level or more , to cluster the subjects or events based on the correlation analysis results.

For example, if a change in a specific biological reaction corresponds to a negative emotional change, and the event occurring when the specific change in biological reaction occurs is also a negative event, a certain level The above correlation is detected. The clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented categories according to the content of the event, the degree of negativity, the magnitude of the correlation, and the like.

Similarly, if a specific change in biological reaction corresponds to a positive emotional change and the event occurring when the specific change in biological reaction occurs is also a positive event, Level or higher correlations are detected. The clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented classifications according to the content of the event, the degree of positivity, the degree of correlation, and the like.

The analysis result notification unit 16 reports at least one of the changes in the biological reaction determined to be specific by the peculiar determination unit 13, the event identified by the related event identification unit 14, and the classification clustered by the clustering unit 15. , to notify the designator of the subject of analysis (the subject of analysis or the organizer of the online session).

For example, the analysis result notification unit 16 recognizes that when a change in a specific biological reaction that is different from that of the other person occurs in the person to be analyzed (one of the three patterns described above; the same applies hereinafter), the analysis target is Notifies the person to be analyzed of his/her own behavior. This allows the person to be analyzed to understand that he/she has a different feeling from others when he or she performs a certain behavior. At this time, the person to be analyzed may also be notified of the change in the specific biological reaction identified for the person to be analyzed. Furthermore, the person to be analyzed may be further notified of the change in the biological reaction of the other person to be compared.

For example, the words and deeds of the person to be analyzed performed without being particularly conscious of their usual emotions, or the words and deeds of the person to be analyzed consciously accompanied by certain emotions, and the emotions and behaviors that others received When the emotion held by the person to be analyzed is different from the feeling held by the person to be analyzed at the time, the person to be analyzed is notified of the speech and behavior of the person to be analyzed at that time. As a result, it is possible to discover behaviors that are well received by others or behaviors that are not well received by others, contrary to one's own consciousness.

In addition, the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when the person to be analyzed undergoes a specific change in biological reaction that is different from that of the other person, together with the change in the specific biological reaction. to notify. As a result, the organizer of the online session can know what kind of event affects what kind of emotional change as a phenomenon specific to the specified analysis subject. Then, it becomes possible to perform appropriate treatment on the person to be analyzed according to the grasped contents.

In addition, the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when a specific change in biological reaction occurs in the analysis subject, which is different from that of others, or the clustering result of the analysis subject. do. As a result, online session organizers can grasp behavioral tendencies peculiar to analysis subjects and predict possible future behaviors and situations, depending on which classification the specified analysis subjects have been clustered into. be able to. Then, it becomes possible to take appropriate measures for the person to be analyzed.

In the above embodiment, the biological reaction index value is calculated by quantifying the change in biological reaction according to a predetermined standard, and the analysis subject is analyzed based on the biological reaction index value calculated for each of the plurality of people. Although the example of determining whether the change in the biological reaction received is specific compared to others has been described, the present invention is not limited to this example. For example, it may be as follows.

That is, the biological reaction analysis unit 12 analyzes the movement of the line of sight for each of a plurality of people and generates a heat map indicating the direction of the line of sight. The peculiar determination unit 13 compares the heat map generated for the person to be analyzed by the biological reaction analysis unit 12 with the heat map generated for the other person, so that the change in the biological reaction analyzed for the person to be analyzed It is determined whether it is specific compared with the change in biological response analyzed for.

Thus, in the present embodiment, moving images of a video session are stored in the local storage of the user terminal 10, and the above analysis is performed on the user terminal 10. Although it may depend on the machine specs of the user terminal 10, it is possible to analyze the moving image information without providing it to the outside.

<Functional configuration example 2>
As shown in FIG. 5, the video session evaluation system of this embodiment may include a moving image acquisition unit 11, a biological reaction analysis unit 12, and a reaction information presentation unit 13a as functional configurations.

The reaction information presentation unit 13a presents information indicating changes in biological reactions analyzed by the biological reaction analysis unit 12a, including participants not displayed on the screen. For example, the reaction information presenting unit 13a presents information indicating changes in biological reactions to an online session leader, moderator, or administrator (hereinafter collectively referred to as the organizer). Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.

By doing so, the organizer of the online session can also grasp the state of the participants who are not displayed on the screen in an environment where the online session is held by multiple people.

<Functional configuration example 3>
FIG. 6 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 6, in the video session evaluation system of the present embodiment, functions similar to those of the above-described first embodiment are given the same reference numerals, and explanations thereof may be omitted.

The system according to this embodiment includes a camera unit that acquires images of a video session, a microphone unit that acquires audio, an analysis unit that analyzes and evaluates moving images, and information obtained by evaluating the acquired moving images. an object generator for generating a display object (described below) based on the display; and a display for displaying both the moving image of the video session and the display object during execution of the video session.

The analysis unit includes the moving image acquisition unit 11, the biological reaction analysis unit 12, the peculiar determination unit 13, the related event identification unit 14, the clustering unit 15, and the analysis result notification unit 16, as described above. The function of each element is as described above.

As shown in FIG. 7, the object generation unit generates an object 50 representing the recognized face part and the above-mentioned Information 100 indicating the content of the analysis/evaluation performed is superimposed on the moving image and displayed. The object 50 may identify and display all faces of a plurality of persons when the faces of the plurality of persons are moved in the moving image.

In addition, the object 50 is, for example, when the camera function of the video session is stopped at the other party's terminal (that is, it is stopped by software within the application of the video session instead of physically covering the camera). If the other party's face is recognized by the other party's camera, the object 50 or the object 100 may be displayed in the part where the other party's face is located. This makes it possible for both parties to confirm that the other party is in front of the terminal even if the camera function is turned off. In this case, for example, in a video session application, the information obtained from the camera may be hidden while only the object 50 or object 100 corresponding to the face recognized by the analysis unit is displayed. Also, the video information acquired from the video session and the information recognized by the analysis unit may be divided into different display layers, and the layer relating to the former information may be hidden.

When there are multiple moving image display areas, the

objects

50 and 100 may be displayed in all areas or only in some areas. For example, as shown in FIG. 8, it may be displayed only on the moving image on the guest side.

Although the preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can conceive of various modifications or modifications within the scope of the technical idea described in the claims. are naturally within the technical scope of the present disclosure.

The device described in this specification may be realized as a single device, or may be realized by a plurality of devices (for example, cloud servers) or the like, all or part of which are connected via a network. For example, the control unit 110 and the storage 130 of each terminal 10 may be realized by different servers connected to each other via a network.

That is, the system includes

user terminals

10, 20, a video session service terminal 30 for providing an interactive video session to the

user terminals

10, 20, and an evaluation terminal 40 for evaluating the video session, Variation combinations of the following configurations are conceivable.
(1) Processing everything only on the user terminal As shown in FIG. 9, by performing the processing by the analysis unit on the terminal that is performing the video session (although a certain processing capacity is required), the video session can be performed. Analysis/evaluation results can be obtained at the same time (in real time) as you are.
(2) Processing by User Terminal and Evaluation Terminal As shown in FIG. 10, an analysis unit may be provided in an evaluation terminal connected via a network or the like. In this case, the moving images acquired by the user terminal are shared with the evaluation terminal at the same time as or after the video session, and are analyzed and evaluated by the analysis unit in the evaluation terminal. Together with or separately from the moving image data (that is, information including at least analysis data) is shared with the terminal and displayed on the display unit.

The systems according to the first to eleventh embodiments shown in FIGS. 10 to 25 are realized by using the functional configuration examples 1 to 3 described above and their combinations.

<First embodiment>
A first embodiment of the present invention will be described with reference to FIGS. 10 and 11. FIG. The system according to this embodiment roughly evaluates the degree of matching between people. For example, analyze the reaction of the other party, evaluate the peculiarity of each other (expressions that do not usually appear), peculiarities with your past (expressions that did not appear in the past even with the same partner), neutral Evaluate by comparing with the normal state. In particular, this matching is effective for online sessions conducted by lecturers and students. Matching various types of lecturers with various types of students is also important for continuing the course.

As shown in FIG. 10, the system according to the present embodiment includes a type determination unit that determines each person's type based on the evaluation result by the analysis unit described above, and a matching determination unit that determines the degree of matching.

The type determination unit determines (estimates) the type of each user by referring to a type database (type DB) in which evaluation results and types are associated in advance. The matching determination unit refers to a matching database (matching DB) in which the degree of matching for each type is defined in advance, and quantifies the degree of matching using the previously defined degree of compatibility between the above-determined types. . The construction of the matching DB can be exemplified by defining in advance the degree of matching between a lecturer who is good at bringing out conversations and a student who is not good at speaking.

After obtaining the evaluation result, the degree of matching may be determined using a learning model that has learned teacher data including the conversation between the two parties and the evaluation of the conversation between the two parties. In this case, it is possible to give feedback (whether or not the lecturer is suitable for the person, etc.) on the results of the lectures between the persons who are actually matched.

Furthermore, as shown in FIG. 11, an online session for type determination may be conducted before the course starts to determine the type of the student in advance. The system acquires a moving image of an online session performed for type determination (step S1101), and determines the type (step S1102). Subsequently, the determined student type and instructor type are temporarily matched (step S1103). As a request from the student side, the conditions required for the lecturer are obtained in advance through questionnaires, such as "I like a gentle teacher", "I like a teacher with a good tempo", etc., and the desired type is specified from the results of the questionnaire. It may be left as is. The system acquires such information as condition information (step S1104). The system considers the condition information, corrects the primary match degree (step S1105), and provides the corrected match degree. As a result, even if the system determines that a strict instructor is suitable as a matching partner, if the student has a condition of "preferring a gentle personality", the primary match degree will be calculated. The matching degree of each instructor is corrected, and the "strict but gentle instructor" is selected as the optimum instructor.

<Second Embodiment>
A second embodiment of the present invention will be described with reference to FIGS. 12 and 13. FIG. In general, the system according to the present embodiment is based on the feeling (evaluation value) obtained from the user, and whether or not the person is likely to express that feeling in the first place (for example, a person who originally smiles a lot tends to have a high happy score). Accurate evaluation by considering comparison with base emotions and evaluating the magnitude of displacement when emotions are expressed (the degree of emotional expression differs between people with small reactions and those with large reactions) It is.

Graphs (a) to (c) of FIG. 12 show (a) raw data (time-series emotion score) of a certain user, and (b) standard deviation (standard deviation processing), and (c) the standard deviation is standardized (zscore conversion with an average of 0 and a variance of 1) (standardization processing).

Based on the evaluation value of (b), the system performs an evaluation that takes into account the differences in facial expressions and normal facial expressions for each user. For example, it is possible to solve the problem that a user who often smiles during normal operation inevitably has a high smile score (happy score). In addition, based on the evaluation of (c), the system performs an evaluation that considers the richness of emotional expression for each user. For example, it is possible to improve the problem caused by the difference between a person who laughs softly and a person who laughs loudly.

A more detailed description will be given based on the schematic diagram of FIG. (a) and (b) of FIG. 13 are graphs of Happy scores (expressing happiness levels) of user a and user b, respectively. Comparing the average scores of user a and user b, it can be seen that user A has a higher average score. In other words, user A smiles more than user B, and the Happy score inevitably tends to be high. In addition, comparing the range of each emotion (ST_A and ST_B), it can be seen that user A has a greater range of emotional expression (that is, the magnitude of reaction) than user B does.

By performing the above-mentioned standard deviation processing and standardization processing, it is possible to evaluate numerical values that eliminate individual differences such as the frequency of expression of emotions and the degree of expression of emotions. In addition, by causing the above-described analysis unit (for example, FIG. 3, etc.) to perform machine learning using teacher data that has been subjected to standard deviation processing and standardization processing, it is possible to perform appropriate learning. . That is, the present system can exhibit its function as a system (apparatus) that generates training data by performing standard deviation processing and standardization processing on evaluation results obtained from various moving images.

<Third Embodiment>
A third embodiment of the present invention will be described with reference to FIG. The system according to the present embodiment allows the subject to annotate (label) the outline and evaluation results. An online session between a lecturer (first user) and a student (second user) will be described below as an example.

The system analyzes and evaluates the moving images of the online sessions described above, and visualizes them by outputting the evaluation results of the students (second users) as graphs. The instructor can add the presence/absence of the emotion to the graph and supplementary information at that time (situation, speech and behavior, action, partner's action information, etc.). As shown in the figure, for example, the user provides the situation at that time (information about the situation such as "the class was lively" and "the other party responded well") to the point (Lab_1) where the Happy score was high. can be associated. In addition, it is possible to associate the location (Lab — 2) where the Happy score was low with the situation at that time (information on the situation, such as “some assignment was given” or “severe things were told”). This can be used to determine whether the score was low because the student was forced to concentrate on some task, or whether the score was low because the student was told something difficult. In this way, by accepting the annotations of oneself (the first user) to the reaction of the other party (the second user), it becomes possible to determine whether the communication should be improved or not. .

It should be noted that, as shown in the figure, annotations may be made for sections (Lab_3 and Lab_4). In this case, it is possible to make annotations such as "time spent teaching a difficult unit" and "time to summarize at the end of class".

Alternatively, a teaching data set may be generated from a set of graphs (plotted evaluation values) and annotations, and machine learning may be performed by the analysis unit.

<Fourth Embodiment>
A fourth embodiment of the present invention will be described with reference to FIG. The system according to the present embodiment, in outline, generates chart information that can serve as a student's emotion chart in an online session of a lecture between a lecturer and a student, and shares the chart information with the lecturer. The contents of the chart include, for example, the average value of each emotion of the student, characteristics, habits, frequent facial expressions, balance and degree of emotional expression by radar chart, ranking of words spoken when depressed, smile The information necessary to face the psychological state of the student further, such as the ranking of the words used when the question appeared.

As shown in the figure, the system displays the evaluation results from multiple perspectives (neutral, happiness, surprise, discomfort, anger, sadness, fear, etc.) on the dashboard as a list by the analysis unit described above. Displaying together with facial expression icons that symbolize each emotion makes it easier to intuitively understand. The display may be, for example, the evaluation results of online sessions for one day, or may be displayed on a weekly or monthly basis.

In addition to the above, the dashboard may also display a digest of the video when each emotion was expressed most strongly, or textual information on the remarks made at that time. Also, the frequency of words used (preferred phrase) may be calculated and the ranking may be displayed.

<Fifth Embodiment>
A fifth embodiment of the present invention will be described with reference to FIG. In general, the system according to the present embodiment determines whether there is inappropriate expression in a conversation between parties (for example, remarks using a position such as power harassment). It detects. Detection methods include rule-based (identification of prohibited keywords and expression of negative emotions on the other side) and machine learning approaches.

As shown in the figure, this system detects whether or not NG keywords (inappropriate words for a boss's remarks) are included in words uttered by the boss user in an online session between a boss and his subordinates, and Obtain the subordinate's evaluation value at that time, and if the expression of negative emotions such as fear, anxiety, sadness, and anger rises beyond a predetermined range compared to before the word was said records the boss's remarks as inappropriate remarks. Inappropriate remarks may be notified, for example, to the company's human resources department, etc., together with the digest video and the text of the remarks.

<Sixth Embodiment>
A sixth embodiment of the present invention will be described with reference to FIGS. 17 and 18. FIG. The system according to the present embodiment is roughly a so-called word cloud using utterances of moving images. The system recognizes the voice of the acquired moving image, converts the words included in the recognized voice into a size corresponding to the utterance frequency, and displays them as text.

As the words to be displayed, it is possible to extract words that are frequently used in the evaluation target video, or words that are not included in videos other than the evaluation target video (words unique to this video). words) may be extracted.

Also, the system may change the color of the text according to the user's evaluation result when the word is uttered. For example, a high HAPPY score may be given a red letter, and a word spoken when a SAD score is high may be given a blue score.

The word cloud shown in Fig. 17 displays the word "study" in the center. When the word is selected, as shown in FIG. 18, the text of the conversation when the word was uttered is displayed. In addition, a play button P is displayed together with the conversation, and when the play button P is selected, a digest of the moving image corresponding to the text is played.

<Seventh embodiment>
A seventh embodiment of the present invention will be described with reference to FIGS. 19 and 20. FIG. The system according to the present embodiment evaluates the degree of fatigue based on both facial images and voices included in the outline and moving images.

The system includes a fatigue evaluation condition reading unit and a fatigue evaluation unit. In evaluating the degree of fatigue according to the present embodiment, the steady state of the user is stored, and the degree of fatigue is evaluated based on the range of emotional fluctuations in the steady state and the current range of emotional fluctuations. Note that the evaluation of the degree of fatigue is not limited to this, and may be evaluated based on the amount of change in the pitch of conversation in the steady state and the amount of change in the pitch of the current conversation.

As shown in FIG. 20, when a moving image is acquired (step S2000), a pre-learned fatigue level evaluation model is read (step S2001), the fatigue level is evaluated (step S2002), and the fatigue level is notified ( Step S2010) is performed. Alternatively, after acquiring the moving image, the evaluation information for the normal time is read (S2101), each element is cross-analyzed by comparing with the evaluation result of each emotion, and the fatigue level is notified (step S2010).

By evaluating the degree of fatigue in this way, for example, it is possible to notify employees, etc., of step-by-step alerts based on the degree of fatigue. For example, when the degree of fatigue exceeds a certain threshold, messages may be sequentially sent to encourage pitching.

<Eighth embodiment>
An eighth embodiment of the present invention will be described with reference to FIG. The system according to the present embodiment generally takes into consideration the order of user utterances and visualizes them in chronological order. It becomes possible to analyze who is likely to speak after whom.

The system acquires moving images and recognizes the utterances of each user included in the moving images. The system includes an object generation unit that generates an object that associates written statements with users and displays them in chronological order.

FIG. 21 is a diagram showing the objects of the conversation rally of users A to C. FIG. When the utterance is up, the utterance object P is plotted, and the utterance objects P adjacent in time are connected by the connector C.

From such a conversation rally, for example, it can be evaluated that user C tends to speak after user B. In this case, for example, it is conceivable to increase user B's conversation in order to encourage user C to speak.

The system further comprises evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the utterance. A color corresponding to the evaluation value may be given to the utterance object P of the conversation rally. For example, whether another user followed after a user spoke in a natural manner, or whether another user followed after the previous user's overbearing remarks would affect the improvement method.

<Ninth Embodiment>
A ninth embodiment of the present invention will be described with reference to FIG. The system according to this embodiment generally includes a statement object generation unit that recognizes a statement for each user included in a moving image and plots a statement object corresponding to the statement in association with the user.

As shown, SUZUKI, SATO, KAMIYA, NOSE, ANDO
Speech objects are plotted when six TADA members speak. By looking at the graph, it is possible to understand at a glance whether or not there is an utterance.

In addition, the system further comprises evaluation means for calculating an evaluation value from a predetermined point of view based on both the recognized face image and the utterance. A color corresponding to the evaluation value may be assigned to the utterance object P. FIG.

According to this configuration, it is possible to easily grasp at a glance who spoke the most and what kind of place it was as a whole.

<Tenth Embodiment>
A tenth embodiment of the present invention will be described with reference to FIGS. 23 and 24. FIG. The system according to the present embodiment generally includes intonation acquisition means for extracting intonation information of speech acquired from a moving image, and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information. and The intonation acquisition means may extract changes in pitch of speech per unit time.

FIG. 23 is a graph showing the standard deviation of intervals (pitch). FIG. 24 is a graph of the standard deviation of volume. In this embodiment, the standard deviation is acquired for a predetermined number of frames from audio data of a predetermined sampling rate.

In general, it is said that the better the communication, the larger the standard deviation of the pitch and sound pressure of each other's utterances. From this, conversations with low standard deviation values tend to be dark conversations, and conversations with little change in standard deviation tend to be plain conversations, and the standard deviation monotonically decreases over time. Then, there is a tendency for the conversation to swell over time, and the standard deviation decreases over time, but when it turns to increase at the end, it tends to be a lively conversation at the end of the conversation, and the standard deviation remains high. If there is, it can be said that the speaker tends to be impatient or confused. Such voice analysis is particularly effective in moving images that do not show facial images (such as when there is no information acquired by a camera), or in scenes where the user looks down and his face is not shown.

It can be seen that t2 and t3 in FIG. 23 are relatively monotonous, but t1 and t4 have large changes. Therefore, it can be estimated that the conversation is not lively at t1 and t4. Also in FIG. 24, it can be seen that t2 and t3, which correspond to the same time axis as in FIG. 23, change little and are relatively monotonous. Also, it can be seen that t1 and t4 are relatively strong and weak. From this, it can be estimated that the conversation is not so lively during the time periods t2 and t3, while the conversation is lively during the time periods t1 and t4, both in terms of pitch and volume.

<Eleventh Embodiment>
An eleventh embodiment of the present invention will be described with reference to FIG. The system according to this embodiment understands the context such as outline, situation/scene, etc., and performs evaluation. For example, even if the smile score is low, it should be evaluated differently depending on whether the smile score is low because it was the first meeting or because the person is a close friend.

This system includes a context reading unit that reads context information and a correction unit that corrects evaluation results according to the context information. The context information includes, for example, situation, number of conversations, acquaintance with the other party, one-way conversation style or two-way conversation style, etc. Category information classified by context, items to be corrected, and correction parameters. It may be prepared.

The system may accept contextual category information from the user in advance, or may automatically determine from the titles and metadata of video files and the like. Accordingly, by specifying the context information of the moving image and performing the correction associated with the corresponding category, it is possible to provide an appropriate evaluation result.

The processes described using the flowcharts in this specification do not necessarily have to be executed in the illustrated order. Some processing steps may be performed in parallel. Also, additional processing steps may be employed, and some processing steps may be omitted.

The embodiments described above may be combined as appropriate and implemented. Also, the effects described herein are merely illustrative or exemplary, and are not limiting. In other words, the technology according to the present disclosure can produce other effects that are obvious to those skilled in the art from the description of this specification, in addition to or instead of the above effects.

10, 20 user terminal 30 video session service terminal 40 evaluation terminal

Claims

Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame;
speech recognition means for recognizing at least the speech of the second user included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image;
keyword detection means for detecting at least a predetermined keyword in the recognized speech;
an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected;
Video image analysis system.