WO2022180855A1 - Video session evaluation terminal, video session evaluation system, and video session evaluation program - Google Patents

Video session evaluation terminal, video session evaluation system, and video session evaluation program Download PDF

Info

Publication number
WO2022180855A1
WO2022180855A1 PCT/JP2021/007572 JP2021007572W WO2022180855A1 WO 2022180855 A1 WO2022180855 A1 WO 2022180855A1 JP 2021007572 W JP2021007572 W JP 2021007572W WO 2022180855 A1 WO2022180855 A1 WO 2022180855A1
Authority
WO
WIPO (PCT)
Prior art keywords
moving image
user
evaluation
analysis
unit
Prior art date
Application number
PCT/JP2021/007572
Other languages
French (fr)
Japanese (ja)
Inventor
渉三 神谷
Original Assignee
株式会社I’mbesideyou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社I’mbesideyou filed Critical 株式会社I’mbesideyou
Priority to JP2023502015A priority Critical patent/JPWO2022180855A1/ja
Priority to PCT/JP2021/007572 priority patent/WO2022180855A1/en
Publication of WO2022180855A1 publication Critical patent/WO2022180855A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present disclosure relates to a video session evaluation terminal, a video session evaluation system, and a video session evaluation program.
  • Patent Document 1 Conventionally, there is known a technique for analyzing the emotions others receive in response to a speaker's remarks (see Patent Document 1, for example). There is also known a technique for analyzing changes in facial expressions of a subject in chronological order over a long period of time and estimating the emotions held during that period (see, for example, Patent Literature 2). Furthermore, there are known techniques for identifying factors that have the greatest influence on changes in emotions (see Patent Documents 3 to 5, for example). Furthermore, there is also known a technology that compares the subject's usual facial expression with the current facial expression and issues an alert when the facial expression is dark (see, for example, Patent Document 6).
  • Patent Documents 7 to 9 There is also known a technique for determining the degree of emotion of a subject by comparing the subject's normal (expressionless) facial expression with the current facial expression (for example, Patent Documents 7 to 9). reference). Furthermore, there is also known a technique for analyzing the feeling of an organization and the atmosphere within a group that an individual feels (see Patent Documents 10 and 11, for example).
  • the purpose of the present invention is to objectively evaluate exchanged communication in order to conduct more efficient communication in situations where online communication is the main focus.
  • Acquisition means for acquiring a moving image relating to an online session between the first user and the second user; face recognition means for recognizing at least face images of the first user and the second user included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice; determining means for determining the degree of matching of the second user to the first user based on the evaluation value; A moving image analysis system is obtained.
  • a moving image acquisition unit that acquires a plurality of moving images obtained by photographing a target person; a biological reaction analysis unit that analyzes changes in the biological reaction of the subject based on the moving image acquired by the moving image acquisition unit; Based on the change in the biological reaction of the subject analyzed by the biological reaction analysis unit, the emotional level of the subject is evaluated according to a standardized evaluation standard for the subject among the plurality of moving images.
  • a moving image analysis system comprising an emotion evaluation unit is obtained.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for classifying into predetermined emotional information based on both the recognized face image and the recognized voice; an annotation receiving means for receiving an annotation operation on the classified emotional information;
  • a moving image analysis system is obtained.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice; evaluation value providing means for providing the subject with an average value for each of the evaluation values from the plurality of viewpoints over a predetermined period; A moving image analysis system is obtained.
  • Acquisition means for acquiring a moving image relating to an online session between the first user and the second user; face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame; speech recognition means for recognizing at least the speech of the second user included in the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image; keyword detection means for detecting at least a predetermined keyword in the recognized speech; an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected; A moving image analysis system is obtained.
  • Acquisition means for acquiring a plurality of moving images showing at least a target person; voice recognition means for recognizing at least the voice of the target person included in the evaluation target moving image among the moving images; a unique word extracting means for extracting, from among the words included in the recognized speech, words not included in the moving image other than the evaluation target moving image; a text display means for converting the extracted word into a size according to the frequency of its utterance and displaying it as text;
  • a moving image analysis system is obtained.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice; text conversion means for converting words included in the recognized speech into text and displaying the text; size setting means for setting the size of the converted text to a size corresponding to the frequency of speech; a color setting means for setting the color of the converted text to a color corresponding to the evaluation value; comprising A moving image analysis system is obtained.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for evaluating the degree of fatigue based on both the recognized face image and the voice, A moving image analysis system is obtained.
  • an acquisition means for acquiring at least a moving image; speech recognition means for recognizing utterances for each speaker included in the moving image; and an object generation unit that generates an object that associates the utterance with the target person and displays them in chronological order.
  • an acquisition means for acquiring at least a moving image; speech recognition means for recognizing utterances for each target person included in the moving image;
  • a moving image analysis system is obtained that includes a utterance object generation unit that plots a utterance object corresponding to the utterance in association with the target person.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; intonation acquisition means for extracting intonation information of the recognized speech; and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; context acquisition means for acquiring context information of the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice; a correction means for correcting the evaluation value using the context information; Video image analysis system.
  • exchanged communication can be objectively evaluated in order to conduct more efficient communication in situations where online communication is the main activity.
  • FIG. 1 is an example of a functional block diagram of an evaluation terminal according to an embodiment of the present invention
  • FIG. 3 is a diagram showing functional configuration example 1 of the evaluation terminal according to the embodiment of the present invention
  • FIG. 8 is a diagram showing functional configuration example 2 of the evaluation terminal according to the embodiment of the present invention
  • FIG. 10 is a diagram showing a functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention
  • 7 is a screen display example according to the functional configuration example 3 of FIG. 6.
  • FIG. FIG. 7 is another screen display example according to the functional configuration example 3 of FIG. 6.
  • FIG. FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention
  • FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention
  • 1 shows a system according to a first embodiment of the invention
  • FIG. 1 shows a system according to a first embodiment of the invention
  • FIG. Fig. 3 shows a system according to a second embodiment of the invention
  • Fig. 3 shows a system according to a second embodiment of the invention
  • Fig. 3 shows a system according to a third embodiment of the invention
  • Fig. 4 shows a system according to a fourth embodiment of the invention
  • Fig. 5 shows a system according to a fifth embodiment of the invention
  • Fig. 6 shows a system according to a sixth embodiment of the invention
  • Fig. 6 shows a system according to a sixth embodiment of the invention
  • Fig. 6 shows a system according to a sixth embodiment of the invention
  • Fig. 12 shows a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention
  • 1 shows a system according to
  • FIG. 7 shows a system according to a seventh embodiment of the invention
  • Fig. 7 shows a system according to a seventh embodiment of the invention
  • FIG. 11 illustrates a system according to an eighth embodiment of the invention
  • Fig. 10 shows a system according to a ninth embodiment of the invention
  • Fig. 10 shows a system according to a tenth embodiment of the invention
  • Fig. 10 shows a system according to a tenth embodiment of the invention
  • FIG. 11 illustrates a system according to an eleventh embodiment of the invention
  • the present disclosure has the following configurations.
  • Acquisition means for acquiring a moving image relating to an online session between the first user and the second user; face recognition means for recognizing at least face images of the first user and the second user included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice; determining means for determining the degree of matching of the second user to the first user based on the evaluation value; Video image analysis system.
  • a moving image acquisition unit that acquires a plurality of moving images obtained by photographing a target person; a biological reaction analysis unit that analyzes changes in the biological reaction of the subject based on the moving image acquired by the moving image acquisition unit; Based on the change in the biological reaction of the subject analyzed by the biological reaction analysis unit, the emotional level of the subject is evaluated according to a standardized evaluation standard for the subject among the plurality of moving images.
  • a moving image analysis system comprising an emotion evaluation unit.
  • An acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for classifying into predetermined emotional information based on both the recognized face image and the recognized voice; an annotation receiving means for receiving an annotation operation on the classified emotional information; Video image analysis system.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice; evaluation value providing means for providing the subject with an average value for each of the evaluation values from the plurality of viewpoints over a predetermined period; Video image analysis system.
  • Acquisition means for acquiring a moving image relating to an online session between the first user and the second user; face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame; speech recognition means for recognizing at least the speech of the second user included in the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image; keyword detection means for detecting at least a predetermined keyword in the recognized speech; an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected; Video image analysis system.
  • Acquisition means for acquiring a plurality of moving images showing at least a target person; voice recognition means for recognizing at least the voice of the target person included in the evaluation target moving image among the moving images; a unique word extracting means for extracting, from among the words included in the recognized speech, words not included in the moving image other than the evaluation target moving image; a text display means for converting the extracted word into a size according to the frequency of its utterance and displaying it as text; Video image analysis system.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice; text conversion means for converting words included in the recognized speech into text and displaying the text; size setting means for setting the size of the converted text to a size corresponding to the frequency of speech; a color setting means for setting the color of the converted text to a color corresponding to the evaluation value; comprising Video image analysis system.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; context acquisition means for acquiring context information of the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice; a correction means for correcting the evaluation value using the context information; Video image analysis system.
  • a video session in an environment where a video session (hereinafter referred to as an online session including one-way and two-way sessions) is held by a plurality of people, the person to be analyzed among the plurality of people is different from the others. It is a system that analyzes and evaluates specific emotions (feelings that occur in response to one's own or others' words and actions. pleasant/unpleasant, or their degree).
  • Online sessions are, for example, online meetings, online classes, online chats, etc. Terminals installed in multiple locations are connected to a server via a communication network such as the Internet, and moving images are transmitted between multiple terminals through the server. It's made to be interactable.
  • Moving images handled in online sessions include facial images and voices of users using terminals.
  • Moving images also include images such as materials that are shared and viewed by a plurality of users. It is possible to switch between the face image and the document image on the screen of each terminal to display only one of them, or to divide the display area and display the face image and the document image at the same time. In addition, it is possible to display the image of one user out of a plurality of users on the full screen, or divide the images of some or all of the users into small screens and display them. It is possible to designate one or a plurality of users among a plurality of users participating in an online session using terminals as analysis subjects.
  • an online session leader, moderator, or manager designates any user as an analysis subject.
  • Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like.
  • An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session. It should be noted that all participants may be subject to analysis without specifying the person to be analyzed.
  • an online session leader, moderator, or administrator hereinafter collectively referred to as the organizer to designate any user as an analysis subject.
  • Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like.
  • An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
  • the video session evaluation system displays at least moving images obtained from a video session established between a plurality of terminals.
  • the displayed moving image is acquired by the terminal, and at least a face image included in the moving image is identified for each predetermined frame unit. An evaluation value for the identified face image is then calculated.
  • the evaluation value is shared as necessary.
  • the acquired moving image is stored in the terminal, analyzed and evaluated on the terminal, and the result is provided to the user of the terminal. Therefore, for example, even a video session containing personal information or a video session containing confidential information can be analyzed and evaluated without providing the moving image itself to an external evaluation agency or the like.
  • the evaluation result evaluation value
  • the video session evaluation system includes user terminals 10 and 20 each having at least an input unit such as a camera unit and a microphone unit, a display unit such as a display, and an output unit such as a speaker. , a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for performing part of the evaluation of the video session.
  • Each functional block, functional unit, and functional module described below can be configured by any of hardware, DSP (Digital Signal Processor), and software provided in a computer, for example.
  • DSP Digital Signal Processor
  • a computer CPU random access memory
  • RAM random access memory
  • ROM read-only memory
  • a series of processes by the systems and terminals described herein may be implemented using software, hardware, or a combination of software and hardware. It is possible to create a computer program for realizing each function of the information sharing support device 10 according to the present embodiment and implement it in a PC or the like. It is also possible to provide a computer-readable recording medium storing such a computer program.
  • the recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like.
  • the above computer program may be distributed, for example, via a network without using a recording medium.
  • the evaluation terminal acquires a moving image from a video session service terminal, identifies at least a face image included in the moving image for each predetermined frame unit, and calculates an evaluation value for the face image ( will be described in detail later).
  • this service provides user terminals 10 and 20 with two-way images and voice. Communication is possible.
  • a moving image captured by the camera of the other user's terminal is displayed on the display of the user's terminal, and audio captured by the microphone of the other's user's terminal can be output from the speaker.
  • this service allows both or either of the user terminals to record moving images and sounds (collectively referred to as “moving images, etc.") in the storage unit of at least one of the user terminals. configured as possible.
  • the recorded moving image information Vs (hereinafter referred to as “recorded information”) is cached in the user terminal that started recording and is locally recorded only in one of the user terminals. If necessary, the user can view the recorded information by himself or share it with others within the scope of using this service.
  • FIG. 4 is a block diagram showing a configuration example according to this embodiment.
  • the video session evaluation system of this embodiment is implemented as a functional configuration of the user terminal 10.
  • the user terminal 10 has, as its functions, a moving image acquisition unit 11, a biological reaction analysis unit 12, a peculiar determination unit 13, a related event identification unit 14, a clustering unit 15, and an analysis result notification unit 16.
  • the moving image acquisition unit 11 acquires from each terminal a moving image obtained by photographing a plurality of people (a plurality of users) with a camera provided in each terminal during an online session. It does not matter whether the moving image acquired from each terminal is set to be displayed on the screen of each terminal. That is, the moving image acquisition unit 11 acquires moving images from each terminal, including moving images being displayed and moving images not being displayed on each terminal.
  • the biological reaction analysis unit 12 analyzes changes in the biological reaction of each of a plurality of people based on the moving images (whether or not they are being displayed on the screen) acquired by the moving image acquiring unit 11.
  • the biological reaction analysis unit 12 separates the moving image acquired by the moving image acquisition unit 11 into a set of images (collection of frame images) and voice, and analyzes changes in the biological reaction from each.
  • the biological reaction analysis unit 12 analyzes the user's facial image using a frame image separated from the moving image acquired by the moving image acquisition unit 11 to obtain at least one of facial expression, gaze, pulse, and facial movement. Analyze changes in biological reactions related to Further, the biological reaction analysis unit 12 analyzes the voice separated from the moving image acquired by the moving image acquisition unit 11 to analyze changes in the biological reaction related to at least one of the user's utterance content and voice quality.
  • the biological reaction analysis unit 12 calculates a biological reaction index value reflecting the change in biological reaction by quantifying the change in biological reaction according to a predetermined standard.
  • the analysis of changes in facial expressions is performed as follows. That is, for each frame image, a facial region is identified from the frame image, and the identified facial expressions are classified into a plurality of types according to an image analysis model machine-learned in advance. Then, based on the classification results, it analyzes whether positive facial expression changes occur between consecutive frame images, whether negative facial expression changes occur, and to what extent the facial expression changes occur, A facial expression change index value corresponding to the analysis result is output.
  • the analysis of changes in line of sight is performed as follows. That is, for each frame image, the eye region is specified in the frame image, and the orientation of both eyes is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or whether the user is looking outside the screen. Also, it may be analyzed whether the eye movement is large or small, or whether the movement is frequent or infrequent. A change in line of sight is also related to the user's degree of concentration.
  • the biological reaction analysis unit 12 outputs a line-of-sight change index value according to the analysis result of the line-of-sight change.
  • the analysis of pulse changes is performed, for example, as follows. That is, for each frame image, the face area is specified in the frame image. Then, using a trained image analysis model that captures numerical values of face color information (G of RGB), changes in the G color of the face surface are analyzed. By arranging the results along the time axis, a waveform representing changes in color information is formed, and the pulse is identified from this waveform. When a person is tense, the pulse speeds up, and when the person is calm, the pulse slows down. The biological reaction analysis unit 12 outputs a pulse change index value according to the analysis result of the pulse change.
  • G of RGB face color information
  • analysis of changes in facial movement is performed as follows. That is, for each frame image, the face area is specified in the frame image, and the direction of the face is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or whether the user is looking outside the screen. Further, it may be analyzed whether the movement of the face is large or small, or whether the movement is frequent or infrequent. The movement of the face and the movement of the line of sight may be analyzed together. For example, it may be analyzed whether the face of the speaker being displayed is viewed straight, whether the face is viewed with upward or downward gaze, or whether the face is viewed obliquely.
  • the biological reaction analysis unit 12 outputs a face orientation change index value according to the analysis result of the face orientation change.
  • Analysis of the contents of the statement is performed, for example, as follows. That is, the biological reaction analysis unit 12 converts the voice into a character string by performing known voice recognition processing on the voice for a specified time (for example, about 30 to 150 seconds), and morphologically analyzes the character string. By doing so, words such as particles and articles that are unnecessary for expressing conversation are removed. Then, vectorize the remaining words, analyze whether a positive emotional change has occurred, whether a negative emotional change has occurred, and to what extent the emotional change has occurred. Outputs the statement content index value.
  • Voice quality analysis is performed, for example, as follows. That is, the biological reaction analysis unit 12 identifies the acoustic features of the voice by performing known voice analysis processing on the voice for a specified time (for example, about 30 to 150 seconds). Then, based on the acoustic features, it analyzes whether a positive change in voice quality has occurred, whether a negative change in voice quality has occurred, and to what extent the change in voice quality has occurred, and according to the analysis results, output the voice quality change index value.
  • a specified time for example, about 30 to 150 seconds
  • the biological reaction analysis unit 12 uses at least one of the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value calculated as described above. to calculate the biological reaction index value.
  • the biological reaction index value is calculated by weighting the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value.
  • the peculiarity determination unit 13 determines whether or not the change in the analyzed biological reaction of the person to be analyzed is more specific than the change in the analyzed biological reaction of the person other than the person to be analyzed. In the present embodiment, the peculiarity determination unit 13 compares changes in the biological reaction of the person to be analyzed with those of others based on the biological reaction index values calculated for each of the plurality of users by the biological reaction analysis unit 12. is specific or not.
  • the peculiar determination unit 13 calculates the variance of the biological reaction index values calculated for each of the plurality of persons by the biological reaction analysis unit 12, and compares the biological reaction index values calculated for the analysis subject with the variance, It is determined whether or not the change in the analyzed biological reaction of the person to be analyzed is specific compared to others.
  • the following three patterns are conceivable as cases where the changes in biological reactions analyzed for the subject of analysis are more specific than those of others.
  • the first is a case where a relatively large change in biological reaction occurs in the subject of analysis, although no particularly large change in biological reaction has occurred in the other person.
  • the second is a case where a particularly large change in biological reaction has not occurred in the subject of analysis, but a relatively large change in biological reaction has occurred in the other person.
  • the third is a case where a relatively large change in biological reaction occurs in both the subject of analysis and the other person, but the content of the change differs between the subject of analysis and the other person.
  • the related event identification unit 14 identifies an event occurring in relation to at least one of the person to be analyzed, the other person, and the environment when the change in the biological reaction determined to be peculiar by the peculiarity determination unit 13 occurs. .
  • the related event identification unit 14 identifies from the moving image the speech and behavior of the person to be analyzed when a specific change in biological reaction occurs in the person to be analyzed.
  • the related event identifying unit 14 identifies, from the moving image, the speech and behavior of the other person when a specific change in the biological reaction of the person to be analyzed occurs.
  • the related event identification unit 14 identifies from the moving image the environment in which a specific change in the biological reaction of the person to be analyzed occurs.
  • the environment is, for example, the shared material being displayed on the screen, the background image of the person to be analyzed, and the like.
  • the clustering unit 15 clusters the change in the biological reaction determined to be specific by the peculiarity determination unit 13 (for example, one or a combination of eye gaze, pulse, facial movement, statement content, and voice quality), and the peculiarity Analyzing the degree of correlation with an event (event identified by the related event identification unit 14) that occurs when a change in biological reaction occurs, and if it is determined that the correlation is at a certain level or more , to cluster the subjects or events based on the correlation analysis results.
  • the peculiarity determination unit 13 for example, one or a combination of eye gaze, pulse, facial movement, statement content, and voice quality
  • the clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented categories according to the content of the event, the degree of negativity, the magnitude of the correlation, and the like.
  • the clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented classifications according to the content of the event, the degree of positivity, the degree of correlation, and the like.
  • the analysis result notification unit 16 reports at least one of the changes in the biological reaction determined to be specific by the peculiar determination unit 13, the event identified by the related event identification unit 14, and the classification clustered by the clustering unit 15. , to notify the designator of the subject of analysis (the subject of analysis or the organizer of the online session).
  • the analysis result notification unit 16 recognizes that when a change in a specific biological reaction that is different from that of the other person occurs in the person to be analyzed (one of the three patterns described above; the same applies hereinafter), the analysis target is Notifies the person to be analyzed of his/her own behavior. This allows the person to be analyzed to understand that he/she has a different feeling from others when he or she performs a certain behavior. At this time, the person to be analyzed may also be notified of the change in the specific biological reaction identified for the person to be analyzed. Furthermore, the person to be analyzed may be further notified of the change in the biological reaction of the other person to be compared.
  • the words and deeds of the person to be analyzed performed without being particularly conscious of their usual emotions, or the words and deeds of the person to be analyzed consciously accompanied by certain emotions, and the emotions and behaviors that others received
  • the emotion held by the person to be analyzed is different from the feeling held by the person to be analyzed at the time
  • the person to be analyzed is notified of the speech and behavior of the person to be analyzed at that time.
  • the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when the person to be analyzed undergoes a specific change in biological reaction that is different from that of the other person, together with the change in the specific biological reaction. to notify.
  • the organizer of the online session can know what kind of event affects what kind of emotional change as a phenomenon specific to the specified analysis subject. Then, it becomes possible to perform appropriate treatment on the person to be analyzed according to the grasped contents.
  • the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when a specific change in biological reaction occurs in the analysis subject, which is different from that of others, or the clustering result of the analysis subject. do.
  • online session organizers can grasp behavioral tendencies peculiar to analysis subjects and predict possible future behaviors and situations, depending on which classification the specified analysis subjects have been clustered into. be able to. Then, it becomes possible to take appropriate measures for the person to be analyzed.
  • the biological reaction index value is calculated by quantifying the change in biological reaction according to a predetermined standard, and the analysis subject is analyzed based on the biological reaction index value calculated for each of the plurality of people.
  • the biological reaction analysis unit 12 analyzes the movement of the line of sight for each of a plurality of people and generates a heat map indicating the direction of the line of sight.
  • the peculiar determination unit 13 compares the heat map generated for the person to be analyzed by the biological reaction analysis unit 12 with the heat map generated for the other person, so that the change in the biological reaction analyzed for the person to be analyzed It is determined whether it is specific compared with the change in biological response analyzed for.
  • moving images of a video session are stored in the local storage of the user terminal 10, and the above analysis is performed on the user terminal 10.
  • the machine specs of the user terminal 10 it is possible to analyze the moving image information without providing it to the outside.
  • the video session evaluation system of this embodiment may include a moving image acquisition unit 11, a biological reaction analysis unit 12, and a reaction information presentation unit 13a as functional configurations.
  • the reaction information presentation unit 13a presents information indicating changes in biological reactions analyzed by the biological reaction analysis unit 12a, including participants not displayed on the screen.
  • the reaction information presenting unit 13a presents information indicating changes in biological reactions to an online session leader, moderator, or administrator (hereinafter collectively referred to as the organizer).
  • Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like.
  • An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
  • the organizer of the online session can also grasp the state of the participants who are not displayed on the screen in an environment where the online session is held by multiple people.
  • FIG. 6 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 6, in the video session evaluation system of the present embodiment, functions similar to those of the above-described first embodiment are given the same reference numerals, and explanations thereof may be omitted.
  • the system includes a camera unit that acquires images of a video session, a microphone unit that acquires audio, an analysis unit that analyzes and evaluates moving images, and information obtained by evaluating the acquired moving images.
  • an object generator for generating a display object (described below) based on the display; and a display for displaying both the moving image of the video session and the display object during execution of the video session.
  • the analysis unit includes the moving image acquisition unit 11, the biological reaction analysis unit 12, the peculiar determination unit 13, the related event identification unit 14, the clustering unit 15, and the analysis result notification unit 16, as described above.
  • the function of each element is as described above.
  • the object generation unit generates an object 50 representing the recognized face part and the above-mentioned Information 100 indicating the content of the analysis/evaluation performed is superimposed on the moving image and displayed.
  • the object 50 may identify and display all faces of a plurality of persons when the faces of the plurality of persons are moved in the moving image.
  • the object 50 is, for example, when the camera function of the video session is stopped at the other party's terminal (that is, it is stopped by software within the application of the video session instead of physically covering the camera). If the other party's face is recognized by the other party's camera, the object 50 or the object 100 may be displayed in the part where the other party's face is located. This makes it possible for both parties to confirm that the other party is in front of the terminal even if the camera function is turned off. In this case, for example, in a video session application, the information obtained from the camera may be hidden while only the object 50 or object 100 corresponding to the face recognized by the analysis unit is displayed. Also, the video information acquired from the video session and the information recognized by the analysis unit may be divided into different display layers, and the layer relating to the former information may be hidden.
  • the objects 50 and 100 may be displayed in all areas or only in some areas. For example, as shown in FIG. 8, it may be displayed only on the moving image on the guest side.
  • the device described in this specification may be realized as a single device, or may be realized by a plurality of devices (for example, cloud servers) or the like, all or part of which are connected via a network.
  • the control unit 110 and the storage 130 of each terminal 10 may be realized by different servers connected to each other via a network.
  • the system includes user terminals 10, 20, a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for evaluating the video session
  • Variation combinations of the following configurations are conceivable.
  • (1) Processing everything only on the user terminal As shown in FIG. 9, by performing the processing by the analysis unit on the terminal that is performing the video session (although a certain processing capacity is required), the video session can be performed. Analysis/evaluation results can be obtained at the same time (in real time) as you are.
  • an analysis unit may be provided in an evaluation terminal connected via a network or the like.
  • the moving images acquired by the user terminal are shared with the evaluation terminal at the same time as or after the video session, and are analyzed and evaluated by the analysis unit in the evaluation terminal. Together with or separately from the moving image data (that is, information including at least analysis data) is shared with the terminal and displayed on the display unit.
  • FIGS. 10 to 25 are realized by using the functional configuration examples 1 to 3 described above and their combinations.
  • FIG. 10 A first embodiment of the present invention will be described with reference to FIGS. 10 and 11.
  • FIG. The system according to this embodiment roughly evaluates the degree of matching between people. For example, analyze the reaction of the other party, evaluate the peculiarity of each other (expressions that do not usually appear), peculiarities with your past (expressions that did not appear in the past even with the same partner), neutral Evaluate by comparing with the normal state.
  • this matching is effective for online sessions conducted by lecturers and students. Matching various types of lecturers with various types of students is also important for continuing the course.
  • the system includes a type determination unit that determines each person's type based on the evaluation result by the analysis unit described above, and a matching determination unit that determines the degree of matching.
  • the type determination unit determines (estimates) the type of each user by referring to a type database (type DB) in which evaluation results and types are associated in advance.
  • the matching determination unit refers to a matching database (matching DB) in which the degree of matching for each type is defined in advance, and quantifies the degree of matching using the previously defined degree of compatibility between the above-determined types. .
  • the construction of the matching DB can be exemplified by defining in advance the degree of matching between a lecturer who is good at bringing out conversations and a student who is not good at speaking.
  • the degree of matching may be determined using a learning model that has learned teacher data including the conversation between the two parties and the evaluation of the conversation between the two parties. In this case, it is possible to give feedback (whether or not the lecturer is suitable for the person, etc.) on the results of the lectures between the persons who are actually matched.
  • an online session for type determination may be conducted before the course starts to determine the type of the student in advance.
  • the system acquires a moving image of an online session performed for type determination (step S1101), and determines the type (step S1102). Subsequently, the determined student type and instructor type are temporarily matched (step S1103).
  • the conditions required for the lecturer are obtained in advance through questionnaires, such as "I like a gentle teacher", "I like a teacher with a good tempo", etc., and the desired type is specified from the results of the questionnaire. It may be left as is.
  • the system acquires such information as condition information (step S1104).
  • the system considers the condition information, corrects the primary match degree (step S1105), and provides the corrected match degree. As a result, even if the system determines that a strict instructor is suitable as a matching partner, if the student has a condition of "preferring a gentle personality", the primary match degree will be calculated. The matching degree of each instructor is corrected, and the "strict but gentle instructor” is selected as the optimum instructor.
  • FIG. 12 A second embodiment of the present invention will be described with reference to FIGS. 12 and 13.
  • FIG. the system according to the present embodiment is based on the feeling (evaluation value) obtained from the user, and whether or not the person is likely to express that feeling in the first place (for example, a person who originally smiles a lot tends to have a high happy score).
  • Accurate evaluation by considering comparison with base emotions and evaluating the magnitude of displacement when emotions are expressed (the degree of emotional expression differs between people with small reactions and those with large reactions) It is.
  • Graphs (a) to (c) of FIG. 12 show (a) raw data (time-series emotion score) of a certain user, and (b) standard deviation (standard deviation processing), and (c) the standard deviation is standardized (zscore conversion with an average of 0 and a variance of 1) (standardization processing).
  • the system Based on the evaluation value of (b), the system performs an evaluation that takes into account the differences in facial expressions and normal facial expressions for each user. For example, it is possible to solve the problem that a user who often smiles during normal operation inevitably has a high smile score (happy score).
  • the system Based on the evaluation of (c), the system performs an evaluation that considers the richness of emotional expression for each user. For example, it is possible to improve the problem caused by the difference between a person who laughs softly and a person who laughs loudly.
  • FIG. (a) and (b) of FIG. 13 are graphs of Happy scores (expressing happiness levels) of user a and user b, respectively. Comparing the average scores of user a and user b, it can be seen that user A has a higher average score. In other words, user A smiles more than user B, and the Happy score inevitably tends to be high. In addition, comparing the range of each emotion (ST_A and ST_B), it can be seen that user A has a greater range of emotional expression (that is, the magnitude of reaction) than user B does.
  • the present system can exhibit its function as a system (apparatus) that generates training data by performing standard deviation processing and standardization processing on evaluation results obtained from various moving images.
  • a third embodiment of the present invention will be described with reference to FIG.
  • the system according to the present embodiment allows the subject to annotate (label) the outline and evaluation results.
  • An online session between a lecturer (first user) and a student (second user) will be described below as an example.
  • the system analyzes and evaluates the moving images of the online sessions described above, and visualizes them by outputting the evaluation results of the students (second users) as graphs.
  • the instructor can add the presence/absence of the emotion to the graph and supplementary information at that time (situation, speech and behavior, action, partner's action information, etc.).
  • the user provides the situation at that time (information about the situation such as "the class was lively” and "the other party responded well") to the point (Lab_1) where the Happy score was high. can be associated.
  • the location (Lab — 2) where the Happy score was low with the situation at that time information on the situation, such as “some assignment was given” or “severe things were told”).
  • annotations may be made for sections (Lab_3 and Lab_4). In this case, it is possible to make annotations such as "time spent teaching a difficult unit” and "time to summarize at the end of class”.
  • a teaching data set may be generated from a set of graphs (plotted evaluation values) and annotations, and machine learning may be performed by the analysis unit.
  • a fourth embodiment of the present invention will be described with reference to FIG.
  • the system according to the present embodiment in outline, generates chart information that can serve as a student's emotion chart in an online session of a lecture between a lecturer and a student, and shares the chart information with the lecturer.
  • the contents of the chart include, for example, the average value of each emotion of the student, characteristics, habits, frequent facial expressions, balance and degree of emotional expression by radar chart, ranking of words spoken when depressed, smile
  • the information necessary to face the psychological state of the student further, such as the ranking of the words used when the question appeared.
  • the system displays the evaluation results from multiple perspectives (neutral, happiness, surprise, discomfort, anger, sadness, fear, etc.) on the dashboard as a list by the analysis unit described above. Displaying together with facial expression icons that symbolize each emotion makes it easier to intuitively understand.
  • the display may be, for example, the evaluation results of online sessions for one day, or may be displayed on a weekly or monthly basis.
  • the dashboard may also display a digest of the video when each emotion was expressed most strongly, or textual information on the remarks made at that time. Also, the frequency of words used (preferred phrase) may be calculated and the ranking may be displayed.
  • the system according to the present embodiment determines whether there is inappropriate expression in a conversation between parties (for example, remarks using a position such as power harassment). It detects. Detection methods include rule-based (identification of prohibited keywords and expression of negative emotions on the other side) and machine learning approaches.
  • this system detects whether or not NG keywords (inappropriate words for a boss's remarks) are included in words uttered by the boss user in an online session between a boss and his subordinates, and Obtain the subordinate's evaluation value at that time, and if the expression of negative emotions such as fear, anxiety, sadness, and anger rises beyond a predetermined range compared to before the word was said records the boss's remarks as inappropriate remarks.
  • Inappropriate remarks may be notified, for example, to the company's human resources department, etc., together with the digest video and the text of the remarks.
  • FIG. 17 A sixth embodiment of the present invention will be described with reference to FIGS. 17 and 18.
  • FIG. The system according to the present embodiment is roughly a so-called word cloud using utterances of moving images.
  • the system recognizes the voice of the acquired moving image, converts the words included in the recognized voice into a size corresponding to the utterance frequency, and displays them as text.
  • words to be displayed it is possible to extract words that are frequently used in the evaluation target video, or words that are not included in videos other than the evaluation target video (words unique to this video). words) may be extracted.
  • the system may change the color of the text according to the user's evaluation result when the word is uttered. For example, a high HAPPY score may be given a red letter, and a word spoken when a SAD score is high may be given a blue score.
  • the word cloud shown in Fig. 17 displays the word "study" in the center.
  • the word is selected, as shown in FIG. 18, the text of the conversation when the word was uttered is displayed.
  • a play button P is displayed together with the conversation, and when the play button P is selected, a digest of the moving image corresponding to the text is played.
  • FIG. 19 A seventh embodiment of the present invention will be described with reference to FIGS. 19 and 20.
  • FIG. The system according to the present embodiment evaluates the degree of fatigue based on both facial images and voices included in the outline and moving images.
  • the system includes a fatigue evaluation condition reading unit and a fatigue evaluation unit.
  • evaluating the degree of fatigue according to the present embodiment, the steady state of the user is stored, and the degree of fatigue is evaluated based on the range of emotional fluctuations in the steady state and the current range of emotional fluctuations. Note that the evaluation of the degree of fatigue is not limited to this, and may be evaluated based on the amount of change in the pitch of conversation in the steady state and the amount of change in the pitch of the current conversation.
  • Step S2000 when a moving image is acquired (step S2000), a pre-learned fatigue level evaluation model is read (step S2001), the fatigue level is evaluated (step S2002), and the fatigue level is notified ( Step S2010) is performed.
  • the evaluation information for the normal time is read (S2101), each element is cross-analyzed by comparing with the evaluation result of each emotion, and the fatigue level is notified (step S2010).
  • the system acquires moving images and recognizes the utterances of each user included in the moving images.
  • the system includes an object generation unit that generates an object that associates written statements with users and displays them in chronological order.
  • FIG. 21 is a diagram showing the objects of the conversation rally of users A to C.
  • FIG. 21 When the utterance is up, the utterance object P is plotted, and the utterance objects P adjacent in time are connected by the connector C.
  • the system further comprises evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the utterance.
  • a color corresponding to the evaluation value may be given to the utterance object P of the conversation rally. For example, whether another user followed after a user spoke in a natural manner, or whether another user followed after the previous user's overbearing remarks would affect the improvement method.
  • the system according to this embodiment generally includes a statement object generation unit that recognizes a statement for each user included in a moving image and plots a statement object corresponding to the statement in association with the user.
  • system further comprises evaluation means for calculating an evaluation value from a predetermined point of view based on both the recognized face image and the utterance.
  • a color corresponding to the evaluation value may be assigned to the utterance object P.
  • FIG. 23 A tenth embodiment of the present invention will be described with reference to FIGS. 23 and 24.
  • FIG. The system according to the present embodiment generally includes intonation acquisition means for extracting intonation information of speech acquired from a moving image, and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information. and
  • the intonation acquisition means may extract changes in pitch of speech per unit time.
  • FIG. 23 is a graph showing the standard deviation of intervals (pitch).
  • FIG. 24 is a graph of the standard deviation of volume. In this embodiment, the standard deviation is acquired for a predetermined number of frames from audio data of a predetermined sampling rate.
  • t2 and t3 in FIG. 23 are relatively monotonous, but t1 and t4 have large changes. Therefore, it can be estimated that the conversation is not lively at t1 and t4.
  • t2 and t3 which correspond to the same time axis as in FIG. 23, change little and are relatively monotonous.
  • t1 and t4 are relatively strong and weak. From this, it can be estimated that the conversation is not so lively during the time periods t2 and t3, while the conversation is lively during the time periods t1 and t4, both in terms of pitch and volume.
  • This system includes a context reading unit that reads context information and a correction unit that corrects evaluation results according to the context information.
  • the context information includes, for example, situation, number of conversations, acquaintance with the other party, one-way conversation style or two-way conversation style, etc. Category information classified by context, items to be corrected, and correction parameters. It may be prepared.
  • the system may accept contextual category information from the user in advance, or may automatically determine from the titles and metadata of video files and the like. Accordingly, by specifying the context information of the moving image and performing the correction associated with the corresponding category, it is possible to provide an appropriate evaluation result.

Abstract

[Problem] To evaluate an online session by evaluating video acquired in the online session. [Solution] A system of the present disclosure comprises: an acquisition means that acquires video pertaining to an online session between a first user and a second user; a face recognition means that recognizes at least a face image of the first user included in the video for each prescribed frame; a speech recognition means that recognizes at least the speech of the second user included in the video; an evaluation means that calculates an evaluation value from multiple perspectives on the basis of both the recognized face image; a keyword detection means that detects a prescribed keyword within at least the recognized speech; and an alert transmission means that sends a prescribed alert on the basis of the evaluation value when the keyword is detected.

Description

ビデオセッション評価端末、ビデオセッション評価システム及びビデオセッション評価プログラムVIDEO SESSION EVALUATION TERMINAL, VIDEO SESSION EVALUATION SYSTEM AND VIDEO SESSION EVALUATION PROGRAM
 本開示は、ビデオセッション評価端末、ビデオセッション評価システム及びビデオセッション評価プログラムに関する。 The present disclosure relates to a video session evaluation terminal, a video session evaluation system, and a video session evaluation program.
 従来、発言者の発言に対して他者が受ける感情を解析する技術が知られている(例えば、特許文献1参照)。また、対象者の表情の変化を長期間にわたり時系列的に解析し、その間に抱いた感情を推定する技術も知られている(例えば、特許文献2参照)。さらに、感情の変化に最も影響を与えた要素を特定する技術も知られている(例えば、特許文献3~5参照)。さらにまた、対象者の普段の表情と現在の表情とを比較して、表情が暗い場合にアラートを発する技術も知られている(例えば、特許文献6参照)。また、対象者の平常時(無表情時)の表情と現在の表情とを比較して、対象者の感情の度合いを判定するようにした技術も知られている(例えば、特許文献7~9参照)。更に、また、組織としての感情や、個人が感じるグループ内の雰囲気を分析する技術も知られている(例えば、特許文献10、11参照)。 Conventionally, there is known a technique for analyzing the emotions others receive in response to a speaker's remarks (see Patent Document 1, for example). There is also known a technique for analyzing changes in facial expressions of a subject in chronological order over a long period of time and estimating the emotions held during that period (see, for example, Patent Literature 2). Furthermore, there are known techniques for identifying factors that have the greatest influence on changes in emotions (see Patent Documents 3 to 5, for example). Furthermore, there is also known a technology that compares the subject's usual facial expression with the current facial expression and issues an alert when the facial expression is dark (see, for example, Patent Document 6). There is also known a technique for determining the degree of emotion of a subject by comparing the subject's normal (expressionless) facial expression with the current facial expression (for example, Patent Documents 7 to 9). reference). Furthermore, there is also known a technique for analyzing the feeling of an organization and the atmosphere within a group that an individual feels (see Patent Documents 10 and 11, for example).
特開2019-58625号公報JP 2019-58625 A 特開2016-149063号公報JP 2016-149063 A 特開2020-86559号公報JP 2020-86559 A 特開2000-76421号公報JP-A-2000-76421 特開2017-201499号公報JP 2017-201499 A 特開2018-112831号公報JP 2018-112831 A 特開2011-154665号公報JP 2011-154665 A 特開2012-8949号公報JP-A-2012-8949 特開2013-300号公報Japanese Unexamined Patent Application Publication No. 2013-300 特開2011-186521号公報JP 2011-186521 A WO15/174426号公報WO15/174426
 上述したすべての技術は、現実空間におけるコミュニケーションが主である状況におけるサブ的な機能にすぎない。即ち、昨今の業務のDX(Digital Transformation)化や、世界的な感染症の流行等を受け、業務や授業等のコミュニケーションがオンラインで行われることが主とされる状況に生まれたものではない。 All the technologies mentioned above are only sub-functions in situations where communication in the real world is the main thing. In other words, due to the recent DX (Digital Transformation) of work and the global epidemic of infectious diseases, it is not a situation where communication such as work and classes is mainly conducted online.
 本発明は、オンラインコミュニケーションが主となる状況において、より効率的なコミュニケーションを行うために、交わされたコミュニケーションを客観的に評価することを目的とする。 The purpose of the present invention is to objectively evaluate exchanged communication in order to conduct more efficient communication in situations where online communication is the main focus.
 本発明によれば、
 第1ユーザと第2ユーザとのオンラインセッションに関する動画像を取得する取得手段と、
 前記動画像内に含まれる前記第1ユーザ及び前記第2ユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて、複数の観点による評価値を算出する評価手段と、
 前記評価値に基づいて、前記第2ユーザの前記第1ユーザに対するマッチ度を判定する判定手段とを備える、
動画像分析システムが得られる。
According to the invention,
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least face images of the first user and the second user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
determining means for determining the degree of matching of the second user to the first user based on the evaluation value;
A moving image analysis system is obtained.
 本発明によれば、
 対象者を撮影することによって得られる複数の動画像を取得する動画像取得部と、
 前記動画像取得部により取得された動画像に基づいて、前記対象者について生体反応の変化を解析する生体反応解析部と、
 前記生体反応解析部により前記対象者について解析された前記生体反応の変化に基づいて、複数の前記動画像間で前記対象者について平準化された評価基準に従って前記対象者の感情の度合いを評価する感情評価部とを備える
動画像分析システムが得られる。
According to the invention,
a moving image acquisition unit that acquires a plurality of moving images obtained by photographing a target person;
a biological reaction analysis unit that analyzes changes in the biological reaction of the subject based on the moving image acquired by the moving image acquisition unit;
Based on the change in the biological reaction of the subject analyzed by the biological reaction analysis unit, the emotional level of the subject is evaluated according to a standardized evaluation standard for the subject among the plurality of moving images. A moving image analysis system comprising an emotion evaluation unit is obtained.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の感情情報に分類する評価手段と、
 分類された前記感情情報へのアノテーション操作を受け付けるアノテーション受付手段と、を備える、
動画像分析システムが得られる。
According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for classifying into predetermined emotional information based on both the recognized face image and the recognized voice;
an annotation receiving means for receiving an annotation operation on the classified emotional information;
A moving image analysis system is obtained.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 複数の前記観点による前記評価値それぞれについて所定期間における平均値を当該対象者へ提供する評価値提供手段と、を備える、
動画像分析システムが得られる。
According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
evaluation value providing means for providing the subject with an average value for each of the evaluation values from the plurality of viewpoints over a predetermined period;
A moving image analysis system is obtained.
 本発明によれば、
 第1ユーザと第2ユーザとのオンラインセッションに関する動画像を取得する取得手段と、
 前記動画像内に含まれる前記第1ユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記第2ユーザの少なくとも音声を認識する音声認識手段と、
 少なくとも認識した前記顔画像に基づいて所定の観点による評価値を算出する評価手段と、
 少なくとも認識した前記音声内の所定のキーワードを検出するキーワード検出手段と、
 前記キーワードを検出したときにおける前記評価値に基づいて、所定のアラートを送信するアラート送信手段とを備える、
動画像分析システムが得られる。
According to the invention,
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame;
speech recognition means for recognizing at least the speech of the second user included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image;
keyword detection means for detecting at least a predetermined keyword in the recognized speech;
an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected;
A moving image analysis system is obtained.
 本発明によれば、
 少なくとも対象者が映っている複数の動画像を取得する取得手段と、
 前記動画像のうち評価対象動画像に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記音声に含まれる単語のうち、前記評価対象動画像以外の前記動画像には含まれていなかった単語を抽出する固有単語抽出手段と、
 抽出した前記単語をその発言頻度に応じたサイズに変換してテキスト表示するテキスト表示手段とを備える、
動画像分析システムが得られる。
According to the invention,
Acquisition means for acquiring a plurality of moving images showing at least a target person;
voice recognition means for recognizing at least the voice of the target person included in the evaluation target moving image among the moving images;
a unique word extracting means for extracting, from among the words included in the recognized speech, words not included in the moving image other than the evaluation target moving image;
a text display means for converting the extracted word into a size according to the frequency of its utterance and displaying it as text;
A moving image analysis system is obtained.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の観点による評価値を算出する評価手段と、
 認識した前記音声に含まれる単語をテキストに変換して表示するテキスト変換手段と、
 変換した前記テキストのサイズをその発言頻度に応じた大きさに設定するサイズ設定手段と、
 変換した前記テキストの色を前記評価値に応じた色に設定する色設定手段と、
を備える、
 動画像分析システムが得られる。
According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
text conversion means for converting words included in the recognized speech into text and displaying the text;
size setting means for setting the size of the converted text to a size corresponding to the frequency of speech;
a color setting means for setting the color of the converted text to a color corresponding to the evaluation value;
comprising
A moving image analysis system is obtained.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて疲労度を評価する評価手段とを備える、
動画像分析システムが得られる。
According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for evaluating the degree of fatigue based on both the recognized face image and the voice,
A moving image analysis system is obtained.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる発言者毎に発言を認識する音声認識手段と、
 前記発言と前記対象者とを関連付けて時系列に並べて表示するオブジェクトを生成するオブジェクト生成部と、を備える
動画像分析システムが得られる。
According to the invention,
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each speaker included in the moving image;
and an object generation unit that generates an object that associates the utterance with the target person and displays them in chronological order.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者毎に発言を認識する音声認識手段と、
 前記発言に対応する発言オブジェクトを前記対象者と関連付けてプロットする発言オブジェクト生成部と、を備える
動画像分析システムが得られる。
According to the invention,
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each target person included in the moving image;
A moving image analysis system is obtained that includes a utterance object generation unit that plots a utterance object corresponding to the utterance in association with the target person.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記音声の抑揚情報を抽出する抑揚取得手段と、
 認識した前記顔画像及び前記抑揚情報の双方に基づいて所定の観点による評価値を算出する評価手段と、を備える
動画像分析システムが得られる。
According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
intonation acquisition means for extracting intonation information of the recognized speech;
and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information.
 本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 前記動画のコンテキスト情報を取得するコンテキスト取得手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の観点による評価値を算出する評価手段と、
 前記コンテキスト情報を用いて当該評価値を補正する補正手段と、を備える、
動画像分析システム。
According to the invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
context acquisition means for acquiring context information of the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
a correction means for correcting the evaluation value using the context information;
Video image analysis system.
 本開示によれば、ビデオセッションの動画像を分析評価することにより、特に内容に関する評価を客観的に行うことができる。 According to the present disclosure, by analyzing and evaluating moving images of a video session, it is possible to objectively evaluate especially the content.
 特に、本発明によれば、オンラインコミュニケーションが主となる状況において、より効率的なコミュニケーションを行うために、交わされたコミュニケーションを客観的に評価することができる。 In particular, according to the present invention, exchanged communication can be objectively evaluated in order to conduct more efficient communication in situations where online communication is the main activity.
本発明の実施の形態によるシステム全体図を示す図である。It is a figure which shows the whole system diagram by embodiment of this invention. 本発明の実施の形態による評価端末の機能ブロック図の一例である。1 is an example of a functional block diagram of an evaluation terminal according to an embodiment of the present invention; FIG. 本発明の実施の形態による評価端末の機能構成例1を示す図である。FIG. 3 is a diagram showing functional configuration example 1 of the evaluation terminal according to the embodiment of the present invention; 本発明の実施の形態による評価端末の機能構成例2を示す図である。FIG. 8 is a diagram showing functional configuration example 2 of the evaluation terminal according to the embodiment of the present invention; 本発明の実施の形態による評価端末の機能構成例3を示す図である。FIG. 10 is a diagram showing a functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 図6の機能構成例3による画面表示例である。7 is a screen display example according to the functional configuration example 3 of FIG. 6. FIG. 図6の機能構成例3による他の画面表示例である。FIG. 7 is another screen display example according to the functional configuration example 3 of FIG. 6. FIG. 本発明の実施の形態による評価端末の機能構成例3の他の構成を示す図である。FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 本発明の実施の形態による評価端末の機能構成例3の他の構成を示す図である。FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 本発明の第1の実施の形態によるシステムを示す図である。1 shows a system according to a first embodiment of the invention; FIG. 本発明の第1の実施の形態によるシステムを示す図である。1 shows a system according to a first embodiment of the invention; FIG. 本発明の第2の実施の形態によるシステムを示す図である。Fig. 3 shows a system according to a second embodiment of the invention; 本発明の第2の実施の形態によるシステムを示す図である。Fig. 3 shows a system according to a second embodiment of the invention; 本発明の第3の実施の形態によるシステムを示す図である。Fig. 3 shows a system according to a third embodiment of the invention; 本発明の第4の実施の形態によるシステムを示す図である。Fig. 4 shows a system according to a fourth embodiment of the invention; 本発明の第5の実施の形態によるシステムを示す図である。Fig. 5 shows a system according to a fifth embodiment of the invention; 本発明の第6の実施の形態によるシステムを示す図である。Fig. 6 shows a system according to a sixth embodiment of the invention; 本発明の第6の実施の形態によるシステムを示す図である。Fig. 6 shows a system according to a sixth embodiment of the invention; 本発明の第7の実施の形態によるシステムを示す図である。Fig. 7 shows a system according to a seventh embodiment of the invention; 本発明の第7の実施の形態によるシステムを示す図である。Fig. 7 shows a system according to a seventh embodiment of the invention; 本発明の第8の実施の形態によるシステムを示す図である。FIG. 11 illustrates a system according to an eighth embodiment of the invention; 本発明の第9の実施の形態によるシステムを示す図である。Fig. 10 shows a system according to a ninth embodiment of the invention; 本発明の第10の実施の形態によるシステムを示す図である。Fig. 10 shows a system according to a tenth embodiment of the invention; 本発明の第10の実施の形態によるシステムを示す図である。Fig. 10 shows a system according to a tenth embodiment of the invention; 本発明の第11の実施の形態によるシステムを示す図である。FIG. 11 illustrates a system according to an eleventh embodiment of the invention;
 本開示の実施形態の内容を列記して説明する。本開示は、以下のような構成を備える。
[項目1]
 第1ユーザと第2ユーザとのオンラインセッションに関する動画像を取得する取得手段と、
 前記動画像内に含まれる前記第1ユーザ及び前記第2ユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて、複数の観点による評価値を算出する評価手段と、
 前記評価値に基づいて、前記第2ユーザの前記第1ユーザに対するマッチ度を判定する判定手段とを備える、
動画像分析システム。
[項目2]
 対象者を撮影することによって得られる複数の動画像を取得する動画像取得部と、
 前記動画像取得部により取得された動画像に基づいて、前記対象者について生体反応の変化を解析する生体反応解析部と、
 前記生体反応解析部により前記対象者について解析された前記生体反応の変化に基づいて、複数の前記動画像間で前記対象者について平準化された評価基準に従って前記対象者の感情の度合いを評価する感情評価部とを備える
動画像分析システム。
[項目3]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の感情情報に分類する評価手段と、
 分類された前記感情情報へのアノテーション操作を受け付けるアノテーション受付手段と、を備える、
動画像分析システム。
[項目4]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 複数の前記観点による前記評価値それぞれについて所定期間における平均値を当該対象者へ提供する評価値提供手段と、を備える、
動画像分析システム。
[項目5]
 第1ユーザと第2ユーザとのオンラインセッションに関する動画像を取得する取得手段と、
 前記動画像内に含まれる前記第1ユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記第2ユーザの少なくとも音声を認識する音声認識手段と、
 少なくとも認識した前記顔画像に基づいて所定の観点による評価値を算出する評価手段と、
 少なくとも認識した前記音声内の所定のキーワードを検出するキーワード検出手段と、
 前記キーワードを検出したときにおける前記評価値に基づいて、所定のアラートを送信するアラート送信手段とを備える、
動画像分析システム。
[項目6]
 少なくとも対象者が映っている複数の動画像を取得する取得手段と、
 前記動画像のうち評価対象動画像に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記音声に含まれる単語のうち、前記評価対象動画像以外の前記動画像には含まれていなかった単語を抽出する固有単語抽出手段と、
 抽出した前記単語をその発言頻度に応じたサイズに変換してテキスト表示するテキスト表示手段とを備える、
動画像分析システム。
[項目7]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の観点による評価値を算出する評価手段と、
 認識した前記音声に含まれる単語をテキストに変換して表示するテキスト変換手段と、
 変換した前記テキストのサイズをその発言頻度に応じた大きさに設定するサイズ設定手段と、
 変換した前記テキストの色を前記評価値に応じた色に設定する色設定手段と、
を備える、
 動画像分析システム。
[項目8]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて疲労度を評価する評価手段とを備える、
動画像分析システム。
[項目9]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる発言者毎に発言を認識する音声認識手段と、
 前記発言と前記対象者とを関連付けて時系列に並べて表示するオブジェクトを生成するオブジェクト生成部と、を備える
動画像分析システム。
[項目10]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者毎に発言を認識する音声認識手段と、
 前記発言に対応する発言オブジェクトを前記対象者と関連付けてプロットする発言オブジェクト生成部と、を備える
動画像分析システム。
[項目11]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記音声の抑揚情報を抽出する抑揚取得手段と、
 認識した前記顔画像及び前記抑揚情報の双方に基づいて所定の観点による評価値を算出する評価手段と、を備える
動画像分析システム。
[項目12]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 前記動画のコンテキスト情報を取得するコンテキスト取得手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の観点による評価値を算出する評価手段と、
 前記コンテキスト情報を用いて当該評価値を補正する補正手段と、を備える、
動画像分析システム。
The contents of the embodiments of the present disclosure are listed and described. The present disclosure has the following configurations.
[Item 1]
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least face images of the first user and the second user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
determining means for determining the degree of matching of the second user to the first user based on the evaluation value;
Video image analysis system.
[Item 2]
a moving image acquisition unit that acquires a plurality of moving images obtained by photographing a target person;
a biological reaction analysis unit that analyzes changes in the biological reaction of the subject based on the moving image acquired by the moving image acquisition unit;
Based on the change in the biological reaction of the subject analyzed by the biological reaction analysis unit, the emotional level of the subject is evaluated according to a standardized evaluation standard for the subject among the plurality of moving images. A moving image analysis system comprising an emotion evaluation unit.
[Item 3]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for classifying into predetermined emotional information based on both the recognized face image and the recognized voice;
an annotation receiving means for receiving an annotation operation on the classified emotional information;
Video image analysis system.
[Item 4]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
evaluation value providing means for providing the subject with an average value for each of the evaluation values from the plurality of viewpoints over a predetermined period;
Video image analysis system.
[Item 5]
Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame;
speech recognition means for recognizing at least the speech of the second user included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image;
keyword detection means for detecting at least a predetermined keyword in the recognized speech;
an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected;
Video image analysis system.
[Item 6]
Acquisition means for acquiring a plurality of moving images showing at least a target person;
voice recognition means for recognizing at least the voice of the target person included in the evaluation target moving image among the moving images;
a unique word extracting means for extracting, from among the words included in the recognized speech, words not included in the moving image other than the evaluation target moving image;
a text display means for converting the extracted word into a size according to the frequency of its utterance and displaying it as text;
Video image analysis system.
[Item 7]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
text conversion means for converting words included in the recognized speech into text and displaying the text;
size setting means for setting the size of the converted text to a size corresponding to the frequency of speech;
a color setting means for setting the color of the converted text to a color corresponding to the evaluation value;
comprising
Video image analysis system.
[Item 8]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for evaluating the degree of fatigue based on both the recognized face image and the voice,
Video image analysis system.
[Item 9]
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each speaker included in the moving image;
and an object generation unit that generates an object that associates the utterance with the target person and displays them in chronological order.
[Item 10]
an acquisition means for acquiring at least a moving image;
speech recognition means for recognizing utterances for each target person included in the moving image;
A moving image analysis system comprising: a utterance object generation unit that plots a utterance object corresponding to the utterance in association with the target person.
[Item 11]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
intonation acquisition means for extracting intonation information of the recognized speech;
and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information.
[Item 12]
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
context acquisition means for acquiring context information of the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
a correction means for correcting the evaluation value using the context information;
Video image analysis system.
 以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Preferred embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. In the present specification and drawings, constituent elements having substantially the same functional configuration are denoted by the same reference numerals, thereby omitting redundant description.
 <基本機能>
 本実施形態のビデオセッション評価システムは、複数人でビデオセッション(以下、一方向及び双方向含めてオンラインセッションという)が行われる環境において、当該複数人の中の解析対象者について他者とは異なる特異的な感情(自分または他人の言動に対して起こる気持ち。快・不快またはその程度など)を解析し評価するシステムである。オンラインセッションは、例えばオンライン会議、オンライン授業、オンラインチャットなどであり、複数の場所に設置された端末をインターネットなどの通信ネットワークを介してサーバに接続し、当該サーバを通じて複数の端末間で動画像をやり取りできるようにしたものである。オンラインセッションで扱う動画像には、端末を使用するユーザの顔画像や音声が含まれる。また、動画像には、複数のユーザが共有して閲覧する資料などの画像も含まれる。各端末の画面上に顔画像と資料画像とを切り替えて何れか一方のみを表示させたり、表示領域を分けて顔画像と資料画像とを同時に表示させたりすることが可能である。また、複数人のうち1人の画像を全画面表示させたり、一部または全部のユーザの画像を小画面に分割して表示させたりすることが可能である。端末を使用してオンラインセッションに参加する複数のユーザのうち、何れか1人または複数人を解析対象者として指定することが可能である。例えば、オンラインセッションの主導者、進行者または管理者(以下、まとめて主催者という)が何れかのユーザを解析対象者として指定する。オンラインセッションの主催者は、例えばオンライン授業の講師、オンライン会議の議長やファシリテータ、コーチングを目的としたセッションのコーチなどである。オンラインセッションの主催者は、オンラインセッションに参加する複数のユーザの中の一人であるのが普通であるが、オンラインセッションに参加しない別人であってもよい。なお、解析対象者を指定せず全ての参加者を解析対象としてもよい。また、オンラインセッションの主導者、進行者または管理者(以下、まとめて主催者という)が何れかのユーザを解析対象者として指定することも可能である。オンラインセッションの主催者は、例えばオンライン授業の講師、オンライン会議の議長やファシリテータ、コーチングを目的としたセッションのコーチなどである。オンラインセッションの主催者は、オンラインセッションに参加する複数のユーザの中の一人であるのが普通であるが、オンラインセッションに参加しない別人であってもよい。
<Basic functions>
In the video session evaluation system of the present embodiment, in an environment where a video session (hereinafter referred to as an online session including one-way and two-way sessions) is held by a plurality of people, the person to be analyzed among the plurality of people is different from the others. It is a system that analyzes and evaluates specific emotions (feelings that occur in response to one's own or others' words and actions. Pleasant/unpleasant, or their degree). Online sessions are, for example, online meetings, online classes, online chats, etc. Terminals installed in multiple locations are connected to a server via a communication network such as the Internet, and moving images are transmitted between multiple terminals through the server. It's made to be interactable. Moving images handled in online sessions include facial images and voices of users using terminals. Moving images also include images such as materials that are shared and viewed by a plurality of users. It is possible to switch between the face image and the document image on the screen of each terminal to display only one of them, or to divide the display area and display the face image and the document image at the same time. In addition, it is possible to display the image of one user out of a plurality of users on the full screen, or divide the images of some or all of the users into small screens and display them. It is possible to designate one or a plurality of users among a plurality of users participating in an online session using terminals as analysis subjects. For example, an online session leader, moderator, or manager (hereinafter collectively referred to as the organizer) designates any user as an analysis subject. Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session. It should be noted that all participants may be subject to analysis without specifying the person to be analyzed. In addition, it is also possible for an online session leader, moderator, or administrator (hereinafter collectively referred to as the organizer) to designate any user as an analysis subject. Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
 本実施の形態によるビデオセッション評価システムは、複数の端末間においてビデオセッションセッションが確立された場合に、当該ビデオセッションから取得される少なくとも動画像を表示される。表示された動画像は、端末によって取得され、動画像内に含まれる少なくとも顔画像を所定のフレーム単位ごとに識別される。その後、識別された顔画像に関する評価値が算出される。当該評価値は必要に応じて共有される。特に、本実施の形態においては、取得した動画像は当該端末に保存され、端末上で分析評価され、その結果が当該端末のユーザに提供される。従って、例えば個人情報を含むビデオセッションや機密情報を含むビデオセッションであっても、その動画自体を外部の評価機関等に提供することなく分析評価できる。また、必要に応じて、当該評価結果(評価値)だけを外部端末に提供することによって、結果を可視化したり、クロス分析等行うことができる。 The video session evaluation system according to the present embodiment displays at least moving images obtained from a video session established between a plurality of terminals. The displayed moving image is acquired by the terminal, and at least a face image included in the moving image is identified for each predetermined frame unit. An evaluation value for the identified face image is then calculated. The evaluation value is shared as necessary. In particular, in this embodiment, the acquired moving image is stored in the terminal, analyzed and evaluated on the terminal, and the result is provided to the user of the terminal. Therefore, for example, even a video session containing personal information or a video session containing confidential information can be analyzed and evaluated without providing the moving image itself to an external evaluation agency or the like. In addition, by providing only the evaluation result (evaluation value) to the external terminal as necessary, the result can be visualized and cross-analysis can be performed.
 図1に示されるように、本実施の形態によるビデオセッション評価システムは、少なくともカメラ部及びマイク部等の入力部と、ディスプレイ等の表示部とスピーカー等の出力部とを有するユーザ端末10、20と、ユーザ端末10、20に双方向のビデオセッションを提供するビデオセッションサービス端末30と、ビデオセッションに関する評価の一部を行う評価端末40とを備えている。 As shown in FIG. 1, the video session evaluation system according to the present embodiment includes user terminals 10 and 20 each having at least an input unit such as a camera unit and a microphone unit, a display unit such as a display, and an output unit such as a speaker. , a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for performing part of the evaluation of the video session.
<ハードウェア構成例>
 以下に説明する各機能ブロック、機能単位、機能モジュールは、例えばコンピュータに備えられたハードウェア、DSP(Digital Signal Processor)、ソフトウェアの何れによっても構成することが可能である。例えばソフトウェアによって構成する場合、実際にはコンピュータのCPU、RAM、ROMなどを備えて構成され、RAMやROM、ハードディスクまたは半導体メモリ等の記録媒体に記憶されたプログラムが動作することによって実現される。本明細書において説明するシステム及び端末による一連の処理は、ソフトウェア、ハードウェア、及びソフトウェアとハードウェアとの組合せのいずれを用いて実現されてもよい。本実施形態に係る情報共有支援装置10の各機能を実現するためのコンピュータプログラムを作製し、PC等に実装することが可能である。また、このようなコンピュータプログラムが格納された、コンピュータで読み取り可能な記録媒体も提供することが可能である。記録媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、フラッシュメモリ等である。また、上記のコンピュータプログラムは、記録媒体を用いずに、例えばネットワークを介して配信されてもよい。
<Hardware configuration example>
Each functional block, functional unit, and functional module described below can be configured by any of hardware, DSP (Digital Signal Processor), and software provided in a computer, for example. For example, when configured by software, it is actually configured with a computer CPU, RAM, ROM, etc., and is realized by running a program stored in a recording medium such as RAM, ROM, hard disk, or semiconductor memory. A series of processes by the systems and terminals described herein may be implemented using software, hardware, or a combination of software and hardware. It is possible to create a computer program for realizing each function of the information sharing support device 10 according to the present embodiment and implement it in a PC or the like. It is also possible to provide a computer-readable recording medium storing such a computer program. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Also, the above computer program may be distributed, for example, via a network without using a recording medium.
 本実施の形態による評価端末は、ビデオセッションサービス端末から動画像を取得し、当該動画像内に含まれる少なくとも顔画像を所定のフレーム単位ごとに識別すると共に、顔画像に関する評価値を算出する(詳しくは後述する)。
<動画の取得方法>
 図3に示されるように、ビデオセッションサービス端末が提供するビデオセッションサービス(以下、単に「本サービス」と言うことがある」)は、ユーザ端末10、20に対して双方向に画像および音声によって通信が可能となるものである。本サービスは、ユーザ端末のディスプレイに相手のユーザ端末のカメラ部で取得した動画像を表示し、相手のユーザ端末のマイク部で取得した音声をスピーカーから出力可能となっている。また、本サービスは双方の又はいずれかのユーザ端末によって、動画像及び音声(これらを合わせて「動画像等」という)を少なくともいずれかのユーザ端末上の記憶部に記録(レコーディング)することが可能に構成されている。記録された動画像情報Vs(以下「記録情報」という)は、記録を開始したユーザ端末にキャッシュされつついずれかのユーザ端末のローカルのみに記録されることとなる。ユーザは、必要があれば当該記録情報を本サービスの利用の範囲内で自分で視聴、他者に共有等行うこともできる。
The evaluation terminal according to the present embodiment acquires a moving image from a video session service terminal, identifies at least a face image included in the moving image for each predetermined frame unit, and calculates an evaluation value for the face image ( will be described in detail later).
<How to get videos>
As shown in FIG. 3, the video session service provided by the video session service terminal (hereinafter sometimes simply referred to as "this service") provides user terminals 10 and 20 with two-way images and voice. Communication is possible. In this service, a moving image captured by the camera of the other user's terminal is displayed on the display of the user's terminal, and audio captured by the microphone of the other's user's terminal can be output from the speaker. In addition, this service allows both or either of the user terminals to record moving images and sounds (collectively referred to as "moving images, etc.") in the storage unit of at least one of the user terminals. configured as possible. The recorded moving image information Vs (hereinafter referred to as “recorded information”) is cached in the user terminal that started recording and is locally recorded only in one of the user terminals. If necessary, the user can view the recorded information by himself or share it with others within the scope of using this service.
<機能構成例1>
 図4は、本実施形態による構成例を示すブロック図である。図4に示すように、本実施形態のビデオセッション評価システムは、ユーザ端末10が有する機能構成として実現される。すなわち、ユーザ端末10はその機能として、動画像取得部11、生体反応解析部12、特異判定部13、関連事象特定部14、クラスタリング部15および解析結果通知部16を備えている。
<Functional configuration example 1>
FIG. 4 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 4, the video session evaluation system of this embodiment is implemented as a functional configuration of the user terminal 10. FIG. That is, the user terminal 10 has, as its functions, a moving image acquisition unit 11, a biological reaction analysis unit 12, a peculiar determination unit 13, a related event identification unit 14, a clustering unit 15, and an analysis result notification unit 16.
 動画像取得部11は、オンラインセッション中に各端末が備えるカメラにより複数人(複数のユーザ)を撮影することによって得られる動画像を各端末から取得する。各端末から取得する動画像は、各端末の画面上に表示されるように設定されているものか否かは問わない。すなわち、動画像取得部11は、各端末に表示中の動画像および非表示中の動画像を含めて、動画像を各端末から取得する。 The moving image acquisition unit 11 acquires from each terminal a moving image obtained by photographing a plurality of people (a plurality of users) with a camera provided in each terminal during an online session. It does not matter whether the moving image acquired from each terminal is set to be displayed on the screen of each terminal. That is, the moving image acquisition unit 11 acquires moving images from each terminal, including moving images being displayed and moving images not being displayed on each terminal.
 生体反応解析部12は、動画像取得部11により取得された動画像(画面上に表示中のものか否かは問わない)に基づいて、複数人のそれぞれについて生体反応の変化を解析する。本実施形態において生体反応解析部12は、動画像取得部11により取得された動画像を画像のセット(フレーム画像の集まり)と音声とに分離し、それぞれから生体反応の変化を解析する。 The biological reaction analysis unit 12 analyzes changes in the biological reaction of each of a plurality of people based on the moving images (whether or not they are being displayed on the screen) acquired by the moving image acquiring unit 11. In the present embodiment, the biological reaction analysis unit 12 separates the moving image acquired by the moving image acquisition unit 11 into a set of images (collection of frame images) and voice, and analyzes changes in the biological reaction from each.
 例えば、生体反応解析部12は、動画像取得部11により取得された動画像から分離したフレーム画像を用いてユーザの顔画像を解析することにより、表情、目線、脈拍、顔の動きの少なくとも1つに関する生体反応の変化を解析する。また、生体反応解析部12は、動画像取得部11により取得された動画像から分離した音声を解析することにより、ユーザの発言内容、声質の少なくとも1つに関する生体反応の変化を解析する。 For example, the biological reaction analysis unit 12 analyzes the user's facial image using a frame image separated from the moving image acquired by the moving image acquisition unit 11 to obtain at least one of facial expression, gaze, pulse, and facial movement. Analyze changes in biological reactions related to Further, the biological reaction analysis unit 12 analyzes the voice separated from the moving image acquired by the moving image acquisition unit 11 to analyze changes in the biological reaction related to at least one of the user's utterance content and voice quality.
 人は感情が変化すると、それが表情、目線、脈拍、顔の動き、発言内容、声質などの生体反応の変化となって現れる。本実施形態では、ユーザの生体反応の変化を解析することを通じて、ユーザの感情の変化を解析する。本実施形態において解析する感情は、一例として、快/不快の程度である。本実施形態において生体反応解析部12は、生体反応の変化を所定の基準に従って数値化することにより、生体反応の変化の内容を反映させた生体反応指標値を算出する。 When a person's emotions change, it manifests as a change in biological reactions such as facial expressions, eye gaze, pulse, facial movements, content of remarks, and voice quality. In this embodiment, changes in the user's emotions are analyzed through analysis of changes in the user's biological reactions. The emotion analyzed in this embodiment is, for example, the degree of comfort/discomfort. In the present embodiment, the biological reaction analysis unit 12 calculates a biological reaction index value reflecting the change in biological reaction by quantifying the change in biological reaction according to a predetermined standard.
 表情の変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から顔の領域を特定し、事前に機械学習させた画像解析モデルに従って特定した顔の表情を複数に分類する。そして、その分類結果に基づいて、連続するフレーム画像間でポジティブな表情変化が起きているか、ネガティブな表情変化が起きているか、およびどの程度の大きさの表情変化が起きているかを解析し、その解析結果に応じた表情変化指標値を出力する。 For example, the analysis of changes in facial expressions is performed as follows. That is, for each frame image, a facial region is identified from the frame image, and the identified facial expressions are classified into a plurality of types according to an image analysis model machine-learned in advance. Then, based on the classification results, it analyzes whether positive facial expression changes occur between consecutive frame images, whether negative facial expression changes occur, and to what extent the facial expression changes occur, A facial expression change index value corresponding to the analysis result is output.
 目線の変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から目の領域を特定し、両目の向きを解析することにより、ユーザがどこを見ているかを解析する。例えば、表示中の話者の顔を見ているか、表示中の共有資料を見ているか、画面の外を見ているかなどを解析する。また、目線の動きが大きいか小さいか、動きの頻度が多いか少ないかなどを解析するようにしてもよい。目線の変化はユーザの集中度にも関連する。生体反応解析部12は、目線の変化の解析結果に応じた目線変化指標値を出力する。 For example, the analysis of changes in line of sight is performed as follows. That is, for each frame image, the eye region is specified in the frame image, and the orientation of both eyes is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or whether the user is looking outside the screen. Also, it may be analyzed whether the eye movement is large or small, or whether the movement is frequent or infrequent. A change in line of sight is also related to the user's degree of concentration. The biological reaction analysis unit 12 outputs a line-of-sight change index value according to the analysis result of the line-of-sight change.
 脈拍の変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から顔の領域を特定する。そして、顔の色情報(RGBのG)の数値を捉える学習済みの画像解析モデルを用いて、顔表面のG色の変化を解析する。その結果を時間軸に合わせて並べることによって色情報の変化を表した波形を形成し、この波形から脈拍を特定する。人は緊張すると脈拍が速くなり、気持ちが落ち着くと脈拍が遅くなる。生体反応解析部12は、脈拍の変化の解析結果に応じた脈拍変化指標値を出力する。 The analysis of pulse changes is performed, for example, as follows. That is, for each frame image, the face area is specified in the frame image. Then, using a trained image analysis model that captures numerical values of face color information (G of RGB), changes in the G color of the face surface are analyzed. By arranging the results along the time axis, a waveform representing changes in color information is formed, and the pulse is identified from this waveform. When a person is tense, the pulse speeds up, and when the person is calm, the pulse slows down. The biological reaction analysis unit 12 outputs a pulse change index value according to the analysis result of the pulse change.
 顔の動きの変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から顔の領域を特定し、顔の向きを解析することにより、ユーザがどこを見ているかを解析する。例えば、表示中の話者の顔を見ているか、表示中の共有資料を見ているか、画面の外を見ているかなどを解析する。また、顔の動きが大きいか小さいか、動きの頻度が多いか少ないかなどを解析するようにしてもよい。顔の動きと目線の動きとを合わせて解析するようにしてもよい。例えば、表示中の話者の顔をまっすぐ見ているか、上目遣いまたは下目使いに見ているか、斜めから見ているかなどを解析するようにしてもよい。生体反応解析部12は、顔の向きの変化の解析結果に応じた顔向き変化指標値を出力する。 For example, analysis of changes in facial movement is performed as follows. That is, for each frame image, the face area is specified in the frame image, and the direction of the face is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or whether the user is looking outside the screen. Further, it may be analyzed whether the movement of the face is large or small, or whether the movement is frequent or infrequent. The movement of the face and the movement of the line of sight may be analyzed together. For example, it may be analyzed whether the face of the speaker being displayed is viewed straight, whether the face is viewed with upward or downward gaze, or whether the face is viewed obliquely. The biological reaction analysis unit 12 outputs a face orientation change index value according to the analysis result of the face orientation change.
 発言内容の解析は、例えば以下のようにして行う。すなわち、生体反応解析部12は、指定した時間(例えば、30~150秒程度の時間)の音声について公知の音声認識処理を行うことによって音声を文字列に変換し、当該文字列を形態素解析することにより、助詞、冠詞などの会話を表す上で不要なワードを取り除く。そして、残ったワードをベクトル化し、ポジティブな感情変化が起きているか、ネガティブな感情変化が起きているか、およびどの程度の大きさの感情変化が起きているかを解析し、その解析結果に応じた発言内容指標値を出力する。  Analysis of the contents of the statement is performed, for example, as follows. That is, the biological reaction analysis unit 12 converts the voice into a character string by performing known voice recognition processing on the voice for a specified time (for example, about 30 to 150 seconds), and morphologically analyzes the character string. By doing so, words such as particles and articles that are unnecessary for expressing conversation are removed. Then, vectorize the remaining words, analyze whether a positive emotional change has occurred, whether a negative emotional change has occurred, and to what extent the emotional change has occurred. Outputs the statement content index value.
 声質の解析は、例えば以下のようにして行う。すなわち、生体反応解析部12は、指定した時間(例えば、30~150秒程度の時間)の音声について公知の音声解析処理を行うことによって音声の音響的特徴を特定する。そして、その音響的特徴に基づいて、ポジティブな声質変化が起きているか、ネガティブな声質変化が起きているか、およびどの程度の大きさの声質変化が起きているかを解析し、その解析結果に応じた声質変化指標値を出力する。 Voice quality analysis is performed, for example, as follows. That is, the biological reaction analysis unit 12 identifies the acoustic features of the voice by performing known voice analysis processing on the voice for a specified time (for example, about 30 to 150 seconds). Then, based on the acoustic features, it analyzes whether a positive change in voice quality has occurred, whether a negative change in voice quality has occurred, and to what extent the change in voice quality has occurred, and according to the analysis results, output the voice quality change index value.
 生体反応解析部12は、以上のようにして算出した表情変化指標値、目線変化指標値、脈拍変化指標値、顔向き変化指標値、発言内容指標値、声質変化指標値の少なくとも1つを用いて生体反応指標値を算出する。例えば、表情変化指標値、目線変化指標値、脈拍変化指標値、顔向き変化指標値、発言内容指標値および声質変化指標値を重み付け計算することにより、生体反応指標値を算出する。 The biological reaction analysis unit 12 uses at least one of the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value calculated as described above. to calculate the biological reaction index value. For example, the biological reaction index value is calculated by weighting the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value.
 特異判定部13は、解析対象者について解析された生体反応の変化が、解析対象者以外の他者について解析された生体反応の変化と比べて特異的か否かを判定する。本実施形態において、特異判定部13は、生体反応解析部12により複数のユーザのそれぞれについて算出された生体反応指標値に基づいて、解析対象者について解析された生体反応の変化が他者と比べて特異的か否かを判定する。 The peculiarity determination unit 13 determines whether or not the change in the analyzed biological reaction of the person to be analyzed is more specific than the change in the analyzed biological reaction of the person other than the person to be analyzed. In the present embodiment, the peculiarity determination unit 13 compares changes in the biological reaction of the person to be analyzed with those of others based on the biological reaction index values calculated for each of the plurality of users by the biological reaction analysis unit 12. is specific or not.
 例えば、特異判定部13は、生体反応解析部12により複数人のそれぞれについて算出された生体反応指標値の分散を算出し、解析対象者について算出された生体反応指標値と分散との対比により、解析対象者について解析された生体反応の変化が他者と比べて特異的か否かを判定する。 For example, the peculiar determination unit 13 calculates the variance of the biological reaction index values calculated for each of the plurality of persons by the biological reaction analysis unit 12, and compares the biological reaction index values calculated for the analysis subject with the variance, It is determined whether or not the change in the analyzed biological reaction of the person to be analyzed is specific compared to others.
 解析対象者について解析された生体反応の変化が他者と比べて特異的である場合として、次の3パターンが考えられる。1つ目は、他者については特に大きな生体反応の変化が起きていないが、解析対象者について比較的大きな生体反応の変化が起きた場合である。2つ目は、解析対象者については特に大きな生体反応の変化が起きていないが、他者について比較的大きな生体反応の変化が起きた場合である。3つ目は、解析対象者についても他者についても比較的大きな生体反応の変化が起きているが、変化の内容が解析対象者と他者とで異なる場合である。 The following three patterns are conceivable as cases where the changes in biological reactions analyzed for the subject of analysis are more specific than those of others. The first is a case where a relatively large change in biological reaction occurs in the subject of analysis, although no particularly large change in biological reaction has occurred in the other person. The second is a case where a particularly large change in biological reaction has not occurred in the subject of analysis, but a relatively large change in biological reaction has occurred in the other person. The third is a case where a relatively large change in biological reaction occurs in both the subject of analysis and the other person, but the content of the change differs between the subject of analysis and the other person.
 関連事象特定部14は、特異判定部13により特異的であると判定された生体反応の変化が起きたときに解析対象者、他者および環境の少なくとも1つに関して発生している事象を特定する。例えば、関連事象特定部14は、解析対象者について特異的な生体反応の変化が起きたときにおける解析対象者自身の言動を動画像から特定する。また、関連事象特定部14は、解析対象者について特異的な生体反応の変化が起きたときにおける他者の言動を動画像から特定する。また、関連事象特定部14は、解析対象者について特異的な生体反応の変化が起きたときにおける環境を動画像から特定する。環境は、例えば画面に表示中の共有資料、解析対象者の背景に写っているものなどである。 The related event identification unit 14 identifies an event occurring in relation to at least one of the person to be analyzed, the other person, and the environment when the change in the biological reaction determined to be peculiar by the peculiarity determination unit 13 occurs. . For example, the related event identification unit 14 identifies from the moving image the speech and behavior of the person to be analyzed when a specific change in biological reaction occurs in the person to be analyzed. In addition, the related event identifying unit 14 identifies, from the moving image, the speech and behavior of the other person when a specific change in the biological reaction of the person to be analyzed occurs. In addition, the related event identification unit 14 identifies from the moving image the environment in which a specific change in the biological reaction of the person to be analyzed occurs. The environment is, for example, the shared material being displayed on the screen, the background image of the person to be analyzed, and the like.
 クラスタリング部15は、特異判定部13により特異的であると判定された生体反応の変化(例えば、目線、脈拍、顔の動き、発言内容、声質のうち1つまたは複数の組み合わせ)と、当該特異的な生体反応の変化が起きたときに発生している事象(関連事象特定部14により特定された事象)との相関の程度を解析し、相関が一定レベル以上であると判定された場合に、その相関の解析結果に基づいて解析対象者または事象をクラスタリングする。 The clustering unit 15 clusters the change in the biological reaction determined to be specific by the peculiarity determination unit 13 (for example, one or a combination of eye gaze, pulse, facial movement, statement content, and voice quality), and the peculiarity Analyzing the degree of correlation with an event (event identified by the related event identification unit 14) that occurs when a change in biological reaction occurs, and if it is determined that the correlation is at a certain level or more , to cluster the subjects or events based on the correlation analysis results.
 例えば、特異的な生体反応の変化がネガティブな感情変化に相当するものであり、当該特異的な生体反応の変化が起きたときに発生している事象もネガティブな事象である場合には一定レベル以上の相関が検出される。クラスタリング部15は、その事象の内容やネガティブな度合い、相関の大きさなどに応じて、あらかじめセグメント化した複数の分類の何れかに解析対象者または事象をクラスタリングする。 For example, if a change in a specific biological reaction corresponds to a negative emotional change, and the event occurring when the specific change in biological reaction occurs is also a negative event, a certain level The above correlation is detected. The clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented categories according to the content of the event, the degree of negativity, the magnitude of the correlation, and the like.
 同様に、特異的な生体反応の変化がポジティブな感情変化に相当するものであり、当該特異的な生体反応の変化が起きたときに発生している事象もポジティブな事象である場合には一定レベル以上の相関が検出される。クラスタリング部15は、その事象の内容やポジティブな度合い、相関の大きさなどに応じて、あらかじめセグメント化した複数の分類の何れかに解析対象者または事象をクラスタリングする。 Similarly, if a specific change in biological reaction corresponds to a positive emotional change and the event occurring when the specific change in biological reaction occurs is also a positive event, Level or higher correlations are detected. The clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented classifications according to the content of the event, the degree of positivity, the degree of correlation, and the like.
 解析結果通知部16は、特異判定部13により特異的であると判定された生体反応の変化、関連事象特定部14により特定された事象、およびクラスタリング部15によりクラスタリングされた分類の少なくとも1つを、解析対象者の指定者(解析対象者またはオンラインセッションの主催者)に通知する。 The analysis result notification unit 16 reports at least one of the changes in the biological reaction determined to be specific by the peculiar determination unit 13, the event identified by the related event identification unit 14, and the classification clustered by the clustering unit 15. , to notify the designator of the subject of analysis (the subject of analysis or the organizer of the online session).
 例えば、解析結果通知部16は、解析対象者について他者とは異なる特異的な生体反応の変化が起きたとき(上述した3パターンの何れか。以下同様)に発生している事象として解析対象者自身の言動を解析対象者自身に通知する。これにより、解析対象者は、自分がある言動を行ったときに他者とは違う感情を持っていることを把握することができる。このとき、解析対象者について特定された特異的な生体反応の変化も併せて解析対象者に通知するようにしてもよい。さらに、対比される他者の生体反応の変化を更に解析対象者に通知するようにしてもよい。 For example, the analysis result notification unit 16 recognizes that when a change in a specific biological reaction that is different from that of the other person occurs in the person to be analyzed (one of the three patterns described above; the same applies hereinafter), the analysis target is Notifies the person to be analyzed of his/her own behavior. This allows the person to be analyzed to understand that he/she has a different feeling from others when he or she performs a certain behavior. At this time, the person to be analyzed may also be notified of the change in the specific biological reaction identified for the person to be analyzed. Furthermore, the person to be analyzed may be further notified of the change in the biological reaction of the other person to be compared.
 例えば、解析対象者が普段どおりの感情で特に意識せずに行った言動、または、解析対象者がある感情を伴って特に意識して行った言動に対して他者が受けた感情と、言動の際に解析対象者自身が抱いていた感情とが相違している場合に、そのときの解析対象者自身の言動が解析対象者に通知される。これにより、自分の意識に反して他者の受けが良い言動や他者の受けが良くない言動などを発見することも可能である。 For example, the words and deeds of the person to be analyzed performed without being particularly conscious of their usual emotions, or the words and deeds of the person to be analyzed consciously accompanied by certain emotions, and the emotions and behaviors that others received When the emotion held by the person to be analyzed is different from the feeling held by the person to be analyzed at the time, the person to be analyzed is notified of the speech and behavior of the person to be analyzed at that time. As a result, it is possible to discover behaviors that are well received by others or behaviors that are not well received by others, contrary to one's own consciousness.
 また、解析結果通知部16は、解析対象者について他者とは異なる特異的な生体反応の変化が起きたときに発生している事象を、特異的な生体反応の変化と共にオンラインセッションの主催者に通知する。これにより、オンラインセッションの主催者は、指定した解析対象者に特有の現象として、どのような事象がどのような感情の変化に影響を与えているのかを知ることができる。そして、その把握した内容に応じて適切な処置を解析対象者に対して行うことが可能となる。 In addition, the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when the person to be analyzed undergoes a specific change in biological reaction that is different from that of the other person, together with the change in the specific biological reaction. to notify. As a result, the organizer of the online session can know what kind of event affects what kind of emotional change as a phenomenon specific to the specified analysis subject. Then, it becomes possible to perform appropriate treatment on the person to be analyzed according to the grasped contents.
 また、解析結果通知部16は、解析対象者について他者とは異なる特異的な生体反応の変化が起きたときに発生している事象または解析対象者のクラスタリング結果をオンラインセッションの主催者に通知する。これにより、オンラインセッションの主催者は、指定した解析対象者がどの分類にクラスタリングされたかによって、解析対象者に特有の行動の傾向を把握したり、今後起こり得る行動や状態などを予測したりすることができる。そして、それに対して適切な処置を解析対象者に対して行うことが可能となる。 In addition, the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when a specific change in biological reaction occurs in the analysis subject, which is different from that of others, or the clustering result of the analysis subject. do. As a result, online session organizers can grasp behavioral tendencies peculiar to analysis subjects and predict possible future behaviors and situations, depending on which classification the specified analysis subjects have been clustered into. be able to. Then, it becomes possible to take appropriate measures for the person to be analyzed.
 なお、上記実施形態では、生体反応の変化を所定の基準に従って数値化することによって生体反応指標値を算出し、複数人のそれぞれについて算出された生体反応指標値に基づいて、解析対象者について解析された生体反応の変化が他者と比べて特異的か否かを判定する例について説明したが、この例に限定されない。例えば、以下のようにしてもよい。 In the above embodiment, the biological reaction index value is calculated by quantifying the change in biological reaction according to a predetermined standard, and the analysis subject is analyzed based on the biological reaction index value calculated for each of the plurality of people. Although the example of determining whether the change in the biological reaction received is specific compared to others has been described, the present invention is not limited to this example. For example, it may be as follows.
 すなわち、生体反応解析部12は、複数人のそれぞれについて目線の動きを解析して目線の方向を示すヒートマップを生成する。特異判定部13は、生体反応解析部12により解析対象者について生成されたヒートマップと他者について生成されたヒートマップとの対比により、解析対象者について解析された生体反応の変化が、他者について解析された生体反応の変化と比べて特異的か否かを判定する。 That is, the biological reaction analysis unit 12 analyzes the movement of the line of sight for each of a plurality of people and generates a heat map indicating the direction of the line of sight. The peculiar determination unit 13 compares the heat map generated for the person to be analyzed by the biological reaction analysis unit 12 with the heat map generated for the other person, so that the change in the biological reaction analyzed for the person to be analyzed It is determined whether it is specific compared with the change in biological response analyzed for.
 このように、本実施の形態においては、ビデオセッションの動画像をユーザ端末10のローカルストレージに保存し、ユーザ端末10上で上述した分析を行うこととしている。ユーザ端末10のマシンスペックに依存する可能性があるとはいえ、動画像の情報を外部に提供することなく分析することが可能となる。 Thus, in the present embodiment, moving images of a video session are stored in the local storage of the user terminal 10, and the above analysis is performed on the user terminal 10. Although it may depend on the machine specs of the user terminal 10, it is possible to analyze the moving image information without providing it to the outside.
<機能構成例2>
 図5に示すように、本実施形態のビデオセッション評価システムは、機能構成として、動画像取得部11、生体反応解析部12および反応情報提示部13aを備えていてもよい。
<Functional configuration example 2>
As shown in FIG. 5, the video session evaluation system of this embodiment may include a moving image acquisition unit 11, a biological reaction analysis unit 12, and a reaction information presentation unit 13a as functional configurations.
 反応情報提示部13aは、画面に表示されていない参加者を含めて生体反応解析部12aにより解析された生体反応の変化を示す情報を提示する。例えば、反応情報提示部13aは、生体反応の変化を示す情報をオンラインセッションの主導者、進行者または管理者(以下、まとめて主催者という)に提示する。オンラインセッションの主催者は、例えばオンライン授業の講師、オンライン会議の議長やファシリテータ、コーチングを目的としたセッションのコーチなどである。オンラインセッションの主催者は、オンラインセッションに参加する複数のユーザの中の一人であるのが普通であるが、オンラインセッションに参加しない別人であってもよい。 The reaction information presentation unit 13a presents information indicating changes in biological reactions analyzed by the biological reaction analysis unit 12a, including participants not displayed on the screen. For example, the reaction information presenting unit 13a presents information indicating changes in biological reactions to an online session leader, moderator, or administrator (hereinafter collectively referred to as the organizer). Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
 このようにすることにより、オンラインセッションの主催者は、複数人でオンラインセッションが行われる環境において、画面に表示されていない参加者の様子も把握することができる。 By doing so, the organizer of the online session can also grasp the state of the participants who are not displayed on the screen in an environment where the online session is held by multiple people.
<機能構成例3>
 図6は、本実施形態による構成例を示すブロック図である。図6に示すように、本実施形態のビデオセッション評価システムは、機能構成として、上述した実施の形態1と類似する機能については同一つの参照符号を付して説明を省略することがある。
<Functional configuration example 3>
FIG. 6 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 6, in the video session evaluation system of the present embodiment, functions similar to those of the above-described first embodiment are given the same reference numerals, and explanations thereof may be omitted.
 本実施の形態によるシステムは、ビデオセッションの映像を取得するカメラ部及び音声を取得するマイク部と、動画像を分析及び評価する解析部、取得した動画像を評価することによって得られた情報に基づいて表示オブジェクト(後述する)を生成するオブジェクト生成部、前記ビデオセッション実行中にビデオセッションの動画像と表示オブジェクトの両方を表示する表示部と、を備えている。 The system according to this embodiment includes a camera unit that acquires images of a video session, a microphone unit that acquires audio, an analysis unit that analyzes and evaluates moving images, and information obtained by evaluating the acquired moving images. an object generator for generating a display object (described below) based on the display; and a display for displaying both the moving image of the video session and the display object during execution of the video session.
 解析部は、上述した説明と同様に、動画像取得部11、生体反応解析部12、特異判定部13、関連事象特定部14、クラスタリング部15および解析結果通知部16を備えている。各要素の機能については上述したとおりである。 The analysis unit includes the moving image acquisition unit 11, the biological reaction analysis unit 12, the peculiar determination unit 13, the related event identification unit 14, the clustering unit 15, and the analysis result notification unit 16, as described above. The function of each element is as described above.
 図7に示されるように、オブジェクト生成部は、解析部によってビデオセッションから取得される動画像を解析した結果に基づいて、必要に応じて、当該認識した顔の部分を示すオブジェクト50と、上述した分析・評価した内容を示す情報100を当該動画像に重畳して表示する。当該オブジェクト50は、複数人の顔が動画像内に移っている場合には、複数人全員の顔を識別し、表示することとしてもよい。 As shown in FIG. 7, the object generation unit generates an object 50 representing the recognized face part and the above-mentioned Information 100 indicating the content of the analysis/evaluation performed is superimposed on the moving image and displayed. The object 50 may identify and display all faces of a plurality of persons when the faces of the plurality of persons are moved in the moving image.
 また、オブジェクト50は、例えば、相手側の端末において、ビデオセッションのカメラ機能を停止している場合(即ち、物理的にカメラを覆う等ではなく、ビデオセッションのアプリケーション内においてソフトウェア的に停止している場合)であっても、相手側のカメラで相手の顔を認識していた場合には、相手の顔が位置している部分にオブジェクト50やオブジェクト100を表示することとしてもよい。これにより、カメラ機能がオフになっていたとしても、相手側が端末の前にいることがお互い確認することが可能となる。この場合、例えば、ビデオセッションのアプリケーションにおいては、カメラから取得した情報を非表示にする一方、解析部によって認識された顔に対応するオブジェクト50やオブジェクト100のみを表示することとしてもよい。また、ビデオセッションから取得される映像情報と、解析部によって認識され得られた情報とを異なる表示レイヤーに分け、前者の情報に関するレイヤーを非表示にすることとしてもよい。 In addition, the object 50 is, for example, when the camera function of the video session is stopped at the other party's terminal (that is, it is stopped by software within the application of the video session instead of physically covering the camera). If the other party's face is recognized by the other party's camera, the object 50 or the object 100 may be displayed in the part where the other party's face is located. This makes it possible for both parties to confirm that the other party is in front of the terminal even if the camera function is turned off. In this case, for example, in a video session application, the information obtained from the camera may be hidden while only the object 50 or object 100 corresponding to the face recognized by the analysis unit is displayed. Also, the video information acquired from the video session and the information recognized by the analysis unit may be divided into different display layers, and the layer relating to the former information may be hidden.
 オブジェクト50やオブジェクト100は、複数の動画像を表示する領域がある場合には、すべての領域又は一部の領域のみに表示することとしてもよい。例えば、図8に示されるように、ゲスト側の動画像のみに表示することとしてもよい。 When there are multiple moving image display areas, the objects 50 and 100 may be displayed in all areas or only in some areas. For example, as shown in FIG. 8, it may be displayed only on the moving image on the guest side.
 以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 Although the preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can conceive of various modifications or modifications within the scope of the technical idea described in the claims. are naturally within the technical scope of the present disclosure.
 本明細書において説明した装置は、単独の装置として実現されてもよく、一部または全部がネットワークで接続された複数の装置(例えばクラウドサーバ)等により実現されてもよい。例えば、各端末10の制御部110およびストレージ130は、互いにネットワークで接続された異なるサーバにより実現されてもよい。 The device described in this specification may be realized as a single device, or may be realized by a plurality of devices (for example, cloud servers) or the like, all or part of which are connected via a network. For example, the control unit 110 and the storage 130 of each terminal 10 may be realized by different servers connected to each other via a network.
 即ち、本システムは、ユーザ端末10、20と、ユーザ端末10、20に双方向のビデオセッションを提供するビデオセッションサービス端末30と、ビデオセッションに関する評価を行う評価端末40とを含んでいるところ、以下のような構成のバリエーション組み合わせが考えられる。
(1)すべてをユーザ端末のみで処理
 図9に示されるように、解析部による処理をビデオセッションを行っている端末で行うことにより、(一定の処理能力は必要なものの)ビデオセッションを行っている時間と同時に(リアルタイムに)分析・評価結果を得ることができる。
(2)ユーザ端末と評価端末とで処理
 図10に示されるように、ネットワーク等で接続された評価端末に解析部を備えさせることとしてもよい。この場合、ユーザ端末で取得された動画像は、ビデオセッションと同時に又は事後的に評価端末に共有され、評価端末における解析部によって分析・評価されたのちに、オブジェクト50及びオブジェクト100の情報がユーザ端末に動画像データと共に又は別に(即ち、少なくとも解析データを含む情報が)共有され表示部に表示される。
That is, the system includes user terminals 10, 20, a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for evaluating the video session, Variation combinations of the following configurations are conceivable.
(1) Processing everything only on the user terminal As shown in FIG. 9, by performing the processing by the analysis unit on the terminal that is performing the video session (although a certain processing capacity is required), the video session can be performed. Analysis/evaluation results can be obtained at the same time (in real time) as you are.
(2) Processing by User Terminal and Evaluation Terminal As shown in FIG. 10, an analysis unit may be provided in an evaluation terminal connected via a network or the like. In this case, the moving images acquired by the user terminal are shared with the evaluation terminal at the same time as or after the video session, and are analyzed and evaluated by the analysis unit in the evaluation terminal. Together with or separately from the moving image data (that is, information including at least analysis data) is shared with the terminal and displayed on the display unit.
 以上説明した機能構成例1乃至機能構成例3の各構成及びそれらの組み合わせを用いて、図10乃至図25に示す第1乃至第11の実施の形態によるシステムが実現する。 The systems according to the first to eleventh embodiments shown in FIGS. 10 to 25 are realized by using the functional configuration examples 1 to 3 described above and their combinations.
<第1の実施の形態>
 図10及び図11を参照して、本発明の第1の実施の形態を説明する。本実施の形態によるシステムは、概略、人と人とのマッチング度合いを評価する。例えば、相手の反応を分析したり、相手別の特異性(普段は出ない表情等)を評価したり、自分の過去との特異性(同じ相手でも過去には出なかった表情等)、ニュートラルな状態との比較を行うことにより評価する。特に、当該マッチングは、講師と受講生が行うオンラインセッションに対して有効である。様々なタイプの講師と様々なタイプの受講生とのマッチングは当該講座を継続するうえでも重要である。
<First embodiment>
A first embodiment of the present invention will be described with reference to FIGS. 10 and 11. FIG. The system according to this embodiment roughly evaluates the degree of matching between people. For example, analyze the reaction of the other party, evaluate the peculiarity of each other (expressions that do not usually appear), peculiarities with your past (expressions that did not appear in the past even with the same partner), neutral Evaluate by comparing with the normal state. In particular, this matching is effective for online sessions conducted by lecturers and students. Matching various types of lecturers with various types of students is also important for continuing the course.
 図10に示されるように、本実施の形態によるシステムは、上述した解析部による評価結果から各人のタイプを決定するタイプ決定部と、マッチング度合いを判定するマッチング判定部とを備えている。 As shown in FIG. 10, the system according to the present embodiment includes a type determination unit that determines each person's type based on the evaluation result by the analysis unit described above, and a matching determination unit that determines the degree of matching.
 タイプ決定部は、評価結果と、タイプとが予め関連付けられたタイプデータベース(タイプDB)を参照することにより、ユーザ毎のタイプを決定(推定)する。マッチング判定部は、タイプ毎のマッチング度合いが予め定義されたマッチングデータベース(マッチングDB)を参照して、上記決定されたタイプ同士を事前に定義された相性度を利用してマッチング度合いを数値化する。マッチングDBの構築は、会話を引き出すのが得意なタイプの講師と、発言が苦手な受講生とのマッチング度合いを事前に定義しておくことなどが例示できる。 The type determination unit determines (estimates) the type of each user by referring to a type database (type DB) in which evaluation results and types are associated in advance. The matching determination unit refers to a matching database (matching DB) in which the degree of matching for each type is defined in advance, and quantifies the degree of matching using the previously defined degree of compatibility between the above-determined types. . The construction of the matching DB can be exemplified by defining in advance the degree of matching between a lecturer who is good at bringing out conversations and a student who is not good at speaking.
 なお、評価結果を取得した後に、2者間の会話と当該2者間の当該会話に対する評価とを含む教師データを学習した学習モデルを用いてマッチ度合いを判定することとしてもよい。この場合、実際にマッチングされた者同士の講義の結果をフィードバック(講師が自分に合っていた、合っていなかった、等)することとしてもよい。 After obtaining the evaluation result, the degree of matching may be determined using a learning model that has learned teacher data including the conversation between the two parties and the evaluation of the conversation between the two parties. In this case, it is possible to give feedback (whether or not the lecturer is suitable for the person, etc.) on the results of the lectures between the persons who are actually matched.
 更に、図11に示されるように、講座開講の前に、受講生のタイプを事前に判定すべく、タイプ判定用のオンラインセッションが行われてもよい。システムは、タイプ判定のために行われたオンラインセッションの動画像を取得し(ステップS1101)、タイプの決定を行う(ステップS1102)。続いて、決定された受講生のタイプと講師のタイプとを一時的にマッチングを行う(ステップS1103)。受講生側の要望として、例えば「優しい先生がいい」「テンポの良い先生がいい」等のように講師に求める条件を事前にアンケート等で取得しておき、当該アンケートの結果から望ましいタイプを特定しておくこととしてもよい。システムは、このような情報を条件情報として取得する(ステップS1104)。システムは、条件情報を考慮して、一次マッチ度を補正して(ステップS1105)、補正後のマッチ度を提供する。これにより、システム側としては、厳しい講師がマッチング相手として適切と判断した場合であっても、受講生に「優しい性格が希望」という条件があった場合には、一次マッチ度が算出された複数の講師のマッチ度がそれぞれ補正されて「厳しくも優しい講師」が最適な講師として選定される。 Furthermore, as shown in FIG. 11, an online session for type determination may be conducted before the course starts to determine the type of the student in advance. The system acquires a moving image of an online session performed for type determination (step S1101), and determines the type (step S1102). Subsequently, the determined student type and instructor type are temporarily matched (step S1103). As a request from the student side, the conditions required for the lecturer are obtained in advance through questionnaires, such as "I like a gentle teacher", "I like a teacher with a good tempo", etc., and the desired type is specified from the results of the questionnaire. It may be left as is. The system acquires such information as condition information (step S1104). The system considers the condition information, corrects the primary match degree (step S1105), and provides the corrected match degree. As a result, even if the system determines that a strict instructor is suitable as a matching partner, if the student has a condition of "preferring a gentle personality", the primary match degree will be calculated. The matching degree of each instructor is corrected, and the "strict but gentle instructor" is selected as the optimum instructor.
<第2の実施の形態>
 図12及び図13を参照して、本発明の第2の実施の形態を説明する。本実施の形態によるシステムは、概略、ユーザから得られた感情(評価値)に対して、そもそもその感情が出やすい人かどうか(例えば、もともと笑顔が多い人のhappyスコアは高くなりがち)というベースの感情との比較を考慮したり、感情が出たときの変位の大きさ(リアクションが小さい人と大きい人とでは感情表出度合いが異なる)を評価したりして、正確に評価を行うものである。
<Second Embodiment>
A second embodiment of the present invention will be described with reference to FIGS. 12 and 13. FIG. In general, the system according to the present embodiment is based on the feeling (evaluation value) obtained from the user, and whether or not the person is likely to express that feeling in the first place (for example, a person who originally smiles a lot tends to have a high happy score). Accurate evaluation by considering comparison with base emotions and evaluating the magnitude of displacement when emotions are expressed (the degree of emotional expression differs between people with small reactions and those with large reactions) It is.
 図12の(a)乃至(c)のグラフは、あるユーザの(a)生データ(時系列の感情スコア)、(b)生データを1分間隔のフレーム幅で取得した標準偏差(標準偏差処理)、(c)標準偏差を標準化(平均0、分散1のzscore化)した(標準化処理)ものである。 Graphs (a) to (c) of FIG. 12 show (a) raw data (time-series emotion score) of a certain user, and (b) standard deviation (standard deviation processing), and (c) the standard deviation is standardized (zscore conversion with an average of 0 and a variance of 1) (standardization processing).
 システムは、(b)の評価値に基づいて、ユーザ毎の顔つきや定常時の表情の違いを考慮した評価を行う。例えば、定常時に笑顔であることが多いユーザは必然的に笑顔のスコア(happyスコア)が高くなってしまう、といった問題を改善することが可能となる。また、システムは(c)の評価に基づいて、ユーザ毎に感情表現の豊さの違いを考慮した評価を行う。例えば小さく笑う人と、大きく笑う人との差による問題を改善することが可能となる。 Based on the evaluation value of (b), the system performs an evaluation that takes into account the differences in facial expressions and normal facial expressions for each user. For example, it is possible to solve the problem that a user who often smiles during normal operation inevitably has a high smile score (happy score). In addition, based on the evaluation of (c), the system performs an evaluation that considers the richness of emotional expression for each user. For example, it is possible to improve the problem caused by the difference between a person who laughs softly and a person who laughs loudly.
 図13の模式図に基づいて更に詳しく説明する。図13の(a)及び(b)は、それぞれ、ユーザaとユーザbのHappyスコア(幸福度を表す)グラフである。ユーザaとユーザbのそれぞれのスコアの平均を比べると、ユーザAの方が平均が高いことがわかる。即ち、ユーザAはユーザBに比べると笑顔が多い人であり、Happyスコアが必然的に高くなる傾向になることがわかる。また、それぞれの感情の幅(ST_A及びST_B)を比較すると、ユーザAの方がユーザBよりも感情表出の幅(即ち、リアクションの大きさ)が大きいことがわかる。 A more detailed description will be given based on the schematic diagram of FIG. (a) and (b) of FIG. 13 are graphs of Happy scores (expressing happiness levels) of user a and user b, respectively. Comparing the average scores of user a and user b, it can be seen that user A has a higher average score. In other words, user A smiles more than user B, and the Happy score inevitably tends to be high. In addition, comparing the range of each emotion (ST_A and ST_B), it can be seen that user A has a greater range of emotional expression (that is, the magnitude of reaction) than user B does.
 上述した標準偏差処理と、標準化処理を行うことにより、その感情の表出頻度や、感情の表出の程度などの個人差を排除した数値の評価を行うことが可能となる。また、上述した解析部(例えば、図3等)に対して、標準偏差処理と及び標準化処理を施した教師データを利用した機械学習を行わせることにより、適切な学習を行うことが可能となる。即ち、本システムは、様々な動画像から得られた評価結果に対して標準偏差処理と、標準化処理を行うことにより教師データを生成するシステム(装置)としてもその機能を発揮できる。 By performing the above-mentioned standard deviation processing and standardization processing, it is possible to evaluate numerical values that eliminate individual differences such as the frequency of expression of emotions and the degree of expression of emotions. In addition, by causing the above-described analysis unit (for example, FIG. 3, etc.) to perform machine learning using teacher data that has been subjected to standard deviation processing and standardization processing, it is possible to perform appropriate learning. . That is, the present system can exhibit its function as a system (apparatus) that generates training data by performing standard deviation processing and standardization processing on evaluation results obtained from various moving images.
<第3の実施の形態>
 図14を参照して、本発明の第3の実施の形態を説明する。本実施の形態によるシステムは、概略、評価結果に対して本人にアノテーション(ラベリング)させるものである。以下、講師(第1ユーザ)と受講生(第2ユーザ)のオンラインセッションを例に説明する。
<Third Embodiment>
A third embodiment of the present invention will be described with reference to FIG. The system according to the present embodiment allows the subject to annotate (label) the outline and evaluation results. An online session between a lecturer (first user) and a student (second user) will be described below as an example.
 システムは、上述したオンラインセッションの動画像を分析・評価し、受講生(第2ユーザ)その評価結果をグラフとして出力することにより可視化する。講師は、当該グラフに対してその感情の有無、その時の補足情報(状況、言動、行動、相手の行動情報等)を追加することができる。図示されるように、例えば、ユーザは、Happyスコアが高かった地点(Lab_1)に対して、その時の状況(「授業が盛り上がっていた」、「相手の反応がよかった」等の状況の情報)を関連付けることができる。また、Happyスコアが低かった地点(Lab_2)に対して、その時の状況(「何らかの課題をさせていた」、「厳しいことを伝えた」等の状況の情報)を関連付けることができる。これにより、何らかの課題に集中させていたからスコアが低かったのか、厳しいことを伝えたからスコアが低かったのか、改善方針に役立てることができる。このように、相手(第2ユーザ)の反応に対する、自分(第1ユーザ)のアノテーションを受け付けることにより、そのコミュニケーションが改善すべきものなのか、ぞうではないものなのかを判定することが可能となる。 The system analyzes and evaluates the moving images of the online sessions described above, and visualizes them by outputting the evaluation results of the students (second users) as graphs. The instructor can add the presence/absence of the emotion to the graph and supplementary information at that time (situation, speech and behavior, action, partner's action information, etc.). As shown in the figure, for example, the user provides the situation at that time (information about the situation such as "the class was lively" and "the other party responded well") to the point (Lab_1) where the Happy score was high. can be associated. In addition, it is possible to associate the location (Lab — 2) where the Happy score was low with the situation at that time (information on the situation, such as “some assignment was given” or “severe things were told”). This can be used to determine whether the score was low because the student was forced to concentrate on some task, or whether the score was low because the student was told something difficult. In this way, by accepting the annotations of oneself (the first user) to the reaction of the other party (the second user), it becomes possible to determine whether the communication should be improved or not. .
 なお、図示されるように、アノテーションは区間(Lab_3及びLab_4)に対して行わせることとしてもよい。この場合、「難しい単元を教えていた時間」「授業終盤のまとめの時間」のようなアノテーションも可能となる。 It should be noted that, as shown in the figure, annotations may be made for sections (Lab_3 and Lab_4). In this case, it is possible to make annotations such as "time spent teaching a difficult unit" and "time to summarize at the end of class".
 また、グラフ(プロットとされた評価値)と、アノテーションとのセットから教示テータセットを生成し、解析部に機械学習させることとしてもよい。 Alternatively, a teaching data set may be generated from a set of graphs (plotted evaluation values) and annotations, and machine learning may be performed by the analysis unit.
<第4の実施の形態>
 図15を参照して、本発明の第4の実施の形態を説明する。本実施の形態によるシステムは、概略、講師と受講生とで行われる講義のオンラインセッションにおいて、受講生の感情カルテとなり得るカルテ情報を生成して講師へ共有するものである。カルテの内容としては、例えば、受講生の感情毎の平均値、特徴、口癖、頻出の表情や、レーダーチャートなどによる感情表出のバランスと程度、落ち込んだときにかけられていた言葉のランキング、笑顔が出たときにかけられていた言葉のランキング等、受講生の心理状態により一層向き合うために必要な情報などが挙げられる。
<Fourth Embodiment>
A fourth embodiment of the present invention will be described with reference to FIG. The system according to the present embodiment, in outline, generates chart information that can serve as a student's emotion chart in an online session of a lecture between a lecturer and a student, and shares the chart information with the lecturer. The contents of the chart include, for example, the average value of each emotion of the student, characteristics, habits, frequent facial expressions, balance and degree of emotional expression by radar chart, ranking of words spoken when depressed, smile The information necessary to face the psychological state of the student further, such as the ranking of the words used when the question appeared.
 図示されるように、システムは、上述した解析部によって複数の観点(ニュートラル、幸福、驚き、不快、怒り、悲しみ、恐れ等)による評価結果をダッシュボードに一覧に表示する。各感情を象徴する表情アイコンと共に表示することで直感的に理解しやすいものとなる。表示は、例えば、一日のオンラインセッションの評価結果としてもよいし、週単位、月単位としてもよい。 As shown in the figure, the system displays the evaluation results from multiple perspectives (neutral, happiness, surprise, discomfort, anger, sadness, fear, etc.) on the dashboard as a list by the analysis unit described above. Displaying together with facial expression icons that symbolize each emotion makes it easier to intuitively understand. The display may be, for example, the evaluation results of online sessions for one day, or may be displayed on a weekly or monthly basis.
 ダッシュボードには、上記のほか、感情毎に最も強く表出した時の動画像のダイジェストや、その際の発言をテキスト化した情報を表示することとしてもよい。また、使用した言葉の頻度(口癖)を算出し、ランキングを表示することとしてもよい。 In addition to the above, the dashboard may also display a digest of the video when each emotion was expressed most strongly, or textual information on the remarks made at that time. Also, the frequency of words used (preferred phrase) may be calculated and the ranking may be displayed.
<第5の実施の形態>
 図16を参照して、本発明の第5の実施の形態を説明する。本実施の形態によるシステムは、概略、当事者同士で行われる会話に不適切な表現があったかどうか(例えば、パワーハラスメントのような立場を利用した発言等)を発言された相手の反応を考慮して検知するものである。検知の方法としては、ルールベース(禁止キーワードの発言事実の特定と、相手側のネガティブ感情の表出)によるものと、機械学習的なアプローチなどが挙げられる。
<Fifth Embodiment>
A fifth embodiment of the present invention will be described with reference to FIG. In general, the system according to the present embodiment determines whether there is inappropriate expression in a conversation between parties (for example, remarks using a position such as power harassment). It detects. Detection methods include rule-based (identification of prohibited keywords and expression of negative emotions on the other side) and machine learning approaches.
 図示されるように、本システムは、上司と部下とのオンラインセッション等において、例えば上司ユーザの発した言葉にNGキーワード(上司の発言として不適切な言葉)が組まれているかどうかを検出し、その際の部下の評価値を取得し、恐れや、不安、悲しみ、怒り、といったネガティブな感情の表出がその言葉を言われる前と比較して所定の範囲を超えて上昇していた場合には当該上司の発言を不適切な発言として記録する。不適切な発言は例えば、会社の人事部等に、そのダイジェスト動画像や、発言した言葉のテキスト等と共に通知されることとしてもよい。 As shown in the figure, this system detects whether or not NG keywords (inappropriate words for a boss's remarks) are included in words uttered by the boss user in an online session between a boss and his subordinates, and Obtain the subordinate's evaluation value at that time, and if the expression of negative emotions such as fear, anxiety, sadness, and anger rises beyond a predetermined range compared to before the word was said records the boss's remarks as inappropriate remarks. Inappropriate remarks may be notified, for example, to the company's human resources department, etc., together with the digest video and the text of the remarks.
<第6の実施の形態>
 図17及び図18を参照して、本発明の第6の実施の形態を説明する。本実施の形態によるシステムは、概略、動画像の発言を利用した所謂ワードクラウドである。システムは、取得した動画像の音声を認識し、認識した音声に含まれる単語をその発言頻度に応じたサイズに変換してテキスト表示する。
<Sixth Embodiment>
A sixth embodiment of the present invention will be described with reference to FIGS. 17 and 18. FIG. The system according to the present embodiment is roughly a so-called word cloud using utterances of moving images. The system recognizes the voice of the acquired moving image, converts the words included in the recognized voice into a size corresponding to the utterance frequency, and displays them as text.
 表示する単語としては、評価対象動画像で使用された頻度の高い単語を抽出することとしてもよいし、評価対象動画像以外の動画像には含まれていなかった単語(今回の動画像特有の言葉)を抽出することとしてもよい。 As the words to be displayed, it is possible to extract words that are frequently used in the evaluation target video, or words that are not included in videos other than the evaluation target video (words unique to this video). words) may be extracted.
 また、システムは、当該単語を発した時のユーザの評価結果に応じてテキストの色を変更することとしてもよい。例えば、HAPPYスコアが高い場合には赤い文字とし、SADスコアが高いときに発言された単語は、青いスコアとしてもよい。 Also, the system may change the color of the text according to the user's evaluation result when the word is uttered. For example, a high HAPPY score may be given a red letter, and a word spoken when a SAD score is high may be given a blue score.
 図17に示されるワードクラウドは、中心に「勉強」という言葉が表示されている。当該単語を選択操作させると図18に示されるように、その単語を発言していたときの会話文がテキストで表示される。また、会話文と共に、再生ボタンPが表示され、再生ボタンPを選択すると当該テキストに対応する動画がダイジェストで再生される。 The word cloud shown in Fig. 17 displays the word "study" in the center. When the word is selected, as shown in FIG. 18, the text of the conversation when the word was uttered is displayed. In addition, a play button P is displayed together with the conversation, and when the play button P is selected, a digest of the moving image corresponding to the text is played.
<第7の実施の形態>
 図19及び図20を参照して、本発明の第7の実施の形態を説明する。本実施の形態によるシステムは、概略、動画像に含まれる顔画像及び音声の双方に基づいて疲労度を評価するものである。
<Seventh embodiment>
A seventh embodiment of the present invention will be described with reference to FIGS. 19 and 20. FIG. The system according to the present embodiment evaluates the degree of fatigue based on both facial images and voices included in the outline and moving images.
 システムは、疲労度評価条件読込部と、疲労度判定部とを備えている。本実施の形態による疲労度の評価は、ユーザの定常状態を記憶し、当該定常状態における感情の起伏の幅と現在の感情の起伏の幅とに基づいて疲労度を評価する。なお、疲労度の評価は、これに限られず、定常状態における会話のピッチの変化量と、現在の会話のピッチの変化量とに基づいて評価してもよい。 The system includes a fatigue evaluation condition reading unit and a fatigue evaluation unit. In evaluating the degree of fatigue according to the present embodiment, the steady state of the user is stored, and the degree of fatigue is evaluated based on the range of emotional fluctuations in the steady state and the current range of emotional fluctuations. Note that the evaluation of the degree of fatigue is not limited to this, and may be evaluated based on the amount of change in the pitch of conversation in the steady state and the amount of change in the pitch of the current conversation.
 図20に示されるように、動画像を取得する(ステップS2000)と、予め学習させた疲労度評価モデルを読込み(ステップS2001)、疲労度の評価を行い(ステップS2002)、疲労度の通知(ステップS2010)を行う。又は、動画像を取得した後に通常時の評価情報を読込み(S2101)、各感情の評価結果と比較することにより各要素をクロス分析を行い疲労度を通知する(ステップS2010)。 As shown in FIG. 20, when a moving image is acquired (step S2000), a pre-learned fatigue level evaluation model is read (step S2001), the fatigue level is evaluated (step S2002), and the fatigue level is notified ( Step S2010) is performed. Alternatively, after acquiring the moving image, the evaluation information for the normal time is read (S2101), each element is cross-analyzed by comparing with the evaluation result of each emotion, and the fatigue level is notified (step S2010).
 このような疲労度の評価を行うことにより、例えば、従業員等に対して、疲労度に基づく段階的なアラートを通知することとしてもよい。例えば、疲労度が一定の閾値を超えた場合には、順次球威を促すメッセージを送信することとしてもよい。 By evaluating the degree of fatigue in this way, for example, it is possible to notify employees, etc., of step-by-step alerts based on the degree of fatigue. For example, when the degree of fatigue exceeds a certain threshold, messages may be sequentially sent to encourage pitching.
<第8の実施の形態>
 図21を参照して、本発明の第8の実施の形態を説明する。本実施の形態によるシステムは、概略、ユーザの発言の順序を考慮して時系列に可視化する。誰の後に誰が発言しやすいかといったことが分析可能となる。
<Eighth embodiment>
An eighth embodiment of the present invention will be described with reference to FIG. The system according to the present embodiment generally takes into consideration the order of user utterances and visualizes them in chronological order. It becomes possible to analyze who is likely to speak after whom.
 システムは、動画像を取得し動画像内に含まれるユーザ毎に発言を認識する。システムは、記発言とユーザとを関連付けて時系列に並べて表示するオブジェクトを生成するオブジェクト生成部を備えている。 The system acquires moving images and recognizes the utterances of each user included in the moving images. The system includes an object generation unit that generates an object that associates written statements with users and displays them in chronological order.
 図21は、ユーザA乃至Cの会話ラリーのオブジェクトを示す図である。発言がアッ場合には発言オブジェクトPがプロットされ、時間的に隣り合う発言オブジェクトP同士はコネクタCで接続される。 FIG. 21 is a diagram showing the objects of the conversation rally of users A to C. FIG. When the utterance is up, the utterance object P is plotted, and the utterance objects P adjacent in time are connected by the connector C.
 かかる会話ラリーから、例えば、ユーザCはユーザBの後に発言する傾向にあることが評価できる。この場合、例えば、ユーザCの発言を促すためにユーザBの会話を増やすことなどが考えられる。 From such a conversation rally, for example, it can be evaluated that user C tends to speak after user B. In this case, for example, it is conceivable to increase user B's conversation in order to encourage user C to speak.
 システムは、認識した顔画像及び発言の双方に基づいて所定の観点による評価値を算出する評価手段を更に備えている。会話ラリーの発言オブジェクトPには当該評価値に応じた色を付与することとしてもよい。例えば、ユーザが自然体で発言したあとに他のユーザが続いたのか、それとも、直前のユーザの高圧的な発言の後に別のユーザが続いたのか、では、改善方法も変わってくるからである。 The system further comprises evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the utterance. A color corresponding to the evaluation value may be given to the utterance object P of the conversation rally. For example, whether another user followed after a user spoke in a natural manner, or whether another user followed after the previous user's overbearing remarks would affect the improvement method.
<第9の実施の形態>
 図22を参照して、本発明の第9の実施の形態を説明する。本実施の形態によるシステムは、概略、動画像内に含まれるユーザ毎に発言を認識し、当該発言に対応する発言オブジェクトをユーザと関連付けてプロットする発言オブジェクト生成部を備える。
<Ninth Embodiment>
A ninth embodiment of the present invention will be described with reference to FIG. The system according to this embodiment generally includes a statement object generation unit that recognizes a statement for each user included in a moving image and plots a statement object corresponding to the statement in association with the user.
 図示されるように、SUZUKI、SATO、KAMIYA、NOSE、ANDO
TADAの6名の発言時に発言オブジェクトがプロットされる。当該グラフを見ることにより、発言の有無を一見して理解することができる。
As shown, SUZUKI, SATO, KAMIYA, NOSE, ANDO
Speech objects are plotted when six TADA members speak. By looking at the graph, it is possible to understand at a glance whether or not there is an utterance.
 また、システムは、認識した顔画像及び発言の双方に基づいて所定の観点による評価値を算出する評価手段を更に備えている。発言オブジェクトPには当該評価値に応じた色を付与することとしてもよい。 In addition, the system further comprises evaluation means for calculating an evaluation value from a predetermined point of view based on both the recognized face image and the utterance. A color corresponding to the evaluation value may be assigned to the utterance object P. FIG.
 かかる構成によれば、誰の発言量が多かったのか、全体としてどんな場だったのかが一見して容易に把握することができる。 According to this configuration, it is possible to easily grasp at a glance who spoke the most and what kind of place it was as a whole.
<第10の実施の形態>
 図23及び図24を参照して、本発明の第10の実施の形態を説明する。本実施の形態によるシステムは、概略、動画像から取得した音声の抑揚情報を抽出する抑揚取得手段と、認識した顔画像及び抑揚情報の双方に基づいて所定の観点による評価値を算出する評価手段とを備えている。抑揚取得手段は、単位時間当たりの音声の音程の変化を抽出することとしてもよい。
<Tenth Embodiment>
A tenth embodiment of the present invention will be described with reference to FIGS. 23 and 24. FIG. The system according to the present embodiment generally includes intonation acquisition means for extracting intonation information of speech acquired from a moving image, and evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the intonation information. and The intonation acquisition means may extract changes in pitch of speech per unit time.
 図23は、音程(ピッチ)の標準偏差をとったグラフである。図24は、音量の標準偏差をとったグラフである。本実施の形態においては、所定のサンプリングレートの音声データから、所定のフレーム数で標準偏差を取得している。 FIG. 23 is a graph showing the standard deviation of intervals (pitch). FIG. 24 is a graph of the standard deviation of volume. In this embodiment, the standard deviation is acquired for a predetermined number of frames from audio data of a predetermined sampling rate.
 一般に、コミュニケーションが良好な会話ほど、お互いの発話のピッチと音圧の標準偏差が大きいと言われている。このことから、標準偏差が全体的に低い値の会話は暗い会話である傾向があり、標準偏差の変化が少ない会話は淡々とした会話である傾向があり、時間を経過とともに標準偏差が単調減少すると時間と共に盛り下がった会話である傾向があり、時間を経過とともに標準偏差が減少するが最後増加に転じた場合は会話の最後は盛り上がった会話である傾向があり、標準偏差が高止まりしている場合は話者が焦っていたり混乱している傾向があると言える。このような音声の分析は、特に顔画像が映っていない動画像(カメラで取得した情報がない場合など)や、ユーザが下を向いたりして顔が写っていないシーンでも有効である。 In general, it is said that the better the communication, the larger the standard deviation of the pitch and sound pressure of each other's utterances. From this, conversations with low standard deviation values tend to be dark conversations, and conversations with little change in standard deviation tend to be plain conversations, and the standard deviation monotonically decreases over time. Then, there is a tendency for the conversation to swell over time, and the standard deviation decreases over time, but when it turns to increase at the end, it tends to be a lively conversation at the end of the conversation, and the standard deviation remains high. If there is, it can be said that the speaker tends to be impatient or confused. Such voice analysis is particularly effective in moving images that do not show facial images (such as when there is no information acquired by a camera), or in scenes where the user looks down and his face is not shown.
 図23のt2及びt3は比較的単調であるが、t1及びt4は変化が大きいことがわかる。従って、t1及びt4のときには会話が盛り上がっていないことが推定できる。また、図24においても、図23と同じ時間軸に相当するt2及びt3は変化が少なく比較的単調であることがわかる。また、t1及びt4は比較的強弱が現れていることがわかる。このことからも、音程及び音量の両方について、t2及びt3の時間帯は会話はあまり弾んでおらず、一方t1及びt4の時間帯は会話は弾んでいると推定できる。 It can be seen that t2 and t3 in FIG. 23 are relatively monotonous, but t1 and t4 have large changes. Therefore, it can be estimated that the conversation is not lively at t1 and t4. Also in FIG. 24, it can be seen that t2 and t3, which correspond to the same time axis as in FIG. 23, change little and are relatively monotonous. Also, it can be seen that t1 and t4 are relatively strong and weak. From this, it can be estimated that the conversation is not so lively during the time periods t2 and t3, while the conversation is lively during the time periods t1 and t4, both in terms of pitch and volume.
<第11の実施の形態>
 図25を参照して、本発明の第11の実施の形態を説明する。本実施の形態によるシステムは、概略、状況・情景などのコンテキストを理解して評価を行う。例えば、笑顔スコアが低くても、それが初対面だから低かったのか、親しい友人だけれど低かったのかの違いによって行うべき評価は異なるべきである。
<Eleventh Embodiment>
An eleventh embodiment of the present invention will be described with reference to FIG. The system according to this embodiment understands the context such as outline, situation/scene, etc., and performs evaluation. For example, even if the smile score is low, it should be evaluated differently depending on whether the smile score is low because it was the first meeting or because the person is a close friend.
 本システムは、コンテキスト情報を読み込むコンテキスト読込部と、当該コンテキスト情報に応じて評価結果を補正する補正部とを備えている。コンテキスト情報は、例えば、シチュエーションや、会話を交わした回数、相手との面識、一方方向の会話形式又は双方向の会話形式等といったコンテキストを分類したカテゴリ情報と、補正すべき項目及び補正パラメータとを備えることとしてもよい。 This system includes a context reading unit that reads context information and a correction unit that corrects evaluation results according to the context information. The context information includes, for example, situation, number of conversations, acquaintance with the other party, one-way conversation style or two-way conversation style, etc. Category information classified by context, items to be corrected, and correction parameters. It may be prepared.
 システムは、ユーザからコンテキストのカテゴリ情報を事前に主導によって受け付けることとしてもよいし、動画像ファイル等のタイトルやメタデータから自動で判定することとしてもよい。これにより動画像のコンテキスト情報を特定し、該当するカテゴリに関連座けられている補正を行うことにより、適正な評価結果提供することができる。 The system may accept contextual category information from the user in advance, or may automatically determine from the titles and metadata of video files and the like. Accordingly, by specifying the context information of the moving image and performing the correction associated with the corresponding category, it is possible to provide an appropriate evaluation result.
 本明細書においてフローチャート図を用いて説明した処理は、必ずしも図示された順序で実行されなくてもよい。いくつかの処理ステップは、並列的に実行されてもよい。また、追加的な処理ステップが採用されてもよく、一部の処理ステップが省略されてもよい。 The processes described using the flowcharts in this specification do not necessarily have to be executed in the illustrated order. Some processing steps may be performed in parallel. Also, additional processing steps may be employed, and some processing steps may be omitted.
 以上説明した実施の形態を適宜組み合わせて実施することとしてもよい。また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 The embodiments described above may be combined as appropriate and implemented. Also, the effects described herein are merely illustrative or exemplary, and are not limiting. In other words, the technology according to the present disclosure can produce other effects that are obvious to those skilled in the art from the description of this specification, in addition to or instead of the above effects.
 10、20   ユーザ端末
 30   ビデオセッションサービス端末
 40   評価端末

 
10, 20 user terminal 30 video session service terminal 40 evaluation terminal

Claims (1)

  1.  第1ユーザと第2ユーザとのオンラインセッションに関する動画像を取得する取得手段と、
     前記動画像内に含まれる前記第1ユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
     前記動画像内に含まれる前記第2ユーザの少なくとも音声を認識する音声認識手段と、
     少なくとも認識した前記顔画像に基づいて所定の観点による評価値を算出する評価手段と、
     少なくとも認識した前記音声内の所定のキーワードを検出するキーワード検出手段と、
     前記キーワードを検出したときにおける前記評価値に基づいて、所定のアラートを送信するアラート送信手段とを備える、
    動画像分析システム。

     
    Acquisition means for acquiring a moving image relating to an online session between the first user and the second user;
    face recognition means for recognizing at least a face image of the first user included in the moving image for each predetermined frame;
    speech recognition means for recognizing at least the speech of the second user included in the moving image;
    evaluation means for calculating an evaluation value from a predetermined viewpoint based on at least the recognized face image;
    keyword detection means for detecting at least a predetermined keyword in the recognized speech;
    an alert transmission means for transmitting a predetermined alert based on the evaluation value when the keyword is detected;
    Video image analysis system.

PCT/JP2021/007572 2021-02-26 2021-02-26 Video session evaluation terminal, video session evaluation system, and video session evaluation program WO2022180855A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023502015A JPWO2022180855A1 (en) 2021-02-26 2021-02-26
PCT/JP2021/007572 WO2022180855A1 (en) 2021-02-26 2021-02-26 Video session evaluation terminal, video session evaluation system, and video session evaluation program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/007572 WO2022180855A1 (en) 2021-02-26 2021-02-26 Video session evaluation terminal, video session evaluation system, and video session evaluation program

Publications (1)

Publication Number Publication Date
WO2022180855A1 true WO2022180855A1 (en) 2022-09-01

Family

ID=83048747

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/007572 WO2022180855A1 (en) 2021-02-26 2021-02-26 Video session evaluation terminal, video session evaluation system, and video session evaluation program

Country Status (2)

Country Link
JP (1) JPWO2022180855A1 (en)
WO (1) WO2022180855A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018068618A (en) * 2016-10-28 2018-05-10 株式会社東芝 Emotion estimating device, emotion estimating method, emotion estimating program, and emotion counting system
JP2020123204A (en) * 2019-01-31 2020-08-13 株式会社日立システムズ Harmful act detection system and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2018068618A (en) * 2016-10-28 2018-05-10 株式会社東芝 Emotion estimating device, emotion estimating method, emotion estimating program, and emotion counting system
JP2020123204A (en) * 2019-01-31 2020-08-13 株式会社日立システムズ Harmful act detection system and method

Also Published As

Publication number Publication date
JPWO2022180855A1 (en) 2022-09-01

Similar Documents

Publication Publication Date Title
WO2022180860A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168180A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168185A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
JP7152825B1 (en) VIDEO SESSION EVALUATION TERMINAL, VIDEO SESSION EVALUATION SYSTEM AND VIDEO SESSION EVALUATION PROGRAM
WO2022180855A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180858A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180854A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180861A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180859A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180853A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180856A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180857A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180862A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180852A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022249462A1 (en) Video analysis system
JP7197955B1 (en) Video meeting evaluation terminal
WO2022201272A1 (en) Video analysis program
JP7152819B1 (en) Video analysis program
WO2022168174A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022230155A1 (en) Video analysis system
WO2022230136A1 (en) Video analysis system
WO2022168184A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022264220A1 (en) Video analysis system
WO2022269801A1 (en) Video analysis system
WO2022201267A1 (en) Video analysis program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927954

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2023502015

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927954

Country of ref document: EP

Kind code of ref document: A1