WO2022168176A1 - Video session evaluation terminal, video session evaluation system, and video session evaluation program - Google Patents

Video session evaluation terminal, video session evaluation system, and video session evaluation program Download PDF

Info

Publication number
WO2022168176A1
WO2022168176A1 PCT/JP2021/003793 JP2021003793W WO2022168176A1 WO 2022168176 A1 WO2022168176 A1 WO 2022168176A1 JP 2021003793 W JP2021003793 W JP 2021003793W WO 2022168176 A1 WO2022168176 A1 WO 2022168176A1
Authority
WO
WIPO (PCT)
Prior art keywords
evaluation
video
moving image
evaluation value
video session
Prior art date
Application number
PCT/JP2021/003793
Other languages
French (fr)
Japanese (ja)
Inventor
渉三 神谷
Original Assignee
株式会社I’mbesideyou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社I’mbesideyou filed Critical 株式会社I’mbesideyou
Priority to JP2022518708A priority Critical patent/JPWO2022168176A1/ja
Priority to PCT/JP2021/003793 priority patent/WO2022168176A1/en
Publication of WO2022168176A1 publication Critical patent/WO2022168176A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the present disclosure relates to a video session evaluation terminal, a video session evaluation system, and a video session evaluation program.
  • Patent Document 1 Conventionally, there is known a technique for analyzing the emotions others receive in response to a speaker's remarks (see Patent Document 1, for example). There is also known a technique for analyzing changes in facial expressions of a subject over a long period of time in time series and estimating the emotions held during that period (see, for example, Patent Document 2). Furthermore, there are known techniques for identifying factors that have the greatest effect on changes in emotions (see Patent Documents 3 to 5, for example). Furthermore, there is also known a technique of comparing a subject's usual facial expression with a current facial expression and issuing an alert when the facial expression is dark (see, for example, Patent Document 6).
  • Patent Documents 7 to 9 There is also known a technique for determining the degree of emotion of a subject by comparing the subject's normal (expressionless) facial expression with the current facial expression (for example, Patent Documents 7 to 9). reference). Furthermore, there is also known a technique for analyzing the feeling of an organization and the atmosphere within a group that an individual feels (see Patent Documents 10 and 11, for example).
  • the purpose of the present invention is to objectively evaluate exchanged communication in order to conduct more efficient communication in situations where online communication is the main focus.
  • a camera unit that acquires a moving image obtained by photographing a target person; a line-of-sight acquisition unit that acquires a movement of the subject's line of sight based on the acquired moving image; a display unit that continuously displays a plurality of images to the subject; a position acquisition unit that acquires a positional relationship between the camera unit and the display unit; an output unit that associates and outputs the eye movement for each of the plurality of displayed images; is obtained.
  • an acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice; output means for outputting the evaluation value as change information along a time series;
  • a moving image analysis system comprising: identifying means for referring to other change information relating to other moving images and identifying other moving images containing the same pattern as the pattern extracted from the change information. is obtained.
  • acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals; face recognition means for recognizing a user's face image included in the moving image for each predetermined frame; voice recognition means for recognizing at least voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice; a storage means for storing the evaluation value as change information along a time series; detection means for detecting that only the evaluation value in one of the plurality of viewpoints has changed beyond a predetermined range; a peculiar frame acquiring means for acquiring a peculiar frame including the detected range; Video image analysis system. is obtained.
  • acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals; face recognition means for recognizing a user's face image included in the moving image for each predetermined frame; voice recognition means for recognizing at least voice of the user included in the moving image; facial expression evaluation means for calculating facial expression evaluation values from a plurality of viewpoints based on the recognized face image; speech evaluation means for calculating speech evaluation values from a plurality of viewpoints based on the recognized speech; facial expression/speech correlation evaluation means for evaluating the correlation between the facial expression evaluation value and the speech evaluation value of at least one of the users; detection means for detecting that the facial expression evaluation value and the voice evaluation value have changed beyond a predetermined range based on the correlation; a peculiar frame acquiring means for acquiring a peculiar frame including the detected range; Video image analysis system. is obtained.
  • acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals; face recognition means for recognizing a user's face image included in the moving image for each predetermined frame; Emotion evaluation means for analyzing the user's standard facial expression from the recognized face image and evaluating the degree of deviation from the user's standard facial expression; concentration evaluation means for evaluating at least the amount of eye movement or face movement of the user from the recognized face image; Safety evaluation means for evaluating the user's feeling of anxiety from the recognized face image; score generation means for generating a score based on two or more evaluations of the emotion evaluation means, the concentration evaluation means, and the safety evaluation means; Video image analysis system. is obtained.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; a target person specifying means for specifying a target person frame in which the target person is recognized from each of the moving images relating to the plurality of the video sessions; further comprising a digest generating video means for generating a digest video by connecting a plurality of target person frames; Video analysis system. is obtained.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; evaluation means for evaluating at least the amount of eye movement and face movement of the user from the recognized face image;
  • a moving image analysis system comprising score calculation means for calculating a score relating to the degree of concentration based on the evaluation. is obtained.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice; For the same user, a first evaluation analysis value obtained by analyzing the evaluation value in a first period and a second evaluation analysis value obtained by analyzing the evaluation value in a second period longer than the first period a state evaluation means for evaluating the state of the user based on Video image analysis system. is obtained.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice; answer acquisition means for acquiring the user's answer information to the question information created based on the plurality of viewpoints; evaluating the state of the user by comparing the evaluation value and the answer information; status rating system. is obtained.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice; Annotation receiving means for receiving an annotation from the user for the evaluation value;
  • a video session evaluation system comprising display means for simultaneously displaying the evaluation value and the received annotation. is obtained.
  • a moving image acquisition means for acquiring a moving image of a sales video session conducted between a sales terminal of a person in charge of sales and a terminal of a sales destination of a person in charge of sales; a contract information acquisition means for acquiring contract information of the sales video session; face recognition means for recognizing the face image of at least one of the person in charge of the sales side or the person in charge of the sales partner included in the moving image for each predetermined frame; A speech recognition means that recognizes the speech of at least one of the person in charge on the sales side and the person in charge of the sales partner included in the moving image, and an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice.
  • model generation means for generating a contract conclusion estimation model for estimating the contract conclusion rates of moving images of other sales video sessions as a plurality of ranks, using the evaluation value and the contract conclusion information as teacher data; determining means for associating one of the plurality of ranks with a new sales video session using the model; Video session rating system. is obtained.
  • a lecturer terminal having a lecturer-side camera for capturing at least the face of a lecturer user; and a student terminal communicably connected to the lecturer terminal via a network for capturing at least the face of a student.
  • Acquisition means for acquiring a moving image of a video session held between a student terminal having a face camera and a hand camera for projecting the hand of the student;
  • Hand action recognition means for recognizing at least the action of the student's hand in the moving image acquired from the hand camera for each predetermined frame; estimating means for estimating the degree of comprehension of the student based on the recognized hand motion; comprising a Video session rating system. is obtained.
  • exchanged communication can be objectively evaluated in order to conduct more efficient communication in situations where online communication is the main activity.
  • FIG. 1 is an example of a functional block diagram of an evaluation terminal according to an embodiment of the present invention
  • FIG. FIG. 2 is a diagram showing functional configuration example 1 of an evaluation terminal according to an embodiment of the present invention
  • FIG. 8 is a diagram showing functional configuration example 2 of the evaluation terminal according to the embodiment of the present invention
  • FIG. 10 is a diagram showing a functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention
  • 7 is a screen display example according to the functional configuration example 3 of FIG. 6.
  • FIG. FIG. 7 is another screen display example according to the functional configuration example 3 of FIG. 6.
  • FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention
  • FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention
  • Fig. 2 shows a heat map of the system according to the first embodiment of the invention
  • It is a figure which shows the image of the calibration of the system by the 1st Embodiment of this invention.
  • Fig. 3 is a graphical comparison of a system according to a second embodiment of the invention
  • Fig. 10 is a comparative diagram of another graph of the system according to the second embodiment of the invention
  • Fig. 3 shows a graph of a system according to a third embodiment of the invention;
  • FIG. 10 shows another graph of the system according to the third embodiment of the invention
  • Fig. 11 shows another graph of the system according to the fourth embodiment of the invention
  • FIG. 22 is a diagram showing an image of system evaluation according to the twelfth embodiment of the present invention.
  • the contents of the embodiments of the present disclosure are listed and described.
  • the present disclosure has the following configurations.
  • a camera unit that acquires a moving image obtained by photographing a target person; a line-of-sight acquisition unit that acquires a movement of the subject's line of sight based on the acquired moving image; a display unit that continuously displays a plurality of images to the subject; a position acquisition unit that acquires a positional relationship between the camera unit and the display unit; an output unit that associates and outputs the eye movement for each of the plurality of displayed images;
  • a line-of-sight evaluation system A line-of-sight evaluation system.
  • Gaze evaluation system [Item 2] The line-of-sight evaluation system according to item 1, The output unit superimposes on the image a heat map indicating a fixation time generated based on the movement of the eye line and outputs the image. Gaze evaluation system [Item 3] The line-of-sight evaluation system according to item 1, The output unit further associates and outputs the movement of the line of sight of another subject who displayed the same image. Gaze evaluation system [Item 4] The line-of-sight evaluation system according to item 3, Further comprising a peculiar determination unit that determines whether the movement of the eye line associated with the subject is more specific than the movement of the eye line associated with the other subject, Gaze evaluation system.
  • the line-of-sight evaluation system according to any one of items 1 to 4, The output unit associates and outputs a leveled heat map obtained by leveling the eye movements of the plurality of subjects for each image.
  • Gaze evaluation system [Item 6] acquisition means for acquiring at least a moving image; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice; output means for outputting the evaluation value as change information along a time series;
  • a moving image analysis system comprising: identifying means for referring to other change information relating to other moving images and identifying other moving images containing the same pattern as the pattern extracted from the change information.
  • the moving image analysis system according to item 6,
  • the output means outputs the evaluation values as chronological graph information
  • the specifying means receives a selection operation of a part of the graph information from the analysis user, and specifies a corresponding frame of another moving image containing the same graph pattern as the graph pattern of the part where the selection operation is performed. and a moving image analysis system.
  • the moving image analysis system according to item 6 or item 7, The moving image analysis system, wherein the identifying means identifies other moving images including the same pattern as the pattern extracted from the change information in the same time period.
  • acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals; face recognition means for recognizing a user's face image included in the moving image for each predetermined frame; voice recognition means for recognizing at least voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice; a storage means for storing the evaluation value as change information along a time series; detection means for detecting that only the evaluation value in one of the plurality of viewpoints has changed beyond a predetermined range; a peculiar frame acquiring means for acquiring a peculiar frame including the detected range; Video image analysis system.
  • the moving image analysis system according to item 9 The plurality of viewpoints includes a first viewpoint and a second viewpoint associated with mutually contradictory attributes, The detection means detects that the evaluation value from the first viewpoint and the evaluation value from the second viewpoint have deviated beyond the predetermined range. Video image analysis system.
  • the detection means detects that the evaluation value of the one viewpoint changes beyond a predetermined range within a predetermined time period after the first time point and immediately after that, the evaluation value becomes substantially the same as the evaluation value at the first time point. detect that it has returned to a value, Video image analysis system.
  • acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals; face recognition means for recognizing a user's face image included in the moving image for each predetermined frame; voice recognition means for recognizing at least voice of the user included in the moving image; facial expression evaluation means for calculating facial expression evaluation values from a plurality of viewpoints based on the recognized face image; speech evaluation means for calculating speech evaluation values from a plurality of viewpoints based on the recognized speech; facial expression/speech correlation evaluation means for evaluating the correlation between the facial expression evaluation value and the speech evaluation value of at least one of the users; detection means for detecting that the facial expression evaluation value and the voice evaluation value have changed beyond a predetermined range based on the correlation; a peculiar frame acquiring means for acquiring a peculiar frame including the detected range; Video image analysis system.
  • the moving image analysis system according to item 15 further comprising attribute evaluation means for associating attributes corresponding to the facial expression evaluation value and the voice evaluation value, The detection means detects that the attribute of the facial expression evaluation value and the attribute of the voice evaluation value are mutually exclusive.
  • Video image analysis system [Item 17] The moving image analysis system according to either item 15 or item 16, Further comprising a digest generating video means for generating a digest video by linking the plurality of specific frames acquired from the moving image, Video analysis system.
  • the moving image analysis system according to any one of items 15 to 18,
  • the video session is capable of sharing screen information displayed on the screen of one user terminal, further comprising shared screen output means for outputting at least the screen information corresponding to the shared specific frame; Video image analysis system.
  • acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals; face recognition means for recognizing a user's face image included in the moving image for each predetermined frame; Emotion evaluation means for analyzing the user's standard facial expression from the recognized face image and evaluating the degree of deviation from the user's standard facial expression; concentration evaluation means for evaluating at least the amount of eye movement or face movement of the user from the recognized face image; Safety evaluation means for evaluating the user's feeling of anxiety from the recognized face image; score generation means for generating a score based on two or more evaluations of the emotion evaluation means, the concentration evaluation means, and the safety evaluation means; Video image analysis system.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; a target person specifying means for specifying a target person frame in which the target person is recognized from each of the moving images relating to the plurality of the video sessions; further comprising a digest generating video means for generating a digest video by connecting a plurality of target person frames; Video analysis system.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; evaluation means for evaluating at least the amount of eye movement and face movement of the user from the recognized face image;
  • a moving image analysis system comprising score calculation means for calculating a score relating to the degree of concentration based on the evaluation.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice; For the same user, a first evaluation analysis value obtained by analyzing the evaluation value in a first period and a second evaluation analysis value obtained by analyzing the evaluation value in a second period longer than the first period a state evaluation means for evaluating the state of the user based on Video image analysis system.
  • a moving image analysis system according to item 23, trend detection means for detecting a predetermined trend with respect to the second evaluation analysis value; a correction means for correcting the first evaluation analysis value according to the detected trend; Video analysis system.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice; answer acquisition means for acquiring the user's answer information to the question information created based on the plurality of viewpoints; evaluating the state of the user by comparing the evaluation value and the answer information; status rating system.
  • [Item 26] 26.
  • Acquisition means for acquiring a moving image of a video session conducted with another terminal; face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame; voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice; Annotation receiving means for receiving an annotation from the user for the evaluation value;
  • a video session evaluation system comprising display means for simultaneously displaying the evaluation value and the received annotation.
  • Video session rating system [Item 29] a moving image acquisition means for acquiring a moving image of a sales video session conducted between a sales terminal of a person in charge of sales and a terminal of a sales destination of a person in charge of sales; a contract information acquisition means for acquiring contract information of the sales video session; face recognition means for recognizing the face image of at least one of the person in charge of the sales side or the person in charge of the sales partner included in the moving image for each predetermined frame; A speech recognition means that recognizes the speech of at least one of the person in charge on the sales side and the person in charge of the sales partner included in the moving image, and an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice.
  • model generation means for generating a contract conclusion estimation model for estimating the contract conclusion rates of moving images of other sales video sessions as a plurality of ranks, using the evaluation value and the contract conclusion information as teacher data; determining means for associating one of the plurality of ranks with a new sales video session using the model; Video session rating system.
  • a lecturer terminal having a lecturer-side camera for capturing at least the face of a lecturer user; and a student terminal communicably connected to the lecturer terminal via a network for capturing at least the face of a student.
  • Acquisition means for acquiring a moving image of a video session held between a student terminal having a face camera and a hand camera for projecting the hand of the student;
  • Hand action recognition means for recognizing at least the action of the student's hand in the moving image acquired from the hand camera for each predetermined frame; estimating means for estimating the degree of comprehension of the student based on the recognized hand motion; comprising a Video session rating system.
  • the video session rating system of item 31 comprising: speech recognition means for recognizing at least the speech of the student included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the speech; The first estimation means estimates the degree of understanding of the student based on the motion at hand and the evaluation value.
  • Video session rating system [Item 33] A video session evaluation system according to item 31 or item 32, The estimation means estimates the degree of comprehension of the student according to the amount of the recognized movement of the hand. Video session rating system.
  • the video session evaluation system according to any one of items 31 to 33, The estimating means further comprises an alert means for issuing an alert when the student's face is not captured by the student's face camera and the movement of the student's hand is not recognized by the hand camera.
  • Video session rating system for issuing an alert when the student's face is not captured by the student's face camera and the movement of the student's hand is not recognized by the hand camera.
  • a video session in an environment where a video session (hereinafter referred to as an online session including one-way and two-way sessions) is held by a plurality of people, the person to be analyzed among the plurality of people is different from the others. It is a system that analyzes and evaluates specific emotions (feelings that occur in response to one's own or others' behavior. Pleasure/displeasure or degree of such).
  • Online sessions are, for example, online meetings, online classes, online chats, etc.
  • Terminals installed in multiple locations are connected to a server via a communication network such as the Internet, and moving images are transmitted between multiple terminals through the server. It's made to be interactable.
  • Moving images also include images such as materials that are shared and viewed by a plurality of users. It is possible to switch between the face image and the document image on the screen of each terminal to display only one of them, or to divide the display area and display the face image and the document image at the same time. In addition, it is possible to display the image of one user out of a plurality of users on the full screen, or divide the images of some or all of the users into small screens and display them.
  • an online session leader, moderator, or manager designates any user as an analysis subject.
  • Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like.
  • An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session. It should be noted that all participants may be subject to analysis without specifying the person to be analyzed.
  • the leader, moderator, or manager of an online session (hereinafter collectively referred to as the organizer) to designate any user as a person to be analyzed.
  • Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like.
  • An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
  • the video session evaluation system displays at least moving images obtained from a video session established between a plurality of terminals.
  • the displayed moving image is acquired by the terminal, and at least a face image included in the moving image is identified for each predetermined frame unit. An evaluation value for the identified face image is then calculated. The evaluation value is shared as necessary.
  • the acquired moving images are stored in the terminal, analyzed and evaluated on the terminal, and the results are provided to the user of the terminal. Therefore, for example, even a video session containing personal information or a video session containing confidential information can be analyzed and evaluated without providing the moving image itself to an external evaluation agency or the like.
  • the evaluation result evaluation value
  • the video session evaluation system includes user terminals 10 and 20 each having at least an input unit such as a camera unit and a microphone unit, a display unit such as a display, and an output unit such as a speaker. , a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for performing part of the evaluation of the video session.
  • FIG. 2 is a diagram showing a hardware configuration example of a computer that implements each of the terminals 10 to 40 according to this embodiment.
  • the computer includes at least a control unit 110, a memory 120, a storage 130, a communication unit 140, an input/output unit 150, and the like. These are electrically connected to each other through bus 160 .
  • the control unit 110 is an arithmetic device that controls the overall operation of each terminal, controls transmission and reception of data between elements, executes applications, and performs information processing necessary for authentication processing.
  • the control unit 110 is a processor such as a CPU, and executes each information processing by executing a program or the like stored in the storage 130 and developed in the memory 120 .
  • the memory 120 includes a main memory made up of a volatile memory device such as a DRAM, and an auxiliary memory made up of a non-volatile memory device such as a flash memory or an HDD.
  • the memory 120 is used as a work area or the like for the control unit 110, and stores the BIOS executed when each terminal is started, various setting information, and the like.
  • the storage 130 stores various programs such as application programs.
  • a database storing data used for each process may be constructed in the storage 130 .
  • moving images in the online session are not recorded in the storage 130 of the video session service terminal 30, but are stored in the storage 130 of the user terminal 10.
  • the evaluation terminal 40 stores an application and other programs necessary for evaluating the moving image acquired on the user terminal 10, and appropriately provides them so that the user terminal 10 can use them.
  • the storage 13 managed by the evaluation terminal 40 may share, for example, only the results of analysis and evaluation by the user terminal 10 .
  • the communication unit 140 connects the terminal to the network.
  • the communication unit 140 is, for example, wired LAN, wireless LAN, Wi-Fi (registered trademark), infrared communication, Bluetooth (registered trademark), short-range or non-contact communication, etc., and communicates directly with an external device or through a network access point. Communicate via
  • the input/output unit 150 is, for example, information input devices such as a keyboard, mouse, and touch panel, and output devices such as a display.
  • a bus 160 is commonly connected to each of the above elements and transmits, for example, address signals, data signals and various control signals.
  • the evaluation terminal acquires a moving image from a video session service terminal, identifies at least a face image included in the moving image for each predetermined frame unit, and calculates an evaluation value for the face image.
  • a video session service provided by a video session service terminal (hereinafter sometimes simply referred to as "this service") provides user terminals 10 and 20 with two-way images and voice. Communication is possible.
  • this service a moving image captured by the camera of the other user's terminal is displayed on the display of the user's terminal, and audio captured by the microphone of the other's user's terminal can be output from the speaker.
  • this service allows both or either of the user terminals to record moving images and sounds (collectively referred to as “moving images, etc.") in the storage unit of at least one of the user terminals. configured as possible.
  • the recorded moving image information Vs (hereinafter referred to as “recorded information”) is cached in the user terminal that started recording and is locally recorded only in one of the user terminals. If necessary, the user can view the recorded information by himself or share it with others within the scope of the use of this service.
  • the user terminal 10 acquires the recorded information and performs analysis and evaluation as described later.
  • the user terminal 10 evaluates the video acquired as described above by the following analysis.
  • FIG. 4 is a block diagram showing a configuration example according to this embodiment.
  • the video session evaluation system of this embodiment is realized as a functional configuration of the user terminal 10.
  • the user terminal 10 has, as its functions, a moving image acquisition unit 11, a biological reaction analysis unit 12, a peculiar determination unit 13, a related event identification unit 14, a clustering unit 15, and an analysis result notification unit 16.
  • Each of the functional blocks 11 to 16 can be configured by any of hardware, a DSP (Digital Signal Processor), and software provided in the user terminal 10, for example.
  • DSP Digital Signal Processor
  • each of the functional blocks 11 to 16 is actually configured with a computer CPU, RAM, ROM, etc., and a program stored in a recording medium such as RAM, ROM, hard disk, or semiconductor memory. is realized by the operation of
  • the moving image acquisition unit 11 acquires from each terminal a moving image obtained by photographing a plurality of people (a plurality of users) with a camera provided in each terminal during an online session. It does not matter whether the moving image acquired from each terminal is set to be displayed on the screen of each terminal. That is, the moving image acquisition unit 11 acquires moving images from each terminal, including moving images being displayed and moving images not being displayed on each terminal.
  • the biological reaction analysis unit 12 analyzes changes in the biological reaction of each of a plurality of people based on the moving images (whether or not they are being displayed on the screen) acquired by the moving image acquiring unit 11.
  • the biological reaction analysis unit 12 separates the moving image acquired by the moving image acquisition unit 11 into a set of images (collection of frame images) and voice, and analyzes changes in the biological reaction from each.
  • the biological reaction analysis unit 12 analyzes the user's facial image using a frame image separated from the moving image acquired by the moving image acquisition unit 11 to obtain at least one of facial expression, gaze, pulse, and facial movement. Analyze changes in biological reactions related to Further, the biological reaction analysis unit 12 analyzes the voice separated from the moving image acquired by the moving image acquisition unit 11 to analyze changes in the biological reaction related to at least one of the user's utterance content and voice quality.
  • the biological reaction analysis unit 12 calculates a biological reaction index value reflecting the change in biological reaction by quantifying the change in biological reaction according to a predetermined standard.
  • the analysis of changes in facial expressions is performed, for example, as follows. That is, for each frame image, a facial region is identified from the frame image, and the identified facial expressions are classified into a plurality of types according to an image analysis model machine-learned in advance. Then, based on the classification results, it analyzes whether positive facial expression changes occur between consecutive frame images, whether negative facial expression changes occur, and to what extent the facial expression changes occur, A facial expression change index value corresponding to the analysis result is output.
  • the analysis of changes in line of sight is performed as follows. That is, for each frame image, the eye region is specified in the frame image, and the orientation of both eyes is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or is looking outside the screen. Also, it may be analyzed whether the eye movement is large or small, or whether the movement is frequent or infrequent. A change in line of sight is also related to the user's degree of concentration.
  • the biological reaction analysis unit 12 outputs a line-of-sight change index value according to the analysis result of the line-of-sight change.
  • the analysis of pulse changes is performed, for example, as follows. That is, for each frame image, the face area is specified in the frame image. Then, using a trained image analysis model that captures numerical values of face color information (G of RGB), changes in the G color of the face surface are analyzed. By arranging the results along the time axis, a waveform representing changes in color information is formed, and the pulse is identified from this waveform. When a person is tense, the pulse speeds up, and when the person is calm, the pulse slows down. The biological reaction analysis unit 12 outputs a pulse change index value according to the analysis result of the pulse change.
  • G of RGB face color information
  • analysis of changes in facial movements is performed as follows. That is, for each frame image, the face area is specified in the frame image, and the direction of the face is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or is looking outside the screen. Further, it may be analyzed whether the movement of the face is large or small, or whether the movement is frequent or infrequent. The movement of the face and the movement of the line of sight may be analyzed together. For example, it may be analyzed whether the face of the speaker being displayed is viewed straight, whether the face is viewed with upward or downward glances, or whether the face is viewed obliquely.
  • the biological reaction analysis unit 12 outputs a face orientation change index value according to the analysis result of the face orientation change.
  • the biological reaction analysis unit 12 converts the voice into a character string by performing known voice recognition processing on the voice for a specified time (for example, about 30 to 150 seconds), and morphologically analyzes the character string. By doing so, words such as particles and articles that are unnecessary for expressing conversation are removed. Then, vectorize the remaining words, analyze whether a positive emotional change has occurred, whether a negative emotional change has occurred, and to what extent the emotional change has occurred. Outputs the utterance content index value.
  • Voice quality analysis is performed, for example, as follows. That is, the biological reaction analysis unit 12 identifies the acoustic features of the voice by performing known voice analysis processing on the voice for a specified time (for example, about 30 to 150 seconds). Then, based on the acoustic features, it analyzes whether a positive change in voice quality has occurred, whether a negative change in voice quality has occurred, and to what extent the change in voice quality has occurred, and according to the analysis results, output the voice quality change index value.
  • a specified time for example, about 30 to 150 seconds
  • the biological reaction analysis unit 12 uses at least one of the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value calculated as described above. to calculate the biological reaction index value.
  • the biological reaction index value is calculated by weighting the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value.
  • the peculiarity determination unit 13 determines whether or not the change in the analyzed biological reaction of the person to be analyzed is more specific than the change in the analyzed biological reaction of the person other than the person to be analyzed. In the present embodiment, the peculiarity determination unit 13 compares changes in the biological reaction of the person to be analyzed with those of others based on the biological reaction index values calculated for each of the plurality of users by the biological reaction analysis unit 12. is specific or not.
  • the peculiar determination unit 13 calculates the variance of the biological reaction index values calculated for each of the plurality of persons by the biological reaction analysis unit 12, and compares the biological reaction index values calculated for the analysis subject with the variance, It is determined whether or not the change in the analyzed biological reaction of the person to be analyzed is specific compared to others.
  • the following three patterns are conceivable as cases where the changes in biological reactions analyzed for the subject of analysis are more specific than those of others.
  • the first is a case where a relatively large change in biological reaction occurs in the subject of analysis, although no particularly large change in biological reaction has occurred in the other person.
  • the second is a case where a particularly large change in biological reaction has not occurred in the subject of analysis, but a relatively large change in biological reaction has occurred in the other person.
  • the third is a case where a relatively large change in biological reaction occurs in both the subject of analysis and the other person, but the content of the change differs between the subject of analysis and the other person.
  • the related event identification unit 14 identifies an event occurring in relation to at least one of the person to be analyzed, the other person, and the environment when the change in the biological reaction determined to be peculiar by the peculiarity determination unit 13 occurs. .
  • the related event identification unit 14 identifies from the moving image the speech and behavior of the person to be analyzed when a specific change in biological reaction occurs in the person to be analyzed.
  • the related event identifying unit 14 identifies, from the moving image, the speech and behavior of the other person when a specific change in the biological reaction of the person to be analyzed occurs.
  • the related event identification unit 14 identifies from the moving image the environment in which a specific change in the biological reaction of the person to be analyzed occurs.
  • the environment is, for example, the shared material being displayed on the screen, the background image of the person to be analyzed, and the like.
  • the clustering unit 15 clusters the change in the biological reaction determined to be specific by the peculiarity determination unit 13 (for example, one or a combination of eye gaze, pulse, facial movement, statement content, and voice quality), and the peculiarity Analyzing the degree of correlation with an event (event identified by the related event identification unit 14) that occurs when a change in biological reaction occurs, and if it is determined that the correlation is at a certain level or more , to cluster the subjects or events based on the correlation analysis results.
  • the peculiarity determination unit 13 for example, one or a combination of eye gaze, pulse, facial movement, statement content, and voice quality
  • the clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented categories according to the content of the event, the degree of negativity, the magnitude of the correlation, and the like.
  • the clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented classifications according to the content of the event, the degree of positivity, the degree of correlation, and the like.
  • the analysis result notification unit 16 reports at least one of the changes in the biological reaction determined to be specific by the peculiar determination unit 13, the event identified by the related event identification unit 14, and the classification clustered by the clustering unit 15. , to notify the designator of the subject of analysis (the subject of analysis or the organizer of the online session).
  • the analysis result notification unit 16 recognizes that when a change in a specific biological reaction that is different from that of the other person occurs in the person to be analyzed (one of the three patterns described above; the same applies hereinafter), the analysis target is Notifies the person to be analyzed of his/her own behavior. This allows the person to be analyzed to understand that he/she has a different feeling from others when he or she performs a certain behavior. At this time, the person to be analyzed may also be notified of the change in the specific biological reaction identified for the person to be analyzed. Furthermore, the person to be analyzed may be further notified of the change in the biological reaction of the other person to be compared.
  • the words and deeds of the person to be analyzed performed without being particularly conscious of their usual emotions, or the words and deeds of the person to be analyzed consciously accompanied by certain emotions, and the emotions and behaviors that others received
  • the emotion held by the person to be analyzed is different from the feeling held by the person to be analyzed at the time
  • the person to be analyzed is notified of the speech and behavior of the person to be analyzed at that time.
  • the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when the person to be analyzed undergoes a specific change in biological reaction that is different from that of the other person, together with the change in the specific biological reaction. to notify.
  • the organizer of the online session can know what kind of event affects what kind of emotional change as a phenomenon specific to the specified analysis subject. Then, it becomes possible to perform appropriate treatment on the person to be analyzed according to the grasped contents.
  • the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when a specific change in biological reaction occurs in the analysis subject, which is different from that of others, or the clustering result of the analysis subject. do.
  • online session organizers can grasp behavioral tendencies peculiar to analysis subjects and predict possible future behaviors and situations, depending on which classification the specified analysis subjects have been clustered into. be able to. Then, it becomes possible to take appropriate measures for the person to be analyzed.
  • the biological reaction index value is calculated by quantifying the change in biological reaction according to a predetermined standard, and the analysis subject is analyzed based on the biological reaction index value calculated for each of the plurality of people.
  • the biological reaction analysis unit 12 analyzes the movement of the line of sight for each of a plurality of people and generates a heat map indicating the direction of the line of sight.
  • the peculiar determination unit 13 compares the heat map generated for the person to be analyzed by the biological reaction analysis unit 12 with the heat map generated for the other person, so that the change in the biological reaction analyzed for the person to be analyzed It is determined whether it is specific compared with the change in biological response analyzed for.
  • moving images of a video session are stored in the local storage of the user terminal 10, and the above analysis is performed on the user terminal 10.
  • the machine specs of the user terminal 10 it is possible to analyze the moving image information without providing it to the outside.
  • the video session evaluation system of this embodiment may include a moving image acquisition unit 11, a biological reaction analysis unit 12, and a reaction information presentation unit 13a as functional configurations.
  • the reaction information presentation unit 13a presents information indicating changes in biological reactions analyzed by the biological reaction analysis unit 12a, including participants not displayed on the screen.
  • the reaction information presenting unit 13a presents information indicating changes in biological reactions to an online session leader, moderator, or administrator (hereinafter collectively referred to as the organizer).
  • Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like.
  • An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
  • the organizer of the online session can also grasp the state of the participants who are not displayed on the screen in an environment where the online session is held by multiple people.
  • FIG. 6 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 6, in the video session evaluation system of the present embodiment, functions similar to those of the above-described first embodiment are given the same reference numerals, and explanations thereof may be omitted.
  • the system includes a camera unit that acquires images of a video session, a microphone unit that acquires audio, an analysis unit that analyzes and evaluates moving images, and information obtained by evaluating the acquired moving images.
  • an object generator for generating a display object (described below) based on the display; and a display for displaying both the moving image of the video session and the display object during execution of the video session.
  • the analysis unit includes the moving image acquisition unit 11, the biological reaction analysis unit 12, the peculiar determination unit 13, the related event identification unit 14, the clustering unit 15, and the analysis result notification unit 16, as described above.
  • the function of each element is as described above.
  • the object generation unit generates an object 50 representing the recognized face part and the above-mentioned Information 100 indicating the content of the analysis/evaluation performed is superimposed on the moving image and displayed.
  • the object 50 may identify and display all faces of a plurality of persons when the faces of the plurality of persons are moved in the moving image.
  • the object 50 is, for example, when the camera function of the video session is stopped at the other party's terminal (that is, it is stopped by software within the application of the video session instead of physically covering the camera). If the other party's face is recognized by the other party's camera, the object 50 or the object 100 may be displayed in the part where the other party's face is located. This makes it possible for both parties to confirm that the other party is in front of the terminal even if the camera function is turned off. In this case, for example, in a video session application, the information obtained from the camera may be hidden while only the object 50 or object 100 corresponding to the face recognized by the analysis unit is displayed. Also, the video information acquired from the video session and the information recognized by the analysis unit may be divided into different display layers, and the layer relating to the former information may be hidden.
  • the objects 50 and 100 may be displayed in all areas or only in some areas. For example, as shown in FIG. 8, it may be displayed only on the moving image on the guest side.
  • the device described in this specification may be realized as a single device, or may be realized by a plurality of devices (for example, cloud servers) or the like, all or part of which are connected via a network.
  • the control unit 110 and the storage 130 of each terminal 10 may be realized by different servers connected to each other via a network.
  • the system includes user terminals 10, 20, a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for evaluating the video session
  • Variation combinations of the following configurations are conceivable.
  • (1) Processing everything only on the user terminal As shown in FIG. 9, by performing the processing by the analysis unit on the terminal that is performing the video session (although a certain processing capacity is required), the video session can be performed. Analysis/evaluation results can be obtained at the same time (in real time) as you are.
  • an analysis unit may be provided in an evaluation terminal connected via a network or the like.
  • the moving image acquired by the user terminal is shared with the evaluation terminal at the same time as or after the video session, and after being analyzed and evaluated by the analysis unit in the evaluation terminal, the information of the object 50 and the object 100 is provided to the user. Together with or separately from the moving image data (that is, information including at least analysis data) is shared with the terminal and displayed on the display unit.
  • FIG. 11 A first embodiment of the present invention will be described with reference to FIGS. 11 and 12.
  • FIG. 11 The system according to the present embodiment, based on the information about which place on the screen the eyes of the person to be evaluated are gazing, and the information of the material displayed at that time, the information of the displayed material. Analyze and evaluate which parts were watched and for how long.
  • the system according to the present embodiment includes camera means for acquiring a moving image obtained by photographing the person to be evaluated, line-of-sight acquiring means for acquiring the eye movement of the subject based on the acquired moving image, and display means for sequentially displaying a plurality of images to the subject.
  • this system has position acquisition means for acquiring the positional relationship between the camera means and the display means.
  • This makes it possible to calibrate the subject's eye movement and gaze point.
  • the subject's eye condition is acquired by the camera unit of the display, and then a predetermined place on the screen (calibration point: center, screen Four corners, etc.) to acquire the movement of the eyes.
  • a predetermined place on the screen calibrbration point: center, screen Four corners, etc.
  • eye movements for example, it is possible to have the user look at the calibration points intentionally by playing an announcement on the screen.
  • a conspicuous sign may be displayed only in the center in an eye-catching manner, and the eye movement at that moment may be estimated as a state of gazing at the center.
  • gaze points are associated with their gaze time on the (shared) material displayed on the screen and output like a heat map. As a result, it is possible to grasp which part of the material was stopped for how long, and the part of interest of the subject can be understood.
  • the system according to the present embodiment may generate a heat map for the same material, taking into consideration the movements of other target persons (other business sites, other students, etc.).
  • the points of gaze of other subjects may also be displayed on the material.
  • the part that the target person does not gaze at even though the other target person is gazing, or the part that the target person is gazing even though the other target person is not gazing It is good also as outputting the specific feature peculiar to a subject like.
  • each material may be associated with a normalized heat map obtained by normalizing the eye movement of the subject and output. For example, the necessity of the material can be grasped from the point of view of which material was looked at well. On the other hand, it can be seen that there is not much need for materials with short fixation times.
  • FIG. 13 A second embodiment of the present invention will be described with reference to FIGS. 13 and 14.
  • FIG. 13 The system according to the present embodiment visualizes the evaluation values analyzed and evaluated based on the facial expressions and voices described above as a graph, and extracts other subjects with the same pattern as the pattern read from the graph.
  • the system according to the present embodiment recognizes the facial image and voice of the subject included in the acquired moving image and calculates the evaluation value (for example, as shown in the graph shown in FIG. 13). 2) output as chronological change information.
  • the value "safety" indicating a sense of security will be described as an example.
  • the illustrated graph is obtained by plotting the time axis on the horizontal axis and the evaluation value indicating the degree of security on the vertical axis. It can be seen that the graph (A) showing the subject A shows a large drop in value from time t1 to t2. For example, such a graph appears when A's feeling uneasy or afraid is detected by judging facial expressions and voices in a complex manner.
  • the graph (B) representing the subject B also shows a large decrease in values between times t1 and t2. For example, such a graph appears when A's feeling uneasy or afraid is detected by judging facial expressions and voices in a complex manner.
  • the system extracts information (B) that includes the same pattern by referring to changes from time t1 to t2 in (A).
  • a partial selection operation of the original graph information is received from the analyst (for example, by selecting time t1 to t2 in (A) of FIG. 17), and the selection operation is A corresponding frame of another moving image containing the same graph pattern as the graph pattern of the marked portion may be specified.
  • FIG. 15 A third embodiment of the present invention will be described with reference to FIGS. 15 and 16.
  • FIG. This lion-shaped system detects a momentary and significant change in the above-described facial expression and voice that exceeds a predetermined threshold.
  • it is possible to analyze the subject's deep psychology by detecting a significant change in only one of the evaluations from a plurality of viewpoints. Efficient analysis can also be performed by extracting and evaluating a moving image in which such a change has occurred.
  • the system cuts out and joins portions L1 of a predetermined length before and after including the change in t1, and L2 of a predetermined length before and after including the change in t2 to generate a digest movie. This makes it possible to extract a moving image that includes the moment when deep psychology appears.
  • the system according to the present embodiment may detect that two graphs different from each other greatly change instantaneously. For example, as shown in FIG. 16, in evaluation values associated with mutually contradictory characteristics such as happy and sad, happy momentarily decreases and sad momentarily increases at time t1. In this way, when one evaluation value momentarily changes and the opposite evaluation value momentarily increases, the emotion that has increased is often the true emotion.
  • a digest video may be generated by concatenating multiple frames (specific frames) acquired from within the video.
  • the speech corresponding to the peculiar frame may be converted into text and output. If it is possible to share the screen information displayed on the screen of the user terminal, the screen information corresponding to the peculiar frame may be output when the momentary change described above occurs.
  • the system associates facial expression evaluation values and voice evaluation values with attributes in advance, and detects changes beyond a predetermined range based on the correlation between the attributes. For example, a positive label is associated with the words “thank you” and “well understood”, and a correlation with facial expression evaluation (happy, sad, safety) is defined in advance.
  • the system shown in FIG. 18 can share screen information displayed on the screen.
  • the content of the screen, the text information, and the graph information indicating the emotion may be associated with each other.
  • emotion that evaluates the degree of divergence from the standard expression of the user concentration that evaluates at least the amount of eye movement or facial movement of the user from the recognized face image, and user's expression from the recognized face image Evaluation is performed from three perspectives: safety, which evaluates feelings related to anxiety.
  • the evaluation may be performed using a learner that has learned each point of view, or may be evaluated by other methods.
  • a score is generated based on two or more evaluations from each evaluated aspect.
  • FIG. 6 A system according to a sixth embodiment of the present invention will now be described with reference to FIG.
  • the system according to the present embodiment identifies, for example, a video in which a specific target person is shown from a plurality of business videos, lecture videos, and the like. This makes it possible to focus on and evaluate a specific person from various online sessions.
  • a digest video may be generated by cutting out only the part in which the target person is shown in the video. For example, of the lectures 001 to 004 of moving images shown in the figure, if the moving images in which the subject is shown are lectures 001, 002, and 004, the system selects t1, t1, and t1 from the respective moving images. t2 and t3 are extracted. The extracted part can be reproduced as a digest moving image.
  • a system according to a seventh embodiment of the present invention will now be described with reference to FIG.
  • the system according to this embodiment calculates a so-called concentration score (degree of concentration) of a subject participating in an online session. Online, especially in webinar format, the audience's camera is often turned off. According to this system, in such a case, it becomes possible to quantitatively determine how much each collector is concentrating on the lecture.
  • the system recognizes the face image captured by the camera during the session (whether or not you share your camera image with the other party) and evaluates the amount of eye movement and face movement of the subject, respectively. .
  • absolute values are evaluated as to how much the face has moved and how much the eyes have moved from the initial position.
  • the face is not moving, but the eyes are moving in various directions, which suggests that the subject is reading the material.
  • the degree of concentration is grasped as such two patterns, and in the former case, it is assumed that the speaker is paying close attention to the speaker's face and listening to the talk, and in the latter case, the displayed material is concentrated and read. It can be inferred that the
  • a score related to the degree of concentration may be calculated based on the degree of movement (Value in the graph shown).
  • the degree of concentration may be 100 when both face and eyes are 0, and may be 0 when both are the maximum values.
  • a system according to an eighth embodiment of the present invention will now be described with reference to FIG.
  • the system according to this embodiment obtains a true evaluation by correcting seasonal and temporal factors by performing the evaluations performed by the system in the above-described first to seventh embodiments in different spans. I'm trying to.
  • the present system provides a short-term evaluation value (analysis value) obtained by analyzing the evaluation value over a short period of time and a long-term evaluation value (analysis value) obtained by analyzing the evaluation value over a long period of time for the same user. Assess the person's condition. For example, a student's evaluation value for one year (long-term evaluation value) and monthly evaluation value (short-term evaluation value) may be analyzed.
  • what is analyzed from the long-term evaluation value is that the subject's long-term Features can be analyzed.
  • the content analyzed from the short-term evaluation value it is possible to analyze the short-term characteristics of the subject, such as feeling depressed at the end of the month and having a bright expression on Friday.
  • the above-mentioned long term is, for example, a cycle of three months, six months, or one year
  • the short term is, for example, a cycle of one day, one week, or one month, but is not limited thereto.
  • the evaluation in this case may employ the average value or the median value of the evaluation values, or may calculate an appropriate value using various statistical techniques.
  • the trend of the evaluation values described above is analyzed, and for example, when an evaluation is made that the mood is depressed at the end of the month, the happiness score that occurred at the end of the month is corrected by multiplying it by a predetermined coefficient. good. That is, when an evaluation different from that on the trend occurs, the true emotion can be analyzed by performing the evaluation with a higher weight than the evaluation.
  • the evaluation value in the end of February is P1 for a subject with a happiness trend indicated by a solid line.
  • the happy score is supposed to be low at the end of the month, it can be seen that it deviates from the trend.
  • the correction method may be to add or subtract the positive or negative deviation from the trend, or any other method.
  • the system according to the present embodiment acquires subjective responses (questionnaires, interviews, etc.) from subjects in advance regarding a certain theme, and compares them with the acquired evaluation values. As a result, for example, even if a subjective answer such as "I am not dissatisfied with the lecture" is obtained in a questionnaire, when the actual expression and voice are analyzed, if the evaluation of the degree of happiness is low, there will be some degree of conjecture. I can understand that you are working.
  • the system according to this embodiment is particularly suitable in the field of employee awareness surveys from companies.
  • questionnaires to the subjects included questions about happiness (e.g. job satisfaction at the company, openness, etc.), questions about anxiety (e.g., troubles, fears, etc.), and questions about future safety (e.g. career path).
  • questions about happiness e.g. job satisfaction at the company, openness, etc.
  • questions about anxiety e.g., troubles, fears, etc.
  • questions about future safety e.g. career path.
  • securement, promotion, salary increase, etc. may be prepared and answered, and compared with the evaluation values regarding happiness, anxiety, and safety among the facial expressions and voices of the subject.
  • a tenth embodiment of the present invention will be described with reference to FIG.
  • the system according to the present embodiment accepts labeling (annotation) of the situation at that time from the subject after the evaluation value obtained from the facial expression and voice of the subject.
  • the evaluation value can be subjectively evaluated after the fact, and the algorithm can be updated by feeding back the evaluation result.
  • labeling may be performed for each time zone, and the evaluation value may be superimposed on the content of the received label.
  • ⁇ the evaluation value is correct/not correct''
  • ⁇ the situation at that time'' ⁇ values based on self-standards''.
  • the system according to this embodiment relates to a sales video session between a sales terminal of a sales representative and a sales terminal of a sales representative.
  • sales representatives make predictions about the closing rate based on their own sales interviews, etc., and their experience. I used to stand up.
  • this system by analyzing the sales video sessions, it is possible to change the closing rate by machine learning statistical processing based on the data of the past sales video sessions and the sales results.
  • This system includes a deal closing information obtaining means for obtaining deal closing information of a past sales video session, and a predetermined face image of at least one of the person in charge on the sales side or the person in charge of the sales partner included in the moving image of the sales video session. face recognition means for recognizing each frame, voice recognition means for recognizing at least one of the voices of the person in charge of the sales side or the person in charge of the sales side included in the moving image, and both the recognized face image and voice and evaluation means for calculating an evaluation value from a plurality of viewpoints.
  • the present system includes model generation means for generating a contract conclusion estimation model for estimating the contract conclusion rate of moving images of other sales video sessions using the evaluation values and contract conclusion information as training data, and uses the model. to determine win rates for new sales video sessions.
  • the closing rate may be, for example, numerical values such as 50% and 70%, or ranks (zones) such as A, B, and C.
  • the contract rate may be calculated based on the degree of similarity with moving images that have been contracted in the past. For example, if the similarity between the video of a new business video session and the video of a contract closed with the same business partner (or similar business client) in the past is 70%, the closing rate of the new business is also A determination of 70% may also be made.
  • this system calculates expected sales forecast figures for a given period of time, such as the current month or quarter. do.
  • FIG. 1 A system according to a twelfth embodiment of the present invention will now be described with reference to FIG.
  • the system according to this embodiment is suitable mainly for online learning guidance.
  • This system includes a lecturer terminal and student terminals that are communicably connected to each other via a network.
  • the instructor terminal has an instructor-side camera for showing at least the face of the instructor user.
  • the student terminal has a face camera for capturing at least the face of the student, and a camera for capturing the hands of the student (the state of writing on a notebook or print, or the state on the desk). has a handheld camera.
  • this system includes hand movement recognition means for recognizing the movement of the student's hand in the moving image acquired from the camera at hand for each predetermined frame, and the recognition of the movement of the student's hand based on the recognized hand movement.
  • estimating means for estimating the degree of
  • the estimation means estimates the degree of comprehension of the student according to the amount of recognized hand movements. For example, whether or not the amount of writing on the board matches the amount of writing on the board by the instructor (whether or not the teacher is taking proper notes), or by analyzing the color of the pen being used, the important points can be color-coded. It is possible to evaluate whether or not you are devising.
  • an alert may be issued to the instructor terminal or the student terminal.
  • the degree of understanding may be estimated based on the evaluation value based on the student's facial expression and voice, and the hand motion.
  • the hand camera cannot detect the student's hand movement while the student's emotions such as feeling irritated or anxious are detected, the lecture will not be effective. It can be assumed that the
  • ⁇ Supplementary hardware configuration The sequence of operations performed by the apparatus described herein may be implemented using software, hardware, or a combination of software and hardware. It is possible to create a computer program for realizing each function of the information sharing support device 10 according to the present embodiment and implement it in a PC or the like. It is also possible to provide a computer-readable recording medium storing such a computer program.
  • the recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Also, the above computer program may be distributed, for example, via a network without using a recording medium.

Abstract

[Problem] To evaluate a video session by evaluating a video acquired in the video session. [Solution] A video session evaluation system according to the present disclosure comprises: an acquisition means for acquiring at least a video; a face recognition means for recognizing at least a face image of a subject included in the video for each prescribed frame; a speech recognition means for recognizing at least the speech of the subject included in the video; an evaluation means for calculating evaluation values for a prescribed aspect on the basis of both the recognized face images and the recognized speech; an output means for outputting the evaluation values as change information in a time series; and a specification means for referencing other change information associated with other videos and specifying other videos that include the same pattern as the pattern extracted from the change information.

Description

ビデオセッション評価端末、ビデオセッション評価システム及びビデオセッション評価プログラムVIDEO SESSION EVALUATION TERMINAL, VIDEO SESSION EVALUATION SYSTEM AND VIDEO SESSION EVALUATION PROGRAM
 本開示は、ビデオセッション評価端末、ビデオセッション評価システム及びビデオセッション評価プログラムに関する。 The present disclosure relates to a video session evaluation terminal, a video session evaluation system, and a video session evaluation program.
 従来、発言者の発言に対して他者が受ける感情を解析する技術が知られている(例えば、特許文献1参照)。また、対象者の表情の変化を長期間にわたり時系列的に解析し、その間に抱いた感情を推定する技術も知られている(例えば、特許文献2参照)。さらに、感情の変化に最も影響を与えた要素を特定する技術も知られている(例えば、特許文献3~5参照)。さらにまた、対象者の普段の表情と現在の表情とを比較して、表情が暗い場合にアラートを発する技術も知られている(例えば、特許文献6参照)。また、対象者の平常時(無表情時)の表情と現在の表情とを比較して、対象者の感情の度合いを判定するようにした技術も知られている(例えば、特許文献7~9参照)。更に、また、組織としての感情や、個人が感じるグループ内の雰囲気を分析する技術も知られている(例えば、特許文献10、11参照)。 Conventionally, there is known a technique for analyzing the emotions others receive in response to a speaker's remarks (see Patent Document 1, for example). There is also known a technique for analyzing changes in facial expressions of a subject over a long period of time in time series and estimating the emotions held during that period (see, for example, Patent Document 2). Furthermore, there are known techniques for identifying factors that have the greatest effect on changes in emotions (see Patent Documents 3 to 5, for example). Furthermore, there is also known a technique of comparing a subject's usual facial expression with a current facial expression and issuing an alert when the facial expression is dark (see, for example, Patent Document 6). There is also known a technique for determining the degree of emotion of a subject by comparing the subject's normal (expressionless) facial expression with the current facial expression (for example, Patent Documents 7 to 9). reference). Furthermore, there is also known a technique for analyzing the feeling of an organization and the atmosphere within a group that an individual feels (see Patent Documents 10 and 11, for example).
特開2019-58625号公報JP 2019-58625 A 特開2016-149063号公報JP 2016-149063 A 特開2020-86559号公報JP 2020-86559 A 特開2000-76421号公報JP-A-2000-76421 特開2017-201499号公報JP 2017-201499 A 特開2018-112831号公報JP 2018-112831 A 特開2011-154665号公報JP 2011-154665 A 特開2012-8949号公報JP-A-2012-8949 特開2013-300号公報Japanese Unexamined Patent Application Publication No. 2013-300 特開2011-186521号公報JP 2011-186521 A WO15/174426号公報WO15/174426
 上述したすべての技術は、現実空間におけるコミュニケーションが主である状況におけるサブ的な機能にすぎない。即ち、昨今の業務のDX(Digital Transformation)化や、世界的な感染症の流行等を受け、業務や授業等のコミュニケーションがオンラインで行われることが主とされる状況に生まれたものではない。 All the technologies mentioned above are only sub-functions in situations where communication in the real world is the main thing. In other words, due to the recent DX (Digital Transformation) of work and the global epidemic of infectious diseases, it is not a situation where communication such as work and classes is mainly conducted online.
 本発明は、オンラインコミュニケーションが主となる状況において、より効率的なコミュニケーションを行うために、交わされたコミュニケーションを客観的に評価することを目的とする。 The purpose of the present invention is to objectively evaluate exchanged communication in order to conduct more efficient communication in situations where online communication is the main focus.
 本発明によれば、
 対象者を撮影することによって得られる動画像を取得するカメラ部と、
 取得した前記動画像に基づいて前記対象者の目線の動きを取得する視線取得部と、
 前記対象者に複数の画像を連続して表示するディスプレイ部と、
 前記カメラ部と、前記ディスプレイ部との位置関係を取得する位置取得部と、
 表示した複数の前記画像ごとに前記目線の動きを関連付けて出力する出力部と、
が得られる。
According to the invention,
a camera unit that acquires a moving image obtained by photographing a target person;
a line-of-sight acquisition unit that acquires a movement of the subject's line of sight based on the acquired moving image;
a display unit that continuously displays a plurality of images to the subject;
a position acquisition unit that acquires a positional relationship between the camera unit and the display unit;
an output unit that associates and outputs the eye movement for each of the plurality of displayed images;
is obtained.
また、本発明によれば、
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の観点による評価値を算出する評価手段と、
 前記評価値を時系列に沿った変化情報として出力する出力手段と、
 他の動画像に関する他の変化情報を参照して、前記変化情報から抽出されるパターンと同一のパターンを含む他の動画像を特定する特定手段と、を備える
動画像分析システム。
が得られる。
Moreover, according to the present invention,
an acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
output means for outputting the evaluation value as change information along a time series;
A moving image analysis system, comprising: identifying means for referring to other change information relating to other moving images and identifying other moving images containing the same pattern as the pattern extracted from the change information.
is obtained.
また、本発明によれば、
 少なくとも二のユーザ端末間において実施されたビデオセッションに関する動画像を取得する取得手段と、
 前記動画像に含まれるユーザの顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記評価値を時系列に沿った変化情報として記憶する記憶手段と、
 複数の前記観点のうち、一の前記観点における前記評価値のみが所定範囲を超えて変化したことを検知する検知手段と、
 前記検知された範囲を含む特異フレームを取得する特異フレーム取得手段と、を備える、
動画像分析システム。
が得られる。
Moreover, according to the present invention,
acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals;
face recognition means for recognizing a user's face image included in the moving image for each predetermined frame;
voice recognition means for recognizing at least voice of the user included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
a storage means for storing the evaluation value as change information along a time series;
detection means for detecting that only the evaluation value in one of the plurality of viewpoints has changed beyond a predetermined range;
a peculiar frame acquiring means for acquiring a peculiar frame including the detected range;
Video image analysis system.
is obtained.
また、本発明によれば、
 少なくとも二のユーザ端末間において実施されたビデオセッションに関する動画像を取得する取得手段と、
 前記動画像に含まれるユーザの顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像に基づいて複数の観点による表情評価値を算出する表情評価手段と、
 認識した前記音声に基づいて複数の観点による音声評価値を算出する音声評価手段と、
 少なくとも一の前記ユーザの前記表情評価値と前記音声評価値との相関関係を評価する表情音声相関評価手段と、
 前記表情評価値と前記音声評価値とが前記相関関係から所定範囲を超えて変化したことを検知する検知手段と、
 前記検知された範囲を含む特異フレームを取得する特異フレーム取得手段と、を備える、
動画像分析システム。
が得られる。
Moreover, according to the present invention,
acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals;
face recognition means for recognizing a user's face image included in the moving image for each predetermined frame;
voice recognition means for recognizing at least voice of the user included in the moving image;
facial expression evaluation means for calculating facial expression evaluation values from a plurality of viewpoints based on the recognized face image;
speech evaluation means for calculating speech evaluation values from a plurality of viewpoints based on the recognized speech;
facial expression/speech correlation evaluation means for evaluating the correlation between the facial expression evaluation value and the speech evaluation value of at least one of the users;
detection means for detecting that the facial expression evaluation value and the voice evaluation value have changed beyond a predetermined range based on the correlation;
a peculiar frame acquiring means for acquiring a peculiar frame including the detected range;
Video image analysis system.
is obtained.
また、本発明によれば、
 少なくとも二のユーザ端末間において実施されたビデオセッションに関する動画像を取得する取得手段と、
 前記動画像に含まれるユーザの顔画像を所定のフレームごとに認識する顔認識手段と、
 認識した前記顔画像から、前記ユーザの標準的な表情を分析すると共に、当該ユーザにおける標準的な表情からの乖離度合いを評価するエモーション評価手段と、
 認識した前記顔画像から、前記ユーザの少なくとも瞳の動き又は顔の動きの量を評価するコンセントレーション評価手段と、
 認識した前記顔画像から、前記ユーザの不安に関する感情を評価するセーフティ評価手段と、
 前記エモーション評価手段、前記コンセントレーション評価手段、前記セーフティ評価手段のうち二以上の評価に基づいてスコアを生成するスコア生成手段と、を備える、
動画像分析システム。
が得られる。
Moreover, according to the present invention,
acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals;
face recognition means for recognizing a user's face image included in the moving image for each predetermined frame;
Emotion evaluation means for analyzing the user's standard facial expression from the recognized face image and evaluating the degree of deviation from the user's standard facial expression;
concentration evaluation means for evaluating at least the amount of eye movement or face movement of the user from the recognized face image;
Safety evaluation means for evaluating the user's feeling of anxiety from the recognized face image;
score generation means for generating a score based on two or more evaluations of the emotion evaluation means, the concentration evaluation means, and the safety evaluation means;
Video image analysis system.
is obtained.
また、本発明によれば、
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 複数の前記ビデオセッションに関する前記動画像の夫々から、前記対象者が認識された対象者フレームを特定する対象者特定手段と、
 複数の前記対象者フレームを連結してダイジェスト動画を生成するダイジェスト生成動画手段を更に備える、
動画分析システム。
が得られる。
Moreover, according to the present invention,
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
a target person specifying means for specifying a target person frame in which the target person is recognized from each of the moving images relating to the plurality of the video sessions;
further comprising a digest generating video means for generating a digest video by connecting a plurality of target person frames;
Video analysis system.
is obtained.
また、本発明によれば、
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 認識した前記顔画像から、前記ユーザの少なくとも瞳の動き及び顔の動きの量を夫々評価する評価手段と、
 前記評価に基づいて、集中度に関するスコアを算出するスコア算出手段と、を備える
動画像分析システム。
が得られる。
Moreover, according to the present invention,
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
evaluation means for evaluating at least the amount of eye movement and face movement of the user from the recognized face image;
A moving image analysis system, comprising score calculation means for calculating a score relating to the degree of concentration based on the evaluation.
is obtained.
また、本発明によれば、
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれるユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 同一のユーザに対して、第1の期間において前記評価値を分析した第1評価分析値と、前記第1の期間よりも長い第2の期間において前記評価値を分析した第2評価分析値とに基づいて前記ユーザの状態を評価する状態評価手段と、を備える、
動画像分析システム。
が得られる。
Moreover, according to the present invention,
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice;
For the same user, a first evaluation analysis value obtained by analyzing the evaluation value in a first period and a second evaluation analysis value obtained by analyzing the evaluation value in a second period longer than the first period a state evaluation means for evaluating the state of the user based on
Video image analysis system.
is obtained.
また、本発明によれば、
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記複数の観点に基づいて作成された質問情報に対する、前記ユーザの回答情報を取得する回答取得手段と、
 前記評価値と、前記回答情報とを比較して前記ユーザの状態を評価する、
状態評価システム。
が得られる。
Moreover, according to the present invention,
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice;
answer acquisition means for acquiring the user's answer information to the question information created based on the plurality of viewpoints;
evaluating the state of the user by comparing the evaluation value and the answer information;
status rating system.
is obtained.
また、本発明によれば、
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれるユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記評価値に対して、前記ユーザからアノテーションを受け付けるアノテーション受付手段と、
 前記評価値と、受け付けた前記アノテーションとを同時に表示する表示手段とを備えた
ビデオセッション評価システム。
が得られる。
Moreover, according to the present invention,
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice;
Annotation receiving means for receiving an annotation from the user for the evaluation value;
A video session evaluation system comprising display means for simultaneously displaying the evaluation value and the received annotation.
is obtained.
また、本発明によれば、
 営業側担当者の営業側端末と、営業先担当者の営業先端末との間で行われる営業ビデオセッションの動画像を取得する動画取得手段と、
 前記営業ビデオセッションの成約情報を取得する成約情報取得手段と、
 前記動画像内に含まれる前記営業側担当者又は営業先担当者の少なくともいずれかの顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記営業側担当者又は営業先担当者の少なくともいずれかの音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記評価値と前記成約情報とを教師データとして、他の営業ビデオセッションの動画像の前記成約率を複数のランクとして推定する成約推定モデルを生成するモデル生成手段と、
 前記モデルを利用して新規の営業ビデオセッションに複数の前記ランクのいずれかを関連付ける判定手段と、を備える、
ビデオセッション評価システム。
が得られる。
Moreover, according to the present invention,
a moving image acquisition means for acquiring a moving image of a sales video session conducted between a sales terminal of a person in charge of sales and a terminal of a sales destination of a person in charge of sales;
a contract information acquisition means for acquiring contract information of the sales video session;
face recognition means for recognizing the face image of at least one of the person in charge of the sales side or the person in charge of the sales partner included in the moving image for each predetermined frame;
A speech recognition means that recognizes the speech of at least one of the person in charge on the sales side and the person in charge of the sales partner included in the moving image, and an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice. an evaluation means to calculate;
model generation means for generating a contract conclusion estimation model for estimating the contract conclusion rates of moving images of other sales video sessions as a plurality of ranks, using the evaluation value and the contract conclusion information as teacher data;
determining means for associating one of the plurality of ranks with a new sales video session using the model;
Video session rating system.
is obtained.
また、本発明によれば、
 講師ユーザの少なくとも顔を映すための講師側カメラを有する講師端末と;当該講師端末とネットワークを介して通信可能に接続された受講者端末であって受講者の少なくとも顔を映すための受講者側顔カメラ及び当該受講者の手元を映すための手元カメラを有する受講者端末と;の間で行われるビデオセッションの動画像を取得する取得手段と、
 少なくとも前記手元カメラから取得された前記動画像内における前記受講者の手元の動作を所定のフレームごとに認識する手元動作認識手段と、
 認識された前記手元の動作に基づいて、前記受講者の理解度を推定する推定手段と、
を備える、
ビデオセッション評価システム。
が得られる。
Moreover, according to the present invention,
a lecturer terminal having a lecturer-side camera for capturing at least the face of a lecturer user; and a student terminal communicably connected to the lecturer terminal via a network for capturing at least the face of a student. Acquisition means for acquiring a moving image of a video session held between a student terminal having a face camera and a hand camera for projecting the hand of the student;
Hand action recognition means for recognizing at least the action of the student's hand in the moving image acquired from the hand camera for each predetermined frame;
estimating means for estimating the degree of comprehension of the student based on the recognized hand motion;
comprising a
Video session rating system.
is obtained.
 本開示によれば、ビデオセッションの動画像を分析評価することにより、特に内容に関する評価を客観的に行うことができる。 According to the present disclosure, by analyzing and evaluating moving images of a video session, it is possible to objectively evaluate especially the content.
 特に、本発明によれば、オンラインコミュニケーションが主となる状況において、より効率的なコミュニケーションを行うために、交わされたコミュニケーションを客観的に評価することができる。 In particular, according to the present invention, exchanged communication can be objectively evaluated in order to conduct more efficient communication in situations where online communication is the main activity.
本発明の実施の形態によるシステム全体図を示す図である。It is a figure which shows the whole system diagram by embodiment of this invention. 本発明の実施の形態による端末の構成例を示す図である。It is a figure which shows the structural example of the terminal by embodiment of this invention. 本発明の実施の形態による評価端末の機能ブロック図の一例である。1 is an example of a functional block diagram of an evaluation terminal according to an embodiment of the present invention; FIG. 本発明の実施の形態による評価端末の機能構成例1を示す図である。FIG. 2 is a diagram showing functional configuration example 1 of an evaluation terminal according to an embodiment of the present invention; 本発明の実施の形態による評価端末の機能構成例2を示す図である。FIG. 8 is a diagram showing functional configuration example 2 of the evaluation terminal according to the embodiment of the present invention; 本発明の実施の形態による評価端末の機能構成例3を示す図である。FIG. 10 is a diagram showing a functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 図6の機能構成例3による画面表示例である。7 is a screen display example according to the functional configuration example 3 of FIG. 6. FIG. 図6の機能構成例3による他の画面表示例である。FIG. 7 is another screen display example according to the functional configuration example 3 of FIG. 6. FIG. 本発明の実施の形態による評価端末の機能構成例3の他の構成を示す図である。FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 本発明の実施の形態による評価端末の機能構成例3の他の構成を示す図である。FIG. 12 is a diagram showing another configuration of functional configuration example 3 of the evaluation terminal according to the embodiment of the present invention; 本発明の第1の実施の形態によるシステムのヒートマップを示す図である。Fig. 2 shows a heat map of the system according to the first embodiment of the invention; 本発明の第1の実施の形態によるシステムのキャリブレーションのイメージを示す図である。It is a figure which shows the image of the calibration of the system by the 1st Embodiment of this invention. 本発明の第2の実施の形態によるシステムのグラフの比較図である。Fig. 3 is a graphical comparison of a system according to a second embodiment of the invention; 本発明の第2の実施の形態によるシステムの他のグラフの比較図である。Fig. 10 is a comparative diagram of another graph of the system according to the second embodiment of the invention; 本発明の第3の実施の形態によるシステムのグラフを示す図である。Fig. 3 shows a graph of a system according to a third embodiment of the invention; 本発明の第3の実施の形態によるシステムの他のグラフを示す図である。Fig. 10 shows another graph of the system according to the third embodiment of the invention; 本発明の第4の実施の形態によるシステムの他のグラフを示す図である。Fig. 11 shows another graph of the system according to the fourth embodiment of the invention; 本発明の第4の実施の形態によるシステムの評価のイメージを示す図である。It is a figure which shows the image of the evaluation of the system by the 4th Embodiment of this invention. 本発明の第6の実施の形態によるシステムの評価のイメージを示す図である。It is a figure which shows the image of the evaluation of the system by the 6th Embodiment of this invention. 本発明の第6の実施の形態によるシステムの評価のイメージを示す図である。It is a figure which shows the image of the evaluation of the system by the 6th Embodiment of this invention. 本発明の第8の実施の形態によるシステムの評価のイメージを示す図である。It is a figure which shows the image of evaluation of the system by the 8th Embodiment of this invention. 本発明の第10の実施の形態によるシステムの評価のイメージを示す図である。It is a figure which shows the image of the evaluation of the system by the 10th Embodiment of this invention. 本発明の第12の実施の形態によるシステムの評価のイメージを示す図である。FIG. 22 is a diagram showing an image of system evaluation according to the twelfth embodiment of the present invention;
 本開示の実施形態の内容を列記して説明する。本開示は、以下のような構成を備える。
[項目1]
 対象者を撮影することによって得られる動画像を取得するカメラ部と、
 取得した前記動画像に基づいて前記対象者の目線の動きを取得する視線取得部と、
 前記対象者に複数の画像を連続して表示するディスプレイ部と、
 前記カメラ部と、前記ディスプレイ部との位置関係を取得する位置取得部と、
 表示した複数の前記画像ごとに前記目線の動きを関連付けて出力する出力部と、
を備える、視線評価システム。
[項目2]
 項目1に記載の視線評価システムであって、
 前記出力部は、前記画像に前記目線の動きに基づいて生成された注視時間を示すヒートマップを重ねて出力する、
視線評価システム
[項目3]
 項目1に記載の視線評価システムであって、
 前記出力部は、同一の前記画像を表示した他の対象者の前記目線の動きを更に関連付けて出力する、
視線評価システム
[項目4]
 項目3に記載の視線評価システムであって、
 前記対象者に関連付けられた前記目線の動きが、前記他の対象者に関連付けられた前記目線の動きと比べて特異的か否かを判定する特異判定部を更に備える、
視線評価システム。
[項目5]
 項目1乃至項目4のいずれかに記載の視線評価システムであって、
 前記出力部は、前記画像ごとに複数の前記対象者の前記目線の動きを平準化して得られる平準化ヒートマップを関連付けて出力する、
視線評価システム。
[項目6]
 少なくとも動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて所定の観点による評価値を算出する評価手段と、
 前記評価値を時系列に沿った変化情報として出力する出力手段と、
 他の動画像に関する他の変化情報を参照して、前記変化情報から抽出されるパターンと同一のパターンを含む他の動画像を特定する特定手段と、を備える
動画像分析システム。
[項目7]
 項目6に記載の動画像分析システムであって、
 前記出力手段は、前記評価値を時系列に沿ったグラフ情報として出力し、
 前記特定手段は、分析ユーザから当該グラフ情報の一部の選択操作を受け付け、当該選択操作がされた部分のグラフパターンと同一のグラフパターンを含む他の動画像の対応するフレームを特定する特定手段と、を備える
動画像分析システム。
[項目8]
 項目6又は項目7に記載の動画像分析システムであって、
 前記特定手段は、同一の時間帯における前記変化情報から抽出されるパターンと同一のパターンを含む他の前記動画像を特定する
動画像分析システム。
[項目9]
 少なくとも二のユーザ端末間において実施されたビデオセッションに関する動画像を取得する取得手段と、
 前記動画像に含まれるユーザの顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記評価値を時系列に沿った変化情報として記憶する記憶手段と、
 複数の前記観点のうち、一の前記観点における前記評価値のみが所定範囲を超えて変化したことを検知する検知手段と、
 前記検知された範囲を含む特異フレームを取得する特異フレーム取得手段と、を備える、
動画像分析システム。
[項目10]
 項目9に記載の動画像分析システムであって、
 前記複数の観点は、互いに相反する属性が関連付けられた第1の観点及び第2の観点を含むものであり、
 前記検知手段は、前記第1の観点による前記評価値と前記第2の観点による前記評価値とが前記所定範囲を超えて乖離したことを検知する、
動画像分析システム。
[項目11]
 項目9に記載の動画像分析システムであって、
 前記検知手段は、前記第1の時点を経過後の所定時間内に、一の前記観点における前記評価値が所定範囲を超えて変化し且つ直後に前記第1の時点における評価値と略同一の値に戻ったことを検知する、
動画像分析システム。
[項目12]
 項目9乃至項目11のいずれかに記載の動画像分析システムであって、
 前記動画像内から取得された複数の前記特異フレームを連結してダイジェスト動画を生成するダイジェスト生成動画手段を更に備える、
動画分析システム。
[項目13]
 項目9乃至項目12のいずれかに記載の動画像分析システムであって、
 前記特異フレームに対応する前記音声をテキストに変換して出力する特異フレーム対応テキスト出力手段を更に備える、
動画像分析システム。
[項目14]
 項目9乃至項目13のいずれかに記載の動画像分析システムであって、
 前記ビデオセッションは、一のユーザ端末の画面に表示された画面情報を共有することが可能であり、
 少なくとも共有された前記特異フレームに対応する前記画面情報を出力する共有画面出力手段を更に備える、
動画像分析システム。
[項目15]
 少なくとも二のユーザ端末間において実施されたビデオセッションに関する動画像を取得する取得手段と、
 前記動画像に含まれるユーザの顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と、
 認識した前記顔画像に基づいて複数の観点による表情評価値を算出する表情評価手段と、
 認識した前記音声に基づいて複数の観点による音声評価値を算出する音声評価手段と、
 少なくとも一の前記ユーザの前記表情評価値と前記音声評価値との相関関係を評価する表情音声相関評価手段と、
 前記表情評価値と前記音声評価値とが前記相関関係から所定範囲を超えて変化したことを検知する検知手段と、
 前記検知された範囲を含む特異フレームを取得する特異フレーム取得手段と、を備える、
動画像分析システム。
[項目16]
 項目15に記載の動画像分析システムであって、
 前記表情評価値と前記音声評価値とのそれぞれに対応する属性を関連付ける属性評価手段を更に備えており、
 前記検知手段は、前記表情評価値の前記属性と、前記音声評価値の前記属性とが互いに相反していることを検知する、
動画像分析システム。
[項目17]
 項目15又は項目16のいずれかに記載の動画像分析システムであって、
 前記動画像内から取得された複数の前記特異フレームを連結してダイジェスト動画を生成するダイジェスト生成動画手段を更に備える、
動画分析システム。
[項目18]
 項目15乃至項目17のいずれかに記載の動画像分析システムであって、
 前記特異フレームに対応する前記音声をテキストに変換して出力する特異フレーム対応テキスト出力手段を更に備える、
動画像分析システム。
[項目19]
 項目15乃至項目18のいずれかに記載の動画像分析システムであって、
 前記ビデオセッションは、一のユーザ端末の画面に表示された画面情報を共有することが可能であり、
 少なくとも共有された前記特異フレームに対応する前記画面情報を出力する共有画面出力手段を更に備える、
動画像分析システム。
[項目20]
 少なくとも二のユーザ端末間において実施されたビデオセッションに関する動画像を取得する取得手段と、
 前記動画像に含まれるユーザの顔画像を所定のフレームごとに認識する顔認識手段と、
 認識した前記顔画像から、前記ユーザの標準的な表情を分析すると共に、当該ユーザにおける標準的な表情からの乖離度合いを評価するエモーション評価手段と、
 認識した前記顔画像から、前記ユーザの少なくとも瞳の動き又は顔の動きの量を評価するコンセントレーション評価手段と、
 認識した前記顔画像から、前記ユーザの不安に関する感情を評価するセーフティ評価手段と、
 前記エモーション評価手段、前記コンセントレーション評価手段、前記セーフティ評価手段のうち二以上の評価に基づいてスコアを生成するスコア生成手段と、を備える、
動画像分析システム。
[項目21]
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 複数の前記ビデオセッションに関する前記動画像の夫々から、前記対象者が認識された対象者フレームを特定する対象者特定手段と、
 複数の前記対象者フレームを連結してダイジェスト動画を生成するダイジェスト生成動画手段を更に備える、
動画分析システム。
[項目22]
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 認識した前記顔画像から、前記ユーザの少なくとも瞳の動き及び顔の動きの量を夫々評価する評価手段と、
 前記評価に基づいて、集中度に関するスコアを算出するスコア算出手段と、を備える
動画像分析システム。
[項目23]
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれるユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 同一のユーザに対して、第1の期間において前記評価値を分析した第1評価分析値と、前記第1の期間よりも長い第2の期間において前記評価値を分析した第2評価分析値とに基づいて前記ユーザの状態を評価する状態評価手段と、を備える、
動画像分析システム。
[項目24]
 項目23に記載の動画像分析システムであって、
 前記第2評価分析値に関して所定のトレンドを検出するトレンド検出手段と、
 検出したトレンドに応じて、前記第1評価分析値を補正する補正手段と、を備える、
動画分析システム。
[項目25]
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれる対象者の少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記複数の観点に基づいて作成された質問情報に対する、前記ユーザの回答情報を取得する回答取得手段と、
 前記評価値と、前記回答情報とを比較して前記ユーザの状態を評価する、
状態評価システム。
[項目26]
 項目25に記載の状態評価システムであって、
 同一の組織に所属するすべての前記ユーザの前記状態に基づいて、組織状態をスコアリングする組織スコアリング手段を更に備える、
状態評価システム。
[項目27]
 他の端末との間で行われるビデオセッションの動画像を取得する取得手段と、
 前記動画像内に含まれるユーザの少なくとも顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記対象者の少なくとも音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記評価値に対して、前記ユーザからアノテーションを受け付けるアノテーション受付手段と、
 前記評価値と、受け付けた前記アノテーションとを同時に表示する表示手段とを備えた
ビデオセッション評価システム。
[項目28]
 項目27に記載のビデオセッション評価システムであって、
 前記評価値を時系列に沿ったグラフ情報として出力するグラフ出力手段を更に備え、
 前記表示手段は、前記グラフ情報に前記アノテーションを重畳して表示する、
ビデオセッション評価システム。
[項目29]
 営業側担当者の営業側端末と、営業先担当者の営業先端末との間で行われる営業ビデオセッションの動画像を取得する動画取得手段と、
 前記営業ビデオセッションの成約情報を取得する成約情報取得手段と、
 前記動画像内に含まれる前記営業側担当者又は営業先担当者の少なくともいずれかの顔画像を所定のフレームごとに認識する顔認識手段と、
 前記動画像内に含まれる前記営業側担当者又は営業先担当者の少なくともいずれかの音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前記評価値と前記成約情報とを教師データとして、他の営業ビデオセッションの動画像の前記成約率を複数のランクとして推定する成約推定モデルを生成するモデル生成手段と、
 前記モデルを利用して新規の営業ビデオセッションに複数の前記ランクのいずれかを関連付ける判定手段と、を備える、
ビデオセッション評価システム。
[項目30]
 項目29に記載のビデオセッション評価システムであって、
 組織内において、推定された前記営業ビデオセッションにかかる取引金額と、前記ランクとに基づいて、所定期間における期待される営業数値を算出する期待値算出手段を更に備える、
ビデオセッション評価システム。
[項目31]
 講師ユーザの少なくとも顔を映すための講師側カメラを有する講師端末と;当該講師端末とネットワークを介して通信可能に接続された受講者端末であって受講者の少なくとも顔を映すための受講者側顔カメラ及び当該受講者の手元を映すための手元カメラを有する受講者端末と;の間で行われるビデオセッションの動画像を取得する取得手段と、
 少なくとも前記手元カメラから取得された前記動画像内における前記受講者の手元の動作を所定のフレームごとに認識する手元動作認識手段と、
 認識された前記手元の動作に基づいて、前記受講者の理解度を推定する推定手段と、
を備える、
ビデオセッション評価システム。
[項目32]
 項目31に記載のビデオセッション評価システムであって、
 前記動画像内に含まれる前記受講者の少なくとも音声を認識する音声認識手段と
 認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
 前期推定手段は、前記手元の動作と前記評価値とに基づいて、前記受講者の理解度を推定する、
ビデオセッション評価システム。
[項目33]
 項目31又は項目32に記載のビデオセッション評価システムであって、
 前記推定手段は、認識された前記手元の動作の量に応じて前記受講者の理解度を推定する、
ビデオセッション評価システム。
[項目34]
 項目31乃至項目33のいずれかに記載のビデオセッション評価システムであって、
 前記推定手段は、受講者側顔カメラによって受講者の顔が映っていない場合であって、かつ前記手元カメラから前記受講者の手の動きが認識されない場合にアラートを発するアラート手段を更に有する、
ビデオセッション評価システム。
The contents of the embodiments of the present disclosure are listed and described. The present disclosure has the following configurations.
[Item 1]
a camera unit that acquires a moving image obtained by photographing a target person;
a line-of-sight acquisition unit that acquires a movement of the subject's line of sight based on the acquired moving image;
a display unit that continuously displays a plurality of images to the subject;
a position acquisition unit that acquires a positional relationship between the camera unit and the display unit;
an output unit that associates and outputs the eye movement for each of the plurality of displayed images;
A line-of-sight evaluation system.
[Item 2]
The line-of-sight evaluation system according to item 1,
The output unit superimposes on the image a heat map indicating a fixation time generated based on the movement of the eye line and outputs the image.
Gaze evaluation system [Item 3]
The line-of-sight evaluation system according to item 1,
The output unit further associates and outputs the movement of the line of sight of another subject who displayed the same image.
Gaze evaluation system [Item 4]
The line-of-sight evaluation system according to item 3,
Further comprising a peculiar determination unit that determines whether the movement of the eye line associated with the subject is more specific than the movement of the eye line associated with the other subject,
Gaze evaluation system.
[Item 5]
The line-of-sight evaluation system according to any one of items 1 to 4,
The output unit associates and outputs a leveled heat map obtained by leveling the eye movements of the plurality of subjects for each image.
Gaze evaluation system.
[Item 6]
acquisition means for acquiring at least a moving image;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image;
evaluation means for calculating an evaluation value from a predetermined viewpoint based on both the recognized face image and the recognized voice;
output means for outputting the evaluation value as change information along a time series;
A moving image analysis system, comprising: identifying means for referring to other change information relating to other moving images and identifying other moving images containing the same pattern as the pattern extracted from the change information.
[Item 7]
The moving image analysis system according to item 6,
The output means outputs the evaluation values as chronological graph information,
The specifying means receives a selection operation of a part of the graph information from the analysis user, and specifies a corresponding frame of another moving image containing the same graph pattern as the graph pattern of the part where the selection operation is performed. and a moving image analysis system.
[Item 8]
The moving image analysis system according to item 6 or item 7,
The moving image analysis system, wherein the identifying means identifies other moving images including the same pattern as the pattern extracted from the change information in the same time period.
[Item 9]
acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals;
face recognition means for recognizing a user's face image included in the moving image for each predetermined frame;
voice recognition means for recognizing at least voice of the user included in the moving image;
evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
a storage means for storing the evaluation value as change information along a time series;
detection means for detecting that only the evaluation value in one of the plurality of viewpoints has changed beyond a predetermined range;
a peculiar frame acquiring means for acquiring a peculiar frame including the detected range;
Video image analysis system.
[Item 10]
The moving image analysis system according to item 9,
The plurality of viewpoints includes a first viewpoint and a second viewpoint associated with mutually contradictory attributes,
The detection means detects that the evaluation value from the first viewpoint and the evaluation value from the second viewpoint have deviated beyond the predetermined range.
Video image analysis system.
[Item 11]
The moving image analysis system according to item 9,
The detection means detects that the evaluation value of the one viewpoint changes beyond a predetermined range within a predetermined time period after the first time point and immediately after that, the evaluation value becomes substantially the same as the evaluation value at the first time point. detect that it has returned to a value,
Video image analysis system.
[Item 12]
The moving image analysis system according to any one of items 9 to 11,
Further comprising a digest generating video means for generating a digest video by linking the plurality of specific frames acquired from the moving image,
Video analysis system.
[Item 13]
The moving image analysis system according to any one of items 9 to 12,
Further comprising a peculiar frame corresponding text output means for converting the speech corresponding to the peculiar frame into text and outputting it,
Video image analysis system.
[Item 14]
The moving image analysis system according to any one of items 9 to 13,
The video session is capable of sharing screen information displayed on the screen of one user terminal,
further comprising shared screen output means for outputting at least the screen information corresponding to the shared specific frame;
Video image analysis system.
[Item 15]
acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals;
face recognition means for recognizing a user's face image included in the moving image for each predetermined frame;
voice recognition means for recognizing at least voice of the user included in the moving image;
facial expression evaluation means for calculating facial expression evaluation values from a plurality of viewpoints based on the recognized face image;
speech evaluation means for calculating speech evaluation values from a plurality of viewpoints based on the recognized speech;
facial expression/speech correlation evaluation means for evaluating the correlation between the facial expression evaluation value and the speech evaluation value of at least one of the users;
detection means for detecting that the facial expression evaluation value and the voice evaluation value have changed beyond a predetermined range based on the correlation;
a peculiar frame acquiring means for acquiring a peculiar frame including the detected range;
Video image analysis system.
[Item 16]
The moving image analysis system according to item 15,
further comprising attribute evaluation means for associating attributes corresponding to the facial expression evaluation value and the voice evaluation value,
The detection means detects that the attribute of the facial expression evaluation value and the attribute of the voice evaluation value are mutually exclusive.
Video image analysis system.
[Item 17]
The moving image analysis system according to either item 15 or item 16,
Further comprising a digest generating video means for generating a digest video by linking the plurality of specific frames acquired from the moving image,
Video analysis system.
[Item 18]
The moving image analysis system according to any one of items 15 to 17,
Further comprising a peculiar frame corresponding text output means for converting the speech corresponding to the peculiar frame into text and outputting it,
Video image analysis system.
[Item 19]
The moving image analysis system according to any one of items 15 to 18,
The video session is capable of sharing screen information displayed on the screen of one user terminal,
further comprising shared screen output means for outputting at least the screen information corresponding to the shared specific frame;
Video image analysis system.
[Item 20]
acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals;
face recognition means for recognizing a user's face image included in the moving image for each predetermined frame;
Emotion evaluation means for analyzing the user's standard facial expression from the recognized face image and evaluating the degree of deviation from the user's standard facial expression;
concentration evaluation means for evaluating at least the amount of eye movement or face movement of the user from the recognized face image;
Safety evaluation means for evaluating the user's feeling of anxiety from the recognized face image;
score generation means for generating a score based on two or more evaluations of the emotion evaluation means, the concentration evaluation means, and the safety evaluation means;
Video image analysis system.
[Item 21]
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
a target person specifying means for specifying a target person frame in which the target person is recognized from each of the moving images relating to the plurality of the video sessions;
further comprising a digest generating video means for generating a digest video by connecting a plurality of target person frames;
Video analysis system.
[Item 22]
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
evaluation means for evaluating at least the amount of eye movement and face movement of the user from the recognized face image;
A moving image analysis system, comprising score calculation means for calculating a score relating to the degree of concentration based on the evaluation.
[Item 23]
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice;
For the same user, a first evaluation analysis value obtained by analyzing the evaluation value in a first period and a second evaluation analysis value obtained by analyzing the evaluation value in a second period longer than the first period a state evaluation means for evaluating the state of the user based on
Video image analysis system.
[Item 24]
A moving image analysis system according to item 23,
trend detection means for detecting a predetermined trend with respect to the second evaluation analysis value;
a correction means for correcting the first evaluation analysis value according to the detected trend;
Video analysis system.
[Item 25]
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a target person included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the user included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice;
answer acquisition means for acquiring the user's answer information to the question information created based on the plurality of viewpoints;
evaluating the state of the user by comparing the evaluation value and the answer information;
status rating system.
[Item 26]
26. The condition assessment system according to item 25,
Further comprising an organization scoring means for scoring the organization status based on the status of all the users belonging to the same organization,
status rating system.
[Item 27]
Acquisition means for acquiring a moving image of a video session conducted with another terminal;
face recognition means for recognizing at least a face image of a user included in the moving image for each predetermined frame;
voice recognition means for recognizing at least the voice of the subject included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the voice;
Annotation receiving means for receiving an annotation from the user for the evaluation value;
A video session evaluation system comprising display means for simultaneously displaying the evaluation value and the received annotation.
[Item 28]
28. The video session rating system of item 27,
Further comprising a graph output means for outputting the evaluation values as graph information in chronological order,
The display means superimposes and displays the annotation on the graph information.
Video session rating system.
[Item 29]
a moving image acquisition means for acquiring a moving image of a sales video session conducted between a sales terminal of a person in charge of sales and a terminal of a sales destination of a person in charge of sales;
a contract information acquisition means for acquiring contract information of the sales video session;
face recognition means for recognizing the face image of at least one of the person in charge of the sales side or the person in charge of the sales partner included in the moving image for each predetermined frame;
A speech recognition means that recognizes the speech of at least one of the person in charge on the sales side and the person in charge of the sales partner included in the moving image, and an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice. an evaluation means to calculate;
model generation means for generating a contract conclusion estimation model for estimating the contract conclusion rates of moving images of other sales video sessions as a plurality of ranks, using the evaluation value and the contract conclusion information as teacher data;
determining means for associating one of the plurality of ranks with a new sales video session using the model;
Video session rating system.
[Item 30]
30. The video session rating system of item 29,
further comprising expected value calculation means for calculating an expected sales figure for a predetermined period based on the estimated transaction amount for the sales video session and the rank within the organization;
Video session rating system.
[Item 31]
a lecturer terminal having a lecturer-side camera for capturing at least the face of a lecturer user; and a student terminal communicably connected to the lecturer terminal via a network for capturing at least the face of a student. Acquisition means for acquiring a moving image of a video session held between a student terminal having a face camera and a hand camera for projecting the hand of the student;
Hand action recognition means for recognizing at least the action of the student's hand in the moving image acquired from the hand camera for each predetermined frame;
estimating means for estimating the degree of comprehension of the student based on the recognized hand motion;
comprising a
Video session rating system.
[Item 32]
32. The video session rating system of item 31, comprising:
speech recognition means for recognizing at least the speech of the student included in the moving image; evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the speech;
The first estimation means estimates the degree of understanding of the student based on the motion at hand and the evaluation value.
Video session rating system.
[Item 33]
A video session evaluation system according to item 31 or item 32,
The estimation means estimates the degree of comprehension of the student according to the amount of the recognized movement of the hand.
Video session rating system.
[Item 34]
The video session evaluation system according to any one of items 31 to 33,
The estimating means further comprises an alert means for issuing an alert when the student's face is not captured by the student's face camera and the movement of the student's hand is not recognized by the hand camera.
Video session rating system.
 以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Preferred embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. In the present specification and drawings, constituent elements having substantially the same functional configuration are denoted by the same reference numerals, thereby omitting redundant description.
 <基本機能>
 本実施形態のビデオセッション評価システムは、複数人でビデオセッション(以下、一方向及び双方向含めてオンラインセッションという)が行われる環境において、当該複数人の中の解析対象者について他者とは異なる特異的な感情(自分または他人の言動に対して起こる気持ち。快・不快またはその程度など)を解析し評価するシステムである。
<Basic functions>
In the video session evaluation system of the present embodiment, in an environment where a video session (hereinafter referred to as an online session including one-way and two-way sessions) is held by a plurality of people, the person to be analyzed among the plurality of people is different from the others. It is a system that analyzes and evaluates specific emotions (feelings that occur in response to one's own or others' behavior. Pleasure/displeasure or degree of such).
 オンラインセッションは、例えばオンライン会議、オンライン授業、オンラインチャットなどであり、複数の場所に設置された端末をインターネットなどの通信ネットワークを介してサーバに接続し、当該サーバを通じて複数の端末間で動画像をやり取りできるようにしたものである。 Online sessions are, for example, online meetings, online classes, online chats, etc. Terminals installed in multiple locations are connected to a server via a communication network such as the Internet, and moving images are transmitted between multiple terminals through the server. It's made to be interactable.
 オンラインセッションで扱う動画像には、端末を使用するユーザの顔画像や音声が含まれる。また、動画像には、複数のユーザが共有して閲覧する資料などの画像も含まれる。各端末の画面上に顔画像と資料画像とを切り替えて何れか一方のみを表示させたり、表示領域を分けて顔画像と資料画像とを同時に表示させたりすることが可能である。また、複数人のうち1人の画像を全画面表示させたり、一部または全部のユーザの画像を小画面に分割して表示させたりすることが可能である。  Videos handled in online sessions include facial images and voices of users using terminals. Moving images also include images such as materials that are shared and viewed by a plurality of users. It is possible to switch between the face image and the document image on the screen of each terminal to display only one of them, or to divide the display area and display the face image and the document image at the same time. In addition, it is possible to display the image of one user out of a plurality of users on the full screen, or divide the images of some or all of the users into small screens and display them.
 端末を使用してオンラインセッションに参加する複数のユーザのうち、何れか1人または複数人を解析対象者として指定することが可能である。例えば、オンラインセッションの主導者、進行者または管理者(以下、まとめて主催者という)が何れかのユーザを解析対象者として指定する。オンラインセッションの主催者は、例えばオンライン授業の講師、オンライン会議の議長やファシリテータ、コーチングを目的としたセッションのコーチなどである。オンラインセッションの主催者は、オンラインセッションに参加する複数のユーザの中の一人であるのが普通であるが、オンラインセッションに参加しない別人であってもよい。なお、解析対象者を指定せず全ての参加者を解析対象としてもよい。 It is possible to designate one or more of the multiple users participating in the online session using the terminal as the analysis target. For example, an online session leader, moderator, or manager (hereinafter collectively referred to as the organizer) designates any user as an analysis subject. Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session. It should be noted that all participants may be subject to analysis without specifying the person to be analyzed.
 また、オンラインセッションの主導者、進行者または管理者(以下、まとめて主催者という)が何れかのユーザを解析対象者として指定することも可能である。オンラインセッションの主催者は、例えばオンライン授業の講師、オンライン会議の議長やファシリテータ、コーチングを目的としたセッションのコーチなどである。オンラインセッションの主催者は、オンラインセッションに参加する複数のユーザの中の一人であるのが普通であるが、オンラインセッションに参加しない別人であってもよい。 In addition, it is also possible for the leader, moderator, or manager of an online session (hereinafter collectively referred to as the organizer) to designate any user as a person to be analyzed. Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
 本実施の形態によるビデオセッション評価システムは、複数の端末間においてビデオセッションセッションが確立された場合に、当該ビデオセッションから取得される少なくとも動画像を表示される。表示された動画像は、端末によって取得され、動画像内に含まれる少なくとも顔画像を所定のフレーム単位ごとに識別される。その後、識別された顔画像に関する評価値が算出される。当該評価値は必要に応じて共有される。 The video session evaluation system according to the present embodiment displays at least moving images obtained from a video session established between a plurality of terminals. The displayed moving image is acquired by the terminal, and at least a face image included in the moving image is identified for each predetermined frame unit. An evaluation value for the identified face image is then calculated. The evaluation value is shared as necessary.
 特に、本実施の形態においては、取得した動画像は当該端末に保存され、端末上で分析評価され、その結果が当該端末のユーザに提供される。従って、例えば個人情報を含むビデオセッションや機密情報を含むビデオセッションであっても、その動画自体を外部の評価機関等に提供することなく分析評価できる。また、必要に応じて、当該評価結果(評価値)だけを外部端末に提供することによって、結果を可視化したり、クロス分析等行うことができる。 In particular, in this embodiment, the acquired moving images are stored in the terminal, analyzed and evaluated on the terminal, and the results are provided to the user of the terminal. Therefore, for example, even a video session containing personal information or a video session containing confidential information can be analyzed and evaluated without providing the moving image itself to an external evaluation agency or the like. In addition, by providing only the evaluation result (evaluation value) to the external terminal as necessary, the result can be visualized and cross-analysis can be performed.
 図1に示されるように、本実施の形態によるビデオセッション評価システムは、少なくともカメラ部及びマイク部等の入力部と、ディスプレイ等の表示部とスピーカー等の出力部とを有するユーザ端末10、20と、ユーザ端末10、20に双方向のビデオセッションを提供するビデオセッションサービス端末30と、ビデオセッションに関する評価の一部を行う評価端末40とを備えている。 As shown in FIG. 1, the video session evaluation system according to the present embodiment includes user terminals 10 and 20 each having at least an input unit such as a camera unit and a microphone unit, a display unit such as a display, and an output unit such as a speaker. , a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for performing part of the evaluation of the video session.
<ハードウェア構成例>
 図2は、本実施形態に係る各端末10乃至40を実現するコンピュータのハードウェア構成例を示す図である。コンピュータは、少なくとも、制御部110、メモリ120、ストレージ130、通信部140および入出力部150等を備える。これらはバス160を通じて相互に電気的に接続される。
<Hardware configuration example>
FIG. 2 is a diagram showing a hardware configuration example of a computer that implements each of the terminals 10 to 40 according to this embodiment. The computer includes at least a control unit 110, a memory 120, a storage 130, a communication unit 140, an input/output unit 150, and the like. These are electrically connected to each other through bus 160 .
 制御部110は、各端末全体の動作を制御し、各要素間におけるデータの送受信の制御、及びアプリケーションの実行及び認証処理に必要な情報処理等を行う演算装置である。例えば制御部110は、CPU等のプロセッサであり、ストレージ130に格納されメモリ120に展開されたプログラム等を実行して各情報処理を実施する。 The control unit 110 is an arithmetic device that controls the overall operation of each terminal, controls transmission and reception of data between elements, executes applications, and performs information processing necessary for authentication processing. For example, the control unit 110 is a processor such as a CPU, and executes each information processing by executing a program or the like stored in the storage 130 and developed in the memory 120 .
 メモリ120は、DRAM等の揮発性記憶装置で構成される主記憶と、フラッシュメモリまたはHDD等の不揮発性記憶装置で構成される補助記憶と、を含む。メモリ120は、制御部110のワークエリア等として使用され、また、各端末の起動時に実行されるBIOS、及び各種設定情報等を格納する。 The memory 120 includes a main memory made up of a volatile memory device such as a DRAM, and an auxiliary memory made up of a non-volatile memory device such as a flash memory or an HDD. The memory 120 is used as a work area or the like for the control unit 110, and stores the BIOS executed when each terminal is started, various setting information, and the like.
 ストレージ130は、アプリケーション・プログラム等の各種プログラムを格納する。各処理に用いられるデータを格納したデータベースがストレージ130に構築されていてもよい。特に本実施の形態においては、ビデオセッションサービス端末30のストレージ130にはオンラインセッションにおける動画像は記録されず、ユーザ端末10のストレージ130に格納される。また、評価端末40は、ユーザ端末10上において取得された動画像を評価するために必要なアプリケーションその他のプログラムを格納し、ユーザ端末10が利用可能に適宜提供する。なお、評価端末40の管理するストレージ13には、例えば、ユーザ端末10によって解析された結果、評価された結果のみが共有されることとしてもよい。 The storage 130 stores various programs such as application programs. A database storing data used for each process may be constructed in the storage 130 . Especially in the present embodiment, moving images in the online session are not recorded in the storage 130 of the video session service terminal 30, but are stored in the storage 130 of the user terminal 10. FIG. In addition, the evaluation terminal 40 stores an application and other programs necessary for evaluating the moving image acquired on the user terminal 10, and appropriately provides them so that the user terminal 10 can use them. The storage 13 managed by the evaluation terminal 40 may share, for example, only the results of analysis and evaluation by the user terminal 10 .
 通信部140は、端末をネットワークに接続する。通信部140は、例えば、有線LAN、無線LAN、Wi-Fi(登録商標)、赤外線通信、Bluetooth(登録商標)、近距離または非接触通信等の方式で、外部機器と直接またはネットワークアクセスポイントを介して通信する。 The communication unit 140 connects the terminal to the network. The communication unit 140 is, for example, wired LAN, wireless LAN, Wi-Fi (registered trademark), infrared communication, Bluetooth (registered trademark), short-range or non-contact communication, etc., and communicates directly with an external device or through a network access point. Communicate via
 入出力部150は、例えば、キーボード、マウス、タッチパネル等の情報入力機器、及びディスプレイ等の出力機器である。 The input/output unit 150 is, for example, information input devices such as a keyboard, mouse, and touch panel, and output devices such as a display.
 バス160は、上記各要素に共通に接続され、例えば、アドレス信号、データ信号及び各種制御信号を伝達する。 A bus 160 is commonly connected to each of the above elements and transmits, for example, address signals, data signals and various control signals.
 特に、本実施の形態による評価端末は、ビデオセッションサービス端末から動画像を取得し、当該動画像内に含まれる少なくとも顔画像を所定のフレーム単位ごとに識別すると共に、顔画像に関する評価値を算出する(詳しくは後述する)。
<動画の取得方法>
 図3に示されるように、ビデオセッションサービス端末が提供するビデオセッションサービス(以下、単に「本サービス」と言うことがある」)は、ユーザ端末10、20に対して双方向に画像および音声によって通信が可能となるものである。本サービスは、ユーザ端末のディスプレイに相手のユーザ端末のカメラ部で取得した動画像を表示し、相手のユーザ端末のマイク部で取得した音声をスピーカーから出力可能となっている。
In particular, the evaluation terminal according to the present embodiment acquires a moving image from a video session service terminal, identifies at least a face image included in the moving image for each predetermined frame unit, and calculates an evaluation value for the face image. (details will be described later).
<How to get videos>
As shown in FIG. 3, a video session service provided by a video session service terminal (hereinafter sometimes simply referred to as "this service") provides user terminals 10 and 20 with two-way images and voice. Communication is possible. In this service, a moving image captured by the camera of the other user's terminal is displayed on the display of the user's terminal, and audio captured by the microphone of the other's user's terminal can be output from the speaker.
 また、本サービスは双方の又はいずれかのユーザ端末によって、動画像及び音声(これらを合わせて「動画像等」という)を少なくともいずれかのユーザ端末上の記憶部に記録(レコーディング)することが可能に構成されている。記録された動画像情報Vs(以下「記録情報」という)は、記録を開始したユーザ端末にキャッシュされつついずれかのユーザ端末のローカルのみに記録されることとなる。ユーザは、必要があれば当該記録情報を本サービスの利用の範囲内で自分で視聴、他者に共有等行うこともできる。 In addition, this service allows both or either of the user terminals to record moving images and sounds (collectively referred to as "moving images, etc.") in the storage unit of at least one of the user terminals. configured as possible. The recorded moving image information Vs (hereinafter referred to as “recorded information”) is cached in the user terminal that started recording and is locally recorded only in one of the user terminals. If necessary, the user can view the recorded information by himself or share it with others within the scope of the use of this service.
 ユーザ端末10は、当該記録情報を取得して、後述するような分析及び評価を行う。 The user terminal 10 acquires the recorded information and performs analysis and evaluation as described later.
 ユーザ端末10は、以上のようにして取得した動画を以下のような分析によって評価を行う。 The user terminal 10 evaluates the video acquired as described above by the following analysis.
<機能構成例1>
 図4は、本実施形態による構成例を示すブロック図である。図4に示すように、本実施形態のビデオセッション評価システムは、ユーザ端末10が有する機能構成として実現される。すなわち、ユーザ端末10はその機能として、動画像取得部11、生体反応解析部12、特異判定部13、関連事象特定部14、クラスタリング部15および解析結果通知部16を備えている。
<Functional configuration example 1>
FIG. 4 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 4, the video session evaluation system of this embodiment is realized as a functional configuration of the user terminal 10. FIG. That is, the user terminal 10 has, as its functions, a moving image acquisition unit 11, a biological reaction analysis unit 12, a peculiar determination unit 13, a related event identification unit 14, a clustering unit 15, and an analysis result notification unit 16.
 上記各機能ブロック11~16は、例えばユーザ端末10に備えられたハードウェア、DSP(Digital Signal Processor)、ソフトウェアの何れによっても構成することが可能である。例えばソフトウェアによって構成する場合、上記各機能ブロック11~16は、実際にはコンピュータのCPU、RAM、ROMなどを備えて構成され、RAMやROM、ハードディスクまたは半導体メモリ等の記録媒体に記憶されたプログラムが動作することによって実現される。 Each of the functional blocks 11 to 16 can be configured by any of hardware, a DSP (Digital Signal Processor), and software provided in the user terminal 10, for example. For example, when configured by software, each of the functional blocks 11 to 16 is actually configured with a computer CPU, RAM, ROM, etc., and a program stored in a recording medium such as RAM, ROM, hard disk, or semiconductor memory. is realized by the operation of
 動画像取得部11は、オンラインセッション中に各端末が備えるカメラにより複数人(複数のユーザ)を撮影することによって得られる動画像を各端末から取得する。各端末から取得する動画像は、各端末の画面上に表示されるように設定されているものか否かは問わない。すなわち、動画像取得部11は、各端末に表示中の動画像および非表示中の動画像を含めて、動画像を各端末から取得する。 The moving image acquisition unit 11 acquires from each terminal a moving image obtained by photographing a plurality of people (a plurality of users) with a camera provided in each terminal during an online session. It does not matter whether the moving image acquired from each terminal is set to be displayed on the screen of each terminal. That is, the moving image acquisition unit 11 acquires moving images from each terminal, including moving images being displayed and moving images not being displayed on each terminal.
 生体反応解析部12は、動画像取得部11により取得された動画像(画面上に表示中のものか否かは問わない)に基づいて、複数人のそれぞれについて生体反応の変化を解析する。本実施形態において生体反応解析部12は、動画像取得部11により取得された動画像を画像のセット(フレーム画像の集まり)と音声とに分離し、それぞれから生体反応の変化を解析する。 The biological reaction analysis unit 12 analyzes changes in the biological reaction of each of a plurality of people based on the moving images (whether or not they are being displayed on the screen) acquired by the moving image acquiring unit 11. In the present embodiment, the biological reaction analysis unit 12 separates the moving image acquired by the moving image acquisition unit 11 into a set of images (collection of frame images) and voice, and analyzes changes in the biological reaction from each.
 例えば、生体反応解析部12は、動画像取得部11により取得された動画像から分離したフレーム画像を用いてユーザの顔画像を解析することにより、表情、目線、脈拍、顔の動きの少なくとも1つに関する生体反応の変化を解析する。また、生体反応解析部12は、動画像取得部11により取得された動画像から分離した音声を解析することにより、ユーザの発言内容、声質の少なくとも1つに関する生体反応の変化を解析する。 For example, the biological reaction analysis unit 12 analyzes the user's facial image using a frame image separated from the moving image acquired by the moving image acquisition unit 11 to obtain at least one of facial expression, gaze, pulse, and facial movement. Analyze changes in biological reactions related to Further, the biological reaction analysis unit 12 analyzes the voice separated from the moving image acquired by the moving image acquisition unit 11 to analyze changes in the biological reaction related to at least one of the user's utterance content and voice quality.
 人は感情が変化すると、それが表情、目線、脈拍、顔の動き、発言内容、声質などの生体反応の変化となって現れる。本実施形態では、ユーザの生体反応の変化を解析することを通じて、ユーザの感情の変化を解析する。本実施形態において解析する感情は、一例として、快/不快の程度である。本実施形態において生体反応解析部12は、生体反応の変化を所定の基準に従って数値化することにより、生体反応の変化の内容を反映させた生体反応指標値を算出する。 When a person's emotions change, it manifests as changes in biological reactions such as facial expressions, gaze, pulse, facial movements, content of remarks, and voice quality. In this embodiment, changes in the user's emotions are analyzed through analysis of changes in the user's biological reactions. The emotion analyzed in this embodiment is, for example, the degree of comfort/discomfort. In the present embodiment, the biological reaction analysis unit 12 calculates a biological reaction index value reflecting the change in biological reaction by quantifying the change in biological reaction according to a predetermined standard.
 表情の変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から顔の領域を特定し、事前に機械学習させた画像解析モデルに従って特定した顔の表情を複数に分類する。そして、その分類結果に基づいて、連続するフレーム画像間でポジティブな表情変化が起きているか、ネガティブな表情変化が起きているか、およびどの程度の大きさの表情変化が起きているかを解析し、その解析結果に応じた表情変化指標値を出力する。 The analysis of changes in facial expressions is performed, for example, as follows. That is, for each frame image, a facial region is identified from the frame image, and the identified facial expressions are classified into a plurality of types according to an image analysis model machine-learned in advance. Then, based on the classification results, it analyzes whether positive facial expression changes occur between consecutive frame images, whether negative facial expression changes occur, and to what extent the facial expression changes occur, A facial expression change index value corresponding to the analysis result is output.
 目線の変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から目の領域を特定し、両目の向きを解析することにより、ユーザがどこを見ているかを解析する。例えば、表示中の話者の顔を見ているか、表示中の共有資料を見ているか、画面の外を見ているかなどを解析する。また、目線の動きが大きいか小さいか、動きの頻度が多いか少ないかなどを解析するようにしてもよい。目線の変化はユーザの集中度にも関連する。生体反応解析部12は、目線の変化の解析結果に応じた目線変化指標値を出力する。 For example, the analysis of changes in line of sight is performed as follows. That is, for each frame image, the eye region is specified in the frame image, and the orientation of both eyes is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or is looking outside the screen. Also, it may be analyzed whether the eye movement is large or small, or whether the movement is frequent or infrequent. A change in line of sight is also related to the user's degree of concentration. The biological reaction analysis unit 12 outputs a line-of-sight change index value according to the analysis result of the line-of-sight change.
 脈拍の変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から顔の領域を特定する。そして、顔の色情報(RGBのG)の数値を捉える学習済みの画像解析モデルを用いて、顔表面のG色の変化を解析する。その結果を時間軸に合わせて並べることによって色情報の変化を表した波形を形成し、この波形から脈拍を特定する。人は緊張すると脈拍が速くなり、気持ちが落ち着くと脈拍が遅くなる。生体反応解析部12は、脈拍の変化の解析結果に応じた脈拍変化指標値を出力する。 The analysis of pulse changes is performed, for example, as follows. That is, for each frame image, the face area is specified in the frame image. Then, using a trained image analysis model that captures numerical values of face color information (G of RGB), changes in the G color of the face surface are analyzed. By arranging the results along the time axis, a waveform representing changes in color information is formed, and the pulse is identified from this waveform. When a person is tense, the pulse speeds up, and when the person is calm, the pulse slows down. The biological reaction analysis unit 12 outputs a pulse change index value according to the analysis result of the pulse change.
 顔の動きの変化の解析は、例えば以下のようにして行う。すなわち、フレーム画像ごとに、フレーム画像の中から顔の領域を特定し、顔の向きを解析することにより、ユーザがどこを見ているかを解析する。例えば、表示中の話者の顔を見ているか、表示中の共有資料を見ているか、画面の外を見ているかなどを解析する。また、顔の動きが大きいか小さいか、動きの頻度が多いか少ないかなどを解析するようにしてもよい。顔の動きと目線の動きとを合わせて解析するようにしてもよい。例えば、表示中の話者の顔をまっすぐ見ているか、上目遣いまたは下目使いに見ているか、斜めから見ているかなどを解析するようにしてもよい。生体反応解析部12は、顔の向きの変化の解析結果に応じた顔向き変化指標値を出力する。 For example, analysis of changes in facial movements is performed as follows. That is, for each frame image, the face area is specified in the frame image, and the direction of the face is analyzed to analyze where the user is looking. For example, it analyzes whether the user is looking at the face of the speaker being displayed, whether the user is looking at the shared material being displayed, or is looking outside the screen. Further, it may be analyzed whether the movement of the face is large or small, or whether the movement is frequent or infrequent. The movement of the face and the movement of the line of sight may be analyzed together. For example, it may be analyzed whether the face of the speaker being displayed is viewed straight, whether the face is viewed with upward or downward glances, or whether the face is viewed obliquely. The biological reaction analysis unit 12 outputs a face orientation change index value according to the analysis result of the face orientation change.
 発言内容の解析は、例えば以下のようにして行う。すなわち、生体反応解析部12は、指定した時間(例えば、30~150秒程度の時間)の音声について公知の音声認識処理を行うことによって音声を文字列に変換し、当該文字列を形態素解析することにより、助詞、冠詞などの会話を表す上で不要なワードを取り除く。そして、残ったワードをベクトル化し、ポジティブな感情変化が起きているか、ネガティブな感情変化が起きているか、およびどの程度の大きさの感情変化が起きているかを解析し、その解析結果に応じた発言内容指標値を出力する。  Analysis of the contents of the statement is performed, for example, as follows. That is, the biological reaction analysis unit 12 converts the voice into a character string by performing known voice recognition processing on the voice for a specified time (for example, about 30 to 150 seconds), and morphologically analyzes the character string. By doing so, words such as particles and articles that are unnecessary for expressing conversation are removed. Then, vectorize the remaining words, analyze whether a positive emotional change has occurred, whether a negative emotional change has occurred, and to what extent the emotional change has occurred. Outputs the utterance content index value.
 声質の解析は、例えば以下のようにして行う。すなわち、生体反応解析部12は、指定した時間(例えば、30~150秒程度の時間)の音声について公知の音声解析処理を行うことによって音声の音響的特徴を特定する。そして、その音響的特徴に基づいて、ポジティブな声質変化が起きているか、ネガティブな声質変化が起きているか、およびどの程度の大きさの声質変化が起きているかを解析し、その解析結果に応じた声質変化指標値を出力する。 Voice quality analysis is performed, for example, as follows. That is, the biological reaction analysis unit 12 identifies the acoustic features of the voice by performing known voice analysis processing on the voice for a specified time (for example, about 30 to 150 seconds). Then, based on the acoustic features, it analyzes whether a positive change in voice quality has occurred, whether a negative change in voice quality has occurred, and to what extent the change in voice quality has occurred, and according to the analysis results, output the voice quality change index value.
 生体反応解析部12は、以上のようにして算出した表情変化指標値、目線変化指標値、脈拍変化指標値、顔向き変化指標値、発言内容指標値、声質変化指標値の少なくとも1つを用いて生体反応指標値を算出する。例えば、表情変化指標値、目線変化指標値、脈拍変化指標値、顔向き変化指標値、発言内容指標値および声質変化指標値を重み付け計算することにより、生体反応指標値を算出する。 The biological reaction analysis unit 12 uses at least one of the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value calculated as described above. to calculate the biological reaction index value. For example, the biological reaction index value is calculated by weighting the facial expression change index value, eye line change index value, pulse change index value, face direction change index value, statement content index value, and voice quality change index value.
 特異判定部13は、解析対象者について解析された生体反応の変化が、解析対象者以外の他者について解析された生体反応の変化と比べて特異的か否かを判定する。本実施形態において、特異判定部13は、生体反応解析部12により複数のユーザのそれぞれについて算出された生体反応指標値に基づいて、解析対象者について解析された生体反応の変化が他者と比べて特異的か否かを判定する。 The peculiarity determination unit 13 determines whether or not the change in the analyzed biological reaction of the person to be analyzed is more specific than the change in the analyzed biological reaction of the person other than the person to be analyzed. In the present embodiment, the peculiarity determination unit 13 compares changes in the biological reaction of the person to be analyzed with those of others based on the biological reaction index values calculated for each of the plurality of users by the biological reaction analysis unit 12. is specific or not.
 例えば、特異判定部13は、生体反応解析部12により複数人のそれぞれについて算出された生体反応指標値の分散を算出し、解析対象者について算出された生体反応指標値と分散との対比により、解析対象者について解析された生体反応の変化が他者と比べて特異的か否かを判定する。 For example, the peculiar determination unit 13 calculates the variance of the biological reaction index values calculated for each of the plurality of persons by the biological reaction analysis unit 12, and compares the biological reaction index values calculated for the analysis subject with the variance, It is determined whether or not the change in the analyzed biological reaction of the person to be analyzed is specific compared to others.
 解析対象者について解析された生体反応の変化が他者と比べて特異的である場合として、次の3パターンが考えられる。1つ目は、他者については特に大きな生体反応の変化が起きていないが、解析対象者について比較的大きな生体反応の変化が起きた場合である。2つ目は、解析対象者については特に大きな生体反応の変化が起きていないが、他者について比較的大きな生体反応の変化が起きた場合である。3つ目は、解析対象者についても他者についても比較的大きな生体反応の変化が起きているが、変化の内容が解析対象者と他者とで異なる場合である。 The following three patterns are conceivable as cases where the changes in biological reactions analyzed for the subject of analysis are more specific than those of others. The first is a case where a relatively large change in biological reaction occurs in the subject of analysis, although no particularly large change in biological reaction has occurred in the other person. The second is a case where a particularly large change in biological reaction has not occurred in the subject of analysis, but a relatively large change in biological reaction has occurred in the other person. The third is a case where a relatively large change in biological reaction occurs in both the subject of analysis and the other person, but the content of the change differs between the subject of analysis and the other person.
 関連事象特定部14は、特異判定部13により特異的であると判定された生体反応の変化が起きたときに解析対象者、他者および環境の少なくとも1つに関して発生している事象を特定する。例えば、関連事象特定部14は、解析対象者について特異的な生体反応の変化が起きたときにおける解析対象者自身の言動を動画像から特定する。また、関連事象特定部14は、解析対象者について特異的な生体反応の変化が起きたときにおける他者の言動を動画像から特定する。また、関連事象特定部14は、解析対象者について特異的な生体反応の変化が起きたときにおける環境を動画像から特定する。環境は、例えば画面に表示中の共有資料、解析対象者の背景に写っているものなどである。 The related event identification unit 14 identifies an event occurring in relation to at least one of the person to be analyzed, the other person, and the environment when the change in the biological reaction determined to be peculiar by the peculiarity determination unit 13 occurs. . For example, the related event identification unit 14 identifies from the moving image the speech and behavior of the person to be analyzed when a specific change in biological reaction occurs in the person to be analyzed. In addition, the related event identifying unit 14 identifies, from the moving image, the speech and behavior of the other person when a specific change in the biological reaction of the person to be analyzed occurs. In addition, the related event identification unit 14 identifies from the moving image the environment in which a specific change in the biological reaction of the person to be analyzed occurs. The environment is, for example, the shared material being displayed on the screen, the background image of the person to be analyzed, and the like.
 クラスタリング部15は、特異判定部13により特異的であると判定された生体反応の変化(例えば、目線、脈拍、顔の動き、発言内容、声質のうち1つまたは複数の組み合わせ)と、当該特異的な生体反応の変化が起きたときに発生している事象(関連事象特定部14により特定された事象)との相関の程度を解析し、相関が一定レベル以上であると判定された場合に、その相関の解析結果に基づいて解析対象者または事象をクラスタリングする。 The clustering unit 15 clusters the change in the biological reaction determined to be specific by the peculiarity determination unit 13 (for example, one or a combination of eye gaze, pulse, facial movement, statement content, and voice quality), and the peculiarity Analyzing the degree of correlation with an event (event identified by the related event identification unit 14) that occurs when a change in biological reaction occurs, and if it is determined that the correlation is at a certain level or more , to cluster the subjects or events based on the correlation analysis results.
 例えば、特異的な生体反応の変化がネガティブな感情変化に相当するものであり、当該特異的な生体反応の変化が起きたときに発生している事象もネガティブな事象である場合には一定レベル以上の相関が検出される。クラスタリング部15は、その事象の内容やネガティブな度合い、相関の大きさなどに応じて、あらかじめセグメント化した複数の分類の何れかに解析対象者または事象をクラスタリングする。 For example, if a change in a specific biological reaction corresponds to a negative emotional change, and the event occurring when the specific change in biological reaction occurs is also a negative event, a certain level The above correlation is detected. The clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented categories according to the content of the event, the degree of negativity, the magnitude of the correlation, and the like.
 同様に、特異的な生体反応の変化がポジティブな感情変化に相当するものであり、当該特異的な生体反応の変化が起きたときに発生している事象もポジティブな事象である場合には一定レベル以上の相関が検出される。クラスタリング部15は、その事象の内容やポジティブな度合い、相関の大きさなどに応じて、あらかじめセグメント化した複数の分類の何れかに解析対象者または事象をクラスタリングする。 Similarly, if a specific change in biological reaction corresponds to a positive emotional change and the event occurring when the specific change in biological reaction occurs is also a positive event, Level or higher correlations are detected. The clustering unit 15 clusters the person to be analyzed or the event into one of a plurality of pre-segmented classifications according to the content of the event, the degree of positivity, the degree of correlation, and the like.
 解析結果通知部16は、特異判定部13により特異的であると判定された生体反応の変化、関連事象特定部14により特定された事象、およびクラスタリング部15によりクラスタリングされた分類の少なくとも1つを、解析対象者の指定者(解析対象者またはオンラインセッションの主催者)に通知する。 The analysis result notification unit 16 reports at least one of the changes in the biological reaction determined to be specific by the peculiar determination unit 13, the event identified by the related event identification unit 14, and the classification clustered by the clustering unit 15. , to notify the designator of the subject of analysis (the subject of analysis or the organizer of the online session).
 例えば、解析結果通知部16は、解析対象者について他者とは異なる特異的な生体反応の変化が起きたとき(上述した3パターンの何れか。以下同様)に発生している事象として解析対象者自身の言動を解析対象者自身に通知する。これにより、解析対象者は、自分がある言動を行ったときに他者とは違う感情を持っていることを把握することができる。このとき、解析対象者について特定された特異的な生体反応の変化も併せて解析対象者に通知するようにしてもよい。さらに、対比される他者の生体反応の変化を更に解析対象者に通知するようにしてもよい。 For example, the analysis result notification unit 16 recognizes that when a change in a specific biological reaction that is different from that of the other person occurs in the person to be analyzed (one of the three patterns described above; the same applies hereinafter), the analysis target is Notifies the person to be analyzed of his/her own behavior. This allows the person to be analyzed to understand that he/she has a different feeling from others when he or she performs a certain behavior. At this time, the person to be analyzed may also be notified of the change in the specific biological reaction identified for the person to be analyzed. Furthermore, the person to be analyzed may be further notified of the change in the biological reaction of the other person to be compared.
 例えば、解析対象者が普段どおりの感情で特に意識せずに行った言動、または、解析対象者がある感情を伴って特に意識して行った言動に対して他者が受けた感情と、言動の際に解析対象者自身が抱いていた感情とが相違している場合に、そのときの解析対象者自身の言動が解析対象者に通知される。これにより、自分の意識に反して他者の受けが良い言動や他者の受けが良くない言動などを発見することも可能である。 For example, the words and deeds of the person to be analyzed performed without being particularly conscious of their usual emotions, or the words and deeds of the person to be analyzed consciously accompanied by certain emotions, and the emotions and behaviors that others received When the emotion held by the person to be analyzed is different from the feeling held by the person to be analyzed at the time, the person to be analyzed is notified of the speech and behavior of the person to be analyzed at that time. As a result, it is possible to discover behaviors that are well received by others or behaviors that are not well received by others, contrary to one's own consciousness.
 また、解析結果通知部16は、解析対象者について他者とは異なる特異的な生体反応の変化が起きたときに発生している事象を、特異的な生体反応の変化と共にオンラインセッションの主催者に通知する。これにより、オンラインセッションの主催者は、指定した解析対象者に特有の現象として、どのような事象がどのような感情の変化に影響を与えているのかを知ることができる。そして、その把握した内容に応じて適切な処置を解析対象者に対して行うことが可能となる。 In addition, the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when the person to be analyzed undergoes a specific change in biological reaction that is different from that of the other person, together with the change in the specific biological reaction. to notify. As a result, the organizer of the online session can know what kind of event affects what kind of emotional change as a phenomenon specific to the specified analysis subject. Then, it becomes possible to perform appropriate treatment on the person to be analyzed according to the grasped contents.
 また、解析結果通知部16は、解析対象者について他者とは異なる特異的な生体反応の変化が起きたときに発生している事象または解析対象者のクラスタリング結果をオンラインセッションの主催者に通知する。これにより、オンラインセッションの主催者は、指定した解析対象者がどの分類にクラスタリングされたかによって、解析対象者に特有の行動の傾向を把握したり、今後起こり得る行動や状態などを予測したりすることができる。そして、それに対して適切な処置を解析対象者に対して行うことが可能となる。 In addition, the analysis result notification unit 16 notifies the organizer of the online session of the event occurring when a specific change in biological reaction occurs in the analysis subject, which is different from that of others, or the clustering result of the analysis subject. do. As a result, online session organizers can grasp behavioral tendencies peculiar to analysis subjects and predict possible future behaviors and situations, depending on which classification the specified analysis subjects have been clustered into. be able to. Then, it becomes possible to take appropriate measures for the person to be analyzed.
 なお、上記実施形態では、生体反応の変化を所定の基準に従って数値化することによって生体反応指標値を算出し、複数人のそれぞれについて算出された生体反応指標値に基づいて、解析対象者について解析された生体反応の変化が他者と比べて特異的か否かを判定する例について説明したが、この例に限定されない。例えば、以下のようにしてもよい。 In the above embodiment, the biological reaction index value is calculated by quantifying the change in biological reaction according to a predetermined standard, and the analysis subject is analyzed based on the biological reaction index value calculated for each of the plurality of people. Although the example of determining whether the change in the biological reaction received is specific compared to others has been described, the present invention is not limited to this example. For example, it may be as follows.
 すなわち、生体反応解析部12は、複数人のそれぞれについて目線の動きを解析して目線の方向を示すヒートマップを生成する。特異判定部13は、生体反応解析部12により解析対象者について生成されたヒートマップと他者について生成されたヒートマップとの対比により、解析対象者について解析された生体反応の変化が、他者について解析された生体反応の変化と比べて特異的か否かを判定する。 That is, the biological reaction analysis unit 12 analyzes the movement of the line of sight for each of a plurality of people and generates a heat map indicating the direction of the line of sight. The peculiar determination unit 13 compares the heat map generated for the person to be analyzed by the biological reaction analysis unit 12 with the heat map generated for the other person, so that the change in the biological reaction analyzed for the person to be analyzed It is determined whether it is specific compared with the change in biological response analyzed for.
 このように、本実施の形態においては、ビデオセッションの動画像をユーザ端末10のローカルストレージに保存し、ユーザ端末10上で上述した分析を行うこととしている。ユーザ端末10のマシンスペックに依存する可能性があるとはいえ、動画像の情報を外部に提供することなく分析することが可能となる。 Thus, in the present embodiment, moving images of a video session are stored in the local storage of the user terminal 10, and the above analysis is performed on the user terminal 10. Although it may depend on the machine specs of the user terminal 10, it is possible to analyze the moving image information without providing it to the outside.
<機能構成例2>
 図5に示すように、本実施形態のビデオセッション評価システムは、機能構成として、動画像取得部11、生体反応解析部12および反応情報提示部13aを備えていてもよい。
<Functional configuration example 2>
As shown in FIG. 5, the video session evaluation system of this embodiment may include a moving image acquisition unit 11, a biological reaction analysis unit 12, and a reaction information presentation unit 13a as functional configurations.
 反応情報提示部13aは、画面に表示されていない参加者を含めて生体反応解析部12aにより解析された生体反応の変化を示す情報を提示する。例えば、反応情報提示部13aは、生体反応の変化を示す情報をオンラインセッションの主導者、進行者または管理者(以下、まとめて主催者という)に提示する。オンラインセッションの主催者は、例えばオンライン授業の講師、オンライン会議の議長やファシリテータ、コーチングを目的としたセッションのコーチなどである。オンラインセッションの主催者は、オンラインセッションに参加する複数のユーザの中の一人であるのが普通であるが、オンラインセッションに参加しない別人であってもよい。 The reaction information presentation unit 13a presents information indicating changes in biological reactions analyzed by the biological reaction analysis unit 12a, including participants not displayed on the screen. For example, the reaction information presenting unit 13a presents information indicating changes in biological reactions to an online session leader, moderator, or administrator (hereinafter collectively referred to as the organizer). Hosts of online sessions are, for example, instructors of online classes, chairpersons and facilitators of online meetings, coaches of sessions for coaching purposes, and the like. An online session host is typically one of the users participating in the online session, but may be another person who does not participate in the online session.
 このようにすることにより、オンラインセッションの主催者は、複数人でオンラインセッションが行われる環境において、画面に表示されていない参加者の様子も把握することができる。 By doing so, the organizer of the online session can also grasp the state of the participants who are not displayed on the screen in an environment where the online session is held by multiple people.
<機能構成例3>
 図6は、本実施形態による構成例を示すブロック図である。図6に示すように、本実施形態のビデオセッション評価システムは、機能構成として、上述した実施の形態1と類似する機能については同一つの参照符号を付して説明を省略することがある。
<Functional configuration example 3>
FIG. 6 is a block diagram showing a configuration example according to this embodiment. As shown in FIG. 6, in the video session evaluation system of the present embodiment, functions similar to those of the above-described first embodiment are given the same reference numerals, and explanations thereof may be omitted.
 本実施の形態によるシステムは、ビデオセッションの映像を取得するカメラ部及び音声を取得するマイク部と、動画像を分析及び評価する解析部、取得した動画像を評価することによって得られた情報に基づいて表示オブジェクト(後述する)を生成するオブジェクト生成部、前記ビデオセッション実行中にビデオセッションの動画像と表示オブジェクトの両方を表示する表示部と、を備えている。 The system according to this embodiment includes a camera unit that acquires images of a video session, a microphone unit that acquires audio, an analysis unit that analyzes and evaluates moving images, and information obtained by evaluating the acquired moving images. an object generator for generating a display object (described below) based on the display; and a display for displaying both the moving image of the video session and the display object during execution of the video session.
 解析部は、上述した説明と同様に、動画像取得部11、生体反応解析部12、特異判定部13、関連事象特定部14、クラスタリング部15および解析結果通知部16を備えている。各要素の機能については上述したとおりである。 The analysis unit includes the moving image acquisition unit 11, the biological reaction analysis unit 12, the peculiar determination unit 13, the related event identification unit 14, the clustering unit 15, and the analysis result notification unit 16, as described above. The function of each element is as described above.
 図7に示されるように、オブジェクト生成部は、解析部によってビデオセッションから取得される動画像を解析した結果に基づいて、必要に応じて、当該認識した顔の部分を示すオブジェクト50と、上述した分析・評価した内容を示す情報100を当該動画像に重畳して表示する。当該オブジェクト50は、複数人の顔が動画像内に移っている場合には、複数人全員の顔を識別し、表示することとしてもよい。 As shown in FIG. 7, the object generation unit generates an object 50 representing the recognized face part and the above-mentioned Information 100 indicating the content of the analysis/evaluation performed is superimposed on the moving image and displayed. The object 50 may identify and display all faces of a plurality of persons when the faces of the plurality of persons are moved in the moving image.
 また、オブジェクト50は、例えば、相手側の端末において、ビデオセッションのカメラ機能を停止している場合(即ち、物理的にカメラを覆う等ではなく、ビデオセッションのアプリケーション内においてソフトウェア的に停止している場合)であっても、相手側のカメラで相手の顔を認識していた場合には、相手の顔が位置している部分にオブジェクト50やオブジェクト100を表示することとしてもよい。これにより、カメラ機能がオフになっていたとしても、相手側が端末の前にいることがお互い確認することが可能となる。この場合、例えば、ビデオセッションのアプリケーションにおいては、カメラから取得した情報を非表示にする一方、解析部によって認識された顔に対応するオブジェクト50やオブジェクト100のみを表示することとしてもよい。また、ビデオセッションから取得される映像情報と、解析部によって認識され得られた情報とを異なる表示レイヤーに分け、前者の情報に関するレイヤーを非表示にすることとしてもよい。 In addition, the object 50 is, for example, when the camera function of the video session is stopped at the other party's terminal (that is, it is stopped by software within the application of the video session instead of physically covering the camera). If the other party's face is recognized by the other party's camera, the object 50 or the object 100 may be displayed in the part where the other party's face is located. This makes it possible for both parties to confirm that the other party is in front of the terminal even if the camera function is turned off. In this case, for example, in a video session application, the information obtained from the camera may be hidden while only the object 50 or object 100 corresponding to the face recognized by the analysis unit is displayed. Also, the video information acquired from the video session and the information recognized by the analysis unit may be divided into different display layers, and the layer relating to the former information may be hidden.
 オブジェクト50やオブジェクト100は、複数の動画像を表示する領域がある場合には、すべての領域又は一部の領域のみに表示することとしてもよい。例えば、図8に示されるように、ゲスト側の動画像のみに表示することとしてもよい。 When there are multiple moving image display areas, the objects 50 and 100 may be displayed in all areas or only in some areas. For example, as shown in FIG. 8, it may be displayed only on the moving image on the guest side.
 以上、添付図面を参照しながら本開示の好適な実施形態について詳細に説明したが、本開示の技術的範囲はかかる例に限定されない。本開示の技術分野における通常の知識を有する者であれば、請求の範囲に記載された技術的思想の範疇内において、各種の変更例または修正例に想到し得ることは明らかであり、これらについても、当然に本開示の技術的範囲に属するものと了解される。 Although the preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, the technical scope of the present disclosure is not limited to such examples. It is obvious that a person having ordinary knowledge in the technical field of the present disclosure can conceive of various modifications or modifications within the scope of the technical idea described in the claims. are naturally within the technical scope of the present disclosure.
 本明細書において説明した装置は、単独の装置として実現されてもよく、一部または全部がネットワークで接続された複数の装置(例えばクラウドサーバ)等により実現されてもよい。例えば、各端末10の制御部110およびストレージ130は、互いにネットワークで接続された異なるサーバにより実現されてもよい。 The device described in this specification may be realized as a single device, or may be realized by a plurality of devices (for example, cloud servers) or the like, all or part of which are connected via a network. For example, the control unit 110 and the storage 130 of each terminal 10 may be realized by different servers connected to each other via a network.
 即ち、本システムは、ユーザ端末10、20と、ユーザ端末10、20に双方向のビデオセッションを提供するビデオセッションサービス端末30と、ビデオセッションに関する評価を行う評価端末40とを含んでいるところ、以下のような構成のバリエーション組み合わせが考えられる。
(1)すべてをユーザ端末のみで処理
 図9に示されるように、解析部による処理をビデオセッションを行っている端末で行うことにより、(一定の処理能力は必要なものの)ビデオセッションを行っている時間と同時に(リアルタイムに)分析・評価結果を得ることができる。
(2)ユーザ端末と評価端末とで処理
 図10に示されるように、ネットワーク等で接続された評価端末に解析部を備えさせることとしてもよい。この場合、ユーザ端末で取得された動画像は、ビデオセッションと同時に又は事後的に評価端末に共有され、評価端末における解析部によって分析・評価されたのちに、オブジェクト50及びオブジェクト100の情報がユーザ端末に動画像データと共に又は別に(即ち、少なくとも解析データを含む情報が)共有され表示部に表示される。
That is, the system includes user terminals 10, 20, a video session service terminal 30 for providing an interactive video session to the user terminals 10, 20, and an evaluation terminal 40 for evaluating the video session, Variation combinations of the following configurations are conceivable.
(1) Processing everything only on the user terminal As shown in FIG. 9, by performing the processing by the analysis unit on the terminal that is performing the video session (although a certain processing capacity is required), the video session can be performed. Analysis/evaluation results can be obtained at the same time (in real time) as you are.
(2) Processing by User Terminal and Evaluation Terminal As shown in FIG. 10, an analysis unit may be provided in an evaluation terminal connected via a network or the like. In this case, the moving image acquired by the user terminal is shared with the evaluation terminal at the same time as or after the video session, and after being analyzed and evaluated by the analysis unit in the evaluation terminal, the information of the object 50 and the object 100 is provided to the user. Together with or separately from the moving image data (that is, information including at least analysis data) is shared with the terminal and displayed on the display unit.
<第1の実施の形態>
 図11及び図12を参照して、本発明の第1の実施の形態を説明する。本実施の形態によるシステムは、概略、評価対象者の目線が画面上のいずれの場所を注視しているのかに関する情報と、その際に表示されている資料の情報から、表示されている資料のどの部分をどれくらいの時間、注視していたかを分析し、評価する。
<First embodiment>
A first embodiment of the present invention will be described with reference to FIGS. 11 and 12. FIG. The system according to the present embodiment, based on the information about which place on the screen the eyes of the person to be evaluated are gazing, and the information of the material displayed at that time, the information of the displayed material. Analyze and evaluate which parts were watched and for how long.
 即ち、本実施の形態によるシステムは、評価対象者を撮影することによって得られる動画像を取得するカメラ手段と、取得した動画像に基づいて対象者の目線の動きを取得する視線取得手段と、対象者に複数の画像を連続して表示するディスプレイ手段とを有している。 That is, the system according to the present embodiment includes camera means for acquiring a moving image obtained by photographing the person to be evaluated, line-of-sight acquiring means for acquiring the eye movement of the subject based on the acquired moving image, and display means for sequentially displaying a plurality of images to the subject.
 特に、本システムは、カメラ手段と、ディスプレイ手段との位置関係を取得する位置取得手段を有している。これにより、対象者の目の動きと、注視点とをキャリブレーションすることが可能となる。キャリブレーション処理は、例えば、図11及び図12に示されるように、ディスプレイのカメラ部によって対象者の目の状態を取得し、その次に画面の所定の場所(キャリブレーションポイント:中央、画面の四隅等)を見ているときの目の動きを取得する。目の動きの取得については、例えば、画面上にアナウンスを流すこととして意図的にキャリブレーションポイントを見てもらうこととしてもよいし。中央のみに目立つようなサインをアイキャッチ的に表示しその瞬間の目の動きを中央を注視している状態と推定することとしてもよい。 In particular, this system has position acquisition means for acquiring the positional relationship between the camera means and the display means. This makes it possible to calibrate the subject's eye movement and gaze point. In the calibration process, for example, as shown in FIGS. 11 and 12, the subject's eye condition is acquired by the camera unit of the display, and then a predetermined place on the screen (calibration point: center, screen Four corners, etc.) to acquire the movement of the eyes. Regarding the acquisition of eye movements, for example, it is possible to have the user look at the calibration points intentionally by playing an announcement on the screen. A conspicuous sign may be displayed only in the center in an eye-catching manner, and the eye movement at that moment may be estimated as a state of gazing at the center.
 図11に示されるように、本実施の形態においては、画面に表示された(共有された)資料上に注視点をその注視時間とともに関連付けてヒートマップのように出力される。これにより、資料のどの部分をどれくらいの時間中止していたのかが把握でき、対象者の関心部分がわかる。 As shown in FIG. 11, in the present embodiment, gaze points are associated with their gaze time on the (shared) material displayed on the screen and output like a heat map. As a result, it is possible to grasp which part of the material was stopped for how long, and the part of interest of the subject can be understood.
 また、本時実施の形態によるシステムは、同一の資料について、他の対象者(別の営業先や、別の受講者等)の動きも考慮してヒートマップを生成することとしてもよい。この場合、他の対象者の注視点も資料上に表示することとすればよい。この際、他の対象者は注視しているのにもかかわらず当該対象者は注視していない部分や、他の対象者は注視していないにもかかわらず当該対象者は注視している部分のように、対象者特有の特異的特徴を出力することとしてもよい。 In addition, the system according to the present embodiment may generate a heat map for the same material, taking into consideration the movements of other target persons (other business sites, other students, etc.). In this case, the points of gaze of other subjects may also be displayed on the material. At this time, the part that the target person does not gaze at even though the other target person is gazing, or the part that the target person is gazing even though the other target person is not gazing It is good also as outputting the specific feature peculiar to a subject like.
 なお、資料自体の評価を行う方法としては、資料毎に、対象者の目線の動きを平準化して得られる平準化ヒートマップを関連付けて出力してもよい。例えば、どの資料をよく見ていたかという観点でその資料の必要性が把握できる。一方、注視時間が短い資料についてはさほど必要性がないものであることもわかる。 As a method of evaluating the materials themselves, each material may be associated with a normalized heat map obtained by normalizing the eye movement of the subject and output. For example, the necessity of the material can be grasped from the point of view of which material was looked at well. On the other hand, it can be seen that there is not much need for materials with short fixation times.
<第2の実施の形態>
 図13乃至図14を参照して本発明の第2の実施の形態を説明する。本実施の形態によるシステムは、上述した表情や音声に基づいて分析・評価した評価値をグラフとして可視化し、当該グラフから読み取れるパターンと同じパターンの他の対象者を抽出するものである。
<Second Embodiment>
A second embodiment of the present invention will be described with reference to FIGS. 13 and 14. FIG. The system according to the present embodiment visualizes the evaluation values analyzed and evaluated based on the facial expressions and voices described above as a graph, and extracts other subjects with the same pattern as the pattern read from the graph.
 即ち、本実施の形態によるシステムは、取得した動画像内に含まれる対象者の顔画像や音声を認識して算出され得られた評価値を(例えば、図13に示されるようなグラフのように)時系列に沿った変化情報として出力する。 That is, the system according to the present embodiment recognizes the facial image and voice of the subject included in the acquired moving image and calculates the evaluation value (for example, as shown in the graph shown in FIG. 13). 2) output as chronological change information.
 図14に示されるように、例えば、評価された対象者A乃至Cのグラフにおいて、安心感を示す「safety」という値を例に説明する。図示されるグラフは、横軸に時間軸を縦軸に安心度を示す評価値をプロットしたものである。対象者Aを示す(A)のグラフは、時刻t1~t2において、大きく値が下がっていることがわかる。例えば、Aが不安に思ったり、恐れたりしたことが表情や音声を複合的に判断して検出された場合にこのようなグラフとして現れる。 As shown in FIG. 14, for example, in the graphs of evaluated subjects A to C, the value "safety" indicating a sense of security will be described as an example. The illustrated graph is obtained by plotting the time axis on the horizontal axis and the evaluation value indicating the degree of security on the vertical axis. It can be seen that the graph (A) showing the subject A shows a large drop in value from time t1 to t2. For example, such a graph appears when A's feeling uneasy or afraid is detected by judging facial expressions and voices in a complex manner.
 また、対象者Bを表す(B)のグラフも同様に、時刻t1~t2において、大きく値が下がっていることがわかる。例えば、Aが不安に思ったり、恐れたりしたことが表情や音声を複合的に判断して検出された場合にこのようなグラフとして現れる。 In addition, it can be seen that the graph (B) representing the subject B also shows a large decrease in values between times t1 and t2. For example, such a graph appears when A's feeling uneasy or afraid is detected by judging facial expressions and voices in a complex manner.
 AもBも時刻t1~t2においては安心度が大きく下がっているが、対象者Cを表す(C)の時刻t1~t2においては安心度に大きな変化はない。 For both A and B, the degree of security drops significantly between times t1 and t2, but there is no significant change in the degree of security between times t1 and t2 in (C) representing subject C.
 このようなグラフが得られた場合、システムは、(A)の時刻t1~t2の変化を参照して同一のパターンを含む(B)の情報を抽出する。 When such a graph is obtained, the system extracts information (B) that includes the same pattern by referring to changes from time t1 to t2 in (A).
 このように、類似パターンを抽出することによって、同一の時間帯に同様の感情を有している対象者を抽出することが可能になる。また、同一の時間帯に反対の感情を有している対象者を抽出することも可能になる。 By extracting similar patterns in this way, it is possible to extract subjects who have similar emotions during the same time period. It also becomes possible to extract subjects who have opposite emotions during the same time period.
 類似パターンを抽出する際に、分析者から元となるグラフ情報の一部の選択操作を受け付け(例えば、図17の(A)の時刻t1乃至t2を選択するようにして)、当該選択操作がされた部分のグラフパターンと同一のグラフパターンを含む他の動画像の対応するフレームを特定することとしてもよい。 When extracting a similar pattern, a partial selection operation of the original graph information is received from the analyst (for example, by selecting time t1 to t2 in (A) of FIG. 17), and the selection operation is A corresponding frame of another moving image containing the same graph pattern as the graph pattern of the marked portion may be specified.
<第3の実施の形態>
 図15及び図16を参照して、本発明の第3の実施の形態を説明する。本獅子の形態によるシステムは、上述した表情と音声とが、所定の閾値を超えて瞬間的に的に大きく変化したことを検知するものである。特に、複数の観点における評価のうち、一の観点のみが大きく変化したことを検知することによって、対象者の深層心理を分析することが可能となる。このような変化を生じた動画を切り出して評価することにより、効率的に分析を行うこともできる。
<Third Embodiment>
A third embodiment of the present invention will be described with reference to FIGS. 15 and 16. FIG. This lion-shaped system detects a momentary and significant change in the above-described facial expression and voice that exceeds a predetermined threshold. In particular, it is possible to analyze the subject's deep psychology by detecting a significant change in only one of the evaluations from a plurality of viewpoints. Efficient analysis can also be performed by extracting and evaluating a moving image in which such a change has occurred.
 図14に示されるように、対象者は、安心度を示す「safety」と幸福度を示す「happy」とのグラフのうち、safetyのグラフは大きい変化はないが、happyのグラフだけが、時刻t1及びt2で大きく落ち込み直後に再び元の水準にもどっている。 As shown in FIG. 14, among the graphs of "safety" indicating the degree of security and "happy" indicating the degree of happiness, the graph of safety did not change significantly, but only the graph of happiness changed at the time of day. It returns to the original level immediately after it drops sharply at t1 and t2.
 本システムは、このようにしてt1の変化を含む前後の所定の長さの部分L1、t2の変化を含む前後の所定の長さの部分L2を切り出して結合させてダイジェスト動画を生成する。これにより、深層心理が現れる瞬間を含む動画を抽出することが可能になる。 In this way, the system cuts out and joins portions L1 of a predetermined length before and after including the change in t1, and L2 of a predetermined length before and after including the change in t2 to generate a digest movie. This makes it possible to extract a moving image that includes the moment when deep psychology appears.
 図16に示されるように、本実施の形態によるシステムは、互いに異なる2つのグラフが瞬間的に大きく変化することを検知することとしてもよい。例えば、図16に示されるように、happyとsadのように、互いに相反する特性を関連付けられた評価値において、時刻t1においては、happyが瞬間的に下がり、sadが瞬間的に上がっている。このように、一つの評価値が瞬間的に変化したときに、相反する他方の評価値が瞬間的に上がった場合、当該上がった方の感情が真の感情であることが多い。 As shown in FIG. 16, the system according to the present embodiment may detect that two graphs different from each other greatly change instantaneously. For example, as shown in FIG. 16, in evaluation values associated with mutually contradictory characteristics such as happy and sad, happy momentarily decreases and sad momentarily increases at time t1. In this way, when one evaluation value momentarily changes and the opposite evaluation value momentarily increases, the emotion that has increased is often the true emotion.
 なお、評価値の特性が相反しなくとも、何らかの評価値が大きく下がった際に大きく上がった評価値を特定することにより深層心理の分析を用意に行うことができる。 Furthermore, even if the characteristics of the evaluation values do not contradict each other, it is possible to easily analyze the deep psychology by identifying the evaluation value that greatly increased when some evaluation value decreased significantly.
 なお、動画像内から取得された複数のフレーム(特異フレーム)を連結してダイジェスト動画を生成してもよい。特異フレームに対応する音声をテキストに変換して出力することとしてもよい。ユーザ端末の画面に表示された画面情報を共有することが可能である場合には、上述した瞬間的な変化があった際に特異フレームに対応する画面情報を出力することとしてもよい。 It should be noted that a digest video may be generated by concatenating multiple frames (specific frames) acquired from within the video. The speech corresponding to the peculiar frame may be converted into text and output. If it is possible to share the screen information displayed on the screen of the user terminal, the screen information corresponding to the peculiar frame may be output when the momentary change described above occurs.
<第4の実施の形態>
 図17を参照して、本発明の第4の実施の形態によるシステムを説明する。本実施の形態によるシステムは、例えば、「ありがとうございます」というポジティブな言葉を発しているにもかかわらず、感情は「happy」が極端に下がっている場合のように、偽りの感情や、忖度と言ったコミュニケーションを検知することができる。
<Fourth Embodiment>
A system according to a fourth embodiment of the present invention will now be described with reference to FIG. The system according to the present embodiment, for example, even though the positive word "thank you" is uttered, the emotion is a case where "happy" is extremely lowered, false emotions and conjecture communication can be detected.
 すなわち、本実施の形態によるシステムは、表情評価値と音声評価値に予め属性を関連付けておき、属性間の相関関係から所定範囲を超えて変化したことを検知する。例えば、「ありがとう」や「よくわかりました」という言葉にはポジティブなラベルを関連付けておき、表情の評価(happy、sad、safety)との相関関係を予め定義しておく。 That is, the system according to the present embodiment associates facial expression evaluation values and voice evaluation values with attributes in advance, and detects changes beyond a predetermined range based on the correlation between the attributes. For example, a positive label is associated with the words "thank you" and "well understood", and a correlation with facial expression evaluation (happy, sad, safety) is defined in advance.
 図17に示されるように、(A)のグラフでは、「ありがとうございます。」「よくわかりました。」という言葉が抽出され、happyのグラフの値が高い。一方、(B)においては、「説明ありがとうございます。」「とても理解できました。」と言っているにもかかわらず、時刻t1以降においてhappyが大きく下がっている。ここで、「とても理解できた」はポジティブな言葉として定義されているものの、対応するhappyという感情は大きく下がっている。 As shown in FIG. 17, in the graph of (A), the words "Thank you very much." On the other hand, in (B), despite saying "Thank you for your explanation" and "I understand very well", happy drops significantly after time t1. Here, "very understandable" is defined as a positive word, but the corresponding feeling of "happy" is greatly reduced.
 図18に示されるシステムは、画面に表示された画面情報を共有することが可能である。この場合、上述した、当該画面の内容と、テキスト情報と、感情を示すグラフ情報とを関連付けることとしてもよい。 The system shown in FIG. 18 can share screen information displayed on the screen. In this case, the content of the screen, the text information, and the graph information indicating the emotion may be associated with each other.
<第5の実施の形態>
 本発明の第5の実施の形態によるシステムを説明する。本時差氏の形態によるシステムは、認識した顔画像から、ユーザの標準的な表情を分析し、エモーション、コンセントレーション、セーフティの3つの観点で分析評価するものである。
<Fifth Embodiment>
A system according to a fifth embodiment of the invention will now be described. The system according to Mr. Motojizashi's form analyzes the user's standard facial expression from the recognized face image, and analyzes and evaluates it from the three viewpoints of emotion, concentration, and safety.
 即ち、当該ユーザにおける標準的な表情からの乖離度合いを評価するエモーションと、認識した顔画像からユーザの少なくとも瞳の動き又は顔の動きの量を評価するコンセントレーションと、認識した顔画像からユーザの不安に関する感情を評価するセーフティと、の3つの観点で評価を行う。 That is, emotion that evaluates the degree of divergence from the standard expression of the user, concentration that evaluates at least the amount of eye movement or facial movement of the user from the recognized face image, and user's expression from the recognized face image Evaluation is performed from three perspectives: safety, which evaluates feelings related to anxiety.
 評価は、それぞれの観点を学習された学習器を利用することとしてもよいし、他の方法で評価を行うこととしてもよい。評価された各観点のうちから二以上の評価に基づいてスコアを生成する。 The evaluation may be performed using a learner that has learned each point of view, or may be evaluated by other methods. A score is generated based on two or more evaluations from each evaluated aspect.
<第6の実施の形態>
 図19を参照して本発明の第6の実施の形態によるシステムを説明する。本実施の形態によるシステムは、例えば、複数の営業動画、講義動画等から、特定の対象者が写っている動画を特定するものである。これにより、オンラインで行われた様々なセッションから特定の人物にフォーカスをあてて評価すること等が可能になる。
<Sixth Embodiment>
A system according to a sixth embodiment of the present invention will now be described with reference to FIG. The system according to the present embodiment identifies, for example, a video in which a specific target person is shown from a plurality of business videos, lecture videos, and the like. This makes it possible to focus on and evaluate a specific person from various online sessions.
 図示されるように、ある講師のオンライン講義に関する動画データがあった場合、通常であれば、何らかのラベリングやタイトル等が関連付けられていない場合、内容を再生するまで特定の人物が写っているかどうか(その講師の講義かどうか等)が把握できない。本実施の形態によれば、各動画ファイル内を分析し、あらかじめ顔画像を登録したものが当該動画に写っているかどうかを検出することが可能となる。 As shown in the figure, if there is video data related to an online lecture by a certain lecturer, normally if there is no labeling or title associated with it, whether or not a specific person is shown until the content is played back ( It is not possible to grasp whether it is a lecture by the lecturer, etc.). According to the present embodiment, it is possible to analyze each moving image file and detect whether or not a facial image registered in advance appears in the moving image.
 なお、動画像内において、対象者が写っている部分のみを切り出してダイジェスト動画を生成することとしてもよい。例えば、図示される動画像の講義001乃至講義004のうち、対象者が写っている動画像は講義001、講義002、講義004であったとすると、本システムは、それぞれの動画像内からt1、t2、t3の部分を抽出する。抽出した部分は、ダイジェスト動画像として再生することが可能になる。 It should be noted that a digest video may be generated by cutting out only the part in which the target person is shown in the video. For example, of the lectures 001 to 004 of moving images shown in the figure, if the moving images in which the subject is shown are lectures 001, 002, and 004, the system selects t1, t1, and t1 from the respective moving images. t2 and t3 are extracted. The extracted part can be reproduced as a digest moving image.
<第7の実施の形態>
 図20を参照して、本発明の第7の実施の形態によるシステムを説明する。本実施の形態によるシステムは、オンラインセッションに参加している対象者のいわゆるコンセントレーションスコア(集中度)を算出するものである。オンラインの場合には、特にウェビナー形式の場合のように聴講者のカメラがオフにされている場合も多い。本システムによれば、このような場合に、それぞれの徴収者がどれくらい講演に集中しているか否かを定量的に判定することが可能となる。
<Seventh embodiment>
A system according to a seventh embodiment of the present invention will now be described with reference to FIG. The system according to this embodiment calculates a so-called concentration score (degree of concentration) of a subject participating in an online session. Online, especially in webinar format, the audience's camera is often turned off. According to this system, in such a case, it becomes possible to quantitatively determine how much each collector is concentrating on the lecture.
 本システムは、セッション上において(相手に自分のカメラ画像を共有するか否かに関わらず)カメラで取得した顔画像を認識し、対象者の瞳の動き及び顔の動きの量を夫々評価する。例えば、図20に示されるように、初期位置から、顔がどれくらい動いたか、目がどれくらい動いたかに関する量を絶対値として評価する。例えば、図示されるグラフの期間L1においては顔の動きも、目の動きも少ないことがわかり、画面等を凝視していることが推測される。一方で、期間L2では、顔は動いていないが、目はいろいろな方向に動いており資料内を読み込んでいることが推測される。集中度は、このような2パターンとして把握し、前者の場合は講演者の顔を注視して話に聞き入っている状態と推測し、後者の場合は表示された資料等を集中して読み込んでいる状態と推測することができる。 The system recognizes the face image captured by the camera during the session (whether or not you share your camera image with the other party) and evaluates the amount of eye movement and face movement of the subject, respectively. . For example, as shown in FIG. 20, absolute values are evaluated as to how much the face has moved and how much the eyes have moved from the initial position. For example, during the period L1 of the illustrated graph, it can be seen that there is little movement of the face and little movement of the eyes, and it is presumed that the person is staring at the screen or the like. On the other hand, during the period L2, the face is not moving, but the eyes are moving in various directions, which suggests that the subject is reading the material. The degree of concentration is grasped as such two patterns, and in the former case, it is assumed that the speaker is paying close attention to the speaker's face and listening to the talk, and in the latter case, the displayed material is concentrated and read. It can be inferred that the
 また、動きの度合い(図示されるグラフのValue)によって、集中度に関するスコアを算出することとしてもよい。算出方法は種々な形式、統計的な方法が選択できる。例えば、faceと、eyesの両方が0の場合を集中度100とし、両方が共に最高値の時には集中度0としてもよい。 Also, a score related to the degree of concentration may be calculated based on the degree of movement (Value in the graph shown). Various forms and statistical methods can be selected for the calculation method. For example, the degree of concentration may be 100 when both face and eyes are 0, and may be 0 when both are the maximum values.
<第8の実施の形態>
 図21を参照して本発明の第8の実施の形態によるシステムを説明する。本実施の形態によるシステムは、上述した第1乃至第7の実施の形態においてシステムが行った評価を異なるスパンで行うことにより、季節的、時期的な要因を補正することによって真の評価を得ようとするものである。
<Eighth Embodiment>
A system according to an eighth embodiment of the present invention will now be described with reference to FIG. The system according to this embodiment obtains a true evaluation by correcting seasonal and temporal factors by performing the evaluations performed by the system in the above-described first to seventh embodiments in different spans. I'm trying to.
 即ち、本システムは、同一のユーザに対して、短期間における評価値を分析した短期評価値(分析値)と、長期間における評価値を分析した長期評価値(分析値)とに基づいて対象者の状態を評価する。例えば、ある受講生の1年間の評価値(長期評価値)と1月単位の評価値(短期評価値)とを分析することとしてもよい。 That is, the present system provides a short-term evaluation value (analysis value) obtained by analyzing the evaluation value over a short period of time and a long-term evaluation value (analysis value) obtained by analyzing the evaluation value over a long period of time for the same user. Assess the person's condition. For example, a student's evaluation value for one year (long-term evaluation value) and monthly evaluation value (short-term evaluation value) may be analyzed.
 例えば、長期評価値から分析される内容としては、冬場になるとコミュニケーションが暗くなる、夏になると活き活きしている、冬の夜はコミュニケーションの数が極端に少ない、等、当該対象者の長期的な特徴が分析可能になる。一方で、短期評価値から分析される内容としては、月末になると気分が落ち込んでいる、金曜日になると表情が明るくなる、といった当該対象者の短期的な特徴が分析可能になる。 For example, what is analyzed from the long-term evaluation value is that the subject's long-term Features can be analyzed. On the other hand, as the content analyzed from the short-term evaluation value, it is possible to analyze the short-term characteristics of the subject, such as feeling depressed at the end of the month and having a bright expression on Friday.
 上述した長期間というのは、例えば、3か月、半年、1年といったサイクルであり、短期間というのは、例えば、1日、1週間、1か月といったサイクルであるが、これに限られず、期間の異なる2つのスパンで評価することとしてもよい。この場合の評価は評価値の平均値や中央値を採用することとしてもよいし、種々の統計的な手法によって適切な値を算出することとすればよい。 The above-mentioned long term is, for example, a cycle of three months, six months, or one year, and the short term is, for example, a cycle of one day, one week, or one month, but is not limited thereto. , may be evaluated in two spans with different periods. The evaluation in this case may employ the average value or the median value of the evaluation values, or may calculate an appropriate value using various statistical techniques.
 本発明においては、上述した評価値のトレンドを分析し、例えば、月末に気分が落ち込むという評価を行った場合、その月末に生じた幸福度スコアに所定の係数を乗じることによって補正することとしてもよい。即ち、トレンド上とは異なる評価が生じた場合、当該評価よりも重みづけを高くして評価を行うことにより真の感情を分析することができる。 In the present invention, the trend of the evaluation values described above is analyzed, and for example, when an evaluation is made that the mood is depressed at the end of the month, the happiness score that occurred at the end of the month is corrected by multiplying it by a predetermined coefficient. good. That is, when an evaluation different from that on the trend occurs, the true emotion can be analyzed by performing the evaluation with a higher weight than the evaluation.
 例えば、図21に示されるように、実線で示された幸福度(happy)のトレンドがある対象者に対して、2月(Feb.)の下旬の値がP1であったとする。本来であれば、月末にはhappyスコアが低いはずであるにもかかわらず、トレンドとは乖離していることがわかる。この場合、トレンドからの乖離分だけ評価値を補正(乖離分を加算)することにより、当該評価値が特別に評価すべきであるものとして扱うこととしてもよい。補正の方法は、トレンドからの正の乖離又は負の乖離をそれぞれ加算丸又は減算することとしてもよいし、他の方法でもよい。 For example, as shown in FIG. 21, suppose that the value in the end of February (Feb.) is P1 for a subject with a happiness trend indicated by a solid line. Although the happy score is supposed to be low at the end of the month, it can be seen that it deviates from the trend. In this case, by correcting (adding) the evaluation value by the amount of deviation from the trend, the evaluation value may be treated as something that should be specially evaluated. The correction method may be to add or subtract the positive or negative deviation from the trend, or any other method.
<第9の実施の形態>
 本発明の第9の実施の形態によるシステムの説明をする。本実施の形態によるシステムは、あるテーマに関して、事前に対象者からの主観的な回答(アンケート、インタビュー等々)を取得しておき、取得した評価値と比較をするものである。これにより、例えば、アンケート等では「講義に不満はない」という主観的な回答を得ていたとしても、実際の表情や音声を分析した際には幸福度の評価が低い場合に、何らかの忖度が働いていることが把握できる。本実施の形態によるシステムは、特に、従業員に対する会社からの意識調査の分野で特に好適である。
<Ninth Embodiment>
A system according to a ninth embodiment of the present invention will now be described. The system according to the present embodiment acquires subjective responses (questionnaires, interviews, etc.) from subjects in advance regarding a certain theme, and compares them with the acquired evaluation values. As a result, for example, even if a subjective answer such as "I am not dissatisfied with the lecture" is obtained in a questionnaire, when the actual expression and voice are analyzed, if the evaluation of the degree of happiness is low, there will be some degree of conjecture. I can understand that you are working. The system according to this embodiment is particularly suitable in the field of employee awareness surveys from companies.
 例えば、対象者に対するアンケートとしては、幸福度に関する質問(会社での働きがい、風通し等)、不安度に関する質問(困っていること、恐れていること等)、将来の安全性に関する質問(キャリアパスの確保、出世や昇給等)を用意して回答させておき、対象者の表情や音声のうち幸福度、不安度、安全度に関する評価値と比較することとしてもよい。 For example, questionnaires to the subjects included questions about happiness (e.g. job satisfaction at the company, openness, etc.), questions about anxiety (e.g., troubles, fears, etc.), and questions about future safety (e.g. career path). securement, promotion, salary increase, etc.) may be prepared and answered, and compared with the evaluation values regarding happiness, anxiety, and safety among the facial expressions and voices of the subject.
 このような比較を当該対象者の属する組織の全従業員に行うことによって、従業員の本音を分析することができ、組織全体の評価が可能になる。 By making such a comparison for all employees of the organization to which the subject belongs, it is possible to analyze the employees' true feelings and evaluate the organization as a whole.
<第10の実施の形態>
 図22を参照して本発明の第10の実施の形態を説明する。本実施の形態によるシステムは、対象者の表情や音声から得られた評価値に対して事後的に当該対象者から当時の状況に関するラベリング(アノテーション)を受け付けるものである。これにより、評価値を事後的に主観的にも評価するとともに、評価の結果をフィードバックすることによりアルゴリズムの更新も行うことができる。
<Tenth Embodiment>
A tenth embodiment of the present invention will be described with reference to FIG. The system according to the present embodiment accepts labeling (annotation) of the situation at that time from the subject after the evaluation value obtained from the facial expression and voice of the subject. As a result, the evaluation value can be subjectively evaluated after the fact, and the algorithm can be updated by feeding back the evaluation result.
 図示されるように、幸福度のグラフに対して、ラベルLab.1乃至Lab.4のように時間帯ごとにラベリングし、当該評価値と、受け付けたラベルの内容を重畳して表示することとしてもよい。 As shown, for the graph of the degree of happiness, the label Lab. 1 to Lab. 4, labeling may be performed for each time zone, and the evaluation value may be superimposed on the content of the received label.
 ラベリングの内容としては、例えば、「その評価値についてあっている/あっていない」、「そのときの状況」、「自己基準での数値」等さまざまな情報を付加することができる。 Various information can be added to the content of the labeling, for example, ``the evaluation value is correct/not correct'', ``the situation at that time'', and ``values based on self-standards''.
<第11の実施の形態>
 本発明の第11の実施の形態によるシステムについて説明する。本実施の形態によるシステムは、営業側担当者の営業側端末と、営業先担当者の営業先端末との間で行われる営業ビデオセッションに関するものである。従来より、営業担当者は、自信の営業の面談等の手ごたえや経験から成約率についての予想を立てつつ、当該営業案件の取引金額と成約率予測を乗じた期待値を算出して営業計画を立てることがあった。本システムによれば、営業ビデオセッションを分析することにより、過去の営業ビデオセッションとその成約結果のデータに基づいて機械学習的統計処理によって成約率を推移呈することができる。
<Eleventh Embodiment>
A system according to the eleventh embodiment of the present invention will now be described. The system according to this embodiment relates to a sales video session between a sales terminal of a sales representative and a sales terminal of a sales representative. Conventionally, sales representatives make predictions about the closing rate based on their own sales interviews, etc., and their experience. I used to stand up. According to this system, by analyzing the sales video sessions, it is possible to change the closing rate by machine learning statistical processing based on the data of the past sales video sessions and the sales results.
 本システムは、過去の営業ビデオセッションの成約情報を取得する成約情報取得手段と、当該営業ビデオセッションの動画像内に含まれる営業側担当者又は営業先担当者の少なくともいずれかの顔画像を所定のフレームごとに認識する顔認識手段と、動画像内に含まれる営業側担当者又は営業先担当者の少なくともいずれかの音声を認識する音声認識手段と、認識した顔画像及び音声の双方に基づいて複数の観点による評価値を算出する評価手段とを有している。また、特に、本システムは、記評価値と成約情報とを教師データとして、他の営業ビデオセッションの動画像の成約率を推定する成約推定モデルを生成するモデル生成手段を備え、当該モデルを利用して新規の営業ビデオセッションの成約率を判定する。 This system includes a deal closing information obtaining means for obtaining deal closing information of a past sales video session, and a predetermined face image of at least one of the person in charge on the sales side or the person in charge of the sales partner included in the moving image of the sales video session. face recognition means for recognizing each frame, voice recognition means for recognizing at least one of the voices of the person in charge of the sales side or the person in charge of the sales side included in the moving image, and both the recognized face image and voice and evaluation means for calculating an evaluation value from a plurality of viewpoints. In particular, the present system includes model generation means for generating a contract conclusion estimation model for estimating the contract conclusion rate of moving images of other sales video sessions using the evaluation values and contract conclusion information as training data, and uses the model. to determine win rates for new sales video sessions.
 成約率は、例えば、50%、70%、のような数値としてもよいし、A、B、Cのようなランク(ゾーン)としてもよい。成約率は過去の成約した動画像との類似度を基準に算出することとしてもよい。例えば、新規の営業ビデオセッションの動画像と、過去の同一の営業先(又は類似の営業先)との成約した動画像との類似度が70%であれば、当該新規の営業の成約率も70%という判定をすることとしてもよい。 The closing rate may be, for example, numerical values such as 50% and 70%, or ranks (zones) such as A, B, and C. The contract rate may be calculated based on the degree of similarity with moving images that have been contracted in the past. For example, if the similarity between the video of a new business video session and the video of a contract closed with the same business partner (or similar business client) in the past is 70%, the closing rate of the new business is also A determination of 70% may also be made.
 本システムは、組織内において、推定された前記営業ビデオセッションにかかる取引金額と、推定した成約率(ランク)に基づいて、当月、クォーター毎など、所定期間における期待される営業見込みの数値を算出する。 Based on the estimated transaction amount for the sales video session and the estimated closing rate (rank) within the organization, this system calculates expected sales forecast figures for a given period of time, such as the current month or quarter. do.
<第12の実施の形態>
 図23を参照して、本発明の第12の実施の形態によるシステムを説明する。本実施の形態によるシステムは、主にオンラインでの学習指導の利用に好適である。本システムは、互いにネットワークを介して通信可能に接続された講師端末と受講者端末とを含む。
<Twelfth Embodiment>
A system according to a twelfth embodiment of the present invention will now be described with reference to FIG. The system according to this embodiment is suitable mainly for online learning guidance. This system includes a lecturer terminal and student terminals that are communicably connected to each other via a network.
 講師端末は、講師ユーザの少なくとも顔を映すための講師側カメラを有する。受講者端末は、図示されるように、受講者の少なくとも顔を映すための顔カメラと、受講者の手元(ノートやプリントなどに書き込んでいる状態や、机の上の状態を)を映すための手元カメラを有する。 The instructor terminal has an instructor-side camera for showing at least the face of the instructor user. As shown in the figure, the student terminal has a face camera for capturing at least the face of the student, and a camera for capturing the hands of the student (the state of writing on a notebook or print, or the state on the desk). has a handheld camera.
 本システムは、特に、手元カメラから取得された動画像内における受講者の手元の動作を所定のフレームごとに認識する手元動作認識手段と、認識された手元の動作に基づいて、受講者の理解度を推定する推定手段とを備えている。 In particular, this system includes hand movement recognition means for recognizing the movement of the student's hand in the moving image acquired from the camera at hand for each predetermined frame, and the recognition of the movement of the student's hand based on the recognized hand movement. estimating means for estimating the degree of
 推定手段は、認識された手元の動作の量に応じて受講者の理解度を推定する。例えば、板書の量が講師側の板書の量と一致しているかどうか(ノートをきちんととっているか)という観点や、使用しているペンの色を分析することによって重要な点を色分けするなどの工夫をしているかといった評価を行うことができる。 The estimation means estimates the degree of comprehension of the student according to the amount of recognized hand movements. For example, whether or not the amount of writing on the board matches the amount of writing on the board by the instructor (whether or not the teacher is taking proper notes), or by analyzing the color of the pen being used, the important points can be color-coded. It is possible to evaluate whether or not you are devising.
 また、顔カメラに顔画像が写っていない(即ち、下を向いている)にも関わらず手元の動きが検知されない(板書を取っていない、居眠りをしている)等の状況を検知してアラートを講師端末や受講者端末に出すこととしてもよい。 In addition, it detects situations such as when the face camera does not show the face image (i.e., the face is facing downward), but no hand movement is detected (the board is not written, the person is asleep), etc. An alert may be issued to the instructor terminal or the student terminal.
 なお、上述したように、受講者の表情や音声に基づく評価値と手元の動作とに基づいて理解度を推定することとしてもよい。この場合、評価した受講者の感情がイライラしていたり、不安に感じていたりした状態を検知しつつ、手元カメラにおいて受講者の手の動きが検知できない場合には、講義が効果的に行われていないことが推定できる。 Note that, as described above, the degree of understanding may be estimated based on the evaluation value based on the student's facial expression and voice, and the hand motion. In this case, if the hand camera cannot detect the student's hand movement while the student's emotions such as feeling irritated or anxious are detected, the lecture will not be effective. It can be assumed that the
<ハードウェア構成の補足>
 本明細書において説明した装置による一連の処理は、ソフトウェア、ハードウェア、及びソフトウェアとハードウェアとの組合せのいずれを用いて実現されてもよい。本実施形態に係る情報共有支援装置10の各機能を実現するためのコンピュータプログラムを作製し、PC等に実装することが可能である。また、このようなコンピュータプログラムが格納された、コンピュータで読み取り可能な記録媒体も提供することが可能である。記録媒体は、例えば、磁気ディスク、光ディスク、光磁気ディスク、フラッシュメモリ等である。また、上記のコンピュータプログラムは、記録媒体を用いずに、例えばネットワークを介して配信されてもよい。
<Supplementary hardware configuration>
The sequence of operations performed by the apparatus described herein may be implemented using software, hardware, or a combination of software and hardware. It is possible to create a computer program for realizing each function of the information sharing support device 10 according to the present embodiment and implement it in a PC or the like. It is also possible to provide a computer-readable recording medium storing such a computer program. The recording medium is, for example, a magnetic disk, an optical disk, a magneto-optical disk, a flash memory, or the like. Also, the above computer program may be distributed, for example, via a network without using a recording medium.
 また、本明細書においてフローチャート図を用いて説明した処理は、必ずしも図示された順序で実行されなくてもよい。いくつかの処理ステップは、並列的に実行されてもよい。また、追加的な処理ステップが採用されてもよく、一部の処理ステップが省略されてもよい。 Also, the processes described using the flowcharts in this specification do not necessarily have to be executed in the illustrated order. Some processing steps may be performed in parallel. Also, additional processing steps may be employed, and some processing steps may be omitted.
 以上説明した実施の形態を適宜組み合わせて実施することとしてもよい。また、本明細書に記載された効果は、あくまで説明的または例示的なものであって限定的ではない。つまり、本開示に係る技術は、上記の効果とともに、または上記の効果に代えて、本明細書の記載から当業者には明らかな他の効果を奏しうる。 The embodiments described above may be combined as appropriate. Also, the effects described herein are merely illustrative or exemplary, and are not limiting. In other words, the technology according to the present disclosure can produce other effects that are obvious to those skilled in the art from the description of this specification in addition to or instead of the above effects.
 10、20   ユーザ端末
 30   ビデオセッションサービス端末
 40   評価端末

 
10, 20 user terminal 30 video session service terminal 40 evaluation terminal

Claims (6)

  1.  少なくとも二のユーザ端末間において実施されたビデオセッションに関する動画像を取得する取得手段と、
     前記動画像に含まれるユーザの顔画像を所定のフレームごとに認識する顔認識手段と、
     前記動画像内に含まれる前記ユーザの少なくとも音声を認識する音声認識手段と、
     認識した前記顔画像及び前記音声の双方に基づいて複数の観点による評価値を算出する評価手段と、
     前記評価値を時系列に沿った変化情報として記憶する記憶手段と、
     複数の前記観点のうち、一の前記観点における前記評価値のみが所定範囲を超えて変化したことを検知する検知手段と、
     前記検知された範囲を含む特異フレームを取得する特異フレーム取得手段と、を備える、
    動画像分析システム。
    acquisition means for acquiring moving images relating to a video session conducted between at least two user terminals;
    face recognition means for recognizing a user's face image included in the moving image for each predetermined frame;
    voice recognition means for recognizing at least voice of the user included in the moving image;
    evaluation means for calculating an evaluation value from a plurality of viewpoints based on both the recognized face image and the recognized voice;
    a storage means for storing the evaluation value as change information along a time series;
    detection means for detecting that only the evaluation value in one of the plurality of viewpoints has changed beyond a predetermined range;
    a peculiar frame acquiring means for acquiring a peculiar frame including the detected range;
    Video image analysis system.
  2.  請求項1に記載の動画像分析システムであって、
     前記複数の観点は、互いに相反する属性が関連付けられた第1の観点及び第2の観点を含むものであり、
     前記検知手段は、前記第1の観点による前記評価値と前記第2の観点による前記評価値とが前記所定範囲を超えて乖離したことを検知する、
    動画像分析システム。
    The moving image analysis system according to claim 1,
    The plurality of viewpoints includes a first viewpoint and a second viewpoint associated with mutually contradictory attributes,
    The detection means detects that the evaluation value from the first viewpoint and the evaluation value from the second viewpoint have deviated beyond the predetermined range.
    Video image analysis system.
  3.  請求項1に記載の動画像分析システムであって、
     前記検知手段は、前記第1の時点を経過後の所定時間内に、一の前記観点における前記評価値が所定範囲を超えて変化し且つ直後に前記第1の時点における評価値と略同一の値に戻ったことを検知する、
    動画像分析システム。
    The moving image analysis system according to claim 1,
    The detection means detects that the evaluation value of the one viewpoint changes beyond a predetermined range within a predetermined time period after the first time point and immediately after that, the evaluation value becomes substantially the same as the evaluation value at the first time point. detect that it has returned to a value,
    Video image analysis system.
  4.  請求項1乃至請求項3のいずれかに記載の動画像分析システムであって、
     前記動画像内から取得された複数の前記特異フレームを連結してダイジェスト動画を生成するダイジェスト生成動画手段を更に備える、
    動画分析システム。
    The moving image analysis system according to any one of claims 1 to 3,
    Further comprising a digest generating video means for generating a digest video by linking the plurality of specific frames acquired from the moving image,
    Video analysis system.
  5.  請求項1乃至請求項4のいずれかに記載の動画像分析システムであって、
     前記特異フレームに対応する前記音声をテキストに変換して出力する特異フレーム対応テキスト出力手段を更に備える、
    動画像分析システム。
    The moving image analysis system according to any one of claims 1 to 4,
    Further comprising a peculiar frame corresponding text output means for converting the speech corresponding to the peculiar frame into text and outputting it,
    Video image analysis system.
  6.  請求項1乃至請求項5のいずれかに記載の動画像分析システムであって、
     前記ビデオセッションは、一のユーザ端末の画面に表示された画面情報を共有することが可能であり、
     少なくとも共有された前記特異フレームに対応する前記画面情報を出力する共有画面出力手段を更に備える、
    動画像分析システム。

     
    The moving image analysis system according to any one of claims 1 to 5,
    The video session is capable of sharing screen information displayed on the screen of one user terminal,
    further comprising shared screen output means for outputting at least the screen information corresponding to the shared specific frame;
    Video image analysis system.

PCT/JP2021/003793 2021-02-02 2021-02-02 Video session evaluation terminal, video session evaluation system, and video session evaluation program WO2022168176A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2022518708A JPWO2022168176A1 (en) 2021-02-02 2021-02-02
PCT/JP2021/003793 WO2022168176A1 (en) 2021-02-02 2021-02-02 Video session evaluation terminal, video session evaluation system, and video session evaluation program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/003793 WO2022168176A1 (en) 2021-02-02 2021-02-02 Video session evaluation terminal, video session evaluation system, and video session evaluation program

Publications (1)

Publication Number Publication Date
WO2022168176A1 true WO2022168176A1 (en) 2022-08-11

Family

ID=82741244

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/003793 WO2022168176A1 (en) 2021-02-02 2021-02-02 Video session evaluation terminal, video session evaluation system, and video session evaluation program

Country Status (2)

Country Link
JP (1) JPWO2022168176A1 (en)
WO (1) WO2022168176A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008204193A (en) * 2007-02-20 2008-09-04 Nippon Telegr & Teleph Corp <Ntt> Content retrieval/recommendation method, content retrieval/recommendation device, and content retrieval/recommendation program
JP2016149063A (en) * 2015-02-13 2016-08-18 オムロン株式会社 Emotion estimation system and emotion estimation method
JP2018068618A (en) * 2016-10-28 2018-05-10 株式会社東芝 Emotion estimating device, emotion estimating method, emotion estimating program, and emotion counting system
JP2019061594A (en) * 2017-09-28 2019-04-18 株式会社野村総合研究所 Conference support system and conference support program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008204193A (en) * 2007-02-20 2008-09-04 Nippon Telegr & Teleph Corp <Ntt> Content retrieval/recommendation method, content retrieval/recommendation device, and content retrieval/recommendation program
JP2016149063A (en) * 2015-02-13 2016-08-18 オムロン株式会社 Emotion estimation system and emotion estimation method
JP2018068618A (en) * 2016-10-28 2018-05-10 株式会社東芝 Emotion estimating device, emotion estimating method, emotion estimating program, and emotion counting system
JP2019061594A (en) * 2017-09-28 2019-04-18 株式会社野村総合研究所 Conference support system and conference support program

Also Published As

Publication number Publication date
JPWO2022168176A1 (en) 2022-08-11

Similar Documents

Publication Publication Date Title
WO2022168180A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168185A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180860A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168176A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168183A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168178A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168175A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168182A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168177A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168179A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168184A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168174A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022168181A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
JP7152825B1 (en) VIDEO SESSION EVALUATION TERMINAL, VIDEO SESSION EVALUATION SYSTEM AND VIDEO SESSION EVALUATION PROGRAM
WO2023032058A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180852A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180858A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180855A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180861A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180859A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022180857A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
WO2022230155A1 (en) Video analysis system
WO2022180856A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program
JP7138998B1 (en) VIDEO SESSION EVALUATION TERMINAL, VIDEO SESSION EVALUATION SYSTEM AND VIDEO SESSION EVALUATION PROGRAM
WO2022180862A1 (en) Video session evaluation terminal, video session evaluation system, and video session evaluation program

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022518708

Country of ref document: JP

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21924573

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21924573

Country of ref document: EP

Kind code of ref document: A1