WO2024009623A1

WO2024009623A1 - Evaluation system, evaluation device, and evaluation method

Info

Publication number: WO2024009623A1
Application number: PCT/JP2023/018500
Authority: WO
Inventors: 孝治堀内; 武志安慶; 裕人冨田; 純子上田; 義照田中; 毅吉原; 康岡田
Original assignee: パナソニックＩｐマネジメント株式会社
Priority date: 2022-07-04
Filing date: 2023-05-17
Publication date: 2024-01-11
Also published as: JP2024006627A

Abstract

This evaluation system comprises: an acquisition unit for acquiring speech data pertaining to a conversation between a first person and a second person; an imaging unit which images the first person and the second person; an extraction unit which extracts a first feature amount pertaining to the line of sight or the orientation of the face of each of the first person and the second person on the basis of the imaging data from the imaging unit, and a second feature amount about the speech data; and a satisfaction level calculation unit which calculates the satisfaction level of the first person on the basis of the first feature amount, the second feature amount, and a calculation algorithm for calculating the satisfaction level of the first person.

Description

Evaluation system, evaluation device and evaluation method

The present disclosure relates to an evaluation system, an evaluation device, and an evaluation method.

Patent Document 1 discloses a store management system that calculates the employee satisfaction level of a store clerk based on a conversation between the store clerk and a conversation partner. The store management system stores calculation algorithms for calculating employee satisfaction for each type of person who can be a conversation partner. The store management system acquires a conversation between a store employee and a conversation partner, recognizes the store employee's emotion based on the store employee's voice included in the conversation, and determines the type of conversation partner. The store management system calculates the employee's satisfaction level based on the recognition result of the employee's emotion and a calculation algorithm corresponding to the determined type of conversation partner.

Japanese Patent Application Publication No. 2011-237957

In recent years, there has been a need to measure and calculate a person's satisfaction level from the perspective of objectively understanding the person's psychological state. For example, there was a need to use technology to calculate people's satisfaction levels to maintain or improve employee motivation in the workplace. Note that the situation in which a technique for calculating a person's satisfaction level is required is not limited to the above-mentioned example, and may be used in a situation where an effort is made to improve service based on customer satisfaction level.

In Patent Document 1, emotional information is calculated from voice data based on a conversation between people (for example, a store clerk and a customer), and the store clerk's satisfaction level (that is, employee satisfaction level) is calculated from the calculated emotional information. . However, evaluating a person's satisfaction level based only on voice data included in a conversation may not be accurate enough, and a more accurate satisfaction evaluation is required.

The present disclosure was devised in view of the conventional situation described above, and aims to perform highly accurate satisfaction evaluation using multiple pieces of information included in conversations between people.

The present disclosure includes: an acquisition unit that acquires audio data related to a conversation between a first person and a second person; an imaging unit that images the first person and the second person; an extraction unit that extracts a first feature amount related to the line of sight or face direction of each of the first person and the second person based on imaging data, and a second feature amount of the audio data; and calculating the satisfaction level of the first person based on the first feature amount, the second feature amount, and a calculation algorithm for calculating the satisfaction level of the first person. An evaluation system is provided, comprising a satisfaction level calculation section.

Further, the present disclosure acquires audio data related to a conversation between a first person and a second person, and imaging data obtained by imaging the first person and the second person, and based on the imaging data. an extraction unit that extracts a first feature amount related to the line of sight or face direction of the first person and the second person, and a second feature amount of the audio data; a satisfaction calculation unit that calculates the satisfaction level of the first person based on the first feature amount, the second feature amount, and a calculation algorithm for calculating the satisfaction level of the first person; Provided is an evaluation device comprising:

Further, the present disclosure acquires audio data related to a conversation between a first person and a second person, images the first person and the second person, and based on the imaged data, the first person A first feature amount related to the line of sight or face direction of the person and the second person, and a second feature amount of the audio data are extracted, and the first feature amount and the second person are extracted. An evaluation method is provided that calculates the degree of satisfaction of the first person based on a second feature amount and a calculation algorithm for calculating the degree of satisfaction of the first person.

Note that these comprehensive or specific aspects may be realized by a system, an apparatus, a method, an integrated circuit, a computer program, or a recording medium. It may be realized by any combination of the following.

According to the present disclosure, it is possible to perform highly accurate satisfaction evaluation using multiple pieces of information included in conversations between people.

Diagram showing an overview of this embodiment Diagram showing an example of feature quantities A block diagram showing an example of the internal configuration of a terminal device and a server according to Embodiment 1. Diagram showing how satisfaction is calculated for logic-based algorithms Diagram showing an example of calculating satisfaction level at predetermined time intervals Sequence diagram of satisfaction evaluation processing according to Embodiment 1 A diagram showing an example of the internal configuration of a terminal device and a server according to Embodiment 2. Sequence diagram of satisfaction evaluation processing according to Embodiment 2 A diagram showing an example of the internal configuration of a terminal device according to Embodiment 3 Flowchart showing the process of calculating satisfaction level on a terminal device Sequence diagram showing the process of calculating the satisfaction level on the server from data captured in images and sounds in the past Diagram showing an example of a screen displayed on a terminal device Diagram showing an example of a screen where a message is displayed depending on the satisfaction result

Hereinafter, embodiments specifically disclosing an evaluation system, an evaluation device, and an evaluation method according to the present disclosure will be described in detail with appropriate reference to the drawings. However, more detailed explanation than necessary may be omitted. For example, detailed explanations of well-known matters and redundant explanations of substantially the same configurations may be omitted. This is to avoid unnecessary redundancy in the following description and to facilitate understanding by those skilled in the art. The accompanying drawings and the following description are provided to enable those skilled in the art to fully understand the present disclosure, and are not intended to limit the subject matter of the claims.

<Summary of the invention>
First, an overview of this embodiment will be explained with reference to FIG. FIG. 1 is a diagram showing an overview of this embodiment.

FIG. 1 shows a case where a person A is having a conversation with a person B using a terminal device 1 that is connected via a network to the terminal device used by the person B. Person A and Person B have an interpersonal relationship; for example, Person A is Person B's subordinate, and Person B is Person A's superior. Note that the relationship between Person A and Person B is not limited to that between a boss and a subordinate, but may be between employees and customers, between colleagues, between an interviewer and an interviewee, or in any other relationship (for example, between a teacher and a student). For example, a case is assumed in which person B, who is a boss, interviews person A, who is a subordinate, online. In the following description, person A may be read as a first person, and person B may be read as a second person.

The audio acquisition device 10 is, for example, a microphone, and picks up the person A's utterance CO “○○××.” The audio acquisition device 10 may be installed in the terminal device 1 or may be an external device communicably connected to the terminal device 1. Hereinafter, the data collected by the audio acquisition device 10 will be referred to as audio data.

The imaging device 11 is, for example, a camera, and images the person A. The imaging device 11 may be installed in the terminal device 1 or may be an external device communicably connected to the terminal device 1. Hereinafter, the data of the person A captured by the imaging device 11 will be referred to as imaging data.

The terminal device 1 transmits the audio data acquired by the audio acquisition device 10 and the imaging data acquired by the imaging device 11 to a device that extracts feature amounts. The device that extracts the feature amount is, for example, a server. Note that the terminal device 1 may extract the feature amount without transmitting the audio data and the imaging data to the server.

The feature amounts extracted from the imaging data and audio data are, for example, facial expressions, line of sight, speech, or actions. Note that the feature amounts extracted from the imaging data and audio data are not limited to these. Information on facial expression or line of sight is extracted from image FR1 representing the face of person A in the captured image data. Information regarding the behavior is extracted from the image FR2 representing the upper body of the person A in the image data. Information related to the utterance is extracted from the audio data. Information related to facial expressions, line of sight, speech, or actions is extracted by the terminal device 1 or the server 2 (see FIG. 7), and will be described in detail later.

The degree of satisfaction is calculated using the extracted feature data (hereinafter referred to as feature amount data) and an algorithm for estimating the degree of satisfaction (hereinafter referred to as satisfaction degree estimation algorithm). Satisfaction is an index representing the degree of satisfaction of person A with the conversation with person B, which is estimated by a satisfaction estimation algorithm based on feature data. Satisfaction estimation algorithms include algorithms based on predetermined logic (hereinafter referred to as logic-based algorithms) and algorithms based on machine learning (hereinafter referred to as machine learning-based algorithms).

A logic-based algorithm is an algorithm that defines a procedure for calculating satisfaction by repeatedly adding and subtracting points based on predetermined logic.

Machine learning-based algorithms are, for example, algorithms that use deep learning based on multilayer perceptrons, random forests, or convolutional neural networks as a configuration and directly output satisfaction levels from feature data.

In this way, when it is difficult to read the other party's psychological state during an online interview, etc., Person B can confirm that Person A is highly satisfied by checking the satisfaction calculated from the audio data and image data. Able to communicate smoothly. Note that the conversation between person A and person B is not limited to an online interview or the like, but may be a face-to-face conversation.

Next, an example of the feature amount will be described with reference to FIG. 2. FIG. 2 is a diagram showing an example of feature amounts.

Features extracted from "facial expressions" include a smiling face, a straight face, or a crying face.

Features extracted from the "line of sight" include the angle of the line of sight or the angle of the face direction.

Features extracted from "utterances" include utterance time or emotion.

Features extracted from "actions" include nodding, standing still, or tilting one's head.

Note that the feature amounts extracted from the above-mentioned "facial expression", "gaze", "utterance", and "movement" are only examples, and are not limited to these.

<Embodiment 1>
The evaluation system 100 in Embodiment 1 extracts user feature data using a terminal device when users (for example, person A and person B) are having an online conversation, and extracts the extracted feature data. is sent to the server and the satisfaction level is calculated by the server.

With reference to FIG. 3, an example of the internal configuration of each of the terminal device and the server according to the first embodiment will be described. FIG. 3 is a block diagram showing an example of the internal configuration of the terminal device and the server according to the first embodiment.

The evaluation system 100A includes at least a terminal device 1A and a server 2A. The number of terminal devices is not limited to one, but may be two or more than two. The terminal device 1A is an example of a terminal used by a user. Note that when distinguishing between an evaluation system, a terminal device, and a server, an alphabet is added after the number. In addition, when the evaluation system, terminal device, and server are not distinguished, only numbers will be used in the description.

The terminal device 1A and the server 2A are communicably connected via the network NW. Note that the terminal device 1A and the server 2A may be communicably connected via a wired LAN (Local Area Network). Further, the terminal device 1A and the server 2A may perform wireless communication (for example, wireless LAN such as Wi-Fi (registered trademark)) without going through the network NW.

The terminal device 1A includes at least a communication I/F 13, a memory 14, an input device 15, a display device 16, an I/F 17, an audio acquisition device 10, an imaging device 11, and a processor 12. The terminal device 1A is a PC (Personal Computer), a tablet, a mobile terminal, a housing including the audio acquisition device 10 and the imaging device 11, or the like.

The communication I/F 13 is a network interface circuit that performs wireless or wired communication with the network NW. Here, I/F represents an interface. The terminal device 1A is communicably connected to the server 2A via the communication I/F 13 and the network NW. The communication I/F 13 transmits the feature amount data extracted by the feature amount extraction unit 12A (see below) to the server 2A. Communication methods used for communication in the communication I/F 13 include, for example, WAN (Wide Area Network), LAN (Local Area Network), LTE (Long Term Evolution), mobile communication such as 5G, power line communication, and short-range wireless communication. Communication (for example, Bluetooth (registered trademark) communication), communication for mobile phones, etc.

The memory 14 includes, for example, a RAM (Random Access Memory) as a work memory used when executing each process of the processor 12, and a ROM (Read Only Memory) that stores programs and data that define the operations of the processor 12. has. Data or information generated or acquired by the processor 12 is temporarily stored in the RAM. A program that defines the operation of the processor 12 is written in the ROM.

The input device 15 receives input from a user (for example, person A or person B). The input device 15 is, for example, a touch panel display or a keyboard. The input device 15 accepts operations in response to instructions displayed on the display device 16.

The display device 16 displays a screen (see below) created by the drawing screen creation unit 24B of the server 2. The display device 16 is, for example, a display or a notebook PC monitor.

The I/F 17 is a software interface. The I/F 17 is communicably connected to the communication I/F 13, memory 14, input device 15, display device 16, audio acquisition device 10, imaging device 11, and processor 12, and exchanges data with each device. . Note that the I/F 17 may be omitted from the terminal device 1A, and data may be exchanged between the devices of the terminal device 1A.

The audio acquisition device 10 picks up the utterances of a user (for example, person A or person B). The audio acquisition device 10 is configured with a microphone device that can collect audio generated based on a user's utterance (that is, detect an audio signal). The audio acquisition device 10 collects audio generated based on a user's utterance, converts it into an electrical signal as audio data, and outputs the electrical signal to the I/F 17 .

The imaging device 11 is a camera that images a user (for example, person A or person B). The imaging device 11 includes at least a lens (not shown) as an optical element and an image sensor (not shown). The lens receives light reflected by the object from within the angle of view of the imaged area of the imaging device 11 and forms an optical image of the object on the light receiving surface (in other words, the imaging surface) of the image sensor. The image sensor is, for example, a solid-state imaging device such as a CCD (Charged Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor). The image sensor converts an optical image formed on an imaging surface through a lens into an electrical signal and sends it to the I/F 17 at predetermined time intervals (for example, 1/30 seconds).

Note that the audio acquisition device 10 and the imaging device 11 may be external devices that are communicably connected to the terminal device 1A.

The processor 12 is, for example, a CPU (Central Processing Unit), a DSP (Digital Signal Processor), a GPU (Graphical Processing Unit), or an FPGA (Field Processing Unit). It is a semiconductor chip on which at least one of electronic devices such as a programmable gate array is mounted. The processor 12 functions as a controller that controls the overall operation of the terminal device 1A, and performs control processing for unifying the operation of each part of the terminal device 1A, data input/output processing with the I/F 17, and data calculation. Performs processing and data storage. The processor 12 realizes the function of the feature extraction unit 12A. The processor 12 uses the RAM of the memory 14 during operation, and temporarily stores data generated or acquired by the processor 12 in the RAM of the memory 14.

The feature amount extraction unit 12A extracts feature amounts (see FIG. 2) based on the audio data acquired from the audio acquisition device 10 and the image data acquired from the imaging device 11. The feature amount extraction unit 12A extracts each feature amount from the audio data and the imaging data using, for example, trained model data for AI (Artificial Intelligence) processing stored in the memory 14 (in other words, based on AI). may be extracted.

The feature amount extraction unit 12A detects the face part of the person A from the image data, and also detects the direction (in other words, the line of sight) of both eyes (that is, the left eye and the right eye) of the detected face part. The feature extracting unit 12A detects the line of sight of the person A who is viewing the screen displayed on the display device 16 (for example, the captured video of the person B). The line of sight detection method can be realized using publicly known technology, for example, it may be detected based on the difference in the orientation of both eyes reflected in each of a plurality of captured images (frames), or it may be realized using other detection methods. good.

The feature extracting unit 12A detects the facial part of the person A from the image data and also detects the direction of the face. The direction of the face is the angle of the face with respect to a specific location on the display device 16 (for example, the center position of the panel of the display device 16). In other words, the angle of the face is an azimuth angle and an elevation angle indicating the three-dimensional direction in which the face of the person A looking at the specific location exists, as viewed from a specific location on the display device 16 (see above). This is a vector representation composed of . Note that the specific location does not have to be limited to the center position of the panel. The face direction detection method can be realized using known techniques.

The feature extracting unit 12A detects the face part of the person A from the image data and also detects the facial expression of the person A. The facial expression detection method can be realized using known techniques.

The feature extraction unit 12A detects the motion of person A from the image data. The motion detection method can be realized using known techniques.

The feature extraction unit 12A detects the speaking time of person A from the voice data. The speaking time may be detected by, for example, integrating the time of a portion of the voice data in which the voice signal of person A is detected. Note that the speech time detection method may be implemented using other known techniques. Furthermore, the feature extracting unit 12A calculates the rate at which person A and person B are speaking, based on the detected speaking time.

The feature extraction unit 12A detects the emotion of person A from the voice data. The feature extraction unit 12A detects the emotion by detecting, for example, the intensity of the voice, the number of moras per unit time, the intensity of each word, the volume, or the spectrum of the voice, etc. from the voice data. Note that the emotion detection method is not limited to this, and may be realized by other known techniques.

The server 2A includes a communication I/F 21, a memory 22, an input device 23, an I/F 26, and a processor 24.

The communication I/F 21 transmits and receives data to and from each of the one or more terminal devices 1 via the network NW. The communication I/F 21 transmits data of a screen output from the I/F 26 to be displayed on the display device 16 to the terminal device 1A.

The memory 22 includes, for example, a RAM as a work memory used when the processor 24 executes each process, and a ROM that stores programs and data that define the operations of the processor 24. Data or information generated or acquired by the processor 24 is temporarily stored in the RAM. A program that defines the operation of the processor 24 is written in the ROM. The memory 22 also stores a satisfaction level estimation algorithm.

The input device 23 receives input from a user (for example, an administrator of the evaluation system 100). The input device 15 is, for example, a touch panel display or a keyboard. For example, the input device 15 accepts the setting of a threshold value (see below) for a logic-based algorithm.

The I/F 26 is a software interface. The I/F 26 is communicably connected to the communication I/F 21, the memory 22, the input device 23, and the processor 24, and exchanges data with each device. Note that the I/F 26 may be omitted from the server 2A, and data may be exchanged between the devices of the server 2A.

The processor 24 is a semiconductor chip on which at least one of electronic devices such as a CPU, a DSP, a GPU, and an FPGA is mounted. The processor 24 functions as a controller that governs the overall operation of the server 2A, and performs control processing for unifying the operation of each part of the server 2A, data input/output processing with the I/F 26, data arithmetic processing, and Performs data storage processing. The processor 24 implements the functions of the satisfaction level estimation section 24A and the drawing screen creation section 24B. The processor 24 uses the RAM of the memory 22 during operation, and temporarily stores data generated or obtained by the processor 24 in the RAM of the memory 22.

The satisfaction estimation unit 24A calculates the satisfaction of the person A using the feature amount data acquired from the terminal device 1A and the satisfaction estimation algorithm recorded in the memory 22. The satisfaction level estimation unit 24A may calculate the satisfaction level using a logic-based algorithm, or may calculate the satisfaction level using a machine learning-based algorithm. The satisfaction estimation unit 24A outputs information regarding the calculated satisfaction level to the drawing screen creation unit 24B.

The drawing screen creation unit 24B creates a screen to be displayed on the display device 16 of the terminal device 1A using the satisfaction level acquired from the satisfaction level estimation unit 24A. The screen includes, for example, a captured video of person A, information regarding satisfaction, a button for controlling the start of satisfaction evaluation, and the like. Note that the items included on the screen are not limited to these. Methods of displaying information regarding satisfaction include displaying satisfaction values calculated at predetermined time intervals by plotting them numerically or on a graph each time, or displaying satisfaction values during or at the end of a meeting. . The graph regarding the satisfaction value is a graph in which values are plotted, a bar graph, a meter, or the like. The drawing screen creation unit 24B outputs the created screen to the I/F 26.

Next, a method for calculating the satisfaction level of the logic-based algorithm will be explained with reference to FIG. FIG. 4 is a diagram showing a method for calculating the satisfaction level of the logic-based algorithm.

In a logic-based algorithm, satisfaction is calculated by adding and subtracting points according to a predetermined agreement (hereinafter referred to as a determination method). The feature amounts used in the determination method are referred to as determination elements. Points are added and subtracted at predetermined time intervals, throughout the conversation, from the start of the conversation to the current time, or at the last 30% of the conversation. Note that the range of time during which points are added and points are subtracted is not limited to these and may be arbitrarily determined by the user.

First, the determination method when the determination element is "speech rate" will be explained. The speech rate represents the percentage of time that a user (for example, person A or person B) speaks within a predetermined period of time. For example, if the user speaks for a total of 1.0 seconds out of 2.5 seconds, the speaking rate is 1.0/2.5, which is 0.4 (that is, 40%). The speech rate can be calculated, for example, by extracting the user's speech time at specific time intervals and dividing the total of the extracted speech times by the extracted total time. Note that the method of calculating the speech rate is one example and is not limited to this. Calculation of the utterance rate may be performed by the feature amount extraction section 12A, or may be performed by the satisfaction level estimation section 24A based on the feature amount data acquired from the feature amount extraction section 12A.

If the speech rate of person A is greater than or equal to the speech rate of person B, the satisfaction estimation algorithm adds 0.5 points to the satisfaction level. Note that the numerical values added and subtracted below are merely examples, and are not limited to 0.5, but may be any predetermined value. If the speech rate of person A is less than the speech rate of person B, the satisfaction estimation algorithm deducts 0.5 points from the satisfaction level.

Additionally, as a determination method, points may be added or subtracted by taking into consideration not only the speech rate of person A relative to the speech rate of person B, but also whether the speech rate of person A is equal to or higher than a preset threshold. That is, when the speech rate of person A is equal to or greater than the speech rate of person B and the speech rate of person A is equal to or greater than the first threshold value, the satisfaction level estimation algorithm adds 0.5 points to the satisfaction level. If the speech rate of person A is less than the speech rate of person B and the speech rate of person A is less than a second threshold that is less than or equal to the first threshold, the satisfaction level estimation algorithm subtracts 0.5 points from the satisfaction level. The first threshold is, for example, 50%, and the second threshold is, for example, 40%. Note that the values of the first threshold value and the second threshold value are merely examples, and may be changed as appropriate by the user (for example, person B).

Furthermore, if the speech rate of person A is greater than or equal to the speech rate of person B, but the speech rate of person A is less than the first threshold, points may be subtracted, or there is no need to add or subtract points.

Furthermore, if the speech rate of person A is less than the speech rate of person B, but the speech rate of person A is greater than or equal to the second threshold, points may be added, or points may not be added or points may be subtracted.

Furthermore, if the speech rate of person A is equal to or greater than the first threshold, points may be added regardless of the speech rate of person B.

Further, if the speech rate of person A is less than the second threshold, points may be deducted regardless of the speech rate of person B.

Next, a determination method when the determination element is "emotion, facial expression, or action" will be explained. Emotion is an index calculated from conversation audio data. When the determination element is "emotion, facial expression, or action," a positive rate, neutral rate, and negative rate (see below) are calculated based on the "emotion, facial expression, or action." The satisfaction estimation algorithm uses the positive rate and negative rate to add and subtract points.

The positive rate indicates the rate at which the user's (for example, person A) emotion is determined to be positive within a predetermined period of time. Examples of feature amounts that are determined to be positive include person A's voice becoming louder, person A's voice becoming louder, person A nodding, or person A smiling. Note that the feature amounts that are determined to be positive are just examples and are not limited to these.

The neutral rate indicates the rate at which the emotions of the user (for example, person A) are determined to be neutral within a predetermined period of time. The neutral state is a state in which it is assumed that person A's emotions are neither positive nor negative. For example, a neutral state is a state in which person A is calm. The feature amount that is determined to be neutral is, for example, that person A has a straight face or that person A is standing still. Note that the feature amounts that are determined to be neutral are merely examples, and are not limited to these.

The negative rate indicates the rate at which the emotions of the user (for example, person A) are determined to be negative within a predetermined period of time. Features that are determined to be negative include, for example, person A has a crying face, person A's voice has become lower, person A's voice has lowered pitch, or person A is tilting his head. . Note that the feature amounts that are determined to be negative are merely examples and are not limited to these.

As an example of calculating the positive rate, neutral rate, and negative rate, a case will be described in which emotions are determined once every 0.5 seconds within 2.5 seconds. For example, the evaluation system 100 makes two positive determinations, two negative determinations, and one neutral determination within 2.5 seconds. In this case, the positive rate is (1+1)/5, which is 0.4 (that is, 40%). The negative rate is (1+1)/5, which is 0.4 (that is, 40%). The neutral rate is 1/5, which is 0.2 (that is, 20%).

Note that the calculation of the positive rate, neutral rate, and negative rate may be performed by the feature amount extraction unit 12A, or may be performed by the satisfaction level estimation unit 24A based on the feature amount data acquired from the feature amount extraction unit 12A.

If the positive rate of person A is greater than or equal to the threshold for adding points, the satisfaction estimation algorithm adds 0.5 points to the satisfaction level. If the negative rate of person A is equal to or greater than the threshold for deducting points, the satisfaction estimation algorithm deducts 0.5 points from the satisfaction level.

For example, the threshold for adding points is 50%. In this case, if the positive rate is 50% or more, the satisfaction estimation algorithm adds 0.5 points to the satisfaction. Note that the threshold value for adding points is not limited to 50% and may be changed as appropriate by the user.

For example, the threshold for demerit points is 50%. In this case, if the negative rate is 50% or more, the satisfaction estimation algorithm reduces the satisfaction by 0.5 points. Note that the threshold value for demerit points is not limited to 50%, and may be changed as appropriate by the user.

Next, a determination method when the determination element is "line of sight" will be explained. When the determination factor is "line of sight," points are added or subtracted based on the time the user (for example, person A) looks in the direction of the display (for example, display device 16).

If the time that person A looks in the direction of the display is equal to or greater than the third threshold, the satisfaction estimation algorithm adds 0.5 points to the satisfaction. If the time the person A looks in the direction of the display is less than the fourth threshold, which is equal to or less than the third threshold, 0.5 points are deducted from the satisfaction level.

Next, an example in which satisfaction is calculated at predetermined time intervals will be described with reference to FIG. FIG. 5 is a diagram illustrating an example of calculating satisfaction levels at predetermined time intervals.

In case CA and case CB, the satisfaction value at the start of evaluation is set to 3, and the satisfaction estimation algorithm repeatedly adds and subtracts satisfaction points. Note that the satisfaction value at the start of the evaluation is not limited to 3 and may be any value. In case CA and case CB, the satisfaction level is assumed to take a value between 0 and 5 points. Note that the range of values that the satisfaction level can take is not limited to 0 to 5 points, but may be in other ranges, and the range does not need to be set.

The graphs for Case CA and Case CB are plots of satisfaction values calculated every 30 seconds. The horizontal axis of the graphs for case CA and case CB represents elapsed time, and the vertical axis represents satisfaction level. Case CA and case CB are, for example, cases in which the conversation ends in 5 minutes.

In case CA, the satisfaction level estimation algorithm repeatedly adds or subtracts points every 30 seconds, and when 5 minutes have passed, the satisfaction level is 5 points, and the user (for example, person A) is highly satisfied. Indicates that the conversation has ended.

In case CB, the satisfaction level estimation algorithm repeatedly adds or subtracts points every 30 seconds, and when 5 minutes have passed, the satisfaction level is 0 points, and the user (for example, person A) has a low satisfaction level. Indicates that the conversation has ended.

Next, the satisfaction evaluation process according to the first embodiment will be described with reference to FIG. 6. FIG. 6 is a sequence diagram of satisfaction evaluation processing according to the first embodiment.

The satisfaction level is evaluated by two terminal devices (terminal device 1AA, terminal device 1AB) and server 2A. For example, assume that the evaluation system 100A calculates the satisfaction level of person A from a conversation between person A and person B. Assume that person A, who is the person to be evaluated, uses terminal device 1AA, and person B uses terminal device 1AB. Note that the number of terminal devices is not limited to two, and may be one or two or more.

The terminal device 1AA sets the values of each threshold value related to addition and deduction of satisfaction points by the satisfaction estimation algorithm (St100). Note that the setting of the threshold value in the terminal device 1AA may be omitted from the process related to FIG. 6.

The terminal device 1AA starts evaluating the satisfaction level of person A (St101). The start of the satisfaction evaluation is executed, for example, by the user (for example, person B) pressing a button to start evaluation displayed on the display device 16.

The terminal device 1AA acquires image data and audio data of person A (St102).

The terminal device 1AA extracts feature amounts based on the imaging data and audio data acquired in the process of step St102 (St103).

The terminal device 1AB sets the values of each threshold regarding addition and deduction of satisfaction points by the satisfaction estimation algorithm (St104). Note that the threshold value may be set arbitrarily by the person B, or may be automatically set based on a set value stored in the memory 14 in advance. Further, the setting of the threshold value may be performed not in the terminal device 1AB but in the server 2A.

The terminal device 1AB starts evaluating the satisfaction level of person A (St105).

The terminal device 1AB acquires image data and audio data of person B (St106).

The terminal device 1AB extracts feature amounts based on the imaging data and audio data acquired in the process of step St106 (St107).

The terminal device 1AA transmits the threshold value set in the process of step St100 and the feature amount extracted in the process of step St103 to the server 2A. The terminal device 1AB transmits the threshold setting value set in the process of step St104 and the feature amount extracted in the process of step St107 to the server 2A (St108).

The server 2A calculates the satisfaction level based on the threshold setting value, the feature amount, and the satisfaction level estimation algorithm obtained in the process of step St108 (St109).

The terminal device 1AA requests the server 2A to transmit the satisfaction results (St110). Note that the process of step St110 may be omitted from the process related to FIG. 6.

The terminal device 1AB requests the server 2A to send the satisfaction results (St111).

The server 2A draws a screen related to the satisfaction level results. The server 2A transmits a screen on which the satisfaction level results are drawn to the terminal device 1AB (St112). The server 2A transmits a screen on which the satisfaction level results are drawn to the terminal device 1AA (St113). The process of step St113 may be omitted from the process related to FIG. 6.

The terminal device 1AA displays the screen acquired in the process of step St113 on the display of the terminal device 1AA (St114). The process of step St114 may be omitted from the process related to FIG. 6.

The terminal device 1AB displays the screen acquired in the process of step St112 on the display of the terminal device 1AB (St115).

Note that the processes from the process of extracting feature amounts in steps St103 and St107 to the process of drawing the results of steps St114 and St115 may be repeatedly executed.

The server 2A transmits a signal to end the evaluation to the terminal device 1AA and the terminal device 1AB (St116).

The terminal device 1AA ends the satisfaction evaluation based on the signal acquired in the process of step St116 (St117).

The terminal device 1AB ends the satisfaction evaluation based on the signal acquired in the process of step St116 (St118).

The terminal device 1AA transmits a request to send the final satisfaction result to the server 2A (St119). The process of step St119 may be omitted from the process of FIG.

The terminal device 1AB transmits a request to transmit the final satisfaction result to the server 2A (St120).

The server 2A draws a screen related to the final satisfaction result based on the request obtained in the process of step St120. The server 2A transmits a screen showing the final result of the satisfaction level to the terminal device 1AB (St121).

Based on the request obtained in the process of step St119, the server 2A draws a screen showing the final result of the satisfaction level. The server 2A transmits a screen showing the final result of the satisfaction level to the terminal device 1AA (St122). The process of step St122 may be omitted from the process of FIG. 6.

The terminal device 1AA displays the screen acquired in the process of step St122 on the display of the terminal device 1AA (St123). The process of step St123 may be omitted from the process of FIG.

The terminal device 1AB displays the screen acquired in the process of step St121 on the display of the terminal device 1AB (St124).

<Embodiment 2>
In the evaluation system according to Embodiment 2, a server performs everything from extraction of feature amounts to calculation of satisfaction level all at once based on imaging data and audio data acquired by a terminal device. Hereinafter, the same reference numerals will be used for the same components as in Embodiment 1, and the description thereof will be omitted.

With reference to FIG. 7, an example of the internal configuration of the terminal device and the server according to the second embodiment will be described. FIG. 7 is a diagram showing an example of the internal configuration of a terminal device and a server according to the second embodiment. Only the parts that are different from the hardware block diagram according to the first embodiment shown in FIG. 3 will be explained.

In the evaluation system 100B according to the second embodiment, the feature extraction unit 12A is incorporated into the processor 24 of the server 2B. That is, the terminal device 1B includes a communication I/F 13, a memory 14, an input device 15, a display device 16, an I/F 17, an audio acquisition device 10, and an imaging device 11. The server 2B includes a communication I/F 21, a memory 22, an input device 23, an I/F 26, and a processor 24.

The processor 24 realizes the functions of the feature amount extraction section 12A, the satisfaction estimation section 24A, and the drawing screen creation section 24B.

The feature amount extraction unit 12A extracts feature amounts based on the audio data and image data acquired from the terminal device 1B.

With reference to FIG. 8, satisfaction evaluation processing according to the second embodiment will be described. FIG. 8 is a sequence diagram of satisfaction evaluation processing according to the second embodiment. Processes similar to those in the sequence diagram of FIG. 6 of the first embodiment are given the same reference numerals, and only different processes will be described.

The terminal device 1BA transmits the threshold value set in the process of step St100 and the imaging data and audio data acquired in the process of step St102 to the server 2B (St200).

The terminal device 1BB transmits the threshold value set in the process of step St104 and the imaging data and audio data acquired in the process of step St106 to the server 2B (St200).

The server 2B extracts feature amounts based on the imaging data and audio data acquired in the process of step St200 (St201).

The server 2B calculates the degree of satisfaction based on the feature amount extracted in the process of step St201 (St202). The following processing is the same as each processing related to the sequence diagram of FIG. 6, so the explanation will be omitted.

<Embodiment 3>
The evaluation system according to the third embodiment calculates the degree of satisfaction using the terminal device or the server based on the imaging data and audio data (that is, the data recorded or recorded in the past) acquired by the terminal device in the past. Hereinafter, the same reference numerals will be used for the same components as in Embodiment 1, and the description thereof will be omitted.

An example of the internal configuration of the terminal device according to the third embodiment will be described with reference to FIG. FIG. 9 is a diagram showing an example of the internal configuration of a terminal device according to the third embodiment. Only the parts that are different from the hardware block diagram according to the first embodiment shown in FIG. 3 will be explained.

The terminal device 1C includes a communication I/F 13, a memory 14, an input device 15, a display device 16, an audio acquisition device 10, an imaging device 11, and a processor 12. The audio acquisition device 10 and the imaging device 11 may be omitted.

The communication I/F 13 may transmit the screen drawn by the drawing screen creation unit 24B of the processor 12 to another terminal device or the like. Further, when the audio acquisition device 10 and the imaging device 11 are external devices, the communication I/F 13 acquires image data captured in the past and audio data captured in the past from the external devices.

The feature amount extraction unit 12A of the processor 12 extracts feature amounts based on image data captured in the past and audio data captured in the past. The feature extraction unit 12A outputs the extracted feature data to the satisfaction estimation unit 24A.

The feature extraction unit 12A obtains one file that includes the imaging data and audio data of both person A and person B. The feature extracting unit 12A separates one file into four pieces of data: image data and voice data of person A, and image data and voice data of person B, using known techniques such as image recognition or voice recognition. do.

Further, the feature extracting unit 12A may obtain two files: a file containing image data and audio data of person A, and a file containing image data and audio data of person B. . The feature extraction unit 12A separates each file into image data and audio data using a known technique. Note that the input device 15 may obtain an input from a user (for example, person A or person B) regarding whether each of the two files is associated with person A or person B.

In addition, the feature extraction unit 12A obtains four files: a file of image data of person A, a file of voice data of person A, a file of image data of person B, and a file of voice data of person B. You may. Note that the input device 15 may obtain input from the user (for example, person A or person B) regarding whether each of the four files is associated with person A or person B.

Furthermore, when the satisfaction level is calculated by the server based on the imaging data and audio data acquired by the terminal device in the past, the hardware block diagram of the third embodiment is similar to FIG. 7 of the second embodiment. The server 2B acquires audio data previously acquired by the audio acquisition device 10 of the terminal device 1B and imaging data previously acquired by the imaging device 11 of the terminal device 1B. The feature amount extraction unit 12A of the server 2B extracts the feature amount based on the acquired audio data and image data, and the satisfaction level estimation unit 24A calculates the satisfaction level based on the extracted feature amount.

Next, with reference to FIG. 10, the process of calculating the satisfaction level on the terminal device will be described. FIG. 10 is a flowchart illustrating the process of calculating the degree of satisfaction on the terminal device. Each process related to FIG. 10 is executed by the processor 12.

The processor 12 sets the values of each threshold regarding addition and deduction of satisfaction points by the satisfaction estimation algorithm (St300). The processor 12 may set the threshold value by obtaining an input signal from the user (for example, person B) obtained from the input device 15, or may set the threshold value automatically based on the setting value stored in advance in the memory 14. You can.

The processor 12 acquires previously captured image data and captured audio data stored in the memory 14 (St301). Note that the processor 12 is not limited to past data, and may acquire data currently being acquired by the audio acquisition device 10 and the imaging device 11 of the terminal device 1C.

The processor 12 extracts feature amounts from the imaging data and audio data acquired in the process of step St301 (St302).

The processor 12 calculates the satisfaction level of the user (for example, person A) based on the feature amount extracted in the process of step St302 (St303).

The processor 12 draws a screen showing the satisfaction level calculated in the process of step St303 (St304).

Next, with reference to FIG. 11, a description will be given of a process in which the server calculates the satisfaction level from data captured in images and sounds in the past. FIG. 11 is a sequence diagram showing a process in which the server calculates the degree of satisfaction from data captured in images and sounds in the past.

The terminal device 1B transmits to the server 2B threshold setting information regarding addition and deduction of satisfaction points based on the satisfaction estimation algorithm (St400).

The terminal device 1B transmits image data captured in the past and audio data captured in the past to the server 2B (St401).

The server 2B extracts feature amounts based on the imaging data and audio data acquired in the process of step St401 (St402).

The server 2B calculates the degree of satisfaction based on the feature amount acquired in the process of step St402 (St403).

The terminal device 1B requests the server 2B to send the final result of the satisfaction level (St404).

The server 2B draws a screen including the final satisfaction result based on the request received from the terminal device 1B in the process of step St404. The server 2B transmits the drawn screen to the terminal device 1B (St405).

The terminal device 1B displays a screen including the final result of the satisfaction level obtained in the process of step St405 (St406).

Next, an example of a screen displayed on the terminal device will be described with reference to FIG. 12. FIG. 12 is a diagram showing an example of a screen displayed on a terminal device.

Screen MN1 is an example of a screen displayed on terminal device 1 at a certain moment during a meeting. For example, if Person A and Person B are having a meeting and Person A is the person to be evaluated, screen MN1 is the screen that Person B refers to. Screen MN1 includes display areas IT1, IT2 and buttons BT1, BT2, BT3, BT4, BT5, and BT6.

The display area IT2 is an area where the captured video of the person A is displayed in real time. The drawing screen creation unit 24B displays the captured video of the person A acquired from the imaging device 11 in the display area IT2.

The display area IT1 is an area where the satisfaction results are displayed. The display area IT1 displays a graph in which satisfaction values calculated at predetermined time intervals are plotted. The drawing screen creation unit 24B may display the satisfaction level in the display area IT1 at the timing when the satisfaction level is acquired from the satisfaction level estimation unit 24A. Note that the display area IT1 is not limited to graphs, and may display satisfaction values calculated from the start of the meeting to now in numbers, or satisfaction values calculated at predetermined time intervals. may be displayed numerically each time. In addition, the display area IT1 may display the current satisfaction level in text such as "high", "medium", or "low" based on the calculated satisfaction value, or may display the current satisfaction level in accordance with the satisfaction level. You may also display emoticons or pictograms.

The button BT1 is a button that turns on the display of the captured image of the user on the other party's terminal device 1. The button BT2 is a button for turning off the display of the user's captured video on the other party's terminal device 1.

The button BT3 is a button that turns on the output of your own voice to the terminal device 1 of the other party. The button BT4 is a button for turning off the output of your own voice to the terminal device 1 of the other party.

The button BT5 is a button for starting or ending satisfaction evaluation. Button BT5 may be omitted from screen MN1.

The button BT6 is a button for starting or ending a conference.

Screen MN2 is an example of a screen displayed on the terminal device 1 at a certain moment during the meeting. Screen MN2 is a screen displayed on terminal device 1 when one minute has passed since screen MN1 was displayed on terminal device 1.

The display area IT3 is an area where the satisfaction results are displayed. The display area IT3 displays a graph in which satisfaction values calculated at predetermined time intervals are plotted. The display area IT3 displays a graph in which two satisfaction results are additionally plotted on the graph displayed in the display area IT1 as the conversation between person A and person B progresses for one minute. In this way, in the display area IT3, satisfaction results are additionally plotted in real time according to the elapsed time.

Next, with reference to FIG. 13, an example of a screen on which a message is displayed according to the satisfaction level result will be described. FIG. 13 is a diagram showing an example of a screen on which a message is displayed according to the satisfaction level result. In the description of FIG. 13, elements that overlap with those in FIG. 12 are given the same reference numerals to simplify or omit the description, and different contents will be described.

Screen MN3 is an example of a screen displayed on terminal device 1 at a certain moment during a meeting. Screen MN3 is a screen that is displayed on terminal device 1 when one minute has elapsed since screen MN1 was displayed on terminal device 1.

The display area IT4 is an area where the satisfaction results are displayed. The display area IT4 displays a graph in which satisfaction values calculated at predetermined time intervals are plotted. The display area IT4 displays a graph in which two satisfaction results are additionally plotted on the graph displayed in the display area IT1 as the conversation between person A and person B progresses for one minute.

The message Mes is a message displayed according to the satisfaction level. For example, the message Mes is displayed according to person A's speaking rate. When determining that the speech rate of person A is less than the speech rate of person B, the satisfaction estimation unit 24A sends a signal to the drawing screen creation unit 24B that the speech rate of person A is less than the speech rate of person B. Output. In addition, when determining that the speech rate of person A is less than the speech rate of person B for a predetermined period of time, the satisfaction estimation unit 24A determines that the speech rate of person A is less than the speech rate of person B. The signal may be output to the drawing screen creation section 24B. Further, when determining that the speech rate of person A is less than the second threshold, the satisfaction level estimation unit 24A outputs a signal indicating that the speech rate of person A is less than the second threshold to the drawing screen creation unit 24B. You can. Note that the feature extraction unit 12A performs the determination as to whether or not the speech rate of person A is less than the speech rate of person B, and the determination as to whether or not the speech rate of person A is less than the second threshold. You can. The drawing screen creation unit 24B creates a message for the person B to refrain from speaking based on the signal acquired from the satisfaction level estimation unit 24A, and causes the message to be displayed on the screen MN3. The message to refrain from speaking is, for example, "Let's listen to what Person A has to say." Note that the message to refrain from speaking is one example and is not limited to this.

Note that the terminal device 1 or the server 2 may calculate the satisfaction level of each of a plurality of people and calculate the average value of the satisfaction levels of all the people. In this way, the terminal device 1 or the server 2 may aggregate the satisfaction level and notify the user without identifying the individual.

Depending on the satisfaction level of person A, the terminal device 1 may display something that attracts the viewer's attention, such as an avatar, on the screen that person B is viewing. Note that the avatar and the like may also be displayed on the screen of person A. For example, when the satisfaction level is low, the evaluation system can improve the satisfaction level of person A by displaying an avatar on the screen to attract the attention of person A and person B to the screen.

Depending on the person A's actions, the terminal device 1 may display a notification that the person A is currently thinking.

The terminal device 1 or the server 2 may calculate the degree of satisfaction without displaying the captured video of the person having the conversation on the display device 16 (that is, with the display of the captured video turned off).

As described above, the evaluation system according to the present embodiment (for example, evaluation system 100, evaluation system 100A, evaluation system 100B) has an acquisition unit ( For example, it includes a voice acquisition device 10) and an imaging unit (for example, an imaging device 11) that images a first person and a second person. The evaluation system calculates a first feature amount related to the line of sight or face direction of each of the first person and the second person, and a second feature amount of the audio data based on the imaging data of the imaging unit. It includes an extraction unit (for example, feature quantity extraction unit 12A) that performs extraction. The evaluation system calculates the satisfaction level of the first person based on the first feature amount, the second feature amount, and the calculation algorithm for calculating the satisfaction level of the first person. A calculation unit (for example, a satisfaction level estimation unit 24A) is provided.

Thereby, the evaluation system can calculate the satisfaction level based on two pieces of information: information related to the first person's line of sight or face direction, and information related to the first person's voice data. Thereby, the evaluation system can perform highly accurate satisfaction evaluation using a plurality of pieces of information included in conversations between people.

Furthermore, the satisfaction calculation unit of the evaluation system of this embodiment calculates the satisfaction at predetermined time intervals from the start to the end of the conversation. Thereby, the evaluation system can calculate the satisfaction level of the first person at each time from the start of the conversation until the end of the conversation, and can perform a flexible evaluation of the satisfaction level.

In addition, the extraction unit of the evaluation system of the present embodiment uses, as second feature quantities, a first ratio indicating the ratio of the first person speaking in the conversation and a ratio of the second person speaking. A second ratio is calculated. The calculation algorithm adds a predetermined value to the satisfaction value when the first ratio is greater than or equal to the second ratio, and subtracts a predetermined value from the satisfaction value when the first ratio is less than the second ratio. do. Thereby, the evaluation system can evaluate the degree of satisfaction according to the speech rate of the first person relative to the speech rate of the second person.

Further, the calculation algorithm of the evaluation system of the present embodiment adds a predetermined value to the satisfaction value when the first ratio is equal to or higher than the second ratio and the first ratio is equal to or higher than the first threshold value. The calculation algorithm subtracts a predetermined value from the satisfaction value when the first ratio is less than the second ratio and the first ratio is less than the second threshold, which is less than or equal to the first threshold. Thereby, the evaluation system can evaluate the degree of satisfaction according to the speech rate of the first person relative to the speech rate of the second person and the speech rate of the first person relative to the threshold value.

Further, the extraction unit of the evaluation system of the present embodiment detects the emotion of the first person from the voice data as a second feature quantity, and detects a positive rate, which is the rate at which the first person feels positive, from the emotion. and a negative rate, which is the rate at which the first person felt negative based on the emotion. The calculation algorithm adds a predetermined value to the satisfaction value when the positive rate is greater than or equal to a threshold for adding points, and subtracts a predetermined value from the satisfaction value when the negative rate is greater than or equal to a threshold for subtracting points. Thereby, the evaluation system can evaluate the degree of satisfaction based on the emotion detected from the voice data of the first person.

Furthermore, the evaluation system of this embodiment further includes a first display section (for example, display device 16) on which a second person is displayed when the first person has a conversation. The extraction unit calculates a time period during which the first person looks at the first display unit as the first feature amount. The calculation algorithm adds a predetermined value to the satisfaction value when the time is equal to or greater than a third threshold, and subtracts a predetermined value from the satisfaction value when the time is less than a fourth threshold that is equal to or less than the third threshold. Thereby, the evaluation system can evaluate the degree of satisfaction based on the time the first person looks at the first display section.

Furthermore, the calculation algorithm of the evaluation system of this embodiment calculates the degree of satisfaction based on machine learning. Thereby, the evaluation system can calculate the degree of satisfaction from the feature amount data using a calculation algorithm based on machine learning.

Furthermore, the evaluation system of the present embodiment includes a second display section (e.g., display device 16) that displays a screen that the second person refers to when having a conversation, and a screen creation section (e.g., It further includes a drawing screen creation section 24B). The screen creation unit creates a screen including the satisfaction result calculated by the satisfaction calculation unit and causes the second display unit to display the screen. This allows the second person, who is the evaluator, to confirm the satisfaction level results of the first person. Thereby, the evaluation system can support the first person to have a conversation with a high level of satisfaction by notifying the second person of the result of the first person's satisfaction level.

The evaluation system according to the present embodiment further includes a second display unit that displays a screen that the second person refers to when having a conversation, and a screen creation unit that creates the screen. The screen creation unit displays the satisfaction level on the second display unit while acquiring the satisfaction level calculated by the satisfaction level calculation unit from the satisfaction level calculation unit according to the conversation between the first person and the second person. . This allows the second person to check the first person's current satisfaction level while conversing with the first person. By notifying the second person of the first person's current satisfaction level, the evaluation system can support the first person to have a conversation so that the first person has a high degree of satisfaction.

The evaluation system according to the present embodiment further includes a second display unit that displays a screen that the second person refers to when having a conversation, and a screen creation unit that creates the screen. When the first ratio is less than the second ratio, the screen creation unit displays on the screen a message to the effect that the second person should refrain from speaking. Thereby, the evaluation system can display a message that helps increase the satisfaction level of the first person based on the speech rates of the first person and the second person.

Further, the screen created by the screen creation unit of the evaluation system according to the present embodiment includes a display area where the captured video of the first person is displayed, a display area where the satisfaction result is displayed, and a screen where the second person's captured image is displayed. a button for displaying the captured video on a screen referenced by the first person, a button for outputting the audio data of the second person from a terminal device used by the first person, and a button for controlling the start or end of the conference. including a button. Thereby, the evaluation system can display a screen including the satisfaction result to the second person. Thereby, the evaluation system can support the second person to have a smooth conversation with the first person by notifying the second person of the satisfaction level of the first person.

Further, the extraction unit of the evaluation system according to the present embodiment is based on the audio data acquired in advance by the acquisition unit and the imaged data of the first person and the second person that were imaged in advance by the imaging unit. Then, a first feature amount related to the line of sight or face direction of each of the first person and the second person, and a second feature amount of the audio data are extracted. Thereby, the evaluation system can extract the feature amount from the image data captured in the past and the audio data captured, and evaluate the satisfaction level of the first person.

Further, the extraction unit of the evaluation system according to the present embodiment extracts a third feature amount related to the facial expressions of each of the first person and the second person based on the imaging data of the imaging unit, and calculates the satisfaction level. The calculation unit calculates the satisfaction level of the first person based on the third feature amount and the calculation algorithm. Thereby, the evaluation system can evaluate the degree of satisfaction from the feature amount based on the facial expression of the first person.

Furthermore, the extraction unit of the evaluation system according to the present embodiment extracts a fourth feature amount related to each of the actions of the first person and the second person based on the imaging data of the imaging unit. The satisfaction level calculation unit calculates the satisfaction level of person A based on the fourth feature amount and the calculation algorithm. Thereby, the evaluation system can evaluate the degree of satisfaction from the feature amount based on the behavior of the first person.

Further, the second feature used in the evaluation system according to the present embodiment is at least one of the voice intensity, the number of moras per unit time, the intensity of each word, the volume, or the voice spectrum. Thereby, the evaluation system can calculate the first person's emotion from the second feature amount.

Further, the second person in this embodiment has an interpersonal relationship with the first person, and the interpersonal relationship includes at least one of the following: between a boss and a subordinate, between an employee and a customer, between colleagues, or between an interviewer and an interviewee. It is characterized in that it includes one. Thereby, the evaluation system can evaluate the satisfaction level of the first person in a situation where the second person has a conversation with the first person with whom he or she has an interpersonal relationship.

Furthermore, the evaluation system according to the present embodiment further includes a calculation algorithm storage unit (for example, the memory 14 or the memory 22) that stores the calculation algorithm. Thereby, the evaluation system can evaluate the satisfaction level of the first person based on the calculation algorithm stored in the calculation algorithm storage unit.

Although the embodiments have been described above with reference to the accompanying drawings, the present disclosure is not limited to such examples. It is clear that those skilled in the art can come up with various changes, modifications, substitutions, additions, deletions, and equivalents within the scope of the claims, and It is understood that it falls within the technical scope of the present disclosure. Further, each of the constituent elements in the embodiments described above may be combined as desired without departing from the spirit of the invention.

Note that this application is based on a Japanese patent application (Japanese Patent Application No. 2022-107706) filed on July 4, 2022, the contents of which are incorporated as a reference in this application.

The technology of the present disclosure is useful as an evaluation system, an evaluation device, and an evaluation method that perform highly accurate satisfaction evaluation using multiple pieces of information included in conversations between people.

100, 100A,

100B Evaluation system

1, 1A, 1B, 1C, 1AA, 1AB, 1BA,

1BB Terminal device

2, 2A, 2B Server 10 Audio acquisition device 11

Imaging device

12, 24 Processor 12A Feature extraction unit 13 Communication I/ F
14, 22

Memory

15, 23 Input device 16 Display device 17, 26 I/F
21 Communication I/F
24A Satisfaction estimation unit 24B Drawing screen creation unit NW Network A, B Person CO Speech FR1, FR2 Image CA, CB Case MN1, MN2, MN3 Screen IT1, IT2, IT3, IT4 Display area BT1, BT2, BT3, BT4, BT5 ,BT6 button Mes message

Claims

an acquisition unit that acquires audio data related to a conversation between the first person and the second person;
an imaging unit that captures images of the first person and the second person;
A first feature amount related to the line of sight or face direction of each of the first person and the second person based on image data of the image capture unit, and a second feature amount of the audio data. An extraction part that extracts;
a satisfaction level that calculates the satisfaction level of the first person based on the first feature amount, the second feature amount, and a calculation algorithm for calculating the satisfaction level of the first person; comprising a calculation section;
Rating system.
The satisfaction level calculation unit calculates the satisfaction level at predetermined time intervals from the start to the end of the conversation.
The evaluation system according to claim 1.
The extraction unit includes, as the second feature amount, a first rate indicating a rate at which the first person is speaking in the conversation, and a second rate indicating a rate at which the second person is speaking. Calculate the ratio of 2 and
The calculation algorithm is
If the first ratio is greater than or equal to the second ratio, a predetermined value is added to the satisfaction value;
If the first ratio is less than the second ratio, subtracting the predetermined value from the satisfaction value;
The evaluation system according to claim 1.
The calculation algorithm is
If the first proportion is equal to or greater than the second proportion and the first proportion is equal to or greater than a first threshold, the predetermined value is added to the satisfaction value;
If the first percentage is less than the second percentage and the first percentage is less than a second threshold that is less than or equal to the first threshold, subtracting the predetermined value from the satisfaction value;
The evaluation system according to claim 3.
The extraction unit detects the emotion of the first person from the voice data, and extracts from the emotion a positive rate, which is a rate at which the first person feels positive, and the emotion, as the second feature amount. Calculate the negative rate, which is the rate at which the first person felt negative, from
The calculation algorithm is
If the positive rate is equal to or higher than a threshold for adding points, add a predetermined value to the satisfaction value,
If the negative rate is equal to or higher than a threshold for point deduction, subtracting the predetermined value from the satisfaction level;
The evaluation system according to claim 1.
further comprising a first display section on which the second person is displayed when the first person has the conversation,
The extraction unit calculates, as the first feature amount, a time period during which the first person looks at the first display unit,
The calculation algorithm is
If the time is equal to or greater than a third threshold, a predetermined value is added to the satisfaction value;
if the time is less than a fourth threshold that is equal to or less than the third threshold, subtracting the predetermined value from the satisfaction value;
The evaluation system according to claim 1.
The calculation algorithm calculates the satisfaction level based on machine learning.
The evaluation system according to claim 1.
a second display unit that displays a screen that the second person refers to when having the conversation;
further comprising a screen creation unit that creates the screen,
The screen creation unit creates a screen including the satisfaction result calculated by the satisfaction calculation unit and displays it on the second display unit.
The evaluation system according to claim 1.
a second display unit that displays a screen that the second person refers to when having the conversation;
further comprising a screen creation unit that creates the screen,
The screen creation unit calculates the satisfaction level while acquiring the satisfaction level calculated by the satisfaction level calculation unit from the satisfaction level calculation unit in accordance with the conversation between the first person and the second person. is displayed on the second display section,
The evaluation system according to claim 2.
a second display unit that displays a screen that the second person refers to when having the conversation;
further comprising a screen creation unit that creates the screen,
When the first ratio is less than the second ratio, the screen creation unit displays a message on the screen to the effect that the second person should refrain from speaking.
The evaluation system according to claim 3.
The screen includes a display area where a captured image of the first person is displayed, a display area where the satisfaction result is displayed, and a screen where the first person refers to the captured image of the second person. a button to display the voice data of the second person from a terminal device used by the first person, and a button to control the start or end of the conference.
The evaluation system according to claim 8.
The extraction unit is configured to extract the first person based on the audio data acquired in advance by the acquisition unit and image data of the first person and the second person, which are imaged in advance by the imaging unit. extracting the first feature amount related to the line of sight or the direction of the face of the person and the second person, and the second feature amount of the audio data;
The evaluation system according to claim 1.
The extraction unit extracts a third feature amount related to facial expressions of each of the first person and the second person based on the imaging data of the imaging unit,
The satisfaction level calculation unit calculates the satisfaction level of the first person based on the third feature amount and the calculation algorithm.
The evaluation system according to claim 1.
The extracting unit extracts a fourth feature amount related to each of the actions of the first person and the second person based on the imaging data of the imaging unit,
The satisfaction level calculation unit calculates the satisfaction level of the person A based on the fourth feature amount and the calculation algorithm.
The evaluation system according to claim 1.
The second feature amount is at least one of the intensity of the voice, the number of moras per unit time, the intensity of each word, the volume, or the spectrum of the voice.
The evaluation system according to claim 1.
The second person has an interpersonal relationship with the first person,
The interpersonal relationship includes at least one of the following: between a boss and a subordinate, between an employee and a customer, between colleagues, or between an interviewer and an interviewee.
The evaluation system according to claim 1.
further comprising a calculation algorithm storage unit that stores the calculation algorithm;
The evaluation system according to claim 1.
Acquire audio data related to a conversation between a first person and a second person and image data obtained by capturing images of the first person and the second person, and acquire the image data of the first person and the second person based on the image data. an extraction unit that extracts a first feature amount related to the line of sight or face direction of each of the second person and the second person, and a second feature amount of the audio data;
a satisfaction level that calculates the satisfaction level of the first person based on the first feature amount, the second feature amount, and a calculation algorithm for calculating the satisfaction level of the first person; comprising a calculation section;
Evaluation device.
Obtain audio data related to a conversation between a first person and a second person,
imaging the first person and the second person;
extracting a first feature amount related to the line of sight or face direction of each of the first person and the second person based on the imaging data, and a second feature amount of the audio data;
calculating the satisfaction level of the first person based on the first feature amount, the second feature amount, and a calculation algorithm for calculating the satisfaction level of the first person;
Evaluation method.