WO2023032014A1

WO2023032014A1 - Estimation method, estimation device, and estimation program

Info

Publication number: WO2023032014A1
Application number: PCT/JP2021/031791
Authority: WO
Inventors: 佑樹北岸; 岳至森; 太一浅見; 歩相名神山
Original assignee: 日本電信電話株式会社
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2023-03-09
Also published as: JPWO2023032014A1

Abstract

An acquisition unit (15a) acquires learning data that includes nonverbal information or paralinguistic information and correct labels representing the states of mind indicated in the nonverbal information or paralinguistic information. A calculation unit (15b) uses a feature quantity of each of input data (14b) and reference data (14c) in the acquired learning data (14a) to calculate an embedded representation of the state of mind indicated by the input data (14b) and an embedded representation of the state of mind indicated by the reference data (14c). An estimation unit (15c) estimates the state of mind indicated by the input data (14b), using the result of a comparison between the embedded representation calculated from the input data (14b) and the embedded representation calculated from the reference data (14c).

Description

Estimation method, estimation device and estimation program

The present invention relates to an estimation method, an estimation device, and an estimation program.

Conventionally, research and development have been conducted on technology that automatically estimates the state of mind that appears in non-verbal and paralinguistic information such as human voice, face, and gestures. For example, in dialogues with agents and robots, the mental state of the dialogue partner is reflected when generating those reactions, the estimation results are used as part of mental health care, and the mental state of participants in web conferences, etc. is expected to be quantified to make it easier to understand.

Estimation of the state of mind that appears in such nonverbal/paralinguistic information is generally performed by labeling each label that represents a defined state of mind for inputs such as feature values and data itself extracted from speech and video images. It is defined as supervised learning that outputs posterior probabilities and the like (see Non-Patent Document 1).

Here, it is said that there are individual differences in the state of mind and its expression. On the other hand, in general, a technique is used in which data of a large number of people are collected and individual differences are absorbed into a machine learning model. Also known is a method of normalizing or absorbing individual differences using pre-registered reference sounds and moving images (see Non-Patent Documents 2 and 3).

However, with conventional technology, it is difficult to accurately estimate the labels that represent the states of mind that appear in nonverbal and paralinguistic information. For example, normality, joy, sadness, surprise, fear, hatred, anger, contempt, and other general emotion recognition, and subdivision of specific index stages such as comprehension, increase the variance of the corresponding label, and the normal state It may be difficult to sufficiently absorb or normalize individual differences only by In addition, when re-learning the model, it is difficult to secure a sufficient amount of data for stably adapting the model parameters to the person to be evaluated, and to perform the learning reliably.

The present invention has been made in view of the above, and aims to accurately estimate a label representing a state of mind appearing in nonverbal/paralinguistic information.

In order to solve the above-described problems and achieve the object, an estimation method according to the present invention is an estimation method executed by an estimation device, comprising: non-linguistic information or paralinguistic information; an acquisition step of acquiring learning data including a correct label representing a state of mind that appears; using a calculation step of calculating an embedded representation of a state and an embedded representation of a state of mind of the reference data, and a comparison result between the embedded representation calculated from the input data and the embedded representation calculated from the reference data; and an estimation step of estimating the state of mind of the input data.

　According to the present invention, it is possible to accurately estimate a label representing a state of mind appearing in nonverbal/paralinguistic information.

FIG. 1 is a schematic diagram illustrating a schematic configuration of an estimation device. FIG. 2 is a diagram for explaining the processing of the estimation device. FIG. 3 is a diagram illustrating a data configuration of learning data. FIG. 4 is a diagram for explaining the processing of the calculator and the estimator. FIG. 5 is a flowchart showing an estimation processing procedure. FIG. 6 is a diagram for explaining the embodiment. FIG. 7 is a diagram illustrating a computer that executes an estimation program;

An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.

[Configuration of estimation device]
FIG. 1 is a schematic diagram illustrating a schematic configuration of an estimation device. Also, FIG. 2 is a diagram for explaining the processing of the estimation device. The estimation device 10 of the present embodiment uses a neural network for a moving image showing the upper body of a subject, which is nonverbal/paralinguistic information, to calculate the degree of understanding as the state of mind that appears in the nonverbal/paralinguistic information. Estimated in 5 stages. The degree of comprehension is, for example, 1. 2. do not understand; Somewhat do not understand;3. 4. Normal state; 5. Somewhat understand; It is defined as understanding, and the higher the number, the better the understanding.

First, as illustrated in FIG. 1, the estimation device 10 of the present embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. Prepare.

The input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device for managing learning data via a network.

The storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In the present embodiment, the storage unit 14 stores, for example, learning data 14a used for estimation processing, which will be described later, model parameters 14d generated and updated in the estimation processing, and the like.

Here, as shown in FIG. 1, the learning data 14a of this embodiment includes the input data 14b and the reference data 14c, but the data configuration is the same. FIG. 3 is a diagram illustrating a data configuration of learning data. As shown in FIG. 3, the learning data 14a includes at least video data showing the upper body of a subject as non-verbal/paralinguistic information, a data ID for identifying each video data, and a personal ID for identifying the subject. , and a correct label representing the state of mind such as the degree of understanding appearing in each moving image data. The learning data 14a may include labels representing attributes of a person such as age and gender. In addition, the learning data 14a may be learned, developed, divided into evaluation sets, or data expanded as necessary.

It should be noted that preprocessing such as contrast normalization and face detection may be performed, and only areas with video data may be used. Also, the codec of the input data (moving image data) is not particularly limited.

Specifically, when estimating the degree of comprehension from video data in the estimation process described later, for example, H264 format video data recorded by a web camera at 30 frames per second is resized so that one side is 224 pixels. do it. Each of the X pieces of moving image data is provided with the individual IDs of S subjects and the correct label of the degree of comprehension.

Also, the input data and the reference data should not contain the same data. During the estimation process, which will be described later, the input data 14b and the reference data 14c may be generated by any combination of the learning data 14a so as to avoid mixing of the same data.

The control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. Thereby, the control unit 15 functions as an acquisition unit 15a, a calculation unit 15b, an estimation unit 15c, and a learning unit 15d, as illustrated in FIG. Note that these functional units may be implemented in different hardware. For example, the acquisition unit 15a may be implemented in hardware different from other functional units. Also, the control unit 15 may include other functional units.

The acquisition unit 15a acquires learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information. Specifically, the acquisition unit 15a receives video data showing the upper body of the subject as nonverbal/paralinguistic information via the input unit 11 or from a device that generates learning data via the communication control unit 13. Then, the learning data 14a including the data ID for identifying each piece of moving image data and the correct label representing the state of mind such as the degree of understanding appearing in each piece of moving image data is obtained.

In addition, the acquisition unit 15a causes the storage unit 14 to store learning data 14a acquired in advance prior to the following processing. The acquiring unit 15a may transfer the acquired learning data 14a to the estimating unit 15c described below without storing the acquired learning data 14a in the storage unit 14. FIG.

Return to the description of Figure 1. The calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the embedded representation of the state of mind of the input data and the state of mind of the reference data. and the embedded representation of .

In addition, the processing using the neural network described below is not limited to this embodiment. good too.

Specifically, the calculation unit 15b first extracts feature amounts from the input data 14b and the reference data 14c for the same subject. For example, the calculation unit 15b extracts the log mel-filter bank of audio, MFCC (Mel frequency cepstrum coefficient), HOG (Histogram of Oriented Gradients), HOF (Histogram of Optical Flow), etc. for each frame of video as feature amounts. do. The moving image itself may be used as the feature quantity.

In addition, the calculation unit 15b may perform preprocessing such as voice enhancement, noise removal, contrast normalization, cutout of the face peripheral region, and feature amount normalization as necessary. In addition, the calculation unit 15b may perform data diffusion processing such as superimposition of noise and reverberation, rotation of moving images, and addition of noise before extracting the feature amount.

Specifically, the calculation unit 15b calculates the face data x _1: T of the input data 14b having a frame length T and the video data y _1:T ^{(1, . . . , N)} of the N pieces of reference data 14c. Only the peripheral area is cut into a square and resized again so that one side has 224 pixels. In addition, the calculation unit 15b normalizes the value of each pixel so that it ranges from 0.0 to 1.0. All of the moving image data of the N pieces of reference data 14c may be given a correct label with an understanding level of 3, or each label with an understanding level of 1 to 5 may be mixed.

Also, it is assumed that the input data 14b and the reference data 14c are selected so as not to contain the same moving image data. For the moving image data of the reference data 14c, mixing of the same moving image data may be avoided by performing preprocessing such as burning out or transforming metadata.

Next, the calculation unit 15b calculates an embedded expression from the feature amounts of the input data 14b and the reference data 14c. For example, the calculation unit 15b calculates the embedded representation H for each time using a 2D CNN (Convolutional Neural Network) or RNN (Recurrent Neural Network). Note that the calculation unit 15b may replace the 2D CNN with a 3D CNN, or replace the RNN with a Transformer.

In addition, the model parameters 14d may include those pre-learned in any other task, or the initial values may be generated with arbitrary random numbers. Moreover, when the learned model parameter 14d is used, whether or not to update the model parameter 14d may be determined arbitrarily.

Specifically, as shown in FIG. 4, the calculation unit 15b uses the 2D CNN and the D-dimensional output dimension as the RNN to obtain the following equation (1) from the input video data x _1:T: Compute the embedded representation tensor H _x . where θ is the CNN parameter set and φ is the RNN parameter set.

The calculation unit 15b also calculates an embedded expression tensor H _y from the reference moving image data y _1:T ^{(1, . . . , N)} as shown in the following equation (2).

Next, the calculation unit 15b compares the embedding expression calculated from the input moving image data _x1:T and the embedding expression calculated from the reference moving image data _y1:T ^{(1, . . . , N)} . Specifically, as shown in FIG. 4, the calculation unit 15b compares the embedded expression tensor H _x calculated from the input data 14b with the embedded expression tensor H _y calculated from the reference data 14c to obtain e ^{( 1, . . . , N)} .

For example, the calculation unit 15b makes a comparison using a source-target attention mechanism. In this case, the calculator 15b calculates the comparison result vector e ^(1,...,N) as shown in the following equation (3).

In _the above equation ₍ 3), the calculation unit 15b calculates attention weight from queryQ _i ^(1, .

Here, _d1 is the number of attention heads, i is each attention head, and W _i ^Q , W _i ^K , and W _i ^V are weights for Query, key, and value in each attention head.

It should be noted that the calculation unit 15b is not limited to the source-target attention mechanism, and may be compared using predetermined four arithmetic operations, combinations, etc. between embedded expressions.

Next, the calculator 15b compares e ^{(1, . . . , N)} to calculate a comparison result vector v, as shown in FIG. At that time, the calculation unit 15b may add arbitrary information such as metadata of _y1:T ^(1,...,N) to e( ^1,...,N) .

For example, using the multi-head self attention mechanism, the calculation unit 15b adds e ^{(1, . . . , N)} to y _1:T ^{(1, .} Combine the data m ^(1,...,N) and generate a tensor E1 _:T combining them. Here, the metadata m ^{(1, . . . , N)} is a one-hot vector representation of the comprehension level label of the C level (C=5).

Further, the calculation unit 15b calculates v from E _1:T as shown in the following equation (5) using the multi-head self attention mechanism.

Here, _d2 is the number ^of attention heads, j is each attention head, ^and _WjQ , _WjK , ^and _WjV are weights for Query, key, and value in each attention head.

Return to the description of Figure 1. The estimating unit 15c estimates the state of mind of the input data 14b using the result of comparison between the embedded representation calculated from the input data 14b and the embedded representation calculated from the reference data 14c.

For example, as described above, the estimation unit 15c compares the embedding expression calculated from the input data 14b by the calculating unit 15b and the embedding expression calculated from the reference data 14c using the multi-head self-attention mechanism. is used to estimate the state of mind of the input data 14b.

Specifically, as shown in FIG. 4, the estimation unit 15c estimates the state of mind of the input data 14b from the comparison result vector v calculated by the calculation unit 15b. At that time, the estimating unit 15c may calculate the posterior probability for each class as a classification problem of an arbitrary number of classes to estimate the state of mind. That is, the estimating unit 15c may estimate the state of mind by calculating the posterior probability for each class of the state of mind classification. Alternatively, the estimation unit 15c may estimate a numerical value representing the state of mind as a regression problem.

For example, as shown in the following equation (6), the estimating unit 15c uses two fully connected layers to determine the posterior probability p(C|x _1:T , y _1:T ^(1,...,N) ) is calculated.

where W ₁ ^FC and W ₂ ^FC represent the weights of the two fully connected layers, D ^FC represents the number of output dimensions of the first fully connected layer, and C represents the number of predicted labels (this C=5 in the embodiment). A ReLU function is used as the activation function of the first fully connected layer.

Return to the description of Figure 1. The learning unit 15d uses the input data 14b and the estimated state of mind of the input data 14b to obtain model parameters 14d of a model that estimates the state of mind appearing in the input nonverbal information or paralinguistic information. learn.

Specifically, the learning unit 15d updates the model parameter set Ω and acquires the learned model parameter set Ω'. The learning unit 15d can apply well-known loss functions and update methods. For example, the model parameter set Ω may include those pre-trained in any other task, initial values may be generated with arbitrary random numbers, and some model parameters may not be updated. may

For example, the learning unit 15d uses the stochastic gradient method (SGD) to update the model parameter set Ω using the cross entropy L shown in the following equation (7) as a loss function.

Here, ^mx is the correct distribution of the input moving image data x _1:T . The method of expressing the correct answer distribution is not particularly limited, and may be expressed as a one-hot vector, for example. Alternatively, the correct distribution may be represented by approximating a normal distribution centered on the correct class.

Note that the learning unit 15d causes the storage unit 14 to store the acquired learned model parameter set Ω' as the model parameter 14d.

In this case, as described above, the calculation unit 15b uses the learned model parameters 14b to calculate the state-of-mind embedding representation of the input data 14b and the state-of-mind embedding representation of the reference data 14c. do.

[Estimation process]
Next, estimation processing by the estimation device 10 will be described. FIG. 5 is a flowchart showing an estimation processing procedure. The flowchart of FIG. 5 is started, for example, when an input instructing the start of the estimation process is received.

First, the acquisition unit 15a acquires the learning data 14a including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information (step S1).

Next, the calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the embedded representation of the state of mind of the input data and the reference data. An embedded representation of the state of mind is calculated (step S2).

The calculation unit 15b also compares the embedding expression calculated from the input data 14b and the embedding expression calculated from the reference data 14c (step S3).

Then, the estimating unit 15c estimates the state of mind of the input data 14b using the result of comparison between the embedded expression calculated from the input data 14b and the embedded expression calculated from the reference data 14c (step S4). This completes a series of estimation processes.

[effect]
As described above, in the estimation device 10 of the present embodiment, the acquisition unit 15a performs learning including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information. Get data. Further, the calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the state-of-mind embedding expression of the input data 14b and the reference data 14c. and the embedded representation of the state of mind of . Also, the estimation unit 15c estimates the state of mind of the input data 14b by using the result of comparison between the embedded representation calculated from the input data 14b and the embedded representation calculated from the reference data 14c.

Specifically, the estimation unit 15c compares the embedding expression calculated from the input data 14b and the embedding expression calculated from the reference data 14c using a multi-head self-attention mechanism. The estimating unit 15c also calculates the posterior probability for each class of the state of mind classification to estimate the state of mind. Alternatively, the estimation unit 15c estimates a numerical value representing the state of mind as a regression problem.

In this way, the estimation device 10 estimates the state of mind using a plurality of correct labels other than the normal state registered in advance as reference information. As a result, even if the variance of the labels is so large that individual differences cannot be normalized or absorbed, it is possible to accurately estimate the label representing the state of mind appearing in the nonverbal/paralinguistic information.

In addition, since there is no need to retrain the model, there is no need to collect an appropriate amount of data in a well-balanced manner for each class of classification, or to monitor learning, making it possible to process with low resources.

In addition, the learning unit 15d uses the input data 14b and the estimated state of mind of the input data 14b to set the model parameter of the model for estimating the state of mind appearing in the input nonverbal information or paralinguistic information. Study 14d. In this case, the calculation unit 15b uses the learned model parameters 14b to calculate the state-of-mind embedding representation of the input data 14b and the state-of-mind embedding representation of the reference data 14c. As a result, it becomes possible to estimate the label representing the state of mind appearing in the nonverbal/paralinguistic information with higher accuracy.

[Example]
FIG. 6 is a diagram for explaining the embodiment. FIG. 6 shows the accuracy of each of the three methods, including the present invention, when estimating five levels of intelligibility for unknown moving image data of the same person. Here, a general method that does not use reference data (none), a method that absorbs individual differences using reference data only in normal conditions (understanding level 3 only), and a method that applies the present invention (understanding level 2, 3, 4) were applied. For “none”, comprehension was estimated using the self-attention mechanism and the fully connected layer for the embedded representation H _x without using reference data. In addition, in the case of "understanding level 3 only", the understanding level was estimated using only the reference data with the understanding level 3 (N=3). In addition, for "level of understanding 2, 3, 4", the level of understanding was estimated using the reference data for levels of understanding 2 to 4 (N=3).

As shown in Fig. 6, compared to the general method that does not use reference data (none) and the method that uses reference data only for normal conditions (comprehension level 3 only), this method using reference data for multiple states It was confirmed that the method of the invention (levels of understanding 2, 3, and 4) stably improves the accuracy of estimating the level of understanding appearing in moving image data.

[program]
It is also possible to create a program in which the processing executed by the estimation device 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the estimating device 10 can be implemented by installing an estimating program that executes the above estimating process as package software or online software on a desired computer. For example, the information processing device can function as the estimation device 10 by causing the information processing device to execute the above estimation program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the estimation device 10 may be implemented in a cloud server.

FIG. 7 is a diagram showing an example of a computer that executes an estimation program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .

Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.

Also, the estimation program is stored in the hard disk drive 1031 as a program module 1093 in which instructions to be executed by the computer 1000 are written, for example. Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the estimation device 10 described in the above embodiment.

In addition, data used for information processing by the estimation program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.

Note that the program module 1093 and program data 1094 related to the estimation program are not limited to being stored in the hard disk drive 1031. For example, they are stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be Alternatively, the program module 1093 and program data 1094 related to the estimation program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and via network interface 1070 It may be read by CPU 1020 .

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

10 estimation device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a learning data 14b input data 14c reference data 14d model parameter 15 control unit 15a acquisition unit 15b calculation unit 15c estimation unit 15d learning unit

Claims

An estimation method executed by an estimation device,
an acquiring step of acquiring learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information;
Using the feature amounts of the input data and the reference data of the acquired learning data, the embedded expression of the state of mind of the input data and the embedded expression of the state of mind of the reference data are calculated. a calculation step;
an estimation step of estimating the state of mind of the input data using a comparison result between the embedded representation calculated from the input data and the embedded representation calculated from the reference data;
An estimation method characterized by including
2. The estimation according to claim 1, wherein the estimation step compares the embedded expression calculated from the input data and the embedded expression calculated from the reference data using a multi-head self-attention mechanism. Method.
The estimation method according to claim 1, wherein the estimation step estimates the state of mind by calculating the posterior probability for each class of the state of mind classification.
The estimation method according to claim 1, wherein the estimation step estimates a numerical value representing the state of mind as a regression problem.
Further comprising a learning step of learning model parameters of a model for estimating the state of mind appearing in the input nonverbal information or paralinguistic information, using the input data and the estimated state of mind of the input data. ,
2. The method according to claim 1, wherein said calculating step calculates an embedded representation of a state of mind in said input data and an embedded representation of a state of mind in said reference data using said learned model parameters. Estimation method described.
an acquisition unit for acquiring learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information;
Using the feature amounts of the input data and the reference data of the acquired learning data, the embedded expression of the state of mind of the input data and the embedded expression of the state of mind of the reference data are calculated. a calculation unit;
an estimating unit that estimates the state of mind of the input data using a comparison result between the embedded expression calculated from the input data and the embedded expression calculated from the reference data;
An estimation device characterized by comprising:
an acquisition step of acquiring learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information;
Using the feature amounts of the input data and the reference data of the acquired learning data, the embedded expression of the state of mind of the input data and the embedded expression of the state of mind of the reference data are calculated. a calculation step;
an estimation step of estimating the state of mind of the input data using a comparison result between the embedded representation calculated from the input data and the embedded representation calculated from the reference data;
An estimation program for causing a computer to execute