WO2023032014A1 - Estimation method, estimation device, and estimation program - Google Patents

Estimation method, estimation device, and estimation program Download PDF

Info

Publication number
WO2023032014A1
WO2023032014A1 PCT/JP2021/031791 JP2021031791W WO2023032014A1 WO 2023032014 A1 WO2023032014 A1 WO 2023032014A1 JP 2021031791 W JP2021031791 W JP 2021031791W WO 2023032014 A1 WO2023032014 A1 WO 2023032014A1
Authority
WO
WIPO (PCT)
Prior art keywords
mind
data
state
estimation
input data
Prior art date
Application number
PCT/JP2021/031791
Other languages
French (fr)
Japanese (ja)
Inventor
佑樹 北岸
岳至 森
太一 浅見
歩相名 神山
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2023544819A priority Critical patent/JPWO2023032014A1/ja
Priority to PCT/JP2021/031791 priority patent/WO2023032014A1/en
Publication of WO2023032014A1 publication Critical patent/WO2023032014A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to an estimation method, an estimation device, and an estimation program.
  • Estimation of the state of mind that appears in such nonverbal/paralinguistic information is generally performed by labeling each label that represents a defined state of mind for inputs such as feature values and data itself extracted from speech and video images. It is defined as supervised learning that outputs posterior probabilities and the like (see Non-Patent Document 1).
  • the present invention has been made in view of the above, and aims to accurately estimate a label representing a state of mind appearing in nonverbal/paralinguistic information.
  • an estimation method is an estimation method executed by an estimation device, comprising: non-linguistic information or paralinguistic information; an acquisition step of acquiring learning data including a correct label representing a state of mind that appears; using a calculation step of calculating an embedded representation of a state and an embedded representation of a state of mind of the reference data, and a comparison result between the embedded representation calculated from the input data and the embedded representation calculated from the reference data; and an estimation step of estimating the state of mind of the input data.
  • FIG. 1 is a schematic diagram illustrating a schematic configuration of an estimation device.
  • FIG. 2 is a diagram for explaining the processing of the estimation device.
  • FIG. 3 is a diagram illustrating a data configuration of learning data.
  • FIG. 4 is a diagram for explaining the processing of the calculator and the estimator.
  • FIG. 5 is a flowchart showing an estimation processing procedure.
  • FIG. 6 is a diagram for explaining the embodiment.
  • FIG. 7 is a diagram illustrating a computer that executes an estimation program;
  • FIG. 1 is a schematic diagram illustrating a schematic configuration of an estimation device.
  • FIG. 2 is a diagram for explaining the processing of the estimation device.
  • the estimation device 10 of the present embodiment uses a neural network for a moving image showing the upper body of a subject, which is nonverbal/paralinguistic information, to calculate the degree of understanding as the state of mind that appears in the nonverbal/paralinguistic information. Estimated in 5 stages. The degree of comprehension is, for example, 1. 2. do not understand; Somewhat do not understand;3. 4. Normal state; 5. Somewhat understand; It is defined as understanding, and the higher the number, the better the understanding.
  • the estimation device 10 of the present embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. Prepare.
  • the input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner.
  • the output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like.
  • the communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device for managing learning data via a network.
  • NIC Network Interface Card
  • the storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In the present embodiment, the storage unit 14 stores, for example, learning data 14a used for estimation processing, which will be described later, model parameters 14d generated and updated in the estimation processing, and the like.
  • the learning data 14a of this embodiment includes the input data 14b and the reference data 14c, but the data configuration is the same.
  • FIG. 3 is a diagram illustrating a data configuration of learning data.
  • the learning data 14a includes at least video data showing the upper body of a subject as non-verbal/paralinguistic information, a data ID for identifying each video data, and a personal ID for identifying the subject. , and a correct label representing the state of mind such as the degree of understanding appearing in each moving image data.
  • the learning data 14a may include labels representing attributes of a person such as age and gender.
  • the learning data 14a may be learned, developed, divided into evaluation sets, or data expanded as necessary.
  • preprocessing such as contrast normalization and face detection may be performed, and only areas with video data may be used.
  • codec of the input data is not particularly limited.
  • H264 format video data recorded by a web camera at 30 frames per second is resized so that one side is 224 pixels. do it.
  • Each of the X pieces of moving image data is provided with the individual IDs of S subjects and the correct label of the degree of comprehension.
  • the input data and the reference data should not contain the same data.
  • the input data 14b and the reference data 14c may be generated by any combination of the learning data 14a so as to avoid mixing of the same data.
  • the control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. Thereby, the control unit 15 functions as an acquisition unit 15a, a calculation unit 15b, an estimation unit 15c, and a learning unit 15d, as illustrated in FIG. Note that these functional units may be implemented in different hardware. For example, the acquisition unit 15a may be implemented in hardware different from other functional units. Also, the control unit 15 may include other functional units.
  • the acquisition unit 15a acquires learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information. Specifically, the acquisition unit 15a receives video data showing the upper body of the subject as nonverbal/paralinguistic information via the input unit 11 or from a device that generates learning data via the communication control unit 13. Then, the learning data 14a including the data ID for identifying each piece of moving image data and the correct label representing the state of mind such as the degree of understanding appearing in each piece of moving image data is obtained.
  • the acquisition unit 15a causes the storage unit 14 to store learning data 14a acquired in advance prior to the following processing.
  • the acquiring unit 15a may transfer the acquired learning data 14a to the estimating unit 15c described below without storing the acquired learning data 14a in the storage unit 14. FIG.
  • the calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the embedded representation of the state of mind of the input data and the state of mind of the reference data. and the embedded representation of .
  • processing using the neural network described below is not limited to this embodiment. good too.
  • the calculation unit 15b first extracts feature amounts from the input data 14b and the reference data 14c for the same subject. For example, the calculation unit 15b extracts the log mel-filter bank of audio, MFCC (Mel frequency cepstrum coefficient), HOG (Histogram of Oriented Gradients), HOF (Histogram of Optical Flow), etc. for each frame of video as feature amounts. do.
  • the moving image itself may be used as the feature quantity.
  • calculation unit 15b may perform preprocessing such as voice enhancement, noise removal, contrast normalization, cutout of the face peripheral region, and feature amount normalization as necessary.
  • calculation unit 15b may perform data diffusion processing such as superimposition of noise and reverberation, rotation of moving images, and addition of noise before extracting the feature amount.
  • the calculation unit 15b calculates the face data x 1: T of the input data 14b having a frame length T and the video data y 1:T (1, . . . , N) of the N pieces of reference data 14c. Only the peripheral area is cut into a square and resized again so that one side has 224 pixels. In addition, the calculation unit 15b normalizes the value of each pixel so that it ranges from 0.0 to 1.0. All of the moving image data of the N pieces of reference data 14c may be given a correct label with an understanding level of 3, or each label with an understanding level of 1 to 5 may be mixed.
  • the input data 14b and the reference data 14c are selected so as not to contain the same moving image data.
  • mixing of the same moving image data may be avoided by performing preprocessing such as burning out or transforming metadata.
  • the calculation unit 15b calculates an embedded expression from the feature amounts of the input data 14b and the reference data 14c. For example, the calculation unit 15b calculates the embedded representation H for each time using a 2D CNN (Convolutional Neural Network) or RNN (Recurrent Neural Network). Note that the calculation unit 15b may replace the 2D CNN with a 3D CNN, or replace the RNN with a Transformer.
  • 2D CNN Convolutional Neural Network
  • RNN Recurrent Neural Network
  • model parameters 14d may include those pre-learned in any other task, or the initial values may be generated with arbitrary random numbers. Moreover, when the learned model parameter 14d is used, whether or not to update the model parameter 14d may be determined arbitrarily.
  • the calculation unit 15b uses the 2D CNN and the D-dimensional output dimension as the RNN to obtain the following equation (1) from the input video data x 1:T: Compute the embedded representation tensor H x .
  • is the CNN parameter set and ⁇ is the RNN parameter set.
  • the calculation unit 15b also calculates an embedded expression tensor H y from the reference moving image data y 1:T (1, . . . , N) as shown in the following equation (2).
  • the calculation unit 15b compares the embedding expression calculated from the input moving image data x1:T and the embedding expression calculated from the reference moving image data y1:T (1, . . . , N) . Specifically, as shown in FIG. 4, the calculation unit 15b compares the embedded expression tensor H x calculated from the input data 14b with the embedded expression tensor H y calculated from the reference data 14c to obtain e ( 1, . . . , N) .
  • the calculation unit 15b makes a comparison using a source-target attention mechanism.
  • the calculator 15b calculates the comparison result vector e (1,...,N) as shown in the following equation (3).
  • the calculation unit 15b calculates attention weight from queryQ i (1, .
  • d1 is the number of attention heads
  • i is each attention head
  • W i Q , W i K , and W i V are weights for Query, key, and value in each attention head.
  • calculation unit 15b is not limited to the source-target attention mechanism, and may be compared using predetermined four arithmetic operations, combinations, etc. between embedded expressions.
  • the calculator 15b compares e (1, . . . , N) to calculate a comparison result vector v, as shown in FIG.
  • the calculation unit 15b may add arbitrary information such as metadata of y1:T (1,...,N) to e( 1,...,N) .
  • the calculation unit 15b adds e (1, . . . , N) to y 1:T (1, . Combine the data m (1,...,N) and generate a tensor E1 :T combining them.
  • calculation unit 15b calculates v from E 1:T as shown in the following equation (5) using the multi-head self attention mechanism.
  • d2 is the number of attention heads
  • j is each attention head
  • WjQ , WjK , and WjV are weights for Query, key, and value in each attention head.
  • the estimating unit 15c estimates the state of mind of the input data 14b using the result of comparison between the embedded representation calculated from the input data 14b and the embedded representation calculated from the reference data 14c.
  • the estimation unit 15c compares the embedding expression calculated from the input data 14b by the calculating unit 15b and the embedding expression calculated from the reference data 14c using the multi-head self-attention mechanism. is used to estimate the state of mind of the input data 14b.
  • the estimation unit 15c estimates the state of mind of the input data 14b from the comparison result vector v calculated by the calculation unit 15b. At that time, the estimating unit 15c may calculate the posterior probability for each class as a classification problem of an arbitrary number of classes to estimate the state of mind. That is, the estimating unit 15c may estimate the state of mind by calculating the posterior probability for each class of the state of mind classification. Alternatively, the estimation unit 15c may estimate a numerical value representing the state of mind as a regression problem.
  • the estimating unit 15c uses two fully connected layers to determine the posterior probability p(C
  • W 1 FC and W 2 FC represent the weights of the two fully connected layers
  • D FC represents the number of output dimensions of the first fully connected layer
  • a ReLU function is used as the activation function of the first fully connected layer.
  • the learning unit 15d uses the input data 14b and the estimated state of mind of the input data 14b to obtain model parameters 14d of a model that estimates the state of mind appearing in the input nonverbal information or paralinguistic information. learn.
  • the learning unit 15d updates the model parameter set ⁇ and acquires the learned model parameter set ⁇ '.
  • the learning unit 15d can apply well-known loss functions and update methods.
  • the model parameter set ⁇ may include those pre-trained in any other task, initial values may be generated with arbitrary random numbers, and some model parameters may not be updated. may
  • the learning unit 15d uses the stochastic gradient method (SGD) to update the model parameter set ⁇ using the cross entropy L shown in the following equation (7) as a loss function.
  • SGD stochastic gradient method
  • mx is the correct distribution of the input moving image data x 1:T .
  • the method of expressing the correct answer distribution is not particularly limited, and may be expressed as a one-hot vector, for example.
  • the correct distribution may be represented by approximating a normal distribution centered on the correct class.
  • the learning unit 15d causes the storage unit 14 to store the acquired learned model parameter set ⁇ ' as the model parameter 14d.
  • the calculation unit 15b uses the learned model parameters 14b to calculate the state-of-mind embedding representation of the input data 14b and the state-of-mind embedding representation of the reference data 14c. do.
  • FIG. 5 is a flowchart showing an estimation processing procedure.
  • the flowchart of FIG. 5 is started, for example, when an input instructing the start of the estimation process is received.
  • the acquisition unit 15a acquires the learning data 14a including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information (step S1).
  • the calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the embedded representation of the state of mind of the input data and the reference data.
  • An embedded representation of the state of mind is calculated (step S2).
  • the calculation unit 15b also compares the embedding expression calculated from the input data 14b and the embedding expression calculated from the reference data 14c (step S3).
  • the estimating unit 15c estimates the state of mind of the input data 14b using the result of comparison between the embedded expression calculated from the input data 14b and the embedded expression calculated from the reference data 14c (step S4). This completes a series of estimation processes.
  • the acquisition unit 15a performs learning including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information. Get data.
  • the calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the state-of-mind embedding expression of the input data 14b and the reference data 14c. and the embedded representation of the state of mind of .
  • the estimation unit 15c estimates the state of mind of the input data 14b by using the result of comparison between the embedded representation calculated from the input data 14b and the embedded representation calculated from the reference data 14c.
  • the estimation unit 15c compares the embedding expression calculated from the input data 14b and the embedding expression calculated from the reference data 14c using a multi-head self-attention mechanism.
  • the estimating unit 15c also calculates the posterior probability for each class of the state of mind classification to estimate the state of mind.
  • the estimation unit 15c estimates a numerical value representing the state of mind as a regression problem.
  • the estimation device 10 estimates the state of mind using a plurality of correct labels other than the normal state registered in advance as reference information. As a result, even if the variance of the labels is so large that individual differences cannot be normalized or absorbed, it is possible to accurately estimate the label representing the state of mind appearing in the nonverbal/paralinguistic information.
  • the learning unit 15d uses the input data 14b and the estimated state of mind of the input data 14b to set the model parameter of the model for estimating the state of mind appearing in the input nonverbal information or paralinguistic information. Study 14d.
  • the calculation unit 15b uses the learned model parameters 14b to calculate the state-of-mind embedding representation of the input data 14b and the state-of-mind embedding representation of the reference data 14c. As a result, it becomes possible to estimate the label representing the state of mind appearing in the nonverbal/paralinguistic information with higher accuracy.
  • FIG. 6 is a diagram for explaining the embodiment.
  • FIG. 6 shows the accuracy of each of the three methods, including the present invention, when estimating five levels of intelligibility for unknown moving image data of the same person.
  • a general method that does not use reference data (none)
  • a method that absorbs individual differences using reference data only in normal conditions (understanding level 3 only)
  • a method that applies the present invention (understanding level 2, 3, 4) were applied.
  • comprehension was estimated using the self-attention mechanism and the fully connected layer for the embedded representation H x without using reference data.
  • the estimating device 10 can be implemented by installing an estimating program that executes the above estimating process as package software or online software on a desired computer.
  • the information processing device can function as the estimation device 10 by causing the information processing device to execute the above estimation program.
  • information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
  • the functions of the estimation device 10 may be implemented in a cloud server.
  • FIG. 7 is a diagram showing an example of a computer that executes an estimation program.
  • Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 .
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • Hard disk drive interface 1030 is connected to hard disk drive 1031 .
  • Disk drive interface 1040 is connected to disk drive 1041 .
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example.
  • a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example.
  • a display 1061 is connected to the video adapter 1060 .
  • the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
  • the estimation program is stored in the hard disk drive 1031 as a program module 1093 in which instructions to be executed by the computer 1000 are written, for example.
  • the hard disk drive 1031 stores a program module 1093 that describes each process executed by the estimation device 10 described in the above embodiment.
  • data used for information processing by the estimation program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
  • program module 1093 and program data 1094 related to the estimation program are not limited to being stored in the hard disk drive 1031.
  • they are stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like.
  • the program module 1093 and program data 1094 related to the estimation program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and via network interface 1070 It may be read by CPU 1020 .
  • LAN Local Area Network
  • WAN Wide Area Network
  • estimation device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a learning data 14b input data 14c reference data 14d model parameter 15 control unit 15a acquisition unit 15b calculation unit 15c estimation unit 15d learning unit

Abstract

An acquisition unit (15a) acquires learning data that includes nonverbal information or paralinguistic information and correct labels representing the states of mind indicated in the nonverbal information or paralinguistic information. A calculation unit (15b) uses a feature quantity of each of input data (14b) and reference data (14c) in the acquired learning data (14a) to calculate an embedded representation of the state of mind indicated by the input data (14b) and an embedded representation of the state of mind indicated by the reference data (14c). An estimation unit (15c) estimates the state of mind indicated by the input data (14b), using the result of a comparison between the embedded representation calculated from the input data (14b) and the embedded representation calculated from the reference data (14c).

Description

推定方法、推定装置および推定プログラムEstimation method, estimation device and estimation program
 本発明は、推定方法、推定装置および推定プログラムに関する。 The present invention relates to an estimation method, an estimation device, and an estimation program.
 従来、人間の音声や顔、身振り手振り等の非言語・パラ言語情報に表れる心の状態を自動的に推定する技術の研究開発が行われてきた。例えば、エージェントやロボットとの対話において、それらの反応の生成時に対話相手の心の状態を反映させたり、メンタルヘルスケアの一環として推定結果を活用したり、web会議等で参加者の心の状態を数値化して把握しやすくしたりすることが期待されている。 Conventionally, research and development have been conducted on technology that automatically estimates the state of mind that appears in non-verbal and paralinguistic information such as human voice, face, and gestures. For example, in dialogues with agents and robots, the mental state of the dialogue partner is reflected when generating those reactions, the estimation results are used as part of mental health care, and the mental state of participants in web conferences, etc. is expected to be quantified to make it easier to understand.
 このような非言語・パラ言語情報に表れる心の状態の推定は、一般に、音声や動画像から抽出される特徴量やデータそのもの等の入力に対し、定義された心の状態を表す各ラベルの事後確率等を出力する教師あり学習として定義される(非特許文献1参照)。 Estimation of the state of mind that appears in such nonverbal/paralinguistic information is generally performed by labeling each label that represents a defined state of mind for inputs such as feature values and data itself extracted from speech and video images. It is defined as supervised learning that outputs posterior probabilities and the like (see Non-Patent Document 1).
 ここで、心の状態およびその表出には個人差があるといわれている。これに対し、一般に、多人数のデータを収集して個人差を機械学習のモデルに吸収させる手法が用いられる。また、事前登録された参考となる音声や動画像を用いて、個人差を正規化したり吸収したりする手法も知られている(非特許文献2、3参照)。 Here, it is said that there are individual differences in the state of mind and its expression. On the other hand, in general, a technique is used in which data of a large number of people are collected and individual differences are absorbed into a machine learning model. Also known is a method of normalizing or absorbing individual differences using pre-registered reference sounds and moving images (see Non-Patent Documents 2 and 3).
 しかしながら、従来技術では、非言語・パラ言語情報に表れる心の状態を表すラベルを精度高く推定することが困難であった。例えば、平常、喜び、悲しみ、驚き、恐怖、憎悪、怒り、軽蔑等の一般感情認識や、理解度等の特定の指標の段階等の細分化により、対応するラベルの分散が増加すると、平常状態だけでは十分に個人差を吸収したり正規化したりすることが困難な場合がある。また、モデルの再学習において、モデルパラメタを安定して評価対象者に適応させるために十分な量のデータを確保することや、確実に学習することが困難である。 However, with conventional technology, it is difficult to accurately estimate the labels that represent the states of mind that appear in nonverbal and paralinguistic information. For example, normality, joy, sadness, surprise, fear, hatred, anger, contempt, and other general emotion recognition, and subdivision of specific index stages such as comprehension, increase the variance of the corresponding label, and the normal state It may be difficult to sufficiently absorb or normalize individual differences only by In addition, when re-learning the model, it is difficult to secure a sufficient amount of data for stably adapting the model parameters to the person to be evaluated, and to perform the learning reliably.
 本発明は、上記に鑑みてなされたものであって、非言語・パラ言語情報に表れる心の状態を表すラベルを精度高く推定することを目的とする。 The present invention has been made in view of the above, and aims to accurately estimate a label representing a state of mind appearing in nonverbal/paralinguistic information.
 上述した課題を解決し、目的を達成するために、本発明に係る推定方法は、推定装置が実行する推定方法であって、非言語情報またはパラ言語情報と、該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データを取得する取得工程と、取得された前記学習データのうち、入力データと参照データとのそれぞれの特徴量を用いて、該入力データの心の状態の埋め込み表現と、該参照データの心の状態の埋め込み表現とを算出する算出工程と、前記入力データから算出された埋め込み表現と、前記参照データから算出された埋め込み表現との比較結果を用いて、前記入力データの心の状態を推定する推定工程と、を含んだことを特徴とする。 In order to solve the above-described problems and achieve the object, an estimation method according to the present invention is an estimation method executed by an estimation device, comprising: non-linguistic information or paralinguistic information; an acquisition step of acquiring learning data including a correct label representing a state of mind that appears; using a calculation step of calculating an embedded representation of a state and an embedded representation of a state of mind of the reference data, and a comparison result between the embedded representation calculated from the input data and the embedded representation calculated from the reference data; and an estimation step of estimating the state of mind of the input data.
 本発明によれば、非言語・パラ言語情報に表れる心の状態を表すラベルを精度高く推定することが可能となる。  According to the present invention, it is possible to accurately estimate a label representing a state of mind appearing in nonverbal/paralinguistic information.
図1は、推定装置の概略構成を例示する模式図である。FIG. 1 is a schematic diagram illustrating a schematic configuration of an estimation device. 図2は、推定装置の処理を説明するための図である。FIG. 2 is a diagram for explaining the processing of the estimation device. 図3は、学習データのデータ構成を例示する図である。FIG. 3 is a diagram illustrating a data configuration of learning data. 図4は、算出部および推定部の処理を説明するための図である。FIG. 4 is a diagram for explaining the processing of the calculator and the estimator. 図5は、推定処理手順を示すフローチャートである。FIG. 5 is a flowchart showing an estimation processing procedure. 図6は、実施例を説明するための図である。FIG. 6 is a diagram for explaining the embodiment. 図7は、推定プログラムを実行するコンピュータを例示する図である。FIG. 7 is a diagram illustrating a computer that executes an estimation program;
 以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals.
[推定装置の構成]
 図1は、推定装置の概略構成を例示する模式図である。また、図2は、推定装置の処理を説明するための図である。本実施形態の推定装置10は、非言語・パラ言語情報である対象者の上半身が映る動画に対して、ニューラルネットワークを用いて、非言語・パラ言語情報に表れる心の状態として、理解度を5段階で推定する。理解度は、例えば、1.理解していない、2.やや理解していない、3.平常状態、4.やや理解している、5.理解している、として、数字が大きいほど理解していることを表すように定義される。
[Configuration of estimation device]
FIG. 1 is a schematic diagram illustrating a schematic configuration of an estimation device. Also, FIG. 2 is a diagram for explaining the processing of the estimation device. The estimation device 10 of the present embodiment uses a neural network for a moving image showing the upper body of a subject, which is nonverbal/paralinguistic information, to calculate the degree of understanding as the state of mind that appears in the nonverbal/paralinguistic information. Estimated in 5 stages. The degree of comprehension is, for example, 1. 2. do not understand; Somewhat do not understand;3. 4. Normal state; 5. Somewhat understand; It is defined as understanding, and the higher the number, the better the understanding.
 まず、図1に例示するように、本実施形態の推定装置10は、パソコン等の汎用コンピュータで実現され、入力部11、出力部12、通信制御部13、記憶部14、および制御部15を備える。 First, as illustrated in FIG. 1, the estimation device 10 of the present embodiment is realized by a general-purpose computer such as a personal computer, and includes an input unit 11, an output unit 12, a communication control unit 13, a storage unit 14, and a control unit 15. Prepare.
 入力部11は、キーボードやマウス等の入力デバイスを用いて実現され、実施者による入力操作に対応して、制御部15に対して処理開始などの各種指示情報を入力する。出力部12は、液晶ディスプレイなどの表示装置、プリンター等の印刷装置、情報通信装置等によって実現される。通信制御部13は、NIC(Network Interface Card)等で実現され、サーバや、学習用データを管理する装置等の外部の装置と制御部15とのネットワークを介した通信を制御する。 The input unit 11 is implemented using input devices such as a keyboard and a mouse, and inputs various instruction information such as processing start to the control unit 15 in response to input operations by the practitioner. The output unit 12 is implemented by a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like. The communication control unit 13 is realized by a NIC (Network Interface Card) or the like, and controls communication between the control unit 15 and an external device such as a server or a device for managing learning data via a network.
 記憶部14は、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。なお、記憶部14は、通信制御部13を介して制御部15と通信する構成でもよい。本実施形態において、記憶部14には、例えば、後述する推定処理に用いられる学習データ14aや、推定処理で生成・更新されるモデルパラメタ14d等が記憶される。 The storage unit 14 is implemented by semiconductor memory devices such as RAM (Random Access Memory) and flash memory, or storage devices such as hard disks and optical disks. Note that the storage unit 14 may be configured to communicate with the control unit 15 via the communication control unit 13 . In the present embodiment, the storage unit 14 stores, for example, learning data 14a used for estimation processing, which will be described later, model parameters 14d generated and updated in the estimation processing, and the like.
 ここで、図1に示したように、本実施形態の学習データ14aには、入力データ14bと参照データ14cとが含まれるが、データ構成は同様である。図3は、学習データのデータ構成を例示する図である。図3に示すように、学習データ14aには、少なくとも非言語・パラ言語情報としての対象者の上半身が映る動画データと、各動画データを識別するデータIDと、対象者を識別する個人IDと、各動画データに表れる理解度等の心の状態を表す正解ラベルとが含まれる。学習データ14aには、年齢、性別等の人物の属性を表すラベルが含まれていてもよい。また、必要に応じて、学習データ14aの学習、開発、あるいは評価セットへの分割やデータ拡張が行われてもよい。 Here, as shown in FIG. 1, the learning data 14a of this embodiment includes the input data 14b and the reference data 14c, but the data configuration is the same. FIG. 3 is a diagram illustrating a data configuration of learning data. As shown in FIG. 3, the learning data 14a includes at least video data showing the upper body of a subject as non-verbal/paralinguistic information, a data ID for identifying each video data, and a personal ID for identifying the subject. , and a correct label representing the state of mind such as the degree of understanding appearing in each moving image data. The learning data 14a may include labels representing attributes of a person such as age and gender. In addition, the learning data 14a may be learned, developed, divided into evaluation sets, or data expanded as necessary.
 なお、コントラストの正規化、顔検出等の事前処理を行って、動画データのある領域のみが利用されてもよい。また、入力データ(動画データ)のコーデック等は特に限定されない。 It should be noted that preprocessing such as contrast normalization and face detection may be performed, and only areas with video data may be used. Also, the codec of the input data (moving image data) is not particularly limited.
 具体的には、後述する推定処理で動画データから理解度を推定する場合に、例えばWebカメラで30フレーム/秒で収録されたH264形式の動画データを、1辺が224ピクセルとなるようにリサイズするとよい。X個の各動画データには、S人の対象者の個人ID,理解度の正解ラベルが付与される。 Specifically, when estimating the degree of comprehension from video data in the estimation process described later, for example, H264 format video data recorded by a web camera at 30 frames per second is resized so that one side is 224 pixels. do it. Each of the X pieces of moving image data is provided with the individual IDs of S subjects and the correct label of the degree of comprehension.
 また、入力データと参照データとは、互いに同一データが混在しなければよい。後述する推定処理の際に、学習データ14aのうちの任意の組み合わせにより、入力データ14bと参照データ14cとを、同一データの混在を回避するようにして生成してもよい。 Also, the input data and the reference data should not contain the same data. During the estimation process, which will be described later, the input data 14b and the reference data 14c may be generated by any combination of the learning data 14a so as to avoid mixing of the same data.
 制御部15は、CPU(Central Processing Unit)やNP(Network Processor)やFPGA(Field Programmable Gate Array)等を用いて実現され、メモリに記憶された処理プログラムを実行する。これにより、制御部15は、図1に例示するように、取得部15a、算出部15b、推定部15c、および学習部15dとして機能する。なお、これらの機能部は、それぞれが異なるハードウェアに実装されてもよい。例えば取得部15aは他の機能部とは異なるハードウェアに実装されてもよい。また、制御部15は、その他の機能部を備えてもよい。 The control unit 15 is implemented using a CPU (Central Processing Unit), NP (Network Processor), FPGA (Field Programmable Gate Array), etc., and executes a processing program stored in memory. Thereby, the control unit 15 functions as an acquisition unit 15a, a calculation unit 15b, an estimation unit 15c, and a learning unit 15d, as illustrated in FIG. Note that these functional units may be implemented in different hardware. For example, the acquisition unit 15a may be implemented in hardware different from other functional units. Also, the control unit 15 may include other functional units.
 取得部15aは、非言語情報またはパラ言語情報と該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データを取得する。具体的には、取得部15aは、入力部11を介して、あるいは学習データを生成する装置等から通信制御部13を介して、非言語・パラ言語情報としての対象者の上半身が映る動画データと、各動画データを識別するデータIDと、各動画データに表れる理解度等の心の状態を表す正解ラベルとを含む学習データ14aを取得する。 The acquisition unit 15a acquires learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information. Specifically, the acquisition unit 15a receives video data showing the upper body of the subject as nonverbal/paralinguistic information via the input unit 11 or from a device that generates learning data via the communication control unit 13. Then, the learning data 14a including the data ID for identifying each piece of moving image data and the correct label representing the state of mind such as the degree of understanding appearing in each piece of moving image data is obtained.
 また、取得部15aは、以下の処理に先立って予め取得した学習データ14aを、記憶部14に記憶させる。なお、取得部15aは、取得した学習データ14aを記憶部14に記憶させずに以下に示す推定部15cに転送してもよい。 In addition, the acquisition unit 15a causes the storage unit 14 to store learning data 14a acquired in advance prior to the following processing. The acquiring unit 15a may transfer the acquired learning data 14a to the estimating unit 15c described below without storing the acquired learning data 14a in the storage unit 14. FIG.
 図1の説明に戻る。算出部15bは、取得された学習データ14aのうち、入力データ14bと参照データ14cとのそれぞれの特徴量を用いて、該入力データの心の状態の埋め込み表現と、該参照データの心の状態の埋め込み表現とを算出する。 Return to the description of Figure 1. The calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the embedded representation of the state of mind of the input data and the state of mind of the reference data. and the embedded representation of .
 なお、以下に説明するニューラルネットワークを用いた処理は、本実施形態に限定されず、例えば、Batch Normalization、ドロップアウト、L1/L2正則化等の周知の技術の要素が任意の箇所に付与されてもよい。 In addition, the processing using the neural network described below is not limited to this embodiment. good too.
 具体的には、算出部15bは、まず、同一対象者についての入力データ14bと参照データ14cとから、それぞれの特徴量を抽出する。例えば、算出部15bは、特徴量として、音声のlog mel-filterbankや、MFCC(メル周波数ケプストラム係数)、動画のフレームごとのHOG(Histogram of Oriented Gradients)、HOF(Histogram of Optical Flow)等を抽出する。動画そのものを特徴量としてもよい。 Specifically, the calculation unit 15b first extracts feature amounts from the input data 14b and the reference data 14c for the same subject. For example, the calculation unit 15b extracts the log mel-filter bank of audio, MFCC (Mel frequency cepstrum coefficient), HOG (Histogram of Oriented Gradients), HOF (Histogram of Optical Flow), etc. for each frame of video as feature amounts. do. The moving image itself may be used as the feature quantity.
 また、算出部15bは、必要に応じて、音声強調、雑音除去、コントラスト正規化、顔周辺領域の切り出し、特徴量の正規化等の前処理を行ってもよい。また、算出部15bは、特徴量を抽出する前に、雑音や残響の重畳、動画の回転やノイズ付与等のデータ拡散の処理を行ってもよい。 In addition, the calculation unit 15b may perform preprocessing such as voice enhancement, noise removal, contrast normalization, cutout of the face peripheral region, and feature amount normalization as necessary. In addition, the calculation unit 15b may perform data diffusion processing such as superimposition of noise and reverberation, rotation of moving images, and addition of noise before extracting the feature amount.
 具体的には、算出部15bは、フレーム長Tの入力データ14bの動画データx1:TおよびN個の参照データ14cの動画データy1:T (1,…,N)に対して、顔周辺領域のみを正方形に切り抜き、1辺224ピクセルとなるように再度リサイズする。また、算出部15bは、各ピクセルの値が0.0~1.0となるように、正規化する。N個の参照データ14cの動画データの全てに理解度3の正解ラベルが付与されていてもよいし、理解度1~5の各ラベルが混在していてもよい。 Specifically, the calculation unit 15b calculates the face data x 1: T of the input data 14b having a frame length T and the video data y 1:T (1, . . . , N) of the N pieces of reference data 14c. Only the peripheral area is cut into a square and resized again so that one side has 224 pixels. In addition, the calculation unit 15b normalizes the value of each pixel so that it ranges from 0.0 to 1.0. All of the moving image data of the N pieces of reference data 14c may be given a correct label with an understanding level of 3, or each label with an understanding level of 1 to 5 may be mixed.
 また、入力データ14bと参照データ14cとには、同一の動画データが混在しないようい選定されているものとする。参照データ14cの動画データについて、メタデータを焼失させたり変形させたりする前処理によって、同一の動画データの混在を回避してもよい。 Also, it is assumed that the input data 14b and the reference data 14c are selected so as not to contain the same moving image data. For the moving image data of the reference data 14c, mixing of the same moving image data may be avoided by performing preprocessing such as burning out or transforming metadata.
 次に、算出部15bは、入力データ14bと参照データ14cとの特徴量から埋め込み表現を算出する。例えば、算出部15bは、2D CNN(Convolutional Neural Network)やRNN(Recurrent Neural Network)を用いて、時間ごとの埋め込み表現Hを算出する。なお、算出部15bは、2D CNNを3D CNNに置き換えてもよいし、RNNをTransformerに置き換えてもよい。 Next, the calculation unit 15b calculates an embedded expression from the feature amounts of the input data 14b and the reference data 14c. For example, the calculation unit 15b calculates the embedded representation H for each time using a 2D CNN (Convolutional Neural Network) or RNN (Recurrent Neural Network). Note that the calculation unit 15b may replace the 2D CNN with a 3D CNN, or replace the RNN with a Transformer.
 また、モデルパラメタ14dは、任意の他のタスクで事前学習されたものが含まれてもよいし、任意の乱数で初期値が生成されてもよい。また、学習済みのモデルパラメタ14dが用いられる場合には、モデルパラメタ14dの更新の可否が任意に決定されてもよい。 In addition, the model parameters 14d may include those pre-learned in any other task, or the initial values may be generated with arbitrary random numbers. Moreover, when the learned model parameter 14d is used, whether or not to update the model parameter 14d may be determined arbitrarily.
 具体的には、図4に示すように、算出部15bは、2D CNNとD次元の出力次元をRNNとを用いて、次式(1)に示すように、入力動画データx1:Tから埋め込み表現テンソルHを算出する。ここで、θはCNNのパラメタ集合、φはRNNのパラメタ集合である。 Specifically, as shown in FIG. 4, the calculation unit 15b uses the 2D CNN and the D-dimensional output dimension as the RNN to obtain the following equation (1) from the input video data x 1:T: Compute the embedded representation tensor H x . where θ is the CNN parameter set and φ is the RNN parameter set.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 また、算出部15bは、次式(2)に示すように、参照動画データy1:T (1,…,N)から埋め込み表現テンソルHを算出する。 The calculation unit 15b also calculates an embedded expression tensor H y from the reference moving image data y 1:T (1, . . . , N) as shown in the following equation (2).
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 次に、算出部15bは、入力動画データx1:Tから算出された埋め込み表現と参照動画データy1:T (1,…,N)から算出された埋め込み表現とを比較する。具体的には、算出部15bは、図4に示すように、入力データ14bから算出された埋め込み表現テンソルHと、参照データ14cから算出された埋め込み表現テンソルHを比較して、e(1,…,N)を算出する。 Next, the calculation unit 15b compares the embedding expression calculated from the input moving image data x1:T and the embedding expression calculated from the reference moving image data y1:T (1, . . . , N) . Specifically, as shown in FIG. 4, the calculation unit 15b compares the embedded expression tensor H x calculated from the input data 14b with the embedded expression tensor H y calculated from the reference data 14c to obtain e ( 1, . . . , N) .
 例えば、算出部15bは、source-target attention機構を用いて比較する。この場合には、算出部15bは、次式(3)に示すように、比較結果ベクトルe(1,…,N)を算出する。 For example, the calculation unit 15b makes a comparison using a source-target attention mechanism. In this case, the calculator 15b calculates the comparison result vector e (1,...,N) as shown in the following equation (3).
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 上記式(3)では、算出部15bは、queryQ (1,…,N)とkeyKからattention weightを算出して、valueVに適用し、最後に時間方向の合計を算出している。 In the above equation ( 3), the calculation unit 15b calculates attention weight from queryQ i (1, .
 ここで、dはattention headsの数、iは各attention heads、W 、W 、W はそれぞれ、各attention headsにおけるQuery、key、valueに対する重みを表す。 Here, d1 is the number of attention heads, i is each attention head, and W i Q , W i K , and W i V are weights for Query, key, and value in each attention head.
 なお、算出部15bは、source-target attention機構に限定されず、埋め込み表現間での所定の四則演算や結合等を用いて比較してもよい。 It should be noted that the calculation unit 15b is not limited to the source-target attention mechanism, and may be compared using predetermined four arithmetic operations, combinations, etc. between embedded expressions.
 次に、算出部15bは、図4に示すように、e(1,…,N)同士を比較して、比較結果ベクトルvを算出する。その際に、算出部15bは、e(1,…,N)に、y1:T (1,…,N)のメタデータ等の任意の情報を付加してもよい。 Next, the calculator 15b compares e (1, . . . , N) to calculate a comparison result vector v, as shown in FIG. At that time, the calculation unit 15b may add arbitrary information such as metadata of y1:T (1,...,N) to e( 1,...,N) .
 例えば、算出部15bは、multi-head self attention機構を用いて、次式(4)に示すように、e(1,…,N)に、y1:T (1,…,N)のメタデータm(1,…,N)を結合し、さらにそれらを結合したテンソルE1:Tを生成する。ここで、メタデータm(1,…,N)は、C段階(C=5)の理解度ラベルをone-hot vectorで表現したものである。 For example, using the multi-head self attention mechanism, the calculation unit 15b adds e (1, . . . , N) to y 1:T (1, . Combine the data m (1,...,N) and generate a tensor E1 :T combining them. Here, the metadata m (1, . . . , N) is a one-hot vector representation of the comprehension level label of the C level (C=5).
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 また、算出部15bは、multi-head self attention機構を用いて、次式(5)に示すように、E1:Tからvを算出する。 Further, the calculation unit 15b calculates v from E 1:T as shown in the following equation (5) using the multi-head self attention mechanism.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 ここで、dはattention headsの数、jは各attention heads、W 、W 、W はそれぞれ、各attention headsにおけるQuery、key、valueに対する重みを表す。 Here, d2 is the number of attention heads, j is each attention head, and WjQ , WjK , and WjV are weights for Query, key, and value in each attention head.
 図1の説明に戻る。推定部15cは、入力データ14bから算出された埋め込み表現と、参照データ14cから算出された埋め込み表現との比較結果を用いて、入力データ14bの心の状態を推定する。 Return to the description of Figure 1. The estimating unit 15c estimates the state of mind of the input data 14b using the result of comparison between the embedded representation calculated from the input data 14b and the embedded representation calculated from the reference data 14c.
 例えば、推定部15cは、上記したように、算出部15bにより入力データ14bから算出された埋め込み表現と、参照データ14cから算出された埋め込み表現とをmulti-head self atention機構を用いて比較した結果を用いて、入力データ14bの心の状態を推定する。 For example, as described above, the estimation unit 15c compares the embedding expression calculated from the input data 14b by the calculating unit 15b and the embedding expression calculated from the reference data 14c using the multi-head self-attention mechanism. is used to estimate the state of mind of the input data 14b.
 具体的には、推定部15cは、図4に示すように、算出部15bが算出した比較結果ベクトルvから入力データ14bの心の状態を推定する。その際に、推定部15cは、任意のクラス数の分類問題として、各クラスに対する事後確率を算出して、心の状態を推定してもよい。すなわち、推定部15cは、心の状態の分類の各クラスに対する事後確率を算出して該心の状態を推定してもよい。あるいは、推定部15cは、回帰問題として心の状態を表す数値を推定してもよい。 Specifically, as shown in FIG. 4, the estimation unit 15c estimates the state of mind of the input data 14b from the comparison result vector v calculated by the calculation unit 15b. At that time, the estimating unit 15c may calculate the posterior probability for each class as a classification problem of an arbitrary number of classes to estimate the state of mind. That is, the estimating unit 15c may estimate the state of mind by calculating the posterior probability for each class of the state of mind classification. Alternatively, the estimation unit 15c may estimate a numerical value representing the state of mind as a regression problem.
 例えば、推定部15cは、次式(6)に示すように、2層の全結合層を用いて、5段階の理解度のそれぞれに対する事後確率p(C|x1:T,y1:T (1,…,N))を算出する。 For example, as shown in the following equation (6), the estimating unit 15c uses two fully connected layers to determine the posterior probability p(C|x 1:T , y 1:T (1,...,N) ) is calculated.
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 ここで、W FC、W FCは、2層の全結合層の重みを表し、DFCは1層目の全結合層の出力次元数を表し、Cは予測ラベルの数を表す(本実施形態ではC=5)。また、1層目の全結合層の活性化関数には、ReLU関数が用いられている。 where W 1 FC and W 2 FC represent the weights of the two fully connected layers, D FC represents the number of output dimensions of the first fully connected layer, and C represents the number of predicted labels (this C=5 in the embodiment). A ReLU function is used as the activation function of the first fully connected layer.
 図1の説明に戻る。学習部15dは、入力データ14bと、推定された該入力データ14bの心の状態とを用いて、入力された非言語情報またはパラ言語情報に表れる心の状態を推定するモデルのモデルパラメタ14dを学習する。 Return to the description of Figure 1. The learning unit 15d uses the input data 14b and the estimated state of mind of the input data 14b to obtain model parameters 14d of a model that estimates the state of mind appearing in the input nonverbal information or paralinguistic information. learn.
 具体的には、学習部15dは、モデルパラメタ集合Ωを更新し、学習済みモデルパラメタ集合Ω’を取得する。学習部15dは、周知の損失関数や更新手法を適用可能である。例えば、モデルパラメタ集合Ωは、任意の他のタスクで事前学習されたものが含まれてもよいし、任意の乱数で初期値が生成されてもよいし、一部のモデルパラメタが更新されなくてもよい。 Specifically, the learning unit 15d updates the model parameter set Ω and acquires the learned model parameter set Ω'. The learning unit 15d can apply well-known loss functions and update methods. For example, the model parameter set Ω may include those pre-trained in any other task, initial values may be generated with arbitrary random numbers, and some model parameters may not be updated. may
 例えば、学習部15dは、確率的勾配法(SGD)を用いて、次式(7)に示す交差エントロピーLを損失関数として、モデルパラメタ集合Ωを更新する。 For example, the learning unit 15d uses the stochastic gradient method (SGD) to update the model parameter set Ω using the cross entropy L shown in the following equation (7) as a loss function.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
 ここで、mは入力される動画データx1:Tの正解分布である。正解分布の表現手法は特に限定されず、例えば、one-hot vectorとして表現されてもよい。あるいは、正解分布は、正解クラスを中心とする正規分布を近似して表されてもよい。 Here, mx is the correct distribution of the input moving image data x 1:T . The method of expressing the correct answer distribution is not particularly limited, and may be expressed as a one-hot vector, for example. Alternatively, the correct distribution may be represented by approximating a normal distribution centered on the correct class.
 なお、学習部15dは、取得した学習済みモデルパラメタ集合Ω’をモデルパラメタ14dとして、記憶部14に記憶させる。 Note that the learning unit 15d causes the storage unit 14 to store the acquired learned model parameter set Ω' as the model parameter 14d.
 この場合には、上記したように、算出部15bは、学習済みのモデルパラメタ14bを用いて、入力データ14bの心の状態の埋め込み表現と、参照データ14cの心の状態の埋め込み表現とを算出する。 In this case, as described above, the calculation unit 15b uses the learned model parameters 14b to calculate the state-of-mind embedding representation of the input data 14b and the state-of-mind embedding representation of the reference data 14c. do.
[推定処理]
 次に、推定装置10による推定処理について説明する。図5は、推定処理手順を示すフローチャートである。図5のフローチャートは、例えば、推定処理の開始を指示する入力があったタイミングで開始される。
[Estimation process]
Next, estimation processing by the estimation device 10 will be described. FIG. 5 is a flowchart showing an estimation processing procedure. The flowchart of FIG. 5 is started, for example, when an input instructing the start of the estimation process is received.
 まず、取得部15aは、非言語情報またはパラ言語情報と該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データ14aを取得する(ステップS1)。 First, the acquisition unit 15a acquires the learning data 14a including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information (step S1).
 次に、算出部15bが、取得された学習データ14aのうち、入力データ14bと参照データ14cとのそれぞれの特徴量を用いて、該入力データの心の状態の埋め込み表現と、該参照データの心の状態の埋め込み表現とを算出する(ステップS2)。 Next, the calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the embedded representation of the state of mind of the input data and the reference data. An embedded representation of the state of mind is calculated (step S2).
 また、算出部15bは、入力データ14bから算出した埋め込み表現と、参照データ14cから算出した埋め込み表現とを比較する(ステップS3)。 The calculation unit 15b also compares the embedding expression calculated from the input data 14b and the embedding expression calculated from the reference data 14c (step S3).
 そして、推定部15cが、入力データ14bから算出された埋め込み表現と、参照データ14cから算出された埋め込み表現との比較結果を用いて、入力データ14bの心の状態を推定する(ステップS4)。これにより、一連の推定処理が終了する。 Then, the estimating unit 15c estimates the state of mind of the input data 14b using the result of comparison between the embedded expression calculated from the input data 14b and the embedded expression calculated from the reference data 14c (step S4). This completes a series of estimation processes.
[効果]
 以上、説明したように、本実施形態の推定装置10において、取得部15aが、非言語情報またはパラ言語情報と、該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データを取得する。また、算出部15bが、取得された学習データ14aのうち、入力データ14bと参照データ14cとのそれぞれの特徴量を用いて、該入力データ14bの心の状態の埋め込み表現と、該参照データ14cの心の状態の埋め込み表現とを算出する。また、推定部15cが、入力データ14bから算出された埋め込み表現と、参照データ14cから算出された埋め込み表現との比較結果を用いて、入力データ14bの心の状態を推定する。
[effect]
As described above, in the estimation device 10 of the present embodiment, the acquisition unit 15a performs learning including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information. Get data. Further, the calculation unit 15b uses the feature amounts of the input data 14b and the reference data 14c in the acquired learning data 14a to obtain the state-of-mind embedding expression of the input data 14b and the reference data 14c. and the embedded representation of the state of mind of . Also, the estimation unit 15c estimates the state of mind of the input data 14b by using the result of comparison between the embedded representation calculated from the input data 14b and the embedded representation calculated from the reference data 14c.
 具体的には、推定部15cは、入力データ14bから算出された埋め込み表現と、参照データ14cから算出された埋め込み表現とをmulti-head self atention機構を用いて比較する。また、推定部15cは、心の状態の分類の各クラスに対する事後確率を算出して該心の状態を推定する。あるいは、推定部15cは、回帰問題として心の状態を表す数値を推定する。 Specifically, the estimation unit 15c compares the embedding expression calculated from the input data 14b and the embedding expression calculated from the reference data 14c using a multi-head self-attention mechanism. The estimating unit 15c also calculates the posterior probability for each class of the state of mind classification to estimate the state of mind. Alternatively, the estimation unit 15c estimates a numerical value representing the state of mind as a regression problem.
 このように、推定装置10は、参考情報として、事前に登録した平常状態以外の複数の正解ラベルを用いて心の状態を推定する。これにより、個人差を正規化したり吸収したりし切れないほどにラベルの分散が大きくても、非言語・パラ言語情報に表れる心の状態を表すラベルを精度高く推定することが可能となる。 In this way, the estimation device 10 estimates the state of mind using a plurality of correct labels other than the normal state registered in advance as reference information. As a result, even if the variance of the labels is so large that individual differences cannot be normalized or absorbed, it is possible to accurately estimate the label representing the state of mind appearing in the nonverbal/paralinguistic information.
 また、モデルの再学習を必要としないため、分類の各クラスにおいてバランスよく適当量のデータを収集したり、学習を監視したりすることが不要であり、低リソースで処理が可能となる。 In addition, since there is no need to retrain the model, there is no need to collect an appropriate amount of data in a well-balanced manner for each class of classification, or to monitor learning, making it possible to process with low resources.
 また、学習部15dは、入力データ14bと、推定された該入力データ14bの心の状態とを用いて、入力された非言語情報またはパラ言語情報に表れる心の状態を推定するモデルのモデルパラメタ14dを学習する。この場合には、算出部15bは、学習済みのモデルパラメタ14bを用いて、入力データ14bの心の状態の埋め込み表現と、参照データ14cの心の状態の埋め込み表現とを算出する。これにより、さらに高精度に非言語・パラ言語情報に表れる心の状態を表すラベルを推定することが可能となる。 In addition, the learning unit 15d uses the input data 14b and the estimated state of mind of the input data 14b to set the model parameter of the model for estimating the state of mind appearing in the input nonverbal information or paralinguistic information. Study 14d. In this case, the calculation unit 15b uses the learned model parameters 14b to calculate the state-of-mind embedding representation of the input data 14b and the state-of-mind embedding representation of the reference data 14c. As a result, it becomes possible to estimate the label representing the state of mind appearing in the nonverbal/paralinguistic information with higher accuracy.
[実施例]
 図6は、実施例を説明するための図である。図6は、本発明を含む3つの手法で、同一人物の未知の動画データについて5段階の理解度を推定した場合の各手法における精度を示す。ここでは、参照データを利用しない一般的な手法(なし)、平常状態のみの参照データを用いて個人差を吸収する手法(理解度3のみ)、および本発明を適用した手法(理解度2,3,4)を適用した。「なし」では、参照データを用いずに埋め込み表現Hに対してself-attention機構と全結合層とを用いて理解度を推定した。また、「理解度3のみ」では、理解度3の参照データだけを用いて理解度を推定した(N=3)。また、「理解度2,3,4」では、理解度2~4の参照データを用いて理解度を推定した(N=3)。
[Example]
FIG. 6 is a diagram for explaining the embodiment. FIG. 6 shows the accuracy of each of the three methods, including the present invention, when estimating five levels of intelligibility for unknown moving image data of the same person. Here, a general method that does not use reference data (none), a method that absorbs individual differences using reference data only in normal conditions (understanding level 3 only), and a method that applies the present invention (understanding level 2, 3, 4) were applied. For “none”, comprehension was estimated using the self-attention mechanism and the fully connected layer for the embedded representation H x without using reference data. In addition, in the case of "understanding level 3 only", the understanding level was estimated using only the reference data with the understanding level 3 (N=3). In addition, for "level of understanding 2, 3, 4", the level of understanding was estimated using the reference data for levels of understanding 2 to 4 (N=3).
 図6に示すように、参照データを用いない一般的な手法(なし)や、平常状態のみの参照データを用いる手法(理解度3のみ)と比較して、複数の状態の参照データを用いる本発明の手法(理解度2,3,4)では、動画データに表れる理解度の推定の精度が安定して向上することが確認された。 As shown in Fig. 6, compared to the general method that does not use reference data (none) and the method that uses reference data only for normal conditions (comprehension level 3 only), this method using reference data for multiple states It was confirmed that the method of the invention (levels of understanding 2, 3, and 4) stably improves the accuracy of estimating the level of understanding appearing in moving image data.
[プログラム]
 上記実施形態に係る推定装置10が実行する処理をコンピュータが実行可能な言語で記述したプログラムを作成することもできる。一実施形態として、推定装置10は、パッケージソフトウェアやオンラインソフトウェアとして上記の推定処理を実行する推定プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の推定プログラムを情報処理装置に実行させることにより、情報処理装置を推定装置10として機能させることができる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。また、推定装置10の機能を、クラウドサーバに実装してもよい。
[program]
It is also possible to create a program in which the processing executed by the estimation device 10 according to the above embodiment is described in a computer-executable language. As one embodiment, the estimating device 10 can be implemented by installing an estimating program that executes the above estimating process as package software or online software on a desired computer. For example, the information processing device can function as the estimation device 10 by causing the information processing device to execute the above estimation program. In addition, information processing devices include mobile communication terminals such as smartphones, mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants). Also, the functions of the estimation device 10 may be implemented in a cloud server.
 図7は、推定プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010と、CPU1020と、ハードディスクドライブインタフェース1030と、ディスクドライブインタフェース1040と、シリアルポートインタフェース1050と、ビデオアダプタ1060と、ネットワークインタフェース1070とを有する。これらの各部は、バス1080によって接続される。 FIG. 7 is a diagram showing an example of a computer that executes an estimation program. Computer 1000 includes, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
 メモリ1010は、ROM(Read Only Memory)1011およびRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1031に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1041に接続される。ディスクドライブ1041には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース1050には、例えば、マウス1051およびキーボード1052が接続される。ビデオアダプタ1060には、例えば、ディスプレイ1061が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .
 ここで、ハードディスクドライブ1031は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093およびプログラムデータ1094を記憶する。上記実施形態で説明した各情報は、例えばハードディスクドライブ1031やメモリ1010に記憶される。 Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or the memory 1010, for example.
 また、推定プログラムは、例えば、コンピュータ1000によって実行される指令が記述されたプログラムモジュール1093として、ハードディスクドライブ1031に記憶される。具体的には、上記実施形態で説明した推定装置10が実行する各処理が記述されたプログラムモジュール1093が、ハードディスクドライブ1031に記憶される。 Also, the estimation program is stored in the hard disk drive 1031 as a program module 1093 in which instructions to be executed by the computer 1000 are written, for example. Specifically, the hard disk drive 1031 stores a program module 1093 that describes each process executed by the estimation device 10 described in the above embodiment.
 また、推定プログラムによる情報処理に用いられるデータは、プログラムデータ1094として、例えば、ハードディスクドライブ1031に記憶される。そして、CPU1020が、ハードディスクドライブ1031に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して、上述した各手順を実行する。 In addition, data used for information processing by the estimation program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
 なお、推定プログラムに係るプログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1031に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ1041等を介してCPU1020によって読み出されてもよい。あるいは、推定プログラムに係るプログラムモジュール1093やプログラムデータ1094は、LAN(Local Area Network)やWAN(Wide Area Network)等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 Note that the program module 1093 and program data 1094 related to the estimation program are not limited to being stored in the hard disk drive 1031. For example, they are stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be Alternatively, the program module 1093 and program data 1094 related to the estimation program are stored in another computer connected via a network such as LAN (Local Area Network) or WAN (Wide Area Network), and via network interface 1070 It may be read by CPU 1020 .
 以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.
 10 推定装置
 11 入力部
 12 出力部
 13 通信制御部
 14 記憶部
 14a 学習データ
 14b 入力データ
 14c 参照データ
 14d モデルパラメタ
 15 制御部
 15a 取得部
 15b 算出部
 15c 推定部
 15d 学習部
10 estimation device 11 input unit 12 output unit 13 communication control unit 14 storage unit 14a learning data 14b input data 14c reference data 14d model parameter 15 control unit 15a acquisition unit 15b calculation unit 15c estimation unit 15d learning unit

Claims (7)

  1.  推定装置が実行する推定方法であって、
     非言語情報またはパラ言語情報と、該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データを取得する取得工程と、
     取得された前記学習データのうち、入力データと参照データとのそれぞれの特徴量を用いて、該入力データの心の状態の埋め込み表現と、該参照データの心の状態の埋め込み表現とを算出する算出工程と、
     前記入力データから算出された埋め込み表現と、前記参照データから算出された埋め込み表現との比較結果を用いて、前記入力データの心の状態を推定する推定工程と、
     を含んだことを特徴とする推定方法。
    An estimation method executed by an estimation device,
    an acquiring step of acquiring learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information;
    Using the feature amounts of the input data and the reference data of the acquired learning data, the embedded expression of the state of mind of the input data and the embedded expression of the state of mind of the reference data are calculated. a calculation step;
    an estimation step of estimating the state of mind of the input data using a comparison result between the embedded representation calculated from the input data and the embedded representation calculated from the reference data;
    An estimation method characterized by including
  2.  前記推定工程は、前記入力データから算出された埋め込み表現と、前記参照データから算出された埋め込み表現とをmulti-head self atention機構を用いて比較することを特徴とする請求項1に記載の推定方法。 2. The estimation according to claim 1, wherein the estimation step compares the embedded expression calculated from the input data and the embedded expression calculated from the reference data using a multi-head self-attention mechanism. Method.
  3.  前記推定工程は、前記心の状態の分類の各クラスに対する事後確率を算出して該心の状態を推定することを特徴とする請求項1に記載の推定方法。 The estimation method according to claim 1, wherein the estimation step estimates the state of mind by calculating the posterior probability for each class of the state of mind classification.
  4.  前記推定工程は、回帰問題として前記心の状態を表す数値を推定することを特徴とする請求項1に記載の推定方法。 The estimation method according to claim 1, wherein the estimation step estimates a numerical value representing the state of mind as a regression problem.
  5.  前記入力データと、推定された該入力データの心の状態とを用いて、入力された非言語情報またはパラ言語情報に表れる心の状態を推定するモデルのモデルパラメタを学習する学習工程をさらに含み、
     前記算出工程は、学習された前記モデルパラメタを用いて、前記入力データの心の状態の埋め込み表現と、前記参照データの心の状態の埋め込み表現とを算出することを特徴とする請求項1に記載の推定方法。
    Further comprising a learning step of learning model parameters of a model for estimating the state of mind appearing in the input nonverbal information or paralinguistic information, using the input data and the estimated state of mind of the input data. ,
    2. The method according to claim 1, wherein said calculating step calculates an embedded representation of a state of mind in said input data and an embedded representation of a state of mind in said reference data using said learned model parameters. Estimation method described.
  6.  非言語情報またはパラ言語情報と、該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データを取得する取得部と、
     取得された前記学習データのうち、入力データと参照データとのそれぞれの特徴量を用いて、該入力データの心の状態の埋め込み表現と、該参照データの心の状態の埋め込み表現とを算出する算出部と、
     前記入力データから算出された埋め込み表現と、前記参照データから算出された埋め込み表現との比較結果を用いて、前記入力データの心の状態を推定する推定部と、
     を有することを特徴とする推定装置。
    an acquisition unit for acquiring learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information;
    Using the feature amounts of the input data and the reference data of the acquired learning data, the embedded expression of the state of mind of the input data and the embedded expression of the state of mind of the reference data are calculated. a calculation unit;
    an estimating unit that estimates the state of mind of the input data using a comparison result between the embedded expression calculated from the input data and the embedded expression calculated from the reference data;
    An estimation device characterized by comprising:
  7.  非言語情報またはパラ言語情報と、該非言語情報またはパラ言語情報に表れる心の状態を表す正解ラベルとを含む学習データを取得する取得ステップと、
     取得された前記学習データのうち、入力データと参照データとのそれぞれの特徴量を用いて、該入力データの心の状態の埋め込み表現と、該参照データの心の状態の埋め込み表現とを算出する算出ステップと、
     前記入力データから算出された埋め込み表現と、前記参照データから算出された埋め込み表現との比較結果を用いて、前記入力データの心の状態を推定する推定ステップと、
     をコンピュータに実行させるための推定プログラム。
    an acquisition step of acquiring learning data including nonverbal information or paralinguistic information and correct labels representing states of mind appearing in the nonverbal information or paralinguistic information;
    Using the feature amounts of the input data and the reference data of the acquired learning data, the embedded expression of the state of mind of the input data and the embedded expression of the state of mind of the reference data are calculated. a calculation step;
    an estimation step of estimating the state of mind of the input data using a comparison result between the embedded representation calculated from the input data and the embedded representation calculated from the reference data;
    An estimation program for causing a computer to execute
PCT/JP2021/031791 2021-08-30 2021-08-30 Estimation method, estimation device, and estimation program WO2023032014A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023544819A JPWO2023032014A1 (en) 2021-08-30 2021-08-30
PCT/JP2021/031791 WO2023032014A1 (en) 2021-08-30 2021-08-30 Estimation method, estimation device, and estimation program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/031791 WO2023032014A1 (en) 2021-08-30 2021-08-30 Estimation method, estimation device, and estimation program

Publications (1)

Publication Number Publication Date
WO2023032014A1 true WO2023032014A1 (en) 2023-03-09

Family

ID=85412275

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/031791 WO2023032014A1 (en) 2021-08-30 2021-08-30 Estimation method, estimation device, and estimation program

Country Status (2)

Country Link
JP (1) JPWO2023032014A1 (en)
WO (1) WO2023032014A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114057A1 (en) * 2016-10-21 2018-04-26 Samsung Electronics Co., Ltd. Method and apparatus for recognizing facial expression
WO2021166207A1 (en) * 2020-02-21 2021-08-26 日本電信電話株式会社 Recognition device, learning device, method for same, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114057A1 (en) * 2016-10-21 2018-04-26 Samsung Electronics Co., Ltd. Method and apparatus for recognizing facial expression
WO2021166207A1 (en) * 2020-02-21 2021-08-26 日本電信電話株式会社 Recognition device, learning device, method for same, and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
MIMURA ATSUSHI, HAGIWARA MASAFUMI: "Understanding Presumption System from Facial Images", DENKI GAKKAI RONBUNSHI. C, EREKUTORONIKUSU, JOHO KOGAKU, SHISUTEMU, THE INSTITUTE OF ELECTRICAL ENGINEERS OF JAPAN, 1 February 2000 (2000-02-01), pages 273 - 278, XP093043173, Retrieved from the Internet <URL:https://www.jstage.jst.go.jp/article/ieejeiss1987/120/2/120_2_273/_pdf/-char/ja> [retrieved on 20230501], DOI: 10.1541/ieejeiss1987.120.2_273 *

Also Published As

Publication number Publication date
JPWO2023032014A1 (en) 2023-03-09

Similar Documents

Publication Publication Date Title
US11562145B2 (en) Text classification method, computer device, and storage medium
JP7306062B2 (en) Knowledge transfer method, information processing device and storage medium
WO2022007823A1 (en) Text data processing method and device
WO2020007129A1 (en) Context acquisition method and device based on voice interaction
CN111814620A (en) Face image quality evaluation model establishing method, optimization method, medium and device
KR20190081243A (en) Method and apparatus of recognizing facial expression based on normalized expressiveness and learning method of recognizing facial expression
WO2020098083A1 (en) Call separation method and apparatus, computer device and storage medium
WO2021051497A1 (en) Pulmonary tuberculosis determination method and apparatus, computer device, and storage medium
WO2020073533A1 (en) Automatic question answering method and device
Salathé et al. Focus group on artificial intelligence for health
WO2020252903A1 (en) Au detection method and apparatus, electronic device, and storage medium
WO2020244151A1 (en) Image processing method and apparatus, terminal, and storage medium
US20240152770A1 (en) Neural network search method and related device
CN112884326A (en) Video interview evaluation method and device based on multi-modal analysis and storage medium
US20230107505A1 (en) Classifying out-of-distribution data using a contrastive loss
CN114817612A (en) Method and related device for calculating multi-modal data matching degree and training calculation model
CN112037904B (en) Online diagnosis and treatment data processing method and device, computer equipment and storage medium
WO2023032014A1 (en) Estimation method, estimation device, and estimation program
CN113327212B (en) Face driving method, face driving model training device, electronic equipment and storage medium
WO2023032016A1 (en) Estimation method, estimation device, and estimation program
JP6992725B2 (en) Para-language information estimation device, para-language information estimation method, and program
JP5931021B2 (en) Personal recognition tendency model learning device, personal recognition state estimation device, personal recognition tendency model learning method, personal recognition state estimation method, and program
CN113515935A (en) Title generation method, device, terminal and medium
US11875785B2 (en) Establishing user persona in a conversational system
WO2023119672A1 (en) Inference method, inference device, and inference program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21955909

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023544819

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE