WO2022185457A1

WO2022185457A1 - Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program

Info

Publication number: WO2022185457A1
Application number: PCT/JP2021/008258
Authority: WO
Inventors: 康仁大杉; いつみ斉藤; 京介西田; 仙吉田
Original assignee: 日本電信電話株式会社
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2022-09-09

Abstract

Provided is a feature quantity extraction device comprising: a first feature quantity extraction unit for extracting a first feature quantity of each item of information in an information time series; a second feature quantity extraction unit that is a model in which a recursive neural network is used, and that is for extracting a second feature quantity related to the position of each item of information in the information time series; and a third feature quantity extraction unit for extracting a third feature quantity of each item of information in the information time series by using the first and second feature quantities.

Description

Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program

The present invention relates to a neural network model that obtains a distributed representation of tokens.

In recent years, research on language models has been actively conducted, including BERT (Bidirectional Encoder Representations from Transformers) disclosed in Non-Patent Document 1. A language model here is one of neural network models that obtain distributed representations of tokens. In this specification, a token represents a unit of distributed representation such as a word. For example, in Non-Patent Document 1, a word is further divided into subwords, and a distributed representation in units of subwords is used. In this case, the token becomes a subword.

Because the language model does not input a single token, but the entire text in which the token is used, it is possible to obtain a distributed representation that reflects the semantic relationships with other tokens in the text. can.

The step of learning this distributed representation is called pre-training. In addition, pre-trained distributed representations can be used to solve various tasks such as text classification and question-answering tasks, and this step is called fine-tuning.

The model disclosed in Non-Patent Document 1 demonstrates high performance in each task in fine-tuning by learning an accurate distributed representation of each token through pre-learning using a large-scale language resource. is doing.

In the language model disclosed in Non-Patent Document 1, the Transformer's attention mechanism and position embedding are important elements. As described in Section 3.2 of Non-Patent Document 2, the attention mechanism calculates a weight representing how much a given token is related to other tokens, and calculates a distributed representation of the token based on the weight. Position embedding (Section 3.5 of Non-Patent Document 2) is a feature quantity representing the position of a certain token in text.

The language model disclosed in Non-Patent Document 1 cannot handle long texts (long token sequences) well. The reason is that only a predetermined number of position embeddings used in the language model have been learned in the pre-learning stage. Position embedding in Non-Patent Document 1 is a vector that depends on the absolute position of each token, and is one of the learning parameters.

For example, in the language model of Non-Patent Document 1, 512 position embeddings are prepared, and positions of up to 512 tokens in the text can be handled. That is, if the text is longer than 512 tokens, the 513th and subsequent tokens cannot be treated simultaneously with the preceding tokens, and there is a possibility that the relationships with other tokens cannot be properly reflected.

It should be noted that the difficulty of properly handling a long token sequence as described above is a problem that can occur in information sequences that are not limited to token sequences.

The present invention has been made in view of the above points, and makes it possible to extract a feature amount that appropriately reflects the relationship with other information for each piece of information in an information sequence of arbitrary length. The purpose is to provide technology.

According to the disclosed technology, a first feature quantity extraction unit for extracting a first feature quantity of each piece of information in an information series;
A second feature quantity extraction unit, which is a model using a recursive neural network, extracting a second feature quantity, which is a feature quantity relating to the position of each piece of information in the information series, using the first feature quantity;
a third feature amount extraction unit that extracts a third feature amount of each information in the information series using the first feature amount and the second feature amount;
A feature extraction device is provided.

According to the disclosed technology, it is possible to extract feature amounts that appropriately reflect the relationship with other information for each piece of information in an information sequence of arbitrary length.

1 is a configuration diagram of a feature extraction device; FIG. It is a flowchart which shows the operation|movement procedure of a feature-value extraction apparatus. 1 is a configuration diagram of a learning device; FIG. 4 is a flow chart showing the operation procedure of the learning device; It is a hardware block diagram of an apparatus. It is a figure which shows an experimental result.

An embodiment (this embodiment) of the present invention will be described below with reference to the drawings. The embodiments described below are merely examples, and embodiments to which the present invention is applied are not limited to the following embodiments. For example, although tokens are used below as an example of information in an information sequence, the processing operations described below can also be applied to information other than tokens (for example, images).

(Overview of Embodiment)
In the present embodiment, by adopting a configuration in which the position embedding of the language model of Non-Patent Document 1 is replaced with position embedding based on a recurrent neural network (RNN), arbitrary It is possible to extract a feature amount that reflects the relationship between each token and other tokens for a text of length .

In this embodiment, the continuity of tokens is modeled by the above RNN, and the RNN outputs a feature amount regarding the position of the token in the token sequence.

As mentioned above, in the language model disclosed in Non-Patent Document 1, only a fixed number of position embeddings are prepared, so text longer than that cannot be supported. On the other hand, in the present embodiment, since the token continuity is modeled by the RNN, even if the position is unknown, the position information of the token can be calculated by the RNN from the relative positional relationship with the preceding and succeeding tokens. It can also handle texts of unknown length, and can solve the problem in the language model disclosed in Non-Patent Document 1.

Examples 1 and 2 will be described below as detailed examples of the present invention. In the first embodiment, a feature quantity extraction device 100 comprising a language model for extracting contextual feature quantities from text will be described. In a second embodiment, a learning device 200 that learns model parameters of a language model that constitutes the feature quantity extraction device 100 will be described.

(Example 1)
<Device configuration>
FIG. 1 shows a configuration example of a feature quantity extraction device 100 according to the first embodiment. As shown in FIG. 1, the feature quantity extraction device 100 has a token feature quantity extraction unit 110, a position feature quantity extraction unit 120, and a context coding unit . Note that the token feature amount extraction unit 110, the position feature amount extraction unit 120, and the context encoding unit 130 may be called a first feature amount extraction unit, a second feature amount extraction unit, and a third feature amount extraction unit, respectively. Also, the token feature amount, position feature amount, and context feature amount may be called a first feature amount, a second feature amount, and a third feature amount, respectively.

Further, the context feature quantity obtained by the feature quantity extraction device 100 may be used for task execution by an external device. Quantities may be used for task execution.

Text is input to the feature quantity extraction device 100, and the feature quantity extraction device 100 extracts contextual feature quantities from the input text.

By inputting the context features obtained by the feature extraction device 100 to a classification unit (which may be called a task execution unit) specialized for a specific task, a specific task (such as a word filling task or a text classification task) can be performed. ) can be solved. The token feature amount extraction unit 110, the position feature amount extraction unit 120, the context encoding unit 130, and the classification unit are all implemented by neural networks.

The configuration of the language model that constitutes the feature amount extraction device 100 is based on the configuration of the language model disclosed in Non-Patent Document 1, but in the language model of Non-Patent Document 1, it corresponds to the token feature amount extraction unit 110. While the model and the model corresponding to the position feature quantity extraction unit 120 operate independently, in the first embodiment, the output of the token feature quantity extraction unit 110 is input to the position feature quantity extraction unit 120, and they operate independently. is not.

<Operation example>
An operation example of the feature extraction apparatus 100 according to the first embodiment for obtaining contextual features from text will be described in detail below in accordance with the procedure of the flowchart in FIG. In the following description, text is a token sequence, and the length of the text is the length of the token sequence. Further, although the number of dimensions in a plurality of feature quantities obtained below may be different, in this embodiment, for simplicity, the same number of dimensions d is used.

At _S <b>101 , a text S ₌ {s ₁ , s ₂ , .

In S102, the token feature amount extraction unit 110 extracts a sequence of token feature amounts {w ₁ , w ₂ , . . . , w _L } from the text. However, each token feature w _i is w _i ∈R ^d . That is, _wi is a d-dimensional real vector.

The token feature quantity extraction unit 110 may be a model with any configuration as long as it is a model that outputs a feature quantity (vector) corresponding to each token from the text. For example, as in Non-Patent Document 1, when there is a predetermined vocabulary set V, one vector is assigned to each token in the vocabulary, and the vector is used as one of the learning parameters. There is a way to extract the quantity. That is, a vector having d weighting parameters per token as elements learned in the neural network is a feature amount corresponding to each token. Assuming that the number of tokens is V, V×d is the number of amount parameters of the neural network model that constitutes the token feature amount extraction unit 110 . This vector is called embedding in the following.

The sequence of _token features {w ₁ , w ₂ , .

_In _S103 , the position feature amount extraction unit 120 extracts _{a sequence of position feature amounts (position embedding) {p 1} _, p ₂ , . . . , p _L }. However, p _i εR ^d . The position feature quantity extraction unit 120 may be any model as long as it extracts a feature quantity (vector) reflecting the positional relationship of tokens from the token feature quantity.

In this embodiment, a model made up of a recurrent neural network (RNN) is used as the position feature amount extraction unit 120 . When the sequence of tokens in the token series is regarded as the progress of time, token features corresponding to the tokens at that time are input to the RNN in chronological order. That is, the token features are input in order of w ₁ , w ₂ , . . . , w _L .

The RNN receives token feature values at a certain time as well as hidden layer information at the previous time, and calculates and outputs hidden layer information at that time based on this information. Hidden layer information corresponds to the position feature amount.

For the RNN of this embodiment, a unidirectional RNN may be used, or a bidirectional RNN may be used. In particular, by using a bidirectional RNN, it is possible to extract the relative positional relationship of the token from the preceding and following tokens. Also, there are various types of RNN such as LSTM and GRU, and any of them may be adopted. Furthermore, the RNN can be configured in multiple stages using a plurality of layers, but the number of layers is not particularly limited. One layer may be sufficient, and multiple layers may be sufficient.

The sequence of token features {w ₁ , w ₂ , . . . , w _L _} and the sequence of position features {p ₁ , p ₂ , .

In S104, the context encoding unit ₁₃₀ extracts the context feature from the sequence of _token _features _{ w ₁ , w ₂ , . Compute the sequence of quantities {h ₁ , h ₂ , . . . , h _L }. In S105, a sequence of context features {h ₁ , h ₂ , . . . , h _L } is output. However, each context feature h _i is h _i εR ^d .

The context coding unit 130 may be a neural network model having a mechanism that considers the surrounding context (that is, the information of surrounding tokens other than the i-th token) when calculating the feature amount for the i-th token. Any model can be used.

For example, the Transformer Encoder disclosed in Non-Patent Document 2 can be used as the context encoder 130 . In this case, a vector obtained by adding the token feature amount wi and the position feature amount _pi is input to the Transformer Encoder (context encoding unit 130) as the _i -th input.

Regarding the input to the Transformer Encoder (context encoding unit 130), in addition to adding the token feature amount _wi and the position feature amount _pi , other feature amounts may be added. For example, in the technique disclosed in Non-Patent Document 1, in the task of estimating the semantic relationship between two sentences, in order to distinguish between the first sentence and the second sentence, a new segment feature value is created, and each token (Fig. 2 of Non-Patent Document 1). In this embodiment, for example, a feature amount g _i for distinguishing sentences, which is similar to the segment feature amount described above, is further added to the sum of the token feature amount _wi and the position feature amount _pi . good too.

As with the Transformer Encoder, the context encoding unit 130 of this embodiment uses the attention mechanism to consider the relationship between each token and other tokens, and Output the amount. Since the attention mechanism itself is a technique disclosed in Non-Patent Document 2, the outline of the attention mechanism will be described here.

As disclosed in Non-Patent Document 2, the attention mechanism is represented by the following formula (1).

Q, K, and V are token feature amounts (here, the token feature amount _wi and the position feature amount p _i ) is linearly transformed, and Q, K, V∈R ^d×L . In formula (1),

indicates that the score (probability) representing the degree to which the token is related to other tokens is calculated based on the inner product between the feature values of the token. Using this score as a weight, the weighted sum vector of the vectors corresponding to each token in V represents the output of attention, i.e. how well other tokens are associated with it. It becomes a feature amount. By adding this Attention (Q, K, V) and the feature amount of the token (the sum of the token feature amount _wi and the position feature amount _pi ), the token and the other tokens It is possible to obtain a feature amount h _i that reflects the relevance of .

In this embodiment, the context encoding unit 130 is configured using an attention mechanism, so that the positional feature quantity extraction unit 120 (RNN) concentrates on grasping the positional relationship of tokens while making use of the highly accurate context grasping ability. It is possible to create a model that outputs more accurate contextual feature amounts than the conventional techniques disclosed in Non-Patent Documents 1 and 2, especially for long texts.

(Example 2)
Next, Example 2 will be described. In the second embodiment, a method of learning the model parameters (1) of the token feature amount extraction unit 110, the position feature amount extraction unit 120, and the context coding unit 130 that configure the feature amount extraction device 100 described in the first embodiment will be described. do. In the second embodiment, model parameters of the token feature quantity extraction unit 110, the position feature quantity extraction unit 120, and the context coding unit 130, which constitute the feature quantity extraction device 100, are referred to as "model parameters (1)", which will be described later. The model parameters of the classifier 140 are called "model parameters (2)".

The learning method is not limited to a specific method. The model parameters should be learned so as to perform some task on the input text and give a correct answer. In this embodiment, as an example, a method of learning model parameters using a word fill-in task (Section 3.1 Task#1 Masked LM of Non-Patent Document 1) will be described.

In the second embodiment, a device that learns model parameters is called a learning device 200. Note that the learning device 200 may be used for actual task execution processing after learning.

FIG. 3 shows a configuration example of the learning device 200. As shown in FIG. 3, the learning device 200 includes a classification unit 140, an update unit 150, a model A parameter (1) storage unit 160 , a model parameter (2) storage unit 170 and a text data storage unit 180 are provided.

The classification unit 140 is a mechanism (neural network model) that predicts the fill-in-the-blank words (tokens) in the word-fill-in task. The updating unit 150 is a mechanism that simultaneously updates the model parameter (1) and the model parameter (2) so that the error between the correct token and the predicted token becomes small. As an update method, for example, an error backpropagation method, which is a general method of supervised learning, can be used. The learning method will be described along the procedure of the flowchart in FIG.

In S201, preparations are made. In advance preparation, first, a set of text data is prepared and stored in the text data storage unit 180 . As text data, data published on the Web such as Wikipedia can be used.

Next, create masked text from the text data. For example, extract each Wikipedia paragraph as a piece of text, split the text into tokens with an appropriate tokenizer, select some tokens, and then use the mask token ([MASK]) or another randomly chosen token. or keeping the tokens as they are, we obtain a text in which some tokens in the token sequence are masked (referred to as "masked text"). Here, the conditions for replacement and maintenance may be the same as those disclosed in Non-Patent Document 1. A token selected for replacement or maintenance is set as a correct token, and this token is set as a prediction target. Also, model parameters (1) and (2) are initialized with random values.

In S202, the masked text is input to the token feature extraction unit 110, and as described in the first embodiment, the token feature extraction unit 110, the position feature extraction unit 120, and the context encoding unit 130 perform processing. to obtain a sequence of context features {h ₁ , h ₂ , . . . , h _L } corresponding to the masked text. h _i in {h ₁ , h ₂ , . . . , h _L } is the context feature for the i-th token. A context feature may be called a distributed representation.

In S203, the sequence of context _features {h ₁ , h ₂ , .

The classification unit 140 is a mechanism that predicts the i-th token from a predetermined vocabulary based on the feature quantity h _i regarding the i-th token. For example, the classifying unit 140 uses a one-layer Feed Forward Network to convert _h _i into a feature quantity y _i εR ^d ′ whose dimensionality is the vocabulary size d′. A token is predicted from the vocabulary using an index (an index indicating one vocabulary (token) among d' vocabulary).

In S204, the prediction token and the correct token are input to the update unit 150, and model parameter (1) and model parameter (2) are updated by supervised learning. S202 to S204 are repeated using the updated model parameters (1) and (2). As a result, the model parameters (1) and (2) are learned so that accurate prediction can be made.

In S205, if the termination condition is satisfied, the learning is terminated. The termination condition may be that the number of iterations reaches a predetermined number, that the model parameter update amount becomes smaller than a threshold value, or other conditions.

(Hardware configuration example)
Both the feature quantity extraction device 100 and the learning device 200 can be realized by causing a computer to execute a program, for example. This computer may be a physical computer or a virtual machine on the cloud. The feature quantity extraction device 100 and the learning device 200 are collectively referred to as "apparatus".

That is, the device can be realized by executing a program corresponding to the processing performed by the device using hardware resources such as a CPU and memory built into the computer. The above program can be recorded in a computer-readable recording medium (portable memory, etc.), saved, or distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.

FIG. 5 is a diagram showing a hardware configuration example of the computer. The computer of FIG. 5 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., which are interconnected by a bus BS. Note that some of these devices may not be provided. For example, the display device 1006 may not be provided when no display is performed.

A program that implements the processing in the computer is provided by a recording medium 1001 such as a CD-ROM or memory card, for example. When the recording medium 1001 storing the program is set in the drive device 1000 , the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000 . However, the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via the network. The auxiliary storage device 1002 stores installed programs, as well as necessary files and data.

The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when a program activation instruction is received. The CPU 1004 implements functions related to the device according to programs stored in the memory device 1003 . The interface device 1005 is used as an interface for connecting to a network and functions as a transmitter and a receiver. A display device 1006 displays a GUI (Graphical User Interface) or the like by a program. An input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used to input various operational instructions. The output device 1008 outputs the calculation result.

(About the experiment)
For various tasks, experiments were conducted using the technology of Non-Patent Document 1 and the proposed technology according to the present invention described in Examples 1 and 2. Experimental results are shown in FIG. FIG. 6 shows task names and evaluation results (acc. (correct answer rate) or F1 score).

The specific tasks targeted for the experiment are word filling, question answering, text classification, and interactive question answering. The outline of each task is as follows.

The word filling task is Task #1 Masked LM described in Section 3.1 of Non-Patent Document 1.

The question-answering task is a task in which a long text and a question are given, and the answer is extracted from the long text.

The text classification task is a task in which a choice, a corresponding explanation (long text), and a question are given, and the choice that answers the question is selected.

The interactive question-answering task is a task of extracting the answer to the question from the given text of a long dialogue history and a question.

As the experimental conditions, we conducted experiments with a maximum text length of 512 tokens for word filling, and with a maximum text length of 1024 tokens for other tasks.

As shown in FIG. 6, the proposed technique obtained better results than the technique disclosed in Non-Patent Document 1 in experiments with tokens longer than 512 tokens.

(Effects of the embodiment, etc.)
According to the technique according to the present embodiment described above, in the model for extracting the context feature quantity, position embedding based on RNN is used to consider the positional relationship before and after, so any text length (for example For a text longer than 512 tokens), it is possible to obtain an accurate feature quantity, that is, a feature quantity that appropriately reflects the relationship between a token and other tokens.

(Appendix)
The following additional remarks are disclosed regarding the above embodiments.

(Appendix 1)
memory;
at least one processor connected to the memory;
including
The processor
Extracting a first feature amount of each information in the information series,
A model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence,
Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series,
Feature extractor.

(Appendix 2)
The feature quantity extraction device according to claim 1, wherein, for each piece of information in the information series, the processor extracts the third feature quantity that reflects a relationship between the information and other information in the information series.

(Appendix 3)
memory;
at least one processor connected to the memory;
including
The processor
A first model using a neural network extracts a first feature amount of each information in the information series,
A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence,
Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
Execute a task using the third feature,
A learning device that updates model parameters of a neural network that constitutes the first model, the second model, and the third model based on the task execution result and correct answer information.

(Appendix 4)
memory;
at least one processor connected to the memory;
A computer containing
Extracting a first feature amount of each information in the information series,
A model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence,
Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series,
Feature extraction method.

(Appendix 5)
memory;
at least one processor connected to the memory;
A computer containing
A first model using a neural network extracts a first feature amount of each information in the information series,
A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence,
Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
Execute a task using the third feature,
A learning method, wherein model parameters of the first model, the second model, and the third model are updated based on an execution result of the task and correct answer information.

(Appendix 6)
A non-temporary storage medium storing a computer-executable program for performing feature quantity extraction processing,
The feature quantity extraction process includes:
Extracting a first feature amount of each information in the information series,
A model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence,
Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series,
Non-transitory storage media.

(Appendix 7)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes
A first model using a neural network extracts a first feature amount of each information in the information series,
A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence,
Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
Execute a task using the third feature,
A non-temporary storage medium that updates model parameters of neural networks that constitute the first model, the second model, and the third model, based on the execution result of the task and correct answer information.

Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. It is possible.

100 Feature quantity extraction device 110 Token feature quantity extraction unit 120 Position feature quantity extraction unit 130 Context coding unit 140 Classification unit 150 Update unit 160 Model parameter (1) storage unit 170 Model parameter (2) storage unit 180 Text data storage unit 200 Learning device 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 interface device 1006 display device 1007 input device 1008 output device

Claims

a first feature quantity extraction unit for extracting a first feature quantity of each piece of information in an information series;
A second feature quantity extraction unit, which is a model using a recursive neural network, extracting a second feature quantity, which is a feature quantity relating to the position of each piece of information in the information series, using the first feature quantity;
a third feature amount extraction unit that extracts a third feature amount of each information in the information series using the first feature amount and the second feature amount;
A feature extraction device comprising:
The feature quantity according to claim 1, wherein the third feature quantity extraction unit extracts, for each piece of information in the information sequence, the third feature quantity that reflects the relationship between the information in question and other information in the information sequence. Extractor.
a first feature quantity extraction unit for extracting a first feature quantity of each piece of information in an information series;
A second feature quantity extraction unit, which is a model using a recursive neural network, extracting a second feature quantity, which is a feature quantity relating to the position of each piece of information in the information series, using the first feature quantity;
a third feature amount extraction unit that extracts a third feature amount of each information in the information series using the first feature amount and the second feature amount;
a task execution unit that executes a task using the third feature;
A neural network that configures the first feature amount extraction unit, the second feature amount extraction unit, and the third feature amount extraction unit based on the task execution result output from the task execution unit and correct answer information. A learning device comprising an updating unit that updates the model parameters of and .
A feature quantity extraction method executed by a feature quantity extraction device,
a step of extracting a first feature amount of each information in the information series;
A step of extracting a second feature amount, which is a feature amount relating to the position of each piece of information in the information series, using the first feature amount by a model using a recursive neural network;
a step of extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount;
A feature extraction method comprising:
A learning method executed by a learning device,
A step of extracting a first feature amount of each piece of information in an information series by a first model using a neural network;
A step of extracting a second feature amount, which is a feature amount relating to the position of each piece of information in the information series, using the first feature amount by a second model using a recursive neural network;
A step of extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
executing a task using the third feature;
and updating model parameters of the first model, the second model, and the third model based on the task execution result and correct answer information.
A program for causing a computer to function as each unit in the feature extraction device according to claim 1 or 2.
A program for causing a computer to function as each unit in the learning device according to claim 3.