WO2022185457A1 - Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program - Google Patents

Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program Download PDF

Info

Publication number
WO2022185457A1
WO2022185457A1 PCT/JP2021/008258 JP2021008258W WO2022185457A1 WO 2022185457 A1 WO2022185457 A1 WO 2022185457A1 JP 2021008258 W JP2021008258 W JP 2021008258W WO 2022185457 A1 WO2022185457 A1 WO 2022185457A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
feature
feature amount
feature quantity
model
Prior art date
Application number
PCT/JP2021/008258
Other languages
French (fr)
Japanese (ja)
Inventor
康仁 大杉
いつみ 斉藤
京介 西田
仙 吉田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/008258 priority Critical patent/WO2022185457A1/en
Publication of WO2022185457A1 publication Critical patent/WO2022185457A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Definitions

  • the present invention relates to a neural network model that obtains a distributed representation of tokens.
  • a language model here is one of neural network models that obtain distributed representations of tokens.
  • a token represents a unit of distributed representation such as a word.
  • a word is further divided into subwords, and a distributed representation in units of subwords is used. In this case, the token becomes a subword.
  • the language model does not input a single token, but the entire text in which the token is used, it is possible to obtain a distributed representation that reflects the semantic relationships with other tokens in the text. can.
  • pre-training The step of learning this distributed representation is called pre-training.
  • pre-trained distributed representations can be used to solve various tasks such as text classification and question-answering tasks, and this step is called fine-tuning.
  • Non-Patent Document 1 demonstrates high performance in each task in fine-tuning by learning an accurate distributed representation of each token through pre-learning using a large-scale language resource. is doing.
  • the Transformer's attention mechanism and position embedding are important elements.
  • the attention mechanism calculates a weight representing how much a given token is related to other tokens, and calculates a distributed representation of the token based on the weight.
  • Position embedding (Section 3.5 of Non-Patent Document 2) is a feature quantity representing the position of a certain token in text.
  • Position embedding in Non-Patent Document 1 is a vector that depends on the absolute position of each token, and is one of the learning parameters.
  • Non-Patent Document 1 For example, in the language model of Non-Patent Document 1, 512 position embeddings are prepared, and positions of up to 512 tokens in the text can be handled. That is, if the text is longer than 512 tokens, the 513th and subsequent tokens cannot be treated simultaneously with the preceding tokens, and there is a possibility that the relationships with other tokens cannot be properly reflected.
  • the present invention has been made in view of the above points, and makes it possible to extract a feature amount that appropriately reflects the relationship with other information for each piece of information in an information sequence of arbitrary length.
  • the purpose is to provide technology.
  • a first feature quantity extraction unit for extracting a first feature quantity of each piece of information in an information series
  • a second feature quantity extraction unit which is a model using a recursive neural network, extracting a second feature quantity, which is a feature quantity relating to the position of each piece of information in the information series, using the first feature quantity
  • a third feature amount extraction unit that extracts a third feature amount of each information in the information series using the first feature amount and the second feature amount
  • FIG. 1 is a configuration diagram of a feature extraction device; FIG. It is a flowchart which shows the operation
  • 1 is a configuration diagram of a learning device;
  • FIG. 4 is a flow chart showing the operation procedure of the learning device;
  • It is a hardware block diagram of an apparatus. It is a figure which shows an experimental result.
  • the continuity of tokens is modeled by the above RNN, and the RNN outputs a feature amount regarding the position of the token in the token sequence.
  • Non-Patent Document 1 As mentioned above, in the language model disclosed in Non-Patent Document 1, only a fixed number of position embeddings are prepared, so text longer than that cannot be supported. On the other hand, in the present embodiment, since the token continuity is modeled by the RNN, even if the position is unknown, the position information of the token can be calculated by the RNN from the relative positional relationship with the preceding and succeeding tokens. It can also handle texts of unknown length, and can solve the problem in the language model disclosed in Non-Patent Document 1.
  • Examples 1 and 2 will be described below as detailed examples of the present invention.
  • a feature quantity extraction device 100 comprising a language model for extracting contextual feature quantities from text will be described.
  • a learning device 200 that learns model parameters of a language model that constitutes the feature quantity extraction device 100 will be described.
  • FIG. 1 shows a configuration example of a feature quantity extraction device 100 according to the first embodiment.
  • the feature quantity extraction device 100 has a token feature quantity extraction unit 110, a position feature quantity extraction unit 120, and a context coding unit .
  • the token feature amount extraction unit 110, the position feature amount extraction unit 120, and the context encoding unit 130 may be called a first feature amount extraction unit, a second feature amount extraction unit, and a third feature amount extraction unit, respectively.
  • the token feature amount, position feature amount, and context feature amount may be called a first feature amount, a second feature amount, and a third feature amount, respectively.
  • the context feature quantity obtained by the feature quantity extraction device 100 may be used for task execution by an external device. Quantities may be used for task execution.
  • Text is input to the feature quantity extraction device 100, and the feature quantity extraction device 100 extracts contextual feature quantities from the input text.
  • a classification unit which may be called a task execution unit
  • a specific task such as a word filling task or a text classification task
  • the token feature amount extraction unit 110, the position feature amount extraction unit 120, the context encoding unit 130, and the classification unit are all implemented by neural networks.
  • the configuration of the language model that constitutes the feature amount extraction device 100 is based on the configuration of the language model disclosed in Non-Patent Document 1, but in the language model of Non-Patent Document 1, it corresponds to the token feature amount extraction unit 110. While the model and the model corresponding to the position feature quantity extraction unit 120 operate independently, in the first embodiment, the output of the token feature quantity extraction unit 110 is input to the position feature quantity extraction unit 120, and they operate independently. is not.
  • the token feature amount extraction unit 110 extracts a sequence of token feature amounts ⁇ w 1 , w 2 , . . . , w L ⁇ from the text.
  • each token feature w i is w i ⁇ R d . That is, wi is a d-dimensional real vector.
  • the token feature quantity extraction unit 110 may be a model with any configuration as long as it is a model that outputs a feature quantity (vector) corresponding to each token from the text.
  • a feature quantity vector
  • V a vector having d weighting parameters per token as elements learned in the neural network
  • V ⁇ d is the number of amount parameters of the neural network model that constitutes the token feature amount extraction unit 110 .
  • This vector is called embedding in the following.
  • the sequence of token features ⁇ w 1 , w 2 , .
  • the position feature amount extraction unit 120 extracts a sequence of position feature amounts (position embedding) ⁇ p 1 , p 2 , . . . , p L ⁇ . However, p i ⁇ R d .
  • the position feature quantity extraction unit 120 may be any model as long as it extracts a feature quantity (vector) reflecting the positional relationship of tokens from the token feature quantity.
  • a model made up of a recurrent neural network is used as the position feature amount extraction unit 120 .
  • RNN recurrent neural network
  • the RNN receives token feature values at a certain time as well as hidden layer information at the previous time, and calculates and outputs hidden layer information at that time based on this information.
  • Hidden layer information corresponds to the position feature amount.
  • a unidirectional RNN may be used, or a bidirectional RNN may be used.
  • a bidirectional RNN it is possible to extract the relative positional relationship of the token from the preceding and following tokens.
  • RNN there are various types of RNN such as LSTM and GRU, and any of them may be adopted.
  • the RNN can be configured in multiple stages using a plurality of layers, but the number of layers is not particularly limited. One layer may be sufficient, and multiple layers may be sufficient.
  • the context encoding unit 130 extracts the context feature from the sequence of token features ⁇ w 1 , w 2 , . Compute the sequence of quantities ⁇ h 1 , h 2 , . . . , h L ⁇ .
  • a sequence of context features ⁇ h 1 , h 2 , . . . , h L ⁇ is output.
  • each context feature h i is h i ⁇ R d .
  • the context coding unit 130 may be a neural network model having a mechanism that considers the surrounding context (that is, the information of surrounding tokens other than the i-th token) when calculating the feature amount for the i-th token. Any model can be used.
  • the Transformer Encoder disclosed in Non-Patent Document 2 can be used as the context encoder 130 .
  • a vector obtained by adding the token feature amount wi and the position feature amount pi is input to the Transformer Encoder (context encoding unit 130) as the i -th input.
  • Non-Patent Document 1 in the task of estimating the semantic relationship between two sentences, in order to distinguish between the first sentence and the second sentence, a new segment feature value is created, and each token (Fig. 2 of Non-Patent Document 1).
  • a feature amount g i for distinguishing sentences which is similar to the segment feature amount described above, is further added to the sum of the token feature amount wi and the position feature amount pi . good too.
  • the context encoding unit 130 of this embodiment uses the attention mechanism to consider the relationship between each token and other tokens, and Output the amount. Since the attention mechanism itself is a technique disclosed in Non-Patent Document 2, the outline of the attention mechanism will be described here.
  • the attention mechanism is represented by the following formula (1).
  • Q, K, and V are token feature amounts (here, the token feature amount wi and the position feature amount p i ) is linearly transformed, and Q, K, V ⁇ R d ⁇ L .
  • the score (probability) representing the degree to which the token is related to other tokens is calculated based on the inner product between the feature values of the token.
  • the weighted sum vector of the vectors corresponding to each token in V represents the output of attention, i.e. how well other tokens are associated with it. It becomes a feature amount.
  • the context encoding unit 130 is configured using an attention mechanism, so that the positional feature quantity extraction unit 120 (RNN) concentrates on grasping the positional relationship of tokens while making use of the highly accurate context grasping ability. It is possible to create a model that outputs more accurate contextual feature amounts than the conventional techniques disclosed in Non-Patent Documents 1 and 2, especially for long texts.
  • RNN positional feature quantity extraction unit 120
  • Example 2 Next, Example 2 will be described.
  • a method of learning the model parameters (1) of the token feature amount extraction unit 110, the position feature amount extraction unit 120, and the context coding unit 130 that configure the feature amount extraction device 100 described in the first embodiment will be described. do.
  • model parameters of the token feature quantity extraction unit 110, the position feature quantity extraction unit 120, and the context coding unit 130, which constitute the feature quantity extraction device 100 are referred to as "model parameters (1)", which will be described later.
  • the model parameters of the classifier 140 are called "model parameters (2)".
  • the learning method is not limited to a specific method.
  • the model parameters should be learned so as to perform some task on the input text and give a correct answer.
  • a method of learning model parameters using a word fill-in task (Section 3.1 Task#1 Masked LM of Non-Patent Document 1) will be described.
  • a device that learns model parameters is called a learning device 200.
  • the learning device 200 may be used for actual task execution processing after learning.
  • FIG. 3 shows a configuration example of the learning device 200.
  • the learning device 200 includes a classification unit 140, an update unit 150, a model A parameter (1) storage unit 160 , a model parameter (2) storage unit 170 and a text data storage unit 180 are provided.
  • the classification unit 140 is a mechanism (neural network model) that predicts the fill-in-the-blank words (tokens) in the word-fill-in task.
  • the updating unit 150 is a mechanism that simultaneously updates the model parameter (1) and the model parameter (2) so that the error between the correct token and the predicted token becomes small.
  • an update method for example, an error backpropagation method, which is a general method of supervised learning, can be used. The learning method will be described along the procedure of the flowchart in FIG.
  • preparations are made.
  • a set of text data is prepared and stored in the text data storage unit 180 .
  • text data data published on the Web such as Wikipedia can be used.
  • masked text For example, extract each Wikipedia paragraph as a piece of text, split the text into tokens with an appropriate tokenizer, select some tokens, and then use the mask token ([MASK]) or another randomly chosen token. or keeping the tokens as they are, we obtain a text in which some tokens in the token sequence are masked (referred to as "masked text").
  • the conditions for replacement and maintenance may be the same as those disclosed in Non-Patent Document 1.
  • a token selected for replacement or maintenance is set as a correct token, and this token is set as a prediction target.
  • model parameters (1) and (2) are initialized with random values.
  • the masked text is input to the token feature extraction unit 110, and as described in the first embodiment, the token feature extraction unit 110, the position feature extraction unit 120, and the context encoding unit 130 perform processing. to obtain a sequence of context features ⁇ h 1 , h 2 , . . . , h L ⁇ corresponding to the masked text.
  • h i in ⁇ h 1 , h 2 , . . . , h L ⁇ is the context feature for the i-th token.
  • a context feature may be called a distributed representation.
  • the sequence of context features ⁇ h 1 , h 2 , .
  • the classification unit 140 is a mechanism that predicts the i-th token from a predetermined vocabulary based on the feature quantity h i regarding the i-th token. For example, the classifying unit 140 uses a one-layer Feed Forward Network to convert h i into a feature quantity y i ⁇ R d ′ whose dimensionality is the vocabulary size d′. A token is predicted from the vocabulary using an index (an index indicating one vocabulary (token) among d' vocabulary).
  • the prediction token and the correct token are input to the update unit 150, and model parameter (1) and model parameter (2) are updated by supervised learning.
  • S202 to S204 are repeated using the updated model parameters (1) and (2).
  • the model parameters (1) and (2) are learned so that accurate prediction can be made.
  • the termination condition may be that the number of iterations reaches a predetermined number, that the model parameter update amount becomes smaller than a threshold value, or other conditions.
  • Both the feature quantity extraction device 100 and the learning device 200 can be realized by causing a computer to execute a program, for example.
  • This computer may be a physical computer or a virtual machine on the cloud.
  • the feature quantity extraction device 100 and the learning device 200 are collectively referred to as "apparatus".
  • the device can be realized by executing a program corresponding to the processing performed by the device using hardware resources such as a CPU and memory built into the computer.
  • the above program can be recorded in a computer-readable recording medium (portable memory, etc.), saved, or distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.
  • FIG. 5 is a diagram showing a hardware configuration example of the computer.
  • the computer of FIG. 5 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., which are interconnected by a bus BS. Note that some of these devices may not be provided. For example, the display device 1006 may not be provided when no display is performed.
  • a program that implements the processing in the computer is provided by a recording medium 1001 such as a CD-ROM or memory card, for example.
  • a recording medium 1001 such as a CD-ROM or memory card
  • the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000 .
  • the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via the network.
  • the auxiliary storage device 1002 stores installed programs, as well as necessary files and data.
  • the memory device 1003 reads and stores the program from the auxiliary storage device 1002 when a program activation instruction is received.
  • the CPU 1004 implements functions related to the device according to programs stored in the memory device 1003 .
  • the interface device 1005 is used as an interface for connecting to a network and functions as a transmitter and a receiver.
  • a display device 1006 displays a GUI (Graphical User Interface) or the like by a program.
  • An input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used to input various operational instructions.
  • the output device 1008 outputs the calculation result.
  • the specific tasks targeted for the experiment are word filling, question answering, text classification, and interactive question answering.
  • the outline of each task is as follows.
  • the word filling task is Task #1 Masked LM described in Section 3.1 of Non-Patent Document 1.
  • the question-answering task is a task in which a long text and a question are given, and the answer is extracted from the long text.
  • the text classification task is a task in which a choice, a corresponding explanation (long text), and a question are given, and the choice that answers the question is selected.
  • the interactive question-answering task is a task of extracting the answer to the question from the given text of a long dialogue history and a question.
  • position embedding based on RNN is used to consider the positional relationship before and after, so any text length (for example For a text longer than 512 tokens), it is possible to obtain an accurate feature quantity, that is, a feature quantity that appropriately reflects the relationship between a token and other tokens.
  • a first model using a neural network extracts a first feature amount of each information in the information series
  • a second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence, Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network; Execute a task using the third feature, A learning device that updates model parameters of a neural network that constitutes the first model, the second model, and the third model based on the task execution result and correct answer information.
  • a computer containing Extracting a first feature amount of each information in the information series,
  • a model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence, Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series, Feature extraction method.
  • a computer containing A first model using a neural network extracts a first feature amount of each information in the information series, A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence, Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network; Execute a task using the third feature, A learning method, wherein model parameters of the first model, the second model, and the third model are updated based on an execution result of the task and correct answer information.
  • the feature quantity extraction process includes: Extracting a first feature amount of each information in the information series, A model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence, Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series, Non-transitory storage media.
  • a non-transitory storage medium storing a program executable by a computer to perform a learning process,
  • the learning process includes A first model using a neural network extracts a first feature amount of each information in the information series, A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence, Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network; Execute a task using the third feature, A non-temporary storage medium that updates model parameters of neural networks that constitute the first model, the second model, and the third model, based on the execution result of the task and correct answer information.
  • Feature quantity extraction device 110 Token feature quantity extraction unit 120 Position feature quantity extraction unit 130 Context coding unit 140 Classification unit 150 Update unit 160 Model parameter (1) storage unit 170 Model parameter (2) storage unit 180 Text data storage unit 200 Learning device 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU 1005 interface device 1006 display device 1007 input device 1008 output device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

Provided is a feature quantity extraction device comprising: a first feature quantity extraction unit for extracting a first feature quantity of each item of information in an information time series; a second feature quantity extraction unit that is a model in which a recursive neural network is used, and that is for extracting a second feature quantity related to the position of each item of information in the information time series; and a third feature quantity extraction unit for extracting a third feature quantity of each item of information in the information time series by using the first and second feature quantities.

Description

特徴量抽出装置、学習装置、特徴量抽出方法、学習方法、及びプログラムFeature quantity extraction device, learning device, feature quantity extraction method, learning method, and program
 本発明は、トークンの分散表現を得るニューラルネットワークモデルに関連するものである。 The present invention relates to a neural network model that obtains a distributed representation of tokens.
 非特許文献1に開示されているBERT(Bidirectional Encoder Representations from Transformers)を始めとして、言語モデルに関する研究が近年盛んに行われている。ここでの言語モデルとは、トークンの分散表現を得るニューラルネットワークモデルの一つである。なお、本明細書において、トークンとは単語などの分散表現の単位を表す。例えば、非特許文献1では、単語を更に細かいサブワードへと分割し、サブワード単位の分散表現を用いている。この場合、トークンはサブワードとなる。 In recent years, research on language models has been actively conducted, including BERT (Bidirectional Encoder Representations from Transformers) disclosed in Non-Patent Document 1. A language model here is one of neural network models that obtain distributed representations of tokens. In this specification, a token represents a unit of distributed representation such as a word. For example, in Non-Patent Document 1, a word is further divided into subwords, and a distributed representation in units of subwords is used. In this case, the token becomes a subword.
 言語モデルでは、単一のトークンを入力するのではなく、トークンが使用されているテキスト全てを入力するため、テキスト内の他のトークンとの意味的な関係性を反映した分散表現を得ることができる。 Because the language model does not input a single token, but the entire text in which the token is used, it is possible to obtain a distributed representation that reflects the semantic relationships with other tokens in the text. can.
 この分散表現を学習するステップを事前学習(pre-training)と呼ぶ。また、事前学習済みの分散表現を用いてテキスト分類タスクや質問応答タスクなどの様々なタスクを解くことができ、このステップをfine-tuningと呼ぶ。 The step of learning this distributed representation is called pre-training. In addition, pre-trained distributed representations can be used to solve various tasks such as text classification and question-answering tasks, and this step is called fine-tuning.
 非特許文献1に開示されているモデルでは、大規模な言語資源を用いた事前学習により各トークンの精度の良い分散表現を学習しておくことで、fine-tuningにおける各タスクでも高い性能を発揮している。 The model disclosed in Non-Patent Document 1 demonstrates high performance in each task in fine-tuning by learning an accurate distributed representation of each token through pre-learning using a large-scale language resource. is doing.
 非特許文献1に開示されている言語モデルにおいて、Transformerのattention機構とposition embeddingが重要な要素となっている。非特許文献2の3.2節に記載のように、attention機構では、あるトークンとその他のトークンがどの程度関連しているかを表す重みを計算し、それに基づいてトークンの分散表現を計算する。position embedding(非特許文献2の3.5節)は、あるトークンがテキスト内のどの位置にあるかを表す特徴量である。 In the language model disclosed in Non-Patent Document 1, the Transformer's attention mechanism and position embedding are important elements. As described in Section 3.2 of Non-Patent Document 2, the attention mechanism calculates a weight representing how much a given token is related to other tokens, and calculates a distributed representation of the token based on the weight. Position embedding (Section 3.5 of Non-Patent Document 2) is a feature quantity representing the position of a certain token in text.
 非特許文献1に開示された言語モデルでは、長いテキスト(長いトークン系列)を上手く扱うことができない。その理由は、当該言語モデルで使用しているposition embeddingが事前学習の段階で決められた数しか学習されていないことである。非特許文献1のposition embeddingは各トークンの絶対位置に依存するベクトルであり、学習パラメータの一つである。 The language model disclosed in Non-Patent Document 1 cannot handle long texts (long token sequences) well. The reason is that only a predetermined number of position embeddings used in the language model have been learned in the pre-learning stage. Position embedding in Non-Patent Document 1 is a vector that depends on the absolute position of each token, and is one of the learning parameters.
 例えば、非特許文献1の言語モデルでは512個のposition embeddingが用意されており、テキスト内の512トークンまでの位置を扱うことができる。すなわち、もしもテキストが512トークンよりも長ければ、513番目以降のトークンはそれ以前のトークンと同時に扱うことができず、他のトークンとの関係性を適切に反映できなくなる可能性がある。 For example, in the language model of Non-Patent Document 1, 512 position embeddings are prepared, and positions of up to 512 tokens in the text can be handled. That is, if the text is longer than 512 tokens, the 513th and subsequent tokens cannot be treated simultaneously with the preceding tokens, and there is a possibility that the relationships with other tokens cannot be properly reflected.
 なお、上記のような長いトークン系列を適切に扱うことが難しいという課題は、トークン系列に限らない情報の系列において生じ得る課題である。 It should be noted that the difficulty of properly handling a long token sequence as described above is a problem that can occur in information sequences that are not limited to token sequences.
 本発明は上記の点に鑑みてなされたものであり、任意の長さの情報系列内の各情報について、他の情報との関係性を適切に反映した特徴量を抽出することを可能とする技術を提供することを目的とする。 The present invention has been made in view of the above points, and makes it possible to extract a feature amount that appropriately reflects the relationship with other information for each piece of information in an information sequence of arbitrary length. The purpose is to provide technology.
 開示の技術によれば、情報系列における各情報の第1特徴量を抽出する第1特徴量抽出部と、
 前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出する、再帰型ニューラルネットワークを用いたモデルである第2特徴量抽出部と、
 前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出する第3特徴量抽出部と、
 を備える特徴量抽出装置が提供される。
According to the disclosed technology, a first feature quantity extraction unit for extracting a first feature quantity of each piece of information in an information series;
A second feature quantity extraction unit, which is a model using a recursive neural network, extracting a second feature quantity, which is a feature quantity relating to the position of each piece of information in the information series, using the first feature quantity;
a third feature amount extraction unit that extracts a third feature amount of each information in the information series using the first feature amount and the second feature amount;
A feature extraction device is provided.
 開示の技術によれば、任意の長さの情報系列内の各情報について、他の情報との関係性を適切に反映した特徴量を抽出することが可能となる。 According to the disclosed technology, it is possible to extract feature amounts that appropriately reflect the relationship with other information for each piece of information in an information sequence of arbitrary length.
特徴量抽出装置の構成図である。1 is a configuration diagram of a feature extraction device; FIG. 特徴量抽出装置の動作手順を示すフローチャートである。It is a flowchart which shows the operation|movement procedure of a feature-value extraction apparatus. 学習装置の構成図である。1 is a configuration diagram of a learning device; FIG. 学習装置の動作手順を示すフローチャートである。4 is a flow chart showing the operation procedure of the learning device; 装置のハードウェア構成図である。It is a hardware block diagram of an apparatus. 実験結果を示す図である。It is a figure which shows an experimental result.
 以下、図面を参照して本発明の実施の形態(本実施の形態)を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。例えば、以下では、情報系列における情報の例としてトークンを使用しているが、トークン以外の情報(例えば画像)についても以下で説明する処理動作を適用することが可能である。 An embodiment (this embodiment) of the present invention will be described below with reference to the drawings. The embodiments described below are merely examples, and embodiments to which the present invention is applied are not limited to the following embodiments. For example, although tokens are used below as an example of information in an information sequence, the processing operations described below can also be applied to information other than tokens (for example, images).
 (実施の形態の概要)
 本実施の形態では、非特許文献1の言語モデルのposition embeddingを、再帰型ニューラルネットワーク(Recurrent Neural Network:RNN)に基づくposition embeddingに置換した構成を採用することで、系列長の制限なく、任意の長さのテキストに対して、各トークンについて他のトークンとの関係性を反映した特徴量を抽出することを可能としている。
(Overview of Embodiment)
In the present embodiment, by adopting a configuration in which the position embedding of the language model of Non-Patent Document 1 is replaced with position embedding based on a recurrent neural network (RNN), arbitrary It is possible to extract a feature amount that reflects the relationship between each token and other tokens for a text of length .
 本実施の形態では、上記のRNNにより、トークンの連続性をモデル化し、当該RNNがトークン系列におけるトークンの位置に関する特徴量を出力する。 In this embodiment, the continuity of tokens is modeled by the above RNN, and the RNN outputs a feature amount regarding the position of the token in the token sequence.
 前述したように、非特許文献1に開示された言語モデルでは、position embeddingが固定の個数しか用意されていないため、それ以上の長さのテキストには対応できなかった。一方、本実施の形態では、RNNによりトークンの連続性をモデル化するため、未知の位置であっても前後のトークンとの相対的な位置関係から当該トークンの位置情報をRNNで計算できるため、未知の長さのテキストにも対応でき、非特許文献1に開示された言語モデルにおける課題を解決可能である。 As mentioned above, in the language model disclosed in Non-Patent Document 1, only a fixed number of position embeddings are prepared, so text longer than that cannot be supported. On the other hand, in the present embodiment, since the token continuity is modeled by the RNN, even if the position is unknown, the position information of the token can be calculated by the RNN from the relative positional relationship with the preceding and succeeding tokens. It can also handle texts of unknown length, and can solve the problem in the language model disclosed in Non-Patent Document 1.
 以下、本発明の詳細な実施例として、実施例1と実施例2を説明する。実施例1では、テキストから文脈特徴量を抽出する言語モデルからなる特徴量抽出装置100について説明する。実施例2では、特徴量抽出装置100を構成する言語モデルのモデルパラメータを学習する学習装置200について説明する。 Examples 1 and 2 will be described below as detailed examples of the present invention. In the first embodiment, a feature quantity extraction device 100 comprising a language model for extracting contextual feature quantities from text will be described. In a second embodiment, a learning device 200 that learns model parameters of a language model that constitutes the feature quantity extraction device 100 will be described.
 (実施例1)
 <装置構成>
 図1に、実施例1における特徴量抽出装置100の構成例を示す。図1に示すように、特徴量抽出装置100は、トークン特徴量抽出部110、位置特徴量抽出部120、文脈符号化部130を有する。なお、トークン特徴量抽出部110、位置特徴量抽出部120、文脈符号化部130をそれぞれ、第1特徴量抽出部、第2特徴量抽出部、第3特徴量抽出部と呼んでもよい。また、トークン特徴量、位置特徴量、文脈特徴量をそれぞれ、第1特徴量、第2特徴量、第3特徴量と呼んでもよい。
(Example 1)
<Device configuration>
FIG. 1 shows a configuration example of a feature quantity extraction device 100 according to the first embodiment. As shown in FIG. 1, the feature quantity extraction device 100 has a token feature quantity extraction unit 110, a position feature quantity extraction unit 120, and a context coding unit . Note that the token feature amount extraction unit 110, the position feature amount extraction unit 120, and the context encoding unit 130 may be called a first feature amount extraction unit, a second feature amount extraction unit, and a third feature amount extraction unit, respectively. Also, the token feature amount, position feature amount, and context feature amount may be called a first feature amount, a second feature amount, and a third feature amount, respectively.
 また、特徴量抽出装置100により得られた文脈特徴量は、外部の装置によりタスク実行に使用されてもよいし、特徴量抽出装置100がタスク実行部を備え、特徴量抽出装置100が文脈特徴量を用いてタスク実行を行ってもよい。 Further, the context feature quantity obtained by the feature quantity extraction device 100 may be used for task execution by an external device. Quantities may be used for task execution.
 特徴量抽出装置100にテキストが入力され、特徴量抽出装置100は、入力されたテキストから文脈特徴量を抽出する。 Text is input to the feature quantity extraction device 100, and the feature quantity extraction device 100 extracts contextual feature quantities from the input text.
 特徴量抽出装置100により得られた文脈特徴量を、特定のタスクに特化した分類部(タスク実行部と呼んでもよい)に入力することで、特定のタスク(単語穴埋めタスクやテキスト分類タスクなど)を解くことができる。トークン特徴量抽出部110、位置特徴量抽出部120、文脈符号化部130、分類部は全てニューラルネットワークで実装される。 By inputting the context features obtained by the feature extraction device 100 to a classification unit (which may be called a task execution unit) specialized for a specific task, a specific task (such as a word filling task or a text classification task) can be performed. ) can be solved. The token feature amount extraction unit 110, the position feature amount extraction unit 120, the context encoding unit 130, and the classification unit are all implemented by neural networks.
 特徴量抽出装置100を構成する言語モデルの構成は非特許文献1に開示されている言語モデルの構成に基づいているが、非特許文献1の言語モデルでは、トークン特徴量抽出部110に相当するモデルと位置特徴量抽出部120に相当するモデルはそれぞれ独立に動作するのに対し、実施例1ではトークン特徴量抽出部110の出力を位置特徴量抽出部120に入力しており、これらは独立ではない。 The configuration of the language model that constitutes the feature amount extraction device 100 is based on the configuration of the language model disclosed in Non-Patent Document 1, but in the language model of Non-Patent Document 1, it corresponds to the token feature amount extraction unit 110. While the model and the model corresponding to the position feature quantity extraction unit 120 operate independently, in the first embodiment, the output of the token feature quantity extraction unit 110 is input to the position feature quantity extraction unit 120, and they operate independently. is not.
 <動作例>
 以下、実施例1における特徴量抽出装置100が、テキストから文脈特徴量を得る際の動作例を、図2のフローチャートの手順に沿って詳細に説明する。以下の説明において、テキストとはトークン系列であるとし、テキストの長さをトークン系列長とする。また、以下で得られる複数の特徴量における次元数はそれぞれ異なっていてもよいが、本実施例では簡単のため全て同じ次元数dとする。
<Operation example>
An operation example of the feature extraction apparatus 100 according to the first embodiment for obtaining contextual features from text will be described in detail below in accordance with the procedure of the flowchart in FIG. In the following description, text is a token sequence, and the length of the text is the length of the token sequence. Further, although the number of dimensions in a plurality of feature quantities obtained below may be different, in this embodiment, for simplicity, the same number of dimensions d is used.
 S101において、トークンsの系列であるテキストS={s,s,…,s}をトークン特徴量抽出部110へ入力する。 At S <b>101 , a text S = {s 1 , s 2 , .
 S102において、トークン特徴量抽出部110は、テキストからトークン特徴量の系列{w,w,…,w}を抽出する。ただし、各トークン特徴量wは、w∈Rである。つまり、wは、d次元の実ベクトルである。 In S102, the token feature amount extraction unit 110 extracts a sequence of token feature amounts {w 1 , w 2 , . . . , w L } from the text. However, each token feature w i is w i ∈R d . That is, wi is a d-dimensional real vector.
 トークン特徴量抽出部110は、テキストから各トークンに対応する特徴量(ベクトル)を出力するモデルであればどのような構成のモデルでもよい。例えば非特許文献1のように、予め定めた語彙集合Vがあるときに、語彙中のトークンそれぞれに1つのベクトルを割り当て、そのベクトルを学習パラメータの1つとすることで、各トークンに対応する特徴量を抽出する方法がある。つまり、ニューラルネットワークにおいて学習される、1トークンあたりd個の重みパラメータを要素とするベクトルが各トークンに対応する特徴量となる。トークン数がVであるとすると、V×dがトークン特徴量抽出部110を構成するニューラルネットワークモデルの額数パラメータの個数である。以下ではこのベクトルをembeddingと呼ぶ。 The token feature quantity extraction unit 110 may be a model with any configuration as long as it is a model that outputs a feature quantity (vector) corresponding to each token from the text. For example, as in Non-Patent Document 1, when there is a predetermined vocabulary set V, one vector is assigned to each token in the vocabulary, and the vector is used as one of the learning parameters. There is a way to extract the quantity. That is, a vector having d weighting parameters per token as elements learned in the neural network is a feature amount corresponding to each token. Assuming that the number of tokens is V, V×d is the number of amount parameters of the neural network model that constitutes the token feature amount extraction unit 110 . This vector is called embedding in the following.
 トークン特徴量抽出部110により得られたトークン特徴量の系列{w,w,…,w}は、位置特徴量抽出部120と文脈符号化部130に入力される。 The sequence of token features {w 1 , w 2 , .
 S103において、位置特徴量抽出部120は、トークン特徴量の系列{w,w,…,w}からトークンの位置関係を反映した位置特徴量(position embedding)の系列{p,p,…,p}を抽出する。ただし、p∈Rである。位置特徴量抽出部120は、トークン特徴量からトークンの位置関係を反映した特徴量(ベクトル)を抽出するモデルであればどのようなモデルでもよい。 In S103 , the position feature amount extraction unit 120 extracts a sequence of position feature amounts (position embedding) {p 1 , p 2 , . . . , p L }. However, p i εR d . The position feature quantity extraction unit 120 may be any model as long as it extracts a feature quantity (vector) reflecting the positional relationship of tokens from the token feature quantity.
 本実施例では、位置特徴量抽出部120として、再帰型ニューラルネットワーク(Recurrent Neural Network:RNN)からなるモデルを用いている。トークン系列のトークンの並びを時刻の進行と見なした場合において、RNNには、時刻の順に、その時刻のトークンに対応するトークン特徴量が入力される。つまり、w,w,…,wの順にトークン特徴量が入力される。 In this embodiment, a model made up of a recurrent neural network (RNN) is used as the position feature amount extraction unit 120 . When the sequence of tokens in the token series is regarded as the progress of time, token features corresponding to the tokens at that time are input to the RNN in chronological order. That is, the token features are input in order of w 1 , w 2 , . . . , w L .
 RNNには、ある時刻のトークン特徴量とともに、前の時刻の隠れ層の情報が入力され、これらの情報から当該時刻の隠れ層の情報を計算し、出力する。隠れ層の情報が、位置特徴量に相当する。 The RNN receives token feature values at a certain time as well as hidden layer information at the previous time, and calculates and outputs hidden layer information at that time based on this information. Hidden layer information corresponds to the position feature amount.
 本実施例のRNNについて、片方向のみのRNNを用いてもよいし、双方向のRNNを用いてもよい。特に双方向のRNNを用いることで、前後のトークンから当該トークンの相対的な位置関係を抽出可能である。また、RNNにはLSTMやGRUなど様々な種類があるがそのどれを採用しても構わない。更に、RNNは複数の層を用いて多段に構成することができるが、その層の数については特に限定はない。1層でもよいし、複数層でもよい。 For the RNN of this embodiment, a unidirectional RNN may be used, or a bidirectional RNN may be used. In particular, by using a bidirectional RNN, it is possible to extract the relative positional relationship of the token from the preceding and following tokens. Also, there are various types of RNN such as LSTM and GRU, and any of them may be adopted. Furthermore, the RNN can be configured in multiple stages using a plurality of layers, but the number of layers is not particularly limited. One layer may be sufficient, and multiple layers may be sufficient.
 トークン特徴量の系列{w,w,…,w}と、位置特徴量の系列{p,p,…,p}は、文脈符号化部130に入力される。 The sequence of token features {w 1 , w 2 , . . . , w L } and the sequence of position features {p 1 , p 2 , .
 S104において、文脈符号化部130は、トークン特徴量の系列{w,w,…,w}と、位置特徴量の系列{p,p,…,p}から、文脈特徴量の系列{h,h,…,h}を計算する。S105において、文脈特徴量の系列{h,h,…,h}を出力する。ただし、各文脈特徴量hは、h∈Rである。 In S104, the context encoding unit 130 extracts the context feature from the sequence of token features { w 1 , w 2 , . Compute the sequence of quantities {h 1 , h 2 , . . . , h L }. In S105, a sequence of context features {h 1 , h 2 , . . . , h L } is output. However, each context feature h i is h i εR d .
 文脈符号化部130は、i番目のトークンに対する特徴量を計算するときに、周囲の文脈(つまり、i番目のトークン以外の周囲のトークンの情報)を考慮する機構を持つニューラルネットワークのモデルであればどのようなモデルを用いてもよい。 The context coding unit 130 may be a neural network model having a mechanism that considers the surrounding context (that is, the information of surrounding tokens other than the i-th token) when calculating the feature amount for the i-th token. Any model can be used.
 例えば、非特許文献2に開示されているTransformer Encoderを文脈符号化部130として使用することができる。この場合、トークン特徴量wと位置特徴量pを足し合わせたベクトルが、i番目の入力として、Transformer Encoder(文脈符号化部130)へ入力される。 For example, the Transformer Encoder disclosed in Non-Patent Document 2 can be used as the context encoder 130 . In this case, a vector obtained by adding the token feature amount wi and the position feature amount pi is input to the Transformer Encoder (context encoding unit 130) as the i -th input.
 なお、Transformer Encoder(文脈符号化部130)への入力に関して、トークン特徴量wと位置特徴量pを足し合わせることに加えて、更に他の特徴量を足し合わせてもよい。例えば、非特許文献1に開示された技術では、2文の意味の関係性を推定するタスクにおいて、1文目と2文目を区別するために、新たにセグメント特徴量を作成し、トークン毎に足し合わせている(非特許文献1の図2)。本実施例でも、例えば、上記のセグメント特徴量と同様の、文を区別するための特徴量gを、トークン特徴量wと位置特徴量pを足し合わせたものに更に足し合わせることとしてもよい。 Regarding the input to the Transformer Encoder (context encoding unit 130), in addition to adding the token feature amount wi and the position feature amount pi , other feature amounts may be added. For example, in the technique disclosed in Non-Patent Document 1, in the task of estimating the semantic relationship between two sentences, in order to distinguish between the first sentence and the second sentence, a new segment feature value is created, and each token (Fig. 2 of Non-Patent Document 1). In this embodiment, for example, a feature amount g i for distinguishing sentences, which is similar to the segment feature amount described above, is further added to the sum of the token feature amount wi and the position feature amount pi . good too.
 本実施例の文脈符号化部130は、Transformer Encoderと同様に、attention機構を用いて、各トークンに対して、当該トークンと他のトークンとの間の関係性を考慮し、それを反映した特徴量を出力する。attention機構自体は非特許文献2に開示されている技術なので、ここではattention機構の概要を説明する。 As with the Transformer Encoder, the context encoding unit 130 of this embodiment uses the attention mechanism to consider the relationship between each token and other tokens, and Output the amount. Since the attention mechanism itself is a technique disclosed in Non-Patent Document 2, the outline of the attention mechanism will be described here.
 非特許文献2に開示されているように、attention機構は以下の式(1)で表される。 As disclosed in Non-Patent Document 2, the attention mechanism is represented by the following formula (1).
Figure JPOXMLDOC01-appb-M000001
 特徴量を算出する対象のトークンである当該トークンと他のトークンとの関係性を考慮する場合において、Q、K、Vはトークンの特徴量(ここでは、トークン特徴量wと位置特徴量pを足し合わせた特徴量)を線形変換した行列となり、Q、K、V∈Rd×Lとなる。式(1)において、
Figure JPOXMLDOC01-appb-M000001
Q, K, and V are token feature amounts (here, the token feature amount wi and the position feature amount p i ) is linearly transformed, and Q, K, V∈R d×L . In formula (1),
Figure JPOXMLDOC01-appb-M000002
は当該トークンが他のトークンとどの程度関連しているかを表すスコア(確率)をトークンの特徴量間の内積に基づいて計算していることを示す。このスコアを重みとして使用して、Vにおける各トークンに対応するベクトルに重みを付けて和をとったベクトルが、attentionの出力、すなわち、他のトークンが当該トークンとどの程度関連しているかを表す特徴量となる。このAttention(Q,K,V)と当該トークンの特徴量(トークン特徴量wと位置特徴量pを足し合わせたもの)を足し合わせることで、当該トークンについての、当該トークンと他のトークンの関連性を反映した特徴量hを得ることができる。
Figure JPOXMLDOC01-appb-M000002
indicates that the score (probability) representing the degree to which the token is related to other tokens is calculated based on the inner product between the feature values of the token. Using this score as a weight, the weighted sum vector of the vectors corresponding to each token in V represents the output of attention, i.e. how well other tokens are associated with it. It becomes a feature amount. By adding this Attention (Q, K, V) and the feature amount of the token (the sum of the token feature amount wi and the position feature amount pi ), the token and the other tokens It is possible to obtain a feature amount h i that reflects the relevance of .
 本実施例では、文脈符号化部130を、attention機構を用いた構成としているため、高精度な文脈把握能力を活かしつつ、位置特徴量抽出部120(RNN)がトークンの位置関係の把握に専念でき、特に長いテキストに対して、非特許文献1、2等に開示されている従来技術よりも、高精度な文脈特徴量を出力するモデルを作成可能である。 In this embodiment, the context encoding unit 130 is configured using an attention mechanism, so that the positional feature quantity extraction unit 120 (RNN) concentrates on grasping the positional relationship of tokens while making use of the highly accurate context grasping ability. It is possible to create a model that outputs more accurate contextual feature amounts than the conventional techniques disclosed in Non-Patent Documents 1 and 2, especially for long texts.
 (実施例2)
 次に、実施例2を説明する。実施例2では、実施例1で説明した特徴量抽出装置100を構成するトークン特徴量抽出部110、位置特徴量抽出部120、文脈符号化部130のモデルパラメータ(1)を学習する方法について説明する。なお、実施例2では、特徴量抽出装置100を構成するトークン特徴量抽出部110、位置特徴量抽出部120、文脈符号化部130のモデルパラメータを「モデルパラメータ(1)」と呼び、後述する分類部140のモデルパラメータを「モデルパラメータ(2)」と呼んでいる。
(Example 2)
Next, Example 2 will be described. In the second embodiment, a method of learning the model parameters (1) of the token feature amount extraction unit 110, the position feature amount extraction unit 120, and the context coding unit 130 that configure the feature amount extraction device 100 described in the first embodiment will be described. do. In the second embodiment, model parameters of the token feature quantity extraction unit 110, the position feature quantity extraction unit 120, and the context coding unit 130, which constitute the feature quantity extraction device 100, are referred to as "model parameters (1)", which will be described later. The model parameters of the classifier 140 are called "model parameters (2)".
 学習方法は特定の方法に限られない。入力されたテキストに対して何等かのタスクを実行して正解を出すようにモデルパラメータが学習されればよい。本実施例では、一例として、単語穴埋めタスク(非特許文献1の3.1節Task#1 Masked LM)を用いてモデルパラメータの学習を行う方法について説明する。 The learning method is not limited to a specific method. The model parameters should be learned so as to perform some task on the input text and give a correct answer. In this embodiment, as an example, a method of learning model parameters using a word fill-in task (Section 3.1 Task#1 Masked LM of Non-Patent Document 1) will be described.
 実施例2において、モデルパラメータの学習を行う装置を学習装置200と呼ぶことにする。なお、学習装置200は、学習後に、実際のタスク実行処理に用いることとしてもよい。 In the second embodiment, a device that learns model parameters is called a learning device 200. Note that the learning device 200 may be used for actual task execution processing after learning.
 図3に、学習装置200の構成例を示す。図3に示すように、学習装置200は、実施例1で説明したトークン特徴量抽出部110、位置特徴量抽出部120、文脈符号化部130に加えて、分類部140、更新部150、モデルパラメータ(1)格納部160、モデルパラメータ(2)格納部170、テキストデータ格納部180を備える。 FIG. 3 shows a configuration example of the learning device 200. As shown in FIG. 3, the learning device 200 includes a classification unit 140, an update unit 150, a model A parameter (1) storage unit 160 , a model parameter (2) storage unit 170 and a text data storage unit 180 are provided.
 分類部140は、上記単語穴埋めタスクにおける穴埋めの単語(トークン)を予測する機構(ニューラルネットワークのモデル)である。更新部150は、正解トークンと予測トークンとの間の誤差が小さくなるように、モデルパラメータ(1)、モデルパラメータ(2)を同時に更新していく機構である。更新方法としては、例えば、教師あり学習の一般的な手法である誤差逆伝搬法等を用いることができる。図4のフローチャートの手順に沿って、学習方法を説明する。 The classification unit 140 is a mechanism (neural network model) that predicts the fill-in-the-blank words (tokens) in the word-fill-in task. The updating unit 150 is a mechanism that simultaneously updates the model parameter (1) and the model parameter (2) so that the error between the correct token and the predicted token becomes small. As an update method, for example, an error backpropagation method, which is a general method of supervised learning, can be used. The learning method will be described along the procedure of the flowchart in FIG.
 S201において、事前準備を行う。事前準備では、まず、テキストデータのセットを準備し、テキストデータ格納部180に格納しておく。テキストデータとしては、WikipediaなどのWeb上に公開されているものを利用することができる。 In S201, preparations are made. In advance preparation, first, a set of text data is prepared and stored in the text data storage unit 180 . As text data, data published on the Web such as Wikipedia can be used.
 次に、テキストデータからマスクされたテキストを作成する。例えば、Wikipediaの各段落を一つのテキストとして抽出し、そのテキストを適切なトークナイザでトークンへ分割した後、いくつかのトークンを選択し、マスクトークン([MASK])やランダムに選んだ別のトークンに置換したり、もしくはそのままのトークンを維持したりすることで、トークン系列における一部のトークンがマスクされたテキスト(「マスクされたテキスト」と呼ぶ)を得る。ここで、置換や維持の条件は非特許文献1において開示されている条件と同じでよい。置換もしくは維持の対象に選ばれたトークンを正解トークンとし、このトークンを予測対象とする。また、モデルパラメータ(1)(2)をランダムな値などで初期化する。 Next, create masked text from the text data. For example, extract each Wikipedia paragraph as a piece of text, split the text into tokens with an appropriate tokenizer, select some tokens, and then use the mask token ([MASK]) or another randomly chosen token. or keeping the tokens as they are, we obtain a text in which some tokens in the token sequence are masked (referred to as "masked text"). Here, the conditions for replacement and maintenance may be the same as those disclosed in Non-Patent Document 1. A token selected for replacement or maintenance is set as a correct token, and this token is set as a prediction target. Also, model parameters (1) and (2) are initialized with random values.
 S202において、マスクされたテキストをトークン特徴量抽出部110に入力し、実施例1で説明したとおりに、トークン特徴量抽出部110、位置特徴量抽出部120、及び文脈符号化部130が処理を行って、マスクされたテキストに対応する文脈特徴量の系列{h,h,…,h}を取得する。{h,h,…,h}におけるhは、i番目のトークンに関する文脈特徴量である。文脈特徴量を分散表現と呼んでもよい。 In S202, the masked text is input to the token feature extraction unit 110, and as described in the first embodiment, the token feature extraction unit 110, the position feature extraction unit 120, and the context encoding unit 130 perform processing. to obtain a sequence of context features {h 1 , h 2 , . . . , h L } corresponding to the masked text. h i in {h 1 , h 2 , . . . , h L } is the context feature for the i-th token. A context feature may be called a distributed representation.
 S203において、文脈特徴量の系列{h,h,…,h}が分類部140に入力され、分類部140が予測トークンを出力する。 In S203, the sequence of context features {h 1 , h 2 , .
 分類部140は、i番目のトークンに関する特徴量hを基にi番目のトークンを予め決められた語彙の中から予測する機構である。例えば、分類部140は、1層のFeed Forward Networkを用いて、hを次元数が語彙サイズd´である特徴量y∈Rd´へと変換し、yの要素の値が最大となるインデックス(d´個の語彙の中の1つの語彙(トークン)を示すインデックス)を用いて語彙からトークンを予測する。 The classification unit 140 is a mechanism that predicts the i-th token from a predetermined vocabulary based on the feature quantity h i regarding the i-th token. For example, the classifying unit 140 uses a one-layer Feed Forward Network to convert h i into a feature quantity y i εR d ′ whose dimensionality is the vocabulary size d′. A token is predicted from the vocabulary using an index (an index indicating one vocabulary (token) among d' vocabulary).
 S204において、予測トークンと正解トークンを更新部150に入力し、教師あり学習によりモデルパラメータ(1)、モデルパラメータ(2)を更新する。更新されたモデルパラメータ(1)、モデルパラメータ(2)により、S202~S204が繰り返し行われる。これにより、正確な予測ができるように、モデルパラメータ(1)とモデルパラメータ(2)が学習されていく。 In S204, the prediction token and the correct token are input to the update unit 150, and model parameter (1) and model parameter (2) are updated by supervised learning. S202 to S204 are repeated using the updated model parameters (1) and (2). As a result, the model parameters (1) and (2) are learned so that accurate prediction can be made.
 S205において、終了条件を満たせば学習を終了する。終了条件は、繰り返し回数が所定回数に達したことであってもよいし、モデルパラメータの更新量が閾値よりも小さくなったことでもよいし、これら以外の条件でもよい。 In S205, if the termination condition is satisfied, the learning is terminated. The termination condition may be that the number of iterations reaches a predetermined number, that the model parameter update amount becomes smaller than a threshold value, or other conditions.
 (ハードウェア構成例)
 特徴量抽出装置100、学習装置200はいずれも、例えば、コンピュータにプログラムを実行させることにより実現できる。このコンピュータは、物理的なコンピュータであってもよいし、クラウド上の仮想マシンであってもよい。特徴量抽出装置100、学習装置200を総称して「装置」と呼ぶ。
(Hardware configuration example)
Both the feature quantity extraction device 100 and the learning device 200 can be realized by causing a computer to execute a program, for example. This computer may be a physical computer or a virtual machine on the cloud. The feature quantity extraction device 100 and the learning device 200 are collectively referred to as "apparatus".
 すなわち、当該装置は、コンピュータに内蔵されるCPUやメモリ等のハードウェア資源を用いて、当該装置で実施される処理に対応するプログラムを実行することによって実現することが可能である。上記プログラムは、コンピュータが読み取り可能な記録媒体(可搬メモリ等)に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 That is, the device can be realized by executing a program corresponding to the processing performed by the device using hardware resources such as a CPU and memory built into the computer. The above program can be recorded in a computer-readable recording medium (portable memory, etc.), saved, or distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.
 図5は、上記コンピュータのハードウェア構成例を示す図である。図5のコンピュータは、それぞれバスBSで相互に接続されているドライブ装置1000、補助記憶装置1002、メモリ装置1003、CPU1004、インタフェース装置1005、表示装置1006、入力装置1007、出力装置1008等を有する。なお、これらのうち、一部の装置を備えないこととしてもよい。例えば、表示を行わない場合、表示装置1006を備えなくてもよい。 FIG. 5 is a diagram showing a hardware configuration example of the computer. The computer of FIG. 5 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., which are interconnected by a bus BS. Note that some of these devices may not be provided. For example, the display device 1006 may not be provided when no display is performed.
 当該コンピュータでの処理を実現するプログラムは、例えば、CD-ROM又はメモリカード等の記録媒体1001によって提供される。プログラムを記憶した記録媒体1001がドライブ装置1000にセットされると、プログラムが記録媒体1001からドライブ装置1000を介して補助記憶装置1002にインストールされる。但し、プログラムのインストールは必ずしも記録媒体1001より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置1002は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 A program that implements the processing in the computer is provided by a recording medium 1001 such as a CD-ROM or memory card, for example. When the recording medium 1001 storing the program is set in the drive device 1000 , the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000 . However, the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via the network. The auxiliary storage device 1002 stores installed programs, as well as necessary files and data.
 メモリ装置1003は、プログラムの起動指示があった場合に、補助記憶装置1002からプログラムを読み出して格納する。CPU1004は、メモリ装置1003に格納されたプログラムに従って、当該装置に係る機能を実現する。インタフェース装置1005は、ネットワークに接続するためのインタフェースとして用いられ、送信部及び受信部として機能する。表示装置1006はプログラムによるGUI(Graphical User Interface)等を表示する。入力装置1007はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。出力装置1008は演算結果を出力する。 The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when a program activation instruction is received. The CPU 1004 implements functions related to the device according to programs stored in the memory device 1003 . The interface device 1005 is used as an interface for connecting to a network and functions as a transmitter and a receiver. A display device 1006 displays a GUI (Graphical User Interface) or the like by a program. An input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used to input various operational instructions. The output device 1008 outputs the calculation result.
 (実験について)
 各種のタスクについて、非特許文献1の技術、及び、実施例1,2で説明した本発明に係る提案技術を用いた実験を行った。実験結果を図6に示す。図6には、タスク名と評価結果(acc.(正解率)、又はF1スコア)が示されている。
(About the experiment)
For various tasks, experiments were conducted using the technology of Non-Patent Document 1 and the proposed technology according to the present invention described in Examples 1 and 2. Experimental results are shown in FIG. FIG. 6 shows task names and evaluation results (acc. (correct answer rate) or F1 score).
 実験対象とした具体的なタスクは、単語穴埋め、質問応答、テキスト分類、及び、対話的な質問応答である。各タスクの概要は下記のとおりである。 The specific tasks targeted for the experiment are word filling, question answering, text classification, and interactive question answering. The outline of each task is as follows.
 単語穴埋めのタスクは、非特許文1の3.1節に記載のTask#1 Masked LMである。 The word filling task is Task #1 Masked LM described in Section 3.1 of Non-Patent Document 1.
 質問応答のタスクは、長いテキストと質問が与えられ、長いテキストから回答となる箇所を抜き出すタスクである。 The question-answering task is a task in which a long text and a question are given, and the answer is extracted from the long text.
 テキスト分類のタスクは、選択肢とそれに対応する説明文(長いテキスト)、及び質問が与えられ、質問に対する答えとなる選択肢を選ぶタスクである。 The text classification task is a task in which a choice, a corresponding explanation (long text), and a question are given, and the choice that answers the question is selected.
 対話的な質問応答のタスクは、長い対話履歴のテキスト、及び質問が与えられ、質問に対する答えをテキストから抽出するタスクである。 The interactive question-answering task is a task of extracting the answer to the question from the given text of a long dialogue history and a question.
 実験条件として、単語穴埋めについては最大テキスト長512トークンで実験を行い、その他のタスクについては、最大テキスト長1024トークンで実験を行った。 As the experimental conditions, we conducted experiments with a maximum text length of 512 tokens for word filling, and with a maximum text length of 1024 tokens for other tasks.
 図6に示すように、512トークンよりも長いトークンでの実験において、提案技術のほうが、非特許文献1に開示された技術よりも良い結果が得られている。 As shown in FIG. 6, the proposed technique obtained better results than the technique disclosed in Non-Patent Document 1 in experiments with tokens longer than 512 tokens.
 (実施の形態の効果等)
 以上説明した本実施の形態に係る技術により、文脈特徴量を抽出するためのモデルにおいて、RNNに基づくposition embeddingを用いて前後の位置関係を考慮することとしたので、任意のテキスト長の(例えば512トークンよりも長い)テキストに対して、精度の良い特徴量、つまり、トークンと他のトークンとの関係性が適切に反映された特徴量を得ることができるようになる。
(Effects of the embodiment, etc.)
According to the technique according to the present embodiment described above, in the model for extracting the context feature quantity, position embedding based on RNN is used to consider the positional relationship before and after, so any text length (for example For a text longer than 512 tokens), it is possible to obtain an accurate feature quantity, that is, a feature quantity that appropriately reflects the relationship between a token and other tokens.
 (付記)
 以上の実施形態に関し、更に以下の付記を開示する。
(Appendix)
The following additional remarks are disclosed regarding the above embodiments.
 (付記項1)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 情報系列における各情報の第1特徴量を抽出し、
 再帰型ニューラルネットワークを用いたモデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出し、
 前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出する、
 特徴量抽出装置。
(Appendix 1)
memory;
at least one processor connected to the memory;
including
The processor
Extracting a first feature amount of each information in the information series,
A model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence,
Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series,
Feature extractor.
 (付記項2)
 前記プロセッサは、前記情報系列における各情報について、当該情報の前記情報系列における他の情報との関係性を反映した前記第3特徴量を抽出する
 付記項1に記載の特徴量抽出装置。
(Appendix 2)
The feature quantity extraction device according to claim 1, wherein, for each piece of information in the information series, the processor extracts the third feature quantity that reflects a relationship between the information and other information in the information series.
 (付記項3)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含み、
 前記プロセッサは、
 ニューラルネットワークを用いた第1モデルにより、情報系列における各情報の第1特徴量を抽出し、
 再帰型ニューラルネットワークを用いた第2モデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出し、
 ニューラルネットワークを用いた第3モデルにより、前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出し、
 前記第3特徴量を用いてタスクを実行し、
 前記タスクの実行結果と、正解の情報とに基づいて、前記第1モデル、前記第2モデル、及び前記第3モデルを構成するニューラルネットワークのモデルパラメータを更新する
 学習装置。
(Appendix 3)
memory;
at least one processor connected to the memory;
including
The processor
A first model using a neural network extracts a first feature amount of each information in the information series,
A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence,
Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
Execute a task using the third feature,
A learning device that updates model parameters of a neural network that constitutes the first model, the second model, and the third model based on the task execution result and correct answer information.
 (付記項4)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含むコンピュータが、
 情報系列における各情報の第1特徴量を抽出し、
 再帰型ニューラルネットワークを用いたモデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出し、
 前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出する、
 特徴量抽出方法。
(Appendix 4)
memory;
at least one processor connected to the memory;
A computer containing
Extracting a first feature amount of each information in the information series,
A model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence,
Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series,
Feature extraction method.
 (付記項5)
 メモリと、
 前記メモリに接続された少なくとも1つのプロセッサと、
 を含むコンピュータが、
 ニューラルネットワークを用いた第1モデルにより、情報系列における各情報の第1特徴量を抽出し、
 再帰型ニューラルネットワークを用いた第2モデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出し、
 ニューラルネットワークを用いた第3モデルにより、前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出し、
 前記第3特徴量を用いてタスクを実行し、
 前記タスクの実行結果と、正解の情報とに基づいて、前記第1モデル、前記第2モデル、前記第3モデルのモデルパラメータを更新する
 学習方法。
(Appendix 5)
memory;
at least one processor connected to the memory;
A computer containing
A first model using a neural network extracts a first feature amount of each information in the information series,
A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence,
Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
Execute a task using the third feature,
A learning method, wherein model parameters of the first model, the second model, and the third model are updated based on an execution result of the task and correct answer information.
 (付記項6)
 特徴量抽出処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
 前記特徴量抽出処理は、
 情報系列における各情報の第1特徴量を抽出し、
 再帰型ニューラルネットワークを用いたモデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出し、
 前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出する、
 非一時的記憶媒体。
(Appendix 6)
A non-temporary storage medium storing a computer-executable program for performing feature quantity extraction processing,
The feature quantity extraction process includes:
Extracting a first feature amount of each information in the information series,
A model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each information in the information sequence,
Using the first feature amount and the second feature amount, extracting a third feature amount of each information in the information series,
Non-transitory storage media.
 (付記項7)
 学習処理を実行するようにコンピュータによって実行可能なプログラムを記憶した非一時的記憶媒体であって、
 前記学習処理は、
 ニューラルネットワークを用いた第1モデルにより、情報系列における各情報の第1特徴量を抽出し、
 再帰型ニューラルネットワークを用いた第2モデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出し、
 ニューラルネットワークを用いた第3モデルにより、前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出し、
 前記第3特徴量を用いてタスクを実行し、
 前記タスクの実行結果と、正解の情報とに基づいて、前記第1モデル、前記第2モデル、及び前記第3モデルを構成するニューラルネットワークのモデルパラメータを更新する
 非一時的記憶媒体。
(Appendix 7)
A non-transitory storage medium storing a program executable by a computer to perform a learning process,
The learning process includes
A first model using a neural network extracts a first feature amount of each information in the information series,
A second model using a recursive neural network uses the first feature amount to extract a second feature amount that is a feature amount related to the position of each piece of information in the information sequence,
Extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
Execute a task using the third feature,
A non-temporary storage medium that updates model parameters of neural networks that constitute the first model, the second model, and the third model, based on the execution result of the task and correct answer information.
 以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. It is possible.
100 特徴量抽出装置
110 トークン特徴量抽出部
120 位置特徴量抽出部
130 文脈符号化部
140 分類部
150 更新部
160 モデルパラメータ(1)格納部
170 モデルパラメータ(2)格納部
180 テキストデータ格納部
200 学習装置
1000 ドライブ装置
1001 記録媒体
1002 補助記憶装置
1003 メモリ装置
1004 CPU
1005 インタフェース装置
1006 表示装置
1007 入力装置
1008 出力装置
100 Feature quantity extraction device 110 Token feature quantity extraction unit 120 Position feature quantity extraction unit 130 Context coding unit 140 Classification unit 150 Update unit 160 Model parameter (1) storage unit 170 Model parameter (2) storage unit 180 Text data storage unit 200 Learning device 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 interface device 1006 display device 1007 input device 1008 output device

Claims (7)

  1.  情報系列における各情報の第1特徴量を抽出する第1特徴量抽出部と、
     前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出する、再帰型ニューラルネットワークを用いたモデルである第2特徴量抽出部と、
     前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出する第3特徴量抽出部と、
     を備える特徴量抽出装置。
    a first feature quantity extraction unit for extracting a first feature quantity of each piece of information in an information series;
    A second feature quantity extraction unit, which is a model using a recursive neural network, extracting a second feature quantity, which is a feature quantity relating to the position of each piece of information in the information series, using the first feature quantity;
    a third feature amount extraction unit that extracts a third feature amount of each information in the information series using the first feature amount and the second feature amount;
    A feature extraction device comprising:
  2.  前記第3特徴量抽出部は、前記情報系列における各情報について、当該情報の前記情報系列における他の情報との関係性を反映した前記第3特徴量を抽出する
     請求項1に記載の特徴量抽出装置。
    The feature quantity according to claim 1, wherein the third feature quantity extraction unit extracts, for each piece of information in the information sequence, the third feature quantity that reflects the relationship between the information in question and other information in the information sequence. Extractor.
  3.  情報系列における各情報の第1特徴量を抽出する第1特徴量抽出部と、
     前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出する、再帰型ニューラルネットワークを用いたモデルである第2特徴量抽出部と、
     前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出する第3特徴量抽出部と、
     前記第3特徴量を用いてタスクを実行するタスク実行部と、
     前記タスク実行部から出力されるタスク実行結果と、正解の情報とに基づいて、前記第1特徴量抽出部、前記第2特徴量抽出部、及び前記第3特徴量抽出部を構成するニューラルネットワークのモデルパラメータを更新する更新部と
     を備える学習装置。
    a first feature quantity extraction unit for extracting a first feature quantity of each piece of information in an information series;
    A second feature quantity extraction unit, which is a model using a recursive neural network, extracting a second feature quantity, which is a feature quantity relating to the position of each piece of information in the information series, using the first feature quantity;
    a third feature amount extraction unit that extracts a third feature amount of each information in the information series using the first feature amount and the second feature amount;
    a task execution unit that executes a task using the third feature;
    A neural network that configures the first feature amount extraction unit, the second feature amount extraction unit, and the third feature amount extraction unit based on the task execution result output from the task execution unit and correct answer information. A learning device comprising an updating unit that updates the model parameters of and .
  4.  特徴量抽出装置が実行する特徴量抽出方法であって、
     情報系列における各情報の第1特徴量を抽出するステップと、
     再帰型ニューラルネットワークを用いたモデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出するステップと、
     前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出するステップと、
     を備える特徴量抽出方法。
    A feature quantity extraction method executed by a feature quantity extraction device,
    a step of extracting a first feature amount of each information in the information series;
    A step of extracting a second feature amount, which is a feature amount relating to the position of each piece of information in the information series, using the first feature amount by a model using a recursive neural network;
    a step of extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount;
    A feature extraction method comprising:
  5.  学習装置が実行する学習方法であって、
     ニューラルネットワークを用いた第1モデルにより、情報系列における各情報の第1特徴量を抽出するステップと、
     再帰型ニューラルネットワークを用いた第2モデルにより、前記第1特徴量を用いて、前記情報系列における各情報の位置に関する特徴量である第2特徴量を抽出するステップと、
     ニューラルネットワークを用いた第3モデルにより、前記第1特徴量と前記第2特徴量とを用いて、前記情報系列における各情報の第3特徴量を抽出するステップと、
     前記第3特徴量を用いてタスクを実行するステップと、
     前記タスクの実行結果と、正解の情報とに基づいて、前記第1モデル、前記第2モデル、前記第3モデルのモデルパラメータを更新するステップと
     を備える学習方法。
    A learning method executed by a learning device,
    A step of extracting a first feature amount of each piece of information in an information series by a first model using a neural network;
    A step of extracting a second feature amount, which is a feature amount relating to the position of each piece of information in the information series, using the first feature amount by a second model using a recursive neural network;
    A step of extracting a third feature amount of each piece of information in the information series using the first feature amount and the second feature amount by a third model using a neural network;
    executing a task using the third feature;
    and updating model parameters of the first model, the second model, and the third model based on the task execution result and correct answer information.
  6.  コンピュータを、請求項1又は2に記載の特徴量抽出装置における各部として機能させるためのプログラム。 A program for causing a computer to function as each unit in the feature extraction device according to claim 1 or 2.
  7.  コンピュータを、請求項3に記載の学習装置における各部として機能させるためのプログラム。 A program for causing a computer to function as each unit in the learning device according to claim 3.
PCT/JP2021/008258 2021-03-03 2021-03-03 Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program WO2022185457A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/008258 WO2022185457A1 (en) 2021-03-03 2021-03-03 Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/008258 WO2022185457A1 (en) 2021-03-03 2021-03-03 Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program

Publications (1)

Publication Number Publication Date
WO2022185457A1 true WO2022185457A1 (en) 2022-09-09

Family

ID=83155167

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/008258 WO2022185457A1 (en) 2021-03-03 2021-03-03 Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program

Country Status (1)

Country Link
WO (1) WO2022185457A1 (en)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN KEHAI, WANG RUI, UTIYAMA MASAO, SUMITA EIICHIRO: "Recurrent Positional Embedding for Neural Machine Translation", PROCEEDINGS OF THE 2019 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING AND THE 9TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (EMNLP-IJCNLP), ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, STROUDSBURG, PA, USA, 1 November 2019 (2019-11-01), Stroudsburg, PA, USA, pages 1361 - 1367, XP055967193, DOI: 10.18653/v1/D19-1139 *

Similar Documents

Publication Publication Date Title
JP7285895B2 (en) Multitask learning as question answering
CN111368996B (en) Retraining projection network capable of transmitting natural language representation
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
US11062179B2 (en) Method and device for generative adversarial network training
CN110737758B (en) Method and apparatus for generating a model
WO2023024412A1 (en) Visual question answering method and apparatus based on deep learning model, and medium and device
CN114641779A (en) Countermeasure training of machine learning models
CN115485696A (en) Countermeasure pretraining of machine learning models
JP6772213B2 (en) Question answering device, question answering method and program
JP6649536B1 (en) Dialogue processing device, learning device, dialogue processing method, learning method and program
WO2019212006A1 (en) Phenomenon prediction device, prediction model generation device, and phenomenon prediction program
JP2019049604A (en) Instruction statement estimation system and instruction statement estimation method
WO2014073206A1 (en) Information-processing device and information-processing method
JP2015169951A (en) information processing apparatus, information processing method, and program
US20230107409A1 (en) Ensembling mixture-of-experts neural networks
JP2022145623A (en) Method and device for presenting hint information and computer program
Thomas et al. Chatbot using gated end-to-end memory networks
US20220383119A1 (en) Granular neural network architecture search over low-level primitives
US11829722B2 (en) Parameter learning apparatus, parameter learning method, and computer readable recording medium
CN114528387A (en) Deep learning conversation strategy model construction method and system based on conversation flow bootstrap
JP6605997B2 (en) Learning device, learning method and program
JP6586026B2 (en) Word vector learning device, natural language processing device, method, and program
WO2023116572A1 (en) Word or sentence generation method and related device
CN111832699A (en) Computationally efficient expressive output layer for neural networks
WO2022185457A1 (en) Feature quantity extraction device, learning device, feature quantity extraction method, learning method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21929033

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21929033

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP