WO2025009063A1 - 学習装置、学習方法、及びプログラム - Google Patents

学習装置、学習方法、及びプログラム Download PDF

Info

Publication number
WO2025009063A1
WO2025009063A1 PCT/JP2023/024781 JP2023024781W WO2025009063A1 WO 2025009063 A1 WO2025009063 A1 WO 2025009063A1 JP 2023024781 W JP2023024781 W JP 2023024781W WO 2025009063 A1 WO2025009063 A1 WO 2025009063A1
Authority
WO
WIPO (PCT)
Prior art keywords
cost function
learning
cost
feature representation
encoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2023/024781
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
昌弘 安田
登 原田
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Inc
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2025530858A priority Critical patent/JPWO2025009063A1/ja
Priority to PCT/JP2023/024781 priority patent/WO2025009063A1/ja
Publication of WO2025009063A1 publication Critical patent/WO2025009063A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to feature representation learning, which embeds sensor signals into a feature space, and to learning various tasks (so-called downstream tasks) such as object detection using the obtained features, and in particular to a learning device, learning method, and program for performing this multitask learning.
  • MIM Masked image modeling
  • SSL self-supervised learning
  • Multi-task learning is a machine learning technique that solves multiple tasks with a single model.
  • a learning device includes a feature representation extraction unit that extracts a feature representation vector from an input signal using an encoder that performs feature representation extraction; a downstream task execution unit that executes one or more downstream tasks using the feature representation vector; a first cost calculation unit that uses the feature representation vector to calculate a first cost function as a constraint on the learning of the encoder; a second cost calculation unit that calculates a second cost function for learning the downstream task using the output of the downstream task and correct answer data for the downstream task; a cost function weight calculation unit that calculates real weights for at least one of the first cost function and the second cost function; and a third cost calculation unit that calculates a multitasking cost function using the first cost function, the second cost function, and the real weights, and updates the encoder parameters and the downstream task parameters so as to reduce the multitasking cost function.
  • the present invention has the effect of making it possible to perform feature representation learning based on SSL even in the case of a lack of data or difficulty in setting the task, and further making it possible to prevent the occurrence of undesired local solutions in the early stages of learning when solving feature representation learning based on SSL and downstream tasks in mark task learning.
  • FIG. 2 is a functional block diagram of the learning device according to the first embodiment.
  • FIG. 4 is a diagram showing an example of a processing flow of the learning device according to the first embodiment.
  • FIG. 1 is a functional block diagram of a multitask execution device according to a first embodiment.
  • FIG. 2 is a diagram showing an example of a processing flow of the multitask execution device according to the first embodiment. 13 shows experimental results when a distributed multimodal event detection task was performed using models trained under experimental conditions.
  • FIG. 13 is a diagram showing an example of the configuration of a computer to which the present technique is applied.
  • a signal is input, and an encoder such as a network that fuses feature representations of multiple modalities is used to extract a feature representation of the input signal. Furthermore, the output of the encoder is input, and one or more downstream tasks are executed using a predetermined function for each of them. Furthermore, a cost function is obtained as a constraint on the learning of the encoder, and a cost function for learning the downstream task is obtained. A real weight for each cost function is obtained using a formula given in advance. Furthermore, a multitask cost function is calculated using the obtained cost function and the weight of the cost function. By performing the above processing, it is possible to calculate the cost used in multitask learning.
  • an encoder such as a network that fuses feature representations of multiple modalities is used to extract a feature representation of the input signal. Furthermore, the output of the encoder is input, and one or more downstream tasks are executed using a predetermined function for each of them. Furthermore, a cost function is obtained as a constraint on the learning of the encoder, and a cost function for learning
  • z E(x; ⁇ ) (1)
  • x is an arbitrary input signal
  • E is an encoder that extracts a feature representation from the input signal
  • is a parameter of the encoder
  • z is an arbitrary feature representation vector obtained by the encoder
  • D n is a function for performing the n-th downstream task among N downstream tasks
  • ⁇ n is a parameter of the function D n
  • y n is the output of the n-th downstream task.
  • N may be 1.
  • L E_m is the m-th of M constraints on the learning of the encoder E given in the form of a loss function.
  • A_B means A B.
  • This constraint can include any constraint, such as the cost for restoring the mask input in MIM (see Non-Patent Document 1) or L 2 regularization for the parameter ⁇ of the encoder E.
  • L D_n is the cost function for the n-th downstream task. ⁇ E_m , ⁇ D_n are weights for each loss function.
  • weights ⁇ E_m and ⁇ D_n are scheduled according to the following equation:
  • ⁇ * S * (epoch) (0 ⁇ * ⁇ 1) (4)
  • epoch is the number of learning epochs (the number of repetitions when one learning data is used repeatedly for learning)
  • S * is an arbitrary scheduling function that returns a real number between 0 and 1.
  • the scheduling function is, for example, This function determines the weight for each iteration, and it makes it possible to increase the weight of downstream tasks in the early stages of learning, and to emphasize the constraint L E_m on the encoder as learning progresses.
  • FIG. 1 is a functional block diagram of a learning device 100 according to the first embodiment, and FIG. 2 shows the processing flow thereof.
  • the learning device 100 includes a feature expression extraction unit 201, a downstream task execution unit 301, a first cost calculation unit 401, a second cost calculation unit 402, a cost function weight calculation unit 403, and a third cost calculation unit 404.
  • the learning device 100 receives learning data as input, performs multitask learning, and outputs a learned model.
  • the learning data includes sensor information x, which is data acquired by one or more sensors, and correct answer data y' n for downstream tasks.
  • the learning device 100 is a special device configured by loading a special program into a publicly known or dedicated computer having, for example, a central processing unit (CPU), a main memory (RAM), etc.
  • the learning device 100 executes each process under the control of the central processing unit, for example.
  • Data input to the learning device 100 and data obtained in each process are stored, for example, in the main memory, and the data stored in the main memory is read out to the central processing unit as necessary and used for other processes.
  • At least a part of each processing unit of the learning device 100 may be configured by hardware such as an integrated circuit.
  • Each memory unit provided in the learning device 100 can be configured by, for example, a main memory such as a RAM (Random Access Memory), or middleware such as a relational database or a key-value store.
  • each storage unit does not necessarily need to be provided inside the learning device 100, but may be configured as an auxiliary storage device made up of a hard disk, optical disk, or semiconductor memory element such as flash memory, and may be configured to be provided
  • the feature representation extraction unit 201 accepts sensor information x, which is data acquired by one or more sensors, as an input signal to the system, and extracts and outputs a feature representation vector z according to equation (1) using an encoder E that performs feature representation extraction (S201).
  • the senor may be a camera or a microphone
  • the sensor information may be a video signal or image (such as an RGB image) acquired by a camera, a sound signal acquired by a microphone, or data obtained from the sound signal (such as an acoustic spectrogram or an acoustic feature such as MFCC).
  • the encoder E includes, for example, ResNet34 for image input, VGGish for acoustic spectrogram input, and a network that fuses feature representations of multiple modalities.
  • the downstream task execution unit 301 receives the feature expression vector z as input, executes N downstream tasks using the function D n of equation (2) (S301), and obtains and outputs the output y n of the downstream tasks.
  • Possible downstream tasks include, for example, an event detection task for identifying the time and type of an event contained in an input signal provided by an image or sound, and a segmentation task for identifying the location of the event.
  • the first cost calculation unit 401 receives the feature expression vector z as input, calculates (S401) a cost function L E_m as a constraint for the learning of the encoder E that extracts the feature expression, and outputs the calculated cost function L E_m .
  • L E_m the L 1 cost function given in Non-Patent Document 1 for restoring a masked input signal
  • L E_m the cost function given in Non-Patent Document 1 for restoring a masked input signal
  • the second cost calculation unit 402 receives the output y n of the downstream task and the correct answer data y' n for the downstream task, and calculates and outputs a cost function L D_n for learning the downstream task using these values (S402).
  • the cost function L D_n may be the MSE (Mean Squared Error) loss between the correct answer label of the event detection given in one-hot representation and the output of the downstream task.
  • ⁇ Cost function weight calculation unit 403 calculates and outputs real number weights ⁇ E_m , ⁇ D_n of each cost function according to the equation (4) using the number of learning iterations epoch (S403).
  • the scheduling function is a function that determines the weight according to the number of iterations.
  • the scheduling function is the cost function L
  • a schedule function can be used to prioritize D_n , and to prioritize the constraint L E_m given to the encoder learning as learning progresses.
  • the scheduling function is (1) As the number of iterations increases, the weight ⁇ E_m may be set to a larger value and the weight ⁇ D_n may be set to a smaller value.
  • a constant that does not depend on the number of iterations may be set as the weight ⁇ D_n , and the weight ⁇ E_m may be set to a larger value as the number of iterations increases.
  • a constant that does not depend on the number of iterations may be set as the weight ⁇ E_m , and the weight ⁇ D_n may be set to a smaller value as the number of iterations increases.
  • the scheduling function may set the weights ⁇ E_m and ⁇ D_n to values that are independent of the number of iterations, or the weights ⁇ E_m and ⁇ D_n may be constants.
  • the cost function weight calculation unit 403 may calculate the weights ⁇ E_m and ⁇ D_n according to the difficulty of the downstream task.
  • the cost function weight calculation unit 403 may be configured to calculate the real number weights ⁇ E_m , ⁇ D_n prior to learning, store them in a storage unit (not shown), and retrieve them from the storage unit during learning.
  • the weights ⁇ E_m , ⁇ D_n may be determined manually, and the cost function weight calculation unit 403 may receive the manually determined weights ⁇ E_m , ⁇ D_n as input without using a scheduling function, and output them as is without performing any calculations.
  • the third cost calculation unit 404 updates the parameter ⁇ in formula (1) and the parameter ⁇ n in formula (2) so as to reduce the multitask cost function L (S404A), and outputs the updated parameters ⁇ and ⁇ n to the feature expression extraction unit 201 and the downstream task execution unit 301, respectively.
  • the backpropagation method can be used as a method for updating the parameters ⁇ and ⁇ n .
  • the learning device 100 When a predetermined condition is satisfied (YES in S404B), the learning device 100 outputs the parameters ⁇ and ⁇ n when the predetermined condition is satisfied as a trained model, and when the predetermined condition is not satisfied (NO in S404B), the learning device 100 controls each unit to repeat the processes S201 to S404A.
  • a predetermined condition a condition for determining whether the update of the parameters ⁇ and ⁇ n has converged may be set, and for example, conditions such as whether the learning has been repeated a certain number of times (e.g., several times), whether the difference between the parameters before and after the update is equal to or less than a predetermined threshold, whether the loss is equal to or less than a predetermined threshold, etc. may be used.
  • a multitask execution device 500 using a trained model will be described.
  • FIG. 3 is a functional block diagram of a multitask execution device 500 according to the first embodiment, and FIG. 4 shows the processing flow thereof.
  • the multitasking execution device 500 includes a feature expression extraction unit 501 and a downstream task execution unit 601.
  • the multitasking execution device 500 receives learned parameters ⁇ and ⁇ n before executing multitasking.
  • the multitasking execution device 500 receives as input sensor information p (multitasking target data) that is data acquired by one or more sensors, executes a downstream task, and outputs an output qn of the downstream task.
  • input sensor information p multitasking target data
  • the feature representation extraction unit 501 has the same configuration as the feature representation extraction unit 201, and receives sensor information p, which is data acquired by one or more sensors, and extracts and outputs a feature representation vector r by the following formula using a learned parameter ⁇ and an encoder E that extracts feature representations (S501).
  • Distributed multimodal event detection is a task described in Reference 1, in which the input is the observation signals of multiple spatially distributed cameras and microphones, and the output is the identification of the type and time of an event, such as a human action, that occurs in the space in which they are installed.
  • the feature expression extraction unit 201 extracts embeddings from the video signal included in the sensor information x using a video signal encoder, and extracts embeddings from the sound signal encoder.
  • ⁇ s is a real vector of the same dimension as ⁇ s , and all elements have a value of 0 or 1 with a certain probability.
  • "O" is a Hadamard product.
  • the masked sensor feature sequence is fused using a module such as that described in Reference 1.
  • the parameters ⁇ and ⁇ n are updated in the third cost calculation unit 404 so that the fusion feature z obtained from the masked sensor feature sequence and the fusion feature ⁇ z obtained from the original sensor feature sequence approach each other.
  • the model parameter ⁇ of the “Target” network is updated using the exponential moving average of the model parameter ⁇ of the “Online” network.
  • is the decay rate, 0 ⁇ 1. This configuration smooths the fluctuations of ⁇ z during the learning process, stabilizing the learning of the system.
  • “Online”, “Target” The networks output fused feature sequences z and ⁇ z, respectively.
  • the downstream task in this experiment is an event detection task using the weak labels as ground truth labels, and the MSE loss between the ground truth labels and the time average of the output is given as L D_1 .
  • g bag (g 1 bag ,...,g N bag ) ⁇ 0,1 ⁇ N as a weak label.
  • the third cost calculation unit 404 calculates a multitask cost function L according to equation (6)', using the cost functions L E_1 , L D_1 and the weight ⁇ .
  • equation (7) which increases ⁇ according to the number of learning iterations
  • equation (8) which fixes ⁇ to the average value of ⁇ in equation (7) within the range of the learning epoch.
  • MaxEpoch is the maximum number of epochs for learning
  • is a hyperparameter, which was set to 1.05 in this experiment.
  • ⁇ 0 is a predetermined value determined so that ⁇ is a value between 0 and 1
  • ⁇ max is the maximum value of ⁇ in equation (7).
  • the cost function weight calculation unit 403 calculates the real weight ⁇ according to equation (7), or calculates the real weight - ⁇ according to equation (8).
  • the real weight ⁇ calculated by equation (7) is used as ⁇ in equation (6)', or the real weight - ⁇ calculated by equation (7) is used as ⁇ in equation (6)'.
  • the real weight - ⁇ is calculated using the number of learning iterations, but is a constant and can be said to be a value independent of the number of iterations.
  • the cost function weight calculation unit 403 needs to find only the weight ⁇ for the cost function L E — 1 for the encoder.
  • FIG. 5 shows the experimental results when a distributed multimodal event detection task was performed using a model trained under the above-mentioned experimental conditions.
  • the error range indicates the standard error obtained by three experiments. By increasing the weight of ⁇ with learning, the highest performance was obtained in all indices. It can be seen that the learning is stable and the final score is less dependent on the initial value of learning. This shows the effectiveness of introducing the scheduling according to the first embodiment in multitask learning.
  • the present invention is not limited to the above-mentioned embodiment and modified examples.
  • the above-mentioned various processes may be executed not only in chronological order as described, but also in parallel or individually depending on the processing capacity of the device executing the processes or as necessary.
  • appropriate modifications are possible within the scope of the present invention.
  • ⁇ Program and recording medium> The various processes described above can be implemented by loading a program that executes each step of the above method into the recording unit 2020 of the computer 2000 shown in FIG. 6, and operating the control unit 2010, input unit 2030, output unit 2040, display unit 2050, etc.
  • the program describing this processing can be recorded on a computer-readable recording medium.
  • Examples of computer-readable recording media include magnetic recording devices, optical disks, magneto-optical recording media, and semiconductor memories.
  • the program may be distributed, for example, by selling, transferring, or lending portable recording media such as DVDs or CD-ROMs on which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of a server computer and transferring the program from the server computer to other computers via a network.
  • a computer that executes such a program for example, first stores in its own storage device the program recorded on a portable recording medium or the program transferred from a server computer. Then, when executing a process, the computer reads the program stored on its own recording medium and executes the process according to the read program. As another execution form of the program, the computer may read the program directly from the portable recording medium and execute the process according to the program, or may execute the process according to the received program each time a program is transferred from the server computer to the computer.
  • the above-mentioned process may also be executed by a so-called ASP (Application Service Provider) type service that does not transfer the program from the server computer to the computer, but realizes the processing function only by issuing an execution instruction and obtaining the results.
  • ASP Application Service Provider
  • the program in this form includes information used for processing by an electronic computer that is equivalent to a program (such as data that is not a direct command to the computer but has properties that specify the processing of the computer).
  • the device is configured by executing a specific program on a computer, but at least a portion of the processing may be realized by hardware.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Image Analysis (AREA)
PCT/JP2023/024781 2023-07-04 2023-07-04 学習装置、学習方法、及びプログラム Ceased WO2025009063A1 (ja)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2025530858A JPWO2025009063A1 (https=) 2023-07-04 2023-07-04
PCT/JP2023/024781 WO2025009063A1 (ja) 2023-07-04 2023-07-04 学習装置、学習方法、及びプログラム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/024781 WO2025009063A1 (ja) 2023-07-04 2023-07-04 学習装置、学習方法、及びプログラム

Publications (1)

Publication Number Publication Date
WO2025009063A1 true WO2025009063A1 (ja) 2025-01-09

Family

ID=94171289

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/024781 Ceased WO2025009063A1 (ja) 2023-07-04 2023-07-04 学習装置、学習方法、及びプログラム

Country Status (2)

Country Link
JP (1) JPWO2025009063A1 (https=)
WO (1) WO2025009063A1 (https=)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021501391A (ja) * 2017-10-26 2021-01-14 マジック リープ, インコーポレイテッドMagic Leap,Inc. 深層マルチタスクネットワークにおける適応的損失平衡のための勾配正規化システムおよび方法
JP2021503662A (ja) * 2017-11-20 2021-02-12 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. ニューラルネットワークのモデルの訓練

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021501391A (ja) * 2017-10-26 2021-01-14 マジック リープ, インコーポレイテッドMagic Leap,Inc. 深層マルチタスクネットワークにおける適応的損失平衡のための勾配正規化システムおよび方法
JP2021503662A (ja) * 2017-11-20 2021-02-12 コーニンクレッカ フィリップス エヌ ヴェKoninklijke Philips N.V. ニューラルネットワークのモデルの訓練

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TAWARA, NAOHIRO ET AL.: "Towards the realization of disentangling neural networks for speaker and phonetic feature extraction", PROCEEDINGS OF THE 2019 SPRING MEETING OF THE ACOUSTICAL SOCIETY OF JAPAN; TOKYO, JAPAN; MARCH 3-5, 2019, vol. 2019, 1 January 2019 (2019-01-01) - 5 March 2019 (2019-03-05), pages 1003 - 1004, XP009560766 *

Also Published As

Publication number Publication date
JPWO2025009063A1 (https=) 2025-01-09

Similar Documents

Publication Publication Date Title
CN113902926B (zh) 一种基于自注意力机制的通用图像目标检测方法和装置
US11468324B2 (en) Method and apparatus with model training and/or sequence recognition
CN112633419B (zh) 小样本学习方法、装置、电子设备和存储介质
US10990852B1 (en) Method and apparatus for training model for object classification and detection
CN110520871B (zh) 使用学习进度测量训练机器学习模型
KR102158683B1 (ko) 외부 메모리로 신경망들 증강
CN113065013B (zh) 图像标注模型训练和图像标注方法、系统、设备及介质
US20180268285A1 (en) Neural network cooperation
KR20200129639A (ko) 모델 학습 방법 및 장치
KR20200128938A (ko) 모델 학습 방법 및 장치
US20240127586A1 (en) Neural networks with adaptive gradient clipping
JP2019159058A (ja) 音声認識システム、音声認識方法、学習済モデル
JP6827911B2 (ja) 音響モデル学習装置、音声認識装置、それらの方法、及びプログラム
CN110930996B (zh) 模型训练方法、语音识别方法、装置、存储介质及设备
WO2019235283A1 (ja) モデル学習装置、方法及びプログラム
CN110490304A (zh) 一种数据处理方法及设备
WO2019138897A1 (ja) 学習装置および方法、並びにプログラム
Vaněk et al. A regularization post layer: An additional way how to make deep neural networks robust
CN113449840A (zh) 神经网络训练方法及装置、图像分类的方法及装置
CN108475346A (zh) 神经随机访问机器
US20240020553A1 (en) Interactive electronic device for performing functions of providing responses to questions from users and real-time conversation with the users using models learned by deep learning technique and operating method thereof
JP7095747B2 (ja) 音響モデル学習装置、モデル学習装置、それらの方法、およびプログラム
CN120726421B (zh) 基于噪声标签的多模态识别方法、装置、设备、存储介质及程序产品
JP7047665B2 (ja) 学習装置、学習方法及び学習プログラム
WO2025009063A1 (ja) 学習装置、学習方法、及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23944315

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025530858

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025530858

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE