WO2023014398A1 - Self-supervised learning with model augmentation - Google Patents

Self-supervised learning with model augmentation Download PDF

Info

Publication number
WO2023014398A1
WO2023014398A1 PCT/US2022/013743 US2022013743W WO2023014398A1 WO 2023014398 A1 WO2023014398 A1 WO 2023014398A1 US 2022013743 W US2022013743 W US 2022013743W WO 2023014398 A1 WO2023014398 A1 WO 2023014398A1
Authority
WO
WIPO (PCT)
Prior art keywords
encoder
embedding
masking
neural network
layers
Prior art date
Application number
PCT/US2022/013743
Other languages
French (fr)
Inventor
Zhiwei Liu
Caiming Xiong
Jia Li
Yongjun Chen
Original Assignee
Salesforce.Com, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/579,377 external-priority patent/US20230042327A1/en
Application filed by Salesforce.Com, Inc. filed Critical Salesforce.Com, Inc.
Priority to CN202280060208.5A priority Critical patent/CN117918014A/en
Publication of WO2023014398A1 publication Critical patent/WO2023014398A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0985Hyperparameter optimisation; Meta-learning; Learning-to-learn
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Definitions

  • the present disclosure relates generally to neural networks and more specifically to machine learning systems and contrastive self-supervised learning (SSL) with model augmentation.
  • SSL contrastive self-supervised learning
  • the sequential recommendation in machine learning aims at predicting future items in sequences, where one crucial part is to characterize item relationships in sequences.
  • Traditional sequence modeling in machine learning may be used to verify the superiority of transform, e.g., the self-attention mechanism, in revealing item correlations in sequences.
  • a transformer may be used to infer the sequence embedding at specified positions by weighted aggregation of item embeddings, where the weights are learned via selfattention.
  • FIG. 1 is a simplified diagram of a computing device according to some embodiments described herein.
  • FIG. 2 is a simplified diagram of a method of performing contrastive learning using model augmentation according to some embodiments described herein.
  • FIG. 3 is a simplified diagram illustrating an example contrastive self-supervised learning system, according to some embodiments described herein.
  • FIG. 4 is a simplified diagram of a method of performing model augmentation for contrastive learning, according to some embodiments described herein.
  • FIG. 5A is a simplified diagram of an example neuron masking module for implementing model augmentation using neuron masking, according to some embodiments described herein.
  • FIG. 5B illustrate another example contrastive learning system with model augmentation using neuron masking, according to some embodiments described herein.
  • FIG. 6A illustrate an example layer dropping module for implementing model augmentation using layer dropping, according to some embodiments described herein.
  • FIG. 6B illustrate another example contrastive learning system with model augmentation using layer dropping, according to some embodiments described herein.
  • FIG. 7 illustrate another example contrastive learning system with model augmentation using neuron masking and layer dropping, according to some embodiments described herein.
  • FIG. 8 illustrates an example encoder complementing module for implementing model augmentation using encoder complementing, according to some embodiments described herein.
  • FIG. 9 is a simplified diagram of a computing device that implements the contrastive learning with model augmentation, according to some embodiments described herein. [0017] In the figures, elements having the same designations have the same or similar functions.
  • network may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
  • module may comprise hardware or software-based framework that performs one or more functions.
  • the module may be implemented on one or more neural networks.
  • FIG. 1 is a simplified diagram of a computing device 100 according to some embodiments.
  • computing device 100 includes a processor 110 coupled to memory 120. Operation of computing device 100 is controlled by processor 110.
  • processor 110 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 100.
  • Computing device 100 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.
  • Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100.
  • Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH- EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
  • Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement.
  • processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system- on-chip), and/or the like.
  • processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
  • memory 120 includes a neural network module 130 that may be used to implement and/or emulate the neural network systems and models described further herein and/or to implement any of the methods described further herein.
  • neural network module 130 may be used to translate structured text.
  • neural network module 130 may also handle the iterative training and/or evaluation of a translation system or model used to translate the structured text.
  • memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the contrastive learning with model augmentation methods described in further detail herein.
  • neural network module 130 may be implemented using hardware, software, and/or a combination of hardware and software. As shown, computing device 100 receives input 140, which is provided to neural network module 130, neural network module 130 then generates output 150.
  • sequential recommendation aims at predicting the next items in user behaviors, which can be solved by characterizing item relationships in sequences.
  • a self-supervised learning (SSL) paradigm may be used to improve the performance, which employs contrastive learning between positive and negative views of sequences.
  • SSL self-supervised learning
  • Various methods may construct views by adopting augmentation from data perspectives, but such data augmentation has various issues. For example, optimal data augmentation methods may be hard to devise. Further, data augmentation methods may destroy sequential correlations. Moreover, such data augmentation may fail to incorporate comprehensive self-supervised signals. To address these issues, systems and methods for contrastive SSL using model augmentation are described below.
  • FIG. 2 is a simplified diagram of method 200 for perform contrastive SSL using model augmentation
  • FIG. 3 is an example neural network system 300 for performing the method 200.
  • One or more of the processes of method 200 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes of method 200.
  • the method 200 may correspond to the method to determine neural network models used by neural network module 130 to perform training and/or perform inference using the neural network model for various tasks. While sequential recommendation tasks are used in the examples described below, method 200 may be used for various tasks including for example sequential recommendation, image recognition, and language modeling.
  • method 200 implements model augmentation to construct view pairs for contrastive learning, e.g., as a complement to the data augmentation methods.
  • a multi-level model augmentation method may include multiple levels of various model augmentation methods, including for example, the neuron mask method, the layer drop method, and the encoder complementing method.
  • a single-level model augmentation method may include a single level of model augmentation, wherein the type of model augmentation may be determined based on a particular task.
  • the method 200 begins at block 201, where contrastive training is performed on a neural network model using one or more batches of training data, and each batch may include one or more original samples.
  • each batch may include one or more original samples.
  • an example neural network model for sequential recommendation is used, and the original samples are also referred to as original sequences.
  • blocks 202 through 212 may be performed.
  • an original sequence from the training data is provided.
  • an original sequence 302 denoted as sequence s u
  • user and item sets are denoted as U and V respectively.
  • sul is the total number of items.
  • a sequential recommendation problem may be formulated as follows: where v
  • the method 200 may proceed to block 204, where first data augmentation is performed to the original sequence to generate a first augmented sequence.
  • the first data augmentation is optional.
  • Various data augmentation techniques may be used, including for example, crop, mask, reorder, insert, substitute, and/or a combination thereof.
  • an augmented sequence 304 is generated by performing first data augmentation to the original sequence 302.
  • the first augmented sequence 304 is provided to an input of an encoder 306.
  • the method 200 may proceed to block 206, where first model augmentation is performed to the encoder (e.g., encoder 306 of FIG. 3) to augment the encoder and generates first embedding (e.g., embedding 310) of the first augmented sequence (e.g., augmented sequence 304 of FIG. 3).
  • model augmentation module 307 performs model augmentation to the encoder to provide an augmented encoder, which generates an output embedding.
  • concatenation e.g., by concatenation module 308 may be performed to the output of the augmented encoder to generate embedding 310.
  • the method 200 may proceed to block 208, where a second data augmentation is performed to the original sequence to generate a second augmented sequence.
  • the second data augmentation is optional.
  • the second data augmentation is different from the first data augmentation
  • the second augmented sequence is different from the first augmented sequence.
  • an augmented sequence 312 is generated by performing a second data augmentation to the original sequence 302.
  • the second augmented sequence 312 is provided to an input of the encoder 306.
  • the method 200 may proceed to block 210, where second model augmentation is performed to the encoder (e.g., encoder 306 of FIG. 3) and generate a second embedding (e.g., embedding 316 of FIG. 3) of the second augmented sequence (e.g., augmented sequence 312 of FIG. 3).
  • model augmentation module 307 performs model augmentation to the encoder 306 to provide an augmented encoder, which generates an output embedding.
  • concatenation e.g., by concatenation module 308 may be performed to the output of the augmented encoder to generate embedding 316.
  • the method 200 may proceed to block 212, where an optimization process is performed to the encoder (e.g., encoder 306 of FIG. 3) using a contrastive loss based on the first embedding and the second embedding (e.g., embeddings 310 and 312 of FIG. 3).
  • An example contrastive loss is provided as follows: where denote two views (e.g., two embeddings) constructed for an original sequence s u; Us an indication function; sim(., .) is a similarity function, e.g., a dot-product function. Because each original sequence has 2 views, for a batch with N original sequences, there are 2N samples/views for training.
  • the nominator of the contrastive loss function indicates the agreement maximization between a positive pair, while the denominator is interpreted as push away those negative pairs.
  • both the SSL and the next item prediction characterize the item relationships in sequences, which may be combined to generate a final loss L to optimize the encoder.
  • An exemplary final loss is provided as follows: where Precis a loss associated with the next item prediction, and is a contrastive loss as discussed above, may be generated using two different views of a same original sequence for contrast, wherein the two different views are generated using data augmentation, model augmentation, and/or a combination thereof.
  • a trained neural network generated by contrastive learning at block 201 may be used to perform a task, e.g., a sequence recommendation task or any other suitable task.
  • the trained neural network may be used to generate a next item prediction for an input sequence.
  • example model augmentation methods e.g., for implementing blocks 206 and 210 of FIG. 2 are described.
  • FIG. 4 illustrates an example model augmentation method 400
  • FIG. 5 A illustrates an example neuron masking module (also referred to as neuron dropout module or dropout module) for implementing model augmentation using neuron masking
  • FIG. 4 illustrates an example model augmentation method 400
  • FIG. 5 A illustrates an example neuron masking module (also referred to as neuron dropout module or dropout module) for implementing model augmentation using neuron masking
  • FIG. 5B illustrates an example contrastive learning systems including a neuron masking module for model augmentation
  • FIG. 6A illustrates an example layer dropping module for implementing model augmentation using layer dropping
  • FIG. 6B illustrates an example contrastive learning systems including a layer dropping module for model augmentation
  • FIG. 7 illustrates example contrastive learning systems including a model augmentation module with both neuron masking and layer dropping
  • FIG. 8 illustrates an example encoder complementing module for implementing model augmentation using encoder complementing.
  • an example model augmentation method 400 that may use one or more levels of different model augmentation methods.
  • the method 400 may proceed to block 402, where neuron masking is performed.
  • an example neuron masking module 500 of an encoder e.g., encoder 306 of FIG. 3 is illustrated.
  • the neuron masking module 500 may receive hidden embeddings 502 (also referred to as input embeddings), and generate an output embedding 504 to the next layer, through a feedforward network (FFN) 506.
  • FNN feedforward network
  • neuron mask may randomly perform masking of partial neurons in each FFN layer 506, based on a respective masking probability p.
  • a larger value of p leads to more intensive embedding perturbations.
  • a pair of different views e.g., embeddings 310 and 316 of FIG. 3 is generated from one same original sequence from model perspectives.
  • the masked neurons are randomly selected, which results in comprehensive contrastive learning on model augmentation.
  • different probability values may be utilized for different FFN layers of the encoder.
  • the neuron masking probability is the same for different FFN layers of the encoder.
  • an example contrastive learning system 550 includes a model augmentation module including neuron masking module 500.
  • the contrastive learning system 550 is substantially similar to the contrastive learning system 300 of FIG. 3 except the differences described below.
  • neuron masking module 500 performs neuron masking twice, by randomly masking partial neurons in one or more layers of the original encoder 306, for generating two different views of the same original sequence 302. Contrastive learning is performed using the two different views.
  • layer dropping is performed.
  • dropping partial layers of a neural network model decreases the depth of the neural network model, and reduces complexity.
  • only shallow embeddings for users and items may be required.
  • embeddings at shallow layers and deep layers are both important to reflect the comprehensive information of the data.
  • randomly dropping a fraction of layers during training may function as a way of regularization.
  • Model augmentation using layer dropping may enable contrastive learning between embeddings with different depth of layers.
  • contrastive learning is achieved between shallow embeddings and deep embeddings, thus providing an enhancement, e.g., over models that only contrast between deep features.
  • layers in the original encoder are dropped.
  • dropping layers especially those necessary layers in the original encoder, may destroy original sequential correlations, and views generated by dropping layers may not be a positive pair.
  • K FFN layers are stacked after the encoder, and M of them are dropped during each batch of training, where M and K are integers and M ⁇ K. In those embodiments, during layer dropping, layers of the original encoder are not dropped.
  • the layer dropping module 600 may receive embeddings 602 (also referred to as embeddings) from the original encoder (e.g., encoder 306 of FIG. 3), and generate an output embedding 604.
  • the layer dropping module 600 may append K layers 606-1 through 606-K after the encoder 306, and randomly drop M of them (e.g., layer 606-2) during each batch of training, where M may be an integer between 0 through K-l.
  • M may be an integer between 0 through K-l.
  • the same K and M numbers of appended and dropped FFN layers are applied to encoder 306 for the separate views.
  • different K and M numbers are applied to encoder 306 for the separate views.
  • an example contrastive learning system 650 includes a model augmentation module including layer dropping module 600.
  • the layer dropping module 600 may perform model augmentation by appending a number of layers (e.g., multilayer perceptron (MLP) layers, self-attention layers, residual layers, other suitable layers, and/or a combination thereof) to the sequence encoder 306.
  • layers e.g., multilayer perceptron (MLP) layers, self-attention layers, residual layers, other suitable layers, and/or a combination thereof
  • layer dropping module 600 performs layer dropping twice, by appending and dropping randomly one or more appended layers to the original encoder 306, for generating two different views of the same original sequence 302. Contrastive learning is performed using the two different views.
  • an example contrastive learning system 700 includes a model augmentation module 702 performing both neuron masking and layer dropping.
  • the contrastive learning system 700 is substantially similar to the contrastive learning system 300 of FIG. 3 except the differences described below.
  • the model augmentation module 702 may perform layer dropping by appending a number of layers (e.g., MLP layers or other suitable layers) to the sequence encoder 306, and randomly dropping M layers out of the K total appended layers during training.
  • the model augmentation module 702 may perform neuron masking to randomly mask partial neurons in one or more layers in the original encoder 306 and/or the appended layers.
  • the method 400 may proceed to block 406, where encoder complementing is performed.
  • one single encoder may be employed to generate embeddings of two views of one sequence. While in some embodiments using a single encoder might be effective in revealing complex sequential correlations, contrasting on one single encoder may result in embedding collapse problems for self-supervised learning. Moreover, one single encoder may only be able to reflect the item relationships from a unitary perspective. For example, a Transformer encoder adopts the attentive aggregation of item embeddings to infer sequence embedding, while an RNN structure is more suitable in encoding direct item transitions. Therefore, in some embodiments, distinct encoders may be used to generate views for contrastive learning, which may enable the model to learn comprehensive sequential relationships of items.
  • embeddings from two views of a sequence with distinct encoders may lead to a non-Siamese paradigm for self-supervised learning, which may be hard to train and suffers the embedding collapse problem. Additionally, in examples where two distinct encoders reveal significantly diverse sequential correlations, the embeddings are so far away from each other and become bad views for contrastive learning. Moreover, in some embodiments, two distinct encoders may be optimized during a training phase, but it may still be problematic to combine them for the inference of sequence embeddings to conduct recommendations.
  • encoder complementing may address issues from using a single encoder or using two distinct encoders to generate the views for contrastive learning.
  • encoder complementing instead of contrastive learning with a single encoder or two distinct encoders, uses a pre-trained encoder to complement model augmentation for the original encoder.
  • FIG. 8 illustrated is an example encoder complementing module 800 for performing encoder complementing.
  • an encoder 806 that is different from encoder 306 is pre-trained with the nextitem prediction target to generate a pre-trained encoder 808. Then, during the contrastive self- supervised training stage, this pre-trained encoder 808 is utilized to generate another embedding 810 for a view.
  • a combiner 812 combines the view embedding 814 generated from a model encoder 306 and the view embedding 810 from the pre-trained encoder 808.
  • this model augmentation is in one branch of the SSL paradigm.
  • the embedding 810 from the pre-trained encoder 808 may be re-scaled by a hyper-parameter y (also referred to as a weight y) before combining with (e.g., adding to) the embedding 814 from the model encoder 306.
  • the smaller value of hyper-parameter y corresponds to injecting fewer perturbations from a distinct encoder 806.
  • the output embedding of the combiner 912 may be passed to the next layer, through a feed-forward network (FFN) 812.
  • FNN feed-forward network
  • parameters of this pre-trained encoder 808 are fixed during the contrastive self-supervised training. In those embodiments, there is no optimization for this pre-trained encoder 808 during the contrastive self-supervised training. Furthermore, during the inference stage, it is no longer required to take account of both encoders 306 and 808, and only model encoder 306 is used.
  • computing device 900 that may be used to implement contrastive learning with model augmentation, according to some embodiments described herein.
  • computing device 900 includes a processor 910 coupled to memory 920. Operation of computing device 900 is controlled by processor 910.
  • processor 910 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 900.
  • Computing device 900 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.
  • Memory 920 may be used to store software executed by computing device 900 and/or one or more data structures used during operation of computing device 900.
  • Memory 920 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH- EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
  • Processor 910 and/or memory 920 may be arranged in any suitable physical arrangement.
  • processor 910 and/or memory 920 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system- on-chip), and/or the like.
  • processor 910 and/or memory 920 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 910 and/or memory 920 may be located in one or more data centers and/or cloud computing facilities.
  • memory 920 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 910) may cause the one or more processors to perform the methods described in further detail herein.
  • memory 920 includes instructions for a neural network module 930 (e.g., neural network module 130 of FIG. 1) that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein.
  • neural network module 930 implements contrastive learning with model augmentation, and may also be referred to as contrastive learning with model augmentation module 930.
  • the contrastive learning with model augmentation module 930 may receive an input 940, e.g., such as original sequences, via a data interface 915.
  • the data interface 915 may be any of a user interface that receives user uploaded input sequences, or a communication interface that may receive or retrieve a previously stored sequences from the database.
  • the contrastive learning with model augmentation module 930 may generate an output 950, such as a prediction of a next item for an input sequence, and/or the like in response to the input 940.
  • the contrastive learning with model augmentation module 930 may further includes the encoder module 931 for providing an encoder, the neuron masking module 932 for performing neuron masking, layer dropping module 933 for performing layer dropping, and encoder complementing module 934 for performing encoder complementing.
  • computing devices such as computing devices 100 and 900 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of methods 200.
  • processors e.g., processor 110
  • Some common forms of machine readable media that may include the processes of methods/systems described herein (e.g., methods/systems of FIGS.
  • 2-8) are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

A method for providing a neural network system includes performing contrastive learning to the neural network system to generate a trained neural network system. The performing the contrastive learning includes performing first model augmentation to a first encoder of the neural network system to generate a first embedding of a sample, performing second model augmentation to the first encoder to generate a second embedding of the sample, and optimizing the first encoder using a contrastive loss based on the first embedding and the second embedding. The trained neural network system is provided to perform a task.

Description

SELF-SUPERVISED LEARNING WITH MODEL AUGMENTATION
RELATED APPLICATIONS
[0001] This application claims priority to U.S. Non- Pro visional Patent Application No. 17/579,377, filed January 19, 2022, U.S. Provisional Patent Application No. 63/252,375 filed October 5, 2021 and U.S. Provisional Patent Application No. 63/230,474 filed August 6, 2021, which are incorporated by reference herein in their entireties.
TECHNICAL FIELD
[0002] The present disclosure relates generally to neural networks and more specifically to machine learning systems and contrastive self-supervised learning (SSL) with model augmentation.
BACKGROUND
[0003] The sequential recommendation in machine learning aims at predicting future items in sequences, where one crucial part is to characterize item relationships in sequences. Traditional sequence modeling in machine learning may be used to verify the superiority of transform, e.g., the self-attention mechanism, in revealing item correlations in sequences. For example, a transformer may be used to infer the sequence embedding at specified positions by weighted aggregation of item embeddings, where the weights are learned via selfattention.
[0004] However, the data sparsity issue and noise in sequences undermine the performance of a neural network model (also referred to as model) in sequential recommendation. The former hinders performance due to insufficient training, since the complex structure of a sequential model requires a dense corpus to be adequately trained. The later also impedes the recommendation ability of a model because noisy item sequences are unable to reveal actual item correlations.
[0005] Accordingly, it would be advantageous to develop systems and methods for improved sequential recommendation. BRIEF DESCRIPTION OF THE DRAWINGS
[0006] FIG. 1 is a simplified diagram of a computing device according to some embodiments described herein.
[0007] FIG. 2 is a simplified diagram of a method of performing contrastive learning using model augmentation according to some embodiments described herein.
[0008] FIG. 3 is a simplified diagram illustrating an example contrastive self-supervised learning system, according to some embodiments described herein.
[0009] FIG. 4 is a simplified diagram of a method of performing model augmentation for contrastive learning, according to some embodiments described herein.
[0010] FIG. 5A is a simplified diagram of an example neuron masking module for implementing model augmentation using neuron masking, according to some embodiments described herein.
[0011] FIG. 5B illustrate another example contrastive learning system with model augmentation using neuron masking, according to some embodiments described herein.
[0012] FIG. 6A illustrate an example layer dropping module for implementing model augmentation using layer dropping, according to some embodiments described herein.
[0013] FIG. 6B illustrate another example contrastive learning system with model augmentation using layer dropping, according to some embodiments described herein.
[0014] FIG. 7 illustrate another example contrastive learning system with model augmentation using neuron masking and layer dropping, according to some embodiments described herein.
[0015] FIG. 8 illustrates an example encoder complementing module for implementing model augmentation using encoder complementing, according to some embodiments described herein.
[0016] FIG. 9 is a simplified diagram of a computing device that implements the contrastive learning with model augmentation, according to some embodiments described herein. [0017] In the figures, elements having the same designations have the same or similar functions.
DETAILED DESCRIPTION
[0018] As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
[0019] As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
[0020] FIG. 1 is a simplified diagram of a computing device 100 according to some embodiments. As shown in FIG. 1, computing device 100 includes a processor 110 coupled to memory 120. Operation of computing device 100 is controlled by processor 110. And although computing device 100 is shown with only one processor 110, it is understood that processor 110 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 100. Computing device 100 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.
[0021] Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH- EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
[0022] Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system- on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
[0023] As shown, memory 120 includes a neural network module 130 that may be used to implement and/or emulate the neural network systems and models described further herein and/or to implement any of the methods described further herein. In some examples, neural network module 130 may be used to translate structured text. In some examples, neural network module 130 may also handle the iterative training and/or evaluation of a translation system or model used to translate the structured text. In some examples, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the contrastive learning with model augmentation methods described in further detail herein. In some examples, neural network module 130 may be implemented using hardware, software, and/or a combination of hardware and software. As shown, computing device 100 receives input 140, which is provided to neural network module 130, neural network module 130 then generates output 150.
[0024] As described above, sequential recommendation aims at predicting the next items in user behaviors, which can be solved by characterizing item relationships in sequences. To address the data sparsity and noise issues in sequences, a self-supervised learning (SSL) paradigm may be used to improve the performance, which employs contrastive learning between positive and negative views of sequences. Various methods may construct views by adopting augmentation from data perspectives, but such data augmentation has various issues. For example, optimal data augmentation methods may be hard to devise. Further, data augmentation methods may destroy sequential correlations. Moreover, such data augmentation may fail to incorporate comprehensive self-supervised signals. To address these issues, systems and methods for contrastive SSL using model augmentation are described below.
[0025] Referring to FIGS. 2 and 3, FIG. 2 is a simplified diagram of method 200 for perform contrastive SSL using model augmentation, and FIG. 3 is an example neural network system 300 for performing the method 200. One or more of the processes of method 200 may be implemented, at least in part, in the form of executable code stored on non-transitory, tangible, machine-readable media that when run by one or more processors may cause the one or more processors to perform one or more of the processes of method 200. In some embodiments, the method 200 may correspond to the method to determine neural network models used by neural network module 130 to perform training and/or perform inference using the neural network model for various tasks. While sequential recommendation tasks are used in the examples described below, method 200 may be used for various tasks including for example sequential recommendation, image recognition, and language modeling.
[0026] In various embodiments, method 200 implements model augmentation to construct view pairs for contrastive learning, e.g., as a complement to the data augmentation methods. Moreover, both single-level and multi-level model augmentation methods for constructing view pairs are described. In an example, a multi-level model augmentation method may include multiple levels of various model augmentation methods, including for example, the neuron mask method, the layer drop method, and the encoder complementing method. In another example, a single-level model augmentation method may include a single level of model augmentation, wherein the type of model augmentation may be determined based on a particular task. By using model augmentation, method 200 improves the performance (e.g., for sequential recommendation or other tasks) by constructing views for contrastive SSL with model augmentation.
[0027] The method 200 begins at block 201, where contrastive training is performed on a neural network model using one or more batches of training data, and each batch may include one or more original samples. For the description below, an example neural network model for sequential recommendation is used, and the original samples are also referred to as original sequences. For each original sequence in a training batch, blocks 202 through 212 may be performed.
[0028] At block 202, an original sequence from the training data is provided. Referring to the example of FIG. 3, an original sequence 302, denoted as sequence su, is received. Here, user and item sets are denoted as U and V respectively. Each user u is associated with a sequence of items in chronological order su = [vi , ..., vt, ..., v|sui], where vt denotes the item that user u has interacted with at time t and |sul is the total number of items. A sequential recommendation problem may be formulated as follows:
Figure imgf000007_0001
where v|sui+i denotes the next item in sequence, and in an example, the neural network system for sequential recommendation may select a candidate item that has a highest probability for recommendation.
[0029] The method 200 may proceed to block 204, where first data augmentation is performed to the original sequence to generate a first augmented sequence. In some examples, the first data augmentation is optional. Various data augmentation techniques may be used, including for example, crop, mask, reorder, insert, substitute, and/or a combination thereof. Referring to the example of FIG. 3, an augmented sequence 304 is generated by performing first data augmentation to the original sequence 302. The first augmented sequence 304 is provided to an input of an encoder 306.
[0030] The method 200 may proceed to block 206, where first model augmentation is performed to the encoder (e.g., encoder 306 of FIG. 3) to augment the encoder and generates first embedding (e.g., embedding 310) of the first augmented sequence (e.g., augmented sequence 304 of FIG. 3). As shown in the example of FIG. 3, model augmentation module 307 performs model augmentation to the encoder to provide an augmented encoder, which generates an output embedding. In some embodiments, concatenation (e.g., by concatenation module 308) may be performed to the output of the augmented encoder to generate embedding 310.
[0031] The method 200 may proceed to block 208, where a second data augmentation is performed to the original sequence to generate a second augmented sequence. In some examples, the second data augmentation is optional. In some examples, the second data augmentation is different from the first data augmentation, and the second augmented sequence is different from the first augmented sequence. Referring to the example of FIG. 3, an augmented sequence 312 is generated by performing a second data augmentation to the original sequence 302. The second augmented sequence 312 is provided to an input of the encoder 306.
[0032] The method 200 may proceed to block 210, where second model augmentation is performed to the encoder (e.g., encoder 306 of FIG. 3) and generate a second embedding (e.g., embedding 316 of FIG. 3) of the second augmented sequence (e.g., augmented sequence 312 of FIG. 3). As shown in the example of FIG. 3, model augmentation module 307 performs model augmentation to the encoder 306 to provide an augmented encoder, which generates an output embedding. In some embodiments, concatenation (e.g., by concatenation module 308) may be performed to the output of the augmented encoder to generate embedding 316.
[0033] The method 200 may proceed to block 212, where an optimization process is performed to the encoder (e.g., encoder 306 of FIG. 3) using a contrastive loss based on the first embedding and the second embedding (e.g., embeddings 310 and 312 of FIG. 3). An example contrastive loss is provided as follows:
Figure imgf000009_0001
where
Figure imgf000009_0002
denote two views (e.g., two embeddings) constructed for an original sequence su; Us an indication function; sim(., .) is a similarity function, e.g., a dot-product function. Because each original sequence has 2 views, for a batch with N original sequences, there are 2N samples/views for training. The nominator of the contrastive loss function indicates the agreement maximization between a positive pair, while the denominator is interpreted as push away those negative pairs.
[0034] In various embodiments, for sequential recommendation, both the SSL and the next item prediction characterize the item relationships in sequences, which may be combined to generate a final loss L to optimize the encoder. An exemplary final loss is provided as follows:
Figure imgf000009_0003
where Precis a loss associated with the next item prediction, and
Figure imgf000009_0004
is a contrastive loss as discussed above,
Figure imgf000009_0005
may be generated using two different views of a same original sequence for contrast, wherein the two different views are generated using data augmentation, model augmentation, and/or a combination thereof.
[0035] At block 216, a trained neural network generated by contrastive learning at block 201 may be used to perform a task, e.g., a sequence recommendation task or any other suitable task. For example, the trained neural network may be used to generate a next item prediction for an input sequence. [0036] Referring to FIGS. 4, 5A, 5B, 6A, 6B, 7, and 8, example model augmentation methods (e.g., for implementing blocks 206 and 210 of FIG. 2) are described. FIG. 4 illustrates an example model augmentation method 400; FIG. 5 A illustrates an example neuron masking module (also referred to as neuron dropout module or dropout module) for implementing model augmentation using neuron masking; FIG. 5B illustrates an example contrastive learning systems including a neuron masking module for model augmentation; FIG. 6A illustrates an example layer dropping module for implementing model augmentation using layer dropping; FIG. 6B illustrates an example contrastive learning systems including a layer dropping module for model augmentation; FIG. 7 illustrates example contrastive learning systems including a model augmentation module with both neuron masking and layer dropping; and FIG. 8 illustrates an example encoder complementing module for implementing model augmentation using encoder complementing.
[0037] Referring to FIG. 4, an example model augmentation method 400 that may use one or more levels of different model augmentation methods. The method 400 may proceed to block 402, where neuron masking is performed. Referring to FIG. 5A, an example neuron masking module 500 of an encoder (e.g., encoder 306 of FIG. 3) is illustrated. The neuron masking module 500 may receive hidden embeddings 502 (also referred to as input embeddings), and generate an output embedding 504 to the next layer, through a feedforward network (FFN) 506. In some embodiments, during training, neuron mask may randomly perform masking of partial neurons in each FFN layer 506, based on a respective masking probability p. A larger value of p leads to more intensive embedding perturbations. As such, by applying different masking probabilities in encoder 306 (e.g., with data augmented sequences 304 and 312 of FIG. 3 or without data augmentation), a pair of different views (e.g., embeddings 310 and 316 of FIG. 3) is generated from one same original sequence from model perspectives. In some embodiments, during each batch of training, the masked neurons are randomly selected, which results in comprehensive contrastive learning on model augmentation. In some embodiments, different probability values may be utilized for different FFN layers of the encoder. In some embodiments, the neuron masking probability is the same for different FFN layers of the encoder. Additionally or alternatively, the neuron mask method 400 may be applied to any neural layer in a model of a neural network system to inject more perturbations. [0038] Referring to FIG. 5B, an example contrastive learning system 550 includes a model augmentation module including neuron masking module 500. The contrastive learning system 550 is substantially similar to the contrastive learning system 300 of FIG. 3 except the differences described below. In the example of FIG. 5B, during training, neuron masking module 500 performs neuron masking twice, by randomly masking partial neurons in one or more layers of the original encoder 306, for generating two different views of the same original sequence 302. Contrastive learning is performed using the two different views.
[0039] At block 404, layer dropping is performed. In various embodiments, dropping partial layers of a neural network model (e.g., encoder 306 of FIG. 3) decreases the depth of the neural network model, and reduces complexity. In some embodiments, only shallow embeddings for users and items may be required. In some embodiments, embeddings at shallow layers and deep layers are both important to reflect the comprehensive information of the data. As such, randomly dropping a fraction of layers during training may function as a way of regularization. Model augmentation using layer dropping may enable contrastive learning between embeddings with different depth of layers. In an example, contrastive learning is achieved between shallow embeddings and deep embeddings, thus providing an enhancement, e.g., over models that only contrast between deep features.
[0040] In some embodiments, layers in the original encoder are dropped. In some of these embodiments, dropping layers, especially those necessary layers in the original encoder, may destroy original sequential correlations, and views generated by dropping layers may not be a positive pair. Alternatively, in some embodiments, instead of manipulating the original encoder, K FFN layers are stacked after the encoder, and M of them are dropped during each batch of training, where M and K are integers and M < K. In those embodiments, during layer dropping, layers of the original encoder are not dropped.
[0041] Referring to FIG. 6A, an example layer dropping module 600 for implementing layer dropping is illustrated. The layer dropping module 600 may receive embeddings 602 (also referred to as embeddings) from the original encoder (e.g., encoder 306 of FIG. 3), and generate an output embedding 604. In various embodiments, the layer dropping module 600 may append K layers 606-1 through 606-K after the encoder 306, and randomly drop M of them (e.g., layer 606-2) during each batch of training, where M may be an integer between 0 through K-l. In some embodiments, the same K and M numbers of appended and dropped FFN layers are applied to encoder 306 for the separate views. In other embodiments, different K and M numbers are applied to encoder 306 for the separate views.
[0042] Referring to FIG. 6B, an example contrastive learning system 650 includes a model augmentation module including layer dropping module 600. The layer dropping module 600 may perform model augmentation by appending a number of layers (e.g., multilayer perceptron (MLP) layers, self-attention layers, residual layers, other suitable layers, and/or a combination thereof) to the sequence encoder 306. In the example of FIG. 6B, during training, layer dropping module 600 performs layer dropping twice, by appending and dropping randomly one or more appended layers to the original encoder 306, for generating two different views of the same original sequence 302. Contrastive learning is performed using the two different views.
[0043] Referring to FIG. 7, an example contrastive learning system 700 includes a model augmentation module 702 performing both neuron masking and layer dropping. The contrastive learning system 700 is substantially similar to the contrastive learning system 300 of FIG. 3 except the differences described below. The model augmentation module 702 may perform layer dropping by appending a number of layers (e.g., MLP layers or other suitable layers) to the sequence encoder 306, and randomly dropping M layers out of the K total appended layers during training. Furthermore, the model augmentation module 702 may perform neuron masking to randomly mask partial neurons in one or more layers in the original encoder 306 and/or the appended layers.
[0044] The method 400 may proceed to block 406, where encoder complementing is performed.
[0045] In various embodiments, during self-supervised learning, one single encoder may be employed to generate embeddings of two views of one sequence. While in some embodiments using a single encoder might be effective in revealing complex sequential correlations, contrasting on one single encoder may result in embedding collapse problems for self-supervised learning. Moreover, one single encoder may only be able to reflect the item relationships from a unitary perspective. For example, a Transformer encoder adopts the attentive aggregation of item embeddings to infer sequence embedding, while an RNN structure is more suitable in encoding direct item transitions. Therefore, in some embodiments, distinct encoders may be used to generate views for contrastive learning, which may enable the model to learn comprehensive sequential relationships of items. However, in some embodiments, embeddings from two views of a sequence with distinct encoders may lead to a non-Siamese paradigm for self-supervised learning, which may be hard to train and suffers the embedding collapse problem. Additionally, in examples where two distinct encoders reveal significantly diverse sequential correlations, the embeddings are so far away from each other and become bad views for contrastive learning. Moreover, in some embodiments, two distinct encoders may be optimized during a training phase, but it may still be problematic to combine them for the inference of sequence embeddings to conduct recommendations.
[0046] The encoder complementing method described herein may address issues from using a single encoder or using two distinct encoders to generate the views for contrastive learning. In various embodiments, instead of contrastive learning with a single encoder or two distinct encoders, encoder complementing uses a pre-trained encoder to complement model augmentation for the original encoder. Referring to FIG. 8, illustrated is an example encoder complementing module 800 for performing encoder complementing. During pretraining stage, an encoder 806 that is different from encoder 306 is pre-trained with the nextitem prediction target to generate a pre-trained encoder 808. Then, during the contrastive self- supervised training stage, this pre-trained encoder 808 is utilized to generate another embedding 810 for a view. A combiner 812 combines the view embedding 814 generated from a model encoder 306 and the view embedding 810 from the pre-trained encoder 808. In some embodiments, this model augmentation is in one branch of the SSL paradigm. The embedding 810 from the pre-trained encoder 808 may be re-scaled by a hyper-parameter y (also referred to as a weight y) before combining with (e.g., adding to) the embedding 814 from the model encoder 306. The smaller value of hyper-parameter y corresponds to injecting fewer perturbations from a distinct encoder 806. The output embedding of the combiner 912 may be passed to the next layer, through a feed-forward network (FFN) 812. By applying different weights y to the output from the pre-trained encoder 808, two different views of the same sequence are provided for contrastive learning.
[0047] In some embodiments, parameters of this pre-trained encoder 808 are fixed during the contrastive self-supervised training. In those embodiments, there is no optimization for this pre-trained encoder 808 during the contrastive self-supervised training. Furthermore, during the inference stage, it is no longer required to take account of both encoders 306 and 808, and only model encoder 306 is used.
[0048] Referring to FIG. 9, illustrated is an example computing device 900 that may be used to implement contrastive learning with model augmentation, according to some embodiments described herein. As shown in FIG. 9, computing device 900 includes a processor 910 coupled to memory 920. Operation of computing device 900 is controlled by processor 910. And although computing device 900 is shown with only one processor 910, it is understood that processor 910 may be representative of one or more central processing units, multi-core processors, microprocessors, microcontrollers, digital signal processors, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), graphics processing units (GPUs) and/or the like in computing device 900. Computing device 900 may be implemented as a stand-alone subsystem, as a board added to a computing device, and/or as a virtual machine.
[0049] Memory 920 may be used to store software executed by computing device 900 and/or one or more data structures used during operation of computing device 900. Memory 920 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH- EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
[0050] Processor 910 and/or memory 920 may be arranged in any suitable physical arrangement. In some embodiments, processor 910 and/or memory 920 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system- on-chip), and/or the like. In some embodiments, processor 910 and/or memory 920 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 910 and/or memory 920 may be located in one or more data centers and/or cloud computing facilities.
[0051] In some examples, memory 920 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 910) may cause the one or more processors to perform the methods described in further detail herein. For example, as shown, memory 920 includes instructions for a neural network module 930 (e.g., neural network module 130 of FIG. 1) that may be used to implement and/or emulate the systems and models, and/or to implement any of the methods described further herein. In some examples, neural network module 930 implements contrastive learning with model augmentation, and may also be referred to as contrastive learning with model augmentation module 930. The contrastive learning with model augmentation module 930 may receive an input 940, e.g., such as original sequences, via a data interface 915. The data interface 915 may be any of a user interface that receives user uploaded input sequences, or a communication interface that may receive or retrieve a previously stored sequences from the database. The contrastive learning with model augmentation module 930 may generate an output 950, such as a prediction of a next item for an input sequence, and/or the like in response to the input 940.
[0052] In some embodiments, the contrastive learning with model augmentation module 930 may further includes the encoder module 931 for providing an encoder, the neuron masking module 932 for performing neuron masking, layer dropping module 933 for performing layer dropping, and encoder complementing module 934 for performing encoder complementing.
[0053] Some examples of computing devices, such as computing devices 100 and 900 may include non-transitory, tangible, machine readable media that include executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the processes of methods 200. Some common forms of machine readable media that may include the processes of methods/systems described herein (e.g., methods/systems of FIGS. 2-8) are, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
[0054] This description and the accompanying drawings that illustrate inventive aspects, embodiments, implementations, or applications should not be taken as limiting. Various mechanical, compositional, structural, electrical, and operational changes may be made without departing from the spirit and scope of this description and the claims. In some instances, well-known circuits, structures, or techniques have not been shown or described in detail in order not to obscure the embodiments of this disclosure. Like numbers in two or more figures represent the same or similar elements.
[0055] In this description, specific details are set forth describing some embodiments consistent with the present disclosure. Numerous specific details are set forth in order to provide a thorough understanding of the embodiments. It will be apparent, however, to one skilled in the art that some embodiments may be practiced without some or all of these specific details. The specific embodiments disclosed herein are meant to be illustrative but not limiting. One skilled in the art may realize other elements that, although not specifically described here, are within the scope and the spirit of this disclosure. In addition, to avoid unnecessary repetition, one or more features shown and described in association with one embodiment may be incorporated into other embodiments unless specifically described otherwise or if the one or more features would make an embodiment non-functional.
[0056] Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.

Claims

WHAT IS CLAIMED IS:
1. A method for providing a neural network system, comprising: performing contrastive learning to the neural network system to generate a trained neural network system, wherein the performing the contrastive learning includes: performing first model augmentation to a first encoder of the neural network system to generate a first embedding of a sample; performing second model augmentation to the first encoder to generate a second embedding of the sample; optimizing the first encoder using a contrastive loss based on the first embedding and the second embedding; and providing the trained neural network system to perform a task.
2. The method of claim 1, wherein the performing the first model augmentation includes: performing neuron masking by randomly masking one or more neurons associated with the first encoder; performing layer dropping by dropping one or more layers associated with the first encoder; or performing encoder complementing using a second encoder.
3. The method of claim 2, wherein the performing the neuron masking includes: randomly masking the one or more neurons of one or more layers associated with the first encoder based on a masking probability.
4. The method of claim 3, where the same masking probability is applied to each layer.
5. The method of claim 3, wherein different masking probabilities are applied to different layers.
6. The method of claim 2, wherein the performing the layer dropping includes: appending a plurality of appended layers to the first encoder; and randomly dropping one or more of the plurality of appended layers.
7. The method of claim 6, wherein the neuron masking is performed to an original layer of the first encoder or one of the plurality of appended layers.
8. The method of claim 2, wherein the performing the encoder complementing includes: providing a pre-trained encoder by pre-training a second encoder; providing, by the first encoder, a first intermediate embedding of the sample; providing, by the pre-trained encoder, a second intermediate embedding of the sample; and combining the first intermediate embedding and a weighted second intermediate embedding for generating the first embedding for contrastive learning.
9. The method of claim 1, wherein the first encoder and the second encoder have different types.
10. The method of claim 6, wherein the first encoder is a Transformer-based encoder, and the second encoder is a recurrent neural network (RNN) based encoder.
11. A non-transitory machine-readable medium comprising a plurality of machine- readable instructions which, when executed by one or more processors, are adapted to cause the one or more processors to perform a method comprising: performing contrastive learning to a neural network system to generate a trained neural network system, wherein the performing the contrastive learning includes: performing first model augmentation to a first encoder of the neural network system to generate a first embedding of a sample; performing second model augmentation to the first encoder to generate a second embedding of the sample; optimizing the first encoder using a contrastive loss based on the first embedding and the second embedding; and providing the trained neural network system to perform a task.
12. The non-transitory machine-readable medium of claim 11, wherein the performing the first model augmentation includes: performing neuron masking by randomly masking one or more neurons associated with the first encoder; performing layer dropping by dropping one or more layers associated with the first encoder; or performing encoder complementing using a second encoder.
13. The non-transitory machine-readable medium of claim 12, wherein the performing the neuron masking includes: randomly masking the one or more neurons of one or more layers associated with the first encoder based on a masking probability.
14. The non-transitory machine-readable medium of claim 12, wherein the performing the layer dropping includes: appending a plurality of appended layers to the first encoder; and randomly dropping one or more of the plurality of appended layers.
15. The non-transitory machine-readable medium of claim 12, wherein the performing the encoder complementing includes: providing a pre-trained encoder by pre-training a second encoder; providing, by the first encoder, a first intermediate embedding of the sample; providing, by the pre-trained encoder, a second intermediate embedding of the sample; and combining the first intermediate embedding and a weighted second intermediate embedding for generating the first embedding for contrastive learning.
16. A system, comprising: a non-transitory memory; and one or more hardware processors coupled to the non-transitory memory and configured to read instructions from the non-transitory memory to cause the system to perform a method comprising: performing contrastive learning to a neural network system to generate a trained neural network system, wherein the performing the contrastive learning includes:
17 performing first model augmentation to a first encoder of the neural network system to generate a first embedding of a sample; performing second model augmentation to the first encoder to generate a second embedding of the sample; optimizing the first encoder using a contrastive loss based on the first embedding and the second embedding; and providing the trained neural network system to perform a task.
17. The system of claim 16, wherein the performing the first model augmentation includes: performing neuron masking by randomly masking one or more neurons associated with the first encoder; performing layer dropping by dropping one or more layers associated with the first encoder; or performing encoder complementing using a second encoder.
18. The system of claim 17, wherein the performing the neuron masking includes: randomly masking the one or more neurons of one or more layers associated with the first encoder based on a masking probability.
19. The system of claim 17, wherein the performing the layer dropping includes: appending a plurality of appended layers to the first encoder; and randomly dropping one or more of the plurality of appended layers.
20. The system of claim 17, wherein the performing the encoder complementing includes: providing a pre-trained encoder by pre-training a second encoder; providing, by the first encoder, a first intermediate embedding of the sample; providing, by the pre-trained encoder, a second intermediate embedding of the sample; and combining the first intermediate embedding and a weighted second intermediate embedding for generating the first embedding for contrastive learning.
18
PCT/US2022/013743 2021-08-06 2022-01-25 Self-supervised learning with model augmentation WO2023014398A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280060208.5A CN117918014A (en) 2021-08-06 2022-01-25 Self-supervised learning with model enhancement

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US202163230474P 2021-08-06 2021-08-06
US63/230,474 2021-08-06
US202163252375P 2021-10-05 2021-10-05
US63/252,375 2021-10-05
US17/579,377 US20230042327A1 (en) 2021-08-06 2022-01-19 Self-supervised learning with model augmentation
US17/579,377 2022-01-19

Publications (1)

Publication Number Publication Date
WO2023014398A1 true WO2023014398A1 (en) 2023-02-09

Family

ID=80446549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/013743 WO2023014398A1 (en) 2021-08-06 2022-01-25 Self-supervised learning with model augmentation

Country Status (1)

Country Link
WO (1) WO2023014398A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792818A (en) * 2021-10-18 2021-12-14 平安科技(深圳)有限公司 Intention classification method and device, electronic equipment and computer readable storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792818A (en) * 2021-10-18 2021-12-14 平安科技(深圳)有限公司 Intention classification method and device, electronic equipment and computer readable storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
KUN ZHOU ET AL: "S^3-Rec: Self-Supervised Learning for Sequential Recommendation with Mutual Information Maximization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 August 2020 (2020-08-18), XP081743350, DOI: 10.1145/3340531.3411954 *
XU XIE ET AL: "Contrastive Learning for Sequential Recommendation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 February 2021 (2021-02-28), XP081885869 *
YUNING YOU ET AL: "Bringing Your Own View: Graph Contrastive Learning without Prefabricated Data Augmentations", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 4 January 2022 (2022-01-04), XP091133715, DOI: 10.1145/3488560.3498416 *
YUNING YOU ET AL: "Graph Contrastive Learning with Augmentations", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 3 April 2021 (2021-04-03), XP081926644 *
ZHIWEI LIU ET AL: "Contrastive Self-supervised Sequential Recommendation with Robust Augmentation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 14 August 2021 (2021-08-14), XP091033508 *
ZHUOFENG WU ET AL: "CLEAR: Contrastive Learning for Sentence Representation", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 31 December 2020 (2020-12-31), XP081849378 *

Similar Documents

Publication Publication Date Title
JP7408574B2 (en) Multitask learning as question answering
CN111699498B (en) Multitask learning as question and answer
Kiperwasser et al. Simple and accurate dependency parsing using bidirectional LSTM feature representations
US11620515B2 (en) Multi-task knowledge distillation for language model
US10699060B2 (en) Natural language processing using a neural network
CN113519001A (en) Generating common sense interpretations using language models
EP4348506A1 (en) Systems and methods for vision-and-language representation learning
WO2021061555A1 (en) Contrastive pre-training for language tasks
CN110610234B (en) Integrating external applications into deep neural networks
Rotman et al. Shuffling recurrent neural networks
Arora et al. Deep learning with h2o
Kostadinov Recurrent Neural Networks with Python Quick Start Guide: Sequential learning and language modeling with TensorFlow
US11836438B2 (en) ML using n-gram induced input representation
KR20240011164A (en) Transfer learning in image recognition systems
US20230280985A1 (en) Systems and methods for a conversational framework of program synthesis
Milutinovic et al. End-to-end training of differentiable pipelines across machine learning frameworks
WO2019106132A1 (en) Gated linear networks
US20230042327A1 (en) Self-supervised learning with model augmentation
US20220050964A1 (en) Structured graph-to-text generation with two step fine-tuning
US20220067534A1 (en) Systems and methods for mutual information based self-supervised learning
CN111259673A (en) Feedback sequence multi-task learning-based law decision prediction method and system
Huang et al. Flow of renyi information in deep neural networks
WO2023014398A1 (en) Self-supervised learning with model augmentation
WO2022164613A1 (en) Ml using n-gram induced input representation
Gurunath et al. Insights Into Deep Steganography: A Study of Steganography Automation and Trends

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22704154

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280060208.5

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE