WO2023244567A1 - Apprentissage de représentations auto-supervisé avec codage informationnel multi-segmentaire - Google Patents

Apprentissage de représentations auto-supervisé avec codage informationnel multi-segmentaire Download PDF

Info

Publication number
WO2023244567A1
WO2023244567A1 PCT/US2023/025137 US2023025137W WO2023244567A1 WO 2023244567 A1 WO2023244567 A1 WO 2023244567A1 US 2023025137 W US2023025137 W US 2023025137W WO 2023244567 A1 WO2023244567 A1 WO 2023244567A1
Authority
WO
WIPO (PCT)
Prior art keywords
circuitry
training
ssrl
segment
batch
Prior art date
Application number
PCT/US2023/025137
Other languages
English (en)
Inventor
Chuang NIU
Ge Wang
Original Assignee
Rensselaer Polytechnic Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rensselaer Polytechnic Institute filed Critical Rensselaer Polytechnic Institute
Publication of WO2023244567A1 publication Critical patent/WO2023244567A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/2163Partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates to a self-supervised representation learning, in particular to, self-supervised representation learning with multi -segmental informational coding.
  • SSRL Self-supervised representation learning
  • SSRL maps high-dimensional data into a meaningful embedding space, where samples of similar semantic content are close to each other.
  • SSRL has been a core task in machine learning and has experienced relatively rapid progress over the past few years.
  • Deep neural networks pre-trained on large-scale unlabeled datasets via SSRL have demonstrated desirable characteristics, including relatively strong robustness and generalizability, improving various down-stream tasks when annotations are scarce.
  • An effective approach for SSRL is to enforce semantically similar samples (i.e., different transformations from a same instance) close to each other in an embedding space. Simply maximizing similarity or minimizing Euclidean distance between embedding features of similar semantic samples tends to produce trivial solutions, e.g., all samples have a same embedding.
  • a self-supervised representation learning (SSRL) circuitry' includes a transformer circuitry configured to receive input data.
  • the input data includes an input batch containing a number, N, of input data sets.
  • the transformer circuitry is configured to transform the input batch into a plurality of training batches. Each training batch contains the number N training data sets.
  • the SSRL circuitry’ further includes for each training batch: a respective encoder circuitry/, a respective projector circuitry’, and a respective partitioning circuitiy.
  • the respective encoder circuitry is configured to encode each training data set into a respective representation feature.
  • the respective projector circuitry/ is configured to map each representation feature into an embedding space as a respective embedding feature vector.
  • the respective partitioning circuitiy is configured to partition each embedding feature vector into a number, S, segments.
  • Each segment has a dimension, Ds.
  • Each segment corresponds to a respective attribute type, and each segment contains at least one instantiated attribute corresponding to the associated attribute type for the segment.
  • the SSRL circuitry further includes, for each training batch, a respective normalizing circuitry’ configured to normalize each segment of the corresponding partitioned embedding feature vector to a probability distribution over Ds instantiated attributes using a softmax function.
  • the SSRL circuitry' further includes a joint probability circuitry configured to determine an empirical joint probability distribution between the embedding features of the training data sets over the plurality of training batches.
  • each encoder circuitry and each projector circuitry corresponds to a respective multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • the input data is selected from the group including image data, text, and speech data.
  • a number of training batches is two.
  • a method for self-supervised representation learning includes receiving, by a transformer circuitiy, input data.
  • the input, data includes an input batch containing a number, N, of input data sets.
  • the method further includes transforming, by the transformer circuitiy, the input batch into a plurality of training batches. Each training batch contains the number N training data sets.
  • the method further includes, for each training batch: encoding, by a respective encoder circuitry, each training data set into a respective representation feature, mapping, by a respective projector circuitry, each representation feature into an embedding space as a respective embedding feature vector; and partitioning, by a respective partitioning circuitry, each embedding feature vector into a number, S, segments.
  • Each segment has a dimension, Ds.
  • Each segment corresponds to a respective attribute type.
  • Each segment contains at least one instantiated attribute corresponding to the associated attribute type for the segment..
  • the method further includes, for each training batch, normalizing, by a respective normalizing circuitry, each segment of the corresponding partitioned embedding feature vector to a probability distribution over Ds instantiated attributes using a softmax function.
  • the method further includes determining, by a joint probability circuitry, an empirical joint probability distribution between the embedding features of the training data sets over the plurality of training batches.
  • each encoder circuitry and each projector circuitry corresponds to a respective multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • the input data is selected from the group including image data, text, and speech data.
  • the method further includes determining, by a training circuitry, a pure entropy loss based, at least in part, on an empirical joint probability distribution. Minimizing the pure entropy loss during training is configured to maximize a joint entropy over a number of selected segments.
  • a pure entropy loss function is:
  • the method further includes determining by the training circuitry, an enhanced loss based, at least in part on the pure entropy loss, and based, at least in part on an inner product term.
  • the enhanced loss is configured to enhance a transformation invariance of a plurality of features.
  • a self-supervised representation learning (SSRL) system includes a computing device and an SSRL circuitry.
  • the computing device includes a processor, a memory, an input/output circuitry, and a data store.
  • the SSRL circuitry' includes a transformer circuitry configured to receive input data.
  • the input data includes an input batch containing a number, N, of input data sets.
  • the transformer circuitry is configured to transform the input batch into a plurality of training batches. Each training batch contains the number N training data sets.
  • the SSRL circuitry further includes for each training batch: a respective encoder circuitry, a respective projector circuitry', and a respective partitioning circuitry.
  • the respective encoder circuitry is configured to encode each training data set into a respective representation feature.
  • the respective projector circuitry is configured to map each representation feature into an embedding space as a respective embedding feature vector.
  • the respective partitioning circuitry' is configured to partition each embedding feature vector into a number, S, segments. Each segment has a dimension, Ds. Each segment corresponds to a respective attribute type, and each segment contains at least one instantiated attribute corresponding to the associated attribute type for the segment.
  • the SSRL circuitry further includes, for each training batch, a respective normalizing circuitry' configured to normalize each segment of the corresponding partitioned embedding feature vector to a probability distribution over Ds instantiated attributes using a softmax function.
  • the SSRL circuitry further includes a joint probability circuitry’ configured to determine an empirical joint probability distribution between the embedding features of the training data sets over the plurality of training batches.
  • each encoder circuitry and each projector circuitry corresponds to a respective multilayer perceptron (MLP).
  • MLP multilayer perceptron
  • the input data is selected from the group including image data, text, and speech data.
  • the SSRL system further includes a training circuitry' configured to determine a pure entropy loss based, at least in part, on an empirical joint probability distribution. Minimizing the pure entropy loss during training is configured to maximize a joint entropy over a number of selected segments.
  • a pure entropy loss function is:
  • the training circuitry 7 is configured to determine an enhanced loss based, at least in part on the pure entropy loss, and based, at least in part on an inner product term.
  • the enhanced loss is configured to enhance a transformation invariance of a plurality' of features.
  • a computer readable storage device has stored thereon instructions that when executed by one or more processors result in the following operations including: any embodiment of the method.
  • FIG. 1 is a sketch illustrating one example feature vector including embedded feature partitions, according to several embodiments of the present disclosure
  • FIG. 2 illustrates a functional block diagram of one example self-supervised representation learning circuitry that graphically illustrates joint entropy, according to one embodiment of the present disclosure
  • FIG. 3 illustrates a functional block diagram of a self-supervised representation learning system (SSRL) that includes a self-supervised representation learning circuitry, according to several embodiments of the present disclosure
  • FIG. 4 is a flowchart of operations for self-supervised representation learning, according to various embodiments of the present disclosure.
  • this disclosure relates to a self-supervised representation learning (SSRL) system, in particular to, an SSRL system with MUlti -Segmental Informational Coding (“MUSIC”).
  • An apparatus, system, and/or method is configured to divide, i.e., partition, an embedding feature vector corresponding to a batch of input data sets into a plurality of segments, with each segment corresponding to a respective attribute type (i.e., general atribute). Each segment is configured to contain at least one instantiated attribute that corresponds to the associated attribute type for the segment.
  • the apparatus, system, and/or method are configured to utilize information theory, e.g., entropy, and an entropy-based cost function, to help avoid trivial solutions.
  • an object may be represented by a plurality of attributes, including, but not limited to, object parts, textures, shapes, etc.
  • An embedding vector may divided into a number, S segments (e.g., Seg-1, Seg-2, ..., Seg-S). Different segments are configured to represent different atributes. For example, Seg-1 may represent object part, Seg-2 may represent texture, and Seg-3 may represent shape, respectively. Each segment is configured to instantiate a number, Ds, different features. Continuing with this example, Seg-2 may be configured to represents samples with different textures (e.g., dot texture, stripe texture, etc.).
  • MUSIC allows an information theory-based representation learning framework.
  • Theoretical analysis supports that optimized MUSIC embedding features are transform-invariant, discriminative, diverse, and non-trivial.
  • the MUSIC technique does not require an asymmetric network architecture with an extra predictor module, a large batch size of contrastive samples, a memory' bank, gradient stopping, or momentum updating.
  • Empirical results suggest that MUSIC does not depend on a relatively high dimension of embedding features or a relatively deep projection head, thus, efficiently reducing a memory' and computation cost.
  • experimental data suggests that MUSIC achieves acceptable results in terms of linear probing on the ImageNet dataset,
  • FIG. 1 is a sketch 100 illustrating one example 102 embedding feature vector including embedded feature partitions, according to several embodiments of the present disclosure.
  • an image may be represented by a plurality of attributes including, but not limited to, general object parts, textures, shapes, etc.
  • Other types of input data for example, text data, speech data, etc., may be similarly represented by a plurality of associated attributes.
  • the example embedding feature vector 102 includes a number, S, segments Seg-1, Seg-2,. . ., Seg- S.
  • an SSRL circuitry e.g., SSRL circuitry 302 of FIG, 3, as will be described in more detail below, may be configured to divide an embedding feature vector into a plurality of segments (Seg-1, Seg-2, ..., Seg-S). Each segment corresponds to a respective attribute type (i.e., “general attribute”). Each segment may then include a plurality of instantiated attributes corresponding to the associated attribute type for the segment. For example, for image data, segment Seg-1 may correspond to an object, part attribute, segment Seg-2 may correspond to a texture attribute, and segment Seg-S may correspond to a shape attribute.
  • a respective general attribute of each segment may include a plurality of instantiations, and different instantiated attributes within a same segment are configured to be discriminative from each other.
  • Each attribute has an associated probability p(s, d) 106 corresponding to the probability that an input data set, e.g., image data, belongs to the d 111 instantiated attribute of the s th segment.
  • p(s, d) 106 corresponding to the probability that an input data set, e.g., image data, belongs to the d 111 instantiated attribute of the s th segment.
  • each attribute Seg-2 Attribute- 1,. . Seg-2 Attribute-Ds may represent a respective texture, e.g., dot texture, stripe texture, etc.
  • Each attribute may have one or more associated samples, e.g., grouping 104 - 1 that includes Seg-2 Attribute-1 samples sample-1 through sample-R.
  • each sample corresponds to image data that includes the associated attribute.
  • each embedding feature vector may be partitioned into a plurality of segments.
  • Each segment is configured to correspond to a respective attribute type.
  • Each segment is configured to contain at least one instantiated attribute corresponding to the associated attribute type for the segment.
  • FIG. 2 illustrates a functional block diagram 200 of one example self-supervised representation learning (SSRL) circuitry' that graphically illustrates joint entropy, according to one embodiment of the present disclosure.
  • SSRL self-supervised representation learning
  • example 200 illustrates a twin architecture and may be configured to use a same network for both branches.
  • Example 200 includes an input data set (X) 201, and two transformed, i.e., training, data sets (X’, X”) 223 - 1 , 223 - 2.
  • a first training data set, X’ 223 - 1 may be provided to a first branch that includes a first encoder 224 - 1 and a first projector 226 - 1.
  • An output of the first encoder 224 - 1 corresponds to an input to the first projector 226 - 1 .
  • An output of the first projector corresponds to a first embedding feature (z-).
  • a second training data set, X” 223 -- 2 may be provided to a second branch that includes a second encoder 224 - 2 and a second projector 226 - 2.
  • An output of the second encoder 224 - 2 corresponds to an input to the second projector 226 - 2.
  • An output of the second projector corresponds to a second embedding feature (zf).
  • a common transformation distribution i.e., random crops combined with color distortions may be used to generate a number of training samples.
  • Two batches of distorted images X' and X" may then be respectively fed to the two branches.
  • Each encoder may correspond to a function F(-; 0 F ) where the symbol ⁇ corresponds to a training data set.
  • Each projector may correspond to a function P(-; 0 P ) where the symbol • corresponds to F(-; 0 F ).
  • An output of each encoder 224 - 1 , 224 - 2 may be used as a respective representation feature.
  • Each projector i.e., projection head
  • Each projector is configured to map the representation feature into an embedding space during training.
  • an SSRL circuitry, system and/or method are not limited to this twin architecture.
  • a SSRL circuitry 7 , system and/or method may include two branches with different parameters or of heterogeneous networks.
  • a SSRL circuitry, system and/or method may be configured to receive input data corresponding to other input modalities (e.g., text, audio, etc.).
  • MUlti- Segmental Informational Coding is configured for self-supervised representation learning.
  • the embedding features of the two branches may be denoted as: where D is the feature dimension.
  • S is the number of segments
  • l) s is the dimension of each segment
  • D ⁇ D s x 5 corresponds to a dimension of an embedding space.
  • the MUSIC technique may be configured to evenly split the embedding vector. It is contemplated that the MUSIC technique may be configured to implement uneven configurations.
  • Each segment may be normalized to a probability distribution p-(s', d f ) over D s instantiated attributes using a softmax function, i.e., ( I)
  • a probability distribution p"(s, d) for the other branch may be similarly determined.
  • the MUSIC technique may be interpreted as a combination of a plurality of classifiers or cluster operators configured to implement different classification criteria learned in a data- driven fashion.
  • an empirical joint distribution between the embedding features of two transformations may be determined over a batch of samples as:
  • two versions of the loss function may be defined.
  • An empirical joint distribution can be modeled as a block matrix 252 as illustrated in FIG. 2.
  • crosshatch squares correspond to elements that are maintained, e.g., 254 and 256 - 1
  • white squares correspond to elements that are not maintained, e.g., 256 - 2. It may be appreciated that minimizing this loss function is configured to maximize a joint entropy over the selected elements. It may be further appreciated that this single loss function is configured to facilitate learning relatively meaningful features.
  • an additional term may be included configured to maximize an inner product between the embedding features from the plurality of transformations.
  • An enhanced loss function may then be defined as: where 2 is a balancing factor. In one nonlimiting example, 2 may be set equal to 1. Based on experimental data, it appears that 2 need not be relatively very' small or relatively large to achieve adequate balancing. Since p-(s, d) and p” (s, d) are the probabilities, maximizing their inner product is configured to ensure the corresponding network makes relatively consistent assignments over all segments between two transformations of a same image. Each segment may then be encouraged to be a one-hot vector for a maximum inner product.
  • this additional term is configured to promote a transfomial invariance and relatively confident assignments over a number of different attributes.
  • One difference of this term from the entropy loss term is a sample-specific constraint while entropy is a statistical measure.
  • an SSRL i.e., MUSIC
  • MUSIC MUSIC
  • the entropy loss is configured to optimize relatively meaningful embedding features.
  • the entropy loss function includes two parts, including the entropy over diagonal elements, e.g., element 256 - 1, and the entropy over the elements of off-diagonal blocks, e.g., block 254 of FIG. 2.
  • the two-part entropy loss function may be written as:
  • a plurality of segments of the MUSIC embedding vector may be confi gured to focus on complementary' attributes.
  • the redundancy or mutual information between any two segments may be minimized. Minimizing redundancy or mutual information between any two segments may be useful for feature selection. It may be appreciated that the redundancy or mutual information between any two segments is minimized when an optimal solution is obtained.
  • Mutual information /(s', s”) between any segments s' and s” may be written as:
  • MUSIC embedding features may be both discriminative and diverse.
  • the entropy-based loss function as described herein, is configured to reduce redundancy in a non-linear way. It may be appreciated that optimal MUSIC embedding features are configured to have zero covariance between any two features in different segments and negative covariance between the features within the same segment.
  • contrastive learning is relatively effective for representation learning by maximizing a similarity between different transformations of a same instance and minimizing a similarity between a reference and other instances.
  • MUSIC according to the present disclosure, is configured to be consistent with contrastive learning.
  • an optimal MUSIC embedding may encode (D$) s different samples.
  • MUSIC may be configured to represent 80 11,2 different samples. Maximizing the joint entropy may be configured to evenly assign a batch of samples into most or all embeddings.
  • the embedding features of most or all instances may be configured to be different from each other, similar to contrastive learning, given a sufficiently large coding capacity. It may be appreciated that contrastive learning is configured to differentiate instances by directly enforcing their features to be dissimilar, while MUSIC is configured to statistically assign instances with different assignment, codes.
  • a self-supervised representation learning (SSRL) circuitry includes a transformer circuitry' configured to receive input data.
  • the input data includes an input batch contai ning a number, N, of input data sets.
  • the transformer circuitry' is configured to transform the input batch into a plurality of training batches. Each training batch contains the number N training data sets.
  • the SSRL circuitry further includes for each training batch: a respective encoder circuitry/, a respective projector circuitry’, and a respective partitioning circuitry.
  • the respective encoder circuitry is configured to encode each training data set into a respective representation feature.
  • the respective projector circuitry is configured to map each representation feature into an embedding space as a respective embedding feature vector.
  • the respective partitioning circuitry is configured to partition each embedding feature vector into a number, S, segments.
  • Each segment has a dimension, Ds.
  • Each segment corresponds to a respective attribute type, and each segment contains at least one instantiated attribute corresponding to the associated attribute type for the segment.
  • FIG. 3 illustrates a functional block diagram of a self-supervised representation learning (SSRL) system 300 that includes an SSRL circuitry 302, according to several embodiments of the present disclosure.
  • the SSRL system 300 may be configured to implement a MUSIC technique, as described herein.
  • the SSRL system 300 includes the SSRL circuitry 302, a computing device 306, and may include a training circuitry 308.
  • the SSRL circuitry 302, and/or training circuitry 308 may be coupled to or included in computing device 306.
  • the SSRL. circuitry 302 is configured to receive input data 301 (e.g., input batch 309 from the training circuitry 308) and to provide a joint probability distribution 333 to the training circuitry 308.
  • the training circuitry 308, e.g., training management circuitry 340, may then be configured to evaluate a pure entropy loss function 342 - 1, and/or an enhanced entropy loss function 342 - 2 based, at least in part, on the joint probability distribution 333, as described herein.
  • the training circuitry 308, e.g., training management circuitry 340, may then be configured to adjust one or more network parameters 303 associated with SSRL circuitry 302, and one or more of the circuitries contained therein, to optimize the entropy associated with elements of an embedding feature vector, as described herein.
  • SSRL circuitry 302 includes a transformer circuitry 322, a plurality of encoder circuitries 324 - I, 324 - 2, a plurality of projector circuitries 326 - 1 , 326 - 2, a plurality of partitioning circuitries 328 - 1, 328 - 2, a plurality of normalizing circuitries 330 - 1, 330 - 2, and a joint probability circuitry' 332.
  • Transformer circuitry' 322 is coupled to a plurality of branches 334 - 1, 334 - 2, and each branch includes a respective plurality of circuitries coupled in series.
  • a first branch 334 - 1 includes a first encoder circuitry 324 - 1 coupled to a first projector circuitry 326 -• 1 coupled to a first partitioning circuitry 328 - 1 coupled to a first normalizing circuitry- 330 - 1
  • a 2 nd branch 334 - 2 includes a 2 nd encoder circuitry 324 - 2 coupled to a 2 nd projector circuitry 326 - 2 coupled to a 2 BCS partitioning circuitry 328 - 2 coupled to a 2 nd normalizing circuitry 330 - 2.
  • a respective normalizing circuitry 330 - 1, 330 - 2 of each of the plurality of branches 334 --- 1, 334 - 2 is coupled to a joint probability circuitry 332.
  • the encoder circuitries 324 - 1, 324 - 2 correspond to the encoders 224 - 1 , 224 - 2, of FIG. 2.
  • the projector circuitries 326 - 1, 326 - 2 correspond to the projectors 226 - I, 226 - 2.
  • the encoder circuitries, encoders, projector circuitries, and/or projectors may each correspond to an artificial neural network, e.g., a multilayer perceptron. However, this disclosure is not limited in this regard.
  • Computing device 306 may include, but is not limited to, a computing system (e.g., a server, a workstation computer, a desktop computer, a laptop computer, a tablet computer, an ultraportable computer, an ultramobile computer, a netbook computer and/or a subnotebook computer, etc.), and/or a smart phone.
  • Computing device 306 includes a processor 310, a memory 312, input/output (I/O) circuitry 314, a user interface (UI) 316, and data store 318.
  • I/O input/output
  • UI user interface
  • Processor 310 is configured to perform operations of SSRL circuitry 302, and/or training circuitry 308.
  • Memory 312 may be configured to store data associated with SSRL. circuitry 302, and/or training circuitry 308.
  • I/O circuitry 314 may be configured to provide wired and/or wireless communication functionality for SSRL system 300.
  • I/O circuitry 314 may be configured to receive input data 301 and/or system input data 307 (including, e.g., training data 344) and to provide output data 305.
  • UI 316 may include a user input device (e.g., keyboard, mouse, microphone, touch sensitive display, etc.) and/or a user output device, e.g., a display.
  • Data store 318 may be configured to store one or more of system input data 307, training data 344, input data 301, output data 305, network parameters 303, and/or other data associated with SSRL circuitry 302, and/or training circuitry 308.
  • Other data may include, for example, function parameters related to loss function(s) 342 (e.g., related to pure entropy loss function 342 - 1, and/or enhanced entropy loss function 342 - 2), training constraints (e.g., hyper parameters, including, but not limited to, number of epochs, batch size, projector depth, segment dimension, feature dimension, convergence criteria, etc.), etc.
  • Training circuitry 308 may be configured to receive and store system input data 307.
  • System input data 307 may include training data 344, loss function(s) 342 parameters, etc.
  • Training data 344 may include, for example, one or more input batches of input data sets.
  • a batch may be configured to contain a number, N, input data sets.
  • each input data set may correspond to image data.
  • the input data sets may not correspond to image data and may include text, audio, and/or speech data.
  • Training circuitry 308 may be further configured to receive and/or store one or more loss function(s) 342, e.g., a pure entropy loss function 342 -• 1, and/or an enhanced entropy loss function 342 - 2, as described herein.
  • loss function(s) 342 e.g., a pure entropy loss function 342 -• 1, and/or an enhanced entropy loss function 342 - 2, as described herein.
  • SSRL circuitry 302 e.g., transformer circuitry 322 is configured to receive input data 301.
  • input data 301 is configured to correspond to an input batch 309 that, includes a number, N, input data sets.
  • Training operations may be managed by training management circuitry 340.
  • training management circuitry 340 may be configured to provide the input batch 309 to SSRL circuitry 302, capture the joint probability distribution 332 from the SSRL circuitry 302, evaluate the loss function(s) 342, and adjust network parameters 303 to optimize operation of SSRL circuitry 302, e.g., maximizing an entropy of an associated embedding feature vector, as described herein.
  • the network parameters 303 may be related to one or more of the encoder circuitries 324 - 1, 324 -2, and/or the projector circuitries 326 - 1, 326 - 2,
  • the encoder circuitries 324 - 1, 324 - 2, and/or the projector circuitries 326 - 1, 326 - 2 may correspond to artificial neural networks.
  • the encoder circuitries 324 - 1, 324 - 2, and/or the projector circuitries 326 - 1, 326 - 2 may correspond to multilayer perceptrons. However, this disclosure is not limited in this regard.
  • Training operations may repeat until a stop criterion is met, e.g., a cost function threshold value is achieved, a maximum number of iterations has been reached, etc.
  • network parameters 303 may be set for operation.
  • the SSRL circuitry 302 may then be configured to map relatively high dimensional data into a meaningful embedding space, where samples of similar semantic content are configured to be relatively close to each other, while avoiding trivial solutions that all samples have a same embedding feature.
  • the input data 301 may correspond to an input batch 309.
  • a batch may have a batch size, N, where N corresponds to the number of data sets in a batch.
  • a batch may be an input batch, e.g,, input batch 309, or a training batch, e.g., a first training batch 323 --- 1 or a second training batch 323 - 2.
  • a subscript, z is an index corresponding to a data set in a batch of data sets.
  • x, j tk input data set.
  • X' batch ofN first training data sets 323 -• 1.
  • x" s i rh second training data set, i.e., i th training data set in second batch of training data sets.
  • input data 301 including an input batch 309, that contains N input data sets may be received by transformer circuitry 322 and/or retrieved by training management circuitry 340 and provided to transformer circuitry 322.
  • the input batch 309 may be transformed into a plurality' of training batches 323 - 1, 323 - 2, by transformer circuitry 322. Transformations may include, but are not limited to, random cropping, horizontal flip, color jittering, grayscale, Gaussian blur, solarization, etc.
  • Each training batch may contain N training, e.g., transformed, data sets.
  • a first training batch 323 - 1 may include N first training data sets, x[, and a second training batch 323 - 2 may include N second training data sets, x”.
  • Each segment corresponds to a respective attribute type (“general attribute”).
  • Each segment is configured to contain at least one instantiated attribute corresponding to the associated attribute type for the segment, as described herein.
  • Each segment may then be normalized by a respective normalizing circuitry 330 - 1, 330 - 2 to a respective probability distribution p;(s, d) over Ds instantiated attributes 331 - 1, 331 - 2 using a soft, max function (Eq. (1)).
  • An empirical joint probability distribution p(s ! t s”, d ! , d !> ⁇ ) 333 may be determined between the embedding features of the training data sets over the batches of training data sets (Eq. (2)) by the joint probability circuitry 332 based, at least in part, on the respective normalized probability distributions 331 - 1, 331 - 2.
  • a pure entropy loss L efl t may be determined using the pure entropy loss function 342 - 1 by the training management circuitry 340 based, at ieast in part, on the empirical joint probability distribution (Eq. (3)). It may be appreciated that the entropy loss includes entropy over diagonal elements and entropy over elements of off diagonal blocks of the block structure illustrated in FIG. 2. Minimizing the pure entropy loss function is configured to maximize the joint entropy over selected elements.
  • An enhanced loss, L may be determined (Eq. (4)) using the enhanced entropy loss function 342 - 2 by the training management circuitry' 340 based, at least in part, on the pure entropy loss, L eri t, and based, at least in part, on and inner product term and a balancing factor, X.
  • the enhanced loss is configured to enhance a transformation invariance of the features.
  • Respective encoder network parameters, 0 F , and respective projector network parameters, 0 P may be adjusted by training management circuitry 340 based, at least in part, on at least one of the pure entropy loss, £ efif , and/or the enhanced loss, L, to optimize entropy.
  • the trained SSRL framew ? ork,''system/circuitry' 302 may then be applied to a selected downstream task.
  • an SSRL system with MUlti -Segmental Informational Coding may be configured to divide, i.e., partition, an embedding feature vector corresponding to a batch of input data sets into a plurality of segments, with each segment corresponding to a respective attribute type (i.e., general attribute). Each segment is configured to contain at ieast one instantiated attribute that corresponds to the associated attribute type for the segment.
  • the apparatus, system, and/or method are configured to utilize information theory'-, e.g, entropy, and an entropy-based cost function, to help avoid trivial solutions.
  • FIG. 4 is a flowchart 400 of operations for self-supervised representation learning, according to various embodiments of the present disclosure.
  • the flowchart 400 illustrates training an SSRL circuitry.
  • the operations may be performed, for example, by the SSRL system 300 (e.g., SSRL circuitry 302, and/or training circuitiy 308) of FIG. 3.
  • Operations of this embodiment may begin with receiving input data at operation 402.
  • the input data may include an input batch containing a number, N, input data sets.
  • Operation 404 includes transforming the input batch into a plurality of training batches, each training batch containing the number, N, training (i.e., transformed) data sets.
  • Operation 406 includes encoding each training data set into a respective representation feature.
  • Operation 408 includes mapping each representation feature into an embedding space as a respective embedding feature.
  • Operation 410 includes partitioning each embedding feature into a number, S, segments, each segment, s, having a dimension, Ds.
  • Operation 412 includes normalizing each segment to a probability distribution over Ds instantiated attributes using a softmax function.
  • Operation 414 includes repeating operations 406 through 412 for each training batch. In other words, operations 406 through 412 may be performed for each training batch.
  • Operation 420 includes determining an empirical joint probability distribution between the embedding features of the training data sets over the batches of training data sets.
  • Operation 422 includes determining a pure entropy loss based, at least, in part, on the empirical joint probability distribution.
  • Operation 424 includes determining an enhanced loss based, at. least, in part, on the pure entropy loss and based, at least in part, on an inner product term and a balancing factor, 1.
  • Operation 426 includes adjusting respective encoder network parameters and respective projector network parameters based, at least in part, on at least one of the pure entropy loss, and/or the enhanced loss to optimize entropy.
  • Operation 428 includes applying the trained SSRL framework to a selected downstream task. Program flow 7 may then continue at operation 430,
  • a SSRL circuitiy and SSRL system may be trained based, at least in part, on a segmented embedded feature vector, and utilizing entropy loss functions.
  • a standard ResNet-50 backbone was used as the encoder that outputs a representation vector of 2,048 units.
  • the base learning rate base Ir was set to 0.6.
  • a two- layer MLP multi-layer perceptron was used for the projector (8,192-8,160), the number of segments STM 102, the segment dimension/)?
  • this disclosure relates to a multi-segment informational coding (MUSIC) optimized with an entropy-based loss function for self-supervised representation learning.
  • MUSIC multi-segment informational coding
  • Experimental results indicate that MUSIC achieves equivalent or better representation learning results compared with existing methods in terms of linear classification.
  • the SSRL framework is configured to ensure that MUSIC can avoid trivial solutions and learn discriminative and diverse features.
  • Experimental data suggest that MUSIC may be effective using a projector with a relatively shallower MLP, and a batch size and a embedding feature dimension smaller than that used in existing methods while achieving comparable or better results.
  • an SSRL circuitry’ and SSRL system including MUSIC support an information theory-based representation learning framework.
  • optimized MUSIC embedding features are transform-invariant, discriminative, diverse, and non-trivial.
  • the MUSIC technique does not require an asymmetric network architecture with an extra predictor module, a large batch size of contrastive samples, a memory bank, gradient stopping, or momentum updating.
  • Empirical results suggest that MUSIC does not depend on a relatively high dimension of embedding features or a relatively deep projection head, thus, efficiently reducing a memory and computation cost.
  • ANN neural network
  • NN neural network
  • Network architectures may include one or more layers that may be sparse, dense, linear, convolutional, and/or fully connected. It may be appreciated that deep learning includes training an ANN.
  • Each ANN may include, but is not limited to, a deep NN (DNN), a convolutional neural network (CNN), a deep CNN (DCNN), a multilayer perceptron (MLP), etc.
  • Training generally corresponds to “optimizing” the ANN, according to a defined metric, e.g., minimizing a cost (e.g., loss) function.
  • logic and/or “module” may refer to an app, software, firmware and/or circuitry configured to perform any of the aforementi oned operations.
  • Software may be embodied as a software package, code, instructions, instraction sets and/or data recorded on n on-transitory computer readable storage medium.
  • Firmware may be embodied as code, instractions or instruction sets and/or data that are hard-coded (e.g., nonvolatile) in memory' devices.
  • Circuitry may include, for example, singly or in any combination, hardwired circuitry', programmable circuitry- such as computer processors comprising one or more individual instruction processing cores, state machine circuitry, and/or firmware that stores instructions executed by programmable circuitry.
  • the logic and/or module may, collectively or individually, be embodied as circuitry that, forms part of a larger system, for example, an integrated circuit (IC), an application-specific integrated circuit (ASIC), a system on-chip (SoC), desktop computers, laptop computers, tablet computers, servers, smart phones, etc.
  • IC integrated circuit
  • ASIC application-specific integrated circuit
  • SoC system on-chip
  • Memory 312 may include one or more of the following types of memory': semiconductor firmware memory, programmable memory, non-volatile memory, read only- memory', electrically programmable memory, random access memory'-, flash memory-, magnetic disk memory, and/or optical disk memory. Either additionally or alternatively system memory/ may include other and/or later-developed types of computer-readable memory'.
  • Embodiments of the operations described herein may- be implemented in a computer- readable storage device having stored thereon instructions that, when executed by one or more processors perform the methods.
  • the processor may include, for example, a processing unit and/or programmable circuitry'.
  • the storage device may include a machine readable storage device including any type of tangible, non -transitory storage device, for example, any type of disk including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk rewritables (CD-RWs), and magneto-optical disks, semiconductor devices such as read-only memories (ROMs), random access memories (RAMs) such as dynamic and static RAMs, erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), flash memories, magnetic or optical cards, or any type of storage devices suitable for storing electronic instructions.
  • ROMs read-only memories
  • RAMs random access memories
  • EPROMs erasable programmable read-only memories
  • EEPROMs electrically erasable programmable read-only memories
  • flash memories magnetic or optical cards, or any type of storage devices suitable for storing electronic instructions.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Editing Of Facsimile Originals (AREA)

Abstract

Dans un mode de réalisation, il est décrit un montage de circuits d'apprentissage de représentations auto-supervisé (SSRL). Le montage de circuits SSRL comprend un montage de circuits de transformateur configuré pour recevoir des données d'entrée. Les données d'entrée comprennent un lot d'entrée contenant un nombre, N, d'ensembles de données d'entrée. Le montage de circuits de transformateur est configuré pour transformer le lot d'entrée en une pluralité de lots d'entraînement. Chaque lot d'entraînement contient le nombre N d'ensembles de données d'entraînement. Le montage de circuits SSRL comprend en outre, pour chaque lot d'entraînement : un montage de circuits d'encodeur respectif, un montage de circuits de projecteur respectif, et un montage de circuits de partitionnement respectif. Le montage de circuits d'encodeur respectif est configuré pour encoder chaque ensemble de données d'entraînement en une caractéristique de représentation respective. Le montage de circuits de projecteur respectifs est configuré pour mapper chaque caractéristique de représentation dans un espace d'incorporation sous la forme d'un vecteur de caractéristique d'incorporation respectif.
PCT/US2023/025137 2022-06-13 2023-06-13 Apprentissage de représentations auto-supervisé avec codage informationnel multi-segmentaire WO2023244567A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263351610P 2022-06-13 2022-06-13
US63/351,610 2022-06-13
US202363472618P 2023-06-13 2023-06-13
US63/472,618 2023-06-13

Publications (1)

Publication Number Publication Date
WO2023244567A1 true WO2023244567A1 (fr) 2023-12-21

Family

ID=89191881

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/025137 WO2023244567A1 (fr) 2022-06-13 2023-06-13 Apprentissage de représentations auto-supervisé avec codage informationnel multi-segmentaire

Country Status (1)

Country Link
WO (1) WO2023244567A1 (fr)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200367974A1 (en) * 2019-05-23 2020-11-26 Surgical Safety Technologies Inc. System and method for surgical performance tracking and measurement
WO2021220008A1 (fr) * 2020-04-29 2021-11-04 Deep Render Ltd Procédés et systèmes de compression et décodage d'image, et de compression et décodage vidéo
US20210350176A1 (en) * 2019-03-12 2021-11-11 Hoffmann-La Roche Inc. Multiple instance learner for prognostic tissue pattern identification
US20210383225A1 (en) * 2020-06-05 2021-12-09 Deepmind Technologies Limited Self-supervised representation learning using bootstrapped latent representations
US20220129699A1 (en) * 2020-10-26 2022-04-28 Robert Bosch Gmbh Unsupervised training of a video feature extractor
US20220171938A1 (en) * 2020-11-30 2022-06-02 Oracle International Corporation Out-of-domain data augmentation for natural language processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210350176A1 (en) * 2019-03-12 2021-11-11 Hoffmann-La Roche Inc. Multiple instance learner for prognostic tissue pattern identification
US20200367974A1 (en) * 2019-05-23 2020-11-26 Surgical Safety Technologies Inc. System and method for surgical performance tracking and measurement
WO2021220008A1 (fr) * 2020-04-29 2021-11-04 Deep Render Ltd Procédés et systèmes de compression et décodage d'image, et de compression et décodage vidéo
US20210383225A1 (en) * 2020-06-05 2021-12-09 Deepmind Technologies Limited Self-supervised representation learning using bootstrapped latent representations
US20220129699A1 (en) * 2020-10-26 2022-04-28 Robert Bosch Gmbh Unsupervised training of a video feature extractor
US20220171938A1 (en) * 2020-11-30 2022-06-02 Oracle International Corporation Out-of-domain data augmentation for natural language processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
NIU CHUANG, WANG GE: "Self-Supervised Representation Learning With MUlti-Segmental Informational Coding (MUSIC)", ARXIV (CORNELL UNIVERSITY), CORNELL UNIVERSITY LIBRARY, ARXIV.ORG, ITHACA, 13 June 2022 (2022-06-13), Ithaca, pages 1 - 12, XP093122736, DOI: 10.48550/arxiv.2206.06461 *
XIAOKANG CHEN; MINGYU DING; XIAODI WANG; YING XIN; SHENTONG MO; YUNHAO WANG; SHUMIN HAN; PING LUO; GANG ZENG; JINGDONG WANG: "Context Autoencoder for Self-Supervised Representation Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 30 May 2022 (2022-05-30), 201 Olin Library Cornell University Ithaca, NY 14853, XP091228217 *

Similar Documents

Publication Publication Date Title
RU2691214C1 (ru) Распознавание текста с использованием искусственного интеллекта
CN111401375B (zh) 文本识别模型训练方法、文本识别方法、装置及设备
CN110956018B (zh) 文本处理模型的训练方法、文本处理方法、装置及存储介质
US11860684B2 (en) Few-shot named-entity recognition
CN111310441A (zh) 基于bert的语音识别后文本修正方法、装置、终端及介质
CN114596566B (zh) 文本识别方法及相关装置
Sarang Artificial neural networks with TensorFlow 2
US20210192152A1 (en) Generating method, non-transitory computer readable recording medium, and information processing apparatus
CN112163434B (zh) 基于人工智能的文本翻译方法、装置、介质及电子设备
WO2020149897A1 (fr) Modèle d'apprentissage approfondi pour incorporations de programme d'apprentissage
CN114254071A (zh) 从非结构化文档中查询语义数据
CN112101042A (zh) 文本情绪识别方法、装置、终端设备和存储介质
US20230419551A1 (en) Generating novel images using sketch image representations
CN113255328A (zh) 语言模型的训练方法及应用方法
CN113821616A (zh) 领域自适应的槽位填充方法、装置、设备及存储介质
CN113435499A (zh) 标签分类方法、装置、电子设备和存储介质
KR20230049025A (ko) 차원 축소 모델 및 그 훈련 방법
CN117079081B (zh) 一种多模态视频文本处理模型训练方法及系统
CN114065771A (zh) 一种预训练语言处理方法及设备
CN112307749A (zh) 文本检错方法、装置、计算机设备和存储介质
WO2023244567A1 (fr) Apprentissage de représentations auto-supervisé avec codage informationnel multi-segmentaire
US20230214695A1 (en) Counterfactual inference management device, counterfactual inference management method, and counterfactual inference management computer program product
CN112988996B (zh) 知识库生成方法、装置、设备及存储介质
CN116489391A (zh) 图像矢量量化编码、文图模型训练及使用方法和装置
CN114692715A (zh) 一种样本标注方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23824487

Country of ref document: EP

Kind code of ref document: A1