WO2024072924A2 - Sélection de caractéristiques évolutives par l'intermédiaire de masques à apprentissage épars - Google Patents

Sélection de caractéristiques évolutives par l'intermédiaire de masques à apprentissage épars Download PDF

Info

Publication number
WO2024072924A2
WO2024072924A2 PCT/US2023/033924 US2023033924W WO2024072924A2 WO 2024072924 A2 WO2024072924 A2 WO 2024072924A2 US 2023033924 W US2023033924 W US 2023033924W WO 2024072924 A2 WO2024072924 A2 WO 2024072924A2
Authority
WO
WIPO (PCT)
Prior art keywords
features
learnable
mask vector
training
sparse
Prior art date
Application number
PCT/US2023/033924
Other languages
English (en)
Other versions
WO2024072924A3 (fr
Inventor
Sercan Omer ARIK
Yihe DONG
Original Assignee
Google Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/372,900 external-priority patent/US20240112084A1/en
Application filed by Google Llc filed Critical Google Llc
Publication of WO2024072924A2 publication Critical patent/WO2024072924A2/fr
Publication of WO2024072924A3 publication Critical patent/WO2024072924A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0495Quantised networks; Sparse networks; Compressed networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning

Definitions

  • a smaller number of features can yield superior generalization, and hence better test accuracy, by minimizing information extraction from spurious patterns that do not hold consistently and by more optimally utilizing the model capacity on the most relevant features.
  • reducing the number of input features can decrease the computational complexity and cost for deployed models, allowing for a decrease in infrastructure requirements to support the features, as the deployed models can learn mappings from input data with smaller dimensions.
  • reducing the number of input features can improve interpretability and controllability, as users can focus on understanding outputs of deployed models from a smaller subset of input features.
  • feature selection can consider the predictive model itself, as an optimal set of features would depend on how mapping occurs between inputs and outputs. This may be referred to as embedded feature selection and can include regularization techniques and extensions.
  • SLM sparse learnable masks
  • SLM further employs an objective that increases mutual information (MI) between selected features and labels in an efficient and scalable manner.
  • MI mutual information
  • SLM can achieve or improve upon state-of-the-art results on several benchmark datasets, often by a significant margin, while reducing computational complexity and cost.
  • An aspect of the disclosure provides for a method for training a machine learning model with scalable feature selection, including: receiving, by one or more processors, a plurality of features for training the machine learning model; initializing, by the one or more processors, a learnable mask vector representing the plurality of features; receiving, by the one or more processors, a number of features to be selected; generating, by the one or more processors, a sparse mask vector from the learnable mask vector; selecting, by the one or more processors, a selected set of features of the plurality of features based on the sparse mask vector and the number of features to be selected; computing, by the one or more processors, a mutual information based error based on the selected set of features being input into the machine learning model; and updating, by the one or more processors, the learnable mask vector based on the mutual information based error.
  • the method further includes receiving, by the one or more processors, a total number of training steps.
  • the receiving, generating, selecting, computing, and updating is iterative for the total number of training steps.
  • the learnable mask vector updated after the total number of training steps includes a final selected set of features to be utilized by the machine learning model.
  • training the machine learning model further includes gradient- descent based learning.
  • the method further includes removing non- selected features of the plurality of features.
  • the method further includes applying a sparsemax normalization to the learnable mask vector.
  • the method further includes decreasing the number of features over a total number of training steps until reaching a target number of features to be selected. In yet another example, gradually decreasing the number of features to be selected is based on a discrete number of evenly spaced steps.
  • selecting the selected set of features further includes multiplying the sparse vector by a positive scalar based on a predetermined number of features.
  • computing the mutual information based error is based on maximizing GOOGLE-3814 mutual information between a distribution of the selected set of features and a distribution of labels for the selected set of features.
  • updating the learnable mask vector is based on minimizing the mutual information based error.
  • Another aspect of the disclosure provides for a system including: one or more processors; and one or more storage devices coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations for training a machine learning model with scalable feature selection, the operations including: receiving a plurality of features for training the machine learning model; initializing a learnable mask vector representing the plurality of features; receiving a number of features to be selected; generating a sparse mask vector from the learnable mask vector; selecting a selected set of features of the plurality of features based on the sparse mask vector and the number of features to be selected; computing a mutual information based error based on the selected set of features being input into the machine learning model; and updating the learnable mask vector based on the mutual information based error.
  • the operations further include receiving a total number of training steps; the receiving, generating, selecting, computing, and updating is iterative for the total number of training steps; and the learnable mask vector updated after the total number of training steps includes a final selected set of features to be utilized by the machine learning model.
  • the operations further include removing non-selected features of the plurality of features.
  • the operations further include applying a sparsemax normalization to the learnable mask vector.
  • the operations further include gradually decreasing the number of features over a total number of training steps until reaching a target number of features to be selected.
  • selecting the selected set of features further includes multiplying the sparse vector by a positive scalar based on a predetermined number of features.
  • computing the mutual information based error is based on maximizing mutual information between a distribution of the selected set of features and a distribution of labels for the selected set of features.
  • Yet another aspect of the disclosure provides for a non-transitory computer readable medium for storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for training a machine learning model with scalable GOOGLE-3814 feature selection, the operations including: receiving a plurality of features for training the machine learning model; initializing a learnable mask vector representing the plurality of features; receiving a number of features to be selected; generating a sparse mask vector from the learnable mask vector; selecting a selected set of features of the plurality of features based on the sparse mask vector and the number of features to be selected; computing a mutual information based error based on the selected set of features being input into the machine learning model; and updating the learnable mask vector based on the mutual information based error.
  • FIG. 1 depicts a block diagram of an example sparse learnable masks system for scalable feature selection according to aspects of the disclosure.
  • FIG. 2 depicts a block diagram of an example environment for implementing a sparse learnable masks system according to aspects of the disclosure.
  • FIG. 3 depicts a block diagram illustrating one or more machine learning model architectures according to aspects of the disclosure.
  • FIG. 4 depicts a flow diagram of an example process for training a machine learning model using scalable feature selection according to aspects of the disclosure.
  • FIG.5 depicts a flow diagram of an example process for performing a training step for training the machine learning model using scalable feature selection according to aspects of the disclosure.
  • FIG. 6 depicts a table comparing accuracy for the sparse learnable masks system to other feature selection approaches over various datasets according to aspects of the disclosure.
  • FIG. 7 depicts a table comparing accuracy for the sparse learnable masks system to other feature selection approaches over various numbers of features to be selected according to aspects of the disclosure.
  • DETAILED DESCRIPTION [0021] The technology relates generally to scalable feature selection, which may be referred to herein as sparse learnable masks (SLM).
  • SLM sparse learnable masks
  • SLM can be integrated into any deep learning or machine learning architecture due to its gradient-descent based optimization.
  • SLM can utilize end-to-end learning through joint training with predictive models.
  • SLM can improve scaling GOOGLE-3814 of feature selection, yielding a target number of features even when the number of input features or samples are large.
  • SLM can modify learnable masks to select the target number of features while addressing differentiability challenges.
  • SLM can utilize improved mutual information (MI) regularization based on a quadratic relaxation of the MI between labels and selected features, conditioned on the probability that a feature is selected.
  • MI mutual information
  • the sparse learnable masks system 100 can be implemented on one or more computing devices in one or more locations.
  • the sparse learnable masks system 100 can be configured to receive input data 102, such as inference data and/or training data, for use in selecting features to train one or more machine learning models.
  • the sparse learnable masks system 100 can receive the input data 102 as part of a call to an application programming interface (API) exposing the sparse learnable masks system 100 to one or more computing devices.
  • API application programming interface
  • the input data 102 can also be provided to the sparse learnable masks system 100 through a storage medium, such as remote storage connected to the one or more computing devices over a network.
  • the input data 102 can further be provided as input through a user interface on a client computing device coupled to the sparse learnable masks system 100.
  • the input data 102 can include training data associated with feature selection, such as covariate input data and target labels.
  • the input data 102 can be numerical, such as categorical features mapped to embeddings.
  • the input data 102 can include training data for any machine learning task, such as medical diagnosis, image classification, speech recognition, price forecasting, and/or fraud detection.
  • the training data can be split into a training set, a validation set, and/or a testing set.
  • An example training/validation/testing split can be an 80/10/10 split, although any other split may be possible.
  • the training data can include examples of features and labels associated with the machine learning task.
  • the training data can be in any form suitable for training a machine learning model, according to one of a variety of different learning techniques.
  • Learning techniques for training a model can include supervised learning, unsupervised learning, semi-supervised learning techniques, parameter-efficient techniques, and reinforcement learning techniques.
  • the training data can include multiple training examples that can be received as input GOOGLE-3814 by a model.
  • the training examples can be labeled with a desired output for the model when processing the labeled training examples.
  • the label and the model output can be evaluated through a loss function to determine an error, which can be back propagated through the model to update weights for the model.
  • a loss function can be applied to calculate an error between outputs, with a ground-truth label of a training example processed by the model.
  • Any of a variety of loss or error functions appropriate for the type of the task the model is being trained for can be utilized, such as cross-entropy loss for classification tasks, or mean square error for regression tasks.
  • the gradient of the error with respect to the different weights of the candidate model on candidate hardware can be calculated, for example using a backpropagation algorithm, and the weights for the model can be updated.
  • the sparse learnable masks system 100 can be configured to output one or more results related to scalable feature selection, generated as output data 104.
  • the output data 104 can include selected features associated with a machine learning task.
  • the sparse learnable masks system 100 can be configured to send the output data 104 for display on a client or user display.
  • the sparse learnable masks system 100 can be configured to provide the output data 104 as a set of computer-readable instructions, such as one or more computer programs.
  • the computer programs can be written in any type of programming language, and according to any programming paradigm, e.g., declarative, procedural, assembly, object-oriented, data-oriented, functional, or imperative.
  • the computer programs can be written to perform one or more different functions and to operate within a computing environment, e.g., on a physical device, virtual machine, or across multiple devices.
  • the computer programs can also implement functionality described herein, for example, as performed by a system, engine, module, or model.
  • the sparse learnable masks system 100 can further be configured to forward the output data 104 to one or more other devices configured for translating the output data into an executable program written in a computer programming language.
  • the sparse learnable masks system 100 can also be configured to send the output data 104 to a storage device for storage and later retrieval.
  • the sparse learning masks system 100 can integrate a feature selection layer into a machine learning architecture guided by gradient-descent based learning. As an example, let ⁇ ⁇ ⁇ ⁇ denote covariate input data and ⁇ denote a target, such as class labels.
  • the sparse GOOGLE-3814 learning masks system 100 can be integrated with a predictor model ⁇ ⁇ , with learnable parameters ⁇ , that is applied to selected features ⁇ ⁇ ⁇ ⁇ ⁇ , where ⁇ ⁇ denotes the number of selected features at step ⁇ .
  • the predictor model can be any architecture trained via gradient descent, such as a multi-layer perceptron or deep tabular data learning. Multiplication of a binary mask ⁇ ⁇ can indicate feature selection operation.
  • the sparse learnable masks system 100 can perform training for scalable feature selection as follows.
  • Task loss may refer to a target prediction task of the dataset, such as medical diagnosis, image classification, speech recognition, price forecasting, and/or fraud detection
  • MI loss may refer to how well one or more selected features align with target labels of the dataset. For instance, MI loss may refer to a reduction in uncertainty for the one or more selected features given the target labels of the dataset. Higher values of MI loss may indicate a stronger correlation between the one or more selected features and the target labels, while lower values of MI loss may indicate a weaker correlation between the one or more selected features and the target labels.
  • the sparse learnable masks system 100 can include a normalization engine 106, a tempering engine 108, a mask scaling engine 110, and a mutual information engine 112.
  • the normalization engine 106, tempering engine 108, mask scaling engine 110, and mutual information engine 112 can be implemented as one or more computer programs, specially configured electronic circuitry, or any combination thereof.
  • the normalization engine 106 can be configured to combine sparse non-linear normalization with learnable feature selection vectors in SLM.
  • the normalization GOOGLE-3814 engine 106 can perform a sparsemax normalization to achieve feature sparsity.
  • Sparsemax may refer to any normalization operation able to achieve sparsity, e.g., the output includes more than a threshold amount of 0 elements.
  • the normalization engine 106 can perform sparsemax normalization by returning a Euclidean projection of an input vector onto a probability simplex.
  • the tempering engine 108 can be configured to gradually decrease a number of features selected until reaching a target number of selected features ⁇ ; .
  • the tempering engine 108 can further be configured to decrease the number of features based on a discrete number of steps. The discrete number of steps can be evenly spaced. For example, the tempering engine 108 can decrease the number of features after every five steps. The tempering engine 108 allows the predictor model to learn from more than the final target number of features during training.
  • the tempering engine 108 further allows for a more robust initialization for training the predictor model based on learning from all features initially compared to starting learning with the target number of features, as the randomness in the initial selection is seldom optimal.
  • the mask scaling engine 110 can be configured to scale the sparse feature mask to achieve a predetermined number of non-zero features.
  • the sparsity in the sparsemax normalization can be based on where the projection lands on the probability simplex ⁇ FGH .
  • the mask scaling engine 110 can adjust the projection of , onto ⁇ FGH , such as by multiplying , by a positive scalar.
  • GOOGLE-3814 Larger scalars may increase sparsity while smaller scalars may decrease sparsity.
  • the probability simplex ⁇ H in ⁇ 7 is the line connecting (0,1) and (1,0), with these two points as the simplex boundary.
  • Let , ( ⁇ , ⁇ ) be a point in ⁇ 7
  • (I, J) be the projection of the point onto ⁇ H .
  • ⁇ ( ⁇ ,) can have varying degrees of sparsity.
  • N(I, J) ⁇ ⁇ ⁇ J ⁇ 7 + ⁇ ⁇ I ⁇ 7 ⁇ 7 ⁇ 7 ⁇ 2 ⁇ J + J 7 + ⁇ 7 ⁇ 7 ⁇ 2 ⁇ I + I 7 .
  • N(0,1) ⁇ N(0.5,0.5) ⁇ ⁇ ⁇ + 0.5.
  • ⁇ ( ⁇ ( ⁇ , ⁇ )) is closer to (0,1) ⁇ ⁇ H whenever ⁇ > 1/(2( ⁇ ⁇ ⁇ )) and closer to (0.5,0.5) otherwise. Since the projection is linear, varying the multiplier ⁇ varies the sparsity of ⁇ (( ⁇ , ⁇ )).
  • the mask scaling engine 110 can obtain a predetermined number of nonzero elements in the sparse feature mask by multiplying by a scalar. For example, given a vector , ⁇ ⁇ F , the mask scaling engine 110 can obtain ⁇ nonzero elements in ⁇ (,) by multiplying , by the scalar: ( ⁇ UH ,(S) ⁇ ( ⁇ + 1) ⁇ ,( ⁇ UH) )GH / ⁇
  • > ⁇ ⁇ ⁇ SVH (4) engine 112 can be configured to increase the mutual information (MI) between the distribution of the selected features and the distribution of the labels as an inductive bias to the model that accounts for sample labels during feature selection.
  • MI mutual information
  • the mutual information engine 112 can maximize the MI between the distribution of the selected features and the distribution of the labels.
  • the mutual information engine 112 can condition the MI on the probability that a feature is selected given by the mask ⁇ .
  • ] can denote a random variable representing features
  • can denote a random variable representing labels, with value spaces ] ⁇ ] and ⁇ ⁇ ⁇ . Maximizing the conditional or the joint MI between selected features and labels can require computation of an exponential number of probabilities, the optimization of which can be intractable. Therefore, the mutual information engine 112 can conduct a quadratic relaxation of the MI which is end- GOOGLE-3814 to-end differentiable.
  • _ m ( ], ⁇ ) can perform optimization to increase MI based on a quadratic relaxation _ m (], ⁇ ) to simplify _ ( ], ⁇ ) while retaining much of its properties, allowing for a reduction in computation cost and memory usage.
  • _ m ( ], ⁇ ) ⁇ n ⁇ g ⁇ a ⁇ h ⁇ b a ⁇ ,b ( ⁇ , ⁇ )7 / a ⁇ ( ⁇ ) o ⁇ ⁇ h ⁇ b b ⁇ ( ⁇ )7 (6) are convex with respect to ⁇ and p.
  • _ m ( ], ⁇ ) can approximate _ ( ], ⁇ ) where a ⁇ ,b (], ⁇ )/ a ⁇ ( ⁇ ) and b ⁇ ( ⁇ ) are in within (1 ⁇ q, 1 + q).
  • can be a ⁇ ,b ( ⁇ , ⁇ ) in the first term and b ⁇ ( ⁇ ) in the second. Since both a ⁇ ,b ( ⁇ , ⁇ ) and b ⁇ ( ⁇ ) are probabilities and sum to 1 across the label space for any given sample, the linear term ⁇ 3 ⁇ /2 does not affect gradient descent optimization. Normalization can be a hard constraint enforced during training that supersedes this linear term in the objective. Therefore, during optimization, a ⁇ ,b ( ⁇ , ⁇ ) ⁇ c.
  • the mutual information engine 112 can connect _ m ( ], ⁇ ) with predictions from the predictor model using Lagrange multipliers.
  • ⁇ ( ⁇ , ⁇ ) ⁇ ] ⁇ ⁇ ⁇ [0,1] denote a probability outcome of the predictor model for sample ⁇ and outcome ⁇ .
  • the below equations model a discrete label case, such as for classification, but a case where labels are continuous can be reduced to the discrete label case through quantization.
  • [0061] ( ], ⁇ ) 1 ⁇ ⁇ h ⁇ b b ⁇ ( ⁇ )7 ⁇ _ m ( ], ⁇ ) (8) in terms of a ⁇ ,b a ⁇ , can express , as a function of _ m (], ⁇ ).
  • the mutual information engine 112 can select a given number of features that reduce, e.g., minimize, (], ⁇ ).
  • _ can denote the index set of the dataset samples
  • can denote the index set of the features
  • can denote the set of possible labels.
  • ⁇ ⁇ ⁇ can denote the index set of features selected
  • S ⁇ can denote the random variable representing a selected subset of features for the / ⁇ sample.
  • and the error can be defined as: [0064] (], ⁇ ) ⁇ ⁇ g ⁇ a ⁇ h ⁇ b a ⁇ ,b ( ⁇ , ⁇ ) kn1 ⁇ ⁇ ( ⁇ , ⁇ )o 7 + ⁇ h ⁇ b ⁇ ⁇ ( ⁇ , ⁇ z ) 7 l ⁇ S ⁇ (k1 ⁇ error under one or more consistency constraints.
  • the and can vectorize the regularization term for a parallel computation of ] S4 ⁇ ] St pairs per batch.
  • the reduction, e.g., minimization, objective with the consistency regularization can be derived similarly.
  • [0070] ( ], ⁇ ) ⁇ S ⁇ ( S ⁇ ⁇ ⁇ n] S ⁇ o) 7 /
  • the consistency regularization ⁇ ⁇ in the MI-increasing objective (], ⁇ ) can have a complexity ⁇ (0 ⁇ ; ), as the calculation occurs over the selected feature index set and is done between each sample and other in its batch.
  • the non-regularization component in ( ], ⁇ ) has a complexity 0 ⁇ where ⁇ is the constant for the number of discrete or binned labels.
  • FIG. 2 depicts a block diagram of an example environment 200 for implementing a sparse learnable masks system 218.
  • the sparse learnable masks system 218 can be implemented on one or more devices having one or more processors in one or more locations, such as in server computing device 202.
  • Client computing device 204 and the server computing device 202 can be communicatively coupled to one or more storage devices 206 over a network GOOGLE-3814 208.
  • the storage devices 206 can be a combination of volatile and non-volatile memory and can be at the same or different physical locations than the computing devices 202, 204.
  • the storage devices 206 can include any type of non-transitory computer readable medium capable of storing information, such as a hard-drive, solid state drive, tape drive, optical storage, memory card, ROM, RAM, DVD, CD-ROM, write-capable, and read-only memories.
  • the server computing device 202 can include one or more processors 210 and memory 212.
  • the memory 212 can store information accessible by the processors 210, including instructions 214 that can be executed by the processors 210.
  • the memory 212 can also include data 216 that can be retrieved, manipulated, or stored by the processors 210.
  • the memory 212 can be a type of transitory or non-transitory computer readable medium capable of storing information accessible by the processors 210, such as volatile and non-volatile memory.
  • the processors 210 can include one or more central processing units (CPUs), graphic processing units (GPUs), field-programmable gate arrays (FPGAs), and/or application-specific integrated circuits (ASICs), such as tensor processing units (TPUs).
  • the instructions 214 can include one or more instructions that, when executed by the processors 210, cause the one or more processors 210 to perform actions defined by the instructions 214.
  • the instructions 214 can be stored in object code format for direct processing by the processors 210, or in other formats including interpretable scripts or collections of independent source code modules that are interpreted on demand or compiled in advance.
  • the instructions 214 can include instructions for implementing a sparse learnable masks system 218, which can correspond to the sparse learnable masks system 100 of FIG. 1.
  • the sparse learnable masks system 218 can be executed using the processors 210, and/or using other processors remotely located from the server computing device 202.
  • the data 216 can be retrieved, stored, or modified by the processors 210 in accordance with the instructions 214.
  • the data 216 can be stored in computer registers, in a relational or non-relational database as a table having a plurality of different fields and records, or as JSON, YAML, proto, or XML documents.
  • the data 216 can also be formatted in a computer-readable format such as, but not limited to, binary values, ASCII, or Unicode.
  • the data 216 can include information sufficient to identify relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories, GOOGLE-3814 including other network locations, or information that is used by a function to calculate relevant data.
  • the client computing device 204 can also be configured similarly to the server computing device 202, with one or more processors 220, memory 222, instructions 224, and data 226.
  • the client computing device 204 can also include a user input 228 and a user output 230.
  • the user input 228 can include any appropriate mechanism or technique for receiving input from a user, such as keyboard, mouse, mechanical actuators, soft actuators, touchscreens, microphones, and sensors.
  • the server computing device 202 can be configured to transmit data to the client computing device 204, and the client computing device 204 can be configured to display at least a portion of the received data on a display implemented as part of the user output 230.
  • the user output 230 can also be used for displaying an interface between the client computing device 204 and the server computing device 202.
  • the user output 230 can alternatively or additionally include one or more speakers, transducers or other audio outputs, a haptic interface or other tactile feedback that provides non-visual and non-audible information to the platform user of the client computing device 204.
  • FIG.2 illustrates the processors 210, 220 and the memories 212, 222 as being within the respective computing devices 202, 204, components described herein can include multiple processors and memories that can operate in different physical locations and not within the same computing device.
  • some of the instructions 214, 224 and the data 216, 226 can be stored on a removable SD card and others within a read-only computer chip. Some or all of the instructions 214, 224 and data 216, 226 can be stored in a location physically remote from, yet still accessible by, the processors 210, 220. Similarly, the processors 210, 220 can include a collection of processors that can perform concurrent and/or sequential operation.
  • the computing devices 202, 204 can each include one or more internal clocks providing timing information, which can be used for time measurement for operations and programs run by the computing devices 202, 204. [0079]
  • the server computing device 202 can be connected over the network 208 to a data center 232 housing any number of hardware accelerators 234.
  • the data center 232 can be one of multiple data centers or other facilities in which various types of computing devices, such as hardware accelerators, are located. Computing resources housed in the data center 232 can be specified for deploying models with scalable feature selection, as described herein.
  • GOOGLE-3814 [0080]
  • the server computing device 202 can be configured to receive requests to process data from the client computing device 204 on computing resources in the data center 232.
  • the environment 200 can be part of a computing platform configured to provide a variety of services to users, through various user interfaces and/or application programming interfaces (APIs) exposing the platform services.
  • the variety of services can include medical diagnosis, image classification, speech recognition, price forecasting, and/or fraud detection utilizing the scalable feature selection as described herein.
  • the client computing device 204 can transmit input data associated with feature selection, such as covariate input data and target labels.
  • the sparse learnable masks system 218 can receive the input data, and in response, generate output data including a selected predetermined number of features with increased mutual information between the selected features and the labels.
  • the server computing device 202 can maintain a variety of models in accordance with different constraints available at the data center 232. For example, the server computing device 202 can maintain different families for deploying models on various types of TPUs and/or GPUs housed in the data center 232 or otherwise available for processing.
  • FIG. 3 depicts a block diagram 300 illustrating one or more machine learning model architectures 302, more specifically 302A-N for each architecture, for deployment in a datacenter 304 housing a hardware accelerator 306 on which the deployed machine learning models 302 will execute, such as for scalable feature selection as described herein.
  • the hardware accelerator 306 can be any type of processor, such as a CPU, GPU, FPGA, or ASIC such as a TPU.
  • An architecture 302 of a machine learning model can refer to characteristics defining the model, such as characteristics of layers for the model, how the layers process input, or how the layers interact with one another.
  • the architecture 302 of the machine learning model can also define types of operations performed within each layer.
  • One or more machine learning model architectures 302 can be generated that can output results, such as for scalable feature selection.
  • Example model architectures 302 can correspond to a multi-layer perceptron and/or deep tabular data learning.
  • the devices 202, 204 and the data center 232 can be capable of direct and indirect communication over the network 208.
  • the client computing device 204 can connect to a service operating in the data center GOOGLE-3814 232 through an Internet protocol.
  • the devices 202, 204 can set up listening sockets that may accept an initiating connection for sending and receiving information.
  • the network 208 can include various configurations and protocols including the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, and private networks using communication protocols proprietary to one or more companies.
  • the network 208 can support a variety of short- and long-range connections.
  • the short- and long-range connections may be made over different bandwidths, such as 2.402 GHz to 2.480 GHz, commonly associated with the Bluetooth® standard, 2.4 GHz and 5 GHz, commonly associated with the Wi-Fi® communication protocol; or with a variety of communication standards, such as the LTE® standard for wireless broadband communication.
  • the network 208 in addition or alternatively, can also support wired connections between the devices 202, 204 and the data center 232, including over various types of Ethernet connection.
  • FIG.4 depicts a flow diagram of an example process 400 for training a machine learning model using scalable feature selection.
  • the example process 400 can be performed on a system of one or more processors in one or more locations, such as the sparse learnable masks system 100 as depicted in FIG.1.
  • the sparse learnable masks system 100 can be configured to receive a plurality of features for training a machine learning model.
  • the plurality of features can be numerical, such as categorical features mapped to embeddings.
  • the plurality of features can be associated with any machine learning task, such as medical diagnosis, image classification, speech recognition, price forecasting, and/or fraud detection.
  • the machine learning model can include any architecture trained using gradient descent, such as a multi- layer perceptron or a deep tabular data learning model.
  • the sparse learnable masks system 100 can further receive a plurality of labels associated with the plurality of features.
  • each of the plurality of labels can denote a target based on one or more of the plurality of features, such as classification labels.
  • the sparse learnable masks system 100 can be configured to receive a total number of training steps.
  • the total number of training steps can correspond to a number of iterations in training the machine learning model, such as for a particular machine learning task.
  • the number of total training steps is a technical parameter that is usable to trade- off the computational and memory resource demands of the training phase vs. accuracy of the feature selection and neural network training process.
  • the sparse learnable masks system 100 can be configured to initialize a learnable mask vector representing the plurality of features.
  • the learnable mask vector can be initialized with all ones to indicate all of the plurality of features are initially selected.
  • the sparse learnable masks system 100 can be configured to iteratively perform a training step for the total number of training steps. Performing the training step can include receiving a number of features to be selected for the training step; generating a sparse mask vector from the learnable mask vector; selecting a set of features of the plurality of features based on the sparse mask vector and the number of features to be selected; computing a mutual information based error based on the selected set of features being input into the machine learning model; and updating the learnable mask vector based on the mutual information-based error.
  • the sparse learnable masks system 100 can be configured to output a final selected set of features, represented by the updated learnable mask vector, to be utilized by the machine learning model.
  • the updated learnable mask vector can have ones indicating selected features and zeros indicating non-selected features.
  • the sparse learnable masks system 100 can further output a trained machine learning model that utilizes the final selected set of features when performing the machine learning task.
  • FIG.5 depicts a flow diagram of an example process 500 for performing a training step for training the machine learning model using scalable feature selection.
  • the example process 500 can be performed on a system of one or more processors in one or more locations, such as the sparse learnable masks system 100 as depicted in FIG.1.
  • the sparse learnable masks system 100 can be configured to receive a number of features to be selected for the training step.
  • the number of features to be GOOGLE-3814 selected can be gradually decreased over a total number of training steps until reaching a target number of features to be selected.
  • the target number of features to be selected is a technical parameter, since it permits to control the resource demands, in particular the memory requirements, during training and also the subsequent inference process. Gradually decreasing the number of features to be selected can be based on a discrete number of evenly spaced steps.
  • the number of features to be selected can start at 50 features and gradually decrease every 5 th training step.
  • the gradual decrease of selected features can be based on a tempering threshold, which can be a fraction of the number of training steps.
  • the sparse learnable masks system 100 can be configured to generate a sparse mask vector from the learnable mask vector.
  • the sparse learnable masks system 100 can apply a sparsemax normalization to the learnable mask vector.
  • the sparse learnable masks system 100 can return a Euclidean projection of the learnable mask vector onto a probability simplex. The projection can scale values in the learnable mask vector to be equidistributed over [0,1].
  • the sparse learnable masks system 100 can be configured to select a set of features of the plurality of features based on the sparse mask vector and the number of features to be selected.
  • the sparse learnable masks system 100 can scale the sparse feature mask to achieve a predetermined number of non-zero values representing a set of features to be selected.
  • the sparse learnable masks system 100 can adjust the projection in the learnable mask vector by multiplying by a positive scalar. Larger positive scalars may increase sparsity while smaller positive scalars may decrease sparsity in the sparse mask vector.
  • the sparse learnable masks system 100 can be configured to compute a mutual information based error based on the selected set of features being input into the machine learning model. For example, computing the mutual information based error can be based on maximizing mutual information between a distribution of the selected set of features and a distribution of labels for the selected set of features. The mutual information can be conditioned on the probability that a feature is selected based on the sparse feature mask.
  • the sparse learnable masks system 100 can further be configured to compute a training task loss based on the selected set of features being input into the machine learning model, such as by calculating a difference between model predictions and dataset labels.
  • the sparse learnable masks system 100 can update the learnable mask vector based on the mutual information based error.
  • the sparse learnable masks system 100 can further update the learnable mask vector based on the training task loss.
  • the sparse learnable masks system 100 can further update one or more parameters for the machine learning model based on the mutual information based error and/or the training task loss. Updating the learnable mask vector can be based on an objective of reducing, e.g., minimizing, the mutual information based error and/or training task loss. Updating the learnable mask vector can include removing non-selected features of the plurality of features.
  • the updated learnable mask vector can have ones indicating selected features and zeros indicating non- selected features or floating point numbers to indicate the probability of selecting a feature.
  • SLM can achieve or improve upon other approaches to feature selection while reducing computational complexity and cost.
  • the various datasets include the following domains: Mice, MNIST, Fashion-MNIST, Isolet, Coil-20, Activity, Ames, Fraud.
  • Mice refers to protein expression levels measured in the cortex of normal and trisomic mice who had been exposed to different experimental conditions. Each feature is the expression level of one protein.
  • MNIST and Fashion-MNIST refer to 28-by-28 grayscale images of hand-written digits and clothing items, respectively.
  • the images are converted to tabular data by treating each pixel as a separate feature.
  • Isolet refers to preprocessed speech data of people speaking the names of the letters in the English alphabet with each feature being one of the preprocessed quantities, including spectral coefficients and sonorant features.
  • Coil-20 refers to centered grayscale images of 20 objects taken at pose intervals of 5 degrees amounting to 72 images for each object. During preprocessing, the images were resized to produce 20-by-20 images, with each feature being one of the pixels.
  • Activity refers to sensor data collected from a smartphone mounted on subjects while they performed several activities such as walking upstairs, standing and laying, with each being one of the 561 raw or processed quantities from the sensors on the phone.
  • Ames refers to a housing dataset with the goal of predicting residential housing prices based on each features of the home.
  • IEEE-CIS Fraud Detection refers to a dataset with the goal of identifying fraudulent transactions from numerous transaction and identity dependent features. The adversarial nature of the task, the nature of fraudsters adapting themselves and yielding different fraud patterns, cause the data to be highly non i.i.d., thus making feature GOOGLE-3814 selection important given that high capacity models can be prone to overfitting and poor generalization. [0099] FIG.
  • FIG. 6 depicts a table comparing accuracy for the sparse learnable masks system to other feature selection approaches over various datasets.
  • the table illustrates selecting 50 features across a wide range of high dimensional datasets, most with > 400 features.
  • the table shows that the SLM consistently yields competitive performance, outperforming all other approaches in all cases except on Mice and Ames, for both of which the performance was saturated due to the small numbers of original features, making feature selection less relevant.
  • Most other feature selection approaches are not consistent in their performance while SLM had consistently strong performance. SLM even improved upon a baseline of using all features, which can likely be attributed to superior generalization when the limited model capacity is focused on the most salient features.
  • SLM 7 depicts a table comparing accuracy for the sparse learnable masks system to other feature selection approaches over various numbers of features to be selected.
  • the table focuses on the Fraud dataset and reports performance as a different number of selected features.
  • the table illustrates that SLM outperforms the other approaches, and its performance degradation is smaller with less features.
  • SLM can also be used for interpretation of global feature importance during inference, yielding the importance ranking of selected features. This can be highly desired in high-stakes applications, such as healthcare or finance, where an importance score can be more useful than simply whether a feature is selected or not.
  • SLM does not need to sample from the joint or marginal distributions, a potentially computationally intensive process, and does not require a contrastive term in the estimation of MI, resulting in less computational cost.
  • SLM accounts for feature inter-dependence by learning inter-dependent probabilities for the selected feature, where the inter-dependent probabilities jointly maximize the MI between features and labels.
  • SLM learns feature selection and the task objective in an end-to-end manner, which alleviates the selection of repetitive features that may individually be predictive, but less predictive over an individually predictive but redundant feature.
  • SLM can improve generalization, especially for high capacity models like deep neural networks, as they can easily overfit patterns from spurious features that do not hold across training and test data splits. For instance, the table in FIG.
  • aspects of this disclosure can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, and/or in computer hardware, such as the structure disclosed herein, their structural equivalents, or combinations thereof.
  • aspects of this disclosure can further be implemented as one or more computer programs, such as one or more modules of computer program instructions encoded on a tangible non-transitory computer storage medium for execution by, or to control the operation of, one or more data processing apparatus.
  • the computer storage medium can be a machine-readable storage device, a machine- readable storage substrate, a random or serial access memory device, or combinations thereof.
  • the computer program instructions can be encoded on an artificially generated propagated signal, such as a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus.
  • an artificially generated propagated signal such as a machine-generated electrical, optical, or electromagnetic signal
  • the term “configured” is used herein in connection with systems and computer program components.
  • a system of one or more computers to be configured to perform particular operations or actions means that the system has installed thereon software, firmware, hardware, or a combination thereof that cause the system to perform the operations or actions.
  • the one or more programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by one or more data processing apparatus, cause the apparatus to perform the operations or actions.
  • the term “data processing apparatus” or “data processing system” refers to data processing hardware and encompasses various apparatus, devices, and machines for processing data, including programmable processors, computers, or combinations thereof.
  • the data processing apparatus can include special purpose logic circuitry, such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC).
  • the data processing apparatus can include code that creates an execution environment for computer programs, such as code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or combinations thereof.
  • the term “computer program” refers to a program, software, a software application, an app, a module, a software module, a script, or code.
  • the computer program can be written in any form of programming language, including compiled, interpreted, declarative, or procedural languages, or combinations thereof.
  • the computer program can be deployed in any form, including as a standalone program or as a module, component, subroutine, or other unit suitable GOOGLE-3814 for use in a computing environment.
  • the computer program can correspond to a file in a file system and can be stored in a portion of a file that holds other programs or data, such as one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, such as files that store one or more modules, sub programs, or portions of code.
  • the computer program can be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.
  • database refers to any collection of data.
  • the data can be unstructured or structured in any manner.
  • the data can be stored on one or more storage devices in one or more locations.
  • an index database can include multiple collections of data, each of which may be organized and accessed differently.
  • engine refers to a software-based system, subsystem, or process that is programmed to perform one or more specific functions.
  • the engine can be implemented as one or more software modules or components or can be installed on one or more computers in one or more locations.
  • a particular engine can have one or more computers dedicated thereto, or multiple engines can be installed and running on the same computer or computers.
  • the processes and logic flows described herein can be performed by one or more computers executing one or more computer programs to perform functions by operating on input data and generating output data.
  • the processes and logic flows can also be performed by special purpose logic circuitry, or by a combination of special purpose logic circuitry and one or more computers.
  • a computer or special purpose logic circuitry executing the one or more computer programs can include a central processing unit, including general or special purpose microprocessors, for performing or executing instructions and one or more memory devices for storing the instructions and data.
  • the central processing unit can receive instructions and data from the one or more memory devices, such as read only memory, random access memory, or combinations thereof, and can perform or execute the instructions.
  • the computer or special purpose logic circuitry can also include, or be operatively coupled to, one or more storage devices for storing data, such as magnetic, magneto optical disks, or optical disks, for receiving data from or transferring data to.
  • the computer or special purpose logic circuitry can be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a GOOGLE-3814 mobile audio or video player, a game console, a Global Positioning System (GPS), or a portable storage device, e.g., a universal serial bus (USB) flash drive, as examples.
  • PDA personal digital assistant
  • GPS Global Positioning System
  • USB universal serial bus
  • Computer readable media suitable for storing the one or more computer programs can include any form of volatile or non-volatile memory, media, or memory devices. Examples include semiconductor memory devices, e.g., EPROM, EEPROM, or flash memory devices, magnetic disks, e.g., internal hard disks or removable disks, magneto optical disks, CD-ROM disks, DVD-ROM disks, or combinations thereof.
  • semiconductor memory devices e.g., EPROM, EEPROM, or flash memory devices
  • magnetic disks e.g., internal hard disks or removable disks, magneto optical disks, CD-ROM disks, DVD-ROM disks, or combinations thereof.
  • aspects of the disclosure can be implemented in a computing system that includes a back end component, e.g., as a data server, a middleware component, e.g., an application server, or a front end component, e.g., a client computer having a graphical user interface, a web browser, or an app, or any combination thereof.
  • the components of the system can be interconnected by any form or medium of digital data communication, such as a communication network. Examples of communication networks include a local area network (LAN) and a wide area network (WAN), e.g., the Internet.
  • the computing system can include clients and servers. A client and server can be remote from each other and interact through a communication network.
  • a server can transmit data, e.g., an HTML page, to a client device, e.g., for purposes of displaying data to and receiving user input from a user interacting with the client device.
  • data generated at the client device e.g., a result of the user interaction, can be received at the server from the client device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Complex Calculations (AREA)

Abstract

Des aspects de la divulgation concernent une approche canonique pour une sélection de caractéristiques appelée masques à apprentissage épars (SLM). Le SLM intègre des masques épars pouvant être appris dans un entraînement de bout en bout. Pour le défi fondamental de non-différenciation de sélection d'un nombre souhaité de caractéristiques, le SLM comprend des mécanismes doubles pour une mise à l'échelle de masque automatique en obtenant une rareté de caractéristique souhaitée et en modérant progressivement cette rareté pour un apprentissage efficace. Le SLM utilise en outre un objectif qui augmente les informations mutuelles (MI) entre des caractéristiques et des étiquettes sélectionnées d'une manière efficace et évolutive. De manière empirique, le SLM peut réaliser ou améliorer des résultats de l'état de la technique sur plusieurs ensembles de données de référence, souvent par une marge significative, tout en réduisant la complexité et le coût de calcul.
PCT/US2023/033924 2022-09-28 2023-09-28 Sélection de caractéristiques évolutives par l'intermédiaire de masques à apprentissage épars WO2024072924A2 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202263410883P 2022-09-28 2022-09-28
US63/410,883 2022-09-28
US18/372,900 2023-09-26
US18/372,900 US20240112084A1 (en) 2022-09-28 2023-09-26 Scalable Feature Selection Via Sparse Learnable Masks

Publications (2)

Publication Number Publication Date
WO2024072924A2 true WO2024072924A2 (fr) 2024-04-04
WO2024072924A3 WO2024072924A3 (fr) 2024-05-23

Family

ID=88695632

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/033924 WO2024072924A2 (fr) 2022-09-28 2023-09-28 Sélection de caractéristiques évolutives par l'intermédiaire de masques à apprentissage épars

Country Status (1)

Country Link
WO (1) WO2024072924A2 (fr)

Also Published As

Publication number Publication date
WO2024072924A3 (fr) 2024-05-23

Similar Documents

Publication Publication Date Title
US20200265301A1 (en) Incremental training of machine learning tools
US20210287048A1 (en) System and method for efficient generation of machine-learning models
US11803744B2 (en) Neural network learning apparatus for deep learning and method thereof
CN111279362B (zh) 胶囊神经网络
Qi et al. Feature selection and multiple kernel boosting framework based on PSO with mutation mechanism for hyperspectral classification
US20230102337A1 (en) Method and apparatus for training recommendation model, computer device, and storage medium
US11521372B2 (en) Utilizing machine learning models, position based extraction, and automated data labeling to process image-based documents
WO2018212710A1 (fr) Procédés et systèmes d'analyse prédictive
US11087215B1 (en) Machine learning classification system
US20200167690A1 (en) Multi-task Equidistant Embedding
US20220230048A1 (en) Neural Architecture Scaling For Hardware Accelerators
US20220215298A1 (en) Method for training sequence mining model, method for processing sequence data, and device
US20190311258A1 (en) Data dependent model initialization
US20220237890A1 (en) Method and apparatus with neural network training
US11100428B2 (en) Distributable event prediction and machine learning recognition system
Zhang Deep generative model for multi-class imbalanced learning
Papakyriakou et al. Data mining methods: a review
WO2022154829A1 (fr) Mise à l'échelle d'architecture neuronale pour accélérateurs matériels
Yuan et al. Deep learning from a statistical perspective
US20240112084A1 (en) Scalable Feature Selection Via Sparse Learnable Masks
US11921821B2 (en) System and method for labelling data for trigger identification
WO2024072924A2 (fr) Sélection de caractéristiques évolutives par l'intermédiaire de masques à apprentissage épars
US20230206134A1 (en) Rank Distillation for Training Supervised Machine Learning Models
US20220084306A1 (en) Method and system of guiding a user on a graphical interface with computer vision
De Bortoli et al. A fast face recognition CNN obtained by distillation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23801050

Country of ref document: EP

Kind code of ref document: A2