WO2022251602A9 - Systèmes et procédés destinés à des modèles appris par machine à convolution et attention - Google Patents

Systèmes et procédés destinés à des modèles appris par machine à convolution et attention Download PDF

Info

Publication number
WO2022251602A9
WO2022251602A9 PCT/US2022/031304 US2022031304W WO2022251602A9 WO 2022251602 A9 WO2022251602 A9 WO 2022251602A9 US 2022031304 W US2022031304 W US 2022031304W WO 2022251602 A9 WO2022251602 A9 WO 2022251602A9
Authority
WO
WIPO (PCT)
Prior art keywords
stage
machine
attention
computer
network
Prior art date
Application number
PCT/US2022/031304
Other languages
English (en)
Other versions
WO2022251602A1 (fr
Inventor
Zihang Dai
Hanxiao LIU
Mingxing TAN
Quoc V. LE
Original Assignee
Google Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Google Llc filed Critical Google Llc
Priority to CN202280026409.3A priority Critical patent/CN117377983A/zh
Priority to EP22731945.6A priority patent/EP4288939A1/fr
Priority to JP2023557195A priority patent/JP2024517056A/ja
Publication of WO2022251602A1 publication Critical patent/WO2022251602A1/fr
Publication of WO2022251602A9 publication Critical patent/WO2022251602A9/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • the present disclosure relates generally to machine-learning. More particularly, the present disclosure relates to systems and methods for machine-learned models having convolution and attention.
  • Machine-learning refers to a class of learned algorithms that provide predictions over input data.
  • Convolutional Neural Networks or CNNs, are a class of machine-learned model that employs convolutional frames in a neural network.
  • Transformers are a class of machine-learned model that employ an attention mechanism to weight distinct portions of input data.
  • Existing approaches to combine convolution and attention face drawbacks such as increased computational cost.
  • the computer-implemented method includes obtaining, by a computing system including one or more computing devices, input data including an input tensor having one or more dimensions.
  • the computer-implemented method includes providing, by the computing system, the input data to a machine-learned convolutional attention network, the machine-learned convolutional attention network including two or more network stages, each of the two or more network stages including one of an attention stage or a convolutional stage.
  • the computer-implemented method includes, in response to providing the input data to the machine-learned convolutional atention network, receiving, by the computing system, a machine-learning prediction from the machine-learned convolutional atention network.
  • the attention stage includes a relative atention mechanism, the relative attention mechanism including the sum of a static convolution kernel with an adaptive atention matrix.
  • the computer-implemented method includes obtaining, by a computing system including one or more computing devices, input data including an input tensor having one or more dimensions.
  • the computer-implemented method includes providing, by the computing system, the input data to a machine-learned convolutional atention network.
  • the machine-learned convolutional atention network includes a downsampling stage configured to reduce a spatial resolution relative to the input tensor and one or more atention blocks including a relative attention mechanism, the relative atention mechanism including the sum of a static convolution kernel with an adaptive atention matrix.
  • the computer-implemented method includes, in response to providing the input data to the machine-learned convolutional atention network, receiving, by the computing system, a machine-learning prediction from the machine-learned convolutional atention network.
  • the computer-implemented method includes obtaining, by a computing system including one or more computing devices, input data including an input tensor having one or more dimensions.
  • the computer-implemented method includes providing, by the computing system, the input data to a machine-learned convolutional atention network, the machine-learned convolutional atention network including a plurality of network stages.
  • the plurality of network stages includes an SO stage including a two-layer convolutional stem network, an S 1 stage including a convolutional block with squeeze-excitation, an S2 stage including a convolutional block, an S3 stage including a convolutional block, an S4 stage including an atention block, and an S5 stage including an atention block.
  • Each of the S4 stage and the S5 stage include a relative atention mechanism including the sum of a static convolution kernel with an adaptive atention matrix. Spatial resolution is decreased at each of the plurality of network stages.
  • a number of channels is increased at each of the plurality of network stages.
  • the computer-implemented method includes, in response to providing the input data to the machine-learned convolutional atention network, receiving, by the computing system, a machine-learning prediction from the machine-learned convolutional attention network..
  • the attached Appendices describe example implementations of the proposed techniques in greater detail.
  • the attached Appendices are incorporated into and form a part of this disclosure.
  • the present disclosure is not limited to the example implementations provided in the attached Appendices.
  • Figure 1 A depicts a block diagram of an example computing system that performs computer vision with reduced computational cost and improved accuracy according to example embodiments of the present disclosure.
  • Figure IB depicts a block diagram of an example computing device that performs computer vision with reduced computational cost and improved accuracy according to example embodiments of the present disclosure.
  • Figure 1C depicts a block diagram of an example computing device that performs computer vision with reduced computational cost and improved accuracy according to example embodiments of the present disclosure.
  • FIG. 2 depicts a block diagram of an example convolution attention network (CoAtNet) model according to example embodiments of the present disclosure.
  • CoAtNet convolution attention network
  • Figure 3 depicts a block diagram of an example convolution attention network model according to example embodiments of the present disclosure.
  • Figure 4 depicts a block diagram of an example convolution attention network model according to example embodiments of the present disclosure.
  • Figure 5 depicts a flow chart diagram of an example method to perform computer vision with reduced computational cost and improved accuracy according to example embodiments of the present disclosure.
  • the present disclosure is directed to systems and methods for machine- learned models having convolution and attention.
  • systems and methods according to example aspects of the present disclosure can include convolutional blocks and/or attention blocks.
  • the attention block(s) can include a relative attention mechanism.
  • example aspects of the present disclosure recognize that the above-described relative attention can be considered a natural mixture of depthwise convolution and content-based attention.
  • example aspects of the present disclosure recognize that both depthwise convolution and self-attention can be expressed as a weighted sum of values in a receptive field.
  • the relative attention mechanism can include the sum of a static convolution kernel with an adaptive attention matrix. This sum can be applied prior to and/or subsequent to SoftMax normalization by the relative attention mechanism.
  • the relative attention mechanism (e.g., applied prior to the SoftMax normalization) may be mathematically represented by:
  • the relative attention mechanism (e.g., applied subsequent to the SoftMax normalization) may be mathematically represented by:
  • the depthwise convolution kernel w -7 is an inputindependent parameter of static value for a given index in the input tensor (i,y) (e.g., the relative shift between the indices i — j , where the dependence on the relative shift rather than the specific values is termed translation equivalence, which can improve generalization under datasets of limited size), x t and % 7 are the input and output at position i, respectively, and Q is a global receptive field (e.g., the entire set of positions).
  • the use of the global receptive field can provide improved capability of capturing complicated relational interactions between different spatial positions, which can be desirable when processing higher-level concepts.
  • the denominator term can also be referred to as the attention weight, A ⁇ j.
  • the attention weight can be decided jointly by the translation equivalence of the depthwise convolution kernel and the input-adaptive inputoutput pairs, which can provide for both properties to varying degrees, improving generalization, capacity, and/or accuracy of the model.
  • These attention blocks having relative self-attention can be employed in networks having convolution and attention (termed herein “CoAtNef ’ models), providing for improved fusion of benefits from convolution and attention.
  • the models can have robustness to overfitting, lower computational cost, reduced memory usage, and/or smaller parameter size, in addition to high accuracy and efficiency, associated with convolutional networks, while additionally providing the ability to learn complicated relational interactions between spatial positions in input data, associated with transformers.
  • Systems and methods according to example aspects of the present disclosure can provide for a number of technical effects and benefits, including improvements to computer technology.
  • systems and methods according to example aspects of the present disclosure can unify convolution and attention to provide improved generalization, model capacity, and/or efficiency.
  • systems and methods according to example aspects of the present disclosure can more effectively manage a trade-off between improved generalization (e.g., similar to convolutional networks) and improved model capacity (e.g., similar to Transformers).
  • improved generalization e.g., similar to convolutional networks
  • improved model capacity e.g., similar to Transformers
  • some example implementations of the present disclosure can achieve state-of-the-art performances under different data sizes and computation budgets.
  • the improvements provided by the proposed model architecture can in turn provide for improved accuracy of the model, especially on unseen input data, improved scope of input datatype and/or dimension, reduced consumption of computational resources (e.g., faster computation speed, fewer computational cycles, reduced processor or memory usage, etc.), and/or other improvements over existing models.
  • a model as proposed herein can achieve comparable performance to a state of the art convolutional neural network while having a fewer number of parameters.
  • an example implementation of a CoAtNet model can achieve comparable top-1 accuracy on the ImageNet dataset with only 40% of the number of parameters and 70% of the FLOPs.
  • the hybrid convolution and attention architectures described herein can enable more efficient usage of specialized hardware such as processors (e.g., graphics processing units) which have been specialized for performing convolution mechanisms and attention mechanism.
  • processors e.g., graphics processing units
  • convolutional stages of the proposed hybrid model can be performed by hardware specialized for convolutional operations while attention stages of the proposed hybrid model can be performed by hardware specialized for attention operations.
  • convolutional operations of the convolutional stages of the proposed hybrid model can be performed in parallel by multiple processors.
  • the machine-learning task can be computer vision tasks such as object detection, object recognition, image classification, semantic segmentation, video recognition, video classification, video segmentation, etc.
  • the machine-learning task can be multi-modality applications such as those involving additional signals (e.g., visual signals) such as, for example, image captioning, video captioning, etc.
  • Figure 1 A depicts a block diagram of an example computing system 100 that performs computer vision with reduced computational cost and improved accuracy according to example embodiments of the present disclosure.
  • the system 100 includes a user computing device 102, a server computing system 130, and a training computing system 150 that are communicatively coupled over a network 180.
  • the user computing device 102 can be any type of computing device, such as, for example, a personal computing device (e.g., laptop or desktop), a mobile computing device (e.g., smartphone or tablet), a gaming console or controller, a wearable computing device, an embedded computing device, or any other type of computing device.
  • a personal computing device e.g., laptop or desktop
  • a mobile computing device e.g., smartphone or tablet
  • a gaming console or controller e.g., a gaming console or controller
  • a wearable computing device e.g., an embedded computing device, or any other type of computing device.
  • the user computing device 102 includes one or more processors 112 and a memory 114.
  • the one or more processors 112 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 114 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 114 can store data 116 and instructions 118 which are executed by the processor 112 to cause the user computing device 102 to perform operations.
  • the user computing device 102 can store or include one or more machine-learned models 120.
  • the machine-learned models 120 can be or can otherwise include various machine-learned models such as neural networks (e.g., deep neural networks) or other types of machine-learned models, including non-linear models and/or linear models.
  • Neural networks can include feed-forward neural networks, recurrent neural networks (e.g., long short-term memory recurrent neural networks), convolutional neural networks or other forms of neural networks.
  • Some example machine-learned models can leverage an attention mechanism such as self-attention.
  • some example machine-learned models can include multi -headed self-attention models (e.g., transformer models).
  • Example machine-learned models 120 e.g., CoAtNet models are discussed with reference to Figures 2-3.
  • the one or more machine-learned models 120 can be received from the server computing system 130 over network 180, stored in the user computing device memory 114, and then used or otherwise implemented by the one or more processors 112.
  • the user computing device 102 can implement multiple parallel instances of a single machine-learned model 120 (e.g., to perform parallel computer vision across multiple instances of CoAtNet models).
  • one or more machine-learned models 140 can be included in or otherwise stored and implemented by the server computing system 130 that communicates with the user computing device 102 according to a client-server relationship.
  • the machine-learned models 140 can be implemented by the server computing system 140 as a portion of a web service (e.g., a computer vision, such as image classification, service).
  • a web service e.g., a computer vision, such as image classification, service.
  • the user computing device 102 can also include one or more user input components 122 that receives user input.
  • the user input component 122 can be a touch-sensitive component (e.g., a touch-sensitive display screen or a touch pad) that is sensitive to the touch of a user input object (e.g., a finger or a stylus).
  • the touch-sensitive component can serve to implement a virtual keyboard.
  • Other example user input components include a microphone, a traditional keyboard, or other means by which a user can provide user input.
  • the server computing system 130 includes one or more processors 132 and a memory 134.
  • the one or more processors 132 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 134 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 134 can store data 136 and instructions 138 which are executed by the processor 132 to cause the server computing system 130 to perform operations.
  • the server computing system 130 includes or is otherwise implemented by one or more server computing devices. In instances in which the server computing system 130 includes plural server computing devices, such server computing devices can operate according to sequential computing architectures, parallel computing architectures, or some combination thereof.
  • the server computing system 130 can store or otherwise include one or more machine-learned models 140.
  • the models 140 can be or can otherwise include various machine-learned models.
  • Example machine-learned models include neural networks or other multi-layer non-linear models.
  • Example neural networks include feed forward neural networks, deep neural networks, recurrent neural networks, and convolutional neural networks.
  • Some example machine-learned models can leverage an attention mechanism such as self-attention.
  • some example machine-learned models can include multi-headed self-attention models (e.g., transformer models).
  • Example models 140 are discussed with reference to Figures 2-3.
  • the user computing device 102 and/or the server computing system 130 can train the models 120 and/or 140 via interaction with the training computing system 150 that is communicatively coupled over the network 180.
  • the training computing system 150 can be separate from the server computing system 130 or can be a portion of the server computing system 130.
  • the training computing system 150 includes one or more processors 152 and a memory 154.
  • the one or more processors 152 can be any suitable processing device (e.g., a processor core, a microprocessor, an ASIC, an FPGA, a controller, a microcontroller, etc.) and can be one processor or a plurality of processors that are operatively connected.
  • the memory 154 can include one or more non-transitory computer-readable storage media, such as RAM, ROM, EEPROM, EPROM, flash memory devices, magnetic disks, etc., and combinations thereof.
  • the memory 154 can store data 156 and instructions 158 which are executed by the processor 152 to cause the training computing system 150 to perform operations.
  • the training computing system 150 includes or is otherwise implemented by one or more server computing devices.
  • the training computing system 150 can include a model trainer 160 that trains the machine-learned models 120 and/or 140 stored at the user computing device 102 and/or the server computing system 130 using various training or learning techniques, such as, for example, backwards propagation of errors.
  • a loss function can be backpropagated through the model(s) to update one or more parameters of the model(s) (e.g., based on a gradient of the loss function).
  • Various loss functions can be used such as mean squared error, likelihood loss, cross entropy loss, hinge loss, and/or various other loss functions.
  • Gradient descent techniques can be used to iteratively update the parameters over a number of training iterations.
  • performing backwards propagation of errors can include performing truncated backpropagation through time.
  • the model trainer 160 can perform a number of generalization techniques (e.g., weight decays, dropouts, etc.) to improve the generalization capability of the models being trained.
  • the model trainer 160 can train the machine-learned models 120 and/or 140 based on a set of training data 162.
  • the training data 162 can include, for example, corpuses or other datasets of task-specific training data, such as an image classification database (e.g., ImageNet, JFT 300M, etc.).
  • the training examples can be provided by the user computing device 102.
  • the model 120 provided to the user computing device 102 can be trained by the training computing system 150 on user-specific data received from the user computing device 102. In some instances, this process can be referred to as personalizing the model.
  • the model trainer 160 includes computer logic utilized to provide desired functionality.
  • the model trainer 160 can be implemented in hardware, firmware, and/or software controlling a general purpose processor.
  • the model trainer 160 includes program files stored on a storage device, loaded into a memory and executed by one or more processors.
  • the model trainer 160 includes one or more sets of computer-executable instructions that are stored in a tangible computer-readable storage medium such as RAM, hard disk, or optical or magnetic media.
  • the network 180 can be any type of communications network, such as a local area network (e.g., intranet), wide area network (e.g., Internet), or some combination thereof and can include any number of wired or wireless links.
  • communication over the network 180 can be carried via any type of wired and/or wireless connection, using a wide variety of communication protocols (e.g., TCP/IP, HTTP, SMTP, FTP), encodings or formats (e.g., HTML, XML), and/or protection schemes (e.g., VPN, secure HTTP, SSL).
  • TCP/IP Transmission Control Protocol/IP
  • HTTP HyperText Transfer Protocol
  • SMTP Simple Stream Transfer Protocol
  • FTP e.g., HTTP, HTTP, HTTP, HTTP, FTP
  • encodings or formats e.g., HTML, XML
  • protection schemes e.g., VPN, secure HTTP, SSL
  • the input to the machine-learned model(s) of the present disclosure can be image data.
  • the machine-learned model(s) can process the image data to generate an output.
  • the machine-learned model(s) can process the image data to generate an image recognition output (e.g., a recognition of the image data, a latent embedding of the image data, an encoded representation of the image data, a hash of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an image segmentation output.
  • the machine- learned model(s) can process the image data to generate an image classification output.
  • the machine-learned model(s) can process the image data to generate an image data modification output (e.g., an alteration of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an encoded image data output (e.g., an encoded and/or compressed representation of the image data, etc.).
  • the machine-learned model(s) can process the image data to generate an upscaled image data output.
  • the machine-learned model(s) can process the image data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be text or natural language data.
  • the machine-learned model(s) can process the text or natural language data to generate an output.
  • the machine- learned model(s) can process the natural language data to generate a language encoding output.
  • the machine-learned model(s) can process the text or natural language data to generate a latent text embedding output.
  • the machine- learned model(s) can process the text or natural language data to generate a translation output.
  • the machine-learned model(s) can process the text or natural language data to generate a classification output.
  • the machine-learned model(s) can process the text or natural language data to generate a textual segmentation output.
  • the machine-learned model(s) can process the text or natural language data to generate a semantic intent output.
  • the machine-learned model(s) can process the text or natural language data to generate an upscaled text or natural language output (e.g., text or natural language data that is higher quality than the input text or natural language, etc.).
  • the machine-learned model(s) can process the text or natural language data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be speech data.
  • the machine-learned model(s) can process the speech data to generate an output.
  • the machine-learned model(s) can process the speech data to generate a speech recognition output.
  • the machine- learned model(s) can process the speech data to generate a speech translation output.
  • the machine-learned model(s) can process the speech data to generate a latent embedding output.
  • the machine-learned model(s) can process the speech data to generate an encoded speech output (e.g., an encoded and/or compressed representation of the speech data, etc.).
  • an encoded speech output e.g., an encoded and/or compressed representation of the speech data, etc.
  • the machine-learned model(s) can process the speech data to generate an upscaled speech output (e.g., speech data that is higher quality than the input speech data, etc.).
  • the machine-learned model(s) can process the speech data to generate a textual representation output (e.g., a textual representation of the input speech data, etc.).
  • the machine- learned model(s) can process the speech data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be latent encoding data (e.g., a latent space representation of an input, etc.).
  • the machine-learned model(s) can process the latent encoding data to generate an output.
  • the machine-learned model(s) can process the latent encoding data to generate a recognition output.
  • the machine-learned model(s) can process the latent encoding data to generate a reconstruction output.
  • the machine-learned model(s) can process the latent encoding data to generate a search output.
  • the machine-learned model(s) can process the latent encoding data to generate a reclustering output.
  • the machine-learned model(s) can process the latent encoding data to generate a prediction output.
  • the input to the machine-learned model(s) of the present disclosure can be statistical data.
  • Statistical data can be, represent, or otherwise include data computed and/or calculated from some other data source.
  • the machine-learned model(s) can process the statistical data to generate an output.
  • the machine- learned model(s) can process the statistical data to generate a recognition output.
  • the machine-learned model(s) can process the statistical data to generate a prediction output.
  • the machine-learned model(s) can process the statistical data to generate a classification output.
  • the machine-learned model(s) can process the statistical data to generate a segmentation output.
  • the machine-learned model(s) can process the statistical data to generate a visualization output.
  • the machine-learned model(s) can process the statistical data to generate a diagnostic output.
  • the input to the machine-learned model(s) of the present disclosure can be sensor data.
  • the machine-learned model(s) can process the sensor data to generate an output.
  • the machine-learned model(s) can process the sensor data to generate a recognition output.
  • the machine-learned model(s) can process the sensor data to generate a prediction output.
  • the machine-learned model(s) can process the sensor data to generate a classification output.
  • the machine-learned model(s) can process the sensor data to generate a segmentation output.
  • the machine-learned model(s) can process the sensor data to generate a visualization output.
  • the machine-learned model(s) can process the sensor data to generate a diagnostic output.
  • the machine-learned model(s) can process the sensor data to generate a detection output.
  • the machine-learned model(s) can be configured to perform a task that includes encoding input data for reliable and/or efficient transmission or storage (and/or corresponding decoding).
  • the task may be audio compression task.
  • the input may include audio data and the output may comprise compressed audio data.
  • the input includes visual data (e.g. one or more image or videos), the output comprises compressed visual data, and the task is a visual data compression task.
  • the task may comprise generating an embedding for input data (e.g. input audio or visual data).
  • the input includes visual data and the task is a computer vision task.
  • the input includes pixel data for one or more images and the task is an image processing task.
  • the image processing task can be image classification, where the output is a set of scores, each score corresponding to a different object class and representing the likelihood that the one or more images depict an object belonging to the object class.
  • the image processing task may be object detection, where the image processing output identifies one or more regions in the one or more images and, for each region, a likelihood that region depicts an object of interest.
  • the image processing task can be image segmentation, where the image processing output defines, for each pixel in the one or more images, a respective likelihood for each category in a predetermined set of categories.
  • the set of categories can be foreground and background.
  • the set of categories can be object classes.
  • the image processing task can be depth estimation, where the image processing output defines, for each pixel in the one or more images, a respective depth value.
  • the image processing task can be motion estimation, where the network input includes multiple images, and the image processing output defines, for each pixel of one of the input images, a motion of the scene depicted at the pixel between the images in the network input.
  • the input includes audio data representing a spoken utterance and the task is a speech recognition task.
  • the output may comprise a text output which is mapped to the spoken utterance.
  • the task comprises encrypting or decrypting input data.
  • the task comprises a microprocessor performance task, such as branch prediction or memory address translation.
  • Figure 1 A illustrates one example computing system that can be used to implement the present disclosure.
  • the user computing device 102 can include the model trainer 160 and the training dataset 162.
  • the models 120 can be both trained and used locally at the user computing device 102.
  • the user computing device 102 can implement the model trainer 160 to personalize the models 120 based on user-specific data.
  • Figure IB depicts a block diagram of an example computing device 10 that performs according to example embodiments of the present disclosure.
  • the computing device 10 can be a user computing device or a server computing device.
  • the computing device 10 includes a number of applications (e.g., applications 1 through N). Each application contains its own machine learning library and machine-learned model(s). For example, each application can include a machine-learned model.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • each application can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components.
  • each application can communicate with each device component using an API (e.g., a public API).
  • the API used by each application is specific to that application.
  • Figure 1C depicts a block diagram of an example computing device 50 that performs according to example embodiments of the present disclosure.
  • the computing device 50 can be a user computing device or a server computing device.
  • the computing device 50 includes a number of applications (e.g., applications 1 through N). Each application is in communication with a central intelligence layer.
  • Example applications include a text messaging application, an email application, a dictation application, a virtual keyboard application, a browser application, etc.
  • each application can communicate with the central intelligence layer (and model(s) stored therein) using an API (e.g., a common API across all applications).
  • the central intelligence layer includes a number of machine-learned models. For example, as illustrated in Figure 1C, a respective machine-learned model can be provided for each application and managed by the central intelligence layer. In other implementations, two or more applications can share a single machine-learned model. For example, in some implementations, the central intelligence layer can provide a single model for all of the applications. In some implementations, the central intelligence layer is included within or otherwise implemented by an operating system of the computing device 50.
  • the central intelligence layer can communicate with a central device data layer.
  • the central device data layer can be a centralized repository of data for the computing device 50. As illustrated in Figure 1C, the central device data layer can communicate with a number of other components of the computing device, such as, for example, one or more sensors, a context manager, a device state component, and/or additional components. In some implementations, the central device data layer can communicate with each device component using an API (e.g., a private API).
  • an API e.g., a private API
  • FIG. 2 depicts a block diagram of an example convolution attention network (CoAtNet) model 200 according to example embodiments of the present disclosure.
  • the model 200 is trained to receive a set of input data 202 descriptive of, for example, image data, or other task-specific input data and, as a result of receipt of the input data 202, provide output data 204 that is responsive to a particular machine-learning task, such as a computer vision task (e.g., image classification).
  • a computer vision task e.g., image classification
  • the model 200 can include a downsampling stage 210.
  • the downsampling stage 210 can reduce a spatial resolution of the input data 202. For instance, if the input data 202 comprises a tensor, the downsampling stage 210 may reduce spatial resolution such that the output of the downsampling stage 210 has at least one dimension or resolution that is lower than that of the tensor of the input data 202. Additionally and/or alternatively, the downsampling stage 210 may increase a number of channels relative to the input data.
  • the downsampling stage 210 can be or can include a convolutional stem.
  • the convolutional stem can have aggressive stride, such as a stride of greater than 10.
  • the model 200 can include one or more attention block(s) 212.
  • the attention block(s) 212 can receive the downsampled input data from downsampling stage 202 and produce the output data 204.
  • the attention block(s) 212 can implement a relative attention mechanism.
  • the attention block(s) 212 can be transformer block(s) which operate similar to Transformer networks.
  • the attention block(s) 212 can include a relative attention mechanism.
  • the relative attention mechanism can include the sum of a static convolution kernel with an adaptive attention matrix. This sum can be applied prior to and/or subsequent to SoftMax normalization by the relative attention mechanism.
  • the relative attention mechanism (e.g., applied prior to the SoftMax normalization) may be mathematically represented by:
  • the relative attention mechanism (e.g., applied subsequent to the SoftMax normalization) may be mathematically represented by:
  • the depthwise convolution kernel w (-7 is an inputindependent parameter of static value for a given index in the input tensor (i,y) (e.g., the relative shift between the indices i — j , where the dependence on the relative shift rather than the specific values is termed translation equivalence, which can improve generalization under datasets of limited size),
  • x t and Xj are the input and output at position i, respectively, and
  • Q is a global receptive field (e.g., the entire set of positions).
  • the use of the global receptive field can provide improved capability of capturing complicated relational interactions between different spatial positions, which can be desirable when processing higher-level concepts.
  • the denominator term can also be referred to as the attention weight, A ⁇ j.
  • the attention weight can be decided jointly by the translation equivalence of the depthwise convolution kernel and the input-adaptive inputoutput pairs, which can provide for both properties to varying degrees, improving generalization, capacity, and/or accuracy of the model.
  • Figure 3 depicts a block diagram of an example convolution attention network (CoAtNet) model 300 according to example embodiments of the present disclosure.
  • the model 300 is trained to receive a set of input data 202 descriptive of, for example, image data, or other task-specific input data and, as a result of receipt of the input data 202, provide output data 204 that is responsive to a particular machine-learning task, such as a computer vision task (e.g., image classification).
  • a computer vision task e.g., image classification
  • the machine-learned convolutional attention network 300 can include two or more network stages (e.g., 302, 304, 306, 308, and 310). Each of the two or more network stages can be or can include one of an attention stage or a convolutional stage such that the convolutional stages are sequentially prior to the attention stages. As one example, in some implementations, the two or more network stages can include an SO stage 302, an SI stage 304, an S2 stage 306, an S3 stage 308, and an S4 stage 310. Each of these stages can be a convolutional stage including one or more convolutional blocks (e.g., MBConv blocks) or an attention stage including one or more attention blocks having a relative attention mechanism.
  • MBConv blocks convolutional blocks
  • the convolutional blocks can perform depthwise separable convolutions (e.g., over a plurality of channels). Additionally and/or alternatively, in some implementations, the convolutional blocks can perform an inverted bottleneck convolution. In some implementations, a spatial resolution gradually decreases over the two or more network stages. In some implementations, a number of channels can be increased (e.g., doubled) at any of the stages, such as at least one of the SI stage 304, the S2 stage 306, the S3 stage 308, or the S4 stage 310.
  • the SO stage 302 comprises a two-layer convolutional stem network.
  • the SI stage 304 can include one or more convolutional blocks with squeeze excitation.
  • the one or more convolutional blocks of the SI stage and/or other convolutional stages can include MBConv blocks.
  • the MBConv blocks can be configured to expand channel size from an original channel size of an input to the one or more convolutional blocks and subsequently project the expanded channel size back to the original channel size.
  • the convolutional blocks can perform depthwise separable convolutions (e.g., over a plurality of channels). Additionally and/or alternatively, in some implementations, the convolutional blocks can perform an inverted bottleneck convolution.
  • a width of the SO stage 302 is less than or equal to a width of the SI stage 304.
  • each of the SO stage 302, the SI stage 304, and the S4 stage 310 include (e.g., exactly) two blocks, and the S2 stage 306 and the S3 308 stage each include greater than two blocks.
  • the two or more network stages include an SO stage 302 comprising a two-layer convolutional stem network, an SI stage 304 comprising a convolutional block with squeeze-excitation, an S2 stage 306 comprising a convolutional block, an S3 stage 308 comprising a attention block, and an S4 stage 310 comprising a attention block, wherein each of the S3 stage 308 and the S4 stage 310 comprise a relative attention mechanism configured to determine a sum of a static convolution kernel with an adaptive attention matrix.
  • the attention blocks and/or stage(s) can include a relative attention mechanism according to example aspects of the present disclosure.
  • the relative attention mechanism can include the sum of a static convolution kernel with an adaptive attention matrix. This sum can be applied prior to and/or subsequent to SoftMax normalization by the relative attention mechanism.
  • the relative attention mechanism e.g., applied prior to the SoftMax normalization
  • the relative attention mechanism (e.g., applied subsequent to the SoftMax normalization) may be mathematically represented by:
  • the depthwise convolution kernel w -7 is an inputindependent parameter of static value for a given index in the input tensor (i,y) (e.g., the relative shift between the indices i — j , where the dependence on the relative shift rather than the specific values is termed translation equivalence, which can improve generalization under datasets of limited size), x t and % 7 are the input and output at position i, respectively, and Q is a global receptive field (e.g., the entire set of positions).
  • Figure 4 depicts a block diagram of an example convolution atention network (CoAtNet) model 400 according to example embodiments of the present disclosure.
  • the model 400 can include SO, SI, S2, S3, and S4 stages.
  • the SO stage or stem stage can include two (e.g., 3x3) convolutional layers (e.g., with a stride of 2).
  • the convolutional SI stage and S2 stage can each include a 1x1 convolutional layer, a 3x3 deconvolution layer, and a 1x1 convolution layer.
  • the attention (e.g., S3 and S4) stages can each include a relative atention mechanism and a feedforward network.
  • the model can additionally include a global pooling layer and a fully connected layer to produce the model output. Each of the stages can be repeated a designed number of times.
  • Figure 5 depicts a flow chart diagram of an example method to perform according to example embodiments of the present disclosure.
  • Figure 4 depicts steps performed in a particular order for purposes of illustration and discussion, the methods of the present disclosure are not limited to the particularly illustrated order or arrangement.
  • the various steps of the method 500 can be omited, rearranged, combined, and/or adapted in various ways without deviating from the scope of the present disclosure.
  • the method 500 can include, at 502, obtaining (e.g., by a computing system including one or more computing devices) input data including an input tensor having one or more dimensions.
  • the input tensor can be a two-dimensional tensor having a length and/or a width.
  • the input tensor can have one or more channels.
  • the input tensor may be or may include image data such as an image having a length, a width, and/or a plurality of color channels.
  • the method 500 can include, at 504, providing (e.g., by the computing system) the input data to a machine-learned convolutional atention network.
  • the machine-learned convolutional atention network can be any suitable network according to example aspects of the present disclosure, such as the networks 200 and/or 300 of Figures 2 and/or 3.
  • the machine-learned convolutional atention network can include two or more network stages, each of the two or more network stages including one of an atention stage or a convolutional stage such that the convolutional stages are sequentially prior to the attention stages.
  • the two or more network stages can include an SO stage, an SI stage, an S2 stage, an S3 stage, and an S4 stage.
  • Each of these stages can be a convolutional stage including one or more convolutional blocks (e.g., MBConv blocks) or an atention stage including one or more attention blocks having a relative attention mechanism.
  • a spatial resolution gradually decreases over the two or more network stages.
  • a number of channels can be increased (e.g., doubled) at any of the stages, such as at least one of the SI stage, the S2 stage, the S3 stage, or the S4 stage.
  • the SO stage comprises a two-layer convolutional stem network.
  • the SI stage can include one or more convolutional blocks with squeeze excitation.
  • the one or more convolutional blocks of the SI stage and/or other convolutional stages can include MBConv blocks.
  • the MBConv blocks can be configured to expand channel size from an original channel size of an input to the one or more convolutional blocks and subsequently project the expanded channel size back to the original channel size.
  • the convolutional blocks can perform depthwise separable convolutions (e.g., over a plurality of channels). Additionally and/or alternatively, in some implementations, the convolutional blocks can perform an inverted bottleneck convolution.
  • a width of the SO stage is less than or equal to a width of the SI stage.
  • each of the SO stage, the SI stage, and the S5 stage include (e.g., exactly) two blocks, and the S2 stage and the S3 stage each include greater than two blocks.
  • the two or more network stages include an SO stage comprising a two-layer convolutional stem network, an SI stage comprising a convolutional block with squeezeexcitation, an S2 stage comprising a convolutional block, an S3 stage comprising a attention block, and an S4 stage comprising a attention block, wherein each of the S3 stage and the S4 stage comprise a relative attention mechanism configured to determine a sum of a static convolution kernel with an adaptive attention matrix.
  • the machine-learned convolutional attention network can include a downsampling stage configured to reduce a spatial resolution relative to the input tensor and one or more attention blocks comprising a relative attention mechanism.
  • the downsampling stage can reduce the spatial resolution to improve feasibility of performing computations. For instance, if the input data includes a tensor, the downsampling stage may reduce spatial resolution such that the output of the downsampling stage has at least one dimension or resolution that is lower than that of the tensor of the input data. Additionally and/or alternatively, the downsampling stage may increase a number of channels relative to the input data.
  • the downsampling stage can be or can include a convolutional stem.
  • the convolutional stem can have aggressive stride, such as a stride of greater than 10.
  • the attention blocks and/or stage(s) can include a relative attention mechanism according to example aspects of the present disclosure.
  • the relative attention mechanism can include the sum of a static convolution kernel with an adaptive attention matrix. This sum can be applied prior to and/or subsequent to SoftMax normalization by the relative attention mechanism.
  • the relative attention mechanism e.g., applied prior to the SoftMax normalization
  • the relative attention mechanism (e.g., applied subsequent to the SoftMax normalization) may be mathematically represented by:
  • the depthwise convolution kernel is an inputindependent parameter of static value for a given index in the input tensor (i,y) (e.g., the relative shift between the indices i — j , where the dependence on the relative shift rather than the specific values is termed translation equivalence, which can improve generalization under datasets of limited size),
  • x t and Xj are the input and output at position i, respectively, and
  • Q is a global receptive field (e.g., the entire set of positions).
  • the method 500 can include, at 506, in response to providing the input data to the machine-learned convolutional attention network, receiving, by the computing system, a machine-learning prediction from the machine-learned convolutional attention network.
  • the machine-learning prediction can be a task-specific machine-learning prediction.
  • the output can be a computer vision output, such as a classification output (e.g., classification vector), object recognition output, etc.
  • the machine-learning prediction can be an intermediate prediction or representation such as an embedding in a latent or learned space.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Neurology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Databases & Information Systems (AREA)
  • Image Analysis (AREA)

Abstract

Un procédé informatisé permettant de mettre en œuvre une vision artificielle avec un coût de calcul réduit et une précision accrue peut comprendre les étapes consistant à : au moyen d'un système informatique contenant un ou plusieurs dispositifs informatiques, obtenir des données d'entrée contenant un tenseur d'entrée à une ou plusieurs dimensions ; au moyen du système informatique, communiquer les données d'entrée à un réseau d'attention convolutif appris par machine, le réseau d'attention convolutif appris par machine comportant au moins deux étages de réseau ; et, en réponse à la communication des données d'entrée au réseau d'attention convolutif appris par machine, recevoir au moyen du système informatique une prédiction d'apprentissage machine provenant du réseau d'attention convolutif appris par machine. Le réseau d'attention convolutif peut comprendre au moins un bloc d'attention. Le bloc d'attention contient un mécanisme d'attention aux relations. Le mécanisme d'attention aux relations contient la somme d'un noyau de convolution statique et d'une matrice d'attention adaptative. Ainsi est-il possible d'obtenir une généralisation, une capacité et une efficacité améliorées du réseau d'attention convolutif par rapport à certains modèles existants.
PCT/US2022/031304 2021-05-27 2022-05-27 Systèmes et procédés destinés à des modèles appris par machine à convolution et attention WO2022251602A1 (fr)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202280026409.3A CN117377983A (zh) 2021-05-27 2022-05-27 具有卷积和注意力的机器学习模型的系统和方法
EP22731945.6A EP4288939A1 (fr) 2021-05-27 2022-05-27 Systèmes et procédés destinés à des modèles appris par machine à convolution et attention
JP2023557195A JP2024517056A (ja) 2021-05-27 2022-05-27 畳み込みおよび注意を有する機械学習型モデルのためのシステムおよび方法

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163194077P 2021-05-27 2021-05-27
US63/194,077 2021-05-27

Publications (2)

Publication Number Publication Date
WO2022251602A1 WO2022251602A1 (fr) 2022-12-01
WO2022251602A9 true WO2022251602A9 (fr) 2023-09-07

Family

ID=82115984

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/031304 WO2022251602A1 (fr) 2021-05-27 2022-05-27 Systèmes et procédés destinés à des modèles appris par machine à convolution et attention

Country Status (5)

Country Link
US (2) US11755883B2 (fr)
EP (1) EP4288939A1 (fr)
JP (1) JP2024517056A (fr)
CN (1) CN117377983A (fr)
WO (1) WO2022251602A1 (fr)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6936592B2 (ja) * 2017-03-03 2021-09-15 キヤノン株式会社 演算処理装置およびその制御方法
CN110189334B (zh) * 2019-05-28 2022-08-09 南京邮电大学 基于注意力机制的残差型全卷积神经网络的医学图像分割方法
US11651191B2 (en) * 2019-09-03 2023-05-16 Here Global B.V. Methods, apparatuses, and computer program products using a repeated convolution-based attention module for improved neural network implementations
US11403486B2 (en) * 2019-11-13 2022-08-02 Huawei Technologies Co., Ltd. Methods and systems for training convolutional neural network using built-in attention
CN112464792A (zh) * 2020-11-25 2021-03-09 北京航空航天大学 一种基于动态卷积的遥感图像舰船目标细粒度分类方法

Also Published As

Publication number Publication date
EP4288939A1 (fr) 2023-12-13
JP2024517056A (ja) 2024-04-19
US11755883B2 (en) 2023-09-12
US20220383069A1 (en) 2022-12-01
US20230359862A1 (en) 2023-11-09
WO2022251602A1 (fr) 2022-12-01
CN117377983A (zh) 2024-01-09

Similar Documents

Publication Publication Date Title
US20230359865A1 (en) Modeling Dependencies with Global Self-Attention Neural Networks
KR20210029785A (ko) 활성화 희소화를 포함하는 신경 네트워크 가속 및 임베딩 압축 시스템 및 방법
US11450096B2 (en) Systems and methods for progressive learning for machine-learned models to optimize training speed
US20230267307A1 (en) Systems and Methods for Generation of Machine-Learned Multitask Models
US20230274527A1 (en) Systems and Methods for Training Multi-Class Object Classification Models with Partially Labeled Training Data
CA3233965A1 (fr) Modelisation d'image a quantification vectorielle
US11948090B2 (en) Method and apparatus for video coding
US20230394306A1 (en) Multi-Modal Machine Learning Models with Improved Computational Efficiency Via Adaptive Tokenization and Fusion
WO2023133204A1 (fr) Modèles d'apprentissage automatique présentant des blocs d'attention multi-axes à résolution flexible
US11755883B2 (en) Systems and methods for machine-learned models having convolution and attention
US20220245917A1 (en) Systems and methods for nearest-neighbor prediction based machine learned models
US20220108204A1 (en) Scale-Permuted Machine Learning Architecture
US20240135187A1 (en) Method for Training Large Language Models to Perform Query Intent Classification
US20230297852A1 (en) Multi-Stage Machine Learning Model Synthesis for Efficient Inference
US20220245428A1 (en) Machine-Learned Attention Models Featuring Omnidirectional Processing
US20230419082A1 (en) Improved Processing of Sequential Data via Machine Learning Models Featuring Temporal Residual Connections
CN113365072B (zh) 特征图压缩方法、装置、计算设备以及存储介质
US20220414542A1 (en) On-The-Fly Feeding of Personalized or Domain-Specific Submodels
US20210383221A1 (en) Systems And Methods For Machine-Learned Models With Message Passing Protocols
WO2024020107A1 (fr) Recyclage d'invite spécifique à une tâche pour des modèles à apprentissage automatique qui réalisent de multiples tâches
WO2023114141A1 (fr) Distillation de connaissances par apprentissage pour prédire des coefficients de composants principaux
WO2023055390A1 (fr) Traitement de régions d'image fondé sur l'apprentissage automatique multi-résolution en cascade à efficacité de calcul améliorée
WO2024073439A1 (fr) Mise à l'échelle d'un gradient vers l'avant avec optimisation locale

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22731945

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022731945

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2023557195

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 2022731945

Country of ref document: EP

Effective date: 20230908

WWE Wipo information: entry into national phase

Ref document number: 202280026409.3

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE