WO2022216521A1 - Dual-flattening transformer through decomposed row and column queries for semantic segmentation - Google Patents

Dual-flattening transformer through decomposed row and column queries for semantic segmentation Download PDF

Info

Publication number
WO2022216521A1
WO2022216521A1 PCT/US2022/022831 US2022022831W WO2022216521A1 WO 2022216521 A1 WO2022216521 A1 WO 2022216521A1 US 2022022831 W US2022022831 W US 2022022831W WO 2022216521 A1 WO2022216521 A1 WO 2022216521A1
Authority
WO
WIPO (PCT)
Prior art keywords
row
column
wise
output
mha
Prior art date
Application number
PCT/US2022/022831
Other languages
French (fr)
Inventor
Ying Wang
Guo-Jun Qi
Wenju Xu
Chiu Man HO
Ziwei XUAN
Original Assignee
Innopeak Technology, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Innopeak Technology, Inc. filed Critical Innopeak Technology, Inc.
Publication of WO2022216521A1 publication Critical patent/WO2022216521A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Definitions

  • BACKGROUND Obtaining high-resolution features is important in many computer vision tasks, especially for dense prediction problems such as semantic segmentation, object detection, and pose estimation, or the like.
  • Typical approaches employ convolutional encoder-decoder architectures where an encoder outputs low-resolution features and a decoder upsamples features with simple filters such as bilinear interpolation.
  • Bilinear upsampling has limited capacity in obtaining high-resolution features, as it only conducts linear interpolation between neighboring pixels without considering nonlinear dependencies in global contexts.
  • Various approaches have been proposed to improve the high-resolution feature quality, such as PointRend and DUpsample.
  • PointRend carefully selects uncertain points in the downsampled feature space and refines them by incorporating low-level features.
  • DUpsample adopts a data-dependent upsampling strategy to recover segmentation from the coarse prediction.
  • these approaches lack the ability to capture long-range dependency for fine-grained details.
  • Diverse non-local or self-attention based schemes have been proposed to enhance the output features, but mostly in the downsampled feature space. They still rely on bilinear upsampling procedure to obtain high-resolution features, which tends to lose global information. [0005]
  • transformers have drawn tremendous interest, due to their great success in capturing long-range dependency.
  • ViT Vision Transformer
  • Multi-scale ViTs have been presented to achieve hierarchical features with different resolutions and have boosted the performance of many dense prediction tasks.
  • upper-level features with low spatial resolution still rely on bilinear upsampling to recover the full-resolution features.
  • the naive bilinear upsampling is inherently weak since it is intrinsically linear and local in recovering fine-grained details by linearly interpolating from local neighbors.
  • Several efficient attention designs can reduce the model complexity, such as Axial-attention, Criss-Cross attention, and LongFormer. However, they mainly focus on feature enhancement in the downsampled space without recovering the high-resolution features or recovering the fine-grained details by modeling the nonlinear dependency on a more global scale from non-local neighbors. [0007] For dense prediction tasks such as semantic segmentation, it is critical to obtain high-resolution features with long range dependency.
  • a naive dense transformer incurs an intractable complexity of limiting its application for high- resolution dense prediction.
  • a naive dense transformer incurs an intractable complexity of limiting its application for high- resolution dense prediction.
  • the techniques of this disclosure generally relate to tools and techniques for implementing computer vision technologies, and, more particularly, to methods, systems, and apparatuses for implementing dual-flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation.
  • DFlatFormer dual-flattening transformer
  • a method may comprise: flattening, using a computing system, an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flattening, using the computing system, the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implementing, using the computing system, one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implementing, using the computing system, one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map; generating, using the computing system, a column-expanded row-wise output feature map, by repeating the row-wise output feature map a
  • a dual-flattening transformer system may be provided for implementing semantic segmentation.
  • the system may comprise a computing system, which may comprise at least one first processor and a first non-transitory computer readable medium communicatively coupled to the at least one first processor.
  • the first non-transitory computer readable medium may have stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: flatten an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flatten the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implement one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implement one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map; generate a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding
  • a sub-label is associated with a reference numeral to denote one of multiple similar components.
  • Fig.1 is a schematic diagram illustrating a system for implementing dual- flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation, in accordance with various embodiments.
  • Figs.2A-2G are schematic block flow diagrams illustrating various non-limiting examples of components of the DFlatFormer for implementing semantic segmentation, in accordance with various embodiments.
  • FIGs.3A and 3B are diagrams illustrating various non-limiting examples of visualization comparisons of DFlatFormer and of conventional semantic segmentation techniques using example datasets, in accordance with various embodiments.
  • Figs.4A-4E are flow diagrams illustrating a method for implementing DFlatFormer through decomposed row and column queries for semantic segmentation, in accordance with various embodiments.
  • Fig.5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments.
  • a computing system may flatten the input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map; and may flatten the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map.
  • the computing system may implement one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; and may implement one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map.
  • the computing system may generate a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; may generate a row-expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and may generate the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map.
  • the computing system may comprise at least one of a dual- flattening transformer ("DFlatFormer"), a machine learning system, an AI system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like.
  • DFlatFormer dual- flattening transformer
  • flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map may be implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map.
  • the method may further comprise bilinearly upsampling, using the computing system, the input feature map; and combining, using the computing system, the bilinearly upsampled input feature map and the output feature map to generate a dense feature map.
  • the method may further comprise performing, using the computing system, semantic segmentation based on the generated dense feature map.
  • a dual-flattening transformer (“DFlatFormer") is provided, e.g., for performing semantic segmentation or other dense prediction operations (e.g., object detection, pose estimation, etc.).
  • Semantic segmentation may be for such high-resolution dense prediction operations as medical imaging, autonomous driving, augmented or virtual reality (AR/VR), land mapping, video conferencing, and/or the like.
  • DFlatFormer allows a transformer architecture that is not only efficient to recover full- resolution features, but also able to recover fine-grained details by exploring full contexts nonlinearly and globally.
  • the dual-flattening transformer architecture is also able to obtain a high-resolution feature map, with a complexity of where h ⁇ w and H ⁇ W are the input and output feature map sizes, respectively.
  • the proposed architecture can also serve as a flexible plug-in module into any CNN or transformer -based encoders to obtain high-resolution dense predictions.
  • some embodiments can improve the functioning of user equipment or systems themselves (e.g., computer vision systems, dense prediction systems, semantic segmentation systems, object detection systems, pose estimation systems, etc.), for example, by flattening, using a computing system, an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flattening, using the computing system, the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implementing, using the computing system, one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implementing, using the computing system, one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature
  • an optimized computer vision architecture i.e., DFlatFormer
  • DFlatFormer a transformer architecture that is not only efficient to recover full-resolution features, but also able to recover fine-grained details by exploring full contexts nonlinearly and globally
  • h ⁇ w and H ⁇ W are the input and output feature map sizes, respectively
  • h ⁇ w and H ⁇ W are the input and output feature map sizes, respectively
  • h ⁇ w and H ⁇ W are the input and output feature map sizes, respectively
  • h ⁇ w and H ⁇ W are the input and output feature map sizes, respectively
  • Figs.1-5 illustrate some of the features of the method, system, and apparatus for implementing computer vision technologies, and, more particularly, to methods, systems, and apparatuses for implementing dual-flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation, as referred to above.
  • the methods, systems, and apparatuses illustrated by Figs.1-5 refer to examples of different embodiments that include various components and steps, which can be considered alternatives or which can be used in conjunction with one another in the various embodiments.
  • Fig.1 is a schematic diagram illustrating a system 100 for implementing dual-flattening transformer through decomposed row and column queries for semantic segmentation, in accordance with various embodiments.
  • system 100 may comprise computing system 105 may include, but not limited to, dual-flattening transformer ("DFlatFormer") 110 and an artificial intelligence (“AI”) system 115, or the like.
  • DFlatFormer dual-flattening transformer
  • AI artificial intelligence
  • the computing system 105, the DFlatFormer 110, and/or the AI system 115 may be part of a semantic segmentation system 120, or may be separate, yet communicatively coupled with, the semantic segmentation system 120.
  • an encoder 125 – which may include, without limitation, one of a convolutional neural network ("CNN") -based encoder or a transformer based encoder, or the like – may also be part of semantic segmentation system 120, or may be separate, yet communicatively coupled with, the semantic segmentation system 120.
  • the computing system 105, the DFlatFormer 110, and/or the AI system 115 may be embodied as an integrated system.
  • computing system 105 may be embodied as separate, yet communicatively coupled, systems.
  • computing system 105 may include, without limitation, at least one of DFlatFormer 110, a machine learning system, AI system 115, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like.
  • the DFlatFormer 110 and/or the AI system 115 may include a neural network including, but not limited to, at least one of a multi-layer perceptron (“MLP”) neural network, a transformer deep learning model-based network, a feed-forward artificial neural network (“ANN”), a recurrent neural network (“RNN”), a convolutional neural network (“CNN”), or a fully convolutional network (“FCN”), and/or the like.
  • MLP multi-layer perceptron
  • ANN feed-forward artificial neural network
  • RNN recurrent neural network
  • CNN convolutional neural network
  • FCN fully convolutional network
  • System 100 may further comprise one or more content sources 130 (and corresponding database(s) 135) and content distribution system 140 (and corresponding database(s) 145) that communicatively couple with at least one of computing system 105, DFlatFormer 110, AI system 115, and/or semantic segmentation system 120, via network(s) 150.
  • content sources 130 and corresponding database(s) 135)
  • content distribution system 140 and corresponding database(s) 145) that communicatively couple with at least one of computing system 105, DFlatFormer 110, AI system 115, and/or semantic segmentation system 120, via network(s) 150.
  • System 100 may further comprise one or more user devices 155a-155n (collectively, "user devices 155" or the like) that communicatively couple with at least one of computing system 105, DFlatFormer 110, AI system 115, and/or semantic segmentation system 120, either directly via wired (not shown) or wireless communications links (denoted by lightning bolt symbols in Fig.1), or indirect via network(s) 150 and via wired (not shown) and/or wireless communications links (denoted by lightning bolt symbols in Fig.1).
  • user devices 155a-155n collectively, “user devices 155" or the like
  • the user devices 155 may each include, but is not limited to, a portable gaming device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a server computer, a digital photo album platform-compliant device, a web-based digital photo album platform-compliant device, a software application ("app") -based digital photo album platform-compliant device, a video sharing platform-compliant device, a web-based video sharing platform-compliant device, an app-based video sharing platform-compliant device, a law enforcement computing system, a security system computing system, a surveillance system computing system, a military computing system, and/or the like.
  • At least one of computing system 105, DFlatFormer 110, AI system 115, and/or semantic segmentation system 120 may receive image data (e.g., image data 160, or the like); and may extract, using a feature extractor (not shown; in some cases, as part of encoder 125, or the like), features from the received image data, and may generate an input feature map, the input feature map including, but not limited to, an image containing features extracted from an input image containing one or more objects, or the like.
  • the computing system may flatten the input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map; and may flatten the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map.
  • the computing system may implement one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; and may implement one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map.
  • the computing system may generate a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; may generate a row-expanded column- wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and may generate the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map.
  • flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map may be implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map.
  • the computing system may bilinearly upsample the input feature map, and may combine the bilinearly upsampled input feature map and the output feature map to generate a dense feature map.
  • the computing system may perform semantic segmentation based on the generated dense feature map, and, in some cases, may send the semantic segmentation results (e.g., semantic segmentation 165, or the like) to at least one of one or more content sources (e.g., content source(s) 130, or the like), a content distribution system (e.g., content distribution system 140, or the like), or one or more user devices (e.g., user devices 155, or the like), and/or the like.
  • content sources e.g., content source(s) 130, or the like
  • a content distribution system e.g., content distribution system 140, or the like
  • user devices e.g., user devices 155, or the like
  • the computing system may calculate row positional code data, by performing linear interpolation on the input feature map to generate a first number of row positional code data, the first number corresponding to the height of the output feature map; may calculate row-flattened positional code data based on the row-wise flattened feature sequence; and may calculate, using a first row-wise multi-head attention ("MHA") model, a first row-wise MHA output based on the row-wise flattened feature sequence and based on first row query, key, and value vectors.
  • MHA multi-head attention
  • the first row query vector may be based on the calculated row positional code data
  • the first row key vector and the first row value vector may each be based on the calculated row-flattened positional code data.
  • the computing system may calculate column positional code data, by performing linear interpolation on the input feature map to generate a second number of column positional code data, the second number corresponding to the width of the output feature map; may calculate column-flattened positional code data based on the column-wise flattened feature sequence; and may calculate, using a first column-wise MHA model, a first column-wise MHA output based on the column-wise flattened feature sequence and based on first column query, key, and value vectors.
  • the first column query vector may be based on the calculated column positional code data, while the first column key vector and the first column value vector may each be based on the calculated column- flattened positional code data.
  • the computing system may calculate, using a first row-wise row-column interactive attention model, a first layer row embedding output based on second row query, key, and value vectors.
  • the second row query vector may be based on the first column-wise MHA output, while the second row key vector and the second row value vector may each be based on the first row-wise MHA output.
  • the computing system may calculate, using a first column-wise row-column interactive attention model, a first layer column embedding output based on second column query, key, and value vectors.
  • the second column query vector may be based on the first row-wise MHA output
  • the second column key vector and the second column value vector may each be based on the first column-wise MHA output.
  • the one or more row transformer layers may include a plurality of row transformer layers.
  • the computing system may calculate, using a second row-wise MHA model, a second row- wise MHA output based on third row query, key, and value vectors, where the third row query vector may be based on a layer row embedding output of an immediately preceding layer among the plurality of row transformer layers, while the third row key vector and the third row value vector may each be based on the calculated row-flattened positional code data; and may calculate, using a second row-wise row-column interactive attention model, a layer row embedding output corresponding to said row transformer layer based on fourth row query, key, and value vectors, where the fourth row query vector may be based on a second column-wise MHA output from a second column-wise MHA model for a corresponding column transformer layer, while the fourth row key vector and the second row value vector may each be based on the second row-wise MHA output.
  • the row-wise output feature map may be based on the layer row embedding output corresponding to the last row transformer layer among the plurality of row transformer layers.
  • the one or more column transformer layers may include a plurality of column transformer layers.
  • the computing system may calculate, using the second column-wise MHA model, the second column-wise MHA output based on third column query, key, and value vectors, where the third column query vector may be based on a layer column embedding output of an immediately preceding layer among the plurality of column transformer layers, while the third column key vector and the third column value vector may each be based on the calculated column-flattened positional code data; and may calculate, using a second column-wise row-column interactive attention model, a layer column embedding output corresponding to said column transformer layer based on fourth column query, key, and value vectors, where the fourth column query vector may be based on the second row-wise MHA output from the second row-wise MHA model for a corresponding row transformer layer while the fourth column key vector and the second column value vector may each be based on the second column-wise MHA output.
  • the column-wise output feature map may be based on the layer column embedding output corresponding to the last column transformer layer among the plurality of column transformer layers.
  • the computing system may perform one of: MHA via grouping; MHA via pooling; or a combination of the MHA via grouping and the MHA via pooling; and/or the like.
  • the MHA via grouping may comprise the computing system: dividing the input feature map into a plurality of groups of row-wise input feature maps; dividing the input feature map into a plurality of groups of column-wise input feature maps; calculating, using the first row-wise MHA model, a first row-wise group-combined MHA output, by independently calculating row-wise MHA output for each group of row-wise input feature maps and combining the calculated row-wise MHA outputs for each group of row- wise input feature maps, where the first row-wise MHA output may include the first row- wise group-combined MHA output; and may calculate, using the first column-wise MHA model, a first column-wise group-combined MHA output, by independently calculating column-wise MHA output for each group of column-wise input feature maps and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps, where the first column-wise MHA output includes the first column-wise group-combined MHA output.
  • the MHA via pooling may comprise the computing system: average-pooling rows of the input feature map to generate an average-pooled row-wise input feature map and to generate a row-wise flattened average-pooled feature sequence; average- pooling columns of the input feature map to generate an average-pooled column-wise input feature map and to generate a column-wise flattened average-pooled feature sequence; calculating, using the first row-wise MHA model, the first row-wise average-pooled MHA output based on the average-pooled row-wise input feature map and the row-wise flattened average-pooled feature sequence, where the first row-wise MHA output includes the first row-wise average-pooled MHA output; and calculating, using the first column-wise MHA model, the first column-wise average-pooled MHA output based on the average-pooled column-wise input feature map and the column-wise flattened average-pooled feature sequence, wherein the first column-wise MHA output includes the first column-wise average- pooled
  • the combination of the MHA via grouping and the MHA via pooling may comprise the computing system: combining the first row-wise group-combined MHA output and the first row-wise average-pooled MHA output to generate the first row- wise MHA output for the first row transformer layer, and combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer.
  • the MHA via grouping may comprise the computing system dividing the input feature map into a plurality of groups of row-wise input feature maps.
  • flattening the input feature map into the row- wise flattened feature sequence may comprise the computing system flattening the row-wise input feature map for each group of row-wise input feature maps into a row-wise flattened feature sub-sequence among a plurality of groups of row-wise flattened feature sub- sequences.
  • calculating the row positional code data may comprise the computing system calculating row positional code data for each group of row-wise input feature maps, by performing linear interpolation on each group of row-wise input feature maps to generate a third number of row positional code data for each group of row-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of row-wise input feature maps.
  • calculating the row- flattened positional code data may comprise the computing system calculating row-flattened positional code data for each group of row-wise flattened feature sequences.
  • calculating the first row-wise MHA output may comprise the computing system: independently calculating row-wise MHA output for each group of row-wise input feature maps based on row query, key, and value vectors for each group of row-wise input feature maps, where the row query vector for each group of row-wise input feature maps may be based on the corresponding calculated row positional code data for said group of row-wise input feature maps, while the row key vector and the row value vector for each group of row- wise input feature maps may each be based on the corresponding calculated row-flattened positional code data for said group of row-wise flattened feature sequences; and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps to generate a first row-wise group-combined MHA output.
  • the first row-wise MHA output may comprise the first row-wise group-combined MHA output.
  • the MHA via grouping may comprise the computing system dividing the input feature map into a plurality of groups of column-wise input feature maps.
  • flattening the input feature map into the column-wise flattened feature sequence may comprise the computing system flattening the column-wise input feature map for each group of column-wise input feature maps into a column-wise flattened feature sub-sequence among a plurality of groups of column-wise flattened feature sub-sequences.
  • calculating the column positional code data may comprise the computing system calculating column positional code data for each group of column-wise input feature maps, by performing linear interpolation on each group of column-wise input feature maps to generate a third number of column positional code data for each group of column-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of column-wise input feature maps.
  • calculating the column-flattened positional code data may comprise the computing system calculating column-flattened positional code data for each group of column-wise flattened feature sequences.
  • calculating the first column-wise MHA output may comprise the computing system: independently calculating column-wise MHA output for each group of column-wise input feature maps based on column query, key, and value vectors for each group of column-wise input feature maps, where the column query vector for each group of column-wise input feature maps may be based on the corresponding calculated column positional code data for said group of column- wise input feature maps, while the column key vector and the column value vector for each group of column-wise input feature maps may each be based on the corresponding calculated column-flattened positional code data for said group of column-wise flattened feature sequences; and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps to generate a first column-wise group-combined MHA output.
  • the first column-wise MHA output may comprise the first column-wise group-combined MHA output.
  • the MHA via pooling may comprise the computing system dividing each row of the input feature map into one or more pools of row-wise input feature maps, such that the input feature map is divided into a plurality of pools of row-wise input feature maps that includes the one or more pools of row- wise input feature maps for each row, and averaging values of features in each pool to generate average-pooled values for each pool among the plurality of pools of row-wise input feature maps, thereby producing an average-pooled row-wise input feature map.
  • flattening the input feature map into the row-wise flattened feature sequence may comprise the computing system flattening the plurality of pools of row-wise input feature maps into a row-wise flattened average-pooled feature sequence.
  • calculating the row positional code data may comprise the computing system calculating average-pooled row positional code data, by performing linear interpolation on the average- pooled row-wise input feature map to generate the first number of average-pooled row positional code data, the first number corresponding to the height of the output feature map.
  • calculating the row-flattened positional code data may comprise the computing system calculating average-pooled row-flattened positional code data based on the row-wise flattened average-pooled feature sequence.
  • calculating the first row-wise MHA output may comprise the computing system calculating, using the first row- wise MHA model, a first row-wise average-pooled MHA output based on the row-wise flattened average-pooled feature sequence and based on fifth row query, key, and value vectors, where the fifth row query vector may be based on the calculated average-pooled row positional code data, while the fifth row key vector and the fifth row value vector may each be based on the calculated average-pooled row-flattened positional code data.
  • the first row-wise MHA output may comprise the first row-wise average-pooled MHA output.
  • the MHA via pooling may comprise the computing system dividing each column of the input feature map into one or more pools of column-wise input feature maps, such that the input feature map is divided into a plurality of pools of column-wise input feature maps that includes the one or more pools of column-wise input feature maps for each column, and averaging values of features in each pool to generate average-pooled values for each pool among the plurality of pools of column- wise input feature maps, thereby producing an average-pooled column-wise input feature map.
  • flattening the input feature map into the column-wise flattened feature sequence may comprise the computing system flattening the plurality of pools of column- wise input feature maps into a column-wise flattened average-pooled feature sequence.
  • calculating the column positional code data may comprise the computing system calculating average-pooled column positional code data, by performing linear interpolation on the average-pooled column-wise input feature map to generate the first number of average-pooled column positional code data, the first number corresponding to the height of the output feature map.
  • calculating the column-flattened positional code data may comprise the computing system calculating average-pooled column-flattened positional code data based on the column-wise flattened average-pooled feature sequence.
  • calculating the first column-wise MHA output may comprise the computing system calculating, using the first column-wise MHA model, a first column-wise average- pooled MHA output based on the column-wise flattened average-pooled feature sequence and based on fifth column query, key, and value vectors, where the fifth column query vector may be based on the calculated average-pooled column positional code data, while the fifth column key vector and the fifth column value vector may each be based on the calculated average-pooled column-flattened positional code data.
  • the first column-wise MHA output may comprise the first column-wise average-pooled MHA output.
  • the combination of the MHA via grouping and the MHA via pooling may comprise the computing system: combining the first row-wise group-combined MHA output and the first row-wise average-pooled MHA output to generate the first row- wise MHA output for the first row transformer layer, and combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer.
  • Figs.2A-2G are schematic block flow diagrams illustrating various non-limiting examples 200 of components of the DFlatFormer for implementing semantic segmentation, in accordance with various embodiments.
  • a conventional dense transformer needs a full-size query sequence to embed a flattened sequence of low-resolution input to a high-resolution output. This is intractable as the full size sequence of H ⁇ W queries would result in demanding memory and computational overhead.
  • full- size queries in a naive dense transformer are decomposed into H row and W column queries, and these decomposed row and column queries are used to embed rows and columns separately through multi-head attention ("MHA") and interactive attention modules or systems (e.g., MHA modules or systems 245a-245n and 255a-255n, as well as Row-Column Interactive Attention modules or systems 250a-250n and 260a-260n as shown in Fig.2A, or the like).
  • MHA multi-head attention
  • interactive attention modules or systems e.g., MHA modules or systems 245a-245n and 255a-255n, as well as Row-Column Interactive Attention modules or systems 250a-250n and 260a-260n as shown in Fig.2A, or the like.
  • An input sequence from an encoder (e.g., encoder 125 of Fig.1, or the like) will be flattened row-wise and column-wise to spatially align with the sequences of decomposed queries before it is fed into row and column transformers (e.g., row transformers or row transformer layers 235a-235n and column transformers or column transformer layers 240a- 240n of Fig.2A, or the like) separately.
  • row and column transformers e.g., row transformers or row transformer layers 235a-235n and column transformers or column transformer layers 240a- 240n of Fig.2A, or the like
  • Fig.2A summarizes the overall pipeline.
  • the DFlatFormer 210 comprises a row transformer (e.g., row transformers or row transformer layers 235a-235n (collectively, “row transformer 235,” “row transformers 235,” or “row transformer layers 235,” or the like), or the like) and a column transformer (e.g., column transformers or column transformer layers 240a-240n (collectively, “column transformer 240,” “column transformers 240,” or “column transformer layers 240,” or the like), or the like).
  • a row transformer e.g., row transformers or row transformer layers 235a-235n (collectively, “row transformer 235,” “row transformers 235,” or “row transformer layers 235,” or the like), or the like
  • a column transformer e.g., column transformers or column transformer layers 240a-240n (collectively, “column transformer 240,” “column transformers 240,” or “column transformer layers 240,” or the like).
  • the resultant row/column embeddings for each layer are further refined by interacting with their column/row counterparts through an interactive attention module (e.g., row-column interactive attention modules or systems 250a-250n and 260a-260n, or the like).
  • an interactive attention module e.g., row-column interactive attention modules or systems 250a-250n and 260a-260n, or the like.
  • the row (column) transformer consists of L layers
  • the resultant last-layer embeddings of rows (265) and columns (270) are finally expanded (into 275 and 280), respectively and combined to output a full-resolution feature map S (285).
  • the full-resolution feature map S (285) may then be combined with a bilinearly upsampled input feature map (205b) to generate a dense feature map (not shown) that may be used to perform semantic segmentation (such as shown, e.g., in Figs.3A and 3B).
  • a layer may include, without limitation, a MHA module and a row-column interactive attention module (e.g., MHA modules or systems 245a-245n and 255a-255n (collectively, “MHA 245 and/or 255,” “MHA module(s) 245 and/or 255,” or “MHA system(s) 245 and/or 255,” or the like), and row-column interactive attention systems 250a-250n and 260a-260n (collectively, “row-column interactive attention 250 and/or 260,” “row-column interactive attention module(s) 250 and/or 260,” or “row-column interactive attention system(s) 250 and/or 260,” or the like), respectively, or the like).
  • MHA modules or systems 245a-245n and 255a-255n collectively, “MHA 245 and/or 255," “MHA module(s) 245 and/or 255,” or “MHA system(s) 245 and/or 255,” or
  • the MHA module 245 takes as inputs the original row- flattened sequence R (215) as well the row query sequence (or the last layer output of row embeddings and it outputs after, in some cases, a multi-layer perceptron ("MLP") of a Feed-Forward Network (“FFN”), or the like. Then, the row-column interactive attention modules 250 and 260 follows by coupling intermediate embeddings of and output from the corresponding row MHA 245 and column MHA 255. This module allows a row (column) representation to aggregate all column (row) embeddings in vertical contexts as they cross the row (column). It outputs the row (column) representation of for layer l, which will be fed into the next layer for further modeling.
  • MLP multi-layer perceptron
  • FNN Feed-Forward Network
  • output (205a) from the encoder or the feature extractor 205 is flattened row-wise and column-wise to align with the row and column queries separately. This preserves the row and column structures by putting entire rows and columns together in the flattened sequences, respectively. Meanwhile, row-wise and column-wise positional encodings will also be applied to the dual-flattened sequences. This will enable the row and column queries to apply the multi-head attentions to the respective flattened sequences. [0062] Referring to Figs.2B and 2C, row-wise and column-wise positional encodings are first derived. Typically, the target feature map has much larger size than the encoder output.
  • the resultant encoding aligns with the size of input feature map and thus can be row-wise flattened to and used as the positional encoding of the row-wise flattened R.
  • This is a row-wise positional encoding since each row has the same code, thereby aligning with the row-wise flattening.
  • Similar column-wise positional encoding can be applied to the corresponding column-wise flattening, as shown in Fig.2C.
  • a single-head attention is formulated as (Eqn.1) where W q , W k , and W v are the linear projection matrices, d m is the embedding dimension for each head, and T denotes transpose of a matrix (in this case, transpose of matrix is the learnable sequence of row queries and is the positional encodings as defined above with respect to Fig.2B.
  • the multi-head attention is obtained by putting together single-head outputs, (Eqn.2) where W O is a linear projection matrix.
  • each layer further goes through a feed-forward network ("FFN") with a residual connection: (Eqn.3)
  • FNN feed-forward network
  • Eqn.3 residual connection: (Eqn.3)
  • layer normalization is omitted for notational simplicity.
  • row-column interactive attention after the MHA module, the row- column interactive attention module is implemented. As illustrated in Figs.2D and 2E, this interactive attention aims to capture the relevant information when a row (column) crosses all columns (rows) spatially. This allows the learned row (column) representation to further integrate the vertical (horizontal) contexts along the crossed columns (rows) through an interactive attention mechanism.
  • the row and column outputs from their interactive attentions can be obtained as follows: (Eqn.4) (Eqn.5) [0070]
  • the row-column interactive attention module 250 (or 260) takes the intermediate row output (or column output as the query to aggregate the relevant information from all crossing columns. For simplicity, we do not use linear projections as in multi-head attention to map and into query, key, and value sequences.
  • the final dense feature map may then be generated. At the output end of the transformer layers 235 and 240, each pixel at (i, j) in the final feature map is represented as a direct combination of the outputs from the row and column transformers at the same location.
  • the resultant last-layer embeddings of rows (265) and columns (270) are finally expanded into column-expanded row-wise output feature map 275 and row-expanded column-wise output feature map 280, respectively, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map and by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map, respectively.
  • the column-expanded row-wise output feature map 275 and row-expanded column-wise output feature map 280 are then combined to output a full-resolution feature map S (285).
  • the input feature map S o (205a) and the output feature map S (285) may be combined to generate a dense feature map, and semantic segmentation may be performed based on the generated dense feature map.
  • a grouping and/or pooling module(s) that reduces the number of row/column flattened tokens fed into each layer may be used. Referring to Figs.2F and 2G, efficient attention via grouping and/or pooling is shown. If all the pixels in the encoder output are taken to form keys/values, the computational complexity would be , as there are in total H rows and W columns for queries and hw keys/values fed into the DFlatFormer 210.
  • multi-head attention via grouping and (average) pooling may be performed.
  • Grouping divides both queries and the input feature map into several groups (as shown in Figs.2F and 2G), where each query can only access features within the corresponding feature group.
  • the features in a row and a column are average-pooled to form shorter flattened sequences (as shown in Figs.2F and 2G), further reducing the model complexity.
  • the row queries and the row-flattened sequence are equally divided into np groups.
  • Multi-head attention may be conducted within each group in parallel such that a row query only performs the multi-head attention on the corresponding group of the flattened sequence.
  • the row- flattened sequence may be average-pooled through a non-overlapping window of size nw over each row, resulting in a shorter sequence where each row query can access the whole one for performing the multi-head attention.
  • the resultant outputs from both grouping and pooling may be added to give the output.
  • Grouping and pooling complement each other. For the grouping, each query accesses a smaller part of the flattened sequence at its original granular level. In contrast, for the pooling, the query accesses the whole sequence but at a coarse level with pooled features.
  • the output can reach a good balance between the computational costs involved and the representation ability.
  • Each query only accesses a group of ⁇ g hw features. There are only ⁇ p hw pooled features in total, and they are shared amongst all the queries.
  • DFlatFormer 210 can serve as a plug-in module to be connected to any CNN or transformer -based encoder (e.g., encoder 125 of Fig.1, or the like) and to generate high-resolution output.
  • any CNN or transformer -based encoder e.g., encoder 125 of Fig.1, or the like
  • Figs.3A and 3B illustrate the efficacy of DFlatFormer compared with conventional techniques.
  • Table 2 For Cityscapes semantic segmentation, DFlatFormer with single-scale inference outperforms DeepLabv3+ by 2.0 % and 1.8 % with backbone ResNet- 50 and ResNet-101, respectively.
  • Tables 3 and 4 provide performance comparisons of DFlatFormer with other architectures on ADE20K and Cityscapes val datasets, respectively.
  • Table 3. Semantic segmentation performance on ADE20K val dataset, with transformer backbone. ⁇ denotes model pretrained on ImageNet-22K. * denotes AlignedResize used in inference.
  • DFlatFormer As shown in Table 3, for ADE20K segmentation with Swin-T as the backbone, DFlatFormer has a 2.6 % gain over UperNet for single-scale inference. With Swin-S as the backbone, DFlatFormer surpasses UperNet by 0.6 %. On the other hand, the model size and GFLOPs of DFlatFormer are much smaller than the baselines. When comparing with SegFormer, with MiT-B2 as the backbone, DFlatFormer outperforms SegFormer by 0.9 %. With MiT-B4 as the backbone, DFlatFormer outperforms SegFormer by 0.5 %.
  • Figs.3A and 3B are diagrams illustrating various non- limiting examples 300 and 300' of visualization comparisons of DFlatFormer and of conventional semantic segmentation techniques using example datasets, in accordance with various embodiments.
  • segmentation results are presented in Fig.3A to compare DFlatFormer with each of DeepLabv3+ and SegFormer on Cityscapes dataset.
  • DFlatFormer provides better or more complete predictions for thin and/or small objects, including, but not limited to, poles and traffic lights, and/or the like. As shown with respect to the bottom image, DFlatFormer also provides more precise predictions on the boundaries of people and terrain, and/or the like.
  • Fig.3B more comparisons are presented between DFlatFormer and each of DeepLabv3+ (ResNet-101) and SegFormer on ADE20K dataset.
  • DFlatFormer can predict more accurate and complete boundaries for objects, including, but not limited to, curtains and lamps, and/or the like.
  • Figs.4A-4E are flow diagrams illustrating a method 400 for implementing DFlatFormer through decomposed row and column queries for semantic segmentation, in accordance with various embodiments.
  • Fig.4 can be implemented by or with (and, in some cases, are described below with respect to) the systems, examples, or embodiments 100, 200, 300, and 300' of Figs.1, 2, 3A, and 3B, respectively (or components thereof), such methods may also be implemented using any suitable hardware (or software) implementation.
  • method 400 at block 405, may comprise receiving, using a computing system, an input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects.
  • the computing system may include, without limitation, at least one of a dual-flattening transformer ("DFlatFormer”), a machine learning system, an AI system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like.
  • method 400 may comprise flattening, using the computing system, the input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map.
  • method 400 may comprise flattening, using the computing system, the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map.
  • Method 400 may further comprise implementing, using the computing system, one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map (block 420).
  • Method 400 may continue onto the process at block 425 or may continue onto the process at block 455a in Fig.4B following the circular marker denoted, "A.” Method 400 may further comprise implementing, using the computing system, one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map (block 425). Method 400 may continue onto the process at block 430, may continue onto the process at block 455b in Fig.4C following the circular marker denoted, "B,” or may loop back to the process at block 420 to repeat the processes at blocks 420 and 425 for each set of row/column transformer layers (where the number of loop backs may be any suitable number between 1 and 20, in some cases, between 1 and 10, or the like).
  • Method 400 may further comprise, at block 430, generating, using the computing system, a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map.
  • Method 400 may further comprise, at block 435, generating, using the computing system, a row- expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map.
  • the method 400 may further comprise calculating, using the computing system and using a second row-wise MHA model, a second row-wise MHA output based on third row query, key, and value vectors, wherein the third row query vector is based on a layer row embedding output of an immediately preceding layer among the plurality of row transformer layers, wherein the third row key vector and the third row value vector are each based on the calculated row-flattened positional code data; and calculating, using the computing system and using a second row-wise row-column interactive attention model, a layer row embedding output corresponding to said row transformer layer based on fourth row query, key, and value vectors, wherein the fourth row query vector is based on a second column-wise MHA output from a second column-wise MHA model for a corresponding column transformer layer, and wherein the fourth row key vector and the second row value vector are each based on the second row-
  • the method 400 may further comprise calculating, using the computing system and using the second column-wise MHA model, the second column-wise MHA output based on third column query, key, and value vectors, wherein the third column query vector is based on a layer column embedding output of an immediately preceding layer among the plurality of column transformer layers, wherein the third column key vector and the third column value vector are each based on the calculated column-flattened positional code data; and calculating, using the computing system and using a second column-wise row-column interactive attention model, a layer column embedding output corresponding to said column transformer layer based on fourth column query, key, and value vectors, wherein the fourth column query vector is based on the second row-wise MHA output from the second row-wise MHA model for a corresponding row transformer layer, and wherein the fourth column key vector and the second column value vector are each based on the second column-wise MHA output
  • flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map may be implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map.
  • method 400 may comprise generating, using the computing system, the output feature map, by combining the column- expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map.
  • method 400 may further comprise bilinearly upsampling, using the computing system, the input feature map; and combining, using the computing system, the bilinearly upsampled input feature map and the output feature map to generate a dense feature map (block 445); and performing, using the computing system, semantic segmentation based on the generated dense feature map (block 450).
  • method 400 may comprise calculating, using the computing system, row positional code data, by performing linear interpolation on the input feature map to generate a first number of row positional code data, the first number corresponding to the height of the output feature map.
  • Method 400 may further comprise calculating, using the computing system, row-flattened positional code data based on the row-wise flattened feature sequence (block 460a); and calculating, using the computing system and using a first row-wise multi-head attention ("MHA") model, a first row-wise MHA output based on the row-wise flattened feature sequence and based on first row query, key, and value vectors, wherein the first row query vector is based on the calculated row positional code data, wherein the first row key vector and the first row value vector are each based on the calculated row-flattened positional code data (block 465a).
  • MHA multi-head attention
  • Method 400 may further comprise, at block 470a, calculating, using the computing system and using a first row-wise row-column interactive attention model, a first layer row embedding output based on second row query, key, and value vectors, wherein the second row query vector is based on the first column-wise MHA output, and wherein the second row key vector and the second row value vector are each based on the first row-wise MHA output.
  • Method 400 may return to the process at block 430 in Fig.4A following the circular marker denoted, "C.” [0102] At block 455b in Fig.4C (following the circular marker denoted, "B"), method 400 may comprise calculating, using the computing system, column positional code data, by performing linear interpolation on the input feature map to generate a second number of column positional code data, the second number corresponding to the width of the output feature map.
  • Method 400 may further comprise calculating, using the computing system, column-flattened positional code data based on the column-wise flattened feature sequence (block 460b); and calculating, using the computing system and using a first column-wise MHA model, a first column-wise MHA output based on the column-wise flattened feature sequence and based on first column query, key, and value vectors, wherein the first column query vector is based on the calculated column positional code data, wherein the first column key vector and the first column value vector are each based on the calculated column- flattened positional code data (block 465b).
  • Method 400 may further comprise, at block 470b, calculating, using the computing system and using a first column-wise row-column interactive attention model, a first layer column embedding output based on second column query, key, and value vectors, wherein the second column query vector is based on the first row-wise MHA output, and wherein the second column key vector and the second column value vector are each based on the first column-wise MHA output.
  • Method 400 may return to the process at block 430 in Fig.4A following the circular marker denoted, "C.”
  • the one or more row transformer layers may comprise performing one of: MHA via grouping (block 475); MHA via pooling (block 480); or a combination of the MHA via grouping (block 475) and the MHA via pooling (block 480).
  • performing MHA via grouping may comprise: dividing, using the computing system, the input feature map into a plurality of groups of row-wise input feature maps (block 475a); and calculating, using the computing system and using the first row-wise MHA model, a first row-wise group-combined MHA output, by independently calculating row-wise MHA output for each group of row-wise input feature maps and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps, wherein the first row-wise MHA output comprises the first row-wise group-combined MHA output (block 475b).
  • performing MHA via pooling may comprise: average-pooling, using the computing system, rows of the input feature map to generate an average-pooled row-wise input feature map and to generate a row-wise flattened average-pooled feature sequence (block 480a); and calculating, using the computing system and using the first row-wise MHA model, the first row-wise average-pooled MHA output based on the average-pooled row-wise input feature map and the row-wise flattened average-pooled feature sequence, wherein the first row-wise MHA output comprises the first row-wise average-pooled MHA output (block 480b).
  • method 400 may further comprise combining the first row-wise group-combined MHA output and the first row-wise average-pooled MHA output to generate the first row-wise MHA output for the first row transformer layer (optional block 485a).
  • performing MHA via grouping may comprise: dividing, using the computing system, the input feature map into a plurality of groups of row-wise input feature maps (similar to block 475a); flattening, using the computing system, the row-wise input feature map for each group of row-wise input feature maps into a row-wise flattened feature sub-sequence among a plurality of groups of row-wise flattened feature sub-sequences; calculating, using the computing system, row positional code data for each group of row-wise input feature maps, by performing linear interpolation on each group of row-wise input feature maps to generate a third number of row positional code data for each group of row-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of row-wise input feature maps; calculating, using the computing system, row-flattened positional code data for each group of row-wise flattened feature sequences; independently calculating row-wise
  • performing MHA via pooling may comprise: dividing, using the computing system, each row of the input feature map into one or more pools of row-wise input feature maps, such that the input feature map is divided into a plurality of pools of row-wise input feature maps that includes the one or more pools of row-wise input feature maps for each row, and averaging, using the computing system, values of features in each pool to generate average-pooled values for each pool among the plurality of pools of row-wise input feature maps, thereby producing an average-pooled row-wise input feature map; flattening, using the computing system, the plurality of pools of row-wise input feature maps into a row-wise flattened average-pooled feature sequence; calculating, using the computing system, average-pooled row positional code data, by performing linear interpolation on the average-pooled row-wise input feature map to generate the first number of average-pooled row positional code data, the first number corresponding to the height of the output
  • method 400 may further comprise combining the first row-wise group-combined MHA output and the first row-wise average- pooled MHA output to generate the first row-wise MHA output for the first row transformer layer (similar to optional block 485a).
  • one or more column transformer layers may comprise performing one of: MHA via grouping (block 490); MHA via pooling (block 495); or a combination of the MHA via grouping (block 490) and the MHA via pooling (block 495).
  • performing MHA via grouping may comprise: dividing, using the computing system, the input feature map into a plurality of groups of column-wise input feature maps (block 490a); and calculating, using the computing system and using the first column-wise MHA model, a first column-wise group-combined MHA output, by independently calculating column-wise MHA output for each group of column-wise input feature maps and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps, wherein the first column-wise MHA output comprises the first column-wise group-combined MHA output (block 490b).
  • performing MHA via pooling may comprise: average-pooling, using the computing system, columns of the input feature map to generate an average-pooled column- wise input feature map and to generate a column-wise flattened average-pooled feature sequence (block 495a); and calculating, using the computing system and using the first column-wise MHA model, the first column-wise average-pooled MHA output based on the average-pooled column-wise input feature map and the column-wise flattened average- pooled feature sequence, wherein the first column-wise MHA output comprises the first column-wise average-pooled MHA output (block 495b).
  • method 400 may further comprise combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer (optional block 485b).
  • performing MHA via grouping may comprise: dividing, using the computing system, the input feature map into a plurality of groups of column-wise input feature maps (similar to block 490a); flattening, using the computing system, the column-wise input feature map for each group of column- wise input feature maps into a column-wise flattened feature sub-sequence among a plurality of groups of column-wise flattened feature sub-sequences; calculating, using the computing system, column positional code data for each group of column-wise input feature maps, by performing linear interpolation on each group of column-wise input feature maps to generate a third number of column positional code data for each group of column-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of column-wise input feature maps; calculating, using the computing system, column-flattened positional code data for each group of column-wise flattened feature sequences; independently calculating column-
  • performing MHA via pooling may comprise: dividing, using the computing system, each column of the input feature map into one or more pools of column-wise input feature maps, such that the input feature map is divided into a plurality of pools of column-wise input feature maps that includes the one or more pools of column-wise input feature maps for each column, and averaging, using the computing system, values of features in each pool to generate average- pooled values for each pool among the plurality of pools of column-wise input feature maps, thereby producing an average-pooled column-wise input feature map; flattening, using the computing system, the plurality of pools of column-wise input feature maps into a column- wise flattened average-pooled feature sequence; calculating, using the computing system, average-pooled column positional code data, by performing linear interpolation on the average-pooled column-wise input feature map to generate the first number of average- pooled column positional code data, the first number corresponding to the height of the
  • method 400 may further comprise combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer (similar to optional block 485b).
  • Fig.5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments.
  • Fig.5 provides a schematic illustration of one embodiment of a computer system 500 of the service provider system hardware that can perform the methods provided by various other embodiments, as described herein, and/or can perform the functions of computer or hardware system (i.e., computing system 105, dual-flattening transformers ("DFlatFormers") 110 and 210, artificial intelligence ("AI") system 115, semantic segmentation system 120, encoder 125, content source(s) 130, content distribution system 140, and user devices 155a-155n, etc.), as described above.
  • computing system 105 dual-flattening transformers
  • AI artificial intelligence
  • semantic segmentation system 120 i.e., semantic segmentation system 120, encoder 125, content source(s) 130, content distribution system 140, and user devices 155a-155n, etc.
  • Fig.5 is meant only to provide a generalized illustration of various components, of which one or more (or none) of each may be utilized as appropriate.
  • Fig.5 therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.
  • the computer or hardware system 500 – which might represent an embodiment of the computer or hardware system (i.e., computing system 105, DFlatFormers 110 and 210, AI system 115, semantic segmentation system 120, encoder 125, content source(s) 130, content distribution system 140, and user devices 155a-155n, etc.), described above with respect to Figs.1-4 – is shown comprising hardware elements that can be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate).
  • a bus 505 or may otherwise be in communication, as appropriate.
  • the hardware elements may include one or more processors 510, including, without limitation, one or more general- purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and/or the like); one or more input devices 515, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, a printer, and/or the like.
  • processors 510 including, without limitation, one or more general- purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and/or the like)
  • input devices 515 which can include, without limitation, a mouse, a keyboard, and/or the like
  • output devices 520 which can include, without limitation, a display device, a printer, and/or the like.
  • the computer or hardware system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random access memory (“RAM”) and/or a read-only memory (“ROM”), which can be programmable, flash-updateable, and/or the like.
  • RAM random access memory
  • ROM read-only memory
  • Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.
  • the computer or hardware system 500 might also include a communications subsystem 530, which can include, without limitation, a modem, a network card (wireless or wired), an infra-red communication device, a wireless communication device and/or chipset (such as a BluetoothTM device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, cellular communication facilities, etc.), and/or the like.
  • the communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, and/or with any other devices described herein.
  • the computer or hardware system 500 will further comprise a working memory 535, which can include a RAM or ROM device, as described above.
  • the computer or hardware system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments (including, without limitation, hypervisors, VMs, and the like), and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein.
  • one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.
  • a set of these instructions and/or code might be encoded and/or stored on a non- transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500.
  • the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon.
  • These instructions might take the form of executable code, which is executable by the computer or hardware system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer or hardware system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.
  • some or all of the procedures of such methods are performed by the computer or hardware system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535.
  • Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525.
  • execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein.
  • machine readable medium and “computer readable medium,” as used herein, refer to any medium that participates in providing data that causes a machine to operate in some fashion.
  • various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals).
  • a computer readable medium is a non-transitory, physical, and/or tangible storage medium.
  • a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like.
  • Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525.
  • Volatile media includes, without limitation, dynamic memory, such as the working memory 535.
  • a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire, and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices).
  • transmission media can also take the form of waves (including without limitation radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications).
  • Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.
  • Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution.
  • the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer.
  • a remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer or hardware system 500.
  • These signals which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.
  • the communications subsystem 530 (and/or components thereof) generally will receive the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc.

Abstract

Novel tools and techniques are provided for implementing dual-flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation. In various embodiments, a computing system may flatten the input feature map into a row-wise (and a column-wise) flattened feature sequence, by concatenating each successive row (and column) of the input feature map to a first row (and a first column, respectively) thereof; may implement one or more row (and column) transformer layers based on the row-wise (and column-wise) flattened feature sequence, to output a row-wise (and a column-wise) output feature map; may generate a column-expanded row-wise (and a row-expanded column-wise) output feature map; and may generate the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map.

Description

DUAL-FLATTENING TRANSFORMER THROUGH DECOMPOSED ROW AND COLUMN QUERIES FOR SEMANTIC SEGMENTATION CROSS-REFERENCES TO RELATED APPLICATIONS [0001] This application claims priority to U.S. Patent Application Ser. No.63/277,656 (the " '656 Application"), filed November 10, 2021, by Ying Wang et al. (attorney docket no. INNOPEAK-1121-161-P), entitled, "Dual-Flattened Transformer Through Decomposed Row and Column Queries for Semantic Segmentation," the disclosure of which is incorporated herein by reference in its entirety for all purposes. COPYRIGHT STATEMENT [0002] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever. FIELD [0003] The present disclosure relates, in general, to methods, systems, and apparatuses for implementing computer vision technologies, and, more particularly, to methods, systems, and apparatuses for implementing dual-flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation. BACKGROUND [0004] Obtaining high-resolution features is important in many computer vision tasks, especially for dense prediction problems such as semantic segmentation, object detection, and pose estimation, or the like. Typical approaches employ convolutional encoder-decoder architectures where an encoder outputs low-resolution features and a decoder upsamples features with simple filters such as bilinear interpolation. Bilinear upsampling has limited capacity in obtaining high-resolution features, as it only conducts linear interpolation between neighboring pixels without considering nonlinear dependencies in global contexts. Various approaches have been proposed to improve the high-resolution feature quality, such as PointRend and DUpsample. PointRend carefully selects uncertain points in the downsampled feature space and refines them by incorporating low-level features. DUpsample adopts a data-dependent upsampling strategy to recover segmentation from the coarse prediction. However, these approaches lack the ability to capture long-range dependency for fine-grained details. Diverse non-local or self-attention based schemes have been proposed to enhance the output features, but mostly in the downsampled feature space. They still rely on bilinear upsampling procedure to obtain high-resolution features, which tends to lose global information. [0005] Recently, transformers have drawn tremendous interest, due to their great success in capturing long-range dependency. As the first standard-alone transformer for computer vision, Vision Transformer ("ViT") shows impressive results on image classification with patch-based self-attention. However, there is still much room to improve when applying to dense prediction tasks due to its low resolution feature map caused by non-overlapping patches and intractable computing costs. The complexity of a naive dense transformer scales rapidly with respect to the high-resolution output, limiting its application to dense prediction problems. [0006] Multi-scale ViTs have been presented to achieve hierarchical features with different resolutions and have boosted the performance of many dense prediction tasks. However, upper-level features with low spatial resolution still rely on bilinear upsampling to recover the full-resolution features. The naive bilinear upsampling is inherently weak since it is intrinsically linear and local in recovering fine-grained details by linearly interpolating from local neighbors. Several efficient attention designs can reduce the model complexity, such as Axial-attention, Criss-Cross attention, and LongFormer. However, they mainly focus on feature enhancement in the downsampled space without recovering the high-resolution features or recovering the fine-grained details by modeling the nonlinear dependency on a more global scale from non-local neighbors. [0007] For dense prediction tasks such as semantic segmentation, it is critical to obtain high-resolution features with long range dependency. To generate high-resolution output of size H ×W from a low-resolution feature map of size h × w (hw < < HW), a naive dense transformer incurs an intractable complexity of
Figure imgf000004_0001
limiting its application for high- resolution dense prediction. [0008] Hence, there is a need for more robust and scalable solutions for implementing computer vision technologies. SUMMARY [0009] The techniques of this disclosure generally relate to tools and techniques for implementing computer vision technologies, and, more particularly, to methods, systems, and apparatuses for implementing dual-flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation. [0010] In an aspect, a method may comprise: flattening, using a computing system, an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flattening, using the computing system, the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implementing, using the computing system, one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implementing, using the computing system, one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map; generating, using the computing system, a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; generating, using the computing system, a row-expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and generating, using the computing system, the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map. [0011] In another aspect, a dual-flattening transformer system may be provided for implementing semantic segmentation. The system may comprise a computing system, which may comprise at least one first processor and a first non-transitory computer readable medium communicatively coupled to the at least one first processor. The first non-transitory computer readable medium may have stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: flatten an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flatten the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implement one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implement one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map; generate a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; generate a row-expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and generate the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map. [0012] Various modifications and additions can be made to the embodiments discussed without departing from the scope of the invention. For example, while the embodiments described above refer to particular features, the scope of this invention also includes embodiments having different combination of features and embodiments that do not include all of the above-described features. [0013] The details of one or more aspects of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques described in this disclosure will be apparent from the description and drawings, and from the claims. BRIEF DESCRIPTION OF THE DRAWINGS [0014] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components. [0015] Fig.1 is a schematic diagram illustrating a system for implementing dual- flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation, in accordance with various embodiments. [0016] Figs.2A-2G are schematic block flow diagrams illustrating various non-limiting examples of components of the DFlatFormer for implementing semantic segmentation, in accordance with various embodiments. [0017] Figs.3A and 3B are diagrams illustrating various non-limiting examples of visualization comparisons of DFlatFormer and of conventional semantic segmentation techniques using example datasets, in accordance with various embodiments. [0018] Figs.4A-4E are flow diagrams illustrating a method for implementing DFlatFormer through decomposed row and column queries for semantic segmentation, in accordance with various embodiments. [0019] Fig.5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments. DETAILED DESCRIPTION [0020] Overview [0021] Various embodiments provide tools and techniques for implementing computer vision technologies, and, more particularly, to methods, systems, and apparatuses for implementing dual-flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation. [0022] In various embodiments, a computing system may flatten the input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map; and may flatten the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map. The computing system may implement one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; and may implement one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map. The computing system may generate a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; may generate a row-expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and may generate the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map. [0023] In some embodiments, the computing system may comprise at least one of a dual- flattening transformer ("DFlatFormer"), a machine learning system, an AI system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. [0024] According to some embodiments, flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map may be implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map. In some embodiments, the method may further comprise bilinearly upsampling, using the computing system, the input feature map; and combining, using the computing system, the bilinearly upsampled input feature map and the output feature map to generate a dense feature map. In some instances, the method may further comprise performing, using the computing system, semantic segmentation based on the generated dense feature map. [0025] In the various aspects described herein, a dual-flattening transformer ("DFlatFormer") is provided, e.g., for performing semantic segmentation or other dense prediction operations (e.g., object detection, pose estimation, etc.). Semantic segmentation may be for such high-resolution dense prediction operations as medical imaging, autonomous driving, augmented or virtual reality (AR/VR), land mapping, video conferencing, and/or the like. DFlatFormer allows a transformer architecture that is not only efficient to recover full- resolution features, but also able to recover fine-grained details by exploring full contexts nonlinearly and globally. The dual-flattening transformer architecture is also able to obtain a high-resolution feature map, with a complexity of
Figure imgf000008_0001
where h × w and H ×W are the input and output feature map sizes, respectively. The proposed architecture can also serve as a flexible plug-in module into any CNN or transformer -based encoders to obtain high-resolution dense predictions. For the dual-flattening transformer, input sequences are flattened row-wise and column-wise before they are fed into multi-head attentions along with decomposed queries. Interactive attentions may also be implemented to share vertical and horizontal contexts between rows and columns. Efficient attentions via grouping and pooling in the feature space, in some cases, may be implemented, which further reduces complexity to being the fractions of grouped and pooled
Figure imgf000008_0002
features, respectively. Experimental results demonstrate the superior performance of DFlatFormer as a universal plug-in into various CNN and transformer backbones for semantic segmentation on multiple datasets. [0026] These and other aspects of the system and method for DFlatFormer through decomposed row and column queries for semantic segmentation are described in greater detail with respect to the figures. [0027] The following detailed description illustrates a few embodiments in further detail to enable one of skill in the art to practice such embodiments. The described examples are provided for illustrative purposes and are not intended to limit the scope of the invention. [0028] In the following description, for the purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments of the present invention may be practiced without some of these details. In other instances, some structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features. [0029] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise. [0030] Various embodiments as described herein – while embodying (in some cases) software products, computer-performed methods, and/or computer systems – represent tangible, concrete improvements to existing technological areas, including, without limitation, computer vision technology, dense prediction technology, semantic segmentation technology, object detection technology, pose estimation technology, and/or the like. In other aspects, some embodiments can improve the functioning of user equipment or systems themselves (e.g., computer vision systems, dense prediction systems, semantic segmentation systems, object detection systems, pose estimation systems, etc.), for example, by flattening, using a computing system, an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flattening, using the computing system, the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implementing, using the computing system, one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implementing, using the computing system, one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map; generating, using the computing system, a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; generating, using the computing system, a row-expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and generating, using the computing system, the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map; and/or the like. [0031] In particular, to the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as, implementing a dual-flattening transformer, for which input sequences are flattened row-wise and column-wise before they are fed into multi-head attentions along with decomposed queries, in some cases, with interactive attentions also implemented to share vertical and horizontal contexts between rows and columns, and in still other cases, with efficient attentions via grouping and pooling in the feature space being further implemented, and/or the like, to name a few examples, that extend beyond mere conventional computer processing operations. These functionalities can produce tangible results outside of the implementing computer system, including, merely by way of example, an optimized computer vision architecture (i.e., DFlatFormer) that allows a transformer architecture that is not only efficient to recover full-resolution features, but also able to recover fine-grained details by exploring full contexts nonlinearly and globally; that is also able to obtain a high-resolution feature map, with a complexity of
Figure imgf000011_0001
where h × w and H ×W are the input and output feature map sizes, respectively; that is further able to obtain further reduced complexity of being the fractions of grouped
Figure imgf000011_0002
and pooled features, respectively, in the case that grouping and/or pooling is used; that also can also serve as a flexible plug-in module into any CNN or transformer -based encoders to obtain high-resolution dense predictions; at least some of which may be observed or measured by users, game/content developers, and/or user device manufacturers. [0032] Some Embodiments [0033] We now turn to the embodiments as illustrated by the drawings. Figs.1-5 illustrate some of the features of the method, system, and apparatus for implementing computer vision technologies, and, more particularly, to methods, systems, and apparatuses for implementing dual-flattening transformer ("DFlatFormer") through decomposed row and column queries for semantic segmentation, as referred to above. The methods, systems, and apparatuses illustrated by Figs.1-5 refer to examples of different embodiments that include various components and steps, which can be considered alternatives or which can be used in conjunction with one another in the various embodiments. The description of the illustrated methods, systems, and apparatuses shown in Figs.1-5 is provided for purposes of illustration and should not be considered to limit the scope of the different embodiments. [0034] With reference to the figures, Fig.1 is a schematic diagram illustrating a system 100 for implementing dual-flattening transformer through decomposed row and column queries for semantic segmentation, in accordance with various embodiments. [0035] In the non-limiting embodiment of Fig.1, system 100 may comprise computing system 105 may include, but not limited to, dual-flattening transformer ("DFlatFormer") 110 and an artificial intelligence ("AI") system 115, or the like. The computing system 105, the DFlatFormer 110, and/or the AI system 115 may be part of a semantic segmentation system 120, or may be separate, yet communicatively coupled with, the semantic segmentation system 120. In some cases, an encoder 125 – which may include, without limitation, one of a convolutional neural network ("CNN") -based encoder or a transformer based encoder, or the like – may also be part of semantic segmentation system 120, or may be separate, yet communicatively coupled with, the semantic segmentation system 120. In some instances, the computing system 105, the DFlatFormer 110, and/or the AI system 115 may be embodied as an integrated system. Alternatively, the computing system 105, the DFlatFormer 110, and/or the AI system 115 may be embodied as separate, yet communicatively coupled, systems. In some embodiments, computing system 105 may include, without limitation, at least one of DFlatFormer 110, a machine learning system, AI system 115, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. In some instances, the DFlatFormer 110 and/or the AI system 115 may include a neural network including, but not limited to, at least one of a multi-layer perceptron ("MLP") neural network, a transformer deep learning model-based network, a feed-forward artificial neural network ("ANN"), a recurrent neural network ("RNN"), a convolutional neural network ("CNN"), or a fully convolutional network ("FCN"), and/or the like. [0036] System 100 may further comprise one or more content sources 130 (and corresponding database(s) 135) and content distribution system 140 (and corresponding database(s) 145) that communicatively couple with at least one of computing system 105, DFlatFormer 110, AI system 115, and/or semantic segmentation system 120, via network(s) 150. System 100 may further comprise one or more user devices 155a-155n (collectively, "user devices 155" or the like) that communicatively couple with at least one of computing system 105, DFlatFormer 110, AI system 115, and/or semantic segmentation system 120, either directly via wired (not shown) or wireless communications links (denoted by lightning bolt symbols in Fig.1), or indirect via network(s) 150 and via wired (not shown) and/or wireless communications links (denoted by lightning bolt symbols in Fig.1). According to some embodiments, the user devices 155 may each include, but is not limited to, a portable gaming device, a smart phone, a tablet computer, a laptop computer, a desktop computer, a server computer, a digital photo album platform-compliant device, a web-based digital photo album platform-compliant device, a software application ("app") -based digital photo album platform-compliant device, a video sharing platform-compliant device, a web-based video sharing platform-compliant device, an app-based video sharing platform-compliant device, a law enforcement computing system, a security system computing system, a surveillance system computing system, a military computing system, and/or the like. [0037] In operation, at least one of computing system 105, DFlatFormer 110, AI system 115, and/or semantic segmentation system 120 (collectively, "computing system") may receive image data (e.g., image data 160, or the like); and may extract, using a feature extractor (not shown; in some cases, as part of encoder 125, or the like), features from the received image data, and may generate an input feature map, the input feature map including, but not limited to, an image containing features extracted from an input image containing one or more objects, or the like. [0038] The computing system may flatten the input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map; and may flatten the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map. The computing system may implement one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; and may implement one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map. The computing system may generate a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; may generate a row-expanded column- wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and may generate the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map. [0039] In some embodiments, flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map may be implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map. In some embodiments, the computing system may bilinearly upsample the input feature map, and may combine the bilinearly upsampled input feature map and the output feature map to generate a dense feature map. In some instances, the computing system may perform semantic segmentation based on the generated dense feature map, and, in some cases, may send the semantic segmentation results (e.g., semantic segmentation 165, or the like) to at least one of one or more content sources (e.g., content source(s) 130, or the like), a content distribution system (e.g., content distribution system 140, or the like), or one or more user devices (e.g., user devices 155, or the like), and/or the like. [0040] According to some embodiments, for a first row transformer layer among the one or more row transformer layers, the computing system: may calculate row positional code data, by performing linear interpolation on the input feature map to generate a first number of row positional code data, the first number corresponding to the height of the output feature map; may calculate row-flattened positional code data based on the row-wise flattened feature sequence; and may calculate, using a first row-wise multi-head attention ("MHA") model, a first row-wise MHA output based on the row-wise flattened feature sequence and based on first row query, key, and value vectors. In some cases, the first row query vector may be based on the calculated row positional code data, while the first row key vector and the first row value vector may each be based on the calculated row-flattened positional code data. [0041] Similarly, for a first column transformer layer among the one or more column transformer layers, the computing system: may calculate column positional code data, by performing linear interpolation on the input feature map to generate a second number of column positional code data, the second number corresponding to the width of the output feature map; may calculate column-flattened positional code data based on the column-wise flattened feature sequence; and may calculate, using a first column-wise MHA model, a first column-wise MHA output based on the column-wise flattened feature sequence and based on first column query, key, and value vectors. In some instances, the first column query vector may be based on the calculated column positional code data, while the first column key vector and the first column value vector may each be based on the calculated column- flattened positional code data. [0042] In some embodiments, further for the first row transformer layer, the computing system may calculate, using a first row-wise row-column interactive attention model, a first layer row embedding output based on second row query, key, and value vectors. In such cases, the second row query vector may be based on the first column-wise MHA output, while the second row key vector and the second row value vector may each be based on the first row-wise MHA output. [0043] Similarly, further for the first column transformer layer, the computing system may calculate, using a first column-wise row-column interactive attention model, a first layer column embedding output based on second column query, key, and value vectors. In such cases, the second column query vector may be based on the first row-wise MHA output, and wherein the second column key vector and the second column value vector may each be based on the first column-wise MHA output. [0044] According to some embodiments, the one or more row transformer layers may include a plurality of row transformer layers. For each row transformer layer among the plurality of row transformer layers that follows the first row transformer layer in sequence, the computing system: may calculate, using a second row-wise MHA model, a second row- wise MHA output based on third row query, key, and value vectors, where the third row query vector may be based on a layer row embedding output of an immediately preceding layer among the plurality of row transformer layers, while the third row key vector and the third row value vector may each be based on the calculated row-flattened positional code data; and may calculate, using a second row-wise row-column interactive attention model, a layer row embedding output corresponding to said row transformer layer based on fourth row query, key, and value vectors, where the fourth row query vector may be based on a second column-wise MHA output from a second column-wise MHA model for a corresponding column transformer layer, while the fourth row key vector and the second row value vector may each be based on the second row-wise MHA output. In such cases, the row-wise output feature map may be based on the layer row embedding output corresponding to the last row transformer layer among the plurality of row transformer layers. [0045] Similarly, the one or more column transformer layers may include a plurality of column transformer layers. For each column transformer layer among the plurality of column transformer layers that follows the first column transformer layer in sequence, the computing system: may calculate, using the second column-wise MHA model, the second column-wise MHA output based on third column query, key, and value vectors, where the third column query vector may be based on a layer column embedding output of an immediately preceding layer among the plurality of column transformer layers, while the third column key vector and the third column value vector may each be based on the calculated column-flattened positional code data; and may calculate, using a second column-wise row-column interactive attention model, a layer column embedding output corresponding to said column transformer layer based on fourth column query, key, and value vectors, where the fourth column query vector may be based on the second row-wise MHA output from the second row-wise MHA model for a corresponding row transformer layer while the fourth column key vector and the second column value vector may each be based on the second column-wise MHA output. In such cases, the column-wise output feature map may be based on the layer column embedding output corresponding to the last column transformer layer among the plurality of column transformer layers. [0046] In some embodiments, for each of the first row transformer layer and the first column transformer layer, the computing system may perform one of: MHA via grouping; MHA via pooling; or a combination of the MHA via grouping and the MHA via pooling; and/or the like. [0047] In some instances, the MHA via grouping may comprise the computing system: dividing the input feature map into a plurality of groups of row-wise input feature maps; dividing the input feature map into a plurality of groups of column-wise input feature maps; calculating, using the first row-wise MHA model, a first row-wise group-combined MHA output, by independently calculating row-wise MHA output for each group of row-wise input feature maps and combining the calculated row-wise MHA outputs for each group of row- wise input feature maps, where the first row-wise MHA output may include the first row- wise group-combined MHA output; and may calculate, using the first column-wise MHA model, a first column-wise group-combined MHA output, by independently calculating column-wise MHA output for each group of column-wise input feature maps and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps, where the first column-wise MHA output includes the first column-wise group-combined MHA output. [0048] In some cases, the MHA via pooling may comprise the computing system: average-pooling rows of the input feature map to generate an average-pooled row-wise input feature map and to generate a row-wise flattened average-pooled feature sequence; average- pooling columns of the input feature map to generate an average-pooled column-wise input feature map and to generate a column-wise flattened average-pooled feature sequence; calculating, using the first row-wise MHA model, the first row-wise average-pooled MHA output based on the average-pooled row-wise input feature map and the row-wise flattened average-pooled feature sequence, where the first row-wise MHA output includes the first row-wise average-pooled MHA output; and calculating, using the first column-wise MHA model, the first column-wise average-pooled MHA output based on the average-pooled column-wise input feature map and the column-wise flattened average-pooled feature sequence, wherein the first column-wise MHA output includes the first column-wise average- pooled MHA output. [0049] In some instances, the combination of the MHA via grouping and the MHA via pooling may comprise the computing system: combining the first row-wise group-combined MHA output and the first row-wise average-pooled MHA output to generate the first row- wise MHA output for the first row transformer layer, and combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer. [0050] Alternatively, for the first row transformer layer, the MHA via grouping may comprise the computing system dividing the input feature map into a plurality of groups of row-wise input feature maps. In some cases, flattening the input feature map into the row- wise flattened feature sequence may comprise the computing system flattening the row-wise input feature map for each group of row-wise input feature maps into a row-wise flattened feature sub-sequence among a plurality of groups of row-wise flattened feature sub- sequences. In some instances, calculating the row positional code data may comprise the computing system calculating row positional code data for each group of row-wise input feature maps, by performing linear interpolation on each group of row-wise input feature maps to generate a third number of row positional code data for each group of row-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of row-wise input feature maps. In some cases, calculating the row- flattened positional code data may comprise the computing system calculating row-flattened positional code data for each group of row-wise flattened feature sequences. In some instances, calculating the first row-wise MHA output may comprise the computing system: independently calculating row-wise MHA output for each group of row-wise input feature maps based on row query, key, and value vectors for each group of row-wise input feature maps, where the row query vector for each group of row-wise input feature maps may be based on the corresponding calculated row positional code data for said group of row-wise input feature maps, while the row key vector and the row value vector for each group of row- wise input feature maps may each be based on the corresponding calculated row-flattened positional code data for said group of row-wise flattened feature sequences; and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps to generate a first row-wise group-combined MHA output. In such cases, the first row-wise MHA output may comprise the first row-wise group-combined MHA output. [0051] Similarly, for the first column transformer layer, the MHA via grouping may comprise the computing system dividing the input feature map into a plurality of groups of column-wise input feature maps. In some cases, flattening the input feature map into the column-wise flattened feature sequence may comprise the computing system flattening the column-wise input feature map for each group of column-wise input feature maps into a column-wise flattened feature sub-sequence among a plurality of groups of column-wise flattened feature sub-sequences. In some instances, calculating the column positional code data may comprise the computing system calculating column positional code data for each group of column-wise input feature maps, by performing linear interpolation on each group of column-wise input feature maps to generate a third number of column positional code data for each group of column-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of column-wise input feature maps. In some cases, calculating the column-flattened positional code data may comprise the computing system calculating column-flattened positional code data for each group of column-wise flattened feature sequences. In some instances, calculating the first column-wise MHA output may comprise the computing system: independently calculating column-wise MHA output for each group of column-wise input feature maps based on column query, key, and value vectors for each group of column-wise input feature maps, where the column query vector for each group of column-wise input feature maps may be based on the corresponding calculated column positional code data for said group of column- wise input feature maps, while the column key vector and the column value vector for each group of column-wise input feature maps may each be based on the corresponding calculated column-flattened positional code data for said group of column-wise flattened feature sequences; and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps to generate a first column-wise group-combined MHA output. In such cases, the first column-wise MHA output may comprise the first column-wise group-combined MHA output. [0052] Alternatively, for the first row transformer layer, the MHA via pooling may comprise the computing system dividing each row of the input feature map into one or more pools of row-wise input feature maps, such that the input feature map is divided into a plurality of pools of row-wise input feature maps that includes the one or more pools of row- wise input feature maps for each row, and averaging values of features in each pool to generate average-pooled values for each pool among the plurality of pools of row-wise input feature maps, thereby producing an average-pooled row-wise input feature map. In some cases, flattening the input feature map into the row-wise flattened feature sequence may comprise the computing system flattening the plurality of pools of row-wise input feature maps into a row-wise flattened average-pooled feature sequence. In some instances, calculating the row positional code data may comprise the computing system calculating average-pooled row positional code data, by performing linear interpolation on the average- pooled row-wise input feature map to generate the first number of average-pooled row positional code data, the first number corresponding to the height of the output feature map. In some cases, calculating the row-flattened positional code data may comprise the computing system calculating average-pooled row-flattened positional code data based on the row-wise flattened average-pooled feature sequence. In some instances, calculating the first row-wise MHA output may comprise the computing system calculating, using the first row- wise MHA model, a first row-wise average-pooled MHA output based on the row-wise flattened average-pooled feature sequence and based on fifth row query, key, and value vectors, where the fifth row query vector may be based on the calculated average-pooled row positional code data, while the fifth row key vector and the fifth row value vector may each be based on the calculated average-pooled row-flattened positional code data. In such cases, the first row-wise MHA output may comprise the first row-wise average-pooled MHA output. [0053] Similarly, for the first column transformer layer, the MHA via pooling may comprise the computing system dividing each column of the input feature map into one or more pools of column-wise input feature maps, such that the input feature map is divided into a plurality of pools of column-wise input feature maps that includes the one or more pools of column-wise input feature maps for each column, and averaging values of features in each pool to generate average-pooled values for each pool among the plurality of pools of column- wise input feature maps, thereby producing an average-pooled column-wise input feature map. In some cases, flattening the input feature map into the column-wise flattened feature sequence may comprise the computing system flattening the plurality of pools of column- wise input feature maps into a column-wise flattened average-pooled feature sequence. In some instances, calculating the column positional code data may comprise the computing system calculating average-pooled column positional code data, by performing linear interpolation on the average-pooled column-wise input feature map to generate the first number of average-pooled column positional code data, the first number corresponding to the height of the output feature map. In some cases, calculating the column-flattened positional code data may comprise the computing system calculating average-pooled column-flattened positional code data based on the column-wise flattened average-pooled feature sequence. In some instances, calculating the first column-wise MHA output may comprise the computing system calculating, using the first column-wise MHA model, a first column-wise average- pooled MHA output based on the column-wise flattened average-pooled feature sequence and based on fifth column query, key, and value vectors, where the fifth column query vector may be based on the calculated average-pooled column positional code data, while the fifth column key vector and the fifth column value vector may each be based on the calculated average-pooled column-flattened positional code data. In such cases, the first column-wise MHA output may comprise the first column-wise average-pooled MHA output. [0054] In some cases, the combination of the MHA via grouping and the MHA via pooling may comprise the computing system: combining the first row-wise group-combined MHA output and the first row-wise average-pooled MHA output to generate the first row- wise MHA output for the first row transformer layer, and combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer. [0055] These and other functions of the system 100 (and its components) are described in greater detail below with respect to Figs.2-4. [0056] Figs.2A-2G (collectively, "Fig.2") are schematic block flow diagrams illustrating various non-limiting examples 200 of components of the DFlatFormer for implementing semantic segmentation, in accordance with various embodiments. [0057] Naively, a conventional dense transformer needs a full-size query sequence to embed a flattened sequence of low-resolution input to a high-resolution output. This is intractable as the full size sequence of H ×W queries would result in demanding memory and computational overhead. [0058] On the contrary, in the DFlatFormer according to the various embodiments, full- size queries in a naive dense transformer are decomposed into H row and W column queries, and these decomposed row and column queries are used to embed rows and columns separately through multi-head attention ("MHA") and interactive attention modules or systems (e.g., MHA modules or systems 245a-245n and 255a-255n, as well as Row-Column Interactive Attention modules or systems 250a-250n and 260a-260n as shown in Fig.2A, or the like). An input sequence from an encoder (e.g., encoder 125 of Fig.1, or the like) will be flattened row-wise and column-wise to spatially align with the sequences of decomposed queries before it is fed into row and column transformers (e.g., row transformers or row transformer layers 235a-235n and column transformers or column transformer layers 240a- 240n of Fig.2A, or the like) separately. These embedded row and column representations will further interact with each other to aggregate vertical and horizontal contexts at the row- column crossing points (such as shown, e.g., in Figs.2D and 2E, or the like), before they are eventually expanded and combined to generate a dense feature map of high resolution (e.g., feature map S 285 of Fig.2A, or the like). [0059] Fig.2A summarizes the overall pipeline. Specially, the DFlatFormer 210 comprises a row transformer (e.g., row transformers or row transformer layers 235a-235n (collectively, "row transformer 235," "row transformers 235," or "row transformer layers 235," or the like), or the like) and a column transformer (e.g., column transformers or column transformer layers 240a-240n (collectively, "column transformer 240," "column transformers 240," or "column transformer layers 240," or the like), or the like). Given the low-resolution output So (205a) from an encoder (e.g., encoder 125 of Fig.1, or the like; in this case, from feature extractor 205 that performs feature extraction on the encoded image from the encoder, or the like), it is flattened row-wise and column-wise into two separate sequences R (215) and C (220) with the respective positional encodings. The flattened row and column sequences together with the decomposed sequences of learnable row queries
Figure imgf000021_0004
and column queries
Figure imgf000021_0003
are fed through the row and column transformers to output corresponding embeddings. The resultant row/column embeddings for each layer are further refined by interacting with their column/row counterparts through an interactive attention module (e.g., row-column interactive attention modules or systems 250a-250n and 260a-260n, or the like). Suppose the row (column) transformer consists of L layers, the resultant last-layer embeddings of rows
Figure imgf000021_0002
(265) and columns
Figure imgf000021_0001
(270) are finally expanded (into 275 and 280), respectively and combined to output a full-resolution feature map S (285). The full-resolution feature map S (285) may then be combined with a bilinearly upsampled input feature map
Figure imgf000021_0012
(205b) to generate a dense feature map (not shown) that may be used to perform semantic segmentation (such as shown, e.g., in Figs.3A and 3B). [0060] For each layer of DFlatFormer, and taking the row transformer as an example, a layer may include, without limitation, a MHA module and a row-column interactive attention module (e.g., MHA modules or systems 245a-245n and 255a-255n (collectively, "MHA 245 and/or 255," "MHA module(s) 245 and/or 255," or "MHA system(s) 245 and/or 255," or the like), and row-column interactive attention systems 250a-250n and 260a-260n (collectively, "row-column interactive attention 250 and/or 260," "row-column interactive attention module(s) 250 and/or 260," or "row-column interactive attention system(s) 250 and/or 260," or the like), respectively, or the like). The MHA module 245 takes as inputs the original row- flattened sequence R (215) as well the row query sequence
Figure imgf000021_0011
(or the last layer output of row embeddings
Figure imgf000021_0007
and it outputs
Figure imgf000021_0009
after, in some cases, a multi-layer perceptron ("MLP") of a Feed-Forward Network ("FFN"), or the like. Then, the row-column interactive attention modules 250 and 260 follows by coupling intermediate embeddings of
Figure imgf000021_0005
and
Figure imgf000021_0008
output from the corresponding row MHA 245 and column MHA 255. This module allows a row (column) representation to aggregate all column (row) embeddings in vertical contexts as they cross the row (column). It outputs the row (column) representation of
Figure imgf000021_0006
for layer l, which will be fed into the next layer for further modeling. [0061] In other words, output
Figure imgf000021_0010
(205a) from the encoder or the feature extractor 205 is flattened row-wise and column-wise to align with the row and column queries separately. This preserves the row and column structures by putting entire rows and columns together in the flattened sequences, respectively. Meanwhile, row-wise and column-wise positional encodings will also be applied to the dual-flattened sequences. This will enable the row and column queries to apply the multi-head attentions to the respective flattened sequences. [0062] Referring to Figs.2B and 2C, row-wise and column-wise positional encodings are first derived. Typically, the target feature map has much larger size than the encoder output. To align the query sequence with the row/column flattened key/value sequence, row/column- wise positional encodings are needed as well. [0063] Formally, taking the row part for example, we have a sequence of H row queries. Meanwhile, the row sequence only has h rows before it is row-wise flattened to R ∈ R hw×d". As shown in Fig.2B, one starts with an initial 1D positional encoding
Figure imgf000022_0002
through a sinusoidal function. This encoding can be upsampled by linear interpolation to
Figure imgf000022_0003
which aligns with the size of query sequence and thus can be used to initialize the positional encoding of queries directly. Meanwhile,
Figure imgf000022_0009
can also be column-wise replicated to
Figure imgf000022_0004
R h×w×d ". The resultant encoding aligns with the size of input feature map
Figure imgf000022_0005
and thus can be row-wise flattened to
Figure imgf000022_0006
and used as the positional encoding of the row-wise flattened R. This is a row-wise positional encoding since each row has the same code, thereby aligning with the row-wise flattening. Similar column-wise positional encoding can be applied to the corresponding column-wise flattening, as shown in Fig.2C. [0064] While the above depicts the mechanism of aligning absolute positional encodings with row/column-wise flattening, it is not difficult to apply a similar idea of row/column-wise expansion to extend relative positional encodings as well. [0065] Turning back to Fig.2A, details pertaining to MHA and row-column interactive attention are described. [0066] Regarding multi-head attention, first, row and column queries are embedded separately through MHA modules or systems 245a-245n and 255a-255n with dual-flattened sequences R and C, respectively. Herein, refers to the row (column) output
Figure imgf000022_0007
sequence for layer l + , that will be fed into the next layer as input. The first layer input
Figure imgf000022_0008
is set to 0. Then, taking a row transformer for example, a single-head attention is formulated as (Eqn.1)
Figure imgf000022_0001
where Wq, Wk, and Wv are the linear projection matrices, dm is the embedding dimension for each head, and T denotes transpose of a matrix (in this case, transpose of matrix is the learnable sequence of row queries and
Figure imgf000023_0010
is the positional
Figure imgf000023_0005
encodings as defined above with respect to Fig.2B. Then, the multi-head attention is obtained by putting together single-head outputs,
Figure imgf000023_0003
(Eqn.2) where WO is a linear projection matrix. The output of each layer further goes through a feed-forward network ("FFN") with a residual connection:
Figure imgf000023_0004
(Eqn.3) [0067] Herein, layer normalization is omitted for notational simplicity. [0068] Regarding row-column interactive attention, after the MHA module, the row- column interactive attention module is implemented. As illustrated in Figs.2D and 2E, this interactive attention aims to capture the relevant information when a row (column) crosses all columns (rows) spatially. This allows the learned row (column) representation to further integrate the vertical (horizontal) contexts along the crossed columns (rows) through an interactive attention mechanism. [0069] Formally, the row and column outputs from their interactive attentions can be obtained as follows: (Eqn.4) (Eqn.5)
Figure imgf000023_0001
[0070] The row-column interactive attention module 250 (or 260) takes the intermediate row output (or column output
Figure imgf000023_0009
as the query to aggregate the relevant information from all crossing columns. For simplicity, we do not use linear projections as in multi-head attention to map
Figure imgf000023_0007
and
Figure imgf000023_0008
into query, key, and value sequences. [0071] Turning back to Fig.2A, the final dense feature map may then be generated. At the output end of the transformer layers 235 and 240, each pixel at (i, j) in the final feature map is represented as a direct combination of the outputs from the row and column transformers at the same location. Formally, it can be written as follows:
Figure imgf000023_0002
(Eqn.6) where Sij is the final feature vector in the dense map at (i, j), and
Figure imgf000023_0006
are the representations of row i and column j in the last layer, respectively. Herein, through row- column interactive attention, each row (column) representation has already aggregated the information from all columns (rows) before it is finally added to the column (row) representations to generate the dense feature output. Thus, the DFlatFormer does not wait until the final layer to re-couple rows and columns. The early interaction between rows and columns allows them to fully explore the vertical and horizontal contexts across layers, while still keeping the DFlatFormer tractable. [0072] As shown in Fig.2A, the resultant last-layer embeddings of rows
Figure imgf000024_0001
(265) and columns (270) are finally expanded into column-expanded row-wise output feature map 275 and row-expanded column-wise output feature map 280, respectively, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map and by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map, respectively. The column-expanded row-wise output feature map 275 and row-expanded column-wise output feature map 280 are then combined to output a full-resolution feature map S (285). In some embodiments, the input feature map So (205a) and the output feature map S (285) may be combined to generate a dense feature map, and semantic segmentation may be performed based on the generated dense feature map. [0073] To further improve computational efficiency, a grouping and/or pooling module(s) that reduces the number of row/column flattened tokens fed into each layer may be used. Referring to Figs.2F and 2G, efficient attention via grouping and/or pooling is shown. If all the pixels in the encoder output are taken to form keys/values, the computational complexity would be
Figure imgf000024_0002
, as there are in total H rows and W columns for queries and hw keys/values fed into the DFlatFormer 210. To further reduce complexity, multi-head attention via grouping and (average) pooling may be performed. Grouping divides both queries and the input feature map into several groups (as shown in Figs.2F and 2G), where each query can only access features within the corresponding feature group. On the other hand, for pooling, the features in a row and a column are average-pooled to form shorter flattened sequences (as shown in Figs.2F and 2G), further reducing the model complexity. [0074] Specifically, as shown in Fig.2F, for a row transformer, the row queries and the row-flattened sequence are equally divided into np groups. Multi-head attention may be conducted within each group in parallel such that a row query only performs the multi-head attention on the corresponding group of the flattened sequence. On the other hand, the row- flattened sequence may be average-pooled through a non-overlapping window of size nw over each row, resulting in a shorter sequence where each row query can access the whole one for performing the multi-head attention. In some embodiments, the resultant outputs from both grouping and pooling may be added to give the output. [0075] Grouping and pooling complement each other. For the grouping, each query accesses a smaller part of the flattened sequence at its original granular level. In contrast, for the pooling, the query accesses the whole sequence but at a coarse level with pooled features. By combining both, the output can reach a good balance between the computational costs involved and the representation ability. [0076] As mentioned above, with decomposed row and column queries, the overall complexity of DFlatFormer may be reduced to
Figure imgf000025_0004
With respect to the complexity with the grouping and pooling, let βg =1 /np be the fraction of the features within each group over the flattened sequence, and βp = 1/nw be the fraction of pooled features over the sequence. Each query only accesses a group of βghw features. There are only βphw pooled features in total, and they are shared amongst all the queries. Hence, the total number of features that a query can access is (βg + βp)hw, resulting in an overall complexity of
Figure imgf000025_0003
In some experiments, these ratios may include
Figure imgf000025_0002
hence the overall complexity is further reduced by half to
Figure imgf000025_0001
[0077] In some embodiments, DFlatFormer 210 can serve as a plug-in module to be connected to any CNN or transformer -based encoder (e.g., encoder 125 of Fig.1, or the like) and to generate high-resolution output. [0078] These and other functions of the example(s) 200 (and their components) are described in greater detail herein with respect to Figs.1 and 4. Figs.3A and 3B illustrate the efficacy of DFlatFormer compared with conventional techniques. [0079] The following results of empirical studies illustrate the effectiveness of DFlatFormer, compared with conventional techniques and systems. [0080] Herein, mean intersection over union ("mIoU") is used to measure segmentation performance. To compare memory sizes and computation complexities, the number of parameters ("Param") and Giga-floating-point operations ("GFLOPs") are used. "Size" a in all tables, if not specified, denotes a crop size of a × a, while "mIoU" in the tables refers to mIoU with single-scale inference, and "+MS" denotes mIoU with multi-scale inference and horizontal flipping. [0081] In the following, more comparisons are made between the DFlatFormer model (according to the various embodiments) and other state-of-the-art architectures, and show that DFlatFormer can be used as a universal plug-in decoder for highly-accurate dense predictions for segmentation. [0082] Regarding comparisons with a CNN backbone, Tables 1 and 2 below show a comprehensive comparison of DFlatFormer with other models on ADE20K and Cityscapes val datasets, respectively. All models adopt the CNN backbones. Table 1. Semantic segmentation performance on ADE20K val dataset with CNN backbone.
Figure imgf000026_0001
[0083] As shown in Table 1, for ADE20K segmentation, with ResNet-50 as the backbone, DFlatFormer outperforms DeepLabv3+ baseline by 3.3 % and 3.7 % for single- scale and multiscale inference, respectively. With ResNet-101 as backbone, DFlatFormer outperforms DeepLabv3+ by 2.7 % for single-scale inference. Table 2. Semantic segmentation on Cityscapes val dataset with CNN backbone. † denotes a crop size of 512 x 1024.
Figure imgf000027_0001
[0084] As shown in Table 2, for Cityscapes semantic segmentation, DFlatFormer with single-scale inference outperforms DeepLabv3+ by 2.0 % and 1.8 % with backbone ResNet- 50 and ResNet-101, respectively. [0085] Regarding comparisons with a Transformer backbone, Tables 3 and 4 provide performance comparisons of DFlatFormer with other architectures on ADE20K and Cityscapes val datasets, respectively. Table 3. Semantic segmentation performance on ADE20K val dataset, with transformer backbone. denotes model pretrained on ImageNet-22K. * denotes AlignedResize used in inference.
Figure imgf000028_0001
[0086] As shown in Table 3, for ADE20K segmentation with Swin-T as the backbone, DFlatFormer has a 2.6 % gain over UperNet for single-scale inference. With Swin-S as the backbone, DFlatFormer surpasses UperNet by 0.6 %. On the other hand, the model size and GFLOPs of DFlatFormer are much smaller than the baselines. When comparing with SegFormer, with MiT-B2 as the backbone, DFlatFormer outperforms SegFormer by 0.9 %. With MiT-B4 as the backbone, DFlatFormer outperforms SegFormer by 0.5 %. While the model size is slightly increased over SegFormer, the DFlatFormer model enjoys much smaller GFLOPs. In SegFormer, AlignedResize is utilized, which potentially provides extra gain over the normal inference. For fair comparison with others, the normal procedure as described in MMSegmentation (MMSegmentation Contributors, "MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark," available at github.com, which is incorporated herein by reference in its entirety for all purposes) was followed. The results demonstrate that DFlatFormer can further leverage the multi-level features to strength the power of hierarchical structures in dense prediction. Table 4. Semantic segmentation on Cityscapes val dataset with transformer backbone. denotes crop size of 512 x 1024.
Figure imgf000029_0001
[0087] As shown in Table 4, with MiT-B2 as the backbone, DFlatFormer outperforms SegFormer by 0.9 %. With MiT-B4 as the backbone, DFlatFormer outperforms SegFormer by 0.5 %. [0088] Figs.3A and 3B (collectively, "Fig.3") are diagrams illustrating various non- limiting examples 300 and 300' of visualization comparisons of DFlatFormer and of conventional semantic segmentation techniques using example datasets, in accordance with various embodiments. [0089] For qualitative comparison, segmentation results are presented in Fig.3A to compare DFlatFormer with each of DeepLabv3+ and SegFormer on Cityscapes dataset. As shown in the top two images of Fig.3A, DFlatFormer provides better or more complete predictions for thin and/or small objects, including, but not limited to, poles and traffic lights, and/or the like. As shown with respect to the bottom image, DFlatFormer also provides more precise predictions on the boundaries of people and terrain, and/or the like. [0090] In Fig.3B, more comparisons are presented between DFlatFormer and each of DeepLabv3+ (ResNet-101) and SegFormer on ADE20K dataset. As shown in the left and center images of Fig.3B, DFlatFormer can predict more accurate and complete boundaries for objects, including, but not limited to, curtains and lamps, and/or the like. As shown in the right image of Fig.3B, among the three methods, only DFlatFormer can accurately predict the segmentation for objects including, but not limited to, the chest of drawers, and/or the like. The results show that DFlatFormer can effectively capture more fine details through long-range attention on the contexts. [0091] These and other functions of the examples 300 and 300' (and their components) are described in greater detail herein with respect to Figs.1, 2, and 4. [0092] Figs.4A-4E (collectively, "Fig.4") are flow diagrams illustrating a method 400 for implementing DFlatFormer through decomposed row and column queries for semantic segmentation, in accordance with various embodiments. Method 400 of Fig.4A continues onto Fig.4B following the circular marker denoted, "A," continues onto Fig.4C following the circular marker denoted, "B," and returns to Fig.4A following the circular marker denoted, "C." [0093] While the techniques and procedures are depicted and/or described in a certain order for purposes of illustration, it should be appreciated that certain procedures may be reordered and/or omitted within the scope of various embodiments. Moreover, while the method 400 illustrated by Fig.4 can be implemented by or with (and, in some cases, are described below with respect to) the systems, examples, or embodiments 100, 200, 300, and 300' of Figs.1, 2, 3A, and 3B, respectively (or components thereof), such methods may also be implemented using any suitable hardware (or software) implementation. Similarly, while each of the systems, examples, or embodiments 100, 200, 300, and 300' of Figs.1, 2, 3A, and 3B, respectively (or components thereof), can operate according to the method 400 illustrated by Fig.4 (e.g., by executing instructions embodied on a computer readable medium), the systems, examples, or embodiments 100, 200, 300, and 300' of Figs.1, 2, 3A, and 3B can each also operate according to other modes of operation and/or perform other suitable procedures. [0094] In the non-limiting embodiment of Fig.4A, method 400, at block 405, may comprise receiving, using a computing system, an input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects. In some embodiments, the computing system may include, without limitation, at least one of a dual-flattening transformer ("DFlatFormer"), a machine learning system, an AI system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system, and/or the like. [0095] At block 410, method 400 may comprise flattening, using the computing system, the input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map. At block 415, method 400 may comprise flattening, using the computing system, the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map. [0096] Method 400 may further comprise implementing, using the computing system, one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map (block 420). Method 400 may continue onto the process at block 425 or may continue onto the process at block 455a in Fig.4B following the circular marker denoted, "A." Method 400 may further comprise implementing, using the computing system, one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map (block 425). Method 400 may continue onto the process at block 430, may continue onto the process at block 455b in Fig.4C following the circular marker denoted, "B," or may loop back to the process at block 420 to repeat the processes at blocks 420 and 425 for each set of row/column transformer layers (where the number of loop backs may be any suitable number between 1 and 20, in some cases, between 1 and 10, or the like). [0097] Method 400 may further comprise, at block 430, generating, using the computing system, a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map. Method 400 may further comprise, at block 435, generating, using the computing system, a row- expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map. [0098] For each row transformer layer among the plurality of row transformer layers that follows the first row transformer layer in sequence, the method 400 may further comprise calculating, using the computing system and using a second row-wise MHA model, a second row-wise MHA output based on third row query, key, and value vectors, wherein the third row query vector is based on a layer row embedding output of an immediately preceding layer among the plurality of row transformer layers, wherein the third row key vector and the third row value vector are each based on the calculated row-flattened positional code data; and calculating, using the computing system and using a second row-wise row-column interactive attention model, a layer row embedding output corresponding to said row transformer layer based on fourth row query, key, and value vectors, wherein the fourth row query vector is based on a second column-wise MHA output from a second column-wise MHA model for a corresponding column transformer layer, and wherein the fourth row key vector and the second row value vector are each based on the second row-wise MHA output; wherein the row-wise output feature map is based on the layer row embedding output corresponding to the last row transformer layer among the plurality of row transformer layers. [0099] For each column transformer layer among the plurality of column transformer layers that follows the first column transformer layer in sequence, the method 400 may further comprise calculating, using the computing system and using the second column-wise MHA model, the second column-wise MHA output based on third column query, key, and value vectors, wherein the third column query vector is based on a layer column embedding output of an immediately preceding layer among the plurality of column transformer layers, wherein the third column key vector and the third column value vector are each based on the calculated column-flattened positional code data; and calculating, using the computing system and using a second column-wise row-column interactive attention model, a layer column embedding output corresponding to said column transformer layer based on fourth column query, key, and value vectors, wherein the fourth column query vector is based on the second row-wise MHA output from the second row-wise MHA model for a corresponding row transformer layer, and wherein the fourth column key vector and the second column value vector are each based on the second column-wise MHA output; wherein the column- wise output feature map is based on the layer column embedding output corresponding to the last column transformer layer among the plurality of column transformer layers. [0100] According to some embodiments, flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map may be implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map. At block 440, method 400 may comprise generating, using the computing system, the output feature map, by combining the column- expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map. In some embodiments, method 400 may further comprise bilinearly upsampling, using the computing system, the input feature map; and combining, using the computing system, the bilinearly upsampled input feature map and the output feature map to generate a dense feature map (block 445); and performing, using the computing system, semantic segmentation based on the generated dense feature map (block 450). [0101] At block 455a in Fig.4B (following the circular marker denoted, "A"), method 400 may comprise calculating, using the computing system, row positional code data, by performing linear interpolation on the input feature map to generate a first number of row positional code data, the first number corresponding to the height of the output feature map. Method 400 may further comprise calculating, using the computing system, row-flattened positional code data based on the row-wise flattened feature sequence (block 460a); and calculating, using the computing system and using a first row-wise multi-head attention ("MHA") model, a first row-wise MHA output based on the row-wise flattened feature sequence and based on first row query, key, and value vectors, wherein the first row query vector is based on the calculated row positional code data, wherein the first row key vector and the first row value vector are each based on the calculated row-flattened positional code data (block 465a). Method 400 may further comprise, at block 470a, calculating, using the computing system and using a first row-wise row-column interactive attention model, a first layer row embedding output based on second row query, key, and value vectors, wherein the second row query vector is based on the first column-wise MHA output, and wherein the second row key vector and the second row value vector are each based on the first row-wise MHA output. Method 400 may return to the process at block 430 in Fig.4A following the circular marker denoted, "C." [0102] At block 455b in Fig.4C (following the circular marker denoted, "B"), method 400 may comprise calculating, using the computing system, column positional code data, by performing linear interpolation on the input feature map to generate a second number of column positional code data, the second number corresponding to the width of the output feature map. Method 400 may further comprise calculating, using the computing system, column-flattened positional code data based on the column-wise flattened feature sequence (block 460b); and calculating, using the computing system and using a first column-wise MHA model, a first column-wise MHA output based on the column-wise flattened feature sequence and based on first column query, key, and value vectors, wherein the first column query vector is based on the calculated column positional code data, wherein the first column key vector and the first column value vector are each based on the calculated column- flattened positional code data (block 465b). Method 400 may further comprise, at block 470b, calculating, using the computing system and using a first column-wise row-column interactive attention model, a first layer column embedding output based on second column query, key, and value vectors, wherein the second column query vector is based on the first row-wise MHA output, and wherein the second column key vector and the second column value vector are each based on the first column-wise MHA output. Method 400 may return to the process at block 430 in Fig.4A following the circular marker denoted, "C." [0103] With reference to Fig.4D, implementing, using the computing system, the one or more row transformer layers (at block 420) may comprise performing one of: MHA via grouping (block 475); MHA via pooling (block 480); or a combination of the MHA via grouping (block 475) and the MHA via pooling (block 480). According to some embodiments, performing MHA via grouping (at block 475) may comprise: dividing, using the computing system, the input feature map into a plurality of groups of row-wise input feature maps (block 475a); and calculating, using the computing system and using the first row-wise MHA model, a first row-wise group-combined MHA output, by independently calculating row-wise MHA output for each group of row-wise input feature maps and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps, wherein the first row-wise MHA output comprises the first row-wise group-combined MHA output (block 475b). In some embodiments, performing MHA via pooling (at block 480) may comprise: average-pooling, using the computing system, rows of the input feature map to generate an average-pooled row-wise input feature map and to generate a row-wise flattened average-pooled feature sequence (block 480a); and calculating, using the computing system and using the first row-wise MHA model, the first row-wise average-pooled MHA output based on the average-pooled row-wise input feature map and the row-wise flattened average-pooled feature sequence, wherein the first row-wise MHA output comprises the first row-wise average-pooled MHA output (block 480b). According to some embodiments, for the combination of the MHA via grouping (block 475) and the MHA via pooling (block 480), method 400 may further comprise combining the first row-wise group-combined MHA output and the first row-wise average-pooled MHA output to generate the first row-wise MHA output for the first row transformer layer (optional block 485a). [0104] Alternatively, although not shown in Fig.4, performing MHA via grouping (at block 475) may comprise: dividing, using the computing system, the input feature map into a plurality of groups of row-wise input feature maps (similar to block 475a); flattening, using the computing system, the row-wise input feature map for each group of row-wise input feature maps into a row-wise flattened feature sub-sequence among a plurality of groups of row-wise flattened feature sub-sequences; calculating, using the computing system, row positional code data for each group of row-wise input feature maps, by performing linear interpolation on each group of row-wise input feature maps to generate a third number of row positional code data for each group of row-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of row-wise input feature maps; calculating, using the computing system, row-flattened positional code data for each group of row-wise flattened feature sequences; independently calculating row-wise MHA output for each group of row-wise input feature maps based on row query, key, and value vectors for each group of row-wise input feature maps, wherein the row query vector for each group of row-wise input feature maps is based on the corresponding calculated row positional code data for said group of row-wise input feature maps, wherein the row key vector and the row value vector for each group of row-wise input feature maps are each based on the corresponding calculated row-flattened positional code data for said group of row-wise flattened feature sequences; and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps to generate a first row-wise group-combined MHA output, wherein the first row-wise MHA output comprises the first row-wise group-combined MHA output. [0105] In some embodiments, although not shown in Fig.4, performing MHA via pooling (at block 480) may comprise: dividing, using the computing system, each row of the input feature map into one or more pools of row-wise input feature maps, such that the input feature map is divided into a plurality of pools of row-wise input feature maps that includes the one or more pools of row-wise input feature maps for each row, and averaging, using the computing system, values of features in each pool to generate average-pooled values for each pool among the plurality of pools of row-wise input feature maps, thereby producing an average-pooled row-wise input feature map; flattening, using the computing system, the plurality of pools of row-wise input feature maps into a row-wise flattened average-pooled feature sequence; calculating, using the computing system, average-pooled row positional code data, by performing linear interpolation on the average-pooled row-wise input feature map to generate the first number of average-pooled row positional code data, the first number corresponding to the height of the output feature map; calculating, using the computing system, average-pooled row-flattened positional code data based on the row-wise flattened average-pooled feature sequence; calculating, using the computing system and using the first row-wise MHA model, a first row-wise average-pooled MHA output based on the row-wise flattened average-pooled feature sequence and based on fifth row query, key, and value vectors, wherein the fifth row query vector is based on the calculated average-pooled row positional code data, wherein the fifth row key vector and the fifth row value vector are each based on the calculated average-pooled row-flattened positional code data, wherein the first row-wise MHA output comprises the first row-wise average-pooled MHA output. [0106] According to some embodiments, for the combination of the MHA via grouping (block 475) and the MHA via pooling (block 480), method 400 may further comprise combining the first row-wise group-combined MHA output and the first row-wise average- pooled MHA output to generate the first row-wise MHA output for the first row transformer layer (similar to optional block 485a). [0107] Similarly, referring to Fig.4E, implementing, using the computing system, one or more column transformer layers (at block 425) may comprise performing one of: MHA via grouping (block 490); MHA via pooling (block 495); or a combination of the MHA via grouping (block 490) and the MHA via pooling (block 495). According to some embodiments, performing MHA via grouping (at block 490) may comprise: dividing, using the computing system, the input feature map into a plurality of groups of column-wise input feature maps (block 490a); and calculating, using the computing system and using the first column-wise MHA model, a first column-wise group-combined MHA output, by independently calculating column-wise MHA output for each group of column-wise input feature maps and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps, wherein the first column-wise MHA output comprises the first column-wise group-combined MHA output (block 490b). In some embodiments, performing MHA via pooling (at block 495) may comprise: average-pooling, using the computing system, columns of the input feature map to generate an average-pooled column- wise input feature map and to generate a column-wise flattened average-pooled feature sequence (block 495a); and calculating, using the computing system and using the first column-wise MHA model, the first column-wise average-pooled MHA output based on the average-pooled column-wise input feature map and the column-wise flattened average- pooled feature sequence, wherein the first column-wise MHA output comprises the first column-wise average-pooled MHA output (block 495b). According to some embodiments, for the combination of the MHA via grouping (block 490) and the MHA via pooling (block 495), method 400 may further comprise combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer (optional block 485b). [0108] Alternatively, although not shown in Fig.4, performing MHA via grouping (at block 490) may comprise: dividing, using the computing system, the input feature map into a plurality of groups of column-wise input feature maps (similar to block 490a); flattening, using the computing system, the column-wise input feature map for each group of column- wise input feature maps into a column-wise flattened feature sub-sequence among a plurality of groups of column-wise flattened feature sub-sequences; calculating, using the computing system, column positional code data for each group of column-wise input feature maps, by performing linear interpolation on each group of column-wise input feature maps to generate a third number of column positional code data for each group of column-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of column-wise input feature maps; calculating, using the computing system, column-flattened positional code data for each group of column-wise flattened feature sequences; independently calculating column-wise MHA output for each group of column-wise input feature maps based on column query, key, and value vectors for each group of column-wise input feature maps, wherein the column query vector for each group of column-wise input feature maps is based on the corresponding calculated column positional code data for said group of column-wise input feature maps, wherein the column key vector and the column value vector for each group of column-wise input feature maps are each based on the corresponding calculated column-flattened positional code data for said group of column-wise flattened feature sequences; and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps to generate a first column-wise group-combined MHA output, wherein the first column-wise MHA output comprises the first column-wise group-combined MHA output. [0109] In some embodiments, although not shown in Fig.4, performing MHA via pooling (at block 495) may comprise: dividing, using the computing system, each column of the input feature map into one or more pools of column-wise input feature maps, such that the input feature map is divided into a plurality of pools of column-wise input feature maps that includes the one or more pools of column-wise input feature maps for each column, and averaging, using the computing system, values of features in each pool to generate average- pooled values for each pool among the plurality of pools of column-wise input feature maps, thereby producing an average-pooled column-wise input feature map; flattening, using the computing system, the plurality of pools of column-wise input feature maps into a column- wise flattened average-pooled feature sequence; calculating, using the computing system, average-pooled column positional code data, by performing linear interpolation on the average-pooled column-wise input feature map to generate the first number of average- pooled column positional code data, the first number corresponding to the height of the output feature map; calculating, using the computing system, average-pooled column- flattened positional code data based on the column-wise flattened average-pooled feature sequence; calculating, using the computing system and using the first column-wise MHA model, a first column-wise average-pooled MHA output based on the column-wise flattened average-pooled feature sequence and based on fifth column query, key, and value vectors, wherein the fifth column query vector is based on the calculated average-pooled column positional code data, wherein the fifth column key vector and the fifth column value vector are each based on the calculated average-pooled column-flattened positional code data, wherein the first column-wise MHA output comprises the first column-wise average-pooled MHA output. [0110] According to some embodiments, for the combination of the MHA via grouping (block 490) and the MHA via pooling (block 495), method 400 may further comprise combining the first column-wise group-combined MHA output and the first column-wise average-pooled MHA output to generate the first column-wise MHA output for the first column transformer layer (similar to optional block 485b). [0111] Examples of System and Hardware Implementation [0112] Fig.5 is a block diagram illustrating an example of computer or system hardware architecture, in accordance with various embodiments. Fig.5 provides a schematic illustration of one embodiment of a computer system 500 of the service provider system hardware that can perform the methods provided by various other embodiments, as described herein, and/or can perform the functions of computer or hardware system (i.e., computing system 105, dual-flattening transformers ("DFlatFormers") 110 and 210, artificial intelligence ("AI") system 115, semantic segmentation system 120, encoder 125, content source(s) 130, content distribution system 140, and user devices 155a-155n, etc.), as described above. It should be noted that Fig.5 is meant only to provide a generalized illustration of various components, of which one or more (or none) of each may be utilized as appropriate. Fig.5, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner. [0113] The computer or hardware system 500 – which might represent an embodiment of the computer or hardware system (i.e., computing system 105, DFlatFormers 110 and 210, AI system 115, semantic segmentation system 120, encoder 125, content source(s) 130, content distribution system 140, and user devices 155a-155n, etc.), described above with respect to Figs.1-4 – is shown comprising hardware elements that can be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general- purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and/or the like); one or more input devices 515, which can include, without limitation, a mouse, a keyboard, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, a printer, and/or the like. [0114] The computer or hardware system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash-updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like. [0115] The computer or hardware system 500 might also include a communications subsystem 530, which can include, without limitation, a modem, a network card (wireless or wired), an infra-red communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, cellular communication facilities, etc.), and/or the like. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, and/or with any other devices described herein. In many embodiments, the computer or hardware system 500 will further comprise a working memory 535, which can include a RAM or ROM device, as described above. [0116] The computer or hardware system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments (including, without limitation, hypervisors, VMs, and the like), and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods. [0117] A set of these instructions and/or code might be encoded and/or stored on a non- transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer or hardware system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer or hardware system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code. [0118] It will be apparent to those skilled in the art that substantial variations may be made in accordance with particular requirements. For example, customized hardware (such as programmable logic controllers, field-programmable gate arrays, application-specific integrated circuits, and/or the like) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed. [0119] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer or hardware system 500) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer or hardware system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein. [0120] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in some fashion. In an embodiment implemented using the computer or hardware system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire, and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including without limitation radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications). [0121] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code. [0122] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer or hardware system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention. [0123] The communications subsystem 530 (and/or components thereof) generally will receive the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 505 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510. [0124] While particular features and aspects have been described with respect to some embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while particular functionality is ascribed to particular system components, unless the context dictates otherwise, this functionality need not be limited to such and can be distributed among various other system components in accordance with the several embodiments. [0125] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with—or without—particular features for ease of description and to illustrate some aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.

Claims

WHAT IS CLAIMED IS: 1. A method, comprising: flattening, using a computing system, an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flattening, using the computing system, the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implementing, using the computing system, one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implementing, using the computing system, one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column-wise output feature map; generating, using the computing system, a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; generating, using the computing system, a row-expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and generating, using the computing system, the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column- wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map.
2. The method of claim 1, wherein the computing system comprises at least one of a dual-flattening transformer ("DFlatFormer"), a machine learning system, an AI system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system.
3. The method of claim 1 or 2, further comprising: for a first row transformer layer among the one or more row transformer layers: calculating, using the computing system, row positional code data, by performing linear interpolation on the input feature map to generate a first number of row positional code data, the first number corresponding to the height of the output feature map; calculating, using the computing system, row-flattened positional code data based on the row-wise flattened feature sequence; and calculating, using the computing system and using a first row-wise multi-head attention ("MHA") model, a first row-wise MHA output based on the row- wise flattened feature sequence and based on first row query, key, and value vectors, wherein the first row query vector is based on the calculated row positional code data, wherein the first row key vector and the first row value vector are each based on the calculated row-flattened positional code data; and for a first column transformer layer among the one or more column transformer layers: calculating, using the computing system, column positional code data, by performing linear interpolation on the input feature map to generate a second number of column positional code data, the second number corresponding to the width of the output feature map; calculating, using the computing system, column-flattened positional code data based on the column-wise flattened feature sequence; and calculating, using the computing system and using a first column-wise MHA model, a first column-wise MHA output based on the column-wise flattened feature sequence and based on first column query, key, and value vectors, wherein the first column query vector is based on the calculated column positional code data, wherein the first column key vector and the first column value vector are each based on the calculated column- flattened positional code data.
4. The method of claim 3, further comprising: further for the first row transformer layer: calculating, using the computing system and using a first row-wise row- column interactive attention model, a first layer row embedding output based on second row query, key, and value vectors, wherein the second row query vector is based on the first column-wise MHA output, and wherein the second row key vector and the second row value vector are each based on the first row-wise MHA output; and further for the first column transformer layer: calculating, using the computing system and using a first column-wise row- column interactive attention model, a first layer column embedding output based on second column query, key, and value vectors, wherein the second column query vector is based on the first row-wise MHA output, and wherein the second column key vector and the second column value vector are each based on the first column-wise MHA output.
5. The method of claim 4, wherein the one or more row transformer layers comprise a plurality of row transformer layers and the one or more column transformer layers comprise a plurality of column transformer layers, wherein the method further comprises: for each row transformer layer among the plurality of row transformer layers that follows the first row transformer layer in sequence: calculating, using the computing system and using a second row-wise MHA model, a second row-wise MHA output based on third row query, key, and value vectors, wherein the third row query vector is based on a layer row embedding output of an immediately preceding layer among the plurality of row transformer layers, wherein the third row key vector and the third row value vector are each based on the calculated row-flattened positional code data; and calculating, using the computing system and using a second row-wise row- column interactive attention model, a layer row embedding output corresponding to said row transformer layer based on fourth row query, key, and value vectors, wherein the fourth row query vector is based on a second column-wise MHA output from a second column-wise MHA model for a corresponding column transformer layer, and wherein the fourth row key vector and the second row value vector are each based on the second row-wise MHA output; and for each column transformer layer among the plurality of column transformer layers that follows the first column transformer layer in sequence: calculating, using the computing system and using the second column-wise MHA model, the second column-wise MHA output based on third column query, key, and value vectors, wherein the third column query vector is based on a layer column embedding output of an immediately preceding layer among the plurality of column transformer layers, wherein the third column key vector and the third column value vector are each based on the calculated column-flattened positional code data; and calculating, using the computing system and using a second column-wise row- column interactive attention model, a layer column embedding output corresponding to said column transformer layer based on fourth column query, key, and value vectors, wherein the fourth column query vector is based on the second row-wise MHA output from the second row-wise MHA model for a corresponding row transformer layer, and wherein the fourth column key vector and the second column value vector are each based on the second column-wise MHA output; wherein the row-wise output feature map is based on the layer row embedding output corresponding to the last row transformer layer among the plurality of row transformer layers, and wherein the column-wise output feature map is based on the layer column embedding output corresponding to the last column transformer layer among the plurality of column transformer layers.
6. The method of claim 3, further comprising, for each of the first row transformer layer and the first column transformer layer, performing one of: MHA via grouping; MHA via pooling; or a combination of the MHA via grouping and the MHA via pooling.
7. The method of claim 6, wherein: the MHA via grouping comprises: dividing, using the computing system, the input feature map into a plurality of groups of row-wise input feature maps and dividing the input feature map into a plurality of groups of column-wise input feature maps; calculating, using the computing system and using the first row-wise MHA model, a first row-wise group-combined MHA output, by independently calculating row-wise MHA output for each group of row-wise input feature maps and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps, wherein the first row-wise MHA output comprises the first row-wise group-combined MHA output; and calculating, using the computing system and using the first column-wise MHA model, a first column-wise group-combined MHA output, by independently calculating column-wise MHA output for each group of column-wise input feature maps and combining the calculated column- wise MHA outputs for each group of column-wise input feature maps, wherein the first column-wise MHA output comprises the first column- wise group-combined MHA output; and the MHA via pooling comprises: average-pooling, using the computing system, rows of the input feature map to generate an average-pooled row-wise input feature map and to generate a row-wise flattened average-pooled feature sequence; average-pooling, using the computing system, columns of the input feature map to generate an average-pooled column-wise input feature map and to generate a column-wise flattened average-pooled feature sequence; calculating, using the computing system and using the first row-wise MHA model, the first row-wise average-pooled MHA output based on the average-pooled row-wise input feature map and the row-wise flattened average-pooled feature sequence, wherein the first row-wise MHA output comprises the first row-wise average-pooled MHA output; and calculating, using the computing system and using the first column-wise MHA model, the first column-wise average-pooled MHA output based on the average-pooled column-wise input feature map and the column-wise flattened average-pooled feature sequence, wherein the first column-wise MHA output comprises the first column-wise average-pooled MHA output; and the combination of the MHA via grouping and the MHA via pooling comprises: combining the first row-wise group-combined MHA output and the first row- wise average-pooled MHA output to generate the first row-wise MHA output for the first row transformer layer, and combining the first column- wise group-combined MHA output and the first column-wise average- pooled MHA output to generate the first column-wise MHA output for the first column transformer layer.
8. The method of claim 6, wherein: the MHA via grouping comprises: for the first row transformer layer: dividing, using the computing system, the input feature map into a plurality of groups of row-wise input feature maps; wherein flattening the input feature map into the row-wise flattened feature sequence comprises flattening, using the computing system, the row-wise input feature map for each group of row-wise input feature maps into a row-wise flattened feature sub-sequence among a plurality of groups of row-wise flattened feature sub-sequences; wherein calculating the row positional code data comprises calculating, using the computing system, row positional code data for each group of row-wise input feature maps, by performing linear interpolation on each group of row-wise input feature maps to generate a third number of row positional code data for each group of row-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of row-wise input feature maps; wherein calculating the row-flattened positional code data comprises calculating, using the computing system, row-flattened positional code data for each group of row-wise flattened feature sequences; wherein calculating the first row-wise MHA output comprises: independently calculating row-wise MHA output for each group of row-wise input feature maps based on row query, key, and value vectors for each group of row-wise input feature maps, wherein the row query vector for each group of row-wise input feature maps is based on the corresponding calculated row positional code data for said group of row-wise input feature maps, wherein the row key vector and the row value vector for each group of row-wise input feature maps are each based on the corresponding calculated row- flattened positional code data for said group of row-wise flattened feature sequences; and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps to generate a first row-wise group-combined MHA output; wherein the first row-wise MHA output comprises the first row-wise group-combined MHA output; and for the first column transformer layer: dividing, using the computing system, the input feature map into a plurality of groups of column-wise input feature maps; wherein flattening the input feature map into the column-wise flattened feature sequence comprises flattening, using the computing system, the column-wise input feature map for each group of column-wise input feature maps into a column-wise flattened feature sub- sequence among a plurality of groups of column-wise flattened feature sub-sequences; wherein calculating the column positional code data comprises calculating, using the computing system, column positional code data for each group of column-wise input feature maps, by performing linear interpolation on each group of column-wise input feature maps to generate a third number of column positional code data for each group of column-wise input feature maps, the third number corresponding to the height of the output feature map divided by the number of groups of column-wise input feature maps; wherein calculating the column-flattened positional code data comprises calculating, using the computing system, column- flattened positional code data for each group of column-wise flattened feature sequences; wherein calculating the first column-wise MHA output comprises: independently calculating column-wise MHA output for each group of column-wise input feature maps based on column query, key, and value vectors for each group of column-wise input feature maps, wherein the column query vector for each group of column- wise input feature maps is based on the corresponding calculated column positional code data for said group of column-wise input feature maps, wherein the column key vector and the column value vector for each group of column-wise input feature maps are each based on the corresponding calculated column-flattened positional code data for said group of column-wise flattened feature sequences; and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps to generate a first column-wise group-combined MHA output; wherein the first column-wise MHA output comprises the first column- wise group-combined MHA output; the MHA via pooling comprises: for the first row transformer layer: dividing, using the computing system, each row of the input feature map into one or more pools of row-wise input feature maps, such that the input feature map is divided into a plurality of pools of row-wise input feature maps that includes the one or more pools of row-wise input feature maps for each row, and averaging, using the computing system, values of features in each pool to generate average-pooled values for each pool among the plurality of pools of row-wise input feature maps, thereby producing an average- pooled row-wise input feature map; wherein flattening the input feature map into the row-wise flattened feature sequence comprises flattening, using the computing system, the plurality of pools of row-wise input feature maps into a row- wise flattened average-pooled feature sequence; wherein calculating the row positional code data comprises calculating, using the computing system, average-pooled row positional code data, by performing linear interpolation on the average-pooled row- wise input feature map to generate the first number of average- pooled row positional code data, the first number corresponding to the height of the output feature map; wherein calculating the row-flattened positional code data comprises calculating, using the computing system, average-pooled row- flattened positional code data based on the row-wise flattened average-pooled feature sequence; wherein calculating the first row-wise MHA output comprises calculating, using the computing system and using the first row- wise MHA model, a first row-wise average-pooled MHA output based on the row-wise flattened average-pooled feature sequence and based on fifth row query, key, and value vectors, wherein the fifth row query vector is based on the calculated average-pooled row positional code data, wherein the fifth row key vector and the fifth row value vector are each based on the calculated average- pooled row-flattened positional code data; wherein the first row-wise MHA output comprises the first row-wise average-pooled MHA output; and for the first column transformer layer: dividing, using the computing system, each column of the input feature map into one or more pools of column-wise input feature maps, such that the input feature map is divided into a plurality of pools of column-wise input feature maps that includes the one or more pools of column-wise input feature maps for each column, and averaging, using the computing system, values of features in each pool to generate average-pooled values for each pool among the plurality of pools of column-wise input feature maps, thereby producing an average-pooled column-wise input feature map; wherein flattening the input feature map into the column-wise flattened feature sequence comprises flattening, using the computing system, the plurality of pools of column-wise input feature maps into a column-wise flattened average-pooled feature sequence; wherein calculating the column positional code data comprises calculating, using the computing system, average-pooled column positional code data, by performing linear interpolation on the average-pooled column-wise input feature map to generate the first number of average-pooled column positional code data, the first number corresponding to the height of the output feature map; wherein calculating the column-flattened positional code data comprises calculating, using the computing system, average-pooled column-flattened positional code data based on the column-wise flattened average-pooled feature sequence; wherein calculating the first column-wise MHA output comprises calculating, using the computing system and using the first column- wise MHA model, a first column-wise average-pooled MHA output based on the column-wise flattened average-pooled feature sequence and based on fifth column query, key, and value vectors, wherein the fifth column query vector is based on the calculated average-pooled column positional code data, wherein the fifth column key vector and the fifth column value vector are each based on the calculated average-pooled column-flattened positional code data; wherein the first column-wise MHA output comprises the first column- wise average-pooled MHA output; and the combination of the MHA via grouping and the MHA via pooling comprises: combining the first row-wise group-combined MHA output and the first row- wise average-pooled MHA output to generate the first row-wise MHA output for the first row transformer layer, and combining the first column- wise group-combined MHA output and the first column-wise average- pooled MHA output to generate the first column-wise MHA output for the first column transformer layer.
9. The method of any of claims 1-6, wherein flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map are implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map.
10. The method of any of claims 1-9, further comprising: bilinearly upsampling, using the computing system, the input feature map; and combining, using the computing system, the bilinearly upsampled input feature map and the output feature map to generate a dense feature map.
11. The method of claim 10, further comprising: performing, using the computing system, semantic segmentation based on the generated dense feature map.
12. A dual-flattening transformer system for implementing semantic segmentation, the system comprising: a computing system, comprising: at least one first processor; and a first non-transitory computer readable medium communicatively coupled to the at least one first processor, the first non-transitory computer readable medium having stored thereon computer software comprising a first set of instructions that, when executed by the at least one first processor, causes the computing system to: flatten an input feature map into a row-wise flattened feature sequence, by concatenating each successive row of the input feature map to a first row of the input feature map, the input feature map comprising an image containing features extracted from an input image containing one or more objects; flatten the input feature map into a column-wise flattened feature sequence, by concatenating each successive column of the input feature map to a first column of the input feature map; implement one or more row transformer layers based at least in part on the row-wise flattened feature sequence, to output a row-wise output feature map; implement one or more column transformer layers based at least in part on the column-wise flattened feature sequence, to output a column- wise output feature map; generate a column-expanded row-wise output feature map, by repeating the row-wise output feature map a number of times corresponding to a width of an output feature map; generate a row-expanded column-wise output feature map, by repeating the column-wise output feature map a number of times corresponding to a height of the output feature map; and generate the output feature map, by combining the column-expanded row-wise output feature map and the row-expanded column-wise output feature map, the output feature map having a resolution that is greater than a resolution of the input feature map.
13. The dual-flattening transformer system of claim 12, wherein the computing system comprises at least one of a dual-flattening transformer ("DFlatFormer"), a machine learning system, an AI system, a deep learning system, a processor on the user device, a server computer over a network, a cloud computing system, or a distributed computing system.
14. The dual-flattening transformer system of claim 12 or 13, wherein the first set of instructions, when executed by the at least one first processor, further causes the computing system to: for a first row transformer layer among the one or more row transformer layers: calculate row positional code data, by performing linear interpolation on the input feature map to generate a first number of row positional code data, the first number corresponding to the height of the output feature map; calculate row-flattened positional code data based on the row-wise flattened feature sequence; calculate, using a first row-wise multi-head attention ("MHA") model, a first row-wise MHA output based on the row-wise flattened feature sequence and based on first row query, key, and value vectors, wherein the first row query vector is based on the calculated row positional code data, wherein the first row key vector and the first row value vector are each based on the calculated row-flattened positional code data; and for a first column transformer layer among the one or more column transformer layers: calculate column positional code data, by performing linear interpolation on the input feature map to generate a second number of column positional code data, the second number corresponding to the width of the output feature map; calculate column-flattened positional code data based on the column-wise flattened feature sequence; calculate, using a first column-wise MHA model, a first column-wise MHA output based on the column-wise flattened feature sequence and based on first column query, key, and value vectors, wherein the first column query vector is based on the calculated column positional code data, wherein the first column key vector and the first column value vector are each based on the calculated column-flattened positional code data.
15. The dual-flattening transformer system of claim 14, wherein the first set of instructions, when executed by the at least one first processor, further causes the computing system to: further for the first row transformer layer: calculate, using a first row-wise row-column interactive attention model, a first layer row embedding output based on second row query, key, and value vectors, wherein the second row query vector is based on the first column-wise MHA output, and wherein the second row key vector and the second row value vector are each based on the first row-wise MHA output; and further for the first column transformer layer: calculate, using a first column-wise row-column interactive attention model, a first layer column embedding output based on second column query, key, and value vectors, wherein the second column query vector is based on the first row-wise MHA output, and wherein the second column key vector and the second column value vector are each based on the first column- wise MHA output.
16. The dual-flattening transformer system of claim 15, wherein the one or more row transformer layers comprise a plurality of row transformer layers and the one or more column transformer layers comprise a plurality of column transformer layers, wherein the first set of instructions, when executed by the at least one first processor, further causes the computing system to: for each row transformer layer among the plurality of row transformer layers that follows the first row transformer layer in sequence: calculate, using a second row-wise MHA model, a second row-wise MHA output based on third row query, key, and value vectors, wherein the third row query vector is based on a layer row embedding output of an immediately preceding layer among the plurality of row transformer layers, wherein the third row key vector and the third row value vector are each based on the calculated row-flattened positional code data; and calculate, using a second row-wise row-column interactive attention model, a layer row embedding output corresponding to said row transformer layer based on fourth row query, key, and value vectors, wherein the fourth row query vector is based on a second column-wise MHA output from a second column-wise MHA model for a corresponding column transformer layer, and wherein the fourth row key vector and the second row value vector are each based on the second row-wise MHA output; and for each column transformer layer among the plurality of column transformer layers that follows the first column transformer layer in sequence: calculate, using the second column-wise MHA model, the second column- wise MHA output based on third column query, key, and value vectors, wherein the third column query vector is based on a layer column embedding output of an immediately preceding layer among the plurality of column transformer layers, wherein the third column key vector and the third column value vector are each based on the calculated column- flattened positional code data; and calculate, using a second column-wise row-column interactive attention model, a layer column embedding output corresponding to said column transformer layer based on fourth column query, key, and value vectors, wherein the fourth column query vector is based on the second row-wise MHA output from the second row-wise MHA model for a corresponding row transformer layer, and wherein the fourth column key vector and the second column value vector are each based on the second column-wise MHA output; wherein the row-wise output feature map is based on the layer row embedding output corresponding to the last row transformer layer among the plurality of row transformer layers, and wherein the column-wise output feature map is based on the layer column embedding output corresponding to the last column transformer layer among the plurality of column transformer layers.
17. The dual-flattening transformer system of claim 14, wherein the first set of instructions, when executed by the at least one first processor, further causes the computing system to, for each of the first row transformer layer and the first column transformer layer, perform one of: MHA via grouping; MHA via pooling; or a combination of the MHA via grouping and the MHA via pooling.
18. The dual-flattening transformer system of claim 17, wherein: the MHA via grouping comprises: dividing the input feature map into a plurality of groups of row-wise input feature maps and dividing the input feature map into a plurality of groups of column-wise input feature maps; calculating, using the first row-wise MHA model, a first row-wise group- combined MHA output, by independently calculating row-wise MHA output for each group of row-wise input feature maps and combining the calculated row-wise MHA outputs for each group of row-wise input feature maps, wherein the first row-wise MHA output comprises the first row-wise group-combined MHA output; and calculating, using the first column-wise MHA model, a first column-wise group-combined MHA output, by independently calculating column-wise MHA output for each group of column-wise input feature maps and combining the calculated column-wise MHA outputs for each group of column-wise input feature maps, wherein the first column-wise MHA output comprises the first column-wise group-combined MHA output; and the MHA via pooling comprises: average-pooling rows of the input feature map to generate an average-pooled row-wise input feature map and to generate a row-wise flattened average- pooled feature sequence; average-pooling columns of the input feature map to generate an average- pooled column-wise input feature map and to generate a column-wise flattened average-pooled feature sequence; calculating, using the first row-wise MHA model, the first row-wise average- pooled MHA output based on the average-pooled row-wise input feature map and the row-wise flattened average-pooled feature sequence, wherein the first row-wise MHA output comprises the first row-wise average- pooled MHA output; and calculating, using the first column-wise MHA model, the first column-wise average-pooled MHA output based on the average-pooled column-wise input feature map and the column-wise flattened average-pooled feature sequence, wherein the first column-wise MHA output comprises the first column-wise average-pooled MHA output; and the combination of the MHA via grouping and the MHA via pooling comprises: combining the first row-wise group-combined MHA output and the first row- wise average-pooled MHA output to generate the first row-wise MHA output for the first row transformer layer, and combining the first column- wise group-combined MHA output and the first column-wise average- pooled MHA output to generate the first column-wise MHA output for the first column transformer layer.
19. The dual-flattening transformer system of any of claims 12-18, wherein flattening the input feature map into the column-wise flattened sequence, implementing the one or more column transformer layers, and generating the row-expanded column-wise output feature map are implemented concurrent with implementation of flattening the input feature map into the row-wise flattened sequence, implementing the one or more row transformer layers, and generating the column-expanded row-wise output feature map.
20. The dual-flattening transformer system of any of claims 12-19, wherein the first set of instructions, when executed by the at least one first processor, further causes the computing system to: bilinearly upsample the input feature map; combine the bilinearly upsampled input feature map and the output feature map to generate a dense feature map; and perform semantic segmentation based on the generated dense feature map.
PCT/US2022/022831 2021-11-10 2022-03-31 Dual-flattening transformer through decomposed row and column queries for semantic segmentation WO2022216521A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163277656P 2021-11-10 2021-11-10
US63/277,656 2021-11-10

Publications (1)

Publication Number Publication Date
WO2022216521A1 true WO2022216521A1 (en) 2022-10-13

Family

ID=83545616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2022/022831 WO2022216521A1 (en) 2021-11-10 2022-03-31 Dual-flattening transformer through decomposed row and column queries for semantic segmentation

Country Status (1)

Country Link
WO (1) WO2022216521A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116051549A (en) * 2023-03-29 2023-05-02 山东建筑大学 Method, system, medium and equipment for dividing defects of solar cell
CN117576405A (en) * 2024-01-17 2024-02-20 深圳汇医必达医疗科技有限公司 Tongue picture semantic segmentation method, device, equipment and medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170053398A1 (en) * 2015-08-19 2017-02-23 Colorado Seminary, Owner and Operator of University of Denver Methods and Systems for Human Tissue Analysis using Shearlet Transforms
US20180018553A1 (en) * 2015-03-20 2018-01-18 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Relevance score assignment for artificial neural networks

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180018553A1 (en) * 2015-03-20 2018-01-18 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Relevance score assignment for artificial neural networks
US20170053398A1 (en) * 2015-08-19 2017-02-23 Colorado Seminary, Owner and Operator of University of Denver Methods and Systems for Human Tissue Analysis using Shearlet Transforms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
RIKIYA YAMASHITA, MIZUHO NISHIO, RICHARD KINH GIAN DO, KAORI TOGASHI: "Convolutional neural networks: an overview and application in radiology", INSIGHTS INTO IMAGING, vol. 9, no. 4, 1 August 2018 (2018-08-01), pages 611 - 629, XP055580998, DOI: 10.1007/s13244-018-0639-9 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116051549A (en) * 2023-03-29 2023-05-02 山东建筑大学 Method, system, medium and equipment for dividing defects of solar cell
CN116051549B (en) * 2023-03-29 2023-12-12 山东建筑大学 Method, system, medium and equipment for dividing defects of solar cell
CN117576405A (en) * 2024-01-17 2024-02-20 深圳汇医必达医疗科技有限公司 Tongue picture semantic segmentation method, device, equipment and medium

Similar Documents

Publication Publication Date Title
Feng et al. Change detection on remote sensing images using dual-branch multilevel intertemporal network
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
WO2022216521A1 (en) Dual-flattening transformer through decomposed row and column queries for semantic segmentation
CN110717851A (en) Image processing method and device, neural network training method and storage medium
CN108491763B (en) Unsupervised training method and device for three-dimensional scene recognition network and storage medium
US20160255357A1 (en) Feature-based image set compression
AU2021354030B2 (en) Processing images using self-attention based neural networks
CN112950471A (en) Video super-resolution processing method and device, super-resolution reconstruction model and medium
CN112990053B (en) Image processing method, device, equipment and storage medium
KR20220101645A (en) Gaming Super Resolution
CN110163221B (en) Method and device for detecting object in image, vehicle and robot
Qin et al. Depth estimation by parameter transfer with a lightweight model for single still images
Bakhtiarnia et al. Efficient high-resolution deep learning: A survey
CN115631433A (en) Target detection method, device, equipment and medium
Guo et al. Speeding uplow rank matrix recovery for foreground separation in surveillance videos
CN115082306A (en) Image super-resolution method based on blueprint separable residual error network
US11790633B2 (en) Image processing using coupled segmentation and edge learning
US20240013399A1 (en) Pyramid architecture for multi-scale processing in point cloud segmentation
Liu et al. MODE: Monocular omnidirectional depth estimation via consistent depth fusion
CN116051723B (en) Bundling adjustment method and electronic equipment
Jasti Multi-frame Video Prediction with Learnable Temporal Motion Encodings
Li et al. Light field reconstruction with arbitrary angular resolution using a deep Coarse-To-Fine framework
Dell'Eva et al. Real-Time Semantic Segmentation of Spherical Images for Automotive Applications
Zhang et al. Small object detection based on hierarchical attention mechanism and multi‐scale separable detection
Xiao et al. Self-supervised monocular depth estimation based on pseudo-pose guidance and grid regularization

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22785187

Country of ref document: EP

Kind code of ref document: A1