CN118251704A - Processing video content using gated transformer neural networks - Google Patents

Processing video content using gated transformer neural networks Download PDF

Info

Publication number
CN118251704A
CN118251704A CN202280060174.XA CN202280060174A CN118251704A CN 118251704 A CN118251704 A CN 118251704A CN 202280060174 A CN202280060174 A CN 202280060174A CN 118251704 A CN118251704 A CN 118251704A
Authority
CN
China
Prior art keywords
frame
token set
token
video stream
tokens
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202280060174.XA
Other languages
Chinese (zh)
Inventor
Y·李
B·慕斯
T·P·F·布兰克沃特
A·哈比比安
B·艾特沙米·贝诺狄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qualcomm Inc
Original Assignee
Qualcomm Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/933,840 external-priority patent/US20230090941A1/en
Application filed by Qualcomm Inc filed Critical Qualcomm Inc
Priority claimed from PCT/US2022/076752 external-priority patent/WO2023049726A1/en
Publication of CN118251704A publication Critical patent/CN118251704A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

Certain aspects of the present disclosure provide techniques and apparatus for processing video streams using a machine learning model. An example method generally includes generating a first token set from a first frame of the video stream and generating a second token set from a second frame of the video stream. Based on a comparison of tokens from the first token set with corresponding tokens in the second token set, a first set of tokens associated with features to be reused from the first frame and a second set of tokens associated with features to be calculated from the second frame are identified. A feature output is generated for a portion of the second frame corresponding to the second token set. Features associated with the first token set are combined with the generated feature output into a representation of the second frame.

Description

Processing video content using gated transformer neural networks
Cross Reference to Related Applications
The present application claims the benefit and priority of U.S. patent application Ser. No. 17/933,840, entitled "processing video content (Processing Video Content Using Gated Transformer Neural Networks) using a gated transformer neural network," filed on day 9 and 20 of 2022, which claims the benefit and priority of U.S. provisional patent application Ser. No. 63/246,643, entitled "object detection (Object Detection in Video Content Using Gated Transformer Neural Networks) in video content using a gated transformer neural network," filed on day 21 of 2021, which are both assigned to the assignee of the present application, the contents of each of which are hereby incorporated by reference in their entirety.
Background
Aspects of the present disclosure relate to machine learning, and more particularly to processing video content using an artificial neural network.
In various cases, the artificial neural network may be used to process video content, such as identifying objects in the captured video content, estimating gestures of people detected in the video content, or semantically segmenting the video content, and performing various operations based on identifying objects in the captured video content. For example, in an autonomous vehicle application, an artificial neural network may be used to identify obstacles or other objects in the path that the autonomous vehicle is traveling, and the identification of these obstacles or objects may be used to control the vehicle to avoid collisions with these obstacles or objects (e.g., by bypassing these obstacles, parking before colliding with the objects, etc.). In monitoring applications, artificial neural networks may be used to detect motion in a monitored environment.
In general, video content may be defined in terms of a spatial dimension and a temporal dimension. Motion over time may be detected in the time dimension based on changes in pixel values detected at given spatial locations in the video content. For example, the background content may remain static or substantially static in the time dimension; however, when (non-disguised) objects move in the time dimension, the spatial position of these objects may change. Thus, motion into a region may be visualized as a change from a static pixel value to a pixel value associated with the object; likewise, motion outside the region may be visualized as a change from a pixel value associated with the object to a different pixel value (e.g., corresponding to a background value).
Various types of neural networks may be used to process visual content, such as video content. For example, convolutional or transformer neural networks (e.g., detection transformers ("DETR") or sliding window ("swin") transformers) may be used to detect objects in visual content, semantically segment visual content into different portions (e.g., anterior Jing Pianduan and background segments, static and non-static segments, etc.), and/or predict future movements of objects in visual content (e.g., perform gesture predictions on multi-joint objects). However, these neural networks may process visual content on a per image basis, and may not take into account redundancy (e.g., spatially or temporally) in the visual content, which may be an inefficient use of computing resources (e.g., processor cycles, memory, etc.).
Accordingly, what is needed is an improved technique for object detection in video content.
Disclosure of Invention
Certain aspects provide a method for detecting objects in a data stream using a machine learning model. An example method generally includes extracting a first feature from a first segment of the data stream and extracting a second feature from a second segment of the data stream. The first feature and the second feature are concatenated into a combined representation of the first segment of the data stream and the second segment of the data stream. Unchanged content and changed content are identified from the combined representation of the first segment of the data stream and the second segment of the data stream. A feature output for the second segment of the data stream is generated from the first feature and the second feature based on the identified unchanged content and the identified changed content. Using a transformer neural network, a plurality of objects in the data stream are identified based on the characteristic output for the second segment of the data stream. One or more actions are taken based on identifying the plurality of objects in the data stream.
Certain aspects provide a method of processing a video stream using a machine learning model. An example method generally includes generating a first token set from a first frame of the video stream and generating a second token set from a second frame of the video stream. Based on a comparison of tokens from the first token set with corresponding tokens in the second token set, a first set of tokens associated with features to be reused from the first frame and a second set of tokens associated with features to be calculated from the second frame are identified. A feature output is generated for a portion of the second frame corresponding to the second token set. The features associated with the first token set are combined with the generated feature output for the portion of the second frame corresponding to the second token set into a representation of the second frame of the video stream.
Other aspects provide a processing system configured to perform the foregoing methods, as well as those described herein; a non-transitory computer-readable medium comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods and those described herein; a computer program product embodied on a computer readable storage medium comprising code for performing the foregoing method and the method described herein; and a processing system comprising means for performing the foregoing methods, as well as those methods further described herein.
The following description and the annexed drawings set forth in detail certain illustrative features of the one or more aspects.
Drawings
The drawings depict certain aspects of the various features of the present disclosure and are, therefore, not to be considered limiting of the scope of the disclosure.
FIG. 1 depicts an example machine learning pipeline for object detection in visual content.
FIG. 2 depicts an example transformer neural network for detecting objects in visual content.
Fig. 3 depicts an example gated transformer neural network for efficiently detecting objects in visual content in accordance with aspects of the present disclosure.
Fig. 4 depicts example operations for efficiently detecting objects in visual content using a gated transformer neural network in accordance with aspects of the present disclosure.
Fig. 5 depicts example operations for efficiently processing visual content using a gated transformer neural network in accordance with aspects of the present disclosure.
FIG. 6 depicts an example pipeline in which binary gates are used in a transformer neural network to efficiently detect objects in visual content, in accordance with aspects of the present disclosure.
FIG. 7 depicts an example gate for selecting features to be included in a feature map for detecting objects in visual content, in accordance with aspects of the present disclosure.
Fig. 8 depicts an example gated transformer neural network in which ternary gates are used to reduce the size of feature graphs used to detect objects in visual content, in accordance with aspects of the present disclosure.
Fig. 9 depicts an example implementation of a device on which efficient detection of objects in visual content using a gated transformer neural network may be performed in accordance with aspects of the present disclosure.
To facilitate understanding, identical reference numerals have been used, where possible, to designate identical elements that are common to the figures. It is contemplated that elements and features of one aspect may be beneficially incorporated in other aspects without further recitation.
Detailed Description
Aspects of the present disclosure provide techniques (e.g., efficient object detection) for efficiently processing visual content using transformer neural networks.
Various types of neural networks may be used to process visual content, such as still images or streams of visual content (e.g., visual content captured as a series of images at a given frame rate (such as 24 frames per second, 29.97 frames per second, 60 frames per second, etc.), for example, detecting objects, predicting future motion of objects detected in the visual content, segmenting the visual content into different semantic groups, etc. However, these neural networks typically process visual content on a per-frame basis, which can be a computationally expensive process, with complexity increasing as the frame size of each frame in the visual content increases.
In general, transformer neural networks may allow modeling long-range dependencies in sequence data, which may reduce the length of the sequence and reduce the computational cost of the attention layer. This in turn may cause linear projection and Feed Forward Network (FFN) components of the transformer neural networks to be computational bottlenecks, and thus techniques that attempt to improve the efficiency of the attention mechanism may have limited impact on the computational efficiency of these transformer neural networks.
To improve the efficiency of the neural network, redundancy in the visual content may be exploited. The use of these redundancies can reduce the computational expense involved in processing visual content. In general, a visual content stream such as video may have both spatial and temporal redundancy. Spatial redundancy generally refers to the portion of video content that is not or minimally relevant to a given task (e.g., object detection). For example, for images captured by cameras in autonomous vehicles, the sky portions in the captured visual content may not be relevant to detecting objects for collision avoidance; however, these neural networks may still process sky portions in the captured visual content, which may be an inefficient use of computing resources. Temporal redundancy generally refers to the temporal correlation between subsequent video frames. In general, a majority of subsequent video frames in the captured visual content may include the same information, e.g., the same content in the background of the frames, stationary objects in the foreground of the frames, etc. Using the autonomous vehicle example again, as the object moves over time, a change may be detected in only a portion of the subsequent video frame. Again, however, these neural networks may not be able to distinguish between portions of the image that have changed and portions of the image that remain unchanged when performing object detection or other computer vision tasks, and thus may use computing resources to process both the unchanged and changed portions of these images. This may be an inefficient use of computing resources (e.g., processor cycles, memory, etc.), and may result in delays in completing object detection tasks, high power utilization, etc.
Some neural network architectures may be configured to take advantage of one type of redundancy to improve the efficiency and performance of object detection and other computer vision tasks using the neural network. For example, the transformer neural network may be configured to utilize only spatial redundancy when performing object detection tasks on a per frame basis. In another example, a jumping convolution technique may be used to recalculate features of portions of an image that have changed relative to a leading image. However, the neural network configured to utilize spatial redundancy can still process video content on a frame independent basis, and the neural network configured to utilize temporal redundancy can still process redundant or irrelevant portions of successive images.
Aspects of the present disclosure provide techniques and apparatus that allow objects to detect and use other computer vision tasks with neural networks that utilize both spatial and temporal redundancy. As discussed in further detail herein, aspects of the present disclosure may reduce the amount of data to be processed by a neural network to perform object detection or other computer vision tasks by using both spatial redundancy and temporal redundancy to process successive segments of a data stream, such as successive frames in video content. Thus, these object detection or other computer vision tasks may be accomplished with less computing resources, which may reduce the amount of power used by the computing device to perform these tasks and speed up the processing of streaming content relative to the amount of power and time used when spatial and temporal redundancy are not utilized when performing these tasks.
Example machine learning pipeline for efficient processing of visual content
FIG. 1 depicts an example machine learning pipeline 100 for object detection in visual content. As shown, the machine learning pipeline 100 includes a backbone 110, an encoder stage 120, a decoder stage 130, and a prediction head stage 140, and is configured to generate an output 150 that includes information about objects detected in images input into the machine learning pipeline 100.
As shown, the backbone 110 extracts feature sets from the input image using a neural network (e.g., a transformer neural network, such as DETR or swin transformers). The feature set may be flattened and passed to the encoder stage 120. The location information may be combined with the feature set extracted by the backbone 110, and the combined feature set and location information may be processed by a transformer encoder in the encoder stage 120. In some aspects, the features extracted by the backbone 110 may be features associated with each of a plurality of spatial segments in the input image. The size of the spatial segments may be determined based on the amount of data to be considered when generating the feature set describing the input image. In general, a larger spatial segment (e.g., comprising a greater number of pixels, or covering a greater portion of the input image) may include more data to be compressed into a single feature representation, which may reduce the number of features to be processed in the machine learning pipeline 100, but at the cost of some fidelity of analysis. At the same time, smaller spatial segments (e.g., comprising a smaller number of pixels, or covering a smaller portion of the input image) may allow finer granularity analysis of the input image by generating a greater number of features to be processed in the machine learning pipeline 100, but at the cost of increased computational complexity.
The encoder stage 120 is generally configured to encode the features extracted by the backbone 110 into potential spatial representations of the features. Various attention mechanisms may be used at encoder stage 120 to emphasize certain features in the feature set generated by backbone 110. The output of the encoder stage 120 may be provided as an input to a decoder stage 130 that decodes the feature into one of a plurality of categories. As shown, the decoder stage 130 includes a transformer decoder configured to take as input the encoding features of the input image from the encoder stage 120 (including the positional information associated with these encoding features). The transformer decoder in decoder stage 130 generally attempts to output predictions associated with each encoding feature received from encoder stage 120.
The prediction generated at the decoder stage 130 may be provided into a prediction header stage 140, which ultimately may predict or otherwise identify the presence of various objects in the input image. The feed-forward network may be used to determine whether an object is present at a given portion of the input image (e.g., associated with a given feature generated by the backbone 110 for a given spatial location in the input image). If the feed forward network at the pre-stage 140 predicts that an object is present at a given spatial location in the input image, further processing may be performed to determine the type of object located at the given spatial location in the input image.
In this example, pipeline 100 may generate classifications 152 and 154 for a given input image. The classifications 152 and 154 may correspond to different objects of interest in the input image. In this example, the object of interest is two of the black bones shown in the input image, and the classifications 152 and 154 may correspond to bounding boxes (any shape) in which the object of interest is located in the input image.
Pipeline 100 generally allows objects to be identified in separate images. However, because the pipeline 100 uses convolutional neural networks in the backbone 110 to extract features from the input image, the pipeline 100 may not be able to efficiently identify objects or perform other computer vision tasks using the streaming content with both spatial redundancy and temporal redundancy in the streaming content.
FIG. 2 illustrates an example transformer neural network 200 that may be used to detect objects in visual content or perform other computer vision tasks. In general, the transformer neural network 200 includes a self-attention module that models dependencies between different input tokens (e.g., different portions of an input image) and a feed-forward network that deepens a feature representation.
As shown, the transformer neural network may receive as input a set of input tokens 202. Tokens in the input token set 202 may correspond to features extracted from different spatial locations within the input image. The set of input tokens may be represented as a sequence:
Where N represents the number of tokens and d represents the embedding dimension of the space in which the tokens can be generated. Token sequences may be assembled into matrices For image-based computer vision tasks, the input sequence may be converted from an input image I represented by the following equation:
where C represents a channel (e.g., red, green, blue, or alpha (transparency) channel in an RGB image); cyan, yellow, magenta, and black channels, etc. in a CYMK image), H represents the height of the input channel, and W represents the width of the input channel.
The self-attention module 203 in the transformer neural network 200 generally includes a plurality of linear projection layers 204, 206, 208, an attention map 212, and an output projection layer 218. Linear projection layers 204, 206, and 1208 may be configured to convert input token set 202 into triples of query Q205, key K207, and value V209, respectively. That is, triplet Y may be represented by the following equation
Y=XWY+BY (3)
Wherein Y ε { Q, K, V },And
To generate attention map 212, query Q205 and key K207 may be combined at matrix multiplier 210, which may calculate a similarity between query Q205 and key K207 and normalize the similarity based on, for example, a softmax function. Note that the force diagram 212 and the value V209 may be combined by a matrix multiplier 214 to generate a new set of tokens 216. The new token 216 set may be calculated as a weighted sum of the values V209 with respect to the attention map 212. The new token 216 set may be represented by the following equation:
Where the softmax function is applied to the rows of the similarity matrix (e.g., attention map 212), and d is a normalization factor. The output projection layer 218 may be applied to the new token 216 set, resulting in a token set 222 represented by the following equation:
XO=XaWo+Bo (5)
Where token set 222 includes the sum of the input tokens calculated at adder 220 and the output of output projection layer 218.
The token set 222 may then be processed by a feed forward network 224, which may include a multi-layer perceptron (MLP) with two fully connected layers. In general, MLPs can deepen the feature representation and can widen the hidden embedding dimension between two fully connected layers. The output token 226 generated by the feed forward network 224 may be represented by the following equation:
XFFN=f((XoW1+B1)W2)+B2 (6)
Wherein the method comprises the steps of And f (·) represents a nonlinear activation function.
Computational analysis of the transformer neural network 200 shows the computational expense of the various components in the transformer neural network 200. Within the backbone of the neural network architecture (e.g., backbone 110 shown in fig. 1 and discussed above), which may be the most computationally complex layer within the overall object detection pipeline, it can be seen that the linear projection layers 204, 206, and 208 consume about 29% of the total number of floating point operations in the pipeline, the attention computation (represented by equation (4) discussed above) consumes about 3% of the total number of floating point operations in the pipeline, and the feed forward network consumes about 52% of the total number of operations in the pipeline. The remainder of the floating point operations in the pipeline may be consumed by the encoder and decoder stages of the pipeline (e.g., encoder stage 120 and decoder stage 130 shown in fig. 1 and discussed above). Thus, it can be seen that attention computation is a computationally inexpensive process, while generating query Q205, key K207, and value V209 through linear projection layers 204, 206, and 208, respectively, can be a computationally expensive process.
In the self-attention module 203 in the transformer neural network 200, the calculations are assigned to the linear projection layers 204, 206 and 208, and a matrix multiplier 210 is used to generate an attention map 212. The computational complexity of the linear projection layers 204, 206, and 208 may be calculated as 3Nd 2+Nd2=4Nd2 and the computational complexity of the matrix multiplication at the matrix multiplier 210 may be calculated as 2N 2 d. The complexity ratio of the matrix multiplication at the matrix multiplier 210 to the linear projection layers 204, 206, and 208 may be represented by the following equation:
when the token sequence is long and the value of N is large, the attention calculation performed by the self-attention module 203 may become a computational bottleneck.
Example gated transformer neural networks for object detection in visual content
To improve the efficiency of transformer neural networks for detecting objects in a data stream and/or performing other computer vision tasks, aspects of the present disclosure may use gating mechanisms to exploit time and space redundancy in a data stream, such as video content, to reduce the amount of data processed in the transformer neural network. As discussed in further detail below, the gating mechanism may be applied to linear layers (e.g., layers 204, 206, 208 shown in fig. 2 and described above), output projection layer 218, and/or feed forward network 224 in the transformer neural network. As discussed in further detail below, a gating mechanism may be used to identify changed and unchanged content in successive segments in a data stream (such as successive frames in video content), and use the identification of the changed and unchanged content between successive segments to determine which sub-segments of a segment of the data stream should be recalculated. For example, video content may be divided into: (1) Background content, which may be static content, and (2) foreground content, which may change over time. The gating mechanism discussed in further detail herein may allow redundancy information in both the spatial and temporal domains to be transferred from the characteristic output generated for an earlier segment in the data stream to a later segment in the data stream. By doing so, there is no need to recalculate features for spatial and temporal redundancy information, which can reduce the computational expense, power utilization, and computational resource utilization involved in object detection or other computer vision tasks using transformer neural networks.
Fig. 3 illustrates an example of a gated transformer neural network 300 for efficiently detecting objects in visual content in accordance with aspects of the present disclosure. As shown, the structure of the gated transformer neural network 300 may maintain the structure of the transformer neural network 200 shown in fig. 2, and gates 302 may be introduced that are coupled to the linear projection layers 204, 206, and 208, the output projection layer 218, and the feed forward network 224 to allow sharing of information between successive segments of the data stream. As discussed in further detail below, gate 302 may be a binary gate for determining whether to use previously computed features from a previous segment in the data stream or to compute features from a current segment in the data stream, or a ternary gate for determining whether to use previously computed features from a previous segment in the data stream, to compute features from a current segment in the data stream, or to zero the computed features. In general, zeroing features may stop computation of features of particular sub-segments of segments in the data stream (e.g., spatially redundant data in video content) for each of the remaining segments of the data stream, which may reduce the size of the attention map and reduce the number of features to perform the computation. Sharing previously computed features from an earlier segment of the data stream to a later segment of the data stream may allow for reduced computation costs in processing the current segment of the data stream.
In some aspects, binary gating may be used to improve the efficiency of the feed forward network 224 in the gated transformer neural network 300. Zeroing features and copying previously computed features from previous segments of a data stream may result in similar or identical reductions in computational expense, such as the number of floating point operations (FLOP) performed during processing of segments in the data stream, when coupled to the feed forward network 224. Since zeroing and copying of previously computed features is functionally equivalent, a simpler gating structure (e.g., with fewer states) may be used.
In some aspects, ternary gating may be used to improve the efficiency of query, key and value (QKV) computations performed by the linear projection layers 204, 206, and 208 in the gated transformer neural network 300. Zeroing out the features at the linear projection layers 204, 206, and 208 may result in an overall reduction in computational expense in the self-attention module 203, as zeroing out the features may remove the features from further computation and reduce the number of features to be computed (and potentially recalculated) by the gated transformer neural network 300.
Fig. 4 illustrates example operations 400 for efficiently detecting objects in visual content using a gated transformer neural network in accordance with aspects of the present disclosure. The operations 400 may be performed, for example, by a computing device on which to deploy a gating transformer neural network (e.g., the gating transformer neural network 300 shown in fig. 3) for various computer vision tasks.
As shown, the operation 400 begins at block 410, where a first feature is extracted from a first segment of a data stream and a second feature is extracted from a second segment of the data stream. In general, each of the first features may represent a different spatial portion of a first segment of the data stream and each of the second features may represent a different spatial portion of a second segment of the data stream. The first segment of the data stream may represent data captured at a first point in time and the second segment of the data stream may represent data captured at a second point in time that is later than the first point in time. For example, the data stream may be a video data stream having a plurality of frames. The first segment of the data stream may include a first frame of the plurality of frames in the video data stream (e.g., a frame at time t), and the second segment of the data stream may include a second frame of the plurality of frames in the video data stream (e.g., a frame at time t+1) having a later timestamp than the first frame.
In some aspects, to extract the first feature from the first segment of the data stream, the first segment of the data stream may be divided into a plurality of sub-segments. For each respective sub-segment of the first segment of the data stream, a neural network (e.g., a transformer neural network) may be used to extract a characteristic representation of the data in the respective sub-segment. Similarly, a second segment of the data stream may be divided into a plurality of sub-segments, and for each respective sub-segment of the second segment of the data stream, a neural network may be used to extract a characteristic representation of the data in the respective sub-segment. In general, a given sub-segment of a first segment of a data stream may correspond to a sub-segment of a second segment of the data stream at the same spatial location in the data stream.
At block 420, the first feature and the second feature are concatenated into a combined representation of the first segment of the data stream and the second segment of the data stream. In general, cascading of these features may allow features extracted from a first segment of a data stream and a second segment of the data stream to be combined for identifying changed content and unchanged content between the first segment of the data stream and the second segment of the data stream. For example, cascading features into a combined representation may include averaging the values of each feature, calculating a difference between each feature, or other operations that may be used to mathematically combine the first feature and the second feature.
At block 430, unchanged content and changed content are identified from the combined representation of the first segment of the data stream and the second segment of the data stream. Various techniques may be used to identify changed content and unchanged content from the combined representation. For example, the difference between the average eigenvalue for a given spatial location and the eigenvalue for the given spatial location in the second segment of the data stream may be calculated. If the difference is outside of the threshold level, it may be determined that the given spatial location includes content whose characteristics are to be recalculated by the transformer neural network. Otherwise, it may be determined that the given spatial location includes unchanged content. In another example, where the combined representation includes differences between feature values from a first segment of the data stream and corresponding feature values from a second segment of the data stream, differences for a given feature corresponding to a given spatial location may be analyzed. If the difference for the given feature exceeds a certain threshold, it may be determined that the given spatial location associated with the given feature includes changed content; otherwise, it may be determined that the given spatial location associated with the given feature includes unchanged content.
In some aspects, where the data stream comprises a video data stream having a plurality of frames, the unchanged content and the changed content may be content on different depth planes (e.g., background or foreground) in different frames. The unchanged content may be, for example, background content shared between a first frame of the video data stream and a second frame of the video data stream. Meanwhile, the changed content may be foreground content changed between the first frame and the second frame.
At block 440, a feature output for a second segment of the data stream is generated from the first feature and the second feature based on the identified unchanged content and the identified changed content. To generate a feature output for the second segment of the data stream, a gate may be used to determine how to generate the feature output. As discussed, binary gates may be used to determine whether to use previously computed features from a first segment of the data stream for a given sub-segment (e.g., a given spatial region in an image) or whether to generate feature outputs based on features extracted from a second segment of the data stream and computed by multiple layers in a transformer neural network.
In some aspects, to generate feature outputs for the second segment of the data stream, a binary gate may be used to select how each respective feature in the feature outputs is to be generated. The first feature may be maintained when the first feature and the corresponding second feature are substantially identical. Otherwise, the binary gate may trigger generation of the output feature for the second feature using the transformer neural network. By doing so, time redundancy may be utilized in performing object detection or other computer vision tasks, as such recalculation is not required when the recalculation features do not generate significantly different data and computing resources would be wasted.
In some aspects, to generate feature outputs for the second segment of the data stream, a ternary gate may be used to select how each respective feature in the feature outputs is to be generated. For spatially redundant data, the ternary gate may output a zero state because spatially redundant data may correspond to features that may be removed from the data stream without adversely affecting object detection or other computer vision tasks. The first feature may be maintained when the first feature and the corresponding second feature are substantially identical. Otherwise, the binary gate may trigger generation of the output feature for the second feature using the transformer neural network. Using a ternary gate, spatial and temporal redundancy may be utilized in performing object detection or other computer vision tasks, as features need not be computed for uncorrelated data, and such recalculation is not required when the recalculation features do not generate significantly different data and may waste computational resources.
At block 450, a plurality of objects in the data stream are identified based on the feature output for the second segment of the data stream. As discussed, to identify an object from a feature output for a second segment of a data stream, the feature output may be encoded into a potential spatial representation by an encoder neural network, and the potential spatial representation of the feature output may be decoded into one of a plurality of classifications using a decoder neural network. The feed forward network may be used to determine whether a sub-segment of the second segment of the data stream corresponds to an object of interest and, if so, what kind of object is included in the sub-segment.
At block 460, one or more actions are taken based on identifying a plurality of objects in the data stream. For example, in an autonomous vehicle deployment, actions taken based on identifying a plurality of objects in a data stream may include controlling a motor vehicle to avoid collisions with the identified objects, such as applying a brake to slow or stop the motor vehicle, accelerating the motor vehicle, and/or steering the motor vehicle around the identified objects. In some aspects, in a data compression example, a compression level may be selected for each sub-segment based on whether the sub-segment of the second segment of the data stream corresponds to background data or an object of interest (e.g., in foreground data). Since the background data may not be of interest, a higher degree of compression may be used to reduce the size of the background data. In general, higher degrees of compression may correspond to higher amounts of information loss; thus, a lower degree of compression (or lossless compression) may be used to compress the sub-segments corresponding to the object of interest in order to preserve visual details in the data stream that may be considered "important".
Fig. 5 illustrates example operations 500 for efficiently processing visual content using a gated transformer neural network in accordance with aspects of the present disclosure. The operations 500 may be performed, for example, by a computing device on which to deploy a gating transformer neural network (e.g., the gating transformer neural network 300 shown in fig. 3) for various computer vision tasks.
As shown, the operation 500 begins at block 510, where a first set of tokens is generated from a first frame of a video stream and a second set of tokens is generated from a second frame of the video stream. In general, each token in the first token set may represent a different spatial portion of a first segment of the data stream and each token in the second token set may represent a different spatial portion of a second segment of the data stream. The first frame may be, for example, a frame captured at time t, and the second frame may be a frame having a later timestamp than the first frame (e.g., a frame captured at time t+1).
At block 520, operation 500 continues with identifying a first set of tokens associated with features to be reused from a first frame and a second set of tokens associated with features to be calculated from a second frame. In general, to identify the first and second token sets, tokens in the first token set may be compared to corresponding tokens in the second token set (e.g., using the binary or ternary gates discussed above with respect to fig. 3). In general, a comparison between tokens in a first feature set and corresponding tokens in a second feature set may be used to determine a difference between a spatial region in a first frame and a corresponding spatial region in a second frame.
Various techniques may be used to identify the changed and unchanged portions of the second frame in the video stream relative to the first frame of the video stream. For example, the difference between the average token value for a given spatial location and the token value for the given spatial location in the second frame of the video stream may be calculated. If the difference is outside of the threshold level, it may be determined that the given spatial location includes content whose characteristics are to be changed by recalculating from the second frame via the transformer neural network. Otherwise, it may be determined that the given spatial location includes unchanged content. In another example, a difference between a token value from a first frame of the video stream and a corresponding token value from a second frame of the data stream may be analyzed. If the difference of the given features exceeds a certain threshold, it may be determined that the given spatial location associated with the given token includes changed content; otherwise, it may be determined that the given spatial location associated with the given token includes unchanged content.
At block 530, operation 500 continues to generate a feature output for a portion of the second frame corresponding to the second token set. In general, to generate a feature output for each of the portions of the second frame corresponding to the second token set, the portions of the second frame may be processed by a neural network trained to extract feature representations from data in the portions of the second frame.
At block 540, operation 500 continues with combining the features associated with the first token set and the generated feature output for the portion of the second frame corresponding to the second token set into a representation of the second frame of the video stream. In general, the combination of these features may allow for combining features extracted from a first frame of a video stream and a second frame of a data stream such that a neural network, such as a transformer neural network, is used to process a portion, but not all, of the second frame. By doing so, both temporal redundancy (i.e., similarity between successive frames in the video content) and spatial redundancy (i.e., similarity between different portions of the same frame) can be utilized in processing frames from the video stream.
In some aspects, the unchanged content and the changed content may be content on different depth planes (e.g., background or foreground) in different frames. The unchanged content may be, for example, background content shared between a first frame of the video data stream and a second frame of the video data stream. Meanwhile, the changed content may be foreground content changed between the first frame and the second frame.
In some aspects, to generate a feature output for a second segment of the data stream, a gate may be used to determine how to generate the feature output. As discussed, binary gates may be used to determine whether to use previously computed features from a first segment of the data stream for a given sub-segment (e.g., a given spatial region in an image) or whether to generate feature outputs based on features extracted from a second segment of the data stream and computed by multiple layers in a transformer neural network.
In some aspects, to generate feature outputs for the second segment of the data stream, a binary gate may be used to select how each respective feature in the feature outputs is to be generated. The first feature may be maintained when the first feature and the corresponding second feature are substantially identical. Otherwise, the binary gate may trigger generation of the output feature for the second feature using the transformer neural network. By doing so, time redundancy may be utilized in performing object detection or other computer vision tasks, as such recalculation is not required when the recalculation features do not generate significantly different data and computing resources would be wasted.
In some aspects, to generate feature outputs for the second segment of the data stream, a ternary gate may be used to select how each respective feature in the feature outputs is to be generated. For spatially redundant data, the ternary gate may output a zero state because spatially redundant data may correspond to features that may be removed from the data stream without adversely affecting object detection or other computer vision tasks. The first feature may be maintained when the first feature and the corresponding second feature are substantially identical. Otherwise, the binary gate may trigger generation of the output feature for the second feature using the transformer neural network. Using a ternary gate, spatial and temporal redundancy may be utilized in performing object detection or other computer vision tasks, as features need not be computed for uncorrelated data, and such recalculation is not required when the recalculation features do not generate significantly different data and may waste computational resources.
The feature output may be used for various computer vision tasks. For example, a plurality of objects in the data stream may be identified based on the feature output for the second segment of the data stream. As discussed, to identify an object from a feature output for a second segment of a data stream, the feature output may be encoded into a potential spatial representation by an encoder neural network, and the potential spatial representation of the feature output may be decoded into one of a plurality of classifications using a decoder neural network. The feed forward network may be used to determine whether a sub-segment of the second segment of the data stream corresponds to an object of interest and, if so, what kind of object is included in the sub-segment.
One or more actions may then be taken based on identifying the plurality of objects in the data stream. For example, in an autonomous vehicle deployment, actions taken based on identifying a plurality of objects in a data stream may include controlling a motor vehicle to avoid collisions with the identified objects, such as applying a brake to slow or stop the motor vehicle, accelerating the motor vehicle, and/or steering the motor vehicle around the identified objects. In some aspects, in a data compression example, a compression level may be selected for each sub-segment based on whether the sub-segment of the second segment of the data stream corresponds to background data or an object of interest (e.g., in foreground data). Since the background data may not be of interest, a higher degree of compression may be used to reduce the size of the background data. In general, higher degrees of compression may correspond to higher amounts of information loss; thus, a lower degree of compression (or lossless compression) may be used to compress the sub-segments corresponding to the object of interest in order to preserve visual details in the data stream that may be considered "important".
Fig. 6 depicts an example pipeline 600 in which binary gates are used in a transformer neural network to efficiently detect objects in visual content, in accordance with aspects of the present disclosure.
As shown, pipeline 600 includes a gate computation stage 610, a conditional feature computation stage 620, and a feature combination stage 630. In general, the first Frame (designated "Frame 1") may be the initial Frame in the captured video content and may be fully processed without using gates to determine whether to calculate features or to use previously calculated features from previous frames in the captured video content. Thus, for Frame1, the gate computation stage 610 may be omitted, and features may be computed from Frame1 (or tokens extracted from Frame1 representing each of a plurality of spatial fragments of Frame 1) using one or more linear projection layers (e.g., QKV projection layers for generating queries, keys, and value matrices of Frame 1). These features generated by one or more linear projection layers may be output as features of Frame 1.
For subsequent frames, a gate may be used to determine whether to calculate a feature or use a previously calculated feature. Thus, for Frame1 and Frame2, tokens extracted from these frames may be input into the binary gate. For particular features located at particular spatial locations in frames 1 and 2, the tokens may be compared at the gate computation stage 610 to determine whether the previously computed features may be borrowed as tokens or whether the tokens are to be recalculated. If it is determined at the gate computation stage 610 that the token is to be recalculated, features may be generated for the token in Frame2 by one or more layers in the conditional feature computation stage 620. Otherwise, features of the token may be borrowed from previous frames without further computation. At feature combination stage 630, a binary gate may be used to determine whether to output the features of a previously computed token or the newly computed features for that token in the feature output of Frame 2.
Similarly, for Frame3, the gate computation stage 610 may determine whether a previously computed feature (e.g., from Frame1 or Frame 2) for a given token (e.g., a spatial fragment) may be used to represent the corresponding token in Frame 3. If previously computed features can be used for a given token (e.g., temporal redundancy can be exploited), the gate can generate a signal that prevents the features from being recalculated for the given token at conditional feature calculation stage 620. Otherwise, the gate may generate a signal that triggers a recalculation of the feature for the given token at the feature calculation stage 620. At feature combination stage 630, binary gates may be used to output previously computed features for a token or newly computed features for the token. In general, using these gates, features can be computed when a threshold amount of change has occurred, and no features need to be computed (e.g., sharable from a previous frame) until a sufficient change in content is detected between different frames that would result in the previously computed features no longer accurately representing a given token (e.g., spatial segment) of the frame.
Fig. 7 depicts an example gate 700 for selecting features to be included in a feature map for detecting objects in visual content, in accordance with aspects of the present disclosure. As shown, a first input frame 702 and a second input frame 704 may be input into linear projection layers 706 and 708, respectively, to generate intermediate features 710 and 712. The intermediate features 710 and 712 may be cascaded at a cascade 714 into a combined feature representation, and the combined feature representations of the first input frame 702 and the second input frame 704 may be input into a linear projection layer 716 to fuse information from the first input frame 702 and the second input frame 704. The output of the linear projection layer 716 may be Logit 718, which may be an original prediction set that may be processed to determine whether to borrow previously computed features, compute features for a given token, or zero features (if a ternary gate is used).
To train the gate 722, a Gumbel softmax sampler may be applied to Logit 718 generated by the linear projection layer 716. For a binary gate, each logic corresponding to a feature generated for a corresponding token in the first input frame 702 and the second input frame 704 may have one state for each token. Logit 718 may be represented by the following vector:
S=(S1,...,Si,...,SN)T (8)
Wherein the method comprises the steps of I indexes the token and T here indicates a matrix transpose. Sigmoid function:
May be applied to Logit 718, where T is here a temperature value (e.g., 2/3). The binary gate may be derived by thresholding the gate state Z based on a value of 0.5 such that the output G of the binary gate 722 for a given token i is represented by the following equation:
the output characteristics and thus the output of gate 722 may be calculated by selecting characteristics from either the first input frame 702 or the second input frame 704 according to the following equation:
where X p represents features from the first input frame 702 (i.e., the previous frame), X c represents features from the second input frame 704 (i.e., the current frame), X g represents gating features, and Representing the Hadamard product of two matrices of equal size. In some aspects, when g=0, there is no need to calculate X c, which can achieve a reduction in computational expense in the neural network.
In some aspects, the L 1 loss function may be minimized for sparse binary gates. The loss function may be represented by the following equation:
Where l is the layer index, γ is the regularization factor, and FLOP l represents the computational complexity of the layer with index l. In general, the computational complexity of a layer may be calculated based on a number of mathematical operations (e.g., additions and multiplications) performed when generating an output of the layer for a given input of features from a segment of a data stream. By regularizing the loss term based on the computational complexity of the linear projection layer in the transformer neural network, balanced compression of the different layers can be achieved.
Fig. 8 depicts an example gated transformer neural network 800 in which ternary gates are used to reduce the size of feature patterns used to detect objects in visual content, in accordance with aspects of the present disclosure. As shown, in block 810, a ternary gate is used to reduce the size of a feature (attention) map representing a segment of a data stream (e.g., a frame in a video content stream), and in block 850, the simplified feature map is used to reduce the computational complexity involved in processing a subsequent segment of the data stream and identifying objects and/or performing other computer vision tasks on the subsequent segment of the data stream.
As discussed above, the ternary gates may be used for layers 814, 816, 818 (e.g., QKV projection layers discussed above) that are used to generate the query Q815, the key K817, and the value V819 in the self-attention module. The architecture of the ternary gate may follow the architecture of gate 700 shown in fig. 7 and discussed above. However, unlike binary gates, ternary gates for layers 814, 816, and 818 may have three states: a zeroing state; a shared state for using previously computed features from previous segments of the data stream; and a computation state for generating features from the current segment of the data stream. Thus, the output Logit of the final linear projection layer can be expressed as:
S=[S:,1,S:,2,S,] (13)
Wherein the method comprises the steps of S :,i denotes a column of S. For each token i, three states S :,1,S:,2,S correspond to the zeroing state, the sharing state, and the computing state, respectively. In the zeroed state, the token may be replaced with a zero value, indicating a removable token. The attention attempt 822 may be generated by combining the query Q815 and the key K817 at the transposer 820. Thus, at block 824, the attention computation described above with respect to equation (4) may be performed on the smaller feature set. Calculation of a smaller feature set may reduce computational costs, but at the cost of information loss in the attention map 822 and in the attention map for processing subsequent segments in the data stream. Additionally, the new set of tokens 826 generated at block 810 may include a computation token and a zeroing token representing an output of the transformer neural network for a first segment of the data stream.
In block 850, an input token 852 may be generated for a subsequent segment in the data stream, as shown, and processed through linear projection layers 854, 856, and 858 to generate a query Q855, a key K857, and a value V859. Because certain features are zeroed out and removed at block 810 (where the first frame is processed through the transformer neural network), query Q855, key K857, and value V859 in block 850 may be less than query Q815, key K817, and value V819 in block 810. Query Q855 and key K857 may be processed by transposer 860 to generate an attention profile 862, which may also have a smaller size than attention profile 822 (which includes zeroed values for multiple tokens) in block 810. The value V859 and the attention map 862 may be combined into a set of tokens 866 by a matrix multiplier 864. The set of tokens 866 may be extended with zeroed data to generate a new set of tokens 868 representing subsequent fragments of the data stream.
Similar to the binary gates discussed above, in the ternary gates used in the gated transformer neural network 800, the softmax function may be applied to the output Logit according to equation (9) discussed above. The value G of the ternary gate can be determined by comparing the values of the three states in the output logic described in equation (13) according to the following equation:
thus, the final output characteristics of the ternary gate can be calculated according to the following equation:
The loss function of the ternary gate may be minimized to sparse the ternary gate. For example, a ternary gate may be trained to minimize the loss function represented by the following equation:
Where l is the layer index, γ 1、γ2 and γ 3 are regularization factors for the zeroed state, shared state and computational state, respectively, and FLOP l represents the computational complexity of the layer with index l. For example, gamma 1 regularization factors, gamma 2 regularization factors, and gamma 3 regularization factors may be selected to balance the zero state, shared state, and computational state so that a sufficient amount of data may be retained to ensure accuracy of object detection or other computer vision tasks.
In general, aspects of the present disclosure may allow for significantly reducing the computational expense involved in object detection tasks with similar accuracy metrics. As discussed herein, the gated transformer neural network can reduce the average computational effort by 40% with similar accuracy, as measured by the mean cross ratio (mIoU) metric. A further reduction in computational effort can be achieved with minimal reduction in accuracy relative to the non-gated transformer neural networks discussed above (e.g., DETR).
Example processing System for efficient processing of visual content Using gated Transformer neural networks
Fig. 9 depicts an example processing system 900 for processing visual content (e.g., for object detection or other computer vision tasks) using a gated transformer neural network (e.g., for object detection or other computer vision tasks), such as described herein, for example, with respect to fig. 4 and 5.
The processing system 900 includes a Central Processing Unit (CPU) 902, which in some examples may be a multi-core CPU. The instructions executed at the CPU 902 may be loaded, for example, from a program memory associated with the CPU 902, or may be loaded from the memory 924.
The processing system 900 also includes additional processing components tailored for specific functions, such as a Graphics Processing Unit (GPU) 904, a Digital Signal Processor (DSP) 906, a Neural Processing Unit (NPU) 908, a multimedia processing unit 910, and a wireless connectivity component 912.
NPUs, such as NPU 908, are generally dedicated circuits configured to implement control and arithmetic logic for performing machine learning algorithms, such as algorithms for processing Artificial Neural Networks (ANNs), deep Neural Networks (DNNs), random Forests (RF), and the like. The NPU may sometimes alternatively be referred to as a Neural Signal Processor (NSP), tensor Processing Unit (TPU), neural Network Processor (NNP), intelligent Processing Unit (IPU), visual Processing Unit (VPU), or graphics processing unit.
NPUs, such as NPU 908, are configured to accelerate performance of common machine learning tasks, such as image classification, machine translation, object detection, and various other predictive models. In some examples, multiple NPUs may be instantiated on a single chip, such as a system on a chip (SoC), while in other examples, multiple NPUs may be part of a dedicated neural network accelerator.
The NPU may be optimized for training or inference, or in some cases configured to balance performance between the two. For NPUs that are capable of both training and inferring, these two tasks can still generally be performed independently.
NPUs designed to accelerate training are generally configured to accelerate optimization of new models, which is a highly computationally intensive operation involving inputting an existing dataset (typically labeled or tagged), iterating over the dataset, and then adjusting model parameters (such as weights and biases) in order to improve model performance. In general, optimizing based on mispredictions involves passing back through layers of the model and determining gradients to reduce prediction errors.
NPUs designed to accelerate inference are generally configured to operate on a complete model. Thus, such NPUs may be configured to input new pieces of data and quickly process the new pieces of data through a trained model to generate model outputs (e.g., inferences).
In one implementation, the NPU 908 is part of one or more of the CPU 902, GPU 904, and/or DSP 906.
In some examples, wireless connectivity component 912 may include subcomponents such as those used for third generation (3G) connectivity, fourth generation (4G) connectivity (e.g., 4G LTE), fifth generation connectivity (e.g., 5G or NR), wi-Fi connectivity, bluetooth connectivity, and other wireless data transmission standards. The wireless connectivity component 912 is further connected to one or more antennas 914.
The processing system 900 can also include one or more sensor processing units 916 associated with any manner of sensor, one or more Image Signal Processors (ISPs) 918 associated with any manner of image sensor, and/or a navigation component 920, which can include satellite-based positioning system components (e.g., GPS or GLONASS) as well as inertial positioning system components.
The processing system 900 may also include one or more input and/or output devices 922, such as a screen, touch-sensitive surfaces (including touch-sensitive displays), physical buttons, speakers, microphones, and so forth.
In some examples, one or more processors of processing system 900 may be based on an ARM or RISC-V instruction set.
The processing system 900 also includes a memory 924, which represents one or more static and/or dynamic memories, such as dynamic random access memory, flash-based static memory, or the like. In this example, memory 924 includes computer-executable components that are executable by one or more of the foregoing processors of processing system 900.
Specifically, in this example, the memory 924 includes a feature extraction component 924A, a feature concatenation/combination component 924B, a content change identification component 924C, a feature output generation component 924D, an object identification component 924E, an action taking component 924F, a token generation component 924G, and a token comparison component 924H. The depicted components, as well as other non-depicted components, may be configured to perform various aspects of the methods described herein.
In general, the processing system 900 and/or components thereof may be configured to perform the methods described herein.
It is noted that in other aspects, features of processing system 900 may be omitted, such as where processing system 900 is a server computer or the like. For example, in other aspects, multimedia processing unit 910, wireless connectivity component 912, sensor processing unit 916, ISP 918, and/or navigation component 920 may be omitted. Further, features of processing system 900 may be distributed, such as training a model and using the model to generate inferences, such as user-validated predictions.
Example clauses
Implementation details are described in the numbered clauses below.
Clause 1: a method for detecting objects in a data stream using a machine learning model, the method comprising: extracting a first feature from a first segment of the data stream and a second feature from a second segment of the data stream; concatenating the first feature and the second feature into a combined representation of the first segment of the data stream and the second segment of the data stream; identifying unchanged content and changed content from the combined representation of the first segment of the data stream and the second segment of the data stream; generating a feature output for the second segment of the data stream from the first feature and the second feature based on the identified unchanged content and the identified changed content; identifying a plurality of objects in the data stream based on the characteristic output for the second segment of the data stream using a transformer neural network; and taking one or more actions based on identifying the plurality of objects in the data stream.
Clause 2: the method of clause 1, wherein: the data stream comprises a video data stream having a plurality of frames, the first segment of the data stream comprising a first frame of the plurality of frames in the video data stream, and the second segment of the data stream comprising a second frame of the plurality of frames in the video data stream, the second frame having a later timestamp than the first frame.
Clause 3: the method of clause 2, wherein: the unchanged content includes background content in the first frame and the second frame, and the changed content includes foreground content in the first frame and the second frame.
Clause 4: the method of any one of clauses 1-3, wherein: extracting the first feature from the first segment of the data stream includes generating a feature representation of data in the respective sub-segment of the first segment of the data stream for each respective sub-segment of the first segment of the data stream, extracting the second feature from the second segment of the data stream includes generating a feature representation of data in the respective sub-segment of the second segment of the data stream for each respective sub-segment of the second segment of the data stream, and each respective sub-segment of the first segment of the data stream corresponds to a sub-segment of the second segment of the data stream in the same spatial location in the data stream.
Clause 5: the method of any of clauses 1-4, wherein generating the feature output for the second segment of the data stream comprises, for each respective feature of the first feature and the second feature: retaining the respective feature from the first feature when the respective feature from the first feature is the same as a corresponding feature from the second feature; and generating, by the transformer neural network, an output feature for the respective feature of the second feature when the respective feature from the first feature is different from the corresponding feature from the second feature.
Clause 6: the method of clause 5, wherein generating the feature output for the second segment of the data stream comprises generating feature output by a binary gate trained to minimize a loss function as a function of computational complexity for each of a plurality of layers used to generate the first feature and the second feature.
Clause 7: the method of any of clauses 1-6, wherein generating the feature output for the second segment of the data stream comprises, for each respective feature of the first feature and the second feature: outputting a zero state based on determining that the respective feature corresponds to removable data in the data stream; retaining individual respective features from individual first features when the respective features are the same as corresponding features from individual second features; and generating, by the transformer neural network, an output feature for the respective feature of the second feature when the respective feature from the first feature is different from the corresponding feature from the second feature.
Clause 8: the method of clause 7, wherein generating the feature output for the second segment of the data stream comprises generating the feature output by a ternary gate trained to minimize a loss function as a function of a computational complexity of each of a plurality of layers used to generate the first feature and the second feature and a regularization factor for each of the zero state, a shared state in which the first feature is the same as the second feature, and a computational state in which the first feature is different from the second feature.
Clause 9: a method for processing a video stream using a machine learning model, the method comprising: generating a first token set from a first frame of the video stream and a second token set from a second frame of the video stream; identifying a first set of tokens associated with features to be reused from the first frame and a second set of tokens associated with features to be calculated from the second frame based on a comparison of tokens from the first token set with corresponding tokens in the second token set; generating a feature output for a portion of the second frame corresponding to the second token set; and combining features associated with the first token set with the generated feature output for the portion of the second frame corresponding to the second token set into a representation of the second frame of the video stream.
Clause 10: the method of clause 9, wherein the second frame of the video stream comprises a frame having a later timestamp than the first frame.
Clause 11: the method of clause 9 or 10, wherein: the first set of tokens corresponds to unchanged content in the first and second frames, and the second set of tokens corresponds to changed content in the first and second frames.
Clause 12: the method of clause 11, wherein: the unchanged content includes background content in the first frame and the second frame, and the changed content includes foreground content in the first frame and the second frame.
Clause 13: the method of any one of clauses 9 to 12, wherein: generating the first token set includes generating a representation of data in the respective sub-segments of the first frame of the video stream for each respective sub-segment of the first frame of the video stream, generating the second token set includes generating a characteristic representation of data in the respective sub-segments of the second frame of the video stream for each respective sub-segment of the second frame of the video stream, and each respective sub-segment of the first frame of the video stream corresponds to a sub-segment of the second frame of the video stream in a same spatial location.
Clause 14: the method of any of clauses 9-13, wherein the first token set and the second token set are identified by inputting the first token set and the second token set via a binary gate trained to minimize a loss function as a function of a computational complexity of each of a plurality of layers used to generate the feature associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set.
Clause 15: the method of any of clauses 9-14, further comprising identifying a third token set corresponding to removable data in the video stream, wherein the feature output excludes features corresponding to the third token set.
Clause 16: the method of clause 15, wherein the first token set, the second token set, and the third token set are identified by a triple gate trained to minimize: a loss function as a function of the computational complexity of each of a plurality of layers for generating the features associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set; and regularization factors for each of a zero state, a shared state in which the tokens in the first token set are the same as the corresponding tokens in the second token set, and a calculated state in which the tokens in the first token set are different from the tokens in the second token set.
Clause 17: a processing system, the processing system comprising: a memory including computer-executable instructions; and one or more processors configured to execute the computer-executable instructions and to cause the processing system to perform the method according to any one of clauses 1-16.
Clause 18: a processing system comprising means for performing the method of any of clauses 1-16.
Clause 19: a non-transitory computer-readable medium comprising computer-executable instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the method of any of clauses 1-16.
Clause 20: a computer program product embodied on a computer readable storage medium, comprising code for performing the method of any of clauses 1-16.
Additional considerations
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. The examples discussed herein are not limited in scope, applicability, or aspect to the description set forth in the claims. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. For example, changes may be made in the function and arrangement of elements discussed without departing from the scope of the disclosure. Various examples may omit, replace, or add various procedures or components as appropriate. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Furthermore, features described with respect to some examples may be combined in some other examples. For example, an apparatus may be implemented or a method practiced using any number of the aspects set forth herein. In addition, the scope of the present disclosure is intended to cover such an apparatus or method that is practiced using other structure, functionality, or both in addition to or instead of the aspects of the present disclosure set forth herein. It should be understood that any aspect of the disclosure disclosed herein may be embodied by one or more elements of the invention.
As used herein, the term "exemplary" means "serving as an example, instance, or illustration. Any aspect described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other aspects.
As used herein, a phrase referring to "at least one item in a list of items" refers to any combination of these items (which includes a single member). For example, at least one of "a, b, or c" is intended to cover a, b, c, a-b, a-c, b-c, and a-b-c, as well as any combination of multiple identical elements (e.g., a-a-a, a-a-b, a-a-c, a-b-b, a-c-c, b-b-c, c-c, and c-c-c, or any other ordering of a, b, and c).
As used herein, the term "determining" encompasses a wide variety of actions. For example, "determining" may include calculating, computing, processing, deriving, researching, looking up (e.g., looking up in a table, database, or other data structure), ascertaining, and the like. Further, "determining" may include receiving (e.g., receiving information), accessing (e.g., accessing data in memory), and so forth. Further, "determining" may include resolving, selecting, establishing, and the like.
The methods disclosed herein comprise one or more steps or actions for achieving the method. The steps and/or actions of the methods may be interchanged with one another without departing from the scope of the claims. That is, unless a particular order of steps or actions is specified, the order and/or use of particular steps and/or actions may be modified without departing from the scope of the claims. Furthermore, the various operations of the methods described above may be performed by any suitable means capable of performing the corresponding functions. The component may include various hardware and/or software components and/or modules including, but not limited to, a circuit, an Application Specific Integrated Circuit (ASIC), or a processor. Generally, where there are operations shown in the figures, those operations may have corresponding components plus features.
The following claims are not intended to be limited to the aspects shown herein but are to be accorded the full scope consistent with the language of the claims. Within the claims, reference to an element in the singular is not intended to mean "one and only one" unless explicitly so stated, but rather "one or more". The term "some" means one or more unless specifically stated otherwise. No claim element should be construed in accordance with the specification of 35u.s.c. ≡112 (f) unless the phrase "means for..once again" is used to explicitly recite the element or, in the case of method claims, the phrase "step for..once again" is used to recite the element. All structural and functional equivalents to the elements of the aspects described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims. Furthermore, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the claims.

Claims (25)

1. A processor-implemented method for processing a video stream using a machine learning model, the processor-implemented method comprising:
Generating a first token set from a first frame of the video stream and a second token set from a second frame of the video stream;
Identifying a first set of tokens associated with features to be reused from the first frame and a second set of tokens associated with features to be calculated from the second frame based on a comparison of tokens from the first token set with corresponding tokens in the second token set; generating a feature output for a portion of the second frame corresponding to the second token set; and
Combining features associated with the first token set with the generated feature output for the portion of the second frame corresponding to the second token set into a representation of the second frame of the video stream.
2. The method of claim 1, wherein the second frame of the video stream comprises a frame having a later timestamp than the first frame.
3. The method according to claim 1, wherein:
The first token set corresponds to unchanged content in the first frame and the second frame, and
The second set of tokens corresponds to changed content in the first frame and the second frame.
4. A method according to claim 3, wherein:
The unchanged content includes background content in the first frame and the second frame, and
The changed content includes foreground content in the first frame and the second frame.
5. The method according to claim 1, wherein:
Generating the first token set includes generating a representation of data in each respective sub-segment of the first frame of the video stream for the respective sub-segment of the first frame of the video stream,
Generating the second token set includes generating, for each respective sub-segment of the second frame of the video stream, a characteristic representation of data in the respective sub-segment of the second frame of the video stream, and
Each respective sub-segment of the first frame of the video stream corresponds to a sub-segment of the second frame of the video stream in the same spatial location.
6. The method of claim 1, wherein the first and second token sets are identified by inputting the first and second token sets via a binary gate trained to minimize a loss function as a function of a computational complexity of each of a plurality of layers used to generate the feature associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set.
7. The method of claim 1, further comprising identifying a third token set corresponding to removable data in the video stream, wherein the feature output excludes features corresponding to the third token set.
8. The method of claim 7, wherein the first token set, the second token set, and the third token set are identified by a ternary gate trained to minimize: a loss function as a function of the computational complexity of each of a plurality of layers for generating the features associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set; and regularization factors for each of a zero state, a shared state in which the tokens in the first token set are the same as the corresponding tokens in the second token set, and a calculated state in which the tokens in the first token set are different from the tokens in the second token set.
9. A system for processing a video stream using a machine learning model, the system comprising:
A memory having stored thereon computer executable instructions; and
A processor configured to execute the computer-executable instructions to cause the system to:
Generating a first token set from a first frame of the video stream and a second token set from a second frame of the video stream;
Identifying a first set of tokens associated with features to be reused from the first frame and a second set of tokens associated with features to be calculated from the second frame based on a comparison of tokens from the first token set with corresponding tokens in the second token set;
generating a feature output for a portion of the second frame corresponding to the second token set; and
Combining features associated with the first token set with the generated feature output for the portion of the second frame corresponding to the second token set into a representation of the second frame of the video stream.
10. The system of claim 9, wherein the second frame of the video stream comprises a frame having a later timestamp than the first frame.
11. The system of claim 9, wherein:
The first token set corresponds to unchanged content in the first frame and the second frame, and
The second set of tokens corresponds to changed content in the first frame and the second frame.
12. The system of claim 11, wherein:
The unchanged content includes background content in the first frame and the second frame, and
The changed content includes foreground content in the first frame and the second frame.
13. The system of claim 9, wherein:
To generate the first token set, the processor is configured to cause the system to generate, for each respective sub-segment of the first frame of the video stream, a representation of data in the respective sub-segment of the first frame of the video stream;
To generate the second token set, the processor is configured to cause the system to generate, for each respective sub-segment of the second frame of the video stream, a characteristic representation of data in the respective sub-segment of the second frame of the video stream; and
Each respective sub-segment of the first frame of the video stream corresponds to a sub-segment of the second frame of the video stream in the same spatial location.
14. The system of claim 9, wherein the first and second token sets are identified by inputting the first and second token sets via a binary gate trained to minimize a loss function as a function of a computational complexity of each of a plurality of layers used to generate the feature associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set.
15. The system of claim 9, wherein the processor is further configured to cause the system to identify a third token set corresponding to removable data in the video stream, wherein the feature output excludes features corresponding to the third token set.
16. The system of claim 15, wherein the first token set, the second token set, and the third token set are identified by a ternary gate trained to minimize: a loss function as a function of the computational complexity of each of a plurality of layers for generating the features associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set; and regularization factors for each of a zero state, a shared state in which the tokens in the first token set are the same as the corresponding tokens in the second token set, and a calculated state in which the tokens in the first token set are different from the tokens in the second token set.
17. A processing system for processing a video stream using a machine learning model, the processing system comprising:
Means for generating a first token set from a first frame of the video stream and a second token set from a second frame of the video stream;
Means for identifying a first set of tokens associated with features to be reused from the first frame and a second set of tokens associated with features to be calculated from the second frame based on a comparison of tokens from the first token group with corresponding tokens in the second token group;
means for generating a feature output for a portion of the second frame corresponding to the second token set; and
Means for combining features associated with the first token set with the generated feature output for the portion of the second frame corresponding to the second token set into a representation of the second frame of the video stream.
18. The processing system of claim 17, wherein the second frame of the video stream comprises a frame having a later timestamp than the first frame.
19. The processing system of claim 17, wherein:
The first token set corresponds to unchanged content in the first frame and the second frame, and
The second set of tokens corresponds to changed content in the first frame and the second frame.
20. The processing system of claim 19, wherein:
The unchanged content includes background content in the first frame and the second frame, and
The changed content includes foreground content in the first frame and the second frame.
21. The processing system of claim 17, wherein:
the means for generating the first token set includes means for generating, for each respective sub-segment of the first frame of the video stream, a representation of data in the respective sub-segment of the first frame of the video stream;
The means for generating the second token set comprises means for generating, for each respective sub-segment of the second frame of the video stream, a characteristic representation of data in the respective sub-segment of the second frame of the video stream; and
Each respective sub-segment of the first frame of the video stream corresponds to a sub-segment of the second frame of the video stream in the same spatial location.
22. The processing system of claim 17, wherein the means for identifying the first and second token sets comprises means for inputting the first and second token sets through a binary gate trained to minimize a loss function as a function of a computational complexity of each of a plurality of layers used to generate the feature associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set.
23. The processing system of claim 17, wherein the means for identifying is configured to identify a third token set corresponding to removable data in the video stream, wherein the feature output excludes features corresponding to the third token set.
24. The processing system of claim 23, wherein the means for identifying the first token set, the second token set, and the third token set comprises a ternary gate trained to minimize: a loss function as a function of the computational complexity of each of a plurality of layers for generating the features associated with the first token set and the feature output generated for the portion of the second frame corresponding to the second token set; and regularization factors for each of a zero state, a shared state in which the tokens in the first token set are the same as the corresponding tokens in the second token set, and a calculated state in which the tokens in the first token set are different from the tokens in the second token set.
25. A non-transitory computer-readable medium having instructions stored thereon, which when executed by a processor, cause the processor to perform operations for processing a video stream using a machine learning model, the operations comprising:
Generating a first token set from a first frame of the video stream and a second token set from a second frame of the video stream;
Identifying a first set of tokens associated with features to be reused from the first frame and a second set of tokens associated with features to be calculated from the second frame based on a comparison of tokens from the first token set with corresponding tokens in the second token set; generating a feature output for a portion of the second frame corresponding to the second token set; and
Combining features associated with the first token set with the generated feature output for the portion of the second frame corresponding to the second token set into a representation of the second frame of the video stream.
CN202280060174.XA 2021-09-21 2022-09-21 Processing video content using gated transformer neural networks Pending CN118251704A (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US63/246,643 2021-09-21
US17/933,840 US20230090941A1 (en) 2021-09-21 2022-09-20 Processing video content using gated transformer neural networks
US17/933,840 2022-09-20
PCT/US2022/076752 WO2023049726A1 (en) 2021-09-21 2022-09-21 Processing video content using gated transformer neural networks

Publications (1)

Publication Number Publication Date
CN118251704A true CN118251704A (en) 2024-06-25

Family

ID=91564705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202280060174.XA Pending CN118251704A (en) 2021-09-21 2022-09-21 Processing video content using gated transformer neural networks

Country Status (1)

Country Link
CN (1) CN118251704A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118570711A (en) * 2024-08-03 2024-08-30 武汉理工大学 Identification method and system based on trust risk distribution token and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118570711A (en) * 2024-08-03 2024-08-30 武汉理工大学 Identification method and system based on trust risk distribution token and electronic equipment

Similar Documents

Publication Publication Date Title
Sindagi et al. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting
US10049279B2 (en) Recurrent networks with motion-based attention for video understanding
Wei et al. Boosting deep attribute learning via support vector regression for fast moving crowd counting
US20170262995A1 (en) Video analysis with convolutional attention recurrent neural networks
US20170262996A1 (en) Action localization in sequential data with attention proposals from a recurrent network
EP4405909A1 (en) Processing video content using gated transformer neural networks
CN112597883A (en) Human skeleton action recognition method based on generalized graph convolution and reinforcement learning
CN115690002A (en) Remote sensing image change detection method and system based on Transformer and dense feature fusion
CN114677412B (en) Optical flow estimation method, device and equipment
US20230070439A1 (en) Managing occlusion in siamese tracking using structured dropouts
CN113378775B (en) Video shadow detection and elimination method based on deep learning
US20230154005A1 (en) Panoptic segmentation with panoptic, instance, and semantic relations
Trusov et al. Fast implementation of 4-bit convolutional neural networks for mobile devices
Sun et al. Multi-YOLOv8: An infrared moving small object detection model based on YOLOv8 for air vehicle
CN118251704A (en) Processing video content using gated transformer neural networks
Tripathy et al. AMS-CNN: Attentive multi-stream CNN for video-based crowd counting
US20230090941A1 (en) Processing video content using gated transformer neural networks
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
Tiwari et al. A new modified-unet deep learning model for semantic segmentation
CN113033283B (en) Improved video classification system
Mohan et al. Panoptic Out-of-Distribution Segmentation
Nguyen et al. SmithNet: strictness on motion-texture coherence for anomaly detection
JP7380915B2 (en) Information processing device, information processing method and program
Joukovsky et al. Interpretable neural networks for video separation: Deep unfolding RPCA with foreground masking
Zhang et al. NeuFlow: Real-time, High-accuracy Optical Flow Estimation on Robots Using Edge Devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination