CN117115199A

CN117115199A - Quantization method, tracking method and device of target tracking model

Info

Publication number: CN117115199A
Application number: CN202311107931.7A
Authority: CN
Inventors: 李三琦; 姚早; 宋周奂; 闵莹; 任俊臣; 姚元章; 张运豪; 朴昶范; 俞炳仁
Original assignee: Samsung China Semiconductor Co Ltd; Samsung Electronics Co Ltd
Current assignee: Samsung China Semiconductor Co Ltd; Samsung Electronics Co Ltd
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2023-11-24

Abstract

A quantization method, a tracking method and a tracking device for a target tracking model are provided. The quantization method comprises the following steps: acquiring a target tracking model based on a transducer, wherein the target tracking model comprises a template branch, a search branch, a splicing module and a first transducer module, wherein the splicing module receives a first feature output from the template branch and a second feature output from the search branch and splices the first feature and the second feature into a spliced feature; generating an optimized target tracking model by deleting the stitching module from the target tracking model and splitting the first transducer module into a second transducer module and a third transducer module, wherein the second transducer module receives the first feature and the third transducer module receives the second feature; a quantization model corresponding to the optimization target tracking model is generated by quantizing the second and third transformers respectively.

Description

Quantization method, tracking method and device of target tracking model

Technical Field

The present disclosure relates to the field of Computer Vision (CV), and more particularly, to a quantization method, tracking method, and apparatus for a target tracking model.

Background

In recent years, artificial neural network-based methods, particularly Convolutional Neural Networks (CNNs), have achieved great success in many applications, particularly in the field of computer vision. However, convolution operations lack global understanding of the image itself, and cannot model dependencies between features, resulting in insufficient utilization of context information. Accordingly, researchers have attempted to migrate a transducer model in the field of natural language processing into computer vision tasks.

The Transformer is a deep neural network based mainly on self-attention mechanisms, and compared to other network types (e.g. CNN and RNN), the Transformer based model shows competitive or even better performance on various visual benchmarks. With the birth of Vision Transformer, the development and application of the transducer in the field of vision are greatly promoted.

With the popularity of deep learning as a universal tool in human life, many electronic devices have been given intelligence, such as smartphones, unmanned aerial vehicles, and autopilots. While neural networks have evolved in many applications, they generally incur high computational costs. If it is desired to integrate the best industry neural network into an edge electronic device with stringent power consumption and computational requirements, it is critical to reduce the power consumption and time of neural network reasoning.

In particular, a Transformer-based object tracking model for tracking objects from video frames is difficult to integrate into edge electronics with stringent power consumption and computational requirements, and has high computational costs, so that it is difficult for electronics (e.g., smartphones, drones, and autopilots) to quickly and accurately track objects in video frames using the Transformer-based object tracking model. Accordingly, a method for enabling an electronic device deployed with a transducer-based object tracking model to quickly and accurately track objects in video frames is desired.

Disclosure of Invention

The invention aims to provide a quantization method and device of a target tracking model and a target tracking method.

According to an aspect of the present disclosure, there is provided a quantization method of a target tracking model based on a transducer, the quantization method including: acquiring a target tracking model based on a transducer, wherein the target tracking model comprises a template branch, a search branch, a splicing module and a first transducer module, wherein the splicing module receives a first feature output from the template branch and a second feature output from the search branch and splices the first feature and the second feature into a spliced feature; generating an optimized target tracking model by deleting the stitching module from the target tracking model and splitting the first transducer module into a second transducer module and a third transducer module, wherein the second transducer module receives the first feature and the third transducer module receives the second feature; a quantization model corresponding to the optimization target tracking model is generated by quantizing the second and third transformers respectively.

Optionally, the second and third converter modules have a first multi-headed attention mechanism module and a second multi-headed attention mechanism module, respectively, wherein the first multi-headed attention mechanism module receives a first vector generated in the second converter module and the second multi-headed attention mechanism module receives a spliced vector obtained by splicing the first vector generated in the second converter module and the second vector generated in the third converter module.

Optionally, the first vector includes a queried vector corresponding to the template branch and a queried vector corresponding to the template branch, the second vector includes a queried vector corresponding to the search branch and a queried vector corresponding to the search branch, and the stitching vector includes: a vector generated by concatenating the queried vector corresponding to the template branch and the queried vector corresponding to the search branch, and a vector generated by concatenating the vector obtained by the query corresponding to the template branch and the vector obtained by the query corresponding to the search branch.

Optionally, the first multi-headed alert mechanism module further receives a query vector corresponding to a template branch, and the second multi-headed alert mechanism module further receives a query vector corresponding to a search branch.

Optionally, the step of quantifying the second transducer module and the third transducer module respectively includes: obtaining a calibration data set comprising a video sequence, wherein the video sequence comprises a plurality of consecutive frames; forming a first frame of the video sequence into a first target calibration data set; forming a second target calibration data set from one frame, together with a first vector generated by a second transducer module based on the first target calibration data set, by selecting the one frame from the plurality of consecutive frames to represent the plurality of consecutive frames according to a frame rate; the second transducer module is quantized based on the first target calibration data set and the third transducer module is quantized based on the second target calibration data set.

Optionally, the number of the plurality of consecutive frames is equal to a value corresponding to the frame rate.

According to an aspect of the present disclosure, there is provided a method of target tracking based on a transducer, the method of target tracking including: acquiring a video sequence as an input of a quantization model; inputting a first frame in the video sequence to a template branch in the quantization model to extract global template features; inputting a plurality of frames in a video sequence to a search branch in a quantization model to extract a plurality of search features; outputting target tracking results from the quantization model based on the global template feature and the plurality of search features.

According to an aspect of the present disclosure, there is provided a quantization apparatus of a transform-based object tracking model, the quantization apparatus including: a target tracking model acquisition module configured to acquire a target tracking model based on a transducer, wherein the target tracking model includes a template branch, a search branch, a stitching module, and a first transducer module, wherein the stitching module receives a first feature output from the template branch and a second feature output from the search branch, and stitches the first feature and the second feature as stitching features; an optimized target tracking model generation module configured to generate an optimized target tracking model by deleting the stitching module from the target tracking model and splitting the first transducer module into a second transducer module and a third transducer module, wherein the second transducer module receives the first feature and the third transducer module receives the second feature; and a quantization module configured to generate a quantization model corresponding to the optimization target tracking model by quantizing the second and third transformers, respectively.

Optionally, the quantization module is configured to: obtaining a calibration data set comprising a video sequence, wherein the video sequence comprises a plurality of consecutive frames; forming a first frame of the video sequence into a first target calibration data set; forming a second target calibration data set from one frame, together with a first vector generated by a second transducer module based on the first target calibration data set, by selecting the one frame from the plurality of consecutive frames to represent the plurality of consecutive frames according to a frame rate; the second transducer module is quantized based on the first target calibration data set and the third transducer module is quantized based on the second target calibration data set.

According to an aspect of the present disclosure, there is provided a transducer-based object tracking device, the device comprising: the acquisition module is used for: is configured to obtain a video sequence as an input to a quantization model as described above; an input module configured to: inputting a first frame in the video sequence to a template branch in the quantization model to extract global template features, and inputting a plurality of frames in the video sequence to a search branch in the quantization model to extract a plurality of search features; and a tracking module configured to output a target tracking result from the quantization model based on the global template feature and the plurality of search features.

According to an aspect of the disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by one or more computing devices, causes the one or more computing devices to implement any quantization method.

According to an aspect of the present disclosure, there is provided an electronic device including one or more computing devices and one or more storage devices, the one or more storage devices having a computer program recorded thereon, which when executed by the one or more computing devices causes the one or more computing devices to implement any quantization method.

According to the quantization method of the target tracking model based on the transducer, the second transducer module and the third transducer module can be quantized independently on the premise of not changing the output value of the target tracking model, so that the target tracking model can be ensured to obtain better precision after quantization, and therefore the electronic equipment provided with the target tracking model based on the transducer can track the target in a video frame quickly and accurately.

According to the exemplary embodiments of the present disclosure, one piece of information representing this second is selected from a plurality of frames in consideration of characteristics of video, and thus, the amount of computation can be reduced while ensuring the accuracy of computation in the quantization process.

According to the conversion from the target tracking model based on the Transformer to the optimized target tracking model according to the exemplary embodiment of the present disclosure, the model network structure is optimized based on the principle that the template branches and the search branches are calculated independently as much as possible, and the optimized model structure is friendly to quantization.

According to the exemplary embodiment of the disclosure, the template branches and the search branches are independently calculated as far as possible under the precondition that the output value of the model is not changed, so as to ensure that the model can obtain better precision after quantization.

According to example embodiments of the present disclosure, since global template features may be calculated only once for a single video sequence and the calculation results are saved in global variables for use in calculation of search branches, the calculation amount may be reduced and the speed of target tracking may be improved.

According to the example embodiments of the present disclosure, since the features of the template image need to be calculated only once for any one video sequence, the template features already calculated need to be used directly for each subsequent search image during calculation. The part of the network that involves the template is thus separated from the part of the search, the template is computed only once, and the results of the computation are saved in global variables for use in the computation of the search image.

Drawings

The foregoing and other objects and features of the present disclosure will become more apparent from the following description taken in conjunction with the accompanying drawings which illustrate, by way of example, an example, in which:

FIG. 1 illustrates a flowchart of a method of quantifying a Transformer-based target tracking model according to an example embodiment of the disclosure;

FIG. 2 illustrates a flow chart of a method of quantifying a second transducer module and a third transducer module, respectively, according to an example embodiment of the present disclosure;

FIG. 3 illustrates a flow chart of an exemplary PTQ method;

FIG. 4 illustrates a transducer-based target tracking model and an optimized target tracking model according to an example embodiment of the present disclosure;

FIG. 5 illustrates a typical attention computing architecture;

FIG. 6 illustrates a flowchart of a Transformer-based target tracking method according to an example embodiment of the disclosure;

FIG. 7 shows a schematic diagram of a network corresponding to the Transformer-based target tracking method of FIG. 6, according to an example embodiment of the disclosure;

FIG. 8 illustrates a quantification apparatus of a transform-based object tracking model, according to an example embodiment of the present disclosure;

FIG. 9 illustrates a Transformer-based target tracking device according to an example embodiment of the disclosure;

fig. 10 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

The following detailed description is provided to assist the reader in obtaining a thorough understanding of the methods, apparatus, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent upon an understanding of the present disclosure. For example, the order of operations described herein is merely an example and is not limited to those set forth herein, but may be altered as will be apparent upon an understanding of the disclosure of the application, except for operations that must occur in a specific order. Furthermore, descriptions of features known in the art may be omitted for clarity and conciseness.

The features described herein may be embodied in different forms and should not be construed as limited to the examples described herein. Rather, the examples described herein have been provided to illustrate only some of the many possible ways to implement the methods, devices, and/or systems described herein, many of which will be apparent after having appreciated the present disclosure.

As used herein, the term "and/or" includes any one of the listed items associated as well as any combination of any two or more.

Although terms such as "first," "second," and "third" may be used herein to describe various elements, components, regions, layers or sections, these elements, components, regions, layers or sections should not be limited by these terms. Rather, these terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first member, first component, first region, first layer, or first portion referred to in the examples described herein may also be referred to as a second member, second component, second region, second layer, or second portion without departing from the teachings of the examples.

In the description, when an element (such as a layer, region or substrate) is referred to as being "on" another element, "connected to" or "coupled to" the other element, it can be directly "on" the other element, be directly "connected to" or be "coupled to" the other element, or one or more other elements intervening elements may be present. In contrast, when an element is referred to as being "directly on" or "directly connected to" or "directly coupled to" another element, there may be no other element intervening elements present.

The terminology used herein is for the purpose of describing various examples only and is not intended to be limiting of the disclosure. Singular forms also are intended to include plural forms unless the context clearly indicates otherwise. The terms "comprises," "comprising," and "having" specify the presence of stated features, amounts, operations, components, elements, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, amounts, operations, components, elements, and/or combinations thereof.

Unless defined otherwise, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs after having the knowledge of this disclosure. Unless explicitly so defined herein, terms (such as those defined in a general dictionary) should be construed to have meanings consistent with their meanings in the context of the relevant art and the present disclosure, and should not be interpreted idealized or overly formal.

In addition, in the description of the examples, when it is considered that detailed descriptions of well-known related structures or functions will cause a ambiguous explanation of the present disclosure, such detailed descriptions will be omitted.

Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, the embodiments may be implemented in various forms and are not limited to the embodiments described herein.

To facilitate an understanding of the present disclosure, quantization is first described.

The quantization algorithm mainly comprises the following two types:

1. post training quantization (PTQ, post-Training Quantization)

The PTQ algorithm can directly convert the pre-trained FP32 network into a fixed-point network without the need for an original training process. This approach typically requires no data-free or only a few calibration data sets, which are typically very easily available, without the need for a labeled data set. Furthermore, since this approach requires little adjustment of the hyper-parameters, it allows to use it in the case of a single API as if a black box were used. Thereby quantifying the pre-trained neural network model in an efficient manner.

2. Quantized perceptive Training (QAT, quantization-Aware-Training)

The QAT algorithm relies on retraining the neural network using analog quantization in training, modeling the quantization noise source during training. This allows the model to find a better solution than post-training quantization (PTQ). However, higher accuracy also requires more training costs, longer training times, tagged data sets, super-parametric searches, etc.

QAT can achieve higher accuracy than PTQ, but PTQ generally takes less time. The quantization method according to an exemplary embodiment of the present disclosure mainly relates to a PTQ method, which is one method of optimizing the quantization accuracy of PTQ.

The method of quantifying a target tracking model based on a transducer and the like of the present disclosure will be described in more detail below in connection with example embodiments.

Fig. 1 illustrates a flowchart of a method of quantifying a Transformer-based object tracking model according to an example embodiment of the present disclosure.

Referring to fig. 1, in step S110, a target tracking model based on a transducer may be acquired.

A Transformer based object tracking model can be used to track objects from video frames. That is, a transform-based object tracking model may receive a video frame as input and track objects (e.g., by way of example only, people, vehicles, etc.) in the video frame by performing corresponding processing on the video frame (e.g., transform-based object tracking processing).

Furthermore, the Transformer-based object tracking model may be any existing Transformer-based object tracking model of any object, and is generated and acquired in any manner. For example, a transform-based target tracking model may include a template (template) branch, a search branch, a stitching module, and a first transform module, the stitching module receiving a first feature output from the template branch and a second feature output from the search branch, and stitching the first feature and the second feature as stitched features.

The template branch may be used to extract features (e.g., first features) of a target image, which refers to a rectangular region of a target object framed in an initial frame. The search branch is used to extract features (e.g., second features) of a search image, which refers to a search rectangular area of each of the subsequent frames, and match target image features among the search image features to locate the position of the target. The first transducer module may be used to perform target tracking based on the stitching features.

In step S120, an optimized target tracking model may be generated by deleting the stitching module from the target tracking model and splitting the first transducer module into a second transducer module and a third transducer module, wherein the second transducer module receives the first feature and the third transducer module receives the second feature.

Similar to the Transformer-based object tracking model, the optimized object tracking model can be used to track objects from video frames. That is, the optimized target tracking model may receive a video frame as input and track objects (e.g., by way of example only, people, vehicles, etc.) in the video frame by performing a corresponding process (e.g., a transform-based target tracking process) on the video frame.

According to example embodiments of the present disclosure, although the optimized target tracking model is structurally different from the target tracking model due to the step S120, an output value of the optimized target tracking model may be the same as or similar to an output value of the target tracking model.

In one example embodiment, the second and third convertors modules have first and second multi-headed attention mechanism modules, respectively. The first multi-headed attention mechanism module receives the first vector generated in the second transducer module, and the second multi-headed attention mechanism module receives a spliced vector obtained by splicing the first vector generated in the second transducer module and the second vector generated in the third transducer module.

The first multi-headed attentiveness mechanism module may process the first vector based on the attentiveness mechanism and provide an output vector for target tracking. Similarly, the second multi-headed gaze mechanism module may process the stitched vector based on the gaze mechanism and provide an output vector for target tracking.

In one embodiment, the first vector comprises a queried vector corresponding to a template branch and a queried vector corresponding to a template branch, the second vector comprises a queried vector corresponding to a search branch and a queried vector corresponding to a search branch, and the stitching vector comprises: a vector generated by concatenating the queried vector corresponding to the template branch and the queried vector corresponding to the search branch, and a vector generated by concatenating the vector obtained by the query corresponding to the template branch and the vector obtained by the query corresponding to the search branch.

In one example, the first multi-headed attention mechanism module may also receive a third vector generated in the second transducer module. The first multi-headed attentiveness mechanism module may process the first vector and the third vector based on the attentiveness mechanism and provide an output vector for target tracking. For example, the third vector may include a query vector corresponding to the template branch. The second multi-headed alert mechanism module may also receive the fourth vector generated in the third transducer module. The second multi-headed attention mechanism module may process the splice vector and the fourth vector based on the attention mechanism and provide an output vector for target tracking. For example, the fourth vector may include a query vector corresponding to the search branch.

In step S130, a quantization model corresponding to the optimization target tracking model may be generated by quantizing the second and third transducers, respectively.

Similar to the optimized target tracking model, a quantization model corresponding to the optimized target tracking model may be used to track the target from the video frame. That is, a quantization model corresponding to the optimized target tracking model may receive a video frame as input and track objects (e.g., by way of example only, people, vehicles, etc.) in the video frame by performing a corresponding process (e.g., a transform-based target tracking process) on the video frame. Compared with a target tracking model and an optimized target tracking model based on a transducer, the quantization model corresponding to the optimized target tracking model can reduce the power consumption and the calculation requirement of system hardware, and meanwhile, the reduction of the target tracking precision is reduced.

Further, the second and third transducers may be quantized separately based on the calibration data to count parameters (e.g., dynamic ranges of weights and activation values) corresponding to the second transducer and parameters (e.g., dynamic ranges of weights and activation values) corresponding to the third transducer. The quantization of the second transducer module may be based on parameters (e.g., dynamic range of weights and activation values) corresponding to the second transducer module, and the quantization of the third transducer module may be based on parameters (e.g., dynamic range of weights and activation values) corresponding to the third transducer module. That is, the quantization of the second transducer module and the quantization of the third transducer module may be independent with respect to each other.

In the quantization process of the existing transform-based target tracking model, quantization of the first transform module involves quantization of data generated by processing data from the template branches and data from the search branches. Since the data from the template branches and the data from the search branches are different in size and content, which results in different amounts of data and ranges of data, connecting them together for quantization, using the same set of maximum and minimum values, can result in large quantization result errors.

In contrast, according to the quantization method of the target tracking model based on the transducer according to the exemplary embodiment of the present disclosure, since the second transducer module and the third transducer module can be independently quantized respectively without changing the output value of the target tracking model, it is ensured that the target tracking model can obtain better accuracy after quantization, and thus an electronic device deployed with the target tracking model based on the transducer can quickly and accurately track the target in the video frame.

Further, a quantization model corresponding to the optimized target tracking model receives the input image and tracks the target in the input image.

Fig. 2 shows a flowchart of a method of quantifying a second transducer module and a third transducer module, respectively, according to an example embodiment of the present disclosure.

Referring to fig. 2, in step S210, a calibration data set including a video sequence including a plurality of consecutive frames may be acquired.

The data set for object tracking is a video sequence. There are various methods for obtaining the data list from the data list, such as randomly selecting a plurality of video sequences; randomly selecting a part of pictures of all video sequences in proportion; randomly selecting a plurality of video sequences and selecting partial pictures in each video in proportion; and selecting video sequences according to the category of the object in the picture. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

In step S220, a first frame of the video sequence may be composed into a first target calibration data set.

In step S230, the plurality of consecutive frames may be represented by selecting one frame from the plurality of consecutive frames according to the frame rate.

In one embodiment, the number of consecutive frames is equal to a value corresponding to the frame rate. For example, when the frame rate is 30 frames/second, the number of the plurality of consecutive frames may be 30 or a value close to 30. For another example, when the frame rate is 25 frames/second, the number of the plurality of consecutive frames may be 25 or a value close to 25.

In this embodiment, most video has a frame rate of a specific value (for example, 30 frames/second, which is generally not lower than 25 frames/second) from the characteristic of the video, and the difference between 25 pictures in the second is not large, so that one piece of information can be selected according to the disclosure to represent the second, thus reducing the calculation amount in the quantization process and ensuring the calculation accuracy.

For example, the pictures in each video are grouped into groups of 25, and each group can be selected only once when the pictures are randomly selected, so that a random selection method, an equidistant selection method and the like can be used for selecting some test data lists. The number of test data sets selected is also controlled so as not to exceed a predetermined percentage (e.g., 1%) of the total number of test data sets.

In step S240, the selected one frame may be combined with the first vector generated by the second transducer module based on the first target calibration data set to form a second target calibration data set.

The first vector generated by the second transducer module of each piece of calibration data in the second target calibration data set may be unchanged for the same video, since the first vector is derived by the second transducer from the first frame of the video as input.

In this way, the first vector is calculated not for each frame but for the first frame having a representative property, and therefore, the calculation time can be reduced and the calculation efficiency can be improved.

In step S250, the second transducer module may be quantized based on the first target calibration data set and the third transducer module may be quantized based on the second target calibration data set.

The quantization method may be any quantization method. For example, the quantization method may be a PTQ quantization method. However, the present disclosure is not limited thereto, and the present disclosure does not limit the quantization method.

Fig. 3 shows a flow chart of a typical PTQ method.

Referring to fig. 3, in step S310, quantized data (also referred to as calibration data) may be prepared. The quantized data may be a small subset of the training data set. However, the quantized data is not limited thereto, and may be other data prepared in advance.

In step S320, a trained model (e.g., full-precision model (FP 32)), the dynamic range of the weights and activation values of the statistical model may be calibrated.

In step S330, quantization parameters may be obtained using the calibration result, and the model may be quantized based on the quantization parameters.

In step S340, a quantized model (e.g., a fixed-point model (INT 8)) may be obtained for subsequent deployment and testing.

Although fig. 3 illustrates a typical PTQ method, the present disclosure is not limited thereto, and may be any other modification of the PTQ quantization method.

Fig. 4 illustrates a transducer-based target tracking model and an optimized target tracking model according to an example embodiment of the present disclosure.

The left side of fig. 4 shows a transducer-based target tracking model, and the right side of fig. 4 shows an optimized target tracking model according to an example embodiment of the present disclosure.

Referring to fig. 4, a transducer-based target tracking model may include a template branch, a search branch, a stitching module, and a first transducer module. For example only, the first transducer module may have a network structure as shown in fig. 4. However, the present disclosure is not limited thereto, and the first transducer module may have a different network structure according to implementation of the transducer-based object tracking model. It should be appreciated that the transducer-based object tracking model may be an existing model, and thus, to avoid redundancy, a brief description of the transducer-based object tracking model is provided.

In fig. 4, t_q may represent a query vector generated by a template branch, t_k may represent a queried vector generated by a template branch, t_v may represent a query vector generated by a search branch, s_q may represent a queried vector generated by a search branch, s_k may represent a query vector generated by a search branch, k may represent a query vector of a combination of t_k of a template branch and s_k vector of a search branch, and v may represent a query vector of a combination of t_v of a template branch and s_v vector of a search branch. Nx may represent cascading N times.

According to the conversion from the target tracking model based on the Transformer to the optimized target tracking model according to the exemplary embodiment of the present disclosure, the model network structure is optimized based on the principle that the template branches and the search branches are calculated independently as much as possible, and the optimized model structure is friendly to quantization. Since the quantization method needs to count the maximum and minimum values of each layer of the network according to the calibration data, and the sizes and contents of the template branches and the search branches are different, the data amount and the data range are different, and the quantization result error is large when the same set of maximum and minimum values are used for connecting the data amounts and the data ranges together for quantization. Thus, according to example embodiments of the present disclosure, the template branches and the search branches are independently calculated as much as possible without changing the model output values, so as to ensure that the model can obtain better accuracy after quantization.

Fig. 5 shows a typical attention calculation structure.

In general, the attention mechanism is divided into two types, one is an additive attention mechanism and one is a point-by-point attention mechanism. Scaled Dot product attention (Scaled Dot-Product Attention) belongs to the Dot product attention mechanism and is added with scaling (Scaled) on the basis of the general Dot product attention mechanism. scaled refers to scaling the attention weight to ensure stability of the value. Multi-Head Attention (Multi-Head Attention) is proposed in Tansformer, simply a combination of Scaled Dot-Product Attention, which acts like a Multi-core in a convolutional network. The Combined Multi-Head Attention (combed Multi-Head Attention) is to splice a template (template) branch and a search (search) branch together to calculate Attention, the template branch is used as a scaling dot product Attention, and the search branch is used to splice t_q and t_v of the template branch and s_q and s_v of the search branch to respectively calculate scaling dot product Attention. According to example embodiments of the present disclosure, designing separate computations considering a mechanism of combining multiple head attentions is more quantization friendly.

In fig. 5, matMul may represent a matrix multiplication operation, scale represents a scaling operation, mask (opt.) represents an occlusion operation, linear represents a linearization operation, concat represents a concatenation operation, split represents a Split operation.

Fig. 5 shows an example structure of the scaled dot product attention, the multi-head attention, and the combined multi-head attention, however, the present disclosure is not limited thereto, and the scaled dot product attention, the multi-head attention, and the combined multi-head attention may have any other form of variation.

Fig. 6 shows a flowchart of a Transformer-based target tracking method according to an example embodiment of the disclosure.

Referring to fig. 6, in step S610, a video sequence may be acquired as an input of a quantization model. The video sequence may include a plurality of frames.

In step S620, a first frame in the video sequence may be input to a template branch in the quantization model to extract global template features.

That is, the global template features may be calculated only once for a single video sequence.

In step S630, a plurality of frames in the video sequence may be input to a search branch in the quantization model to extract a plurality of search features.

That is, for a single video sequence, search features may be computed for each frame.

In step S640, a target tracking result may be output from the quantization model based on the global template feature and the plurality of search features.

Fig. 7 shows a schematic diagram of a network corresponding to the Transformer-based target tracking method of fig. 6, according to an example embodiment of the disclosure.

Referring to fig. 7, a video sequence may include a plurality of frames frame1 to frame l. L may indicate the number of frames.

According to the example embodiments of the present disclosure, since the features of the template image need to be calculated only once for any one video sequence, the template features already calculated need to be used directly for each subsequent search image during calculation. The part of the network that involves the template is thus separated from the part of the search, the template is computed only once, and the results of the computation are saved in global variables for use in the computation of the search image. To ensure accuracy of the network after separation, the quantization parameter after separation and the quantization parameter before separation may be kept consistent.

Fig. 8 illustrates a quantification apparatus of a transducer-based object tracking model, according to an example embodiment of the present disclosure.

Referring to fig. 8, a quantization apparatus 800 of a transducer-based target tracking model may include a target tracking model acquisition module 810, an optimized target tracking model generation module 820, and a quantization module 830.

The target tracking model acquisition module 810 may acquire a Transformer-based target tracking model, wherein the target tracking model includes a template branch, a search branch, a stitching module, and a first Transformer module, wherein the stitching module receives a first feature output from the template branch and a second feature output from the search branch, and stitches the first feature and the second feature as a stitched feature.

The optimized target tracking model generation module 820 may generate the optimized target tracking model by deleting the stitching module from the target tracking model and splitting the first transducer module into a second transducer module and a third transducer module, wherein the second transducer module receives the first feature and the third transducer module receives the second feature.

The quantization module 830 is configured to generate a quantization model corresponding to the optimization target tracking model by quantizing the second and third transformers, respectively.

The acquisition operations performed by the target tracking model acquisition module 810, the operations performed by the optimization target tracking model generation module 820 to generate the optimization target tracking model, and the quantization operations performed by the quantization module 830 are described with reference to one or more of fig. 1 through 7. Here, in order to avoid redundancy, the acquisition operation performed by the target tracking model acquisition module 810, the operation of generating the optimized target tracking model performed by the optimized target tracking model generation module 820, and the quantization operation performed by the quantization module 830 are not repeated.

Fig. 9 illustrates a Transformer-based target tracking device according to an example embodiment of the disclosure.

Referring to fig. 9, a transducer-based target tracking device 900 may include an acquisition module 910, an input module 920, and a tracking module 930.

The acquisition module 910 may acquire an input of a video sequence quantization model. The input module 920 may input a first frame in the video sequence to a template branch in the quantization model to extract global template features and input a plurality of frames in the video sequence to a search branch in the quantization model to extract a plurality of search features. The tracking module 930 is configured to output target tracking results from the quantization model based on the global template feature and the plurality of search features.

That is, the Transformer-based object tracking device 900 may perform any Transformer-based object tracking method according to example embodiments of the disclosure. Therefore, in order to avoid redundancy, a repetitive description is not made here.

Referring to fig. 10, an electronic device 1000 in accordance with embodiments of the disclosure may include one or more computing devices (e.g., processors) 1010 and one or more storage devices 1020. Here, the one or more storage devices 1020 store a computer program that, when executed by the one or more computing devices 1010, implements any of the methods described with reference to fig. 1-7. For brevity, any of the methods described with reference to fig. 1-7 performed by the one or more computing devices 1010 will not be repeated herein.

The various modules in the illustrated apparatus of the quantized neural network model of the present disclosure may be configured as software, hardware, firmware, or any combination thereof that perform particular functions. For example, each module may correspond to an application specific integrated circuit, may correspond to a pure software code, or may correspond to a module in which software is combined with hardware. Furthermore, one or more functions implemented by the respective modules may also be uniformly performed by components in a physical entity apparatus (e.g., a processor, a client, a server, or the like).

Further, the quantization method of the neural network model of the present disclosure described with reference to the description may be implemented by a program (or instructions) recorded on a computer-readable storage medium. For example, according to an exemplary embodiment of the present disclosure, a computer-readable storage medium storing instructions may be provided, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform a method of quantization of a neural network model according to the present disclosure.

The computer program in the above-described computer readable storage medium may be run in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc., and it should be noted that the computer program may also be used to perform additional steps other than or more specific processes when the above-described steps are performed, and the contents of these additional steps and further processes have been mentioned in the description of the related method with reference to one or more of fig. 1 to 7, so that a detailed description will not be repeated here.

It should be noted that each module in the apparatus for quantifying a neural network model according to the exemplary embodiments of the present disclosure may completely rely on the execution of a computer program to implement a corresponding function, i.e., each module corresponds to each step in the functional architecture of the computer program, so that the entire system is called through a specific software package (e.g., lib library) to implement the corresponding function.

On the other hand, the various modules according to the various embodiments of the present disclosure may also be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium, such as a storage medium, so that the processor can perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, exemplary embodiments of the present disclosure may also be implemented as a computing device including a storage component and a processor, the storage component having stored therein a set of computer-executable instructions that, when executed by the processor, perform a method of quantifying a neural network model according to exemplary embodiments of the present disclosure.

In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the above set of instructions.

Here, the computing device is not necessarily a single computing device, but may be any device or aggregate of circuits capable of executing the above-described instructions (or instruction set) alone or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with locally or remotely (e.g., via wireless transmission).

In a computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the quantization method of the neural network model according to the exemplary embodiment of the present disclosure may be implemented in software, some of the operations may be implemented in hardware, and furthermore, the operations may be implemented in a combination of software and hardware.

The processor may execute instructions or code stored in one of the memory components, where the memory component may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, network connection, etc., such that the processor is able to read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via buses and/or networks.

The quantization method of the neural network model according to the exemplary embodiments of the present disclosure may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operate at non-exact boundaries.

Thus, the method of quantifying a target tracking model described with reference to at least one of fig. 1 to 7 may be implemented by a system comprising at least one computing device and at least one storage device storing instructions.

According to an exemplary embodiment of the present disclosure, the at least one computing device is a computing device for performing a quantization method of the object tracking model according to an exemplary embodiment of the present disclosure, in which a set of computer-executable instructions is stored, which when executed by the at least one computing device, performs the quantization method of the object tracking model described with reference to the figures.

The foregoing description of exemplary embodiments of the present disclosure has been presented only to be understood as illustrative and not exhaustive, and the present disclosure is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. Accordingly, the scope of the present disclosure should be determined by the scope of the claims.

Claims

1. A method of quantifying a target tracking model based on a transducer, the method comprising:

Acquiring a target tracking model based on a transducer, wherein the target tracking model comprises a template branch, a search branch, a splicing module and a first transducer module, wherein the splicing module receives a first feature output from the template branch and a second feature output from the search branch and splices the first feature and the second feature into a spliced feature;

generating an optimized target tracking model by deleting the stitching module from the target tracking model and splitting the first transducer module into a second transducer module and a third transducer module, wherein the second transducer module receives the first feature and the third transducer module receives the second feature;

a quantization model corresponding to the optimization target tracking model is generated by quantizing the second and third transformers respectively.

2. The quantization method of claim 1, wherein the second and third transducers have a first multi-headed attention mechanism module and a second multi-headed attention mechanism module, respectively, wherein the first multi-headed attention mechanism module receives a first vector generated in the second transducer module and the second multi-headed attention mechanism module receives a spliced vector obtained by splicing the first vector generated in the second transducer module and the second vector generated in the third transducer module.

3. The quantization method of claim 2, wherein the first vector comprises a queried vector corresponding to a template branch and a queried vector corresponding to a template branch, the second vector comprises a queried vector corresponding to a search branch and a queried vector corresponding to a search branch, and the stitching vector comprises: a vector generated by concatenating the queried vector corresponding to the template branch and the queried vector corresponding to the search branch, and a vector generated by concatenating the vector obtained by the query corresponding to the template branch and the vector obtained by the query corresponding to the search branch.

4. The quantization method of claim 2, wherein the first multi-headed attention mechanism module further receives a query vector corresponding to a template branch and the second multi-headed attention mechanism module further receives a query vector corresponding to a search branch.

5. The quantization method of claim 1, wherein the steps of quantizing the second and third transducers respectively comprise:

obtaining a calibration data set comprising a video sequence, wherein the video sequence comprises a plurality of consecutive frames;

forming a first frame of the video sequence into a first target calibration data set;

Representing the plurality of consecutive frames by selecting one frame from the plurality of consecutive frames according to a frame rate;

forming the one frame with a first vector generated by a second transducer module based on the first target calibration data set into a second target calibration data set;

the second transducer module is quantized based on the first target calibration data set and the third transducer module is quantized based on the second target calibration data set.

6. The quantization method according to claim 5, wherein the number of the plurality of consecutive frames is equal to a value corresponding to a frame rate.

7. A Transformer-based target tracking method, the target tracking method comprising:

acquiring a video sequence as input to the quantization model of any one of claims 1-6;

inputting a first frame in the video sequence to a template branch in the quantization model to extract global template features;

inputting a plurality of frames in a video sequence to a search branch in a quantization model to extract a plurality of search features;

outputting target tracking results from the quantization model based on the global template feature and the plurality of search features.

8. A transform-based quantization device for a target tracking model, the quantization device comprising:

A target tracking model acquisition module configured to acquire a target tracking model based on a transducer, wherein the target tracking model includes a template branch, a search branch, a stitching module, and a first transducer module, wherein the stitching module receives a first feature output from the template branch and a second feature output from the search branch, and stitches the first feature and the second feature as stitching features;

an optimized target tracking model generation module configured to generate an optimized target tracking model by deleting the stitching module from the target tracking model and splitting the first transducer module into a second transducer module and a third transducer module, wherein the second transducer module receives the first feature and the third transducer module receives the second feature;

and a quantization module configured to generate a quantization model corresponding to the optimization target tracking model by quantizing the second and third transformers, respectively.

9. A transducer-based target tracking device, the target tracking device comprising:

the acquisition module is used for: is configured to obtain a video sequence as an input to the quantization model of claim 8;

An input module configured to: inputting a first frame in the video sequence to a template branch in the quantization model to extract global template features, and inputting a plurality of frames in the video sequence to a search branch in the quantization model to extract a plurality of search features;

and a tracking module configured to output a target tracking result from the quantization model based on the global template feature and the plurality of search features.

10. A computer readable storage medium having stored thereon a computer program which, when executed by one or more computing devices, causes the one or more computing devices to implement the quantization method of any of claims 1-7.