CN118429388A

CN118429388A - Visual tracking method and device based on image processing

Info

Publication number: CN118429388A
Application number: CN202410853855.2A
Authority: CN
Inventors: 林漫钦; 尹家源
Original assignee: Hohem Technology Co ltd
Current assignee: Hohem Technology Co ltd
Priority date: 2024-06-28
Filing date: 2024-06-28
Publication date: 2024-08-02
Anticipated expiration: 2044-06-28
Also published as: CN118429388B

Abstract

The application relates to the technical field of image processing and discloses a visual tracking method and device based on image processing. The method comprises the following steps: collecting a preprocessing environment image sequence of a target object through a holder; performing target detection analysis to obtain a target detection result and performing feature extraction to obtain first position data and first feature data; performing feature aggregation and tracking parameter membership analysis through a graph convolution network to obtain an optimal tracking parameter combination and constructing a visual tracking model; carrying out Kalman filtering analysis and motion state prediction to obtain motion state prediction data; performing feature matching to obtain second position data and second feature data; the method comprises the steps of calculating the visual tracking deviation to obtain visual tracking deviation data, generating position adjustment parameters and angle adjustment parameters of the cradle head according to the visual tracking deviation data, and controlling the cradle head to carry out visual tracking on a target object.

Description

Visual tracking method and device based on image processing

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a visual tracking method and device based on image processing.

Background

Along with the development of computer vision technology, the accuracy and the real-time performance of the vision tracking method are remarkably improved. However, under complex environments, conventional visual tracking methods still face challenges such as illumination variation, object occlusion, motion blur, and multi-object tracking. Traditional methods often rely on single features or simple filtering algorithms, and are difficult to cope with changeable real scenes and complex target motion patterns.

Existing visual tracking techniques mostly rely on simple target tracking algorithms, such as mean shift, etc. The prior art performs better when dealing with static background or simple targets, but does not perform well in the case of dynamic background, multi-target and complex target movements, etc. In addition, conventional methods often have difficulty handling rapid changes and complex nonlinear movements of the target, making the loss and mistracking of the target during tracking more common. That is, the accuracy of the prior art is low.

Disclosure of Invention

The application provides a visual tracking method and device based on image processing, which are used for improving the accuracy of visual tracking of a cradle head by adopting an image processing technology.

In a first aspect, the present application provides an image processing-based visual tracking method, including:

collecting a continuous environment image sequence of a target object through a holder and carrying out image preprocessing to obtain a preprocessed environment image sequence;

Performing target detection analysis on a target object in the preprocessing environment image sequence to obtain a target detection result, and performing feature extraction on the target detection result to obtain first position data and first feature data;

Inputting the first position data and the first characteristic data into a preset graph convolution network to perform characteristic aggregation and tracking parameter membership analysis to obtain an optimal tracking parameter combination, and constructing a visual tracking model according to the optimal tracking parameter combination;

Carrying out Kalman filtering analysis and motion state prediction on the visual tracking model to obtain motion state prediction data of the target object;

performing feature matching on the motion state prediction data of the target object to obtain second position data and second feature data;

And performing vision tracking deviation calculation on the second position data and the second characteristic data to obtain vision tracking deviation data, generating position adjustment parameters and angle adjustment parameters of the holder according to the vision tracking deviation data, and controlling the holder to perform vision tracking on the target object.

In a second aspect, the present application provides an image processing-based visual tracking apparatus, comprising:

the acquisition module is used for acquiring a continuous environment image sequence of the target object through the cradle head and carrying out image preprocessing to obtain a preprocessed environment image sequence;

The detection module is used for carrying out target detection analysis on a target object in the preprocessing environment image sequence to obtain a target detection result, and carrying out feature extraction on the target detection result to obtain first position data and first feature data;

The construction module is used for inputting the first position data and the first characteristic data into a preset graph convolution network to perform characteristic aggregation and tracking parameter membership analysis to obtain an optimal tracking parameter combination, and constructing a visual tracking model according to the optimal tracking parameter combination;

The prediction module is used for carrying out Kalman filtering analysis and motion state prediction on the visual tracking model to obtain motion state prediction data of the target object;

The matching module is used for carrying out feature matching on the motion state prediction data of the target object to obtain second position data and second feature data;

The control module is used for carrying out vision tracking deviation calculation on the second position data and the second characteristic data to obtain vision tracking deviation data, generating position adjustment parameters and angle adjustment parameters of the holder according to the vision tracking deviation data, and controlling the holder to carry out vision tracking on the target object.

A third aspect of the present application provides an image processing-based visual tracking apparatus comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the image processing-based visual tracking apparatus to perform the image processing-based visual tracking method described above.

A fourth aspect of the present application provides a computer-readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the above-described image processing-based visual tracking method.

According to the technical scheme provided by the application, the collected environment image sequence is comprehensively optimized through image preprocessing. Noise is effectively removed, contrast and edge information of an image are enhanced, image quality is improved, and a target object in an image sequence can be accurately detected through feature extraction. The multi-level feature extraction improves the accuracy of target detection, retains rich information such as the shape, color and texture of a target object, adopts a preset graph rolling network to perform feature aggregation on first position data and first feature data, and performs tracking parameter analysis through a membership model. The multi-layer convolution operation and the nonlinear activation function realize effective feature aggregation, membership analysis accurately calculates the relation between each position node and a predefined tracking parameter, the motion state of a target object is predicted through Kalman filtering and particle filtering, and a regional search model and a non-homogeneous Poisson process are introduced, so that the update of an undiscovered target is more accurate, the use of a Clarituxe lower bound calculation and an optimized target function is performed, and the accuracy and the stability of predicted data are ensured. The feature matching and template updating of the target object are completed through local search, so that the target position and the feature can still be accurately matched under the conditions of rapid movement and environmental change of the target, and tracking drift and error tracking phenomena are effectively avoided. By calculating the vision tracking deviation data, the position adjustment parameters and the angle adjustment parameters of the cradle head are generated, the target object is always in the visual field range of the camera, the movement change of the target can be responded quickly, and the accuracy of vision tracking of the cradle head is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained based on these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of an embodiment of a visual tracking method based on image processing according to an embodiment of the present application;

Fig. 2 is a schematic diagram of an embodiment of a visual tracking device based on image processing according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a visual tracking method and device based on image processing. The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For ease of understanding, a specific flow of an embodiment of the present application is described below with reference to fig. 1, and an embodiment of a visual tracking method based on image processing in an embodiment of the present application includes:

S101, collecting a continuous environment image sequence of a target object through a holder and performing image preprocessing to obtain a preprocessed environment image sequence;

It is to be understood that the execution subject of the present application may be a visual tracking device based on image processing, and may also be a terminal or a server, which is not limited herein. The embodiment of the application is described by taking a server as an execution main body as an example.

Specifically, a continuous environment image sequence of the target object is acquired through the cradle head, and the moving process and the surrounding environment information of the target object are continuously and completely captured. And carrying out median filtering treatment on the continuous environment image sequence, removing salt and pepper noise in the images to obtain a denoising image sequence, and improving the definition and quality of the images. And carrying out mean filtering on the denoising image sequence, and smoothing tiny noise in the image to obtain an initial smooth image sequence. And carrying out Gaussian filtering treatment on the initial smooth image sequence, removing high-frequency noise, enabling the image to be smoother, and obtaining the Gaussian smooth image sequence. And carrying out histogram equalization processing on the Gaussian smooth image sequence, enhancing the contrast of the image, enabling details in the image to be clearer, and obtaining the contrast enhanced image sequence. And gamma correction is carried out on the contrast enhancement image sequence, and the brightness of the image is adjusted through the gamma correction, so that the gray level distribution of the image is more reasonable, and the brightness correction image sequence is obtained. And carrying out edge detection on the brightness correction image sequence, extracting edge information in the image, and highlighting the outline of the target object to obtain an edge enhancement image sequence. And carrying out binarization processing on the edge enhanced image sequence, converting the image into a black-and-white image, and highlighting the difference between the target object and the background to obtain a binarized image sequence. And carrying out morphological processing on the binarized image sequence, removing noise points and small areas in the image, filling the cavity of the target object, and obtaining the image sequence after morphological processing. And (3) performing image cutting on the image sequence subjected to morphological processing, removing unnecessary background areas, and reserving main parts of the target object to obtain a preprocessing environment image sequence.

Step S102, performing target detection analysis on a target object in a preprocessing environment image sequence to obtain a target detection result, and performing feature extraction on the target detection result to obtain first position data and first feature data;

Specifically, a sliding window process is performed on the pre-processing environment image sequence, and a plurality of candidate region image sequences are generated by gradually sliding on the image with a window of a fixed size, wherein the candidate regions possibly contain target objects or parts thereof, so as to cover the whole pre-processing environment image. And carrying out Haar feature extraction on the target objects in the plurality of candidate region image sequences, extracting the local texture, edge and other features of the target objects, and obtaining Haar feature data. AdaBoost classification is carried out on Haar characteristic data, adaBoost is a machine learning algorithm, and classification accuracy and robustness are improved through combination of a plurality of weak classifiers. In the process, the AdaBoost classifier classifies candidate areas according to Haar feature data, determines which candidate areas contain target objects, and obtains target detection results of the target objects. ORB feature extraction is performed on the target detection result, ORB (Oriented FAST and Rotated BRIEF) is an efficient feature extraction and description algorithm, and the ORB feature data is obtained by extracting local feature points and descriptors of the target object through ORB feature extraction. These feature data may describe the local information and the overall morphology of the target object in more detail. And carrying out position and feature analysis on the ORB feature data, and determining the specific position and related features of the target object in the image to obtain first position data and first feature data. The first location data describes the location coordinates of the target object in the image, and the first feature data contains feature descriptors of the target object for subsequent tracking and matching.

Step S103, inputting the first position data and the first characteristic data into a preset graph convolution network to perform characteristic aggregation and tracking parameter membership analysis to obtain an optimal tracking parameter combination, and constructing a visual tracking model according to the optimal tracking parameter combination;

Specifically, node conversion is performed on the first position data to obtain a plurality of position nodes, and each position node represents a specific position of the target object in the image. And carrying out node attribute conversion on the first characteristic data according to the plurality of position nodes to obtain node attribute data of each position node, wherein the attribute data describe the characteristic information of the target object at the position. And constructing a graph structure of the plurality of position nodes and node attribute data of each position node to generate a position graph structure, wherein each node connection represents the relation of the target object between different positions. And carrying out convolution operation and multi-layer convolution feature aggregation on the position diagram structure through a preset diagram convolution network to obtain a feature aggregation result. The graph convolution network comprises a plurality of graph convolution layers, a nonlinear activation function is arranged behind each graph convolution layer, the convolution layers and the activation function can extract and aggregate the characteristic information of the position nodes layer by layer, and the depth and accuracy of characteristic extraction are improved. And calculating the membership of each position node and the predefined tracking parameter according to the feature aggregation result through a membership model, and determining the most suitable tracking parameter of each position node by analyzing the feature aggregation result through the membership model to obtain the optimal tracking parameter combination. The optimal tracking parameter combination can optimize each parameter configuration in the tracking process, and improves the tracking precision and stability. And selecting and configuring the initial state and the noise covariance of the Kalman filter and the target speed and acceleration according to the optimal tracking parameter combination, and constructing a visual tracking model. The Kalman filter is a prediction and estimation algorithm, and by configuring the initial state and the noise covariance, the motion state of a target object is effectively predicted, and tracking parameters are adjusted in real time, so that a visual tracking model can capture and track the motion trail of the target object more accurately.

Step S104, kalman filtering analysis and motion state prediction are carried out on the visual tracking model to obtain motion state prediction data of the target object;

Specifically, the position of the target object at the next moment is predicted through a visual tracking model, and target position prediction data are obtained. And carrying out Kalman filtering analysis on the target position prediction data, and smoothing and correcting the prediction data through a Kalman filter to obtain a Kalman filtering analysis result. And carrying out particle filtering prediction on the Kalman filtering analysis result, and predicting the position of the target object to obtain particle position prediction data. Particle filtering captures the nonlinear motion characteristics of a target object through state estimation of a plurality of particles. And carrying out multi-sensor data fusion on the particle position prediction data, and carrying out fusion processing on the data from different sensors to obtain fusion position prediction data. And constructing a corresponding area search model according to the fusion position prediction data, and generating undiscovered target update data through the area search model. The region search model determines regions where the target object may appear by analyzing the fusion position prediction data, and performs key monitoring on the regions. And modeling the non-homogeneous poisson process of the update data of which the target is not found, and obtaining a non-homogeneous poisson process model. The non-homogeneous poisson process model is used for describing probability distribution of the occurrence of the target object in different areas, and is beneficial to dynamically adjusting the search strategy. And (3) carrying out posterior claimepiro lower bound (PCRLB) calculation on the non-homogeneous poisson process model to obtain PCRLB quantized data. PCRLB the quantized data is used to quantify the uncertainty of the prediction, providing a theoretical lower bound on the accuracy of the target object position estimate. And constructing an optimized objective function based on PCRLB quantized data, wherein the optimized objective function ensures the accuracy and reliability of the target object position estimation by minimizing the prediction uncertainty. And carrying out chaotic mapping-multi-target collaborative differential evolution algorithm solution through an optimized objective function to obtain an optimal sensor scheduling scheme. The chaotic mapping-multi-target collaborative differential evolution algorithm can efficiently solve the problem of multi-target optimization by combining the chaotic theory and the differential evolution algorithm, and provides an optimal sensor scheduling strategy. And generating motion state prediction data of the target object through an optimal sensor scheduling scheme and a visual tracking model.

Step S105, performing feature matching on the motion state prediction data of the target object to obtain second position data and second feature data;

Specifically, the motion state prediction data of the target object is locally searched, and the search is performed in the area near the predicted target position, so as to obtain local search data. Feature point extraction is performed on the local search data, and feature point data of the target object is obtained by using feature point extraction algorithms such as SIFT, SURF or ORB, and the feature points can effectively represent key points and local information of the target object. And generating feature descriptors for the feature point data, and carrying out quantization description on image information around the feature points by generating the feature descriptors to obtain the feature descriptor data. And carrying out quick approximate nearest neighbor search on the feature descriptor data, and finding nearest neighbor matching data of each feature descriptor by using a quick search algorithm such as FLANN. Random sampling consistency analysis (RANSAC) is carried out on nearest neighbor matching data, the matching data is screened through a RANSAC algorithm, error matching is eliminated, inner point matching data is obtained, and the inner point matching data can describe the position and the state of a target object in an image more accurately. Carrying out affine transformation estimation on the interior point matching data, establishing the geometric relationship of the target object between different image frames to obtain an affine transformation model, and carrying out transformation application on the affine transformation model to obtain transformed target position data, thereby realizing accurate alignment of the target position. And performing dense corresponding matching on the transformed target position data, and performing matching in the whole area of the target object to obtain dense matching data. And updating the template of the densely matched data, ensuring that the template data is always consistent with the state of the current target object, and obtaining updated template data. And generating second position data and second characteristic data corresponding to the target object according to the updated template data. The second location data describes the exact location of the target object in the image, and the second feature data contains detailed feature information of the target object.

And S106, performing vision tracking deviation calculation on the second position data and the second characteristic data to obtain vision tracking deviation data, and generating position adjustment parameters and angle adjustment parameters of the holder according to the vision tracking deviation data to control the holder to perform vision tracking on the target object.

Specifically, the visual tracking deviation calculation is performed on the second position data and the second characteristic data, the visual tracking deviation data is obtained by calculating the difference between the position of the target object of the current frame and the expected position, and the deviation degree of the position and the characteristic of the target object in the tracking process is quantified. And carrying out pan-tilt camera translation control through visual tracking deviation data, and calculating to obtain position adjustment parameters by using the deviation data. The position adjustment parameters can determine specific movement amounts of the cradle head in the horizontal direction and the vertical direction, so that the camera can be realigned to the position of the target object. And carrying out rotation control on the cradle head camera through the vision tracking deviation data to obtain an angle adjustment parameter, wherein the angle adjustment parameter is used for adjusting the rotation angle of the camera so as to ensure that the target object is always positioned at the center position of the field of view of the camera. And generating a visual feedback signal through the position adjustment parameter and the angle adjustment parameter to obtain the visual feedback signal. The visual feedback signal is a real-time signal and can reflect the current adjustment state of the camera and the position change condition of the target object. And generating control commands of the cradle head camera according to the visual feedback signals, wherein the control commands comprise specific adjustment commands for guiding the movement and rotation operation of the cradle head camera so as to carry out real-time target tracking. And controlling the cradle head to carry out visual tracking on the target object according to the control command of the cradle head camera. After the cradle head receives the control command, the position and the angle of the camera are automatically adjusted so as to keep the target object within the visual field of the camera. The process is a closed-loop control system, and the adjustment parameters and the control command are generated by continuously calculating the vision tracking deviation data, so that the cradle head can follow the movement of the target object in real time, and the target object is ensured to be always at the optimal observation position of the camera.

In the embodiment of the application, the collected environmental image sequence is comprehensively optimized through image preprocessing. Noise is effectively removed, contrast and edge information of an image are enhanced, image quality is improved, and a target object in an image sequence can be accurately detected through feature extraction. The multi-level feature extraction improves the accuracy of target detection, retains rich information such as the shape, color and texture of a target object, adopts a preset graph rolling network to perform feature aggregation on first position data and first feature data, and performs tracking parameter analysis through a membership model. The multi-layer convolution operation and the nonlinear activation function realize effective feature aggregation, membership analysis accurately calculates the relation between each position node and a predefined tracking parameter, the motion state of a target object is predicted through Kalman filtering and particle filtering, and a regional search model and a non-homogeneous Poisson process are introduced, so that the update of an undiscovered target is more accurate, the use of a Clarituxe lower bound calculation and an optimized target function is performed, and the accuracy and the stability of predicted data are ensured. The feature matching and template updating of the target object are completed through local search, so that the target position and the feature can still be accurately matched under the conditions of rapid movement and environmental change of the target, and tracking drift and error tracking phenomena are effectively avoided. By calculating the vision tracking deviation data, the position adjustment parameters and the angle adjustment parameters of the cradle head are generated, the target object is always in the visual field range of the camera, the movement change of the target can be responded quickly, and the accuracy of vision tracking of the cradle head is improved.

In a specific embodiment, the process of executing step S101 may specifically include the following steps:

(1) Collecting a continuous environment image sequence of a target object through a holder;

(2) Performing median filtering on the continuous environment image sequence to obtain a denoising image sequence;

(3) Average filtering is carried out on the denoising image sequence to obtain an initial smooth image sequence;

(4) Carrying out Gaussian filtering on the initial smooth image sequence to obtain a Gaussian smooth image sequence;

(5) Performing histogram equalization on the Gaussian smooth image sequence to obtain a contrast enhancement image sequence;

(6) Gamma correction is carried out on the contrast enhancement image sequence to obtain a brightness correction image sequence;

(7) Performing edge detection on the brightness correction image sequence to obtain an edge enhancement image sequence;

(8) Performing binarization processing on the edge enhanced image sequence to obtain a binarized image sequence;

(9) Carrying out morphological processing on the binarized image sequence to obtain a morphological processed image sequence;

(10) And performing image clipping on the image sequence subjected to morphological processing to obtain a preprocessing environment image sequence.

Specifically, a continuous environment image sequence of a target object is acquired through a holder. And carrying out median filtering processing on the continuous environment image sequence. The median filtering is an effective denoising method and is suitable for removing salt and pepper noise in an image. The median filtering is implemented by sequencing the neighborhood of each pixel point and taking the intermediate value to replace the current pixel value. Assume an image I in which the value of each pixel isThe formula for median filtering can be expressed as:

；

wherein, Representing the filtered pixel values, mean represents the median operation, W is a defined window range. And carrying out mean value filtering on the denoising image sequence. And calculating the average value of each pixel point in the image and the neighborhood pixels of the pixel point, so that the image is smoothed, and the tiny noise is reduced. The formula of the mean filtering is as follows:

；

wherein, Representing the value of the pixel after the filtering,Is the number of pixels in the window. And after the mean value filtering is finished, gaussian filtering is carried out on the initial smooth image sequence. Gaussian filtering is a weighted average filtering method that further removes noise and detail by convolving the image. Kernel function of gaussian filteringIs generally defined as:

；

wherein, Is the standard deviation, which determines the degree of filtering. The process of gaussian filtering image I can be expressed as:

；

And carrying out histogram equalization on the Gaussian smooth image sequence. By stretching the gray level distribution of the image, the contrast of the image is enhanced, so that details in the image are more obvious. The gray value distribution of the input image is homogenized, thereby enhancing the overall contrast of the image. The basic principle is to redistribute the gray values of the pixels by means of a Cumulative Distribution Function (CDF). Gamma correction is performed on the contrast enhanced image sequence. Gamma correction changes the brightness characteristics of an image by adjusting the gray value of the image. The formula for gamma correction is as follows:

；

wherein, Is a gamma value, typically greater than 1 to enhance the image brightness, or less than 1 to attenuate the image brightness. Edge detection is performed on the sequence of luminance corrected images. By identifying areas in the image where gray values vary drastically, the contours of the target object are highlighted. Common edge detection algorithms include Canny algorithm, sobel operator, and the like. The general process of Canny edge detection includes steps of Gaussian filtering, gradient calculation, non-maximum suppression, double-threshold detection and the like. And carrying out binarization processing on the edge enhanced image sequence. By classifying the pixel values in an image into two classes: targets and backgrounds. Binarization is usually performed by a fixed threshold or adaptive threshold method, and a point with a pixel value higher than the threshold is set to white (255), and a point lower than the threshold is set to black (0), thereby obtaining a binarized image sequence. Morphological processing is performed on the sequence of binarized images. Morphological processing includes erosion and dilation operations by which noise and small areas in the image can be removed, filling in voids in the target object. The etching operation may remove small white noise points, while the dilation operation may fill small holes of the target object. And (3) performing image clipping on the image sequence subjected to morphological processing, removing unnecessary background areas through clipping, and reserving main parts of the target object to obtain a preprocessing environment image sequence.

In a specific embodiment, the process of executing step S102 may specifically include the following steps:

(1) Performing sliding window processing on the preprocessing environment image sequence to generate a plurality of candidate region image sequences;

(2) Carrying out Haar feature extraction on target objects in the image sequences of the multiple candidate areas to obtain Haar feature data;

(3) Performing AdaBoost classification on the Haar characteristic data to obtain a target detection result of a target object;

(4) Performing ORB feature extraction on the target detection result to obtain ORB feature data;

(5) And carrying out position and feature analysis on the ORB feature data to obtain first position data and first feature data.

Specifically, a sliding window process is performed on the pre-processing environment image sequence. The whole image is segmented into a plurality of overlapping patches, i.e. candidate region image sequences. The size and step size of the sliding window need to be selected according to the actual application scene and the size of the target object. Assume that the window size is(Width w, height h), step size s, then for one sizeThe number of candidate regions N that can be generated is calculated by:

；

Where N is the number of candidate regions generated by the sliding window, W is the width of the image, H is the height of the image, W is the width of the sliding window, H is the height of the sliding window, s is the step size of the sliding window, Representing rounding down on x. The sliding window process may ensure that each partial image is covered so that no region that may contain the target object is missed. And carrying out Haar feature extraction on the candidate region image sequence, and extracting features by calculating brightness differences of different regions in the image. Haar features are based on the calculation of a weighted sum of pixel values within a rectangular region, typically operating with a simple black and white rectangular region. For example, a typical Haar feature may be expressed as the difference between the luminance sums of two adjacent rectangular regions:

；

wherein f is a Haar eigenvalue, Representing the image in positionIs used for the display of the display panel,AndRespectively representing two adjacent rectangular areas. AdaBoost classification was performed on Haar feature data. AdaBoost is a machine learning algorithm that constructs a strong classifier by combining multiple weak classifiers. Assume that there are T weak classifiers, each weak classifier having a weight ofThe final classification result can be expressed as:

；

wherein, Is the classification result of the strong classifier,Is a sign function ifThen; Otherwise，Representing the classification result of the t-th weak classifier on the sample x,Is the weight of the t-th weak classifier. And (5) ORB feature extraction is carried out on the target detection result. ORB combines FAST key point detector and BRIEF descriptor and adds direction information. In the ORB feature extraction process, key points are detected through a FAST algorithm, and then the direction of each key point is determined through calculating gray scale centroids of the neighborhood of the key points. And carrying out binary feature description on the neighborhood of each key point through the BRIEF descriptor to obtain ORB feature data. For example, assume that the number of keypoints detected by the FAST algorithm is K, and the position of each keypoint isDetermining the direction of each key point by calculating the gray centroid of the neighborhood of the key point. And carrying out binary feature description on the neighborhood of each key point by using the BRIEF descriptor to obtain ORB feature data. And carrying out position and feature analysis on the ORB feature data to obtain first position data and first feature data. The ORB feature data includes location of keypoints and descriptor information. By analyzing these data, the specific position of the target object in the image is determined and a detailed characterization of the target object is extracted.

In a specific embodiment, the process of executing step S103 may specifically include the following steps:

(1) Performing node conversion on the first position data to obtain a plurality of position nodes, and performing node attribute conversion on the first characteristic data according to the plurality of position nodes to obtain node attribute data of each position node;

(2) Carrying out graph structure construction on a plurality of position nodes and node attribute data of each position node to obtain a position graph structure;

(3) Carrying out convolution operation and multi-layer convolution feature aggregation on the position graph structure through a preset graph convolution network to obtain a feature aggregation result, wherein the graph convolution network comprises a plurality of graph convolution layers, and a nonlinear activation function is arranged behind each graph convolution layer;

(4) Calculating the membership of each position node and the predefined tracking parameters according to the feature aggregation result through a membership model to obtain an optimal tracking parameter combination;

(5) And selecting and configuring the initial state and the noise covariance of the Kalman filter and the target speed and acceleration according to the optimal tracking parameter combination, and constructing a visual tracking model.

Specifically, node conversion is performed on the first position data. Assume that the first position data isWhere N represents the number of keypoints,Indicating the location of the ith keypoint. Each position data point is converted by a nodeConversion to a node in the graph, denoted as. Each nodeCorresponding to one positionObtaining a plurality of position nodes. And carrying out node attribute conversion on the first characteristic data according to the position node. For each location node, its corresponding feature descriptor is taken as node attribute data. And constructing a graph structure of the position nodes and the node attribute data of each position node. Graph structureIs composed of node set V and edge set E, each nodeRepresenting a location node. The edge set E represents the connection relationship between nodes, which can be generally determined according to the distance or other relevance between the nodes. Assume that the distance between nodes is:

；

Defining a threshold value When (when)When the nodeAndThere is an edge between, i.e. And carrying out convolution operation and multi-layer convolution feature aggregation on the position diagram structure through a preset diagram convolution network to obtain a feature aggregation result. The graph rolling network can effectively extract and aggregate the characteristics of the graph structure data. Assuming the graph rolling network has L layers per layer of convolution layer output, this can be expressed as:

；

wherein, Represent the firstThe node characteristic matrix of the layer,Represent the firstThe node feature matrix of the layer, a is the adjacency matrix of the graph, D is the degree matrix,，Is the firstThe weight matrix of the layer is used to determine,Nonlinear activation functions, such as the ReLU function. After the multi-layer convolution operation, the characteristic aggregation result is obtainedIt contains the aggregated characteristic information for each node. And calculating the membership of each position node and the predefined tracking parameters according to the feature aggregation result through a membership model to obtain the optimal tracking parameter combination. The membership model may be implemented by fuzzy logic or other methods, and the membership degree of each location node and the tracking parameters is calculated, thereby determining the optimal tracking parameter combination. Assume that the tracking parameter set isWherein M is the number of parameters, and the feature aggregation result isThe membership can be expressed as:

；

wherein, Representing nodesWith tracking parametersIs used for the degree of membership of the group (a),Representing nodesIs characterized by f being a membership function. According to membershipAnd selecting the parameter combination with the largest membership degree as the optimal tracking parameter combination. And selecting and configuring the initial state and the noise covariance of the Kalman filter and the target speed and acceleration according to the optimal tracking parameter combination, and constructing a visual tracking model. The kalman filter is a classical estimation and prediction algorithm for tracking the motion state of a target. The state update formula of the Kalman filter is as follows:

；

wherein, Representing an estimate of the state at time k,Is shown at the momentIs used to predict the state of a (c) in the (c),Is the gain of the kalman,Is the observed value, H is the observation matrix,Is a state covariance matrix of the state,Is an identity matrix. By selecting the optimal tracking parameter combination, the initial state, the noise covariance and the target speed and acceleration of the Kalman filter are configured, and an accurate and efficient visual tracking model is constructed.

In a specific embodiment, the process of executing step S104 may specifically include the following steps:

(1) Predicting the position of the target object at the next moment through a visual tracking model to obtain target position prediction data;

(2) Carrying out Kalman filtering analysis on the target position prediction data to obtain Kalman filtering analysis results;

(3) Carrying out particle filtering prediction on the Kalman filtering analysis result to obtain particle position prediction data;

(4) Carrying out multi-sensor data fusion on the particle position prediction data to obtain fusion position prediction data;

(5) Constructing a corresponding area search model according to the fusion position prediction data, and generating undiscovered target update data through the area search model;

(6) Modeling the non-homogeneous poisson process of the undiscovered target update data to obtain a non-homogeneous poisson process model;

(7) Performing posterior claimepirome lower bound calculation on the non-homogeneous poisson process model to obtain PCRLB quantized data, and constructing an optimization objective function based on PCRLB quantized data;

(8) Carrying out chaotic mapping-multi-target collaborative differential evolution algorithm solution through an optimized objective function to obtain an optimal sensor scheduling scheme;

(9) And generating motion state prediction data of the target object through an optimal sensor scheduling scheme and a visual tracking model.

Specifically, the position of the target object at the next moment is predicted by the visual tracking model. Assume that the state at the current time isThe predicted state at the next timeThe method can be obtained through prediction of a state transition equation:

；

wherein, Is the state at the next instant of the instant prediction, F is the state transition matrix, B is the control matrix,Is a control input. And carrying out Kalman filtering analysis on the target position prediction data. Kalman filtering provides a more accurate state estimation by combining the predicted data and the observed data. The update steps of the Kalman filtering are as follows:

；

wherein, Is the gain of the kalman,Is a prediction error covariance matrix, H is an observation matrix, R is an observation noise covariance matrix,Is an observation value of the current,Is an updated state estimate, and I is an identity matrix. And carrying out particle filter prediction on the Kalman filter analysis result. Particle filtering represents the state distribution by using a large number of particles and estimates the state through importance sampling and resampling steps. Assuming that N particles are used, the state of each particle isThe weight isThe updating step of the particle filter is:

；

wherein, Is the observation probability. And (3) carrying out multi-sensor data fusion on the particle position prediction data, and combining the data from different sensors to improve the accuracy of position prediction. The fusion method may take the form of an extension of weighted averaging or kalman filtering. And constructing a corresponding area search model according to the fusion position prediction data, and generating undiscovered target update data through the area search model. The area search model is used for searching in an area where the target object possibly appears, and generating update data of the undiscovered target. The region search model may be constructed based on bayesian updates or other probabilistic models. And modeling the non-homogeneous poisson process of the update data of which the target is not found, and obtaining a non-homogeneous poisson process model. The non-homogeneous poisson process is used to describe the probability distribution of the occurrence of a target at different times and spaces, and its intensity function can be expressed as:

；

wherein, Is an intensity function at time t and location x,Is the initial intensity of the light and,Is the attenuation factor of the light-emitting diode,Is a spatially distributed function. And (3) carrying out posterior claimepiro lower bound (PCRLB) calculation on the non-homogeneous poisson process model to obtain PCRLB quantized data. PCRLB is used for quantifying the uncertainty of parameter estimation, and the calculation formula is as follows:

；

wherein, Is a fischer information matrix. An optimization objective function is constructed based on PCRLB quantized data. The optimization objective function is used to minimize the estimation error. And carrying out chaotic mapping-multi-target collaborative differential evolution algorithm solution through an optimized objective function to obtain an optimal sensor scheduling scheme. The chaotic mapping-multi-target collaborative differential evolution algorithm combines initial sensitivity of the chaotic mapping and global searching capability of collaborative differential evolution, and can effectively solve the multi-target optimization problem. And generating motion state prediction data of the target object through an optimal sensor scheduling scheme and a visual tracking model. The sensor scheduling scheme is used for optimizing the configuration and the use of the sensors, so that the visual tracking model can accurately predict the motion state of the target object.

In a specific embodiment, the process of executing step S105 may specifically include the following steps:

(1) Carrying out local search on the motion state prediction data of the target object to obtain local search data;

(2) Extracting feature points of the local search data to obtain feature point data, and generating feature descriptors of the feature point data to obtain feature descriptor data;

(3) Performing fast approximate nearest neighbor search on the feature descriptor data to obtain nearest neighbor matching data, and performing random sampling consistency analysis on the nearest neighbor matching data to obtain interior point matching data;

(4) Carrying out affine transformation estimation on the interior point matching data to obtain an affine transformation model, and carrying out transformation application on the affine transformation model to obtain transformed target position data;

(5) Performing dense corresponding matching on the transformed target position data to obtain dense matching data, and performing template updating on the dense matching data to obtain updated template data;

(6) And generating second position data and second characteristic data corresponding to the target object according to the updated template data.

Specifically, local search is performed on the motion state prediction data of the target object, so as to obtain local search data. A fine search is performed near the predicted position of the target object in order to more accurately locate the target object. The local search may be performed through a sliding window or a local area based on the predicted position. For example, if the predicted position of the target object is in the imageThen can be at the same timeSearching is carried out in a small area of the center, and local search data are obtained. And extracting characteristic points of the local search data to obtain characteristic point data. Feature point extraction is used to identify salient points in an image. Common feature point extraction algorithms include SIFT (scale invariant feature transform), SURF (accelerated robust features), and ORB (directional FAST and rotational BRIEF). These algorithms are able to detect corner points or other salient points in the image. Assuming that the ORB algorithm is used, the process of extracting the feature points can be expressed as:

；

wherein I represents an input image, The position of the i-th feature point is represented, and N represents the number of feature points. And generating the feature descriptor for the feature point data to obtain feature descriptor data. The feature descriptor is used to describe local image information of the feature points. Taking the ORB algorithm as an example, the descriptor of each feature point is a binary vector, and is generated by comparing gray values of regions around the feature point. And carrying out quick approximate nearest neighbor search on the feature descriptor data to obtain nearest neighbor matching data. A fast approximate nearest neighbor search algorithm such as a FLANN (fast nearest neighbor search library) can effectively find the nearest neighbor of each feature descriptor in the feature space. And carrying out random sampling consistency (RANSAC) analysis on the nearest neighbor matching data to obtain interior point matching data. The RANSAC algorithm estimates model parameters by iteratively selecting a random subset and finding the largest set of interior points. And carrying out affine transformation estimation on the interior point matching data to obtain an affine transformation model. Affine transformation the points of one image can be mapped into another image, and affine transformation parameters are estimated by minimizing the error of the inter-point matching pair. And carrying out transformation application on the affine transformation model to obtain transformed target position data. Affine transformations may map the target location from the current frame to the next frame. And carrying out dense corresponding matching on the transformed target position data to obtain dense matching data. The dense corresponding matching obtains a more accurate matching result by carrying out more characteristic point matching in the target area. Dense matching may be achieved using either optical streaming or dense descriptor matching methods. And updating the template of the densely matched data to obtain updated template data. Template updating enables the template to adapt to changes in the target object by fusing the matching result of the current frame into the template. And generating second position data and second characteristic data corresponding to the target object according to the updated template data. The second location data represents a location of the target object in the current frame and the second feature data represents an updated feature descriptor of the target object.

In a specific embodiment, the process of executing step S106 may specifically include the following steps:

(1) Performing vision tracking deviation calculation on the second position data and the second characteristic data to obtain vision tracking deviation data;

(2) Performing pan-tilt camera translation control through the vision tracking deviation data to obtain position adjustment parameters, and performing pan-tilt camera rotation control through the vision tracking deviation data to obtain angle adjustment parameters;

(3) Generating a visual feedback signal through the position adjustment parameter and the angle adjustment parameter to obtain a visual feedback signal, and generating a pan-tilt camera control command according to the visual feedback signal;

(4) And controlling the cradle head to carry out visual tracking on the target object according to the control command of the cradle head camera.

Specifically, the visual tracking deviation calculation is performed on the second position data and the second characteristic data, so as to obtain visual tracking deviation data. Assume that the second position data isThis is the position data of the target object in the current frame, and the target center position of the camera is. The visual tracking deviation may be obtained by calculating a deviation between the target object position and the camera center position. And carrying out translation control on the pan-tilt camera through the visual tracking deviation data to obtain the position adjustment parameters. Assuming that the pan-tilt camera can be adjusted in the horizontal and vertical directions, the position adjustment parameters can be expressed as:

；

wherein, Indicating the level-adjustment parameter(s),Representing the parameters of the vertical adjustment,Is the proportionality constant for position adjustment. Meanwhile, rotation control of the cradle head camera is carried out through visual tracking deviation data, and angle adjustment parameters are obtained. Assume that the rotation angle of the pan-tilt camera is in the horizontal directionIn the vertical direction ofThe angle adjustment parameter may be expressed as:

；

wherein, Represents the adjustment parameter of the horizontal rotation angle,Represents the vertical rotation angle adjustment parameter,Is the proportionality constant of the angle adjustment. And generating a visual feedback signal through the position adjustment parameter and the angle adjustment parameter to obtain the visual feedback signal. The visual feedback signal is used for feeding back the state and the position of the current camera in real time so as to dynamically adjust. The visual feedback signal may be generated by fusing the position and angle adjustment parameters:

；

wherein, Representing the horizontal feedback signal and,Representing the vertical feedback signal, f is a function of a combination of position and angular adjustment parameters. And generating a control command of the pan-tilt camera according to the visual feedback signal. The control commands are used for guiding the actual actions of the cradle head camera to align the cradle head camera to the position of the target object. The control command may be expressed as:

；

wherein, A horizontal control command is indicated and a control command is indicated,Representing the vertical control command, g is a function of converting the feedback signal into a control command. And controlling the cradle head to carry out visual tracking on the target object according to the control command of the cradle head camera. After the cradle head receives the control command, the position and the angle of the camera are adjusted, so that the target object is always positioned in the center of the visual field of the camera.

The above describes the image processing-based visual tracking method in the embodiment of the present application, and the following describes the image processing-based visual tracking device in the embodiment of the present application, referring to fig. 2, one embodiment of the image processing-based visual tracking device in the embodiment of the present application includes:

The acquisition module 201 is configured to acquire a continuous environment image sequence of a target object through a pan-tilt and perform image preprocessing to obtain a preprocessed environment image sequence;

The detection module 202 is configured to perform target detection analysis on a target object in the preprocessing environment image sequence to obtain a target detection result, and perform feature extraction on the target detection result to obtain first position data and first feature data;

the construction module 203 is configured to input the first position data and the first feature data into a preset graph convolution network to perform feature aggregation and tracking parameter membership analysis, obtain an optimal tracking parameter combination, and construct a visual tracking model according to the optimal tracking parameter combination;

the prediction module 204 is used for performing kalman filtering analysis and motion state prediction on the visual tracking model to obtain motion state prediction data of the target object;

the matching module 205 is configured to perform feature matching on the motion state prediction data of the target object to obtain second position data and second feature data;

The control module 206 is configured to perform vision tracking deviation calculation on the second position data and the second feature data, obtain vision tracking deviation data, generate a position adjustment parameter and an angle adjustment parameter of the pan-tilt according to the vision tracking deviation data, and control the pan-tilt to perform vision tracking on the target object.

And through the cooperative cooperation of the components, the collected environment image sequence is comprehensively optimized through image preprocessing. Noise is effectively removed, contrast and edge information of an image are enhanced, image quality is improved, and a target object in an image sequence can be accurately detected through feature extraction. The multi-level feature extraction improves the accuracy of target detection, retains rich information such as the shape, color and texture of a target object, adopts a preset graph rolling network to perform feature aggregation on first position data and first feature data, and performs tracking parameter analysis through a membership model. The multi-layer convolution operation and the nonlinear activation function realize effective feature aggregation, membership analysis accurately calculates the relation between each position node and a predefined tracking parameter, the motion state of a target object is predicted through Kalman filtering and particle filtering, and a regional search model and a non-homogeneous Poisson process are introduced, so that the update of an undiscovered target is more accurate, the use of a Clarituxe lower bound calculation and an optimized target function is performed, and the accuracy and the stability of predicted data are ensured. The feature matching and template updating of the target object are completed through local search, so that the target position and the feature can still be accurately matched under the conditions of rapid movement and environmental change of the target, and tracking drift and error tracking phenomena are effectively avoided. By calculating the vision tracking deviation data, the position adjustment parameters and the angle adjustment parameters of the cradle head are generated, the target object is always in the visual field range of the camera, the movement change of the target can be responded quickly, and the accuracy of vision tracking of the cradle head is improved.

The present application also provides an image processing-based visual tracking apparatus, which includes a memory and a processor, wherein the memory stores computer-readable instructions that, when executed by the processor, cause the processor to execute the steps of the image processing-based visual tracking method in the above embodiments.

The present application also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and may also be a volatile computer readable storage medium, in which instructions are stored which, when executed on a computer, cause the computer to perform the steps of the image processing-based visual tracking method.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, systems and units may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random acceS memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. An image processing-based visual tracking method, characterized in that the image processing-based visual tracking method comprises the following steps:

2. The visual tracking method based on image processing according to claim 1, wherein the capturing a continuous environmental image sequence of a target object by a pan-tilt and performing image preprocessing to obtain a preprocessed environmental image sequence comprises:

Collecting a continuous environment image sequence of a target object through a holder;

performing median filtering on the continuous environment image sequence to obtain a denoising image sequence;

Performing mean filtering on the denoising image sequence to obtain an initial smooth image sequence;

Carrying out Gaussian filtering on the initial smooth image sequence to obtain a Gaussian smooth image sequence;

performing histogram equalization on the Gaussian smooth image sequence to obtain a contrast enhancement image sequence;

Performing gamma correction on the contrast enhancement image sequence to obtain a brightness correction image sequence;

Performing edge detection on the brightness correction image sequence to obtain an edge enhancement image sequence;

performing binarization processing on the edge enhanced image sequence to obtain a binarized image sequence;

performing morphological processing on the binarized image sequence to obtain a morphological processed image sequence;

And performing image clipping on the image sequence subjected to morphological processing to obtain a preprocessing environment image sequence.

3. The image processing-based visual tracking method according to claim 1, wherein the performing object detection analysis on the object in the preprocessing environment image sequence to obtain an object detection result, and performing feature extraction on the object detection result to obtain first location data and first feature data includes:

performing sliding window processing on the preprocessing environment image sequence to generate a plurality of candidate region image sequences;

Carrying out Haar feature extraction on target objects in the candidate region image sequences to obtain Haar feature data;

Performing AdaBoost classification on the Haar characteristic data to obtain a target detection result of the target object;

Performing ORB feature extraction on the target detection result to obtain ORB feature data;

And carrying out position and feature analysis on the ORB feature data to obtain first position data and first feature data.

4. The image processing-based visual tracking method according to claim 1, wherein inputting the first location data and the first feature data into a preset graph rolling network to perform feature aggregation and tracking parameter membership analysis, obtaining an optimal tracking parameter combination, and constructing a visual tracking model according to the optimal tracking parameter combination, comprises:

performing node conversion on the first position data to obtain a plurality of position nodes, and performing node attribute conversion on the first characteristic data according to the plurality of position nodes to obtain node attribute data of each position node;

Constructing a graph structure of the plurality of position nodes and node attribute data of each position node to obtain a position graph structure;

Carrying out convolution operation and multi-layer convolution feature aggregation on the position graph structure through a preset graph convolution network to obtain a feature aggregation result, wherein the graph convolution network comprises a plurality of graph convolution layers, and a nonlinear activation function is arranged behind each graph convolution layer;

Calculating the membership of each position node and a predefined tracking parameter according to the feature aggregation result through a membership model to obtain an optimal tracking parameter combination;

And selecting and configuring the initial state and the noise covariance of the Kalman filter and the target speed and the acceleration according to the optimal tracking parameter combination, and constructing a visual tracking model.

5. The image processing-based visual tracking method according to claim 1, wherein the performing kalman filter analysis and motion state prediction on the visual tracking model to obtain motion state prediction data of the target object includes:

Predicting the position of the target object at the next moment through the visual tracking model to obtain target position prediction data;

carrying out Kalman filtering analysis on the target position prediction data to obtain Kalman filtering analysis results;

Carrying out particle filtering prediction on the Kalman filtering analysis result to obtain particle position prediction data;

Performing multi-sensor data fusion on the particle position prediction data to obtain fusion position prediction data;

constructing a corresponding area search model according to the fusion position prediction data, and generating undiscovered target update data through the area search model;

modeling the non-homogeneous poisson process of the undiscovered target update data to obtain a non-homogeneous poisson process model;

Performing posterior claimepirome lower bound calculation on the non-homogeneous poisson process model to obtain PCRLB quantized data, and constructing an optimization objective function based on the PCRLB quantized data;

Carrying out chaotic mapping-multi-objective collaborative differential evolution algorithm solution through the optimized objective function to obtain an optimal sensor scheduling scheme;

And generating motion state prediction data of the target object through the optimal sensor scheduling scheme and the visual tracking model.

6. The image processing-based visual tracking method according to claim 1, wherein the performing feature matching on the motion state prediction data of the target object to obtain second position data and second feature data includes:

Carrying out local search on the motion state prediction data of the target object to obtain local search data;

Extracting feature points of the local search data to obtain feature point data, and generating feature descriptors of the feature point data to obtain feature descriptor data;

performing fast approximate nearest neighbor search on the feature descriptor data to obtain nearest neighbor matching data, and performing random sampling consistency analysis on the nearest neighbor matching data to obtain interior point matching data;

carrying out affine transformation estimation on the interior point matching data to obtain an affine transformation model, and carrying out transformation application on the affine transformation model to obtain transformed target position data;

Performing dense corresponding matching on the transformed target position data to obtain dense matching data, and performing template updating on the dense matching data to obtain updated template data;

And generating second position data and second characteristic data corresponding to the target object according to the updated template data.

7. The image processing-based visual tracking method according to claim 1, wherein the performing a visual tracking deviation calculation on the second position data and the second feature data to obtain visual tracking deviation data, and generating a position adjustment parameter and an angle adjustment parameter of the pan-tilt according to the visual tracking deviation data, and controlling the pan-tilt to perform visual tracking on the target object includes:

Performing vision tracking deviation calculation on the second position data and the second characteristic data to obtain vision tracking deviation data;

Performing pan-tilt camera translation control through the visual tracking deviation data to obtain position adjustment parameters, and performing pan-tilt camera rotation control through the visual tracking deviation data to obtain angle adjustment parameters;

Generating a visual feedback signal according to the position adjustment parameter and the angle adjustment parameter to obtain a visual feedback signal, and generating a pan-tilt camera control command according to the visual feedback signal;

And controlling the cradle head to carry out visual tracking on the target object according to the cradle head camera control command.

8. An image processing-based visual tracking apparatus, comprising:

9. An image processing-based visual tracking apparatus, characterized in that the image processing-based visual tracking apparatus comprises: a memory and at least one processor, the memory having instructions stored therein;

The at least one processor invokes the instructions in the memory to cause the image processing based visual tracking apparatus to perform the image processing based visual tracking method of any of claims 1-7.

10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the image processing based visual tracking method of any of claims 1-7.