CN106203350A - A kind of moving target is across yardstick tracking and device - Google Patents

A kind of moving target is across yardstick tracking and device Download PDF

Info

Publication number
CN106203350A
CN106203350A CN201610548086.0A CN201610548086A CN106203350A CN 106203350 A CN106203350 A CN 106203350A CN 201610548086 A CN201610548086 A CN 201610548086A CN 106203350 A CN106203350 A CN 106203350A
Authority
CN
China
Prior art keywords
neural network
video frame
weight
layer
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610548086.0A
Other languages
Chinese (zh)
Other versions
CN106203350B (en
Inventor
杜军平
朱素果
任楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201610548086.0A priority Critical patent/CN106203350B/en
Publication of CN106203350A publication Critical patent/CN106203350A/en
Application granted granted Critical
Publication of CN106203350B publication Critical patent/CN106203350B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a kind of moving target across yardstick tracking and device, build neutral net, and successively carry out Weight Training and obtain initialized weights;The true value of the first frame of video is inputted in the described neutral net set up, initial weights are updated;From the beginning of the second frame of video, the frame of video of input is calculated bias term;According to the bias term of the frame of video obtained, calculate the output valve after this frame of video input neural network;Judge the confidence level of the output of neutral net in described frame of video whether less than the threshold value preset, if less than; the weights of described neutral net would be updated, according to having the neutral net of weights after renewal, the moving target of this frame of video would be estimated;If more than, directly the moving target of this frame of video is estimated.Therefore, described moving target is capable of the feature extraction of moving target accurately across yardstick tracking and device, and the position and size to moving target is predicted, and obtains optimal motion target tracking result.

Description

Moving target cross-scale tracking method and device
Technical Field
The invention relates to the technical field of image processing, in particular to a cross-scale tracking method and a cross-scale tracking device for a moving target.
Background
The traditional tracking method combines different complex features with a deep neural network, does not consider the scale difference between different sampling sheets, and increases the complexity of the method to a certain extent. The appearance of a target in the process of visual Tracking is learned by utilizing a stacked noise reduction self-encoder (SDAE), and a moving target is tracked by combining a depth network and particle filtering, the frame structure (DLT, Deep Learning Tracking) is simple, a 4-layer neural network framework is simply considered, the simple moving target Tracking effect is good, but the method is not suitable for the moving target Tracking in a complex environment.
A tracking method combining self-learning and a suspension constraint principle initializes a network through learning under a line and updates a network weight on the line, wherein the motion estimation uses logistic regression probability estimation. Although the accuracy of the method is improved to some extent, the method is still not robust to more challenging problems such as occlusion or partial occlusion. The hash method can distinguish different image blocks by using simple information, is sensitive to specific contents of the image and is insensitive to the integrity of the image. The moving target can be tracked by adopting three basic hash methods (a perceptual hash method, an average hash method and a differential hash method), and the method considers the characteristics of the three hash methods, but also increases the complexity of the method. In order to reduce the complexity of the method and improve the efficiency of the method, a two-dimensional combined hash method is provided as a feature extraction method, and Bayesian motion estimation is combined to complete the tracking of a moving target, but the method can only show a good effect on a part of video sequences.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for tracking a moving target across scales, which can implement accurate feature extraction of the moving target, predict the position and size of the moving target, and obtain an optimal tracking result of the moving target.
Based on the above purpose, the invention provides a cross-scale tracking method for a moving target, which comprises the following steps:
building a neural network, and carrying out weight training layer by layer to obtain initialized weights;
inputting the true value of the first video frame into the established neural network, and updating the initial weight;
calculating an offset term for the input video frame starting from the second video frame;
calculating an output value of the video frame after the video frame is input into the neural network according to the obtained bias term of the video frame;
judging whether the confidence coefficient output by the neural network in the video frame is smaller than a preset threshold value, if so, updating the weight of the neural network, and estimating the moving target of the video frame according to the neural network with the updated weight; and if the difference is larger than the preset threshold, directly estimating the moving object of the video frame.
In some embodiments of the present invention, a stack-type noise reduction self-encoder is used to train the weights layer by layer, so as to obtain initialized weights.
In some embodiments of the present invention, the constructing the neural network and performing weight training layer by layer to obtain the initialized weight includes:
let k denote the number of training samples, i ═ 1,2, …, k, and the training set of samples is { x }l,…,xi,…,xkRespectively representing a weight value of a hidden layer and a weight value of an output layer, and b' and b represent bias items of different hidden layers; from input samples xiA hidden layer representation h can be derivediAnd reconstruction of the inputAs shown in formula (1) and formula (2):
hi=f(W'xi+b') (1)
x ^ i = s i g m ( Wh i + b ) - - - ( 2 )
wherein f (·) represents a nonlinear excitation function; sigm (-) represents the excitation function of the neural network, as shown in equation (3):
s i g m ( y ) = 1 1 + exp ( - y ) - - - ( 3 )
the noise reduction self-encoder is obtained through learning, and the formula (4) is shown as follows:
min W , W ′ , b , b ′ Σ i = 1 k | | x i - x ^ i | | 2 2 + γ ( | | W | | F 2 + | | W ′ | | F 2 ) - - - ( 4 )
wherein,representing a loss function reconstructed by a neural network, wherein a second term is a weight penalty term; using parameter gamma to reconstruction error and weight penalty itemBalancing to fully consider the relationship between the two; assume the use of equation (4)Then, the partial derivative is calculated for E, and the formula (5) is obtained:
∂ E ( w → ) ∂ w → j i = ∂ E 0 ( w → ) ∂ w → j i + 2 λw j i = - δ j x j i + 2 γw j i - - - ( 5 )
if it is notOrder toThen obtaining the formula (6) and the formula (7):
w ~ j i = w j i + η ( δ j x j i - 2 γw j i ) - - - ( 6 )
w ~ j i = ( 1 - 2 η γ ) w j i + ηδ j x j i - - - ( 7 )
where, η is the learning rate,jis an error, xjiAnd wjiRepresenting the data and the weights from the ith layer to the jth layer, respectively.
In some embodiments of the invention, said calculating the bias term for the input video frame comprises:
sampling a video frame to obtain an initialization sample S ═ S1,s2,…,sN},i=1,2,…,N;
Calculating the previous frame It-1Tracking a hash value l of the result;
for each sample siCalculating siAverage hash value of
Calculate l and each sample s using equation (8)iHamming distance therebetween; where N denotes the number of samples, S ═ S1,s2,…,sN},i=1,2,…,N,dis(l,si) Representing video frames It-1Tracking the hash value l of the result and the current ith sample siHash value ofThe Hamming distance therebetween is represented by formula (8):
d i s ( l , s i ) = Σ i = 1 N l ⊗ l s i - - - ( 8 )
obtaining a sample s according to the obtained Hamming distance and the formula (9)iThe corresponding bias term:
b i = 1 - d i s ( l , s i ) Σ t = 1 N d i s ( l , s t ) - - - ( 9 )
wherein, biThe bias term for the ith sample is indicated.
In some embodiments of the present invention, said estimating the moving object of the video frame according to the neural network with the updated weight value includes:
calculating the updated confidence of each particle according to the neural network with the updated weight;
and selecting the particle with the maximum confidence coefficient after updating as the moving object of the video frame in the calculated confidence coefficient after updating.
In another aspect, the present invention further provides a moving object cross-scale tracking apparatus, including:
the neural network construction unit is used for constructing a neural network and carrying out weight training layer by layer to obtain initialized weights;
the weight updating unit is used for inputting the true value of the first video frame into the established neural network and updating the initial weight; calculating an offset term for the input video frame starting from the second video frame; calculating an output value of the video frame after the video frame is input into the neural network according to the obtained bias term of the video frame; judging whether the confidence coefficient output by the neural network in the video frame is smaller than a preset threshold value, if so, updating the weight of the neural network, and if not, not processing;
and the moving object estimation unit is used for estimating the moving object of the video frame.
In some embodiments of the present invention, the neural network constructing unit performs a layer-by-layer weight training by using a stacked noise reduction self-encoder to obtain an initialized weight.
In some embodiments of the present invention, the obtaining the initialized weight value by the neural network constructing unit includes:
let k denote the number of training samples, i ═ 1,2, …, k, and the training set of samples is { x }l,…,xi,…,xkRespectively representing a weight value of a hidden layer and a weight value of an output layer, and b' and b represent bias items of different hidden layers; from input samples xiA hidden layer representation h can be derivediAnd reconstruction of the inputAs shown in formula (1) and formula (2):
hi=f(W'xi+b') (1)
x ^ i = s i g m ( Wh i + b ) - - - ( 2 )
wherein f (·) represents a nonlinear excitation function; sigm (-) represents the excitation function of the neural network, as shown in equation (3):
s i g m ( y ) = 1 1 + exp ( - y ) - - - ( 3 )
the noise reduction self-encoder is obtained through learning, and the formula (4) is shown as follows:
min W , W ′ , b , b ′ | | x i - x ^ i | | 2 2 + γ ( | | W | | F 2 + | | W ′ | | F 2 ) - - - ( 4 )
wherein,representing a loss function reconstructed by a neural network, wherein a second term is a weight penalty term; balancing the reconstruction error and the weight penalty term by using the parameter gamma so as to fully consider the relationship between the reconstruction error and the weight penalty term; assume the use of equation (4)Then, the partial derivative is calculated for E, and the formula (5) is obtained:
∂ E ( w → ) ∂ w → j i = ∂ E 0 ( w → ) ∂ w → j i + 2 λw j i = - δ j x j i + 2 γw j i - - - ( 5 )
if it is notOrder toThen obtaining the formula (6) and the formula (7):
w ~ j i = w j i + η ( δ j x j i - 2 γw j i ) - - - ( 6 )
w ~ j i = ( 1 - 2 η γ ) w j i + ηδ j x j i - - - ( 7 )
where, η is the learning rate,jis an error, xjiAnd wjiRepresenting the data and the weights from the ith layer to the jth layer, respectively.
In some embodiments of the present invention, the calculating the bias term for the input video frame by the weight updating unit includes:
eye-to-eyeSampling frequency frame to obtain initialization sample S ═ S1,s2,…,sN},i=1,2,…,N;
Calculating the previous frame It-1Tracking a hash value l of the result;
for each sample siCalculating siAverage hash value of
Calculate l and each sample s using equation (8)iHamming distance therebetween; where N denotes the number of samples, S ═ S1,s2,…,sN},i=1,2,…,N,dis(l,si) Representing video frames It-1Tracking the hash value l of the result and the current ith sample siHash value ofThe Hamming distance therebetween is represented by formula (8):
d i s ( l , s i ) = Σ i = 1 N l ⊗ l s i - - - ( 8 )
and obtaining a bias term corresponding to the sample si according to the obtained Hamming distance and the formula (9):
b i = 1 - d i s ( l , s i ) Σ t = 1 N d i s ( l , s t ) - - - ( 9 )
wherein, biThe bias term for the ith sample is indicated.
In some embodiments of the present invention, the estimating, by the moving object estimating unit, the moving object of the video frame according to the neural network with the updated weight value includes:
calculating the updated confidence of each particle according to the neural network with the updated weight;
and selecting the particle with the maximum confidence coefficient after updating as the moving object of the video frame in the calculated confidence coefficient after updating.
From the above, the moving target cross-scale tracking method and device provided by the invention correct the bias term of the neural network by calculating the hash characteristic values of different sampling slices and taking the similarity of the hash characteristic values and the template as the scale characteristic; and estimating the probability of each particle by adopting a moving target tracking method of particle filtering, thereby completing the tracking of the moving target.
Drawings
FIG. 1 is a schematic flow chart of a cross-scale tracking method for a moving object according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a moving object cross-scale tracking device in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
As an embodiment, referring to fig. 1, the method for tracking a moving object across scales may adopt the following steps:
step 101, building a neural network, and performing weight training layer by layer to obtain initialized weights and bias items.
Preferably, the stack-type noise reduction self-encoder is used for training the weight layer by layer to obtain the initialized weight.
In the embodiment, let k denote the number of training samples, i ═ 1,2, …, k, and the training set of samples is { x }1,…,xi,…,xkAnd b' and b represent bias items of different hidden layers. From input samples xiA hidden layer representation h can be derivediAnd reconstruction of the inputAs shown in formula (1) and formula (2):
hi=f(W'xi+b') (1)
x ^ i = s i g m ( Wh i + b ) - - - ( 2 )
where f (-) represents a nonlinear excitation function. sigm (-) represents the excitation function of the neural network, as shown in equation (3):
s i g m ( y ) = 1 1 + exp ( - y ) - - - ( 3 )
the noise reduction self-encoder is obtained through learning, and the formula (4) is shown as follows:
min W , W ′ , b , b ′ | | x i - x ^ i | | 2 2 + γ ( | | W | | F 2 + | | W ′ | | F 2 ) - - - ( 4 )
wherein,representing a loss function reconstructed by a neural network, wherein the second item is a weight penalty item, and searching for a smaller weight is completed by using a gradient descent method, so that the possibility of overfitting is reduced; and balancing the reconstruction error and the weight penalty term by using the parameter gamma so as to fully consider the relationship between the reconstruction error and the weight penalty term. It is assumed that equation (4) can be usedThen, the partial derivative is obtained for E, and the formula (5) is obtained:
∂ E ( w → ) ∂ w → j i = ∂ E 0 ( w → ) ∂ w → j i + 2 λw j i = - δ j x j i + 2 γw j i - - - ( 5 )
if it is notOrder toThen obtaining the formula (6) and the formula (7):
w ~ j i = w j i + η ( δ j x j i - 2 γw j i ) - - - ( 6 )
w ~ j i = ( 1 - 2 η γ ) w j i + ηδ j x j i - - - ( 7 )
where, η is the learning rate,jis an error, xjiAnd wjiRepresenting the data and the weights from the ith layer to the jth layer, respectively.
Preferably, an eight-layer neural network structure may be established in step 101.
And 102, inputting the true value of the first video frame into the established neural network, and updating the initial weight.
Specifically, the true value of the first frame video frame can be substituted into equation (7) to obtain the updated weight.
Step 103, starting from the second video frame, calculating the bias term for the input video frame, i.e. updating the initialized bias term. The specific implementation process comprises the following steps:
the method comprises the following steps: sampling a video frame to obtain an initialization sample S ═ S1,s2,…,sN},i=1,2,…,N
Step two: calculating the previous frame It-1The hash value l of the result is tracked.
Step three: for each sample siCalculating siAverage hash value of
Step four: calculate l and each sample s using equation (8)iHamming distance between.
In one embodiment, assume that the current video frame is ItThe previous video frame It-1As a comparison object, calculating It-1And tracking the average hash value of the result, and calculating the Hamming distance between the average hash value of the result and the average hash values of all current samples. The greater the hamming distance between each two hash values, the less similarity between the two, and vice versa.
Let N denote the number of samples, S ═ S1,s2,…,sN},i=1,2,…,N,dis(l,si) Representing video frames It-1Tracking the hash value l of the result and the current ith sample siHash ofValue ofThe hamming distance therebetween can be expressed as shown in equation (8):
d i s ( l , s i ) = Σ i = 1 N l ⊗ l s i - - - ( 8 )
wherein,indicating an exclusive or operation.
Step five: obtaining a sample s according to the obtained Hamming distanceiThe corresponding bias term.
Wherein, the sample siThe corresponding new bias term is shown as equation (9):
b i = 1 - d i s ( l , s i ) Σ t = 1 N d i s ( l , s t ) - - - ( 9 )
wherein, biThe bias term for the ith sample is indicated. Easy to find, new bias term biTo some extent, the importance of the current sample in all samples.
It is also worth noting that in neural networks, the bias term is typically initially set to a fixed value, such as 1. The bias term is used as an input term of the deep network, and represents the importance degree between different samples to a certain degree. And capturing structural features of the target by using inherent low-frequency information of the image, and repairing the bias term to enable different samples to correspond to different bias values. The structural feature reflects different scales among sampling slices to a certain extent and is a representation form of image scale features.
And 104, calculating the output confidence of the video frame after the video frame is input into the neural network according to the obtained bias term of the video frame.
As a preferred embodiment, for deep networks, starting from the second layer, i.e. layer 2: 8. Further, the confidence of each layer output of the neural network can be calculated layer by using the formula (1) and the formula (2). Where it ends when the last layer of the neural network is computed.
Step 105, judging whether the confidence coefficient output by the neural network in the video frame is smaller than a preset threshold value, if so, updating the weight of the neural network, and executing step 106. If so, proceed directly to step 106.
Preferably, the preset threshold may be 0.8. In an embodiment, the confidence is calculated by a neural network, that is, the sampling slice is input into the neural network, and the corresponding confidence is output. When the confidence is smaller than a preset threshold (for example, smaller than 0.8), it indicates that the current neural network is already not suitable for the dynamically changing moving object, and the weights of the neural network need to be updated by using the positive and negative samples of the last several frames. And if the current weight is larger than the preset threshold value, the current neural network still can well distinguish the moving target from the background, so that the weight of the sampling particles can be continuously calculated by using the current neural network without other processing.
Further, the weights of the neural network may be updated using equation (7) described above. For example: for each sample, x is obtained by forward propagation through the last hidden layerjiFinally, the estimated value a is obtained through the output layer, and the label is y (the label of the positive sample is 1, and the negative sample is 0), then the formula (7) is obtainedj=(a-y)×sigmoid′(xji). The weights for each layer are then updated by back-propagation from the output layer to the input layer. For example: for the l-th layer or layers,
step 106, estimating the moving object of the video frame.
In an embodiment, the updated confidence for each particle may be calculated from the neural network with updated weights. And selecting the particle with the maximum confidence coefficient after updating as the moving object of the video frame in the calculated confidence coefficient after updating. Of course, if the confidence level is greater than the preset threshold in step 105, the particle with the highest confidence level may be directly selected as the moving object of the video frame. If the position change of the target in the adjacent frame is small, it can be considered that the target in the current frame is more likely to surround the particle with high confidence in the previous frame.
In a preferred aspectIn the embodiment, before each particle is substituted into the neural network with updated weight, the particle needs to be filtered. Wherein the process of particle filtering is an iterative process, assuming XtRepresenting the state variable of the moving object at time t, then in an initial step, first in the preceding set of particles Xt-1In the method, N particles are proportionally collected according to the distribution of the particles to obtain a new state. However, the newly generated particles are usually influenced by probability, resulting in particle degradation, which results in a majority of particles concentrating around the particles with larger weight, and then posterior probability p (x)t|z1:t-1) Is approximately equal to having the importance weightA finite set of N particles as shown in equation (10):
p(xt|z1:t-1)=∫p(xt|xt-1)p(xt-1|z1:t-1)dxt-1(10)
wherein, if particleThat is, the predicted tracking result, the background information contained in the corresponding rectangular frame (the tracking result is represented by the position of the particle and the size of the target, the position and the size form a rectangular frame, and the position and the size of the target are visually represented on the image in a reverse direction by using the rectangular frame) is less than the background information in the rectangular frames corresponding to other particles, and the weight of the particle is larger. Then, the relationship between the weight and the posterior probability can be obtained from equation (11):
w t , j i = w t - 1 i p j ( z t | x t i ) p j ( x t i | x t - 1 i ) q j ( x t | x 1 : t - 1 , z 1 : t ) , i = 1 , 2 , ... , N , j = 1 , 2 - - - ( 11 )
wherein x is1:t-1Representing random samples forming a posterior probability distribution, N representing the number of samples, z1:tIndicating the observed value from the start time to the time t,representing the ith sample at time t,represents the ith weight vector, q (x), at time t-1 (also representing the previous frame)t|x1:t-1,z1:t)=p(xt|xt-1) Is the importance distribution.
In another aspect of the present invention, a moving object cross-scale tracking apparatus is further provided, as shown in fig. 2, the moving object cross-scale tracking apparatus includes a neural network constructing unit 201, a weight updating unit 202, and a moving object estimating unit 203, which are connected in sequence. The neural network constructing unit 201 constructs a neural network, and performs weight training layer by layer to obtain an initialized weight. Preferably, the neural network constructing unit 201 performs a layer-by-layer weight training by using the stacked noise reduction self-encoder to obtain an initialized weight. The weight updating unit 202 inputs the true value of the first video frame into the established neural network, and updates the initial weight; calculating an offset term for the input video frame starting from the second video frame; calculating an output value of the video frame after the video frame is input into the neural network according to the obtained bias term of the video frame; judging whether the confidence coefficient output by the neural network in the video frame is smaller than a preset threshold value, if so, updating the weight of the neural network, and estimating the moving target of the video frame by the moving target estimation unit 203 according to the neural network with the updated weight; if the motion object is larger than the threshold, the motion object estimation unit 203 directly estimates the motion object of the video frame.
As an embodiment, in the process of initializing the weights, the neural network constructing unit 202 may assume that k represents the number of training samples, i ═ 1,2, …, k, and the training set of samples is ═ 1,2, …, k ″W 'and W respectively represent the weight of a hidden layer and the weight of an output layer, and b' and b represent bias items of different hidden layers; from input samples xiA hidden layer representation h can be derivediAnd reconstruction of the inputAs shown in formula (1) and formula (2):
hi=f(W'xi+b') (1)
x ^ i = s i g m ( Wh i + b ) - - - ( 2 )
wherein f (·) represents a nonlinear excitation function; sigm (-) represents the excitation function of the neural network, as shown in equation (3):
s i g m ( y ) = 1 1 + exp ( - y ) - - - ( 3 )
the noise reduction self-encoder is obtained through learning, and the formula (4) is shown as follows:
min W , W ′ , b , b ′ | | x i - x ^ i | | 2 2 + γ ( | | W | | F 2 + | | W ′ | | F 2 ) - - - ( 4 )
wherein,representing a loss function reconstructed by a neural network, wherein a second term is a weight penalty term; balancing the reconstruction error and the weight penalty term by using the parameter gamma so as to fully consider the relationship between the reconstruction error and the weight penalty term; assume the use of equation (4)Then, the partial derivative is calculated for E, and the formula (5) is obtained:
∂ E ( w → ) ∂ w → j i = ∂ E 0 ( w → ) ∂ w → j i + 2 λw j i = - δ j x j i + 2 γw j i - - - ( 5 )
if it is notOrder toThen obtaining the formula (6) and the formula (7):
w ~ j i = w j i + η ( δ j x j i - 2 γw j i ) - - - ( 6 )
w ~ j i = ( 1 - 2 η γ ) w j i + ηδ j x j i - - - ( 7 )
where, η is the learning rate,jis an error, xjiAnd wjiRepresenting the data and the weights from the ith layer to the jth layer, respectively.
Further, when the weight value updating unit 202 calculates the bias term for the input video frame, it can be assumed that the current video frame is ItThe previous video frame It-1As a comparison object, calculating It-1And tracking the average hash value of the result, and calculating the Hamming distance between the average hash value of the result and the average hash values of all current samples. The greater the hamming distance between each two hash values, the less similarity between the two, and vice versa. Specifically, a video frame is sampled, and an initialization sample S ═ S is obtained1,s2,…,sN1,2, …, N. Then, the previous frame I is calculatedt-1Tracking the hash value l of the result, for each sample siCalculating siAverage hash value ofCalculate l and each sample s using equation (8) belowiHamming distance between:
d i s ( l , s i ) = Σ i = 1 N l ⊗ l s i - - - ( 8 )
finally, according to the obtained Hamming distance and the formula (9), obtaining a sample siThe corresponding bias term.
b i = 1 - d i s ( l , s i ) Σ t = 1 N d i s ( l , s t ) - - - ( 9 )
Wherein, biThe bias term for the ith sample is indicated.
In another embodiment, the moving object estimation unit 203 may calculate an updated confidence for each particle according to a neural network with updated weights. And selecting the particle with the maximum confidence coefficient after updating as the moving object of the video frame in the calculated confidence coefficient after updating. Of course, if the confidence is greater than the preset threshold, the particle with the highest confidence may be directly selected as the moving object of the video frame. In a preferred embodiment, the filtering process is performed on each particle before each particle is substituted into the neural network with updated weights.
It should be noted that, in the implementation of the moving object cross-scale tracking apparatus according to the present invention, the details of the moving object cross-scale tracking method described above have been described in detail, and therefore, the repeated contents are not described again.
In summary, the moving target cross-scale tracking method and device provided by the invention creatively train and learn the parameters of the network on line by using the stack type noise reduction automatic encoder before tracking; in the tracking process, calculating the hash value of each sampling chip by adopting an average hash method, calculating the Hamming distance between the sampling chip and the tracking result of the previous frame through similarity calculation, and correcting the offset item of the network by utilizing the distance; through the characteristic extraction process of the network, scale information and detail information of the moving target can be obtained, and the position and the size of the moving target are predicted by utilizing particle filter motion estimation, so that the tracking result of the moving target is finally obtained;
moreover, a neural network is constructed by the noise reduction self-encoder obtained through learning, the smaller weight is searched by a gradient descent method, and the possibility of overfitting is reduced; meanwhile, the invention captures the structural characteristics of the target by utilizing the inherent low-frequency information of the image, and repairs the bias item, so that different samples correspond to different bias values; each sample is down-sampled into new samples with different sizes, the high-frequency part of the image is removed, an image containing 64 pixels is obtained, and the gray level average value of the new image is calculated; judging the size of each pixel value of the new image, wherein the size is larger than or equal to the average value and is 1, otherwise, the size is 0, and finally obtaining the hash value of the sample; calculating the Hamming distance between the sample sampled in the current frame and the Hash value of the template, wherein the smaller the distance is, the higher the similarity with the template is; establishing bias items of the neural network aiming at different sampling pieces through the average hash value, thereby realizing different scale correspondences of the different sampling pieces; therefore, the invention can have wide and important popularization significance; finally, the whole moving target cross-scale tracking method and device are compact and easy to control.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. A cross-scale tracking method for a moving target is characterized by comprising the following steps:
building a neural network, and carrying out weight training layer by layer to obtain initialized weights;
inputting the true value of the first video frame into the established neural network, and updating the initial weight;
calculating an offset term for the input video frame starting from the second video frame;
calculating an output value of the video frame after the video frame is input into the neural network according to the obtained bias term of the video frame;
judging whether the confidence coefficient output by the neural network in the video frame is smaller than a preset threshold value, if so, updating the weight of the neural network, and estimating the moving target of the video frame according to the neural network with the updated weight; and if the difference is larger than the preset threshold, directly estimating the moving object of the video frame.
2. The method of claim 1, wherein the layer-by-layer weights are trained by using a stacked noise reduction self-encoder to obtain initialized weights.
3. The method of claim 2, wherein the constructing the neural network and performing weight training layer by layer to obtain initialized weights comprises:
let k denote the number of training samples, i ═ 1,2, …, k, and the training set of samples is { x }1,…,xi,…,xkRespectively representing a weight value of a hidden layer and a weight value of an output layer, and b' and b represent bias items of different hidden layers; from input samples xiA hidden layer representation h can be derivediAnd reconstruction of the inputAs shown in formula (1) and formula (2):
hi=f(W'xi+b') (1)
x ^ i = s i g m ( Wh i + b ) - - - ( 2 )
wherein f (·) represents a nonlinear excitation function; sigm (-) represents the excitation function of the neural network, as shown in equation (3):
s i g m ( y ) = 1 1 + exp ( - y ) - - - ( 3 )
the noise reduction self-encoder is obtained through learning, and the formula (4) is shown as follows:
m i n W , W ′ , b , b ′ Σ i = 1 k | | x i - x ^ i | | 2 2 + γ ( | | W | | F 2 + | | W ′ | | F 2 ) - - - ( 4 )
wherein,representing a loss function reconstructed by a neural network, wherein a second term is a weight penalty term; balancing the reconstruction error and the weight penalty term by using the parameter gamma so as to fully consider the relationship between the reconstruction error and the weight penalty term; assume the use of equation (4)Then, the partial derivative is calculated for E, and the formula (5) is obtained:
∂ E ( w → ) ∂ w → j i = ∂ E 0 ( w → ) ∂ w → j i + 2 γw j i = - δ j x j i + 2 γw j i - - - ( 5 )
if it is notOrder toThen obtaining the formula (6) and the formula (7):
w ~ j i = w j i + η ( δ j x j i - 2 γw j i ) - - ( 6 )
w ~ j i = ( 1 - 2 η γ ) w j i + ηδ j x j i - - - ( 7 )
where, η is the learning rate,jis an error, xjiAnd wjiRepresenting the data and the weights from the ith layer to the jth layer, respectively.
4. The method of claim 3, wherein calculating the bias term for the input video frame comprises:
sampling a video frame to obtain an initialization sample S ═ S1,s2,…,sN},i=1,2,…,N;
Calculating the previous frame It-1Tracking a hash value l of the result;
for each sample siCalculating siAverage hash value of
Calculate l and each sample s using equation (8)iHamming distance therebetween; where N denotes the number of samples, S ═ S1,s2,…,sN},i=1,2,…,N,dis(l,si) Representing video frames It-1Tracking the hash value l of the result and the current ith sample siHash ofValue ofThe Hamming distance therebetween is represented by formula (8):
d i s ( l , s i ) = Σ i = 1 N l ⊗ l s i - - - ( 8 )
obtaining a sample s according to the obtained Hamming distance and the formula (9)iThe corresponding bias term:
b i = 1 - d i s ( l , s i ) Σ t = 1 N d i s ( l , s t ) - - - ( 9 )
wherein, biThe bias term for the ith sample is indicated.
5. The method according to any one of claims 1 to 4, wherein the estimating the moving object of the video frame according to the neural network with the updated weight value comprises:
calculating the updated confidence of each particle according to the neural network with the updated weight;
and selecting the particle with the maximum confidence coefficient after updating as the moving object of the video frame in the calculated confidence coefficient after updating.
6. A moving object cross-scale tracking device is characterized by comprising:
the neural network construction unit is used for constructing a neural network and carrying out weight training layer by layer to obtain initialized weights;
the weight updating unit is used for inputting the true value of the first video frame into the established neural network and updating the initial weight; calculating an offset term for the input video frame starting from the second video frame; calculating an output value of the video frame after the video frame is input into the neural network according to the obtained bias term of the video frame; judging whether the confidence coefficient output by the neural network in the video frame is smaller than a preset threshold value, if so, updating the weight of the neural network, and if not, not processing;
and the moving object estimation unit is used for estimating the moving object of the video frame.
7. The apparatus of claim 6, wherein the neural network building unit performs layer-by-layer weight training by using a stacked noise reduction self-encoder to obtain initialized weights.
8. The apparatus of claim 7, wherein the neural network building unit obtaining initialized weights comprises:
let k denote the number of training samples, i ═ 1,2, …, k, and the training set of samples is { x }1,…,xi,…,xkRespectively representing a weight value of a hidden layer and a weight value of an output layer, and b' and b represent bias items of different hidden layers; from input samples xiA hidden layer representation h can be derivediAnd reconstruction of the inputAs shown in formula (1) and formula (2):
hi=f(W'xi+b') (1)
x ^ i = s i g m ( Wh i + b ) - - - ( 2 )
wherein f (·) represents a nonlinear excitation function; sigm (-) represents the excitation function of the neural network, as shown in equation (3):
s i g m ( y ) = 1 1 + exp ( - y ) - - - ( 3 )
the noise reduction self-encoder is obtained through learning, and the formula (4) is shown as follows:
m i n W , W ′ , b , b ′ Σ i = 1 k | | x i - x ^ i | | 2 2 + γ ( | | W | | F 2 + | | W ′ | | F 2 ) - - - ( 4 )
wherein,representing a loss function reconstructed by a neural network, wherein a second term is a weight penalty term; balancing the reconstruction error and the weight penalty term by using the parameter gamma to fullyConsidering the relationship between the two; assume the use of equation (4)Then, the partial derivative is calculated for E, and the formula (5) is obtained:
∂ E ( w → ) ∂ w → j i = ∂ E 0 ( w → ) ∂ w → j i + 2 γw j i = - δ j x j i + 2 γw j i - - - ( 5 )
if it is notOrder toThen obtaining the formula (6) and the formula (7):
w ~ j i = w j i + η ( δ j x j i - 2 γw j i ) - - - ( 6 )
w ~ j i = ( 1 - 2 η γ ) w j i + ηδ j x j i - - - ( 7 )
where, η is the learning rate,jis an error, xjiAnd wjiRepresenting the data and the weights from the ith layer to the jth layer, respectively.
9. The apparatus of claim 8, wherein the weight update unit calculates the bias term for the input video frame comprises:
sampling a video frame to obtain an initialization sample S ═ S1,s2,…,sN},i=1,2,…,N;
Calculating the previous frame It-1Tracking a hash value l of the result;
for each sample siCalculating siAverage hash value of
Calculate l and each sample s using equation (8)iHamming distance therebetween; where N denotes the number of samples, S ═ S1,s2,…,sN},i=1,2,…,N,dis(l,si) Representing video frames It-1Tracking the hash value l of the result and the current ith sample siHash value ofThe Hamming distance therebetween is represented by formula (8):
d i s ( l , s i ) = Σ i = 1 N l ⊗ l s i - - - ( 8 )
obtaining a sample s according to the obtained Hamming distance and the formula (9)iThe corresponding bias term:
b i = 1 - d i s ( l , s i ) Σ t = 1 N d i s ( l , s t ) - - - ( 9 )
wherein, biThe bias term for the ith sample is indicated.
10. The apparatus according to any one of claims 6 to 9, wherein the estimating the moving object of the video frame according to the neural network with updated weights comprises:
calculating the updated confidence of each particle according to the neural network with the updated weight;
and selecting the particle with the maximum confidence coefficient after updating as the moving object of the video frame in the calculated confidence coefficient after updating.
CN201610548086.0A 2016-07-12 2016-07-12 A kind of across the scale tracking of moving target and device Active CN106203350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610548086.0A CN106203350B (en) 2016-07-12 2016-07-12 A kind of across the scale tracking of moving target and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610548086.0A CN106203350B (en) 2016-07-12 2016-07-12 A kind of across the scale tracking of moving target and device

Publications (2)

Publication Number Publication Date
CN106203350A true CN106203350A (en) 2016-12-07
CN106203350B CN106203350B (en) 2019-10-11

Family

ID=57477769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610548086.0A Active CN106203350B (en) 2016-07-12 2016-07-12 A kind of across the scale tracking of moving target and device

Country Status (1)

Country Link
CN (1) CN106203350B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169117A (en) * 2017-05-25 2017-09-15 西安工业大学 A kind of manual draw human motion search method based on autocoder and DTW
CN107292914A (en) * 2017-06-15 2017-10-24 国家新闻出版广电总局广播科学研究院 Visual target tracking method based on small-sized single branch convolutional neural networks
CN109559329A (en) * 2018-11-28 2019-04-02 陕西师范大学 A kind of particle filter tracking method based on depth denoising autocoder
CN110136173A (en) * 2019-05-21 2019-08-16 浙江大华技术股份有限公司 A kind of target location processing method and device
CN110298404A (en) * 2019-07-02 2019-10-01 西南交通大学 A kind of method for tracking target based on triple twin Hash e-learnings
CN112634188A (en) * 2021-02-02 2021-04-09 深圳市爱培科技术股份有限公司 Vehicle far and near scene combined imaging method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548662A (en) * 1993-02-08 1996-08-20 Lg Electronics Inc. Edge extracting method and apparatus using diffusion neural network
CN101089917A (en) * 2007-06-01 2007-12-19 清华大学 Quick identification method for object vehicle lane changing
CN104036323A (en) * 2014-06-26 2014-09-10 叶茂 Vehicle detection method based on convolutional neural network
CN105654509A (en) * 2015-12-25 2016-06-08 燕山大学 Motion tracking method based on composite deep neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5548662A (en) * 1993-02-08 1996-08-20 Lg Electronics Inc. Edge extracting method and apparatus using diffusion neural network
CN101089917A (en) * 2007-06-01 2007-12-19 清华大学 Quick identification method for object vehicle lane changing
CN104036323A (en) * 2014-06-26 2014-09-10 叶茂 Vehicle detection method based on convolutional neural network
CN105654509A (en) * 2015-12-25 2016-06-08 燕山大学 Motion tracking method based on composite deep neural network

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169117A (en) * 2017-05-25 2017-09-15 西安工业大学 A kind of manual draw human motion search method based on autocoder and DTW
CN107169117B (en) * 2017-05-25 2020-11-10 西安工业大学 Hand-drawn human motion retrieval method based on automatic encoder and DTW
CN107292914A (en) * 2017-06-15 2017-10-24 国家新闻出版广电总局广播科学研究院 Visual target tracking method based on small-sized single branch convolutional neural networks
CN109559329A (en) * 2018-11-28 2019-04-02 陕西师范大学 A kind of particle filter tracking method based on depth denoising autocoder
CN110136173A (en) * 2019-05-21 2019-08-16 浙江大华技术股份有限公司 A kind of target location processing method and device
CN110298404A (en) * 2019-07-02 2019-10-01 西南交通大学 A kind of method for tracking target based on triple twin Hash e-learnings
CN112634188A (en) * 2021-02-02 2021-04-09 深圳市爱培科技术股份有限公司 Vehicle far and near scene combined imaging method and device

Also Published As

Publication number Publication date
CN106203350B (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN106203350B (en) A kind of across the scale tracking of moving target and device
CN110232394B (en) Multi-scale image semantic segmentation method
CN107358293B (en) Neural network training method and device
CN107369166B (en) Target tracking method and system based on multi-resolution neural network
CN105938559B (en) Use the Digital Image Processing of convolutional neural networks
CN108062562B (en) Object re-recognition method and device
Saputra et al. Learning monocular visual odometry through geometry-aware curriculum learning
CN110120020A (en) A kind of SAR image denoising method based on multiple dimensioned empty residual error attention network
CN107330357A (en) Vision SLAM closed loop detection methods based on deep neural network
CN109754078A (en) Method for optimization neural network
CN113657560B (en) Weak supervision image semantic segmentation method and system based on node classification
CN107154024A (en) Dimension self-adaption method for tracking target based on depth characteristic core correlation filter
CN110349185B (en) RGBT target tracking model training method and device
CN107885322A (en) Artificial neural network for mankind's activity identification
CN104881029B (en) Mobile Robotics Navigation method based on a point RANSAC and FAST algorithms
CN109948741A (en) A kind of transfer learning method and device
CN105678284A (en) Fixed-position human behavior analysis method
CN111126278B (en) Method for optimizing and accelerating target detection model for few-class scene
CN105981050A (en) Method and system for exacting face features from data of face images
CN104156919B (en) A kind of based on wavelet transformation with the method for restoring motion blurred image of Hopfield neutral net
CN114186672A (en) Efficient high-precision training algorithm for impulse neural network
CN112446888A (en) Processing method and processing device for image segmentation model
CN114897728A (en) Image enhancement method and device, terminal equipment and storage medium
CN114283320A (en) Target detection method based on full convolution and without branch structure
CN110827319B (en) Improved Staple target tracking method based on local sensitive histogram

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant