Disclosure of Invention
The invention provides a novel space-time Harris corner detection method and a novel space-time Harris corner detection device, and aims to provide a detection algorithm which can fully reflect the space-time domain correlation of a video, and further can extract space-time Harris corners which contain apparent information in a space domain and reflect clear motion change in a time domain.
The invention provides a novel space-time Harris corner detection method, which comprises the following steps:
obtaining a three-dimensional geometric algebraic space of a video based on spatial information and time domain information of a video image contained in the video, and constructing a motion vector of a pixel point of the three-dimensional geometric algebraic space;
obtaining an apparent-motion vector of each pixel point by using the motion vector of the pixel point and a preset apparent-motion vector algorithm;
combining the appearance-motion vectors of the pixel points to construct a space-time second-order matrix, and constructing a space-time Harris corner response function according to the space-time second-order matrix;
and calculating a space-time Harris corner response function value of a certain pixel point p and all pixel points in the neighborhood of the certain pixel point p according to the space-time Harris corner response function, wherein if the space-time Harris corner response function value of the pixel point p is larger than the space-time Harris corner response function values of all pixel points in the neighborhood of the certain pixel point p, the point p is the space-time Harris corner of the video.
Further, the constructing the motion vector of the pixel point in the three-dimensional geometric algebraic space includes:
constructing a three-dimensional geometric algebraic spacePoints the pixel point at p in the current frame video image to p in the next frame video imagerMotion vector v of pixel point of (d)pSaid motion vector vpComprises the following steps:
vp=pr-p;
wherein p is the three-dimensional geometric algebraic spacePixel point p in the video image of the current framerFor the three-dimensional geometric algebraic spaceThe pixel point with the smallest difference between the pixel values of the p position in the neighborhood taking p 'as the center in the next frame video image and the current frame video image, wherein p' is the pixel point with the same position as the pixel point of the p position in the current frame video image in the next frame video image.
Further, the preset apparent-motion vector algorithm is:
f′(p)=f(p)+vp;
wherein v ispRepresents a motion vector for a pixel point at p in the video image, f (p) represents a pixel value for a pixel point at p in the video image, and f' (p) represents an apparent-motion vector for a pixel point at p in the video image.
Further, the constructing a spatio-temporal second-order matrix by combining the apparent-motion vectors of the pixel points includes:
calculating the gradient f ' of the apparent-motion vector f ' (p) of each pixel point in three directions of x, y and t 'x,f′y,f′t;
Utilizing the gradient f ' of the apparent-motion vector f ' (p) of the pixel point in the three directions of x, y and t 'x,f′y,f′tConstructing a second-order gradient matrix N, wherein the second-order gradient matrix N is as follows:
performing convolution calculation on the matrix N by using a Gaussian weighting function omega (p) to obtain a space-time second-order matrix M (p), wherein the space-time second-order matrix M (p) is as follows:
where ω denotes a Gaussian weighting function ω (p),for the convolution symbols, A, B, C, D, E and F
Corresponding to the elements of the matrix m (p), respectively.
Further, the spatio-temporal Harris corner response function is:
R=det(M)-k(trace(M))3=(ABC+2DEF-BE2-AF2-CD2)-k(A+B+C)3
wherein R is a space-time Harris corner response function value, and k is an empirical constant; det (M) represents the determinant of matrix M (p), trace (M) represents the traces of matrix M (p), and the expression is as follows:
det(M)=λ1λ2λ3=ABC+2DEF-BE2-AF2-CD2
trace(M)=λ1+λ2+λ3=A+B+C
wherein λ is1、λ2、λ3Characteristic values of matrix M, A, B, C, D, E and F respectively, correspond to the elements of matrix M (p).
The invention also provides a novel space-time Harris angular point detection device, which comprises:
the preprocessing module is used for obtaining a three-dimensional geometric algebraic space of the video based on spatial information and time domain information of a video image contained in the video;
the motion vector construction module is used for constructing motion vectors of pixel points of the three-dimensional geometric algebraic space;
the appearance-motion vector construction module is used for obtaining an appearance-motion vector of each pixel point by utilizing the motion vector of the pixel point and a preset appearance-motion vector algorithm;
the space-time second-order matrix construction module is used for constructing a space-time second-order matrix by combining the apparent-motion vectors of the pixel points;
the response function constructing module is used for constructing a space-time Harris corner response function according to the space-time second-order matrix;
and the space-time Harris angular point acquisition module is used for calculating space-time Harris angular point response function values of a certain pixel point p and all pixel points in the neighborhood of the pixel point p according to the space-time Harris angular point response function values, and if the space-time Harris angular point response function values of the pixel point p are larger than the space-time Harris angular point response function values of all pixel points in the neighborhood of the pixel point p, the point p is the space-time Harris angular point of the video.
Further, the motion vector construction module is specifically configured to construct a three-dimensional geometric algebraic space according to the following formulaPoints the pixel point at p in the current frame video image to p in the next frame video imagerMotion vector v of pixel point of (d)pSaid motion vector vpComprises the following steps:
vp=pr-p;
wherein p is the three-dimensional geometric algebraic spacePixel point p in the video image of the current framerFor the three-dimensional geometric algebraic spaceThe pixel point with the smallest difference between the pixel values of the p position in the neighborhood taking p 'as the center in the next frame video image and the current frame video image, wherein p' is the pixel point with the same position as the pixel point of the p position in the current frame video image in the next frame video image.
Further, the preset apparent-motion vector algorithm is:
f′(p)=f(p)+vp;
wherein v ispRepresents a motion vector for a pixel point at p in the video image, f (p) represents a pixel value for a pixel point at p in the video image, and f' (p) represents an apparent-motion vector for a pixel point at p in the video image.
Further, the spatio-temporal second-order matrix constructing module includes:
a gradient calculation module for calculating the gradient f ' of the apparent-motion vector f ' (p) of each pixel point in the three directions of x, y and t 'x,f′y,f′t;
A second-order gradient matrix construction module for utilizing the gradient f ' of the apparent-motion vector f ' (p) of the pixel point in the three directions of x, y and t 'x,f′y,f′tConstructing a second-order gradient matrix N, wherein the second-order gradient matrix N is as follows:
the space-time second-order matrix obtaining module is used for performing convolution calculation on the matrix N by utilizing a Gaussian weighting function omega (p) to obtain a space-time second-order matrix M (p), and the space-time second-order matrix M (p) is as follows:
where ω denotes a Gaussian weighting function ω (p),for the convolution symbols A, B, C, D, E and F correspond to the elements of matrix M (p), respectively.
Further, the spatio-temporal Harris corner response function is:
R=det(M)-k(trace(M))3=(ABC+2DEF-BE2-AF2-CD2)-k(A+B+C)3
wherein R is a space-time Harris corner response function value, and k is an empirical constant; det (M) represents the determinant of matrix M (p), trace (M) represents the traces of matrix M (p), and the expression is as follows:
det(M)=λ1λ2λ3=ABC+2DEF-BE2-AF2-CD2
trace(M)=λ1+λ2+λ3=A+B+C
wherein λ is1、λ2、λ3Characteristic values of matrix M, A, B, C, D, E and F respectively, correspond to the elements of matrix M (p).
Compared with the prior art, the invention has the beneficial effects that: the invention provides a novel space-time Harris angular point detection method and a device, which are used for obtaining a three-dimensional geometric algebraic space of a video based on spatial information and time domain information of video images contained in the video and constructing motion vectors of pixel points of the three-dimensional geometric algebraic space, and combining the motion vectors of the pixel points to construct an appearance-motion vector so as to obtain a video appearance and motion information unified model, constructing a spatio-temporal second-order matrix on the basis of the unified model, constructing a spatio-temporal Harris corner response function according to the spatio-temporal second-order matrix, and calculating a space-time Harris corner response function value of a certain pixel point p and all pixel points in the neighborhood of the certain pixel point according to the space-time Harris corner response function, and comparing the space-time Harris corner response function values of the certain pixel point p and other pixel points in the neighborhood of the certain pixel point p to judge whether the certain pixel point p is a space-time Harris corner of the video. Compared with the prior art, the method has the advantages that the video is regarded as a three-dimensional structure, a unified model of the video appearance and the motion information is established by adopting geometric algebra, the unified model is established for the geometric bodies with different dimensions and the relation between the geometric bodies with different dimensions of the video appearance information and the motion information, and a space-time Harris corner point detection algorithm is provided based on the model, so that the detection algorithm can fully reflect the time-space domain correlation of the video, and the extracted space-time interest points contain unique appearance information in the space domain and simultaneously represent clear motion change in the time domain.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The problem that the Harris corner points of the space-time interest points, which contain unique apparent information in the space domain and represent clear motion change in the time domain, can not be truly extracted due to the fact that the correlation of the space-time domain of the video cannot be fully reflected in the prior art.
In order to solve the technical problems, the invention provides a novel space-time Harris corner detection method, which treats a video as a three-dimensional structure, adopts geometric algebra to establish a unified model of video appearance and motion information, establishes a unified model for geometric bodies with different dimensions of the video appearance information and the motion information and the relation between the geometric bodies and the motion information, and provides a novel space-time Harris corner detection algorithm based on the model, so that the detection algorithm can fully reflect the time-space domain correlation of the video, and extracted space-time interest points contain unique appearance information in a space domain and simultaneously represent clear motion change in a time domain.
Referring to fig. 1, a new spatio-temporal Harris corner detection method according to an embodiment of the present invention includes:
step S1, obtaining a three-dimensional geometric algebraic space of a video based on spatial domain information and time domain information of a video image contained in the video, and constructing motion vectors of pixel points of the three-dimensional geometric algebraic space;
in the embodiment of the present invention, modeling is performed in a video, a sequence of video images (video frames) included in the video may be represented as a three-dimensional stereo structure, including spatial information and temporal information, and it should be noted that both appearance information and motion information in the video may be represented by geometric algebra in the three-dimensional stereo structure. In view of the simplicity of the operation of geometric (Clifford) algebra on vector data and geometric information, the video is modeled herein using three-dimensional geometric algebra as a mathematical framework, and a representation model of a video image sequence under the geometric algebra framework will be explained below.
Let R3Is a three-dimensional Euclidean space formed by spatial information and time domain information of a video image sequence contained in a video, and the orthonormal base of the three-dimensional Euclidean space is { e }1,e2,e3Then these orthonormal bases are spanned by the geometry to R3The geometric algebraic space ofTheI.e. the three-dimensional geometric algebraic space of the video, in the embodiment of the present invention, the geometric algebraic space is subsequently usedIt is briefly described asOne set of canonical bases for this is:
{1,e1,e2,e3,e1∧e2,e2∧e3,e1∧e3,e1∧e2∧e3} (1)
wherein ^ represents the sign of the geometric algebraic outer product calculation, e1∧e2,e2∧e3And e1∧e3Is composed of three orthogonal radicals e1、e2And e3Three independent double outer products are obtained, which represent geometrically separatePlanes of two vector representations in space, e1∧e2∧e3Is the triple outer product: e.g. of the type1∧e2∧e3=(e1∧e2)e3The geometrical interpretation is: double outer product e1∧e2Along the vector e3The obtained directed geometry is moved. { e1,e2,e3Can be seen asThe basis vectors { x, y, t } of the 3-dimensional vector subspace.
The video sequence F can be represented as:
F=f(p) (2)
wherein,and p ═ xe1+ye2+te3(ii) a x and y represent spatial coordinates and have 0 < x < Sx,0<y<Sy(ii) a t represents the time domain coordinate, and 0 < t < St. F (p) represents the apparent value of the pixel at p in the video F.
Let p1,And p is1=x1e1+y1e2+t1e3,p2=x2e1+y2e2+t2e3Then their geometric product can be expressed as:
p1p2=p1·p2+p1∧p2(3)
wherein, represents the inner product, and Λ represents the outer product. It means that the geometric product of two vectors is composed of inner product (p)1·p2) Sum and outer product (p)1∧p2) And (4) the sum of the components.
In thatIn, p1And p2Can be represented by △ p, i.e.:
△p=p1-p2=(x1-x2)e1+(y1-y2)e2+(t1-t2)e3(4)
it represents a slave p2Point of direction p1The vector of (2) not only is a measure of the distance between two pixel points, but also can reflect the motion condition of the pixel points in the video sequence.
The above is an introduction of the three-dimensional geometric algebraic space of the video in the embodiment of the present invention.
In the embodiment of the present invention, the motion vector of the pixel point for constructing the three-dimensional geometric algebraic space is specifically constructed byPoints the pixel point at p in the current frame video image to p in the next frame video imagerMotion vector v of pixel point of (d)pSaid motion vector vpComprises the following steps:
vp=pr-p; (5)
wherein, if p is set, the ratio of p,and p ═ xie1+yje2+tke3,p′=xie1+yje2+(tk+1)e3S is when t is equal to tkSet of points of the neighborhood of l × l centered at p' on the +1 plane.
Wherein p is the three-dimensional geometric algebraic spaceThe pixel point at p in the video image of the current frame,i.e. prFor the three-dimensional geometric algebraic spaceThe pixel point with the smallest difference between the pixel values of the p position in the neighborhood taking p 'as the center in the next frame video image and the current frame video image, wherein p' is the pixel point with the same position as the pixel point of the p position in the current frame video image in the next frame video image. v. ofpThe motion change conditions of the pixel points at the p position are reflected, including the change of the motion direction, the change of the motion speed and the like, and the change can be expressed as vpTo reflect the magnitude of the motion change. In general, the larger the change of the motion direction or the motion speed of the pixel point at p is, the larger vpThe larger the modulus value of (a) and vice versa.
Step S2, obtaining the appearance-motion vector of each pixel point by using the motion vector of the pixel point and a preset appearance-motion vector algorithm;
the preset apparent-motion vector algorithm is as follows:
f′(p)=f(p)+vp(6)
where f' (p) represents the apparent-motion vector of the pixel at p in the video image, vpRepresenting the motion vector of the pixel point at p in the video image, and f (p) representing the apparent value of the pixel point at p in the video image, wherein f (p) represents the pixel value of the pixel point at p in the video image in the embodiment of the present invention.
The newly defined f' (p) is a vector containing both scalar quantity information and vector quantity information, and reflects not only the appearance information but also the change of the moving direction and the moving speed.
Based on the above definition, a unified model umam (unified model of appearance and motion), abbreviated as F', of video appearance and motion is constructed as follows:
F′=f′(p) (7)
wherein f' (p) is a pixel pointA function of (a), or an apparent-motion vector (AMV), as an argument. As can be seen from the above analysis, the UMAM not only contains the apparent information of the video, but also can reflect the local motion information of the video, including the motion direction and speed; the spatio-temporal Harris corner points mentioned in the embodiments of the present invention refer to UMAM-Harris corner points.
Step S3, constructing a space-time second-order matrix by combining the apparent-motion vectors of the pixel points, and constructing a space-time Harris corner response function according to the space-time second-order matrix;
in order to better understand the technical solution of the embodiment of the present invention, the following describes a three-dimensional geometric algebraic space of a videoWeighted correlation function of the upper AMV.
Is provided withPoint p 'is a point in the neighborhood of p with coordinates (p + △ p), then the weighted correlation function of f' (p) and f '(p') is defined as follows:
wherein ω (p) is a three-dimensional Gaussian kernel function G (p, σ),w (p) represents a convolution operation, and is a gaussian window function of size l × l × l centered at p, σ being a scale factor of the gaussian kernel function.
Is provided withThen the embodiment of the invention providesGaussian function G (p, σ) in (a), defined as follows:
wherein σ isThe scale factor of gaussian function G (p, σ) in (a) represents the outer product, "· represents the inner product, and the size of the gaussian window function is l × l × l, where l ═ 6 σ + 1.
In order to better understand the technical solution in the embodiment of the present invention, it is proved that the above-mentioned gaussian function G (p, σ) is a three-dimensional geometric algebraic space of a videoA medium effective gaussian function.
And (3) proving that: | p · σ -2It can be further developed as follows.
σ ^ σ may be further expanded as follows.
Then, equation (9) can be written as:
based on the above evidence, it can be seen that G (p, σ) is converted toThe form of (b) is consistent with the general three-dimensional Gaussian function, and therefore, the Gaussian function G (p, σ) provided in the embodiment of the present invention isA medium effective gaussian function.
Then, the taylor series expansion is performed on f' (p +. DELTA.p), and a first order approximation is taken:
wherein △ x, △ y, △ t is the distance between p and p 'in the three directions of x, y and t, and f'x,f′y,f′tIs the gradient of f' (p) in the three x, y, t directions, i.e.
The results in the above equations (14), (15) and (16) are all vectors.
Thus, equation (8) can be approximated as
Wherein,
of formula (19), f'x,f′y,f′tRespectively representThe gradient of AMV in the three directions x, y, t, ω is a Gaussian weighting function ω (p) in equation (8),for the convolution symbols A, B, C, D, E and F correspond to the elements of matrix M (p), respectively.
Based on the weighted correlation function, a flowchart of the step of refining step S3 in the embodiment of the present invention is described below, and as shown in fig. 2, the step S3 includes:
s31, calculating the gradient f ' of the apparent-motion vector f ' (p) of each pixel point in three directions of x, y and t by combining the formulas (14), (15) and (16) 'x,f′y,f′t;
S32, using the gradient f ' of the apparent motion vector f ' (p) of the pixel point in three directions of x, y and t 'x,f′y,f′tConstructing a second-order gradient matrix N, wherein the second-order gradient matrix N is as follows:
specifically, the specific acquisition process of the second-order gradient matrix N is shown in equation (17).
S33, performing convolution calculation on the matrix N by using a gaussian weighting function ω (p) to obtain a space-time second-order matrix m (p), where the space-time second-order matrix m (p) is:
where ω is a gaussian weighting function ω (p) in formula (10), and the scale factor of the gaussian function takes σ equal to 1,for the convolution symbols A, B, C, D, E and F correspond to the elements of matrix M (p), respectively.
Specifically, the specific acquisition process of the spatio-temporal second order matrix m (p) is shown in formula (17).
S34, constructing a space-time Harris corner response function according to the space-time second-order matrix;
the spatio-temporal Harris corner response function is:
R=det(M)-k(trace(M))3=(ABC+2DEF-BE2-AF2-CD2)-k(A+B+C)3
wherein, R is a space-time Harris angular point response function value, k is an empirical constant, and k is 0.04 in the embodiment of the invention; det (M) represents the determinant of matrix M (p), trace (M) represents the traces of matrix M (p), and the expression is as follows:
det(M)=λ1λ2λ3=ABC+2DEF-BE2-AF2-CD2
trace(M)=λ1+λ2+λ3=A+B+C
wherein λ is1、λ2、λ3Characteristic values of matrix M, A, B, C, D, E and F respectively, correspond to the elements of matrix M (p).
Step S4, calculating a space-time Harris corner response function value of a certain pixel point p and all pixel points in the neighborhood according to the space-time Harris corner response function, and if the space-time Harris corner response function value of the pixel point p is larger than the space-time Harris corner response function values of all pixel points in the neighborhood, the point p is the space-time Harris corner of the video.
Specifically, by comparing the UMAM-Harris corner response function values of the point p and other pixel points in the h × h × h neighborhood, if the point p has the maximum r (p) in the h × h × h neighborhood, the point p is the UMAM-Harris corner of the video.
Referring to fig. 3, a schematic structural diagram of a new spatio-temporal Harris corner detection apparatus in an embodiment of the present invention is shown, where the detection apparatus includes:
the video processing device comprises a preprocessing module 1, a video processing module and a video processing module, wherein the preprocessing module is used for obtaining a three-dimensional geometric algebraic space of a video based on spatial information and time domain information of a video image contained in the video;
the motion vector construction module 2 is used for constructing motion vectors of pixel points of the three-dimensional geometric algebraic space;
specifically, the motion vector construction module is used for constructing a three-dimensional geometric algebraic space according to the following formulaPoints the pixel point at p in the current frame video image to p in the next frame video imagerMotion vector v of pixel point of (d)pSaid motion vector vpComprises the following steps:
vp=pr-p;
wherein p is the three-dimensional geometric algebraic spacePixel point p in the video image of the current framerFor the three-dimensional geometric algebraic spaceThe pixel point with the smallest difference between the pixel values of the p position in the current frame video image and the pixel value of the p position in the field taking p 'as the center in the next frame video image, wherein p' is the pixel point with the same position as the pixel point of the p position in the current frame video image in the next frame video image.
And the appearance-motion vector construction module 3 is used for obtaining an appearance-motion vector of each pixel point by using the motion vector of the pixel point and a preset appearance-motion vector algorithm, wherein the appearance-motion vector of each pixel point in the three-dimensional geometric algebraic space is the established video appearance and motion information unified model.
Specifically, the preset apparent-motion vector algorithm is:
f′(p)=f(p)+vp;
wherein v ispRepresents a motion vector for a pixel point at p in the video image, f (p) represents a pixel value for a pixel point at p in the video image, and f' (p) represents an apparent-motion vector for a pixel point at p in the video image.
The space-time second-order matrix construction module 4 is used for constructing a space-time second-order matrix by combining the apparent-motion vectors of the pixel points;
specifically, as shown in fig. 4, the refinement function module of the spatio-temporal second-order matrix construction module includes:
a gradient calculating module 41 for calculating a gradient f ' of the apparent-motion vector f ' (p) of each pixel point in three directions of x, y and t 'x,f′y,f′t;
A second-order gradient matrix constructing module 42, configured to utilize gradient f ' of the apparent-motion vector f ' (p) of the pixel point in three directions x, y, t 'x,f′y,f′tConstructing a second-order gradient matrix N, wherein the second-order gradient matrix N is as follows:
a space-time second-order matrix obtaining module 43, configured to perform convolution calculation on the matrix N by using a gaussian weighting function ω (p) to obtain a space-time second-order matrix m (p), where the space-time second-order matrix m (p) is:
where ω denotes a Gaussian weighting function ω (p),for the convolution symbols A, B, C, D, E and F correspond to the elements of matrix M (p), respectively.
A response function constructing module 5, configured to construct a spatio-temporal Harris corner response function according to the spatio-temporal second order matrix;
specifically, the spatio-temporal Harris corner response function is:
R=det(M)-k(trace(M))3=(ABC+2DEF-BE2-AF2-CD2)-k(A+B+C)3
wherein, R is a space-time Harris angular point response function value, k is an empirical constant, and k is 0.04 in the embodiment of the invention; det (M) represents the determinant of matrix M (p), trace (M) represents the traces of matrix M (p), and the expression is as follows:
det(M)=λ1λ2λ3=ABC+2DEF-BE2-AF2-CD2
trace(M)=λ1+λ2+λ3=A+B+C
wherein λ is1、λ2、λ3Characteristic values of matrix M, A, B, C, D, E and F respectively, correspond to the elements of matrix M (p).
And the space-time Harris angular point acquisition module 6 is used for calculating space-time Harris angular point response function values of a certain pixel point p and all pixel points in the neighborhood of the pixel point p according to the space-time Harris angular point response function values, and if the space-time Harris angular point response function values of the pixel point p are larger than the space-time Harris angular point response function values of all pixel points in the neighborhood of the pixel point p, the point p is the space-time Harris angular point of the video.
Specifically, by comparing the UMAM-Harris corner response function values of the point p and other pixel points in the h × h × h neighborhood, if the point p has the maximum r (p) in the h × h × h neighborhood, the point p is the UMAM-Harris corner of the video.
In the embodiment of the invention, a three-dimensional geometric algebraic space of a video is obtained based on spatial information and time domain information of a video image contained in the video, motion vectors of pixel points of the three-dimensional geometric algebraic space are constructed, an appearance-motion vector is constructed by combining the motion vectors of the pixel points, a unified model of video appearance and motion information is obtained, a space-time second-order matrix is constructed on the basis of the unified model, a space-time Harris corner response function is constructed according to the space-time second-order matrix, space-time Harris corner response function values of a certain pixel point p and all pixel points in a neighborhood of the certain pixel point p are calculated according to the space-time Harris corner response function values, and the size of the space-time Harris corner response function values of other pixel points in the neighborhood of the certain pixel point p is compared to judge whether the point p is the space. Compared with the prior art, the method has the advantages that the video is regarded as a three-dimensional structure, a unified model of the video appearance and the motion information is established by adopting geometric algebra, the unified model is established for the geometric bodies with different dimensions and the relation between the geometric bodies with different dimensions of the video appearance information and the motion information, and a space-time Harris corner point detection algorithm is provided based on the model, so that the detection algorithm can fully reflect the time-space domain correlation of the video, and the extracted space-time interest points contain unique appearance information in the space domain and simultaneously represent clear motion change in the time domain.
In order to verify the effectiveness of the new spatio-temporal Harris corner detection method in the embodiment of the present invention, the embodiment of the present invention evaluates the proposed UMAM-Harris detection algorithm on the current popular video behavior recognition data set UCF 101. The UCF101 is one of the largest current video behavior recognition data sets of real scenes, and comprises 101 categories of behaviors, wherein the total number of the behaviors is 13320, each category of behaviors is composed of 25 groups, each group comprises 4-7 videos, and the videos in the same group are all completed by the same person under the same scene and have the same background and shooting view angle. These 101 types of videos are classified into 5 major categories according to the difference of moving objects: Human-Object Interaction, Body-Motion Only, Human-Human Interaction, musical instrument Playing, and Sports (Sports). The UCF101 provides different scenes including camera motion, complex background, occlusion, different lighting and low resolution, etc., which are currently challenging video behavior recognition datasets.
The spatio-temporal interest points used in the experimental process comprise UMAM-Harris angular points extracted by the UMAM-Harris detection algorithm, Harris3D features extracted by the Harris3D detection algorithm and SIFT3D features extracted by the SIFT3D detection algorithm. For each video, we extract its spatio-temporal interest points and then compute the ST-SIFT descriptors of each extracted spatio-temporal interest point. And performing subsampling on the space-time interest points of the video used for training, then using PCA to reduce the dimensionality of the descriptor and a pre-training Gaussian mixture model to calculate a Fisher vector, and finally training to obtain the SVM model. For testing, PCA dimensionality reduction is carried out on the videos to be tested, Fisher vectors of space-time interest points of the videos are calculated, and finally classification is carried out by using an SVM model obtained through pre-training. In order to keep the training set from overlapping the video of the test set, the test set contains 7 groups of each type of behavior, and the remaining 18 groups are used for training.
In order to test and evaluate the technical solution in the embodiment of the present invention, the algorithm proposed by us is evaluated on the UCF101 human behavior data set. In an experimental part, a UMAM-Harris detection algorithm is used for extracting spatio-temporal interest points on a specific video, then an algorithm based on UMAM-Harris corner points is used for classifying video behaviors on a UCF101 data set, and the classification is compared with a current existing method.
Under the same experimental setup, we extract spatio-temporal interest points from a table tennis video named "v _ TableTennisShot _ g01_ c 01" in the UCF101 database by using the proposed UMAM-Harris detection algorithm, the Harris3D detection algorithm and the SIFT3D detection algorithm, respectively, and the extraction results are shown in fig. 5, fig. 6 and fig. 7.
As shown in fig. 5, 6, and 7, a, b, c, and d, when the video frames are set to 42, 47, 94, and 113, the three detection algorithms of Harris3D, SIFT3D, and UMAM-Harris correspond to the distribution of the extracted spatio-temporal interest points, the dots on the video frames in the figure represent the positions of the extracted spatio-temporal interest points, fig. 5 corresponds to the distribution of Harris3D features, fig. 6 corresponds to the distribution of SIFT3D features, and fig. 7 corresponds to the distribution of UMAM-Harris corners.
From the above experimental results, it can be seen that Harris3D features detected by using the Harris3D detection algorithm are mainly distributed on the body of the athlete, so that the joint movement of the athlete can be detected, and in addition, a small amount of noise distributed on the background of the video is also detected. In addition, SIFT3D features extracted by using the SIFT3D detection algorithm are very rich, but contain more background noise points, and the pseudo-spatiotemporal interest points bring interference for subsequent feature description and classification identification. Compared with the extraction results of the Harris3D detection algorithm and the SIFT3D detection algorithm, most of UMAM-Harris corner points extracted by the UMAM-Harris algorithm are distributed on the body of the sportsman and contain few background noise points, because the UMAM-Harris detection algorithm considers that the moving object has obvious gray level changes in the spatial domain and the time domain, and also considers the information of the moving speed, the moving direction and the like of the moving object, so that UMAM-Harris corner points containing moving information can be extracted from the video, and the background noise points are restrained. Therefore, the UMAM-Harris corners can accurately position the position of the moving object, and can better represent the behavior with remarkable change in the video. From this we can conclude that: the UMAM-Harris detection algorithm truly extracts the spatio-temporal interest points containing unique apparent information and rich motion information on the basis of retaining the advantages of robustness, effectiveness and the like of the features detected by the Harris3D detection algorithm.
The embodiment of the invention regards the video as a three-dimensional structure, establishes a unified model of video appearance and motion information by using Clifford algebra, and develops a new space-time Harris corner detection algorithm-UMAM-Harris corner detection algorithm on the basis. Experimental results show that the UMAM-Harris detection algorithm can truly extract the space-time interest points which reflect the unique appearance information of the video in a space domain and can reflect the motion change in a time domain, and the classification accuracy of video behavior identification can be effectively improved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.