CN107180224B

CN107180224B - Finger motion detection and positioning method based on space-time filtering and joint space Kmeans

Info

Publication number: CN107180224B
Application number: CN201710231824.3A
Authority: CN
Inventors: 韦岗; 梁舒; 马碧云; 李增
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-04-10
Filing date: 2017-04-10
Publication date: 2020-06-19
Anticipated expiration: 2037-04-10
Also published as: CN107180224A

Abstract

The invention discloses a finger motion detection and positioning method based on space-time filtering and joint space Kmeans, which comprises the steps of firstly attaching ten labels with different colors (except black and white) to fingers of a player, and shooting a video of the player playing a keyboard instrument; then, detecting a finger moving target by using a time-space filtering method for an input video frame, and feeding back a space-space filtering result to guide dynamic background updating; finger moving target positioning is carried out by using joint space Kmeans, and the clustering number and the initialized class center are determined in a self-adaptive manner by combining with the statistical characteristics of R, G, B histogram, so that fingering recognition and recording functions with low calculation complexity, high convergence speed, high positioning accuracy and good real-time performance are realized.

Description

Finger motion detection and positioning method based on space-time filtering and joint space Kmeans

Technical Field

The invention relates to the technical fields of visual monitoring, digital image processing and the like, in particular to a finger motion detection and positioning method based on space-time filtering and joint space Kmeans.

Background

Proper fingering by a piano (or other keyboard instrument) player is critical to flexible playing and interpretation of music. The good fingering can embody the comprehension and interpretation of the characteristics of the style of the music and the content of the works by the player, and meanwhile, the energy and the time can be saved, and the playing efficiency can be improved. Although the fingering of playing has a general rule, the non-fixity of different song fingering increases the difficulty for beginners to practice fingering and imitate fingering of musicians. The manual recording fingering method not only needs higher music maintenance, but also is time-consuming and labor-consuming. Therefore, it is a necessary trend for fingering research and learning to realize automatic and intelligent fingering recognition by a machine.

The key of fingering recognition is the organic combination of moving object detection and moving object positioning.

A commonly used moving object detection method includes: background modeling, frame differencing, and optical flow.

1) Background modeling method: static scenes without intruding objects are assumed to have some general characteristics and are mixed together with a weighted sum of statistical models to simulate the background model. Once the background model is known, the intruding object can be detected by marking out the portions of the scene image that do not conform to this background model. Common background modeling methods include single gaussian models, mixed gaussian models, kernel density estimation, etc. Although the methods can obtain a relatively accurate moving target area, the calculation amount is relatively large, the speed is slow, and the method is sensitive to illumination change and background change.

2) Frame difference method: motion regions in the image are extracted by temporal differences between adjacent frames. Although the frame difference method has high operation speed and good stability, when the finger movement speed is low, the overlapped part of the two frames with the moving target pixels close to each other cannot be detected.

3) An optical flow method: motion detection is performed using the optical flow characteristics of a moving object that change with time, and an independent moving object can be detected even when no information on the scene can be obtained in advance, although background modeling is not required. However, the calculation is complex, a special hardware device is needed, the real-time requirement is difficult to meet, and meanwhile, the problems of motion boundaries, motion occlusion, multi-motion (including transparent and semitransparent motion) and the like are also bottlenecks of the optical flow method.

Simultaneous moving object localization methods are typically based on edge detection. Edge detection replaces simplified positioning information with accurate target contour representation, but edge detection loses a large amount of information when fingering is complicated or labels of two or more fingers have overlapping parts, and even judges two moving targets as one. And the finger can not be matched with the detected outline correctly because the edge detection can only position and can not classify. Meanwhile, the edge detection is greatly influenced by the background, and the noise point can be detected to interfere the positioning of the finger without a filtering function.

Therefore, in the application scenario of playing fingering recognition, the moving object detection and positioning method has various problems, such as: the invention provides a finger motion detection and positioning method based on space-time filtering and joint space Kmeans, which realizes fingering identification by analyzing videos of players playing pianos (or other keyboard instruments). The moving target detection of the invention adopts a space-time filtering method, which can overcome the influence of illumination change and background change and effectively avoid the missing detection of the low-speed moving target; the moving target positioning uses a joint space Kmeans method, so that the statistical characteristics of the images can be fully utilized to carry out self-adaptive judgment, and the positioning and clustering accuracy is improved.

Disclosure of Invention

The method aims to overcome the defect that the existing moving target detection and positioning method is applied to the performance fingering recognition scene, and provides the finger motion detection and positioning method based on time-space domain filtering and joint space Kmeans.

In order to achieve the purpose, the finger motion detection and positioning method based on the space-time filtering and the joint space Kmeans comprises three modules of labeling, video shooting, moving object detection and moving object positioning.

The labeling and video shooting module is used for generating a video file processed by a subsequent module, firstly, ten labels with different colors (except black and white) are attached to fingers of a player, and then the player normally plays a piano and shoots a playing process into a video.

The moving object detection module is used for detecting a moving object and adopts a space-time filtering method. Firstly, performing spatial filtering on an input video frame to obtain an accurate moving target area. And then the spatial filtering result is fed back to guide the temporal band-pass filtering result and the temporal low-pass filtering result to perform spatial recombination at the foreground (finger movement) position and the background position to complete dynamic background updating, so that the influences of illumination change, camera shake and background change can be overcome, and the missing detection of a moving target when the finger moves at a low speed is effectively avoided. And converting the detection result of the finger moving target from an RGB space (color space) to a YCrCb and HSV space (color space) for band-pass filtering, removing skin color and shadow, judging through a foreground threshold value, and extracting a label.

The specific implementation steps of the object motion detection are shown in fig. 2.

Step 1: spatial filtering, comprising the steps of:

1.1 search for moving object regions. The current input video frame and the background image are compared by pixel point in the space domain to search the moving target area.

1.2 determining foreground and background. The motion target area is set as a pixel point of a corresponding position of a current video input frame, and a pixel point of a background area is set as white (in an RGB space, white is (255, 255, 255)).

1.3 feed back foreground and background. Foreground (moving object region) and background feedback are used for background update of the next frame.

Step 2: dynamic background update, comprising the steps of:

and 2.1, spatial filtering result feedback. And feeding back the previous spatial filtering result to guide dynamic background updating. And judging whether the current input video frame is the 2 nd frame image. If the current input frame is the 2 nd frame, the background is not updated, and the first frame image is directly used as the background; and if the current input frame is not the 2 nd frame, performing the next operation.

2.2 spatial domain recombination. And performing spatial domain recombination on the time domain band-pass filtering result and the time domain low-pass filtering result at the foreground (finger movement) position and the background position to complete background updating.

And step 3: extracting the label, comprising the steps of:

3.1 removing skin color. Converting the RGB space into YCrCb space, and judging whether the coordinates (Cr, Cb) are in the skin color ellipse model. If a certain pixel point is in the skin color elliptical model, the pixel point is set to be white.

3.2 remove the shadow. And converting the RGB space into HSV space, and performing band-pass filtering on the V component histogram.

3.3 judging the label. And in the HSV space, calculating a foreground average threshold of the S component, and setting the pixel point of which the S component is smaller than the foreground saturation average threshold in the extracted moving target as white.

The moving target positioning module is used for positioning a moving target and adopts a joint space Kmeans method. The joint space Kmeans can not only be used for positioning, but also be used for classification, so that the correct matching of different fingers and label classification is realized, and the positioning errors caused by color overlapping, fingering complexity and noise point interference are effectively avoided. Firstly, the wave crests of the three component histograms after low-pass filtering are judged R, G, B, and the size of the clustering number K is determined in a self-adaptive mode, so that the classification is more accurate and intelligent. And then, self-adaptive initialization clustering is performed by utilizing the statistical characteristics of the histogram, so that the situation of local optimum can be avoided, the iterative convergence speed is accelerated, and the efficiency and the accuracy of the algorithm are improved. The color space (R, G, B) and the geometric space (x, y) are combined with the 5-dimensional Kmeans, so that the prior knowledge of the similar positions of the pixel points with the same color can be fully utilized, and the accuracy of clustering and positioning is improved. And random disturbance and simulated annealing are carried out on the clustering center, so that the stability of the algorithm is improved as much as possible while the clustering center is prevented from falling into local optimum. And finally, classifying and positioning the clustering result, and determining the corresponding position of the fingers of each frame of picture on the keyboard, thereby obtaining the fingering of the player.

The specific implementation steps of the target motion positioning are shown in fig. 3.

Step 1: joint spatial adaptive Kmeans comprising the steps of:

1.1 statistics R, G, B histogram characteristics. And (4) performing low-pass filtering on the three component histograms of the moving object detection result R, G, B, and judging the peak of the histogram in a self-adaptive manner.

1.2 adaptively determining the clustering number K. The maximum number of peaks of the R, G, B histogram is taken as the cluster number of the joint space Kmeans.

1.3 adaptive clustering initialization. The cluster center is initialized with the R, G, B histogram peak locations.

1.4 iterate until convergence. The following operations are repeated until convergence: (a) class centers for the K classes are calculated, respectively. The class center of the kth (K is more than or equal to 1 and less than or equal to K) is the mean vector of the 5-dimensional observation (R, G, B, x, y) vectors in the kth class. (b) Each observation is assigned to the class in which the closest class center is located (euclidean distance is used to define "closest").

Step 2: random perturbation and simulated annealing, comprising the steps of:

2.1 calculate the 5-dimensional perturbation radius for each class. Taking the distance from the center of each class to the farthest from all the points of the class as the disturbance radius r_K(five-dimensional vector, K is the number of clusters).

2.2 random perturbation. Taking a random number random between-1 and 1₀Let class center go on r_K*random₀The disturbance of (2). Taking the result after the class center disturbance as a new initialization class center, and performing joint space self-operation againFitting for Kmeans. The new objective function is calculated as the difference Δ J-J from the current objective function. If Δ J < 0, the new solution is accepted as the current solution and the perturbation radius is updated. The objective function is an objective function in Kmeans.

2.3 simulated annealing. Modifying the random number participating in the disturbance into random₀*a^-tWherein a is the annealing rate, a>1, t is the number of anneals, and the operations of 2.1 and 2.2 are continued.

And step 3: fingering identification, comprising the following steps:

3.1 locating the moving object. And determining the corresponding position of each video frame finger on the keyboard by combining the coordinates of the spatial self-adaptive Kmeans clustering center so as to obtain the fingering.

3.2 fingered output. And uniformly storing the fingering of each video frame in csv for subsequent fingering learning and research.

Compared with the prior art, the invention has the following advantages and technical effects:

1) compared with the common moving target detection technology, the method can overcome the influences of illumination change, camera shake and background change, effectively avoids the degradation of the background and the omission of the moving target during low-speed finger movement, and is beneficial to the detection and extraction of the moving target.

2) The invention adopts a spatial filtering method in the moving target detection, determines the moving target by comparing the input video frame with the background image in a spatial domain pixel by pixel, and can obtain a more accurate moving target area compared with the common moving target detection technology.

3) The invention uses a united space self-adaptive Kmeans method in the moving target positioning, combines the positioning with the self-adaptive clustering, thereby realizing the correct matching of different fingers and label classification, and effectively avoiding the positioning error caused by color overlapping, fingering complexity and noise point interference. The color space (R, G, B) and the geometric space (x, y) are combined with the 5-dimensional Kmeans, so that the prior knowledge of the similar positions of the pixel points with the same color can be fully utilized, and the accuracy of clustering and positioning is improved.

4) In the method for positioning the moving target by using the Kmeans in the joint space, the size of the clustering number K is determined in a self-adaptive manner according to the maximum value of the number of the wave peaks after low-pass filtering of R, G, B three component histograms, so that the classification is more accurate and intelligent. The statistical characteristics of the histogram are used for self-adaptively initializing clustering, so that the situation of falling into local optimum can be avoided, the iterative convergence speed is accelerated, and the efficiency and the accuracy of the algorithm are improved.

In conclusion, the method can overcome the defect that the existing moving target detection and positioning method is applied to the performance fingering recognition scene, has the advantages of insensitivity to illumination and background change, low calculation complexity, high convergence speed, high positioning accuracy, good real-time performance and the like, and can be widely applied to gesture recognition and other fields by being properly modified.

Drawings

FIG. 1 is a general flow chart of a finger motion detection and localization method based on spatio-temporal filtering and joint space Kmeans according to the present invention;

FIG. 2 is a flow chart of a moving object detection module according to the present invention;

FIG. 3 is a flow chart of the moving object locating module according to the present invention.

Detailed Description

The invention firstly sticks the fingers of the player with ten labels with different colors (except black and white) and shoots the video of the player playing the piano with the fingers.

And then, carrying out moving object detection processing on the video. Firstly, determining a finger motion area by using a time-space filtering method for an input video frame, and extracting a label. The spatial filtering detects the moving target by comparing the input video frame with the background image pixel by pixel in the spatial domain, thereby obtaining a more accurate moving target area. And then, the spatial filtering result is fed back to guide dynamic background updating, so that the updated background is closest to the background of the spatial filtering input video frame, the background degradation is effectively avoided, and the detection and extraction of the moving target are facilitated. In YCrCb space, the projection of skin information on a CrCb two-dimensional plane is approximately in elliptical distribution, and the skin color pixel points on the finger moving target are removed by judging whether the coordinates (Cr, Cb) are in a skin color elliptical model. In the HSV space, the brightness V represents the brightness degree of the color, and the darker the brightness V is, the smaller the brightness V is. The brightness of the shadow relative to other parts of the finger moving object is minimum, and the shadow can be removed through band-pass filtering of the V component histogram. S represents the saturation of the color, and the saturation is higher if the color is dark and bright. The saturation of the label is maximum relative to other parts of the finger moving target, and the label can be extracted through the judgment of the foreground saturation average threshold.

And finally, performing motion positioning processing on the video. And classifying and positioning the finger moving target by adopting the joint space Kmeans, thereby realizing the functions of fingering identification and recording. Kmeans clusters with K points in space as centers, and classifies the object closest to the center. And through an iterative method, each clustering center is updated successively, the error is reduced continuously, and the optimal solution is converged when the error is unchanged. In this moving object location, the optimal solution is around the 10 label colors R, G, B. Therefore, the peak after low-pass filtering of the R, G, B three component histograms is used for self-adaptive initialization of the clusters, so that the initialization center of the Kmeans can be closer to the optimal solution, the iterative convergence speed is accelerated, the algorithm efficiency is improved, the self-adaptive initialization can avoid the situation of falling into the local optimal solution, which is different from the situation that the clustering of the random initialization Kmeans can possibly obtain the local optimal solution rather than the overall optimal solution. Meanwhile, the color space (R, G, B) and the geometric space (x, y) are combined with the 5-dimensional Kmeans, the prior knowledge of the similar positions of the pixel points with the same color can be fully utilized, and the accuracy of clustering and positioning is improved. The simulated annealing Kmean algorithm is a heuristic iterative algorithm with progressive convergence, which has been theoretically proven to converge on a global optimal solution with a probability of 1. Therefore, random disturbance and simulated annealing are carried out on the clustering center, and the stability of the algorithm is improved while the clustering center is prevented from falling into local optimum.

The invention organically combines methods such as machine learning, digital signal processing and the like together, and realizes the detection and the positioning of the finger movement based on the space-time filtering and the joint space Kmeans method. The present invention will be described in further detail with reference to the following detailed description and accompanying drawings, but the embodiments of the invention are not limited thereto.

Fig. 1 is a specific embodiment of the present invention, which mainly includes three modules of labeling and video shooting, moving object detection, and moving object positioning. The invention firstly sticks the fingers of the player with ten labels with different colors (except black and white) and shoots the video of the player playing the piano with the fingers. And then, detecting a finger moving target area by using a time-space filtering method for the input video frame, extracting a label, and classifying and positioning the finger moving target by using joint space Kmeans, thereby realizing fingering identification and recording functions.

The moving object detection module is used for detecting a moving object and adopts a space-time filtering method. Firstly, performing spatial filtering on an input video frame to obtain an accurate moving target area. And then the spatial filtering result is fed back to guide the temporal band-pass filtering result and the temporal low-pass filtering result to perform spatial recombination at the foreground (finger movement) position and the background position to complete dynamic background updating, so that the influences of illumination change, camera shake and background change can be overcome, and the missing detection of a moving target when the finger moves at a low speed is effectively avoided. And converting the detection result of the finger moving target from an RGB space to a YCrCb and HSV space for band-pass filtering, removing skin color and shadow, judging through a foreground threshold value, and extracting a label.

Step 1: spatial filtering, comprising the steps of:

1.2 determining foreground and background. And setting the pixel points of the corresponding positions of the current video input frame in the motion target area, and setting the pixel points of the background area to be white.

Step 2: dynamic background update, comprising the steps of:

And step 3: extracting the label, comprising the steps of:

Step 1: joint spatial adaptive Kmeans comprising the steps of:

1.2 adaptively determining the clustering number K. The maximum number of peaks in the R, G, B histogram is taken as the cluster number of the joint space Kmeans.

1.4 iterate until convergence. The following operations are repeated until convergence: (a) class centers for the K classes are calculated, respectively. The class center of the kth (K is more than or equal to 1 and less than or equal to K) is the mean vector of the 5-dimensional observation (R, G, B, x, y) vectors in the kth class. (b) Each observation is assigned to the class in which the closest class center is located.

Step 2: random perturbation and simulated annealing, comprising the steps of:

2.2 random perturbation. Taking a random number random between-1 and 1₀Let class center go on r_K*random₀The disturbance of (2). And taking the result after the class center disturbance as a new initialization class center, and performing joint space self-adaptive Kmeans again. The new objective function is calculated as the difference Δ J-J from the current objective function. If Δ J < 0, then connectThe new solution is taken as the current solution, and the disturbance radius is updated.

2.3 simulated annealing. Modifying the random number participating in the disturbance into random₀*a^-tWherein a is the annealing rate, a>1, t is the set annealing times, and the operations of 2.1 and 2.2 are carried out.

And step 3: fingering identification, comprising the following steps:

The invention can be well realized and the effects of the invention can be obtained, and the embodiment of the invention can be widely applied to gesture recognition and other fields after being properly modified.

Claims

1. The finger motion detection and positioning method based on space-time filtering and combined space self-adaptive Kmeans is characterized in that the method is realized by a labeling and shooting video module, a moving object detection module and a moving object positioning module;

the labeling and video shooting module is used for generating a video file processed by a subsequent module and comprises: firstly, attaching ten labels with different colors except black and white to fingers of a player, and shooting a playing process into a video while the player normally plays a piano;

the moving object detection module is used for detecting a moving object, and the specific implementation steps comprise:

step 1, spatial filtering, comprising the following steps:

1.1 search for moving object region: carrying out space domain pixel-by-pixel comparison on a current input video frame and a background image to search a moving target area;

1.2 determining foreground and background: setting a motion target area as a pixel point of a corresponding position of a current video input frame, setting a pixel point of a background area as white, wherein the white is (255, 255, 255) in an RGB space;

1.3 feed back foreground and background: the foreground, namely the moving target area and the background feedback are used for updating the background of the next frame;

step 2, dynamic background updating, comprising the following steps:

2.1 spatial filtering result feedback: the last spatial filtering result is fed back to guide dynamic background updating, whether the current input video frame is a 2 nd frame image or not is judged, if the current input frame is the 2 nd frame, the background is not updated, and the first frame image is directly used as the background; if the current input frame is not the 2 nd frame, performing the next operation;

2.2 spatial domain reorganization: performing space domain recombination on the time domain band-pass filtering result and the time domain low-pass filtering result in the foreground, namely the finger movement position and the background position, so as to complete background updating;

and step 3: extracting the label, comprising the steps of:

3.1 removing skin color: converting the RGB space into the YCrCb space, judging whether the coordinates (Cr, Cb) are in the skin color elliptical model, and if a certain pixel point is in the skin color elliptical model, setting the pixel point to be white;

3.2 shadow removal: converting the RGB space into HSV space, and performing band-pass filtering on the V component histogram;

3.3 judging label: in HSV space, calculating a foreground average threshold of an S component, and setting pixel points of which the S component is smaller than the foreground saturation average threshold in the extracted moving target as white;

the moving target positioning module is used for positioning a moving target and is realized by adopting joint space self-adaptive Kmeans, and the specific realization steps comprise:

step 1, combining spatial adaptive Kmeans, comprising the following steps:

1.1 statistical R, G, B histogram characterization: performing low-pass filtering on the three component histograms of the moving target detection result R, G, B, and judging the peak of the histogram in a self-adaptive manner;

1.2 self-adaptively determining the clustering number K: taking R, G, B maximum peak number of histogram as cluster number of joint space self-adaptive Kmeans;

1.3 adaptive clustering initialization: initializing a clustering center by using R, G, B histogram peak positions;

1.4 iterations until convergence: the following operations are repeated until convergence: (a) respectively calculating class centers of K classes; the class center of the kth class is a mean vector of 5-dimensional observation (R, G, B, x, y) vectors in the kth class, and K is more than or equal to 1 and less than or equal to K; (b) assigning each observation to the class in which the closest class center is located; the distance is determined by adopting the Euclidean distance;

step 2: random perturbation and simulated annealing, comprising the steps of:

2.1 calculate the 5-dimensional perturbation radius for each class: taking the farthest distance between the class center of each class and all the points in the class as a disturbance radius r_K，r_KIs a five-dimensional vector, and K is the number of clusters;

2.2 random perturbation: taking a random number random between-1 and 1₀Let class center go on r_K*random₀The disturbance of (2); taking the disturbed result of the class center as a new initialized class center, performing joint space self-adaptation Kmeans again, and calculating the difference value delta J between a new target function J 'and the current target function J to be J' -J; if the delta J is less than 0, a new solution is accepted as the current solution, the disturbance radius is updated, and the step 3 is carried out; otherwise, entering step 2.3;

2.3 simulated annealing: modifying the random number participating in the disturbance into random₀*a^-tWherein a is the annealing rate, a>1, t is the annealing times, and the operations of 2.1 and 2.2 are continued;

and step 3: fingering identification, comprising the following steps:

3.1 positioning of the moving target: determining the corresponding position of each video frame finger on the keyboard through the coordinate of the joint space self-adaptive Kmeans clustering center so as to obtain a fingering method;

3.2 fingering output: and uniformly storing the fingering of each video frame in a csv file for subsequent fingering learning and research.