CN112329656A

CN112329656A - Feature extraction method for human action key frame in video stream

Info

Publication number: CN112329656A
Application number: CN202011246020.9A
Authority: CN
Inventors: 宋玲; 夏智敏; 陈燕; 叶进; 石森煌; 王立颖
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-05
Anticipated expiration: 2040-11-10
Also published as: CN112329656B

Abstract

The invention discloses a method for extracting characteristics of human action key frames in video stream, which uses a method for improving MHI (media-height interface) based on combination of Gaussian kernel function and equal-interval frame distance sampling to process video stream data without analyzing images of each frame, can effectively smooth gray value changes in a motion history image MHI so that the motion history image MHI has stronger robustness, extracts image characteristics through HOG (hyper text object graph), and then uses an NN (neural network) classifier to detect whether an action state label is changed or not and extracts the action key frames according to the change. The invention can smoothly extract the video stream of the action key frame on the premise of meeting the action classification precision.

Description

Feature extraction method for human action key frame in video stream

Technical Field

The invention relates to the technical field of target identification, in particular to a method for extracting characteristics of human action key frames in video streams.

Background

The technology of extracting key frames in video streams for changes in human body motion has been the focus of research in recent years. The video stream data has the characteristics of large information amount, complex data structure, strict time sequence characteristics and the like, and how to analyze the change of the human action state from the real-time video data of a video file or a camera and obtain corresponding action key frame data is the most critical problem in the human key point detection problem.

Disclosure of Invention

The invention provides a feature extraction method of human action key frames in video streams, which can quickly and accurately extract the action key frames in the video streams.

In order to solve the problems, the invention is realized by the following technical scheme:

a method for extracting features of human action key frames in video stream includes the following steps:

step 1: acquiring a calculation frame from video stream data by using an equal interval sampling method;

step 2: generating a historical motion map and performing motion segmentation on the calculation frame by using an improved motion historical map algorithm based on a Gaussian kernel function, and separating a human motion foreground from a background to obtain the historical motion map;

on the basis of the traditional motion history map algorithm, the improved motion history map algorithm based on the Gaussian kernel function performs incremental increase or decremental on the gray value of the calculation frame at the current moment by comparing the gray values of the corresponding pixel points of the calculation frame at the current moment and the comparison frame in the time sequence, namely:

if the difference value of the gray value of the corresponding pixel point of the calculation frame at the current moment and the comparison frame in the time sequence is larger than or equal to the set gray threshold, increasing the gray value of the pixel point of the calculation frame at the current moment

Wherein the omega tableDisplaying the set frame influence factor, wherein t represents the current time, and delta t represents the time difference between the calculation frame of the current time and the comparison frame in the time sequence;

if the difference value of the gray value of the corresponding pixel point of the calculation frame at the current moment and the comparison frame in the time sequence is smaller than a set gray threshold, reducing the rated attenuation coefficient sigma of the gray value of the pixel point of the calculation frame at the current moment;

and step 3: describing the motion information of the contour edge of the historical motion image by utilizing the directional gradient histogram feature, and extracting and calculating the image feature in the frame;

and 4, step 4: and (3) carrying out motion recognition on the image features by using the NN classifier, and outputting the calculation frame at the current moment as an action key frame when the motion state labels of the calculation frame at the current moment and the comparison frame in the time sequence are changed.

The method further comprises the following steps before the step 1: when the video stream data is collected, the median filter is used for carrying out preprocessing for eliminating noise on the video stream data.

The contrast frames in the above time sequence are obtained from the video stream data using an equal-interval sampling method.

The sampling interval of the comparison frame in the above timing is equal to or larger than the sampling interval of the calculation frame.

In step2, the set grayscale threshold is 127.

In step2, the attenuation coefficient σ is 30.

Compared with the prior art, the invention provides an algorithm (GMHKE) for extracting the video stream key frame based on the improved MHI and HOG characteristics of the Gaussian kernel function. The algorithm uses a method for improving MHI based on Gaussian kernel function and equal-interval frame distance sampling to process video stream data, each frame of image is not needed to be analyzed, the change of gray values in the MHI of a motion history picture can be effectively smoothed, the MHI has strong robustness, image features are extracted through HOG, and an NN classifier is used for detecting whether an action state label is changed or not and extracting action key frames according to the change of the action state label. The invention can smoothly extract the video stream of the action key frame on the premise of meeting the action classification precision.

Drawings

Fig. 1 is a flowchart of a method for extracting features of a human motion key frame in a video stream.

Fig. 2 shows the selection of attenuation coefficients in GMHI.

FIG. 3 is a diagram of action samples, (a) Walk behavior, (b) Run behavior, and (c) Collapse behavior.

Fig. 4 shows the experimental results of test set 1, (a) original video, (b) MHI, and (c) HOG features.

Fig. 5 shows the results of the test set 2 experiment, (a) raw video, (b) MHI, and (c) HOG features.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to specific examples.

Referring to fig. 1, a method for extracting features of human motion key frames (GMHKE) in a video stream includes the following steps:

step1, preprocessing the video stream data to eliminate noise by using a median filter, and then acquiring a calculation frame from the video stream data by using an equal-interval sampling method.

And 2, generating a historical motion map and performing motion segmentation on the calculation frame by using an improved motion historical map algorithm based on a Gaussian kernel function, and separating the human motion foreground from the background to obtain the historical motion map.

when the difference value of the gray value of the corresponding pixel point of the calculation frame at the current moment and the comparison frame in the time sequence is greater than or equal to the set gray threshold value, the gray value of the pixel point of the calculation frame at the current moment is increased

Where ω represents the set frame influence factorAnd sub, t represents the current time, and delta t represents the time difference between the calculation frame and the comparison frame in the time sequence.

And when the difference value of the gray values of the corresponding pixel points of the calculation frame at the current moment and the comparison frame in the time sequence is smaller than a set gray threshold, reducing the rated attenuation coefficient sigma of the gray value of the pixel point of the calculation frame at the current moment.

The comparison frames in the time sequence are obtained from the video stream data by using an equal interval sampling method, namely the comparison frames in the time sequence change along with time, and the time intervals of the comparison frames in every 2 adjacent time sequences are the same. The sampling interval of the comparison frames in the time sequence is equal to or greater than the sampling interval of the calculation frames. When the sampling interval of the comparison frame is equal to that of the calculation frame, the comparison frame of the calculation frame at the current moment in the time sequence is the calculation frame at the previous moment. When the sampling interval of the comparison frame is larger than that of the calculation frame, the comparison frame of the calculation frame at the current moment in the time sequence is the calculation frame at the previous moments.

And 3, describing the motion information of the contour edge of the historical motion picture by using HOG (histogram of oriented gradients) characteristics, and extracting the image characteristics in the calculation frame.

And 4, performing motion recognition on the image features by using the NN classifier, and outputting the calculation frame at the current moment as an action key frame when the motion state labels of the calculation frame at the current moment and the comparison frame in the time sequence are changed.

The following is a description of the related art to which the present invention relates:

(1) data pre-processing

In the process of acquiring video data, due to the change of illumination intensity and angle, the existence of environmental noise can cause certain interference factors to the processing and identification of images, so that the images need to be preprocessed for eliminating noise and improving image quality. Good pre-processing results will improve the accuracy and rate of subsequent operations.

The median filtering uses a nonlinear method, filters the image by using a sliding window with the size of odd number of sampling frames, such as (2n +1) × (2n +1), arranges the gray values of the pixels in the sampling frames in sequence, and uses the median to replace the gray value in the center of a function frame, so that the method can smooth impulse noise and protect the integrity of the image edge, and has better performance on repairing salt and pepper noise.

The median filter of the filter with the size of 15 × 15 is used to remove salt and pepper noise, and the coordinates of the pixel point of the original image are (x, y), the gray value is f (x, y), and the gray value after median filtering is g (x, y), then the operational relationship is shown in formula (1):

g(x,y)＝median(f(x-k,y-j),(l,j)∈w) (1)

(2) extracting and calculating frame by equal interval frame distance method

Due to the fact that the pixel overlapping rate of two adjacent frames in video stream data is high, if all frames before a certain moment are considered to be included in MHI calculation, pixel point changes are compact, the calculation amount of pixel gray values is large, and the next feature extraction is affected. Because the MHI describes the relative intensity change of the pixel points in the data stream, the time of the relative intensity change needs to be fixed, the extraction of the video frames by using methods such as clustering and the like is usually self-adaptive extraction, the sampling time interval of the extraction method generally changes along with the operation result, and therefore, the calculation frames are extracted by using a method with equal frame spacing, namely, the extraction of the calculation frames is performed by using an interval sampling rate of 1/5 frame spacing within a range of a rated frame number.

The MHI sampled at equal interval frame distance contains clearer motion information, because the MHI obtained by the method avoids extraction of redundant data, reduces the influence of pixel overlapping of adjacent frames, and sparsizes human body action, so that motion characteristics are more obvious.

(3) Improved motion history map (GMHI) based on Gaussian kernel function

The change of the human motion state can be judged only by data accumulation of a certain amount of time series data frames. Human actions form a space-time shape in a space-time volume of an action video, a Motion sequence of a human can be described by a single MHI Image, and a Motion History Map (MHI) is that time sequence data of the video is observed, and the Motion condition of each object in the video stream is represented in a brightness attenuation mode to represent the time before and after the Motion occurs, so that the MHI is used for judging the change of the Motion state. The value of a certain pixel in an original MHI algorithm is directly attenuated from a full brightness value once changed, the method is sensitive to the interference of environmental noise and external factors, and once the conditions of flicker, winged insects and the like appear in camera lens or video stream data, the correct formation of the MHI is greatly influenced.

The Gaussian function (Gaussian function) replaces the pixel value of the point by using the weighted mean of the pixel neighborhood, and the weight of each neighborhood pixel point is monotonically increased or decreased with the distance between the point and the central point, so that the value of the pixel point can be effectively and smoothly calculated, as shown in formula (2):

wherein the content of the first and second substances,

representing the height of the curve in the gaussian function; 2 omega²Representing the coordinate range of the gaussian function.

The traditional gaussian function is selected from the spatial domain of pixels, and the invention adopts the time-sequence domain of pixels, namely in a time-sequence video stream data, the influence weight of the pixel value closer to the current frame on the frame is higher, and vice versa. The GMHI algorithm is obtained by improving the update function of the MHI by using the Gaussian function as the kernel function coefficient, and only the gradually-increased operation is given to the pixel points with high pixel value change frequency, so that the influence of small-range mutation and flicker on the MHI formation of the pixel points is avoided, and the method has higher robustness and accuracy.

GMHI provides motion foreground information by recording changes at pixel points and encoding, and the update formula of the gray value H (x, y, t) of the pixel (x, y) of the t-th frame calculation frame is as follows:

wherein the content of the first and second substances,

represents the update function of the pixel (x, y) of the t-th frame calculation frame, and σ represents a given attenuation coefficient.

The motion history map can influence the expression effect because of different values of the parameters, the closer to the current moment, the brighter the pixel brightness change is, the feature extraction is convenient, the brightness of the pixel value is increased by adopting an accumulation mode, and the noise elimination capability is good. Frame t calculates an update function for a pixel (x, y) of a frame

Comprises the following steps:

where D (x, y, t) is the difference between the gray-level values of the corresponding pixels (x, y) of the tth frame calculation frame and the comparison frame in time series (i.e., the tth- Δ t frame calculation frame):

D(x,y,t)＝|H(x,y,t)-H(x,y,t-Δt)| (5)

wherein, H (x, y, t) represents the gray scale value of the pixel (x, y) of the t-th frame calculation frame, and H (x, y, t- Δ t) represents the gray scale value of the pixel (x, y) of the comparison frame in time sequence of the t-th frame calculation frame.

And adding a Gaussian kernel function for smooth calculation, when the change amplitude D (x, y, t) of the gray value of the pixel point exceeds a gray threshold 127, improving the Gaussian kernel function by using the weighted value of the neighborhood of the pixel time sequence to calculate so as to gradually increase the gray value, determining the influence weight of the gray value of the pixel according to the distance delta t of the time sequence, wherein the influence weight is more obvious than the linear change in the original MHI function, and the influence of noise and use scene interference factors can be reduced by better retaining the contour of the human motion track.

After the parameters of the gaussian kernel function are determined through experiments, the calculation of the whole MHI only needs to select an attenuation coefficient sigma, and the selection of the attenuation coefficient influences the attenuation speed of the gray value of the pixel. As shown in fig. 2, when the attenuation coefficient is too large, the MHI image can record only a motion with a large motion amplitude, and when the attenuation coefficient is too small, the MHI image may be difficult to judge the direction of the motion. When the attenuation coefficients are 10, 20, 30, 40, and 50, respectively, the motion recording frame becomes less noticeable as the motion recording frame becomes longer as the coefficient increases, and the motion recording frame becomes shorter as the coefficient decreases. In experiments it can be seen that the MHI image reaches a relatively optimal state when the attenuation factor is chosen to be 30.

The basic steps of the GMHI algorithm are described as follows:

inputting: video stream data

And (3) outputting: exercise history map

step1, obtaining a calculation frame from video stream data by using an equal-interval sampling method;

step2. update the calculation frame using equation (3): when the change frequency of the pixel points in the calculation frame and the contrast frame in the time sequence is larger, accumulating the gray value of the pixel points; when the change frequency of the pixel points in the calculation frame and the contrast frame in the time sequence is less than the gray threshold, the gray value is decreased by a rated attenuation coefficient; and accordingly obtains the MHI picture.

step3. repeat step1 to step2 until the video stream data stops being input;

step4. the algorithm ends.

The foreground sequence can be represented in a compact manner by the calculated MHI. The sequence of contours belonging to an action is compressed into a gray-scale image, where the most recent motion is represented by the lighter gray-scale value pixels as shown, preserving the main motion information. The MHI is centered with respect to the centroid of the detected foreground and scaled to some fixed scale size so as to have a scaled and location invariant representation. The illumination and contrast invariance representation may be represented by dividing each pixel in the MHI by the MHI_τTo obtain a unit sum, as shown in equation (6):

where M is the total number of rows of pixels in the MHI image and N is the total number of columns of pixels.

(4) Description of HOG features

The HOG feature descriptor is widely used for object detection and human action recognition in computer vision, can well describe motion information of contour edges in MHIs, and the part adopts a common one-dimensional operator [ -1, 0, 1] and a transpose matrix thereof to carry out gradient calculation of the MHIs, and for each MHI, the MHI is zoomed into a gray image of 48x104 pixels according to a motion foreground obtained by segmentation, so that the cell size of a pixel unit is selected to be 4x4 and the pixel block size is 8x8 on parameter selection of HOG feature extraction, wherein the gradient of the cell is divided into 8 direction blocks by 360 degrees, each bin size is 45 degrees, and L2 canonical normalization is carried out according to gradient information and directions. Each block contains 4 pixel units and 8 directional blocks, so each block contains 32 dimensional feature vectors, while each image has 8 blocks in the horizontal direction and 6 blocks in the vertical direction, and with block as scanning step size, each image has 1536 dimensional feature vectors.

(5) Change of motion state

A simple and efficient NN classifier approach is used to obtain the closest classification labels, which due to its simplicity can be applied to a large number of classification problems, effectively classifying large action classes. In the process of training and testing the data set, according to the HOG characteristic value obtained by the MHI diagram, calculating the Euclidean distance of the characteristic value, as shown in formula (7), allocating the action label with the value closest to the sample to the test sample, and finally calculating the matching success rate of the action label and the sample. When the action label is changed, the current frame is saved as a picture file and is output to the target folder.

Where p and q are HOG vectors for test and training samples and n is the length of the vector.

(6) MHI and HOG feature extraction action key frame based on Gaussian kernel function improvement

The steps of the algorithm for GMHKE are described as follows:

inputting: video stream data

And (3) outputting: action key frame in picture form

step2, generating a historical motion map and performing motion segmentation on the calculation frame by using GMHI of a Gaussian kernel function, and separating a human motion foreground from a background;

step3, using HOG feature extraction to obtain image features in the calculation frame;

step4, using an NN classifier to perform motion recognition on the features, and if the motion state label is changed, exporting the current frame as a picture to be output as a motion key frame;

step5. the above steps 1 to 4 until no new video stream data is input.

step6. the algorithm ends.

Suppose that a motion history map is calculated by taking M frames of video streams as an interval, N is the sampling frequency of Gaussian kernel function combined with equal interval frame distance sampling, C is the feature vector dimension extracted by HOG features, and N is the total number of actions in a data set. The complexity of the GMHKE algorithm proposed in this chapter is O (2 MNC)²/n), and the algorithm complexity for action key frame extraction using the original MHI is O (2 MNC)²) Through complexity analysis, the GMHKE algorithm is found to be n times lower in complexity than the key frame extraction using the original MHI action key frame.

2. Experiment and analysis of results

(1) Human motion recognition database

The experiment used the MuHAVi database, which is a multi-view dataset. The database is widely applied to the field of human motion recognition. The MuHAVi database uses 8 cameras to record 14 human motion behaviors obtained by recording the motions of seven testers under different scenes from angles of 0 DEG and 45 DEG, and captures 136 video files in total, wherein the video files comprise Collapse left, Collapse right, Guard to kit, Guard to Punch, Kick right, Punch right, Run left to right, Run right to left, Stand up right, Turn back left, Turn back right, Walk left to right and Walk right to left. These motion behaviors can be further grouped into eight classes, Collapse, Guard, Kick, Punch, Run, Stand up, Turn Back, Walk, such as Run left to right and Run right to left, which can be grouped into the Run action class. Samples of athletic performance in the video database are shown in fig. 3.

The experiment is based on a MuHAVi video data set, and the problem that whether the GMHKE algorithm can effectively extract the human action key frame is verified. In order to display the performance of the algorithm, a cross-validation method is adopted, namely, the video data of one tester is selected as a validation set, and the video data of the remaining 6 actors is used as a training set to test the classifier.

(2) Experimental platform configuration

The server configuration used in this experiment was as follows:

hardware platform: CPU, Inter (R) Core (TM) i5-8250U CPU @1.6GHz 1.80 GHz; GPU, NVIDA GeForce GTX 1050with Max-Q Design 4 g; solid state disk, 256G; a cloud platform, Titan XP E5-16208 core 32G 2TB hard disk.

A software platform: system environment, windows 10 family chinese version; CUDA 9.0; python 3.5; tensorflow 1.12.0; keras 2.0.8; jupyter notewood;

data set: a MuHAVi video data set; youku network video data set.

(3) Simulation experiment results and analysis

1) Classification accuracy and efficiency of GMHKE algorithm

The method of cross validation is adopted to train and validate the GMHKE algorithm and 136 video sequences in the MuHAVi video data set by using the original MHI method. The accuracy and time-consuming experimental results of GMHKE and original MHI are shown in Table 1, the GMHKE algorithm obtains 93.1% accuracy, the accuracy of the original MHI is 91.9%, the whole rate is improved by 2%, in terms of time, the GMHKE algorithm trains and tests eight types of actions of 136 video sequences, 80873 frames are used in total, 2696 MHI images are obtained, 3197s is consumed, the processing time of a single MHI image is 1.18s, each frame of the original MHI image needs to be calculated, 15175s is consumed, and most of the obtained MHI images are redundant data.

TABLE 1 accuracy and time consumption of GMHKE and original MHI

(2) Recognition rate of GMHKE algorithm for different behaviors

Table 2 shows the recognition rates of 8 types of behaviors in 136 video sequences of the MuHAVi video data set obtained by the GMHKE algorithm and the original MHI through the same HOG feature extraction, and the functional relation between the recognition rate P (N) and the misrecognition rate P (N | M) of a certain type of behavior is obtained as shown in equations (8) and (9):

wherein, R (N) represents the number of N types of behaviors correctly identified by the GMHKE algorithm, Q (N) is the total number of N types of behaviors labeled in the verification set, and R (N | M) is the number of M types of behaviors incorrectly identified by the Nth type of behaviors.

As can be seen from the data in Table 2, the GMHKE algorithm has the most obvious improvement on the recognition rate of Collapse, Walk, Run and Kick, and has less improvement on Stand up, Guard, Punch and Turn Back. The reason for this is that equal interval frame distance is adopted to extract the calculation frame, so that the motion characteristics are more obvious, while Stand up and Guard are in a low-motion partial static state because the human body is in a similar motion state, and the two motion states are easy to be confused, so that the lifting efficiency is lower, the Punch self-recognition rate is better, the lifting space is not large, the Turn Back action is more biased to small-scale self-motion, and the method adopting equal interval frame distance may lose part of motion information, so that the recognition rate is slightly lower than the original HMI.

TABLE 2 GMHKE Algorithm and behavior recognition rate of original MHI

Tables 3 and 4 are behavior recognition confusion matrices of GMHKE and original MHI, the diagonal line of the matrix is the probability that the mth behavior is correctly recognized, the nth row in the mth column represents the probability that the mth behavior is recognized as the nth behavior, and when m is unequal to n, the probability is false recognition, 0.922 in the first row in the first column in the table is that the accuracy that the Collapse behavior is correctly recognized as Collapse is 92.2%, and 0.070 in the third row in the first column represents that the Collapse behavior is incorrectly recognized as kirck is 7%.

Table 3 confusion matrix of original MHI algorithm

TABLE 4 confusion matrix for GMHKE algorithm

(3) Accuracy of GMHKE algorithm for extracting action key frame

Five video segments are randomly extracted in the MuHAVi video data verification set each time to carry out the verification of the accuracy of the extraction of the action key frame which is repeatedly averaged three times, and five groups of experiments are completed in total, wherein the table 5 shows the extraction rate obtained by comparing the number of the action key frames extracted by using the GMHKE algorithm with the number of action change frames labeled by the data set. The extraction rule of the action key frame is that if the motion classification label of the current frame is monitored to be changed compared with the last action key frame, the current frame is considered as the latest action key frame and is output in a picture mode, and if the current frame is not changed, the next round of detection is continued.

TABLE 5 GMHKE Algorithm extraction Key frame accuracy

(4) Performance of GMHKE algorithm in Internet test set

As the training set and the verification set belong to the same database, in order to avoid the occurrence of the over-fitting result, a data set from the Internet (Youku) is selected for testing, the test set 1 is derived from a demonstration video of fitness movement, wherein the demonstration video comprises 12 key movements, and the test set 2 is derived from a demonstration video of Keep fitness software, wherein the demonstration video comprises 20 key movements. Tables 6-7 and FIGS. 4-5 show the results of the test set. By using the method, the key frames with changed motion states can be extracted with higher accuracy for the input video sequence of the test set.

TABLE 6 accuracy of GMHKE algorithm in test set 1

TABLE 7 accuracy of GMHKE algorithm in test set 2

The invention provides a GMHKE algorithm which uses a method of improving GMHI based on Gaussian kernel function combined with equal-interval frame distance sampling to process video stream data, extracts image features through HOG, detects whether an action state label is changed or not through an NN classifier, extracts a motion key frame according to the change, verifies the practicability of the GMHKE algorithm through experimental simulation, and obtains a relatively ideal effect in action classification. The Gaussian kernel function is combined with the equal interval frame distance extraction calculation frame to calculate, the gray value intensity of the motion history image can be changed more stably, the robustness is better, the change of the human motion state can be described clearly by HOG (high order generalized) extracting the characteristics of different gradient information of MHI (high definition information), and the NN classifier of rapid motion classification is combined to prove that the method of the chapter can smoothly extract the video stream of the motion key frame on the premise of meeting the motion classification precision.

It should be noted that, although the above-mentioned embodiments of the present invention are illustrative, the present invention is not limited thereto, and thus the present invention is not limited to the above-mentioned embodiments. Other embodiments, which can be made by those skilled in the art in light of the teachings of the present invention, are considered to be within the scope of the present invention without departing from its principles.

Claims

1. A method for extracting features of human action key frames in video stream is characterized by comprising the following steps:

Wherein ω represents a set frame impact factor, t represents a current time, and Δ t represents a time difference between a calculation frame at the current time and a comparison frame in a time sequence;

2. The method as claimed in claim 1, wherein the method further comprises, before step 1: when the video stream data is collected, the median filter is used for carrying out preprocessing for eliminating noise on the video stream data.

3. The method as claimed in claim 1, wherein the comparison frames in time series are obtained from the video stream data by using an equal-interval sampling method.

4. The method as claimed in claim 3, wherein the sampling interval of the comparison frames in the time sequence is equal to or greater than the sampling interval of the calculation frames.

5. The method according to claim 1, wherein the threshold value of the gray scale level set in step2 is 127.

6. The method according to claim 1, wherein in step2, the attenuation coefficient σ is 30.