CN112464856A - Video streaming detection method based on human skeleton key points - Google Patents
Video streaming detection method based on human skeleton key points Download PDFInfo
- Publication number
- CN112464856A CN112464856A CN202011431363.2A CN202011431363A CN112464856A CN 112464856 A CN112464856 A CN 112464856A CN 202011431363 A CN202011431363 A CN 202011431363A CN 112464856 A CN112464856 A CN 112464856A
- Authority
- CN
- China
- Prior art keywords
- data
- key points
- skeleton
- action
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/23—Recognition of whole body movements, e.g. for sport training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/10—Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
- G06V40/103—Static body considered as a whole, e.g. static pedestrian or occupant recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Human Computer Interaction (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video stream motion detection method based on human skeleton key points, which is characterized in that a sliding window of m seconds is utilized to intercept m seconds of video and n frames per second. And respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame. And then, the interframe skeleton data is split into a plurality of skeleton sequences according to Euclidean distances, namely one skeleton sequence is used by one person, and the method mainly aims at the action detection and identification of the video with the variable length. And 1 time real-time speed can be achieved on 2080TI level GPU. Therefore, the video motion detection and identification have practical effects.
Description
Technical Field
The invention relates to the field of video identification, in particular to a video flow detection method based on human skeleton key points.
Background
The motion detection is mainly based on a human body posture model, and is used for identifying motion pictures acquired by videos, for example, a Chinese patent with publication (announcement) number CN107194344A discloses a human body behavior identification method adaptive to a skeleton center. The problem that the action recognition precision is low in the prior art is mainly solved. The method comprises the following implementation steps: 1) acquiring a three-dimensional skeleton sequence from the skeleton sequence data set, and preprocessing the three-dimensional skeleton sequence to obtain a coordinate matrix; 2) selecting characteristic parameters according to the coordinate matrix, adaptively selecting a coordinate center, and normalizing the action again to obtain an action coordinate matrix; 3) and denoising the action coordinate matrix by a DTW method, reducing the problems of time dislocation and noise of the action coordinate matrix by an FTP method, and classifying the action coordinate matrix by using an SVM. Compared with the existing behavior recognition method, the method effectively improves the recognition precision. The method can be applied to monitoring, video games and man-machine interaction. The technology mainly aims at motion recognition of short videos, and the main application scene of the technology mainly lies in some entrance guard or security recognition systems, and the recognition effect of long videos is very common. In the prior art, there is a good effect on motion classification of short videos, that is, a short video is input, and the motion classification of the video is output. Related techniques such as C3D, ST-GCN, 2S-AGCN, etc. Such methods are ineffective for motion detection of long videos or video streams. Moreover, the method has high requirements on hardware, and is difficult to achieve practical effects.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a video flow detection method based on human skeleton key points, which mainly aims at the action detection and identification of videos with indefinite length. And the speed of the video motion detection and identification can reach 1 time on a GPU of 2080TI level, so that the video motion detection and identification has practical effect.
The purpose of the invention is realized by the following technical scheme:
a video flow detection method based on human skeleton key points comprises the following steps:
1) capturing m seconds and n frames per second in the video each time by using an m second sliding window to obtain m x n frame images;
2) respectively identifying human skeleton key points of the m-n frame images, and taking top K skeleton key points in each frame, wherein the top K indicates that a plurality of people exist in one image, and the top K skeleton key points need to be taken according to a certain rule, such as the K with the highest confidence level or the K with the highest area.
3) Dividing the interframe skeleton data into a plurality of skeleton sequences according to Euclidean distance, namely one skeleton sequence of one person;
4) and (4) feeding each skeleton sequence into a prediction result of the deep learning network model.
Further, the method 3) further includes a bone data normalization processing method, including:
11) scaling the coordinate data to a height 1080 and adapting the width;
12) translating the entire bone data with the center of the bone as the origin, such that the bone data is independent of image resolution, multiplying the bone data by s0= 1.0;
13) calculating displacement data of key points before a next frame and a previous frame, wherein the first frame is 0, and then multiplying the displacement data by s1= 4.0; wherein s0 is used to adjust the distribution range of the normalized feature data spatial information, and s1 is used to adjust the distribution range of the normalized feature data motion information;
14) and (3) connecting and stacking the skeleton key points and the displacement data together to form input data for training and prediction, and finally obtaining corresponding training data.
Further, the bone center is the middle point of the two hips.
Furthermore, the skeleton data are normalized to be between-0.5 and in the maximum range of the gradient of the activation function tanh, and the training convergence of the deep learning network model is facilitated. In summary, based on a large amount of statistical information, normalized data are observed and approximately distributed in the range of [ -0.5, 0.5 ].
Further, the deep learning network model prediction method is that a bone sequence [ x0, x1, x2, … ] is input into a bidirectional cyclic neural deep learning network model, and a label of each frame is predicted; the output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence; in this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.
Further, the method for making the training data set comprises the following steps:
111) extracting frames of a video to be marked according to 10 frames per second;
112) extracting 10 groups from top to bottom according to image quality;
113) manually marking one group of data, namely putting one action sequence into a corresponding action catalog; the frame number must not be continuous between two action sequences;
114) the residual group data are automatically grouped according to the data marked manually;
115) extracting skeleton key points frame by frame;
116) normalizing the bone key point data according to the normalization mode;
117) randomly combining sequences with training data of 30-70, wherein the sequences comprise action sequences and non-action sequences;
118) the training data is divided into a frame number sequence and a label sequence which are respectively stored in different files. The feature data corresponding to the frame number is also stored in a separate feature file.
Further, detailed description of the single-flow model:
inputting data as normalized bone key point data; one frame is provided, and 1-n frames are supported to be input; the input shape is (batch _ size, seq _ len, flat _ num);
linear change and Tanh activation;
sending the data into a multi-layer bidirectional LSTM deep learning network model;
strengthening sequence label conversion relation by using a CRF layer;
b _ represents the start of an action, I _ represents the continuation of an action, and O represents no action.
The next of O may be O, B _, not I _;
the next of B _ can be I _, not B _, O;
the next of I _ can be I _, O, B _.
The invention has the beneficial effects that: the method can extract the features of the long video, has higher identification accuracy, and is suitable for extracting the features of the long-period streaming media playing and dropping.
Drawings
FIG. 1 is a diagram of normalized data distribution (spatial signature part);
FIG. 2 is a diagram of normalized data distribution (motion profile);
FIG. 3 is a schematic view of a single flow model;
FIG. 4 is a schematic view of a three-stream fusion model;
FIG. 5 is a schematic diagram of linear variation and tanh nonlinear activation of the three-stream data respectively.
Detailed Description
The technical solution of the present invention is further described in detail with reference to the following specific examples, but the scope of the present invention is not limited to the following.
Using a sliding window of m seconds, n frames per second are captured each time for m seconds in the video. And respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame. Then, the interframe skeleton data is divided into a plurality of skeleton sequences according to the Euclidean distance, namely one skeleton sequence for one person, wherein top K indicates that a plurality of persons exist in one picture, and the top K is required to be taken according to a certain rule, such as K with the highest confidence level or K with the highest area.
The original bone data are coordinates in the image, so that the training and the prediction of the deep learning network model are not facilitated, and the bone data are normalized by the method, wherein the specific normalization method is as follows.
The coordinate data is scaled to 1080 height, width adaptive.
The entire bone data is translated with the center of the bone (the mid-point of the two hips) as the origin, so that the bone data is independent of the image resolution, multiplying the bone data by s0= 1.0. The data distribution is as shown in figure 1:
the displacement data of the keypoints before the next frame-the previous frame is calculated, the first frame is 0, and then the displacement data is multiplied by s1= 4.0. The data distribution is as shown in FIG. 2: wherein s0 is used to adjust the distribution range of the normalized feature data spatial information, and s1 is used to adjust the distribution range of the normalized feature data motion information.
And (4) continuously stacking the bone key points and the displacement data together to form input data for training and prediction. And normalizing the data to be between-0.5 and 0.5, wherein the data is in the maximum range of the gradient of the activation function tanh, and the training convergence of the deep learning network model is facilitated.
For example:
1. the original image skeleton key points are extracted from the image, the extracted original image skeleton key points comprise 67 key points, wherein 25 body key point positions, 21 left-hand key point positions and 21 right-hand key point positions are included, each key point is composed of (abscissa and ordinate), and the origin of coordinates is the upper left corner of the image.
Input image example (resolution 544X 960):
output example s1 out: the output is a 134-dimensional array, and every two values are one key point location.
[315, 368, 302, 428, 263, 428, 242, 502, 242, 562, 342, 428, 397, 399, 386, 349, 302, 557, 271, 560, 260, 660, 250, 743, 326, 557, 336, 659, 342, 746, 305, 360, 323, 363, 286, 368, 331, 371, 352, 788, 365, 783, 328, 757, 252, 773, 239, 767, 255, 746, 382, 348, 375, 348, 365, 343, 357, 337, 352, 333, 365, 325, 358, 315, 353, 310, 350, 305, 372, 322, 366, 310, 364, 301, 362, 293, 378, 321, 374, 308, 372, 301, 371, 293, 385, 321, 386, 313, 387, 308, 388, 305, 256, 569, 244, 568, 241, 583, 240, 592, 240, 599, 243, 578, 242, 589, 240, 595, 241, 601, 244, 577, 241, 586, 242, 592, 242, 600, 245, 579, 240, 584, 242, 590, 242, 601, 244, 580, 241, 597, 240, 597, 241, 601]
2. The ordinate scales to 1080 size and the abscissa scales to the same scale.
In this example:
y _ scale = 1080/960=1.125, and all 67 x 2=134 values in the first step are multiplied by y _ scale to yield s2out:
[354.375, 414.0, 339.75, 481.5, 295.875, 481.5, 272.25, 564.75, 272.25, 632.25, 384.75, 481.5, 446.625, 448.875, 434.25, 392.625, 339.75, 626.625, 304.875, 630.0, 292.5, 742.5, 281.25, 835.875, 366.75, 626.625, 378.0, 741.375, 384.75, 839.25, 343.125, 405.0, 363.375, 408.375, 321.75, 414.0, 372.375, 417.375, 396.0, 886.5, 410.625, 880.875, 369.0, 851.625, 283.5, 869.625, 268.875, 862.875, 286.875, 839.25, 429.75, 391.5, 421.875, 391.5, 410.625, 385.875, 401.625, 379.125, 396.0, 374.625, 410.625, 365.625, 402.75, 354.375, 397.125, 348.75, 393.75, 343.125, 418.5, 362.25, 411.75, 348.75, 409.5, 338.625, 407.25, 329.625, 425.25, 361.125, 420.75, 346.5, 418.5, 338.625, 417.375, 329.625, 433.125, 361.125, 434.25, 352.125, 435.375, 346.5, 436.5, 343.125, 288.0, 640.125, 274.5, 639.0, 271.125, 655.875, 270.0, 666.0, 270.0, 673.875, 273.375, 650.25, 272.25, 662.625, 270.0, 669.375, 271.125, 676.125, 274.5, 649.125, 271.125, 659.25, 272.25, 666.0, 272.25, 675.0, 275.625, 651.375, 270.0, 657.0, 272.25, 663.75, 272.25, 676.125, 274.5, 652.5, 271.125, 671.625, 270.0, 671.625, 271.125, 676.125]
3. the coordinate points are normalized to the center point of the human body.
Referring to the "body key point bitmap" in the upper graph, the 8 th key point (s 2out marked red) of the body is taken as the central point, i.e. two values (s 2out [16], s2out [17 ]) in the second step output. The relative positions of 67 keypoints were calculated. I.e., s2out [16] is subtracted from all abscissas and s2out [17] is subtracted from all ordinates.
Taking the first point (354.375, 414.0) as an example, after transformation:
(354.375, 414.0)-(s2out[16], s2out[17])=(354.375, 414.0)- (339.75, 626.625)
= (354.375-339.75, 414.0-626.625) = (14.625, -212.625), then the obtained relative coordinates are divided by 1080 to obtain (0.01354, -0.19687), and the obtained relative coordinates are multiplied by s0=1.0 to adjust the distribution range of the output value, wherein the default value is 1.0, which is equivalent to no adjustment.
All 67 bits are transformed to get s3out:
[0.01354, -0.19687, 0.0, -0.13437, -0.04063, -0.13437, -0.0625, -0.05729, -0.0625, 0.00521, 0.04167, -0.13437, 0.09896, -0.16458, 0.0875, -0.21667, 0.0, 0.0, -0.03229, 0.00313, -0.04375, 0.10729, -0.05417, 0.19375, 0.025, 0.0, 0.03542, 0.10625, 0.04167, 0.19687, 0.00313, -0.20521, 0.02187, -0.20208, -0.01667, -0.19687, 0.03021, -0.19375, 0.05208, 0.24063, 0.06563, 0.23542, 0.02708, 0.20833, -0.05208, 0.225, -0.06563, 0.21875, -0.04896, 0.19687, 0.08333, -0.21771, 0.07604, -0.21771, 0.06563, -0.22292, 0.05729, -0.22917, 0.05208, -0.23333, 0.06563, -0.24167, 0.05833, -0.25208, 0.05312, -0.25729, 0.05, -0.2625, 0.07292, -0.24479, 0.06667, -0.25729, 0.06458, -0.26667, 0.0625, -0.275, 0.07917, -0.24583, 0.075, -0.25938, 0.07292, -0.26667, 0.07187, -0.275, 0.08646, -0.24583, 0.0875, -0.25417, 0.08854, -0.25938, 0.08958, -0.2625, -0.04792, 0.0125, -0.06042, 0.01146, -0.06354, 0.02708, -0.06458, 0.03646, -0.06458, 0.04375, -0.06146, 0.02187, -0.0625, 0.03333, -0.06458, 0.03958, -0.06354, 0.04583, -0.06042, 0.02083, -0.06354, 0.03021, -0.0625, 0.03646, -0.0625, 0.04479, -0.05937, 0.02292, -0.06458, 0.02813, -0.0625, 0.03438, -0.0625, 0.04583, -0.06042, 0.02396, -0.06354, 0.04167, -0.06458, 0.04167, -0.06354, 0.04583]
a large amount of data was counted to obtain a distribution graph in which the abscissa is the value after normalization, the ordinate is the count of each value, and the unit of the ordinate is million times (1 e 6). Observing fig. 2, it can be seen that the values in this interval (-0.5, 0.5) are mostly, which corresponds to the data roughly normalized to (-0.5, 0.5). If the data needs to be normalized to between (-1, 1), only the parameter s0=2.0 needs to be adjusted.
4. After the steps 1, 2 and 3, the first half part of the final feature data, namely the spatial position feature part, is obtained. There is also a need to continue to acquire motion characteristic information. The process of acquiring the motion characteristic information is relatively simple.
Assuming that two adjacent frames of images, f0 and f1, are respectively transformed by 1 and 2 to obtain s2out0 and s2out1, the motion data is (s 2out 1-s 2out 0)/1080 × s1, and a parameter s1 is introduced to adjust the value range of the motion characteristic part so that the value range of the motion characteristic part is close to that of the spatial characteristic part, thereby being beneficial to subsequent training of the neural deep learning network model. When s1=4.0, profile 2 of the motion feature data.
A large amount of motion characteristic data is counted to obtain a distribution diagram, wherein the abscissa is the value after normalization, the ordinate is the count of each value, and the unit of the ordinate is million times (1 e 6).
5. The spatial feature and the motion feature are spliced together back and forth to form a 268-dimensional feature vector.
Inputting the bone sequence [ x0, x1, x2, … ] into a bidirectional cyclic neural deep learning network model, and predicting the label of each frame. The output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence. In this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.
The bone data were normalized as described above.
And (3) adopting a bidirectional cycle deep learning network model Bi-LSTM + conditional random field CRF.
The training data is randomly scaled over the entire sequence and no-motion sequences must be included between the motion sequences.
Production of training data sets (training video data can only contain a single person).
And (4) extracting frames of the video to be marked according to 10 frames per second.
The 10 groups were extracted from top to bottom with image quality.
And marking one group of data manually, namely putting one action sequence into a corresponding action catalog. The frame number must not be consecutive between two action sequences.
And the residual group data are automatically grouped according to the manually marked data.
Skeletal keypoints are extracted frame by frame.
The skeletal keypoint data is normalized in the manner previously described.
The random combination training data is a sequence of 30-70, wherein the sequence comprises an action sequence and a non-action sequence.
The training data is divided into a frame number sequence and a label sequence which are respectively stored in different files. The feature data corresponding to the frame number is also stored in a separate feature file.
Deep learning network model an end-to-end, single-flow model is used in this embodiment, the principle of which is illustrated with reference to fig. 3.
Three-stream fusion model, as shown with reference to fig. 4:
stream 1: bone data, which is the length between associated bone keypoints. The relationship is, for example, wrist-elbow joint, elbow-key. The data may be generated by spatial feature computation.
Stream 2: joint data, i.e. the spatial part of the normalized feature data mentioned before
Stream 3: motion data.
The three-stream data are respectively linearly changed and tanh is nonlinearly activated, and the principle thereof is shown with reference to fig. 5.
Referring to fig. 3, a detailed description of the single-flow model.
The input data is normalized bone keypoint data. One per frame, supporting input 1-n frames. The input shape is (batch _ size, seq _ len, flat _ num).
By linear change and Tanh activation.
And sending the data into a multi-layer bidirectional LSTM deep learning network model.
And strengthening the sequence label conversion relation by using a CRF layer.
The action sequence needs to start with B _ B.
The next of B _ would not be O.
The next of B _ is the continuation of the action and I _.
Video streaming is detected and identified.
The video is slid over in time sequence with a sliding window of m seconds, step s seconds.
Each window extracts x frames of pictures.
And extracting skeleton key points of each frame of people.
And combining the skeleton key points into a plurality of skeleton sequences according to the Euclidean distance.
And (4) feeding each skeleton sequence into a prediction result of the deep learning network model.
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
1. The video flow detection method based on the human skeleton key points is characterized by comprising the following steps:
1) capturing m seconds and n frames per second in the video each time by using an m second sliding window to obtain m x n frame images;
2) respectively identifying human skeleton key points of the m x n frame images, and taking top K skeleton key points in each frame;
3) dividing the interframe skeleton data into a plurality of skeleton sequences according to Euclidean distance, namely one skeleton sequence of one person;
4) and (4) feeding each skeleton sequence into a prediction result of the deep learning network model.
2. The method for detecting video flow based on key points of human bones as claimed in claim 1, wherein said 3) further comprises a method for normalizing the bone data, comprising:
11) scaling the coordinate data to a height 1080 and adapting the width;
12) translating the entire bone data with the center of the bone as the origin so that the bone data is independent of the image resolution, multiplying the bone data by s 0;
13) calculating displacement data of key points before a next frame and a previous frame, wherein the first frame is 0, and then multiplying the displacement data by s1, wherein s0 is used for adjusting the distribution range of the normalized feature data spatial information, and s1 is used for adjusting the distribution range of the normalized feature data motion information;
14) and (3) connecting and stacking the skeleton key points and the displacement data together to form input data for training and prediction, and finally obtaining corresponding training data.
3. The method as claimed in claim 1, wherein the bone center is a mid-point of two hips.
4. The method for detecting video flow based on human skeleton key points as claimed in claim 1, wherein the skeleton data is normalized to-0.5, and is within a maximum range of an activation function tanh gradient, which is beneficial to the training convergence of a deep learning network model.
5. The method for detecting video flow based on key points of human bones as claimed in claim 1, wherein the deep learning network model prediction method is to input the bone sequence [ x0, x1, x2, … ] into a bidirectional cyclic neural deep learning network model to predict the label of each frame; the output results are for example: [ o, o, o, o, o, b _ t, i _ t, i _ t, i _ t, o, o, o, o, b _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, i _ z, o, o, o ], where o is a no action sequence and non-o is an action sequence; in this example, t is jump, z is turn, b _ is start of action, and i _ is continuation of action.
6. The method for detecting video streaming based on key points of human bones as claimed in claim 1, further comprising a training data set generating method comprising:
111) extracting frames of a video to be marked according to 10 frames per second;
112) extracting 10 groups from top to bottom according to image quality;
113) manually marking one group of data, namely putting one action sequence into a corresponding action catalog; the frame number must not be continuous between two action sequences;
114) the residual group data are automatically grouped according to the data marked manually;
115) extracting skeleton key points frame by frame;
116) normalizing the bone key point data according to the normalization mode;
117) randomly combining sequences with training data of 30-70, wherein the sequences comprise action sequences and non-action sequences;
118) the training data is divided into a frame number sequence and a label sequence which are respectively stored in different files;
the feature data corresponding to the frame number is also stored in a separate feature file.
7. The method for detecting video flow based on human skeleton key points as claimed in claim 1, wherein the deep learning network model is a single-flow model, and its detailed description is:
inputting data as normalized bone key point data; one frame is provided, and 1-n frames are supported to be input; the input shape is (batch _ size, seq _ len, flat _ num);
linear change and Tanh activation;
sending the data into a multi-layer bidirectional LSTM deep learning network model;
strengthening sequence label conversion relation by using a CRF layer;
b _ represents the start of an action, I _ represents the continuation of an action, and O represents no action;
the next of O may be O, B _, not I _;
the next of B _ can be I _, not B _, O;
the next of I _ can be I _, O, B _.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011431363.2A CN112464856B (en) | 2020-12-09 | 2020-12-09 | Video streaming detection method based on key points of human bones |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011431363.2A CN112464856B (en) | 2020-12-09 | 2020-12-09 | Video streaming detection method based on key points of human bones |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112464856A true CN112464856A (en) | 2021-03-09 |
CN112464856B CN112464856B (en) | 2023-06-13 |
Family
ID=74801107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011431363.2A Active CN112464856B (en) | 2020-12-09 | 2020-12-09 | Video streaming detection method based on key points of human bones |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112464856B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118509542A (en) * | 2024-07-18 | 2024-08-16 | 圆周率科技(常州)有限公司 | Video generation method, device, computer equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109710802A (en) * | 2018-12-20 | 2019-05-03 | 百度在线网络技术(北京)有限公司 | Video classification methods and its device |
CN110263666A (en) * | 2019-05-29 | 2019-09-20 | 西安交通大学 | A kind of motion detection method based on asymmetric multithread |
CN110348321A (en) * | 2019-06-18 | 2019-10-18 | 杭州电子科技大学 | Human motion recognition method based on bone space-time characteristic and long memory network in short-term |
US20190332866A1 (en) * | 2018-04-26 | 2019-10-31 | Fyusion, Inc. | Method and apparatus for 3-d auto tagging |
CN110991274A (en) * | 2019-11-18 | 2020-04-10 | 杭州电子科技大学 | Pedestrian tumbling detection method based on Gaussian mixture model and neural network |
CN111383421A (en) * | 2018-12-30 | 2020-07-07 | 奥瞳系统科技有限公司 | Privacy protection fall detection method and system |
CN111680562A (en) * | 2020-05-09 | 2020-09-18 | 北京中广上洋科技股份有限公司 | Human body posture identification method and device based on skeleton key points, storage medium and terminal |
CN111680613A (en) * | 2020-06-03 | 2020-09-18 | 安徽大学 | Method for detecting falling behavior of escalator passengers in real time |
-
2020
- 2020-12-09 CN CN202011431363.2A patent/CN112464856B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190332866A1 (en) * | 2018-04-26 | 2019-10-31 | Fyusion, Inc. | Method and apparatus for 3-d auto tagging |
CN109710802A (en) * | 2018-12-20 | 2019-05-03 | 百度在线网络技术(北京)有限公司 | Video classification methods and its device |
CN111383421A (en) * | 2018-12-30 | 2020-07-07 | 奥瞳系统科技有限公司 | Privacy protection fall detection method and system |
CN110263666A (en) * | 2019-05-29 | 2019-09-20 | 西安交通大学 | A kind of motion detection method based on asymmetric multithread |
CN110348321A (en) * | 2019-06-18 | 2019-10-18 | 杭州电子科技大学 | Human motion recognition method based on bone space-time characteristic and long memory network in short-term |
CN110991274A (en) * | 2019-11-18 | 2020-04-10 | 杭州电子科技大学 | Pedestrian tumbling detection method based on Gaussian mixture model and neural network |
CN111680562A (en) * | 2020-05-09 | 2020-09-18 | 北京中广上洋科技股份有限公司 | Human body posture identification method and device based on skeleton key points, storage medium and terminal |
CN111680613A (en) * | 2020-06-03 | 2020-09-18 | 安徽大学 | Method for detecting falling behavior of escalator passengers in real time |
Non-Patent Citations (4)
Title |
---|
ROMERO MORAIS 等: "Learning Regularity in Skeleton Trajectories for Anomaly Detection in Videos", 《PROCEEDINGS OF THE IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION》, pages 11988 - 11996 * |
ZHEN QIN 等: "Learning Local Part Motion Representation for Skeleton-based Action Recognition", 《PROCEEDINGS OF THE 2019 11TH INTERNATIONAL CONFERENCE ON EDUCATION TECHNOLOGY AND COMPUTERS》, pages 295 - 299 * |
时俊: "基于GCN人体行为识别系统的研究与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 1, pages 138 - 1346 * |
许政: "基于深度学习的人体骨架点检测", 《中国优秀硕士学位论文全文数据库信息科技辑》, no. 1, pages 138 - 1603 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118509542A (en) * | 2024-07-18 | 2024-08-16 | 圆周率科技(常州)有限公司 | Video generation method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN112464856B (en) | 2023-06-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110287805B (en) | Micro-expression identification method and system based on three-stream convolutional neural network | |
Cho et al. | Self-attention network for skeleton-based human action recognition | |
Levy et al. | Live repetition counting | |
Jiang et al. | Action unit detection using sparse appearance descriptors in space-time video volumes | |
Tang et al. | Real-time neural radiance talking portrait synthesis via audio-spatial decomposition | |
CN110084130B (en) | Face screening method, device, equipment and storage medium based on multi-target tracking | |
CN112418095A (en) | Facial expression recognition method and system combined with attention mechanism | |
CN110572696A (en) | variational self-encoder and video generation method combining generation countermeasure network | |
CN111062314B (en) | Image selection method and device, computer readable storage medium and electronic equipment | |
CN115880784A (en) | Scenic spot multi-person action behavior monitoring method based on artificial intelligence | |
Ahmad et al. | SDIGRU: spatial and deep features integration using multilayer gated recurrent unit for human activity recognition | |
JP2014116716A (en) | Tracking device | |
Caetano et al. | Activity recognition based on a magnitude-orientation stream network | |
CN112464856A (en) | Video streaming detection method based on human skeleton key points | |
Cui et al. | Pose-appearance relational modeling for video action recognition | |
CN118411745A (en) | Emotion recognition method based on video analysis technology and upper limb pose description | |
Deotale et al. | Optimized hybrid RNN model for human activity recognition in untrimmed video | |
CN116682180A (en) | Action recognition method based on human skeleton sequence space-time information | |
Zhao et al. | Research on human behavior recognition in video based on 3DCCA | |
CN115205750B (en) | Motion real-time counting method and system based on deep learning model | |
CN114882553B (en) | Micro-expression recognition method and system based on deep learning | |
Torpey et al. | Human action recognition using local two-stream convolution neural network features and support vector machines | |
He et al. | MTRFN: Multiscale temporal receptive field network for compressed video action recognition at edge servers | |
Katti et al. | Character and word level gesture recognition of Indian Sign language | |
Rakun et al. | Spectral domain cross correlation function and generalized learning vector quantization for recognizing and classifying Indonesian sign language |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |