CN116824641A

CN116824641A - Gesture classification method, device, equipment and computer storage medium

Info

Publication number: CN116824641A
Application number: CN202311091309.1A
Authority: CN
Inventors: 曹伟; 王晓利; 段玉涛; 孟祥秀
Original assignee: Canos Digital Technology Beijing Co ltd; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Current assignee: Canos Digital Technology Beijing Co ltd; Cosmoplat Industrial Intelligent Research Institute Qingdao Co Ltd
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-09-29
Anticipated expiration: 2043-08-29
Also published as: CN116824641B

Abstract

The application belongs to the technical field of gesture recognition, and particularly relates to a gesture classification method, a gesture classification device, gesture classification equipment and a computer storage medium. The method comprises the steps of obtaining a video to be processed, extracting a plurality of key frames, and extracting features of the key frames again to obtain a plurality of target features; extracting the video to be processed through different video pipelines to obtain a plurality of sub-videos to be processed; and processing and feature fusion are carried out on the plurality of target features and the plurality of sub-videos to be processed, and the gesture classification result of the videos to be processed is determined. The method can be applied to scenes with great difference between the actual needed recognized gestures, such as the sleep recognition of the staff of each node of the industrial Internet of things, and the like, and the gestures in the pre-training model, so that the connection among all parts of the image is ensured, the accurate recognition of the connection between frames before and after the video time is ensured, and the efficiency and the accuracy of gesture recognition classification are improved.

Description

Gesture classification method, device, equipment and computer storage medium

Technical Field

The application belongs to the technical field of gesture recognition, and particularly relates to a gesture classification method, a gesture classification device, gesture classification equipment and a computer storage medium.

Background

With the rapid development of computer vision technology and related hardware facilities, human body posture classification technology has been gradually applied to various fields of society, and has exerted important economic and social values. Human body posture is one of important biological characteristics of a human body, and human body posture classification technology has many application scenes, such as: gait analysis, video surveillance, augmented reality, human-machine interaction, finance, entertainment and gaming, sports science, and the like.

When the gesture recognition technology is applied to video monitoring, different gestures of workers in the video can be recognized, whether the sleeping sentry exists or not is determined, and production efficiency is prevented from being influenced. The gesture classification algorithm can play a role in detecting sleep posts of personnel in central control rooms, duty rooms and the like in various industries.

The pre-training gesture classification model based on the public data set has strong video classification capability, but in the specific scene of the industrial field, the gesture actually required to be recognized has a great difference from the action in the pre-training model, so that the recognition effect is poor and the gesture recognition accuracy is not high.

Disclosure of Invention

The application provides a gesture classification method, a gesture classification device, gesture classification equipment and a computer storage medium, which are used for solving the defect that a result obtained after recognition by using a pre-training model is inaccurate.

In a first aspect, the present application provides a gesture classification method, including:

acquiring a video to be processed, and acquiring an image corresponding to each frame in the video to be processed according to the video to be processed;

determining a plurality of key frames of the video to be processed according to the image;

respectively extracting the characteristics of the images corresponding to the key frames to obtain a plurality of target characteristics;

extracting the video to be processed through a plurality of video pipelines to obtain a plurality of sub-videos to be processed, wherein the frame numbers and the picture sizes of the sampled videos acquired by different video pipelines are different;

and determining an attitude classification result of the video to be processed according to the target features and the sub-videos to be processed.

Optionally, the determining, according to the image, a plurality of key frames of the video to be processed includes:

acquiring multi-classification tasks of gesture classification;

classifying the multi-classification tasks to obtain a plurality of classification tasks, and obtaining a dimension reduction mapping matrix of each classification task;

according to the plurality of dimension reduction mapping matrixes and the images corresponding to each frame in the video to be processed, performing saliency judgment processing on the images corresponding to each frame to obtain a saliency judgment result of the images corresponding to each frame;

And determining the plurality of key frames according to the significance judging result.

Optionally, the performing saliency discrimination processing on the image corresponding to each frame according to the plurality of dimension-reduction mapping matrices includes:

the significance discrimination processing is carried out on the image corresponding to each frame by adopting the following formula:

wherein ,a set of characteristic representations of each frame corresponding to the target gesture in the c-th classification task respectively;

the method comprises the steps that a set of characteristic representations of each frame corresponding to all other types of videos in a c-th classification task is obtained;

c dimension reduction mapping matrixes corresponding to the c-th classification task;

scoring an ith frame of an mth video;

is characteristic of the ith frame of the mth video.

Optionally, the target feature includes: and the target detection features are used for respectively extracting the features of the images corresponding to the key frames to obtain a plurality of target features, and the target detection features comprise the following steps:

determining a target area of the image corresponding to each key frame through a target detection algorithm according to the images corresponding to the key frames;

according to each target area, determining candidate detection characteristics corresponding to each key frame, wherein the detection characteristics are used for indicating the height and the width of the corresponding target area and the probability of existence of a person;

And screening the first N candidate detection features from the plurality of candidate detection features to serve as target detection features, wherein the probability corresponding to the target detection features is larger than the probability corresponding to other detection features, and N is an integer larger than or equal to 1.

Optionally, the target feature further includes: and after determining the target area of the image corresponding to each key frame through a target detection algorithm according to the images corresponding to the key frames, the method further comprises the following steps:

preprocessing a plurality of target areas, and extracting corresponding target image blocks from each target area;

determining the two-dimensional image features according to a plurality of target image blocks;

the determining the gesture classification result of the video to be processed according to the target features and the sub-videos to be processed includes:

and determining an attitude classification result of the video to be processed according to the target detection features, the two-dimensional image features and the sub-videos to be processed.

Optionally, the determining the pose classification result of the video to be processed according to the target detection features, the two-dimensional image features and the sub-videos to be processed includes:

Determining three-dimensional video characteristics according to the plurality of sub-videos to be processed;

performing fusion processing on the target detection features, the two-dimensional image features and the three-dimensional video features to obtain fusion features;

and carrying out normalization processing on the fusion characteristics to obtain the gesture classification result.

Optionally, before the fusing the plurality of target detection features, the two-dimensional image features and the three-dimensional video features to obtain the fused features, the method further includes:

determining position coding information according to the two-dimensional image characteristics and the three-dimensional video characteristics, wherein the position coding information is used for indicating the positions of the target areas in the key frame images and the positions of the sub-videos to be processed in the videos to be processed;

and determining the original position of the gesture classification result in the video to be processed according to the position coding information.

In a second aspect, the present application provides a gesture classification apparatus comprising:

the acquisition module is used for acquiring the video to be processed and acquiring an image corresponding to each frame in the video to be processed according to the video to be processed.

And the determining module is used for determining a plurality of key frames of the video to be processed according to the image.

And the processing module is used for respectively extracting the characteristics of the images corresponding to the key frames to obtain a plurality of target characteristics.

The processing module is further configured to extract the video to be processed through a plurality of video pipelines to obtain a plurality of sub-videos to be processed, where the frame numbers and the picture sizes of the sampled videos acquired by different video pipelines are different.

The determining module is further configured to determine an pose classification result of the video to be processed according to the multiple target features and the multiple sub-videos to be processed.

Optionally, the acquiring module is further configured to acquire a multi-classification task of gesture classification.

The processing module is further used for classifying the multi-classification tasks to obtain a plurality of classification tasks.

The acquisition module is further used for acquiring a dimension-reduction mapping matrix of each classification task.

The processing module is specifically configured to perform saliency discrimination processing on an image corresponding to each frame according to the plurality of dimension-reduction mapping matrices and the image corresponding to each frame in the video to be processed, so as to obtain a saliency discrimination result of the image corresponding to each frame.

The determining module is specifically configured to determine the plurality of key frames according to the significance discrimination result.

Optionally, the processing module is further configured to perform saliency discrimination processing on the image corresponding to each frame by adopting the following formula:

scoring an ith frame of an mth video;

is characteristic of the ith frame of the mth video.

Optionally, the determining module is further configured to determine, according to the images corresponding to the plurality of key frames, a target area of the image corresponding to each key frame through a target detection algorithm.

The determining module is specifically configured to determine, according to each target area, a candidate detection feature corresponding to each key frame, where the detection feature is used to indicate a height and a width of the corresponding target area and a size of a probability of existence of a person.

The processing module is specifically configured to screen out first N candidate detection features from a plurality of candidate detection features, as target detection features, where a probability corresponding to the target detection features is greater than a probability corresponding to other detection features, and N is an integer greater than or equal to 1.

Optionally, the processing module is further configured to pre-process the multiple target areas, and extract a corresponding target image block from each target area.

The determining module is specifically configured to determine the two-dimensional image feature according to a plurality of target image blocks.

The determining module is specifically configured to determine an pose classification result of the video to be processed according to the plurality of target detection features, the two-dimensional image features and the plurality of sub-videos to be processed.

Optionally, the determining module is specifically configured to determine a three-dimensional video feature according to the plurality of sub-videos to be processed.

The processing module is further configured to perform fusion processing on the multiple target detection features, the two-dimensional image features and the three-dimensional video features, so as to obtain fusion features.

The processing module is specifically configured to perform normalization processing on the fusion feature to obtain the gesture classification result.

Optionally, the determining module is further configured to determine, according to the two-dimensional image feature and the three-dimensional video feature, position coding information, where the position coding information is used to indicate positions of the multiple target areas in the key frame image and positions of the sub-video to be processed in the video to be processed.

And the determining module is also used for determining the original position of the gesture classification result in the video to be processed according to the position coding information.

In a third aspect, the present application provides a gesture classification apparatus comprising:

a memory;

a processor;

wherein the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the gesture classification method as described in the first aspect and the various possible implementations of the first aspect.

In a fourth aspect, the present application provides a computer storage medium having stored thereon computer-executable instructions that are executed by a processor to implement the gesture classification method as described in the first aspect and the various possible implementations of the first aspect.

According to the gesture classification method provided by the application, the video to be processed is obtained, a plurality of key frames are extracted, and feature extraction is carried out on the plurality of key frames again to obtain a plurality of target features; extracting the video to be processed through different video pipelines to obtain a plurality of sub-videos to be processed; and processing and feature fusion are carried out on the plurality of target features and the plurality of sub-videos to be processed, and the gesture classification result of the videos to be processed is determined. The method ensures the connection among all parts of the image, ensures the accurate identification of the connection among frames before and after video time, and improves the efficiency and accuracy of gesture identification and classification.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a flowchart of a gesture classification method provided by the present application;

FIG. 2 is a second flowchart of the gesture classification method provided by the present application;

FIG. 3 is a schematic diagram of a gesture classification apparatus according to the present application;

fig. 4 is a schematic structural view of the gesture classification apparatus provided by the present application.

Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein.

In embodiments of the application, words such as "exemplary" or "such as" are used to mean examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

When the gesture recognition technology is applied to video monitoring, different gestures of workers in the video can be recognized, whether the sleeping sentry exists or not is determined, and production efficiency is prevented from being influenced. When the gesture of the staff in the video is identified and classified, the gesture classification algorithm is mainly adopted for processing, and the gesture classification algorithm can play a role in detecting the sleeping sentry of the staff in the central control room, the duty room and the like in each industry.

However, the pre-training gesture classification model based on the public data set has strong video classification capability, but in the specific scene of the industrial field, when the problem of recognition of central control room personnel on the sleep (staff sleep) is solved, the actually recognized gesture is very different from the sleeping action in the pre-training model, so that the recognition effect is poor, the gesture recognition accuracy is low, and the recognition difficulty is high.

First, the terms related to the present application will be explained.

Multilayer perceptron (Multilayer Perceptron, MLP for short): the feedforward neural network based on the artificial neural network consists of a plurality of full-connection layers and is commonly used for tasks such as classification, regression and the like. The fundamental component of MLP is neurons (neurons), which are the fundamental units of artificial neural networks. In MLP, each neuron accepts a set of inputs, performs linear and nonlinear transformations on the inputs, and then passes the results to the neurons of the next layer. In each neuron, a linear transformation is implemented by a weight matrix and a bias vector, and a nonlinear transformation is implemented by an activation function. The fully connected layer of the MLP maps the input feature vectors to the hidden layer and the output layer. The neuron numbers of the hidden layer and the output layer can be set according to task requirements. After each fully connected layer, operations such as activation functions and batch normalization layers can be added to enhance the expressive power and stability of the model.

Linear transformation (Linear Discriminant Analysis, LDA for short): is a classical linear classification method whose basic idea is to map high-dimensional data into low-dimensional space and then set a threshold to distinguish the data.

Target detection algorithm: the target detection algorithm is an algorithm for extracting and storing image features of a target object, detecting detected images with the same image features in other input images through the stored image features, and outputting the position, the size and the similarity of the detected images and the target object. The target detection is a core task of computer vision and can be applied to practical fields of robot navigation, intelligent monitoring, industrial detection and the like. The task is to find all the objects of interest in the image and determine their category and location.

RGB channels: refers to a channel that holds image color information. RGB is the three primary colors of color, "R" means "red", "G" means "green", "B" means "blue". In the RGB color mode, the RGB three channels are a composite of R, G, B channels. Each image has one or more color channels, and the default number of color channels in an image depends on its color mode, i.e. the color mode of an image will determine its number of color channels.

softmax function: the function is in effect a logarithmic normalization of the gradient of the finite term discrete probability distribution. Therefore, the Softmax function has wide application in various probability-based multi-classification problem methods including multiple logistic regression, multiple linear discriminant analysis, naive bayes classifier, artificial neural network, and the like.

In view of the above, the present application provides a gesture classification method.

Next, an implementation scenario according to the present application will be described.

The existing sleeping sentry gesture recognition technical scheme mainly comprises the following steps: a convolutional neural network-based pose classification model and a transducer-based pose classification model are employed.

1. A gesture classification model based on a convolutional neural network, such as a slow model, is adopted. A slow path running at a low frame rate to capture spatial semantics; a fast path runs at a high frame rate, capturing motion with fine temporal resolution. fast path can be made very lightweight by reducing channel capacity, but can learn useful time information for video recognition. The model obtains better performance in the aspects of video motion classification and detection

2. A transducer-based pose classification model, such as the Timesformer model, is employed. Conventional video classification models use 3D convolution kernels to extract features, while Timesformer is a transform-based self-attention mechanism that acquires the semantics of each image block by comparing it to other image blocks in the video, so that local dependencies between neighboring image blocks, as well as global dependencies of distant image blocks, can be captured simultaneously, enabling it to capture spatiotemporal dependencies in the entire video. This model treats the input video as a spatio-temporal sequence of image blocks extracted from individual frames to apply a transform to the video.

The gesture classification is mostly used for detecting sleeping sentry of personnel in central control rooms, duty rooms and the like in various industries. For example, a server configured with a gesture classification algorithm may identify and classify the gesture of a worker in a surveillance video to determine whether the worker in the current video has a sleep behavior.

According to the gesture classification method provided by the application, the video to be processed, which is required to be subjected to gesture classification, is processed to obtain a plurality of video frame images; extracting features of a plurality of video frame images to obtain character detection features and character image features; processing the video to be processed by adopting different video pipelines, and extracting the characteristics of the processed video to obtain video characteristics; and fusing the obtained character detection features, character picture features and video features, and normalizing to determine the current gesture classification result. The method ensures the connection among the parts of the image, ensures the accurate identification of the connection among frames before and after the video time, improves the efficiency and accuracy of gesture identification and classification, avoids information loss and reduces the complexity of data processing.

The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of gesture classification according to an embodiment of the present application. The execution body of the embodiment may be, for example, a server. As shown in fig. 1, the gesture classification method provided in this embodiment includes:

s101: and acquiring a video to be processed, and acquiring an image corresponding to each frame in the video to be processed according to the video to be processed.

The video to be processed is used for indicating the activity record of the staff in the specific area received by the server.

It is understood that video is made up of still pictures, which are referred to as frames. Because of the special physiological structure of the human eye, if the frame rate of the viewed picture is higher than 16 frames/second, the viewed picture is considered to be coherent, a phenomenon called visual retention.

And acquiring the activity record video of the staff in the specific area, and extracting the image corresponding to each frame in the currently acquired video.

For example, if the duration of the current acquired video is 1 minute and the frame rate of the current video is 24 frames/second, the current video is extracted to obtain 1440 frame images.

S102: and determining a plurality of key frames of the video to be processed according to the image.

The key frame is used for indicating that the minimum classification of frames in the currently acquired video frame image is maximized, and the key frame can be multiple.

It will be appreciated that video is more informative than images, but that the redundant information in a sequence is too much, and that the information can be compressed by extracting key frames. Because the industrial monitoring video is different from the video shot under the common condition, the background and the foreground of the industrial monitoring video are continuously changed during the common shooting, the environment information of the monitoring information is not greatly changed, and the main change comes from the activities of people. The common extraction key frames mainly are frames for identifying scene change or frames with large character activity change; but this extraction criterion is not suitable for gesture recognition. Taking the gesture recognition of sleeping sentry as an example, if a frame with a large motion change is found, it may be exactly the wrong frame. Thus for gesture recognition, a key frame is defined as a frame that is distinguishable from video frame images having other gestures by the selected video frame image.

According to the image corresponding to each frame extracted from the video to be processed, distinguishing each frame image from other video frame images; and determining a plurality of key frames of the video to be processed according to the distinguishing processing result.

For example, if the 1 st, 2 nd, 3 rd and 4 th frame images are sleep images in the currently acquired video frame images, distinguishing the video frame images; if the 2 nd frame is the most distinguished from the other frame images and the 2 nd frame is the smallest classification with the other frame images, the 2 nd frame image is determined to be the key frame. The method from which other key frames are determined is the same for other video frame images acquired.

S103: and respectively extracting the characteristics of the images corresponding to the key frames to obtain a plurality of target characteristics.

The target features are used for indicating human body space semantic information extracted from the video to be processed.

After a plurality of key frames are determined, extracting features of the image corresponding to each key frame to obtain a plurality of target features, wherein the extracted features can reflect the human information in the image corresponding to the key frames.

It will be appreciated that the purpose of gesture classification is to determine the gesture of a worker in a video to be processed, and thus, the extraction of the target features is primarily directed to the worker present in the video to be processed.

For example, if the video frame image corresponding to the current key frame is the 2 nd frame image, feature extraction is performed on the 2 nd frame image, and the obtained feature information is the target feature.

S104: and extracting the video to be processed through a plurality of video pipelines to obtain a plurality of sub-videos to be processed, wherein the frame numbers and the picture sizes of the sampled videos acquired by different video pipelines are different.

The video pipeline is used for indicating a video processing tool for sampling the video to be processed.

It can be understood that the video pipeline is mainly used for extracting video in a certain area in the video stream, the collectable frame number and image area of the video pipeline can be represented by a kernel shape (t×h×w), wherein T is the collectable frame number (i.e. the instant dimension), H is the height size of the collectable image area, W is the width size of the collectable image area, the kernel shapes of different frequency pipelines are different, the frequency of sampling the video to be processed by the video pipeline is determined by the space-time steps (Ts, hs, ws) of the kernel, and the space-time steps of the kernel are used for indicating the change frequency of the collectable frame number and the image area when the sampling video pipeline is used for sampling.

Extracting the video to be processed through a plurality of different video pipelines to obtain a plurality of sub-videos to be processed; the sub-video to be processed is a video stream of a certain area in the video to be processed, and the frame number of the extracted video stream is consistent with the time dimension of the currently selected video pipeline.

The method for extracting the sampling video in the 3D space by the video pipeline is as follows:

the video pipeline has a kernel shape (T×H×W), a kernel space-time stride (Ts, hs, ws), and a convolution start point offset (x, y, z).

For example, the following 4 video acquisition pipelines may be used, each configured as follows:

(1) 8 x 8 stride is (16, 32);

(2) The step of 16×4×4 is 6×32×32, and the offset is automatically adjusted;

(3) 4×12×12 steps are (16, 32), and offsets are (0, 16);

(4) The 1×16×16 stride is (32, 16).

For input data of the size 32×224×224, this processing yields only 559 token, much less than the number of token for the Timesfomer model. Taking the first pipe as an example, T X H X W of the pipe is 8X 8, each pipe is resampled after the end of 16 frames, and the sampling interval in the direction h×w is 32.

Compared with the other three pipelines, the second pipeline has more dense sampling in the time dimension, can better identify a small pixel target and can better identify the gesture of the human body, so that the third pipeline can be used for identifying the pixels near the human body in full time; the sampling area of the second pipeline is as close to the human body area as possible by automatically adjusting the offset of the second pipeline.

The offset can be automatically adjusted using the character image according to the character image corresponding to the extracted key frame.

Starting points are available from the core shape 16 x 4 and the space-time step 6 x 32 of the second video pipe with offsets x ranging from 0 to 15, y and z ranging from 0 to 3 and y=z.

Extracting target region coordinates bb with a probability of being greater than 0.5 for a person in a key frame of a person image ₁ -bb _n Corresponding probability p ₁ -p _n . Optimal offset of y, zThe calculation formula of (2) is as follows:

（1）

when (y, z) ∈S (x)&（y，z）∈S _bb (n) w=1;

when (y, z) ∈S (x)&（y，z）S _bb (n) w=0;

wherein S (x) is a set of points passed after performing offset of x and step size of (4×4), S _bb (n) is the point set of the nth target area.

The optimal offset of x should be such that the keyframe of the character image should be within the sampling range of the second video pipeline, the following calculation formula is available:

（2）

i is an integer greater than 0, and x is an integer and satisfiesK is the sequence number of the key frame of the human image.

For example, if the number of frames of the current video to be processed is 1440, the picture size is 640 x 480, the adoption process is performed using video pipes with a core shape of (8 x 8), one of the video streams may be obtained with the number of frames 1 to 8 and the acquired image area (0, 0) to (7, 7).

S105: and determining an attitude classification result of the video to be processed according to the target features and the sub-videos to be processed.

The gesture classification result is used for indicating probability distribution of the gesture of the person in the video to be processed.

Extracting corresponding target features from the sub-videos to be processed according to the plurality of sub-videos to be processed; processing and calculating the target characteristics extracted from the sub-video to be processed according to the target characteristics obtained by utilizing the key frame image, and fusing the target characteristics, so that the accuracy of a processing result is improved; and determining an attitude classification result of the video to be processed according to the processing result.

For example, the multiple target features of the key frame image and the target features extracted from the sub-video to be processed are jointly calculated and fused, and according to the final result, the corresponding possibility of multiple gestures recognized by the currently adopted gesture classification method can be obtained, the possibility can be represented by probability, and the sum of the probabilities of all the gestures is 1, wherein the target features extracted from the sub-video to be processed can be vectors with length of 768, for example.

According to the gesture classification method provided by the embodiment, the video to be processed is obtained, a plurality of key frames are extracted, and feature extraction is carried out on the plurality of key frames again to obtain a plurality of target features; extracting the video to be processed through different video pipelines to obtain a plurality of sub-videos to be processed; and processing and feature fusion are carried out on the plurality of target features and the plurality of sub-videos to be processed, and the gesture classification result of the videos to be processed is determined. The method ensures the connection among all parts of the image, ensures the accurate identification of the connection among frames before and after video time, and improves the efficiency and accuracy of gesture identification and classification.

Fig. 2 is a flowchart second of a gesture classification method according to an embodiment of the present application. As shown in fig. 2, the present embodiment describes in detail a gesture classification method based on the embodiment of fig. 1, and the gesture classification method shown in the present embodiment includes:

s201: and acquiring a video to be processed, and acquiring an image corresponding to each frame in the video to be processed according to the video to be processed.

Step S201 is similar to step S101 described above, and will not be described again.

S202: and acquiring multi-classification tasks of gesture classification.

The multi-classification task is used for indicating the category of performing gesture classification on the video to be processed currently, for example, the multi-classification task set by the current gesture recognition method may be: sleep, sit, talk, walk.

And acquiring the gestures existing in the current application scene from the historical data in the scene, and determining the acquired multiple gestures as multi-classification tasks for classifying the gestures of the current video to be processed.

It will be appreciated that the multi-classification task is a goal of performing gesture classification on the current video to be processed, and the purpose of gesture classification is to determine which classification the video to be processed belongs to. According to the historical data acquired in the current application scene, the historical data is subjected to gesture classification, according to the historical data subjected to gesture classification, the possible gestures in the current application scene are determined, and the determined multiple gestures are used as multi-classification tasks of the current gesture classification.

S203: and carrying out classification processing on the multi-classification tasks to obtain a plurality of classification tasks, and obtaining a dimension reduction mapping matrix of each classification task.

The method comprises the steps of dividing a task into two classified tasks, wherein the classified tasks are used for indicating that other tasks except a certain task in the multi-classified tasks are divided into two classified tasks obtained in the same task, and a dimension reduction mapping matrix is used for indicating that characteristic information corresponding to the classified tasks is processed by a linear discrimination algorithm.

The multi-classification task is subjected to classification processing, namely each classification task is extracted and classified in sequence, namely each gesture in the multi-classification task is respectively determined to be a target gesture, and classification tasks of other gestures except the classification task of the current target gesture are distributed together, so that a plurality of classification tasks are obtained; and calculating each obtained classified task through an LDA algorithm to obtain a dimension-reducing mapping matrix of each classified task.

It can be appreciated that the dimension-reduction mapping matrix can be obtained by pre-training the historical data acquired in the current application scene; the dimension reduction mapping matrix is not influenced by the currently acquired video to be processed and is only related to corresponding action characteristics; and the characteristic information of different postures is different, the corresponding dimension-reducing mapping matrixes are also different, the characteristic information corresponding to the same posture is the same, and the corresponding dimension-reducing mapping matrixes are also the same. The dimension reduction mapping matrix corresponding to each gesture in the multi-classification task can be determined by acquiring classified video data in a historical database, acquiring a corresponding feature information set of each classification according to the historical classification data, and respectively processing the acquired feature information sets. In order to determine the difference between each frame in the video, we use LDA technique to distinguish between videos from different categories, since LDA can not only perform dimension reduction, but also retain as much category discrimination information as possible; the LDA can process the classified data and can also process the dimension reduction operation of the multi-classified data.

For example, the multi-classification task set by the current gesture recognition method is: the classification tasks obtained by processing the classification tasks are respectively as follows: the classification tasks of the "other" category refer to a collection of all gesture classification tasks except for the classification task of the target gesture in the corresponding classification tasks; performing linear discriminant analysis on the classification tasks of the other classes to obtain corresponding dimension reduction mapping matrixes; the dimension-reduction mapping matrix of the task corresponding to the target gesture can be directly obtained.

S204: and carrying out significance discrimination processing on the images corresponding to each frame according to the plurality of dimension reduction mapping matrixes and the images corresponding to each frame in the video to be processed, and obtaining a significance discrimination result of the images corresponding to each frame.

Wherein the saliency discrimination is used to determine a frame image that maximizes the minimum classification gap.

According to the image corresponding to each frame in the video to be processed, a corresponding feature set of each frame can be determined, after the dimension reduction mapping matrixes corresponding to the classification tasks of the target gestures and the classification tasks of other gestures in each classification task are obtained, the dimension reduction mapping matrixes corresponding to each classification task and the feature set of each frame which is currently obtained are respectively substituted into the scoring function by using the scoring function of the frame to calculate, and the obtained calculation result is the significance discrimination result of the image corresponding to each frame.

It can be understood that the target gesture is the recognition target of the corresponding classification task. For example, the current classification task may be: [ sleep Shift, other ], the target pose of the current classification task is: "sleep sentry".

Specifically, according to the plurality of dimension-reduction mapping matrices, performing saliency discrimination processing on the image corresponding to each frame, including:

（3）

（4）

scoring an ith frame of an mth video;

is characteristic of the ith frame of the mth video.

It can be appreciated that if the number of classification poses of the current multi-classification task is c, then(i.e) The method comprises the steps of identifying a set of characteristic representations of each frame corresponding to the target gesture of c classification tasks in the classification; />To->(i.e.)>) And->Obtained by linear discriminant analysisAnd c dimension reduction mapping matrixes respectively corresponding to the c classification tasks are calculated by adopting a formula 3.

And scoring each frame by using a scoring function, namely, selecting one matrix from the obtained c dimension-reduction mapping matrixes one by one, substituting the matrix into a formula 4, and calculating to obtain a scoring value of each frame, and performing significance judgment according to the scoring value, wherein the greater the scoring value is, the stronger the significance of the scoring value is.

S205: and determining a plurality of key frames according to the significance judging result.

And respectively calculating the scoring value corresponding to each frame in each classification task according to the scoring function of the frame, screening out a plurality of video frames with higher scoring values according to the scoring values, and determining the video frames as a plurality of key frames.

It will be appreciated that each class has a projection vector (projection vectors) and that scoring each frame based on the projected distance results in frames that maximize the minimum gap. From the viewpoint of dimension reduction projection, a point is more likely to be a point with obvious classification characteristics if it projects farther before and after the projection.

For example, current classification tasks are: [ sleep, others ],for the set of feature representations corresponding to each frame in the "sleep" video, +.>For the feature representation set corresponding to each frame in the "other" video, will +. >And->Substituting formula 1 for linear discriminant analysis to obtain corresponding dimension reduction mapping matrix>The method comprises the steps of carrying out a first treatment on the surface of the Will->Substituting formula 2 to score each frame in the current two classification tasks; if the scoring value corresponding to the 2 nd frame image is highest, namely +>For the higher of the current classification tasks, frame 2 may be determined to be the key frame.

S206: and determining a target area of the image corresponding to each key frame through a target detection algorithm according to the images corresponding to the key frames.

The target area is used for indicating the areas divided on the images corresponding to the key frames to store the figures.

After a plurality of key frames are determined, images corresponding to the plurality of key frames are acquired, a plurality of target areas are divided from the plurality of images through a target detection algorithm, and the probability that image information in the target areas belongs to character information is greater than 0.65.

It can be understood that if the image person corresponding to the current key frame has a high ratio, the ratio of the acquired target area to the whole image is also high; and vice versa.

For example, the size of an image corresponding to a currently acquired key frame is 640×480, and a target detection algorithm is utilized to extract a plurality of key frames respectively; if the 2 nd frame is a key frame, the image corresponding to the 2 nd frame is extracted, and the extracted target areas can be (0, 0) to (16, 16) and (61,51) to (630,470).

S207: and determining candidate detection features corresponding to each key frame according to each target area, wherein the detection features are used for indicating the height and width of the corresponding target area and the size of the probability of existence of people.

The candidate detection features are used for indicating feature information extracted according to all the currently acquired target areas; and the association relation exists between the detection feature and the candidate detection feature, and the detection feature belongs to the candidate detection feature.

And extracting data from the obtained multiple target areas to obtain detection features of the target areas corresponding to each key frame, namely obtaining width features and height features of the corresponding target areas and the probability of existence of people, and determining the width features and the height features as candidate detection features.

For example, if the currently acquired target areas are (0, 0) to (16, 16) and (61,51) to (630,470), the target areas are (0, 0) to (16, 16) and the corresponding width feature is 17 pixels, the height feature is 17 pixels and the probability of existence of a person is 0.8; the target areas (61,51) to (630,470) have a width of 570 pixels, a height of 420 pixels, and a probability of a person being present of 0.85.

S208: the first N candidate detection features are screened out from the candidate detection features and serve as target detection features.

The target detection features are used for indicating detection features for carrying out gesture classification calculation, the probability corresponding to the target detection features is larger than the probability corresponding to other detection features, and N is an integer larger than or equal to 1.

And screening out the first N candidate detection features which are arranged from large to small according to the probability of the existence of the person in the candidate detection features corresponding to the target areas, and taking the first N candidate detection features as target detection features.

It can be understood that screening candidate detection features is to screen target areas, and the screening process can be divided into the following two types according to the number of target areas:

if the number of the currently acquired target areas is greater than N, N target areas are sequentially selected according to the sequence of the sequence from large to small according to the probability of people in the areas.

If the number of the currently acquired target areas is smaller than N, screening is carried out from the character images corresponding to the secondary key frames, wherein the specific screening method is as follows:

（5）

（6）

wherein i.e. (1, 2..N),，

for video frame interval, +.>For the total number of video frames of the video to be processed, +.>Number of key frame of human image, +.>And N-1 secondary key frames are calculated.

Thus, N target areas with the highest probability of the image being a person in the target area can be extracted from the plurality of person images corresponding to the plurality of key frames and the person image corresponding to the sub-key frame.

S209: preprocessing a plurality of target areas, and extracting corresponding target image blocks from each target area.

The target image block is used for indicating a new image obtained after the target area is extracted from the corresponding image.

Preprocessing the N target areas according to the screened N target areas to ensure that the sizes of the current N target areas are consistent, and converting the current N target areas into RGB three-channel images; and dividing the N preprocessed images to obtain a plurality of target image blocks with the same size.

For example, using the torch. Reshape function, N images are changed to RGB three-channel images of 112 x 112 in length and width; for each image, it is further divided into image blocks with a size of 16×16, and for an image of 112×112, it can be divided into 112×112/(16×16), that is, 49 image blocks.

S210: two-dimensional image features are determined from a plurality of target image blocks.

Wherein the two-dimensional image features are used to indicate character features extracted for the target region.

Based on the obtained plurality of target image blocks, the feature of each target image block is extracted and mapped into a vector of the same dimension as the number of image pixels through one Linear Projection layer.

For example, for 112×112/(16×16) of 112×112 images, i.e., 49 image blocks, each image block has 16×16×3=768 features, then the image block data is transformed into matrix form of [16,16,3] by linear transformation, and a vector of length 768 is obtained by mapping.

S211: and extracting the video to be processed through a plurality of video pipelines to obtain a plurality of sub-videos to be processed, wherein the frame numbers and the picture sizes of the sampled videos acquired by different video pipelines are different.

Step S211 is similar to step S104 described above, and will not be described again.

S212: and determining an attitude classification result of the video to be processed according to the target detection features, the two-dimensional image features and the sub-videos to be processed.

Extracting corresponding target features from the sub-videos to be processed according to the plurality of sub-videos to be processed; processing and calculating the target detection features and the two-dimensional image features which are acquired by utilizing the key frame images and the target features extracted from the sub-video to be processed, and fusing the target features to each other to improve the accuracy of a processing result; and determining an attitude classification result of the video to be processed according to the processing result.

Specifically, determining an attitude classification result of the video to be processed according to the plurality of target detection features, the two-dimensional image features and the plurality of sub-videos to be processed includes:

determining three-dimensional video characteristics according to a plurality of sub-videos to be processed; performing fusion processing on the multiple target detection features, the two-dimensional image features and the three-dimensional video features to obtain fusion features; and carrying out normalization processing on the fusion characteristics to obtain an attitude classification result.

The three-dimensional video features are used for indicating human body space semantic information acquired from the sub-video extracted from the video pipeline, and the fusion features are used for indicating that a plurality of target detection features, two-dimensional image features and three-dimensional video features are processed through the full-connection layer.

The fusion processing is performed on the multiple target detection features, the two-dimensional image features and the three-dimensional video features to obtain fusion features, which may be, for example:

1. the plurality of target detection features are in the form of:

the number of rows of the matrix corresponds to the corresponding character,height H for the first person target area,/, ->Width W of the object area is +.>Is the probability of a person within the target area.

First the moment is transformed into 1 x N x 3 dimensions using the flat operation, followed by the output x using the fully connected layer of 1 neuron.

2. The two-dimensional image feature and the three-dimensional video feature are vectors with length of 768, and the outputs y and z are obtained by using the full connection layers of 384 neurons respectively. Then, the x, y and z are spliced to obtain a vector with length 768+N dimensions, then the vector is output through a softmax function by using the full connection layer of c neurons.

And finally, classifying and outputting the gestures by adopting the following formula:

（7）

The probability of each gesture in the video to be processed in the multi-classification tasks (sleeping sentry, sitting, talking and walking) can be obtained by adopting the formula to calculate.

determining position coding information according to the two-dimensional image characteristics and the three-dimensional video characteristics, wherein the position coding information is used for indicating the positions of the target areas in the images corresponding to the key frames and the positions of the sub-videos to be processed in the videos to be processed; and determining the original position of the gesture classification result in the video to be processed according to the position coding information.

When the two-dimensional image features and the three-dimensional video features are fused, the position coding information is added, so that information loss caused by combining the fused image and the video information can be avoided, the complexity of data processing is reduced, and the output dimension of the fused features is still unchanged.

The specific way of determining the position-coding information is as follows:

for two-dimensional image blocks, the position-coding information is determined by two factors:

1. The position of the character block diagram in the source video, namely the corresponding patch sequence, G_s of the central position of the character target area in the image corresponding to the key frame;

2. the patch sequence, L_s, of the image block in the person's target area.

The whole position coding information is generated by splicing the two pieces of information, namely:

（8）

（9）

（10）

（11）

wherein t is a constant super parameter, here the value is 10000, j can be from 0 to d/4, and d is the characteristic quantity.

For sampled video, the position-coding information is determined by three factors:

1. the center point of the video pipeline is offset, L, relative to the length of the video starting point;

2. the center point of the video pipeline is offset, W, relative to the width of the video starting point;

3. the center point of the video pipe is offset, T, from the frame of the video start point.

The whole position coding information is generated by splicing three parts, namely:

（12）

（13）

（14）

（15）

（16）

wherein t is a constant super parameter, and here the value is 10000; j may take on values from 0 to d/6, and d is the feature quantity.

According to the gesture classification method provided by the embodiment, the video to be processed is obtained, and a plurality of key frames are extracted; extracting a plurality of key frames based on a target detection algorithm to obtain a plurality of target areas; extracting features of the multiple target areas to obtain multiple target detection features; preprocessing a plurality of target areas, and extracting a plurality of target image blocks; extracting features of a plurality of target image blocks to obtain two-dimensional image features; extracting the video to be processed through different video pipelines to obtain a plurality of sub-videos to be processed, and extracting the characteristics of the obtained sub-videos to be processed to obtain three-dimensional video characteristics; and carrying out fusion processing on the plurality of target detection features, the two-dimensional image features and the three-dimensional video features to obtain fusion features, and carrying out normalization processing on the fusion features to obtain an attitude classification result. The method ensures the connection among the parts of the image, ensures the accurate identification of the connection among frames before and after the video time, improves the efficiency and accuracy of gesture identification and classification, avoids information loss and reduces the complexity of data processing.

Fig. 3 is a schematic structural diagram of an attitude classification device according to the present application. As shown in fig. 3, the present application provides a posture classification device 300 including:

the obtaining module 301 is configured to obtain a video to be processed, and obtain an image corresponding to each frame in the video to be processed according to the video to be processed.

A determining module 302, configured to determine a plurality of key frames of the video to be processed according to the image.

And the processing module 303 is configured to perform feature extraction on the images corresponding to the plurality of key frames, so as to obtain a plurality of target features.

The processing module 303 is further configured to extract the video to be processed through a plurality of video pipelines to obtain a plurality of sub-videos to be processed, where the frame numbers and the picture sizes of the sampled videos acquired by different video pipelines are different.

The determining module 302 is further configured to determine an pose classification result of the video to be processed according to the multiple target features and the multiple sub-videos to be processed.

Optionally, the acquiring module 301 is further configured to acquire a multi-classification task of gesture classification.

The processing module 303 is further configured to perform classification processing on the multi-classification task to obtain a plurality of classification tasks.

The obtaining module 301 is further configured to obtain a dimension-reduction mapping matrix for each task classified.

The processing module 303 is specifically configured to perform saliency discrimination processing on an image corresponding to each frame according to the plurality of dimension-reduction mapping matrices and the image corresponding to each frame in the video to be processed, so as to obtain a saliency discrimination result of the image corresponding to each frame.

The determining module 302 is specifically configured to determine the plurality of keyframes according to the significance determination result.

Optionally, the processing module 303 is further configured to perform saliency discrimination processing on the image corresponding to each frame by using the following formula:

scoring an ith frame of an mth video; />

Is characteristic of the ith frame of the mth video.

Optionally, the determining module 302 is further configured to determine, according to the images corresponding to the plurality of key frames, a target area of the image corresponding to each key frame through a target detection algorithm.

The determining module 302 is specifically configured to determine, according to each target area, a candidate detection feature corresponding to each key frame, where the detection feature is used to indicate a height and a width of the corresponding target area and a size of a probability of existence of a person.

Optionally, the processing module 303 is further configured to pre-process the multiple target areas, and extract a corresponding target image block from each target area.

The determining module 302 is specifically configured to determine the two-dimensional image feature according to a plurality of target image blocks.

The determining module 302 is specifically configured to determine an pose classification result of the video to be processed according to the plurality of target detection features, the two-dimensional image features, and the plurality of sub-videos to be processed.

Optionally, the determining module 302 is specifically configured to determine three-dimensional video features according to the plurality of sub-videos to be processed.

The processing module 303 is further configured to perform fusion processing on the plurality of target detection features, the two-dimensional image feature, and the three-dimensional video feature, so as to obtain a fusion feature.

The processing module 303 is specifically configured to perform normalization processing on the fusion feature to obtain the pose classification result.

Optionally, the determining module 302 is further configured to determine, according to the two-dimensional image feature and the three-dimensional video feature, position coding information, where the position coding information is used to indicate positions of the plurality of target areas in the key frame image and positions of the sub-video to be processed in the video to be processed.

The determining module 302 is further configured to determine an original position of the pose classification result in the video to be processed according to the position encoding information.

Fig. 4 is a schematic structural diagram of an attitude classification apparatus according to the present application. As shown in fig. 4, the present application provides a posture classification device 400 including: a receiver 401, a transmitter 402, a processor 403 and a memory 404.

A receiver 401 for receiving instructions and data;

a transmitter 402 for transmitting instructions and data;

memory 404 for storing computer-executable instructions;

a processor 403, configured to execute computer-executable instructions stored in the memory 404, to implement the steps executed by the gesture classification method in the above embodiment. Reference may be made in particular to the relevant description of the embodiments of the gesture classification method described above.

Alternatively, the memory 404 may be separate or integrated with the processor 403.

When the memory 404 is provided separately, the electronic device further comprises a bus for connecting the memory 404 and the processor 403.

The application also provides a computer readable storage medium, wherein the computer readable storage medium stores computer execution instructions, and when the processor executes the computer execution instructions, the gesture classification method executed by the gesture classification device is realized.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, functional modules/units in the apparatus, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed cooperatively by several physical components. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

While the present application has been described with reference to the preferred embodiments shown in the drawings, it will be readily understood by those skilled in the art that the scope of the application is not limited to those specific embodiments, and the above examples are only for illustrating the technical solution of the application, not for limiting it; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

Claims

1. A method of gesture classification, the method comprising:

2. The method of claim 1, wherein the determining a plurality of keyframes of the video to be processed from the image comprises:

acquiring multi-classification tasks of gesture classification;

3. The method according to claim 2, wherein the performing saliency discrimination processing on the image corresponding to each frame according to the plurality of dimension-reduction mapping matrices includes:

,

scoring an ith frame of an mth video;

is characteristic of the ith frame of the mth video.

4. The method of claim 1, wherein the target feature comprises: and the target detection features are used for respectively extracting the features of the images corresponding to the key frames to obtain a plurality of target features, and the target detection features comprise the following steps:

5. The method of claim 4, wherein the target feature further comprises: and after determining the target area of the image corresponding to each key frame through a target detection algorithm according to the images corresponding to the key frames, the method further comprises the following steps:

6. The method of claim 5, wherein determining the pose classification result for the video to be processed based on the plurality of object detection features, the two-dimensional image features, and the plurality of sub-videos to be processed comprises:

7. The method of claim 6, wherein the fusing the plurality of object detection features, the two-dimensional image features, and the three-dimensional video features, prior to obtaining the fused features, further comprises:

8. An attitude classification apparatus, characterized by comprising:

the acquisition module is used for acquiring a video to be processed and acquiring an image corresponding to each frame in the video to be processed according to the video to be processed;

the determining module is used for determining a plurality of key frames of the video to be processed according to the image;

the processing module is used for extracting the characteristics of the images corresponding to the key frames respectively to obtain a plurality of target characteristics;

the processing module is further used for extracting the video to be processed through a plurality of video pipelines to obtain a plurality of sub-videos to be processed, wherein the frame numbers and the picture sizes of the sampled videos acquired by different video pipelines are different;

9. A gesture classification apparatus, comprising:

a memory;

a processor;

wherein the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the pose classification method according to any of claims 1-7.

10. A computer storage medium having stored therein computer executable instructions which when executed by a processor are adapted to implement the pose classification method according to any of claims 1-7.