WO2020119527A1 - 人体动作识别方法、装置、终端设备及存储介质 - Google Patents

人体动作识别方法、装置、终端设备及存储介质 Download PDF

Info

Publication number
WO2020119527A1
WO2020119527A1 PCT/CN2019/122746 CN2019122746W WO2020119527A1 WO 2020119527 A1 WO2020119527 A1 WO 2020119527A1 CN 2019122746 W CN2019122746 W CN 2019122746W WO 2020119527 A1 WO2020119527 A1 WO 2020119527A1
Authority
WO
WIPO (PCT)
Prior art keywords
image sequence
target
depth image
training
direction vector
Prior art date
Application number
PCT/CN2019/122746
Other languages
English (en)
French (fr)
Inventor
程俊
姬晓鹏
赵青松
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2020119527A1 publication Critical patent/WO2020119527A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application belongs to the technical field of pattern recognition, and particularly relates to a method, device, terminal device, and computer-readable storage medium for human motion recognition.
  • human motion recognition methods combining deep image sequences and convolutional neural networks can be divided into two types based on two-dimensional convolutional neural networks and three-dimensional convolutional neural networks.
  • the recognition method based on the two-dimensional convolutional neural network the time series information of the depth image sequence is first compressed, and then the two-dimensional convolutional neural network is used for feature learning and classification of the motion trajectory image to obtain the recognition result.
  • this method does not have a strong description of the spatiotemporal capabilities of the human body’s apparent information and motion information in the depth image sequence, and it relies heavily on the careful timing processing before network data input, which makes the method’s recognition efficiency and The accuracy is low.
  • the original depth data is used as the network input.
  • the description ability of the spatiotemporal information can be enhanced to a certain extent, the ability to describe the motion cues of the local spatiotemporal space is limited.
  • the existing human motion recognition methods based on image sequences and convolutional neural networks have the problems of poor spatiotemporal information description ability and low recognition performance.
  • embodiments of the present application provide a human motion recognition method, device, terminal device, and computer-readable storage medium, to solve the problems that the existing human motion recognition methods have poor ability to describe spatiotemporal information and have low recognition performance.
  • a first aspect of the embodiments of the present application provides a method for human motion recognition, including:
  • the sparse sampling of each image sequence segment to obtain a corresponding target image sequence includes:
  • the corresponding target image sequence is obtained.
  • the extracting the gradient direction vector of each target image sequence includes:
  • the gradient component of each target image sequence is normalized by an L2 norm to obtain the gradient direction vector of each target image sequence.
  • the method before the extracting the gradient direction vector of each target image sequence, the method further includes:
  • the performing data enhancement operation on each of the target image sequences includes:
  • the method before the acquiring the depth image sequence of the human motion, the method further includes:
  • the pre-established three-dimensional convolutional neural network model is trained.
  • the method further includes:
  • the trained three-dimensional convolutional neural network is tested.
  • a second aspect of an embodiment of the present application provides a human motion recognition device, including:
  • Depth image sequence acquisition module used to obtain depth image sequences of human actions
  • a first dividing module configured to divide the depth image sequence into a preset number of image sequence segments at equal intervals
  • a first time-series sparse sampling module which is used to perform time-series sparse sampling on each of the image sequence fragments to obtain a corresponding target image sequence
  • An extraction module for extracting the gradient direction vector of each target image sequence
  • the recognition module is used for performing human body motion recognition based on the gradient direction vector and the pre-trained three-dimensional convolutional neural network model.
  • the first sparse sampling module includes:
  • An extraction unit for extracting a first target depth image, a second target depth image, and a third target depth image from each of the image sequence segments, wherein the first target depth image, the second target depth image And the relative position of the time sequence of the third target depth image in the depth image sequence is an arithmetic sequence;
  • the forming unit is configured to obtain the corresponding target image sequence based on the first target depth image, the second target depth image, and the third target depth image of each image sequence segment.
  • the extraction module includes:
  • a component calculation unit for calculating the gradient component of each of the target image sequences separately;
  • the normalization unit is configured to normalize the gradient component of each target image sequence by an L2 norm to obtain the gradient direction vector of each target image sequence.
  • the method further includes:
  • the data enhancement module is used to perform data enhancement operations on each of the target image sequences.
  • the data enhancement module includes:
  • a cropping unit configured to crop a preset area of each depth image to obtain a corresponding first target area of a first preset size
  • the target size selection unit is used to randomly select the target size from the preset candidate sizes
  • a random cropping unit configured to randomly crop each of the first target areas according to the target size to obtain a corresponding second target area
  • a scaling unit is used to scale each of the second target areas to a second preset size.
  • Training depth image sequence acquisition module used to obtain training depth image sequence
  • a second dividing module configured to divide the training depth image sequence into the preset number of training image sequence segments
  • a second time-series sparse sampling module configured to sample each of the training image sequence fragments in a first preset time-series sparse sampling mode to obtain a corresponding target training image sequence
  • the training module is used for training the pre-established three-dimensional convolutional neural network model according to each target training image sequence.
  • the method further includes:
  • Test depth image sequence acquisition module for acquiring test depth image sequence
  • a third dividing module configured to divide the test depth image sequence into the preset number of test image sequence fragments
  • a third time-series sparse sampling module configured to sample each of the test image sequence fragments in a second preset time-series sparse sampling mode to obtain a corresponding target test image sequence
  • the test module is used to test the trained three-dimensional convolutional neural network according to each target test image sequence.
  • a third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, which is implemented when the processor executes the computer program The steps of the method as described in any of the first aspect above.
  • a fourth aspect of the embodiments of the present application provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the method according to any one of the first aspects above is implemented A step of.
  • the gradient direction vector of the depth image by acquiring the gradient direction vector of the depth image, human motion recognition is performed according to the gradient direction vector and the three-dimensional convolutional neural network module, that is, the gradient direction vector is used as the input of the three-dimensional convolutional neural network model, the calculation is simpler, and the recognition is improved Efficiency; through the gradient direction vector and the three-dimensional convolutional neural network, the spatio-temporal information modeling of the image sequence can be better completed, and the spatio-temporal information description ability is improved. In addition, through the organic combination of sparse sampling and three-dimensional convolution, it also improves The ability to describe spatiotemporal information improves the accuracy of recognition.
  • FIG. 1 is a schematic block diagram of a flow of a method for human body motion recognition according to an embodiment of the present application
  • FIG. 2 is a schematic block diagram of a data enhancement operation process provided by an embodiment of the present application.
  • FIG. 3 is a schematic block diagram of a flow of a training process provided by an embodiment of this application.
  • FIG. 4 is a schematic block diagram of a flow of a test process provided by an embodiment of this application.
  • FIG. 5 is a schematic structural block diagram of a human motion recognition device according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a terminal device provided by an embodiment of the present application.
  • FIG. 1 is a schematic block diagram of a flow of a method for human action recognition provided by an embodiment of the present application.
  • the method may include the following steps:
  • Step S101 Acquire a depth image sequence of human actions.
  • Step S102 Divide the depth image sequence into a preset number of image sequence segments at equal intervals.
  • the depth image sequence of sequence length N is ⁇ D(t)
  • N is a positive integer greater than 0
  • the depth image sequence is divided into K image sequence segments at equal intervals, S(k)
  • K is a positive integer greater than 0
  • S(k) represents the divided k-th image sequence segment.
  • Each image sequence segment includes a certain number of depth images.
  • Step S103 Perform time series sparse sampling on each image sequence segment to obtain a corresponding target image sequence.
  • time series sparse sampling refers to extracting several data from one data set, that is, extracting several depth images from each image sequence segment.
  • the target image sequence of each image sequence segment is obtained, that is, each image sequence segment corresponds to a target image sequence.
  • the target image sequence is a sequence composed of several extracted depth images.
  • the number of images extracted from each image sequence segment is the same, and the number of extracted images may be 2 or 3. That is, two depth images or three depth images can be extracted from each image sequence segment.
  • Human motion recognition requires multiple consecutive frames of images in time series, so the extracted multiple images comply with certain rules in time series.
  • the specific process of performing time-series sparse sampling on each image sequence segment to obtain a corresponding target image sequence may include: extracting a first target depth image, a second target depth image from each image sequence segment, and The third target depth image, wherein the relative positions of the first target depth image, the second target depth image, and the third target depth image in the depth image sequence are in an equidistant sequence; based on the first target depth of each image sequence segment The image, the second target depth image and the third target depth image to obtain the corresponding target image sequence.
  • the first target depth image, the second target depth image, and the third target depth image are respectively D(m k -d), D(m k ), and D(m k +d )
  • m k refers to the relative position of the extracted depth image in the k-th segment in the depth image sequence of length N
  • d is a positive integer greater than zero. 1 ⁇ m k ⁇ N.
  • m k -d, m k and m k +d are the relative positions of the time series of the three extracted depth images, and they are distributed in the arithmetic sequence.
  • the distance between two target depth images should not be too large, and should be within a reasonable range, that is, the value of d should not be too large.
  • the value of d may be 1 or 2, that is, extract three consecutive target depth images of D(m k -1), D(m k ), D(m k +1) to form the target of the corresponding segment Image sequence ⁇ D(m k -1), D(m k ), D(m k +1) ⁇ .
  • Step S104 Extract the gradient direction vector of each target image sequence.
  • the gradient direction vector of each segment can be calculated based on each target image sequence.
  • the target image sequence may be subjected to data enhancement operations after sparse sampling in time series and before the gradient direction vector is extracted.
  • data enhancement operations include image scaling, cropping and other operations.
  • the above method may further include: performing a data enhancement operation on each target image sequence.
  • each target image sequence into an image sequence, and then perform data enhancement operations on each frame of the image sequence.
  • each segment extracts 3 depth images, then the composition A depth image sequence including 3K images; it is also possible to directly perform corresponding data enhancement operations on each image in each target image sequence.
  • the process of performing the data enhancement operation on each target image sequence may specifically include:
  • Step S201 Crop the preset area of each depth image to obtain a corresponding first target area of a first preset size.
  • the above-mentioned preset area may be a pre-selected area, and the same position in each image is cropped, and the cropped size is the first preset size.
  • the first preset size can be set according to actual needs. For example, in the original depth image of 512 ⁇ 424 pixels, pixels of 90 to 410 in the x direction and 90 to 410 in the y direction are selected to obtain a first preset size of 320 ⁇ 320 pixels.
  • Step S202 Randomly select the target size from the preset candidate sizes.
  • the above-mentioned preset alternative size may include multiple alternative sizes, and one size is randomly selected from the multiple alternative sizes as the target size.
  • the alternative sizes may include 320 ⁇ 320, 288 ⁇ 288, 256 ⁇ 256, 224 ⁇ 224, and a randomly selected size of 256 ⁇ 256 as the target size.
  • Step S203 Randomly crop each first target area according to the target size to obtain a corresponding second target area.
  • the target size is randomly cropped using the target size to obtain a second target area. That is, an area of the target size is randomly cropped in the first target area as the second target area.
  • the size of the first target area is 320 ⁇ 320 and the target size is 256 ⁇ 256
  • an area with a size of 256 ⁇ 256 pixels is randomly selected for the area with a size of 320 ⁇ 320 pixels.
  • Step S204 Zoom each second target area to a second preset size.
  • the above second preset size may be set according to actual needs, for example, 224 ⁇ 224, at this time, a region of 256 ⁇ 256 pixels may be randomly selected from a region of 320 ⁇ 320 pixels Zoom to 224 ⁇ 224.
  • the specific manifestation of time series sparse sampling is different, and the extraction process of gradient direction vectors will also be correspondingly different.
  • the center difference method and L2 norm normalization can be used to calculate the gradient direction vector; when it is 2, the front and rear difference method and L2 norm normalization can be used to calculate the gradient Direction vector.
  • the specific process of extracting the gradient direction vector of each target image sequence may include: calculating the gradient components of each target image sequence separately; The gradient component of each target image sequence is normalized by the L2 norm to obtain the gradient direction vector of each target image sequence.
  • the gradient components in the three directions are calculated based on each target image sequence, and then the gradient direction vector of the corresponding segment is calculated using the L2 norm normalization.
  • the three-dimensional image sequence D(x, y, t) is not considered
  • the gradient components in the three directions of x, y, and t are D x , D y , and D t can be approximately expressed as:
  • the gradient components D x , D y , and D t are normalized by L2, and expressed as a unit vector form in the Euclidean space represented by x, y, and t coordinates, namely:
  • eps represents an infinitesimal quantity.
  • G is the gradient direction vector.
  • Step S105 Perform human body motion recognition based on the gradient direction vector and the pre-trained three-dimensional convolutional neural network model.
  • 1 ⁇ k ⁇ K ⁇ of the entire depth image sequence can be obtained.
  • the gradient direction vector is input to the trained three-dimensional convolutional neural network model, and human motion recognition is performed to obtain the recognition result.
  • the above three-dimensional convolutional neural network model may be specifically a three-dimensional residual convolutional neural network model, or may be a three-dimensional convolutional neural network model in other network forms, which is not limited herein.
  • the model is pre-trained with training sample data.
  • the dimension of the gradient direction vector is C ⁇ K ⁇ H ⁇ W
  • H and W indicate the height and width of the depth image
  • K indicates the division used in the sparse sampling of time series.
  • a 34-layer residual network can be selected as the basic network, the original two-dimensional convolution kernel is replaced with a three-dimensional convolution kernel, and the network structure is adjusted to obtain an improved three-dimensional residual convolution neural network.
  • the layer groups of the improved 3D residual convolutional neural network are described as follows:
  • Conv1 Use 64 7 ⁇ 7 ⁇ 7 three-dimensional convolution kernels, set the step to 2 in the H and W dimensions, and set the step to 1 in the K dimension. Specifically, when the input data dimension is 3 ⁇ 8 ⁇ 224 ⁇ 224, a feature map with a dimension of 64 ⁇ 8 ⁇ 112 ⁇ 112 is obtained through the Conv1 operation.
  • Conv2_x First use a 3 ⁇ 3 ⁇ 3 size filter window for maximum pooling, set the step to 2 in the H and W dimensions, and set the step to 1 in the K dimension. Specifically, when the dimension of the input feature map is 64 ⁇ 8 ⁇ 112 ⁇ 112, the feature map with the dimension of 64 ⁇ 8 ⁇ 56 ⁇ 56 is obtained after the maximum pooling operation; then three sets of 2 layers are used in turn, each layer Perform a convolution operation on the above feature maps for 64 3 ⁇ 3 ⁇ 3 three-dimensional convolution kernels, and set the step to 1 in the dimensions of H, W, and K. Specifically, the feature map with the input dimension of 64 ⁇ 8 ⁇ 56 ⁇ 56 is obtained through the convolution operation, and the feature map with the dimension of 64 ⁇ 8 ⁇ 56 ⁇ 56 is still obtained.
  • Conv3_x Use 4 sets of 2 layers, each layer is 128 3 ⁇ 3 ⁇ 3 three-dimensional convolution kernels to perform the convolution operation on the above feature map, and set the H, W, K dimensions in the first layer convolution operation
  • the steps are all 2; the convolution operations of the remaining layers are set to H, W, and K, and the steps are all set to 1.
  • the dimension of the input feature map is 64 ⁇ 8 ⁇ 56 ⁇ 56
  • the feature map with the dimension of 128 ⁇ 4 ⁇ 28 ⁇ 28 is obtained through the first layer convolution operation; and then through the convolution operation of the remaining layers, the The dimension is still 128 ⁇ 4 ⁇ 28 ⁇ 28 feature map.
  • Conv4_x Use 6 groups of 2 layers, each layer is 256 3 ⁇ 3 ⁇ 3 three-dimensional convolution kernels to perform the convolution operation on the above feature map, and set the H, W, K dimensions in the first layer convolution operation The steps are all 2; the convolution operations of the remaining layers are set to H, W, and K, and the steps are all set to 1. Specifically, when the dimension of the input feature map is 128 ⁇ 4 ⁇ 28 ⁇ 28, the feature map with the dimension of 256 ⁇ 2 ⁇ 14 ⁇ 14 is obtained through the first layer convolution operation; and then the convolution operation of the remaining layers is obtained The dimension is still 256 ⁇ 2 ⁇ 14 ⁇ 14 feature map.
  • Conv5_x Use 3 groups of 2 layers, each layer is 512 3 ⁇ 3 ⁇ 3 three-dimensional convolution kernels to perform the convolution operation on the above feature map, and set the H, W, K dimensions in the first layer convolution operation The steps are all 2; the convolution operations of the remaining layers are set to H, W, and K, and the steps are all set to 1. Specifically, when the dimension of the input feature map is 256 ⁇ 2 ⁇ 14 ⁇ 14, the feature map with the dimension of 512 ⁇ 1 ⁇ 7 ⁇ 7 is obtained through the first layer convolution operation; The dimension is still 512 ⁇ 1 ⁇ 7 ⁇ 7 feature map.
  • Fc First, the H, W, and K dimensions are averaged respectively, and the size of the selected filter window is Then use the 512 ⁇ N c fully connected layer to output to the corresponding number of action categories. Specifically, in a feature map with an input dimension of 512 ⁇ 1 ⁇ 7 ⁇ 7, a filter window with a size of 1 ⁇ 7 ⁇ 7 is used for mean pooling to obtain a 512 ⁇ 1 ⁇ 1 ⁇ 1 dimension feature vector. Then take 60 human actions as an example, using the weight coefficient dimension of 512 ⁇ 60 full connection to obtain a 1 ⁇ 60 feature vector.
  • the improved three-dimensional residual convolutional neural network does not reduce the time dimension on Conv1 and Conv2_x, and simultaneously reduces the spatial and temporal dimensions between Conv3_x and Conv5_x.
  • the mean pooling is used to output 512-dimensional feature vectors, and the number of categories fully connected to the output N c is used . From the input to the output, 1/8 scale reduction is performed in the time dimension, and the spatial dimension is consistent with the 2D residual neural network to perform 1/32 scale reduction.
  • the three-dimensional residual convolutional neural network shown above is just an exemplary structure.
  • the specific network structure and number of layers of the three-dimensional neural network can be set according to the needs of computing resource consumption and recognition performance. Not limited here.
  • this embodiment performs time-series sparse sampling based on the depth image sequence, and then extracts the gradient direction vector as the local spatio-temporal motion information input, and then performs global appearance information and motion information on the obtained local spatio-temporal motion information input based on the three-dimensional neural network.
  • Feature learning to get the action category label, it only needs to calculate the gradient direction vector, the calculation is very simple, and the spatiotemporal description ability is strong, and the recognition performance is high.
  • C represents a visible light image
  • D represents a depth image
  • S represents a skeleton joint point
  • this embodiment will introduce the training process and the testing process of the three-dimensional convolutional neural network model.
  • the method may further include:
  • Step S301 Acquire a training depth image sequence.
  • Step S302 Divide the training depth image sequence into a preset number of training image sequence segments.
  • Step S303 Sampling each training image sequence segment by the first preset time series sparse sampling method to obtain a corresponding target training image sequence.
  • the above first preset timing sparseness method may specifically include: randomly extracting a corresponding number of depth images in each training image sequence segment to form a corresponding target training image sequence.
  • Each segment corresponds to a target training image sequence. For example, when 3 depth images need to be extracted from each training image sequence segment, 3 depth images are randomly selected from each segment, and these 3 depth images are formed into the target training image sequence of the corresponding segment.
  • a data enhancement operation may be performed.
  • the data enhancement operation may include cropping, scaling, and other operations.
  • the process may be similar to the data enhancement process mentioned above, and details are not described here.
  • Step S304 Train the pre-established three-dimensional convolutional neural network model according to each target training image sequence.
  • the cross-entropy loss can be used as the criterion function, and the model training using small batch stochastic gradient descent method can be used.
  • pre-training parameters are not used, but Kaiming initialization method is used to initialize the convolution parameters.
  • the default configuration used can be specifically: the batch size is 64, the initial learning rate is 0.01, the momentum is 0.9, the weight decay is 1 ⁇ 10 -5 , a total of 100 generations of iterations are performed, and the learning rate is set every The 20th generation decay is the previous 0.1.
  • the trained network model needs to be tested to determine whether the model meets the usage standards.
  • Step S401 Obtain a test depth image sequence.
  • Step S402 Divide the test depth image sequence into a preset number of test image sequence segments.
  • Step S403 Sampling each test image sequence segment by a second preset time-sparse sampling method to obtain a corresponding target test image sequence.
  • the foregoing second preset timing sparse sampling method may specifically be: extracting the depth image at the middle position of the segment from each segment. For example, when the number of segment sequences is 11, and each segment needs to extract 3 images, the 6th frame image in the segment and the two adjacent depth images are extracted.
  • time series sparse sampling method is different from the time series sparse sampling method in the training process.
  • the difference between the time series sparse sampling method in the training process and the test process can make the characteristics of network learning more generalized.
  • time series sparse sampling method in the training process and the test process may also be the same, and can also achieve the purpose of the embodiments of the present application.
  • a data enhancement operation may be performed, and the data enhancement operation may include operations such as cropping and scaling.
  • the testing stage after cutting out the fixed areas of each depth image, you can directly zoom to a certain size. For example, in the original depth image of 512 ⁇ 424 pixels, pixels in the x direction 90-410 and the y direction 90-410 are selected to obtain a size of 320 ⁇ 320 pixels, and then directly scaled to 224 ⁇ 224 pixels.
  • Step S404 Test the trained three-dimensional convolutional neural network according to each target test image sequence.
  • the difference between the sparse sampling methods in the training process and the test process can make the characteristics of the network learning more generalized.
  • FIG. 5 is a schematic structural block diagram of a human motion recognition device according to an embodiment of the present application.
  • the device may include:
  • the depth image sequence obtaining module 51 is used to obtain a depth image sequence of human actions
  • the first dividing module 52 is used to divide the depth image sequence into a preset number of image sequence segments at equal intervals;
  • the first time series sparse sampling module 53 is used to perform time series sparse sampling on each image sequence segment to obtain the corresponding target image sequence;
  • the extraction module 54 is used to extract the gradient direction vector of each target image sequence
  • the recognition module 55 is used for performing human body motion recognition based on the gradient direction vector and the pre-trained three-dimensional convolutional neural network model.
  • the foregoing first sequential sparse sampling module includes:
  • An extraction unit for extracting a first target depth image, a second target depth image, and a third target depth image from each image sequence segment, where the first target depth image, the second target depth image, and the third target depth image
  • the relative position of the time series in the depth image sequence is an arithmetic sequence
  • the forming unit is configured to obtain a corresponding target image sequence based on the first target depth image, the second target depth image, and the third target depth image of each image sequence segment.
  • the above extraction module includes:
  • the component calculation unit is used to calculate the gradient component of each target image sequence separately;
  • the normalization unit is used to normalize the gradient component of each target image sequence by the L2 norm to obtain the gradient direction vector of each target image sequence.
  • the above device further includes:
  • the data enhancement module is used to perform data enhancement operations on each target image sequence.
  • the foregoing data enhancement module includes:
  • a cropping unit configured to crop a preset area of each depth image to obtain a corresponding first target area of a first preset size
  • the target size selection unit is used to randomly select the target size from the preset candidate sizes
  • the random cropping unit is used to randomly crop each first target area according to the target size to obtain the corresponding second target area;
  • the scaling unit is used to scale each second target area to a second preset size.
  • the above device further includes:
  • Training depth image sequence acquisition module used to obtain training depth image sequence
  • the second dividing module is used to divide the training depth image sequence into a preset number of training image sequence fragments
  • the second time-series sparse sampling module is used to sample each training image sequence segment through the first preset time-series sparse sampling method to obtain a corresponding target training image sequence;
  • the training module is used to train the pre-established three-dimensional convolutional neural network model according to each target training image sequence.
  • the above device further includes:
  • Test depth image sequence acquisition module for acquiring test depth image sequence
  • the third dividing module is used to divide the test depth image sequence into a preset number of test image sequence fragments
  • the third time-series sparse sampling module is used to sample each test image sequence segment through the second preset time-series sparse sampling method to obtain a corresponding target test image sequence;
  • the test module is used to test the trained three-dimensional convolutional neural network according to each target test image sequence.
  • the terminal device 6 of this embodiment includes: a processor 60, a memory 61, and a computer program 62 stored in the memory 61 and executable on the processor 60.
  • the processor 60 executes the computer program 62, the steps in the above embodiments of the human body motion recognition method are implemented, for example, steps S101 to S105 shown in FIG. 1.
  • the processor 60 executes the computer program 62, the functions of each module or unit in the foregoing device embodiments are realized, for example, the functions of the modules 51 to 55 shown in FIG. 5.
  • the computer program 62 may be divided into one or more modules or units, and the one or more modules or units are stored in the memory 61 and executed by the processor 60 to complete This application.
  • the one or more modules or units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 62 in the terminal device 6.
  • the computer program 62 may be divided into a depth image sequence acquisition module, a first division module, a first time series sparse sampling module, an extraction module, and an identification module.
  • the specific functions of each module are as follows:
  • the depth image sequence acquisition module is used to acquire a depth image sequence of human actions; the first division module is used to divide the depth image sequence into a preset number of image sequence fragments at equal intervals; the first time sequence sparse sampling module is used to The image sequence fragments are sparsely sampled in time series to obtain the corresponding target image sequence; the extraction module is used to extract the gradient direction vector of each target image sequence; the identification module is used to calculate the three-dimensional convolutional neural network according to the gradient direction vector and the pre-training Models for human motion recognition.
  • the terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer and a cloud server.
  • the terminal device may include, but is not limited to, the processor 60 and the memory 61.
  • FIG. 6 is only an example of the terminal device 6 and does not constitute a limitation on the terminal device 6, and may include more or less components than the illustration, or a combination of certain components or different components.
  • the terminal device may further include an input and output device, a network access device, a bus, and the like.
  • the so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 61 may be an internal storage unit of the terminal device 6, such as a hard disk or internal memory of the terminal device 6.
  • the memory 61 may also be an external storage device of the terminal device 6, such as a plug-in hard disk equipped on the terminal device 6, a smart memory card (Smart, Media, Card, SMC), and a secure digital (SD) Cards, flash cards, etc. Further, the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device.
  • the memory 61 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 61 can also be used to temporarily store data that has been or will be output.
  • each functional unit and module is used as an example for illustration.
  • the above-mentioned functions may be allocated by different functional units
  • Module completion means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
  • the functional units and modules in the embodiments may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may use hardware It can also be implemented in the form of software functional units.
  • the specific names of each functional unit and module are only for the purpose of distinguishing each other, and are not used to limit the protection scope of the present application.
  • the disclosed device, terminal device, and method may be implemented in other ways.
  • the device and terminal device embodiments described above are only schematic.
  • the division of the module or unit is only a logical function division, and in actual implementation, there may be other division modes, such as multiple units Or components can be combined or integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or software functional unit.
  • the integrated module or unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the present application can implement all or part of the processes in the methods of the above embodiments, and can also be completed by a computer program instructing relevant hardware.
  • the computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, the steps of the foregoing method embodiments may be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a mobile hard disk, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals and software distribution media, etc.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • electrical carrier signals telecommunications signals and software distribution media, etc.
  • the content contained in the computer-readable medium can be appropriately increased or decreased according to the requirements of legislation and patent practice in jurisdictions. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media Does not include electrical carrier signals and telecommunications signals.

Abstract

本申请实施例适用于模式识别技术领域,公开了一种人体动作识别方法、装置、终端设备及计算机可读存储介质,其中,方法包括:获取人体动作的深度图像序列;将深度图像序列等间隔划分为预设数量个图像序列片段;对每个图像序列片段进行时序稀疏采样,得到对应的目标图像序列;提取每个目标图像序列的梯度方向向量;根据梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。本申请实施例的时空信息描述能力较强,识别性能较高,计算较简便。

Description

人体动作识别方法、装置、终端设备及存储介质 技术领域
本申请属于模式识别技术领域,尤其涉及一种人体动作识别方法、装置、终端设备及计算机可读存储介质。
背景技术
随着深度卷积神经网络技术的不断发展,使得利用深度神经网络来解决基于图像序列的动作识别和行为建模问题得以实现。
目前,结合深度图像序列和卷积神经网络的人体动作识别方法可以分为基于二维卷积神经网络和基于三维卷积神经网络两种。在基于二维卷积神经网络的识别方法中,首先对深度图像序列的时序信息进行压缩,然后采用二维卷积神经网络对运动轨迹图像进行特征学习和分类,得到识别结果。但是,该方法中对深度图像序列中人体的表观信息和运动信息的时空能力描述并不强,并很大程度上依赖于网络数据输入前精心的时序处理工作,使得该方法的识别效率和准确率较低。在基于三维卷积神经网络的识别方法中,采用原始深度数据作为网络输入,虽然可以在一定程度上加强时空信息的描述能力,但是,对局部时空的运动线索描述能力有限。
也就是说,现有的基于图像序列和卷积神经网络的人体动作识别方法存在时空信息描述能力不强,识别性能较低等问题。
技术问题
有鉴于此,本申请实施例提供一种人体动作识别方法、装置、终端设备及计算机可读存储介质,以解决现有人体动作识别方法的时空信息描述能力不强,识别性能较低的问题。
技术解决方案
本申请实施例的第一方面提供一种人体动作识别方法,包括:
获取人体动作的深度图像序列;
将所述深度图像序列等间隔划分为预设数量个图像序列片段;
对每个所述图像序列片段进行时序稀疏采样,得到对应的目标图像序列;
提取每个所述目标图像序列的梯度方向向量;
根据所述梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。
结合第一方面,在一种可行的实现方式中,所述对每个所述图像序列片段进行时序稀疏采样,得到对应的目标图像序列,包括:
从每个所述图像序列片段中抽取第一目标深度图像、第二目标深度图像以及第三目标 深度图像,其中,所述第一目标深度图像、所述第二目标深度图像以及所述第三目标深度图像在所述深度图像序列中的时序相对位置呈等差数列;
基于每个所述图像序列片段的所述第一目标深度图像、所述第二目标深度图像以及所述第三目标深度图像,得到对应的所述目标图像序列。
结合第一方面,在一种可行的实现方式中,所述提取每个所述目标图像序列的梯度方向向量,包括:
分别计算每个所述目标图像序列的梯度分量;
将每个所述目标图像序列的所述梯度分量进行L2范数归一化,得到每个所述目标图像序列的所述梯度方向向量。
结合第一方面,在一种可行的实现方式中,在所述提取每个所述目标图像序列的梯度方向向量之前,还包括:
对每个所述目标图像序列进行数据增强操作。
结合第一方面,在一种可行的实现方式中,所述对每个所述目标图像序列进行数据增强操作,包括:
对各个深度图像的预设区域进行裁剪,得到相应的第一预设尺寸的第一目标区域;
从预设备选尺寸中随机选取目标尺寸;
根据所述目标尺寸,对各个所述第一目标区域进行随机裁剪,得到相应的第二目标区域;
将各个所述第二目标区域的缩放至第二预设尺寸。
结合第一方面,在一种可行的实现方式中,在所述获取人体动作的深度图像序列之前,还包括:
获取训练深度图像序列;
将所述训练深度图像序列划分为所述预设数量个训练图像序列片段;
通过第一预设时序稀疏采样方式对每个所述训练图像序列片段进行采样,得到对应的目标训练图像序列;
根据各个所述目标训练图像序列,对预建立的三维卷积神经网络模型进行训练。
结合第一方面,在一种可行的实现方式中,在所述根据所述目标训练图像序列,对预建立的三维卷积神经网络模型进行训练之后,还包括:
获取测试深度图像序列;
将所述测试深度图像序列划分为所述预设数量个测试图像序列片段;
通过第二预设时序稀疏采样方式对每个所述测试图像序列片段进行采样,得到相应的 目标测试图像序列;
根据各个所述目标测试图像序列,对训练后的三维卷积神经网络进行测试。
本申请实施例的第二方面提供一种人体动作识别装置,包括:
深度图像序列获取模块,用于获取人体动作的深度图像序列;
第一划分模块,用于将所述深度图像序列等间隔划分为预设数量个图像序列片段;
第一时序稀疏采样模块,用于对每个所述图像序列片段进行时序稀疏采样,得到对应的目标图像序列;
提取模块,用于提取每个所述目标图像序列的梯度方向向量;
识别模块,用于根据所述梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。
结合第二方面,在一种可行的实现方式中,所述第一时序稀疏采样模块包括:
抽取单元,用于从每个所述图像序列片段中抽取第一目标深度图像、第二目标深度图像以及第三目标深度图像,其中,所述第一目标深度图像、所述第二目标深度图像以及所述第三目标深度图像在所述深度图像序列中的时序相对位置呈等差数列;
形成单元,用于基于每个所述图像序列片段的所述第一目标深度图像、所述第二目标深度图像以及所述第三目标深度图像,得到对应的所述目标图像序列。
结合第二方面,在一种可行的实现方式中,所述提取模块包括:
分量计算单元,用于分别计算每个所述目标图像序列的梯度分量;
归一化单元,用于将每个所述目标图像序列的所述梯度分量进行L2范数归一化,得到每个所述目标图像序列的所述梯度方向向量。
结合第二方面,在一种可行的实现方式中,还包括:
数据增强模块,用于对每个所述目标图像序列进行数据增强操作。
结合第二方面,在一种可行的实现方式中,所述数据增强模块包括:
裁剪单元,用于对各个深度图像的预设区域进行裁剪,得到相应的第一预设尺寸的第一目标区域;
目标尺寸选取单元,用于从预设备选尺寸中随机选取目标尺寸;
随机裁剪单元,用于根据所述目标尺寸,对各个所述第一目标区域进行随机裁剪,得到相应的第二目标区域;
缩放单元,用于将各个所述第二目标区域的缩放至第二预设尺寸。
结合第二方面,还包括:
训练深度图像序列获取模块,用于获取训练深度图像序列;
第二划分模块,用于将所述训练深度图像序列划分为所述预设数量个训练图像序列片段;
第二时序稀疏采样模块,用于通过第一预设时序稀疏采样方式对每个所述训练图像序列片段进行采样,得到对应的目标训练图像序列;
训练模块,用于根据各个所述目标训练图像序列,对预建立的三维卷积神经网络模型进行训练。
结合第二方面,在一种可行的实现方式中,还包括:
测试深度图像序列获取模块,用于获取测试深度图像序列;
第三划分模块,用于将所述测试深度图像序列划分为所述预设数量个测试图像序列片段;
第三时序稀疏采样模块,用于通过第二预设时序稀疏采样方式对每个所述测试图像序列片段进行采样,得到相应的目标测试图像序列;
测试模块,用于根据各个所述目标测试图像序列,对训练后的三维卷积神经网络进行测试。
本申请实施例的第三方面提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面任一项所述方法的步骤。
本申请实施例的第四方面提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述第一方面任一项所述方法的步骤。
有益效果
本申请实施例与现有技术相比存在的有益效果是:
本申请实施例通过采集深度图像的梯度方向向量,根据梯度方向向量和三维卷积神经网络模块进行人体动作识别,即将梯度方向向量作为三维卷积神经网络模型的输入,计算较简便,提高了识别效率;通过梯度方向向量和三维卷积神经网络可以较好地完成对图像序列的时空信息建模,提高了时空信息描述能力,此外,通过时序稀疏采样和三维卷积的有机结合,也提高了时空信息描述能力,从而提高了识别准确率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附 图获得其他的附图。
图1为本申请实施例提供的一种人体动作识别方法的流程示意框图;
图2为本申请实施例提供的数据增强操作的流程示意框图;
图3为本申请实施例提供的训练过程的流程示意框图;
图4为本申请实施例提供的测试过程的流程示意框图;
图5为本申请实施例提供的一种人体动作识别装置的结构示意框图;
图6是本申请实施例提供的终端设备的示意图。
本发明的实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
为了说明本申请所述的技术方案,下面通过具体实施例来进行说明。
实施例一
请参见图1,为本申请实施例提供的一种人体动作识别方法的流程示意框图,该方法可以包括以下步骤:
步骤S101、获取人体动作的深度图像序列。
步骤S102、将深度图像序列等间隔划分为预设数量个图像序列片段。
可以理解的是,上述预设数量的数值可以根据实际应用需要进行确定。具体,序列长度为N的深度图像序列{D(t)|1≤t≤N},N为大于0的正整数,将该深度图像序列等间隔划分为K个图像序列片段,S(k)|1≤k≤K,K为大于0的正整数,S(k)表示划分后的第k个图像序列片段。每个图像序列片段内包括有一定数量的深度图像。
步骤S103、对每个图像序列片段进行时序稀疏采样,得到对应的目标图像序列。
可以理解的是,时序稀疏采样是指从一个数据集中抽取出若干个数据,即从每个图像序列片段中抽取出若干张深度图像。采样之后,得到每个图像序列片段的目标图像序列,即每一个图像序列片段对应一个目标图像序列。该目标图像序列是由所抽取出的若干张深度图像组成的序列。
一般来说,从每个图像序列片段中所抽取的图像数量是一样的,所抽取的图像数量可以是2张,也可以是3张。即从每个图像序列片段中可以抽取2张深度图像,也可以抽取 3张深度图像。人体动作识别需要时序上连续的多帧图像,故所抽取的多张图像在时序上符合一定的规律。
在一些实施例中,上述对每个图像序列片段进行时序稀疏采样,得到对应的目标图像序列的具体过程可以包括:从每个图像序列片段中抽取第一目标深度图像、第二目标深度图像以及第三目标深度图像,其中,第一目标深度图像、第二目标深度图像以及第三目标深度图像在深度图像序列中的时序相对位置呈等差数列;基于每个图像序列片段的第一目标深度图像、第二目标深度图像以及第三目标深度图像,得到对应的目标图像序列。
需要说明的是,为了方便表述,将上述第一目标深度图像、第二目标深度图像以及第三目标深度图像分别用D(m k-d)、D(m k)、D(m k+d)表示,m k是指所抽取的第k个片段内的深度图像在长度为N的深度图像序列中的相对位置,d为大于零的正整数。1<m k<N。m k-d、m k、m k+d是所抽取的三种深度图像的时序相对位置,其呈等差数列分布。
为了保证人体动作图像的连贯性,两张目标深度图像的间隔不能过大,应当处以一个合理的范围内,即d的取值不能过大。优选地,d的取值可以为1或2,即,抽取D(m k-1)、D(m k)、D(m k+1)连续三张目标深度图像,以组成相应片段的目标图像序列{D(m k-1),D(m k),D(m k+1)}。也可以抽取D(m k-2)、D(m k)、D(m k+2)三张目标深度图像,以组成相应片段的目标图像序列{D(m k-2),D(m k),D(m k+2)}。
在另一些实施例中,也可以从每个图像序列片段中抽取连续或连贯的2张深度图像组成片段的目标图像序列。当然,从每个片段中所抽取的图像数量可以根据实际需要进行选择。
步骤S104、提取每个目标图像序列的梯度方向向量。
需要说明的是,抽取出每个图像序列片段的目标图像序列之后,可以基于各个目标图像序列,分别计算各个片段的梯度方向向量。
在一些实施例中,为了进一步提高人体动作识别的准确率和效率,可以在时序稀疏采样之后,提取梯度方向向量之前,对目标图像序列进行数据增强操作。其中,数据增强操 作包括图像缩放、裁剪等操作。
可选地,在提取每个目标图像序列的梯度方向向量之前,上述方法还可以包括:对每个目标图像序列进行数据增强操作。
具体应用中,可以将各个目标图像序列组成一个图像序列之后,再对这个图像序列中的每一帧图像进行数据增强操作,例如,当分K个片段,每个片段抽取3张深度图像,则组成包括3K张图像的深度图像序列;也可以直接对每个目标图像序列中的每一张图像进行相应数据增强操作。
更进一步地,参见图2示出的数据增强操作的流程示意框图,上述对每个目标图像序列进行数据增强操作的过程具体可以包括:
步骤S201、对各个深度图像的预设区域进行裁剪,得到相应的第一预设尺寸的第一目标区域。
需要说明的是,上述预设区域可以是预先选择的区域,对各张图像中的相同位置进行裁剪,且裁剪的尺寸为第一预设尺寸。该第一预设尺寸可以根据实际需要进行设定。例如,在512×424像素的原始深度图像中,分别选取x方向90~410、y方向90~410的像素点,得到第一预设尺寸为320×320像素。
步骤S202、从预设备选尺寸中随机选取目标尺寸。
需要说明的是,上述预设备选尺寸可以包括多个备选尺寸,从这多个备选尺寸中随机选取一个尺寸作为目标尺寸。例如,备选尺寸可以包括320×320、288×288、256×256、224×224,随机选取的一个尺寸256×256作为目标尺寸。
步骤S203、根据目标尺寸,对各个第一目标区域进行随机裁剪,得到相应的第二目标区域。
随机选取出一个目标尺寸之后,利用该目标尺寸对第一目标区域进行随机裁剪,得到第二目标区域。即在第一目标区域中随机裁剪出目标尺寸大小的区域作为第二目标区域。
例如,第一目标区域的大小为320×320,目标尺寸为256×256,则对320×320像素大小的区域随机选取一个256×256像素大小的区域。
步骤S204、将各个第二目标区域的缩放至第二预设尺寸。
需要说明的是,上述第二预设尺寸可以是根据实际需要进行设定,例如为224×224,此时,可以将对320×320像素大小的区域随机选取得到一个256×256像素大小的区域缩放至224×224。
当然,数据增强的具体操作不限于上文所提及的方式。
时序稀疏采样的具体表现形式不同,梯度方向向量的提取过程也会有相应的不同。当目标图像序列中的深度图像为3张时,可以采用中心差分法和L2范数归一化计算梯度方向向量;当为2张时,可以采用前后差分法和L2范数归一化计算梯度方向向量。
在一些实施例中,当每个目标图像序列中的深度图像为3张时,上述提取每个目标图像序列的梯度方向向量的具体过程可以包括:分别计算每个目标图像序列的梯度分量;将每个目标图像序列的梯度分量进行L2范数归一化,得到每个目标图像序列的梯度方向向量。
其中,基于每个目标图像序列分别计算三个方向上的梯度分量,然后再利用L2范数归一化计算对应片段的梯度方向向量。
具体地,当目标图像序列具体为{D(m k-1),D(m k),D(m k+1)}时,对于三维图像序列D(x,y,t),在不考虑边界点的情况下,其在x,y,t三个方向上的梯度分量为D x、D y、D t可以近似表示为:
Figure PCTCN2019122746-appb-000001
Figure PCTCN2019122746-appb-000002
Figure PCTCN2019122746-appb-000003
然后再对梯度分量D x、D y、D t进行L2归一化,将其表示为由x,y,t坐标表示的欧 式空间下单位矢量形式,即:
Figure PCTCN2019122746-appb-000004
这里eps表示无穷小量,在本方法中,取eps=1×10 -6。G为梯度方向向量。
需要说明的是,当目标图像序列为{D(m k-2),D(m k),D(m k+2)}或者其它时,其计算过程类似,在此不再赘述。
步骤S105、根据梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。
在经过时序稀疏采样和梯度方向向量提取之后,可以得到梯整个深度图像序列的梯度方向向量{G(k)|1≤k≤K}。将梯度方向向量输入至已训练的三维卷积神经网络模型,进行人体动作识别,得到识别结果。
需要说明的是,上述三维卷积神经网络模型可以具体为三维残差卷积神经网络模型,也可以为其它网络形式的三维卷积神经网络模型,在此不作限定。该模型是预先利用训练样本数据训练好的。
例如,当梯度方向向量的维度为C×K×H×W,C=3,表示三个方向的梯度分量,H、W表示深度图像的高度和宽度,K表示时序稀疏采样中所采用的分段数目。具体选取H=W=224,K=8,则输入网络前的梯度方向向量的数据维度为3×8×224×224。
具体可以选取34层残差网络作为基础网络,将原始的二维卷积核替换为三维卷积核,并对网络结构进行调整,得到改进后的三维残差卷积神经网络。该改进后的三维残差卷积神经网络的各层组描述如下:
Conv1:使用64个7×7×7的三维卷积核,在H、W维度上设置步进为2,在K维度上设置步进为1。具体地,在输入数据维度为3×8×224×224时,经Conv1操作得到维度为64×8×112×112的特征图。
Conv2_x:首先使用3×3×3大小的过滤器窗口进行最大值池化,在H、W维度上设置步进为2,在K维度上设置步进为1。具体地,在输入特征图维度为64×8×112×112时,经最大值池化操作后得到维度为64×8×56×56的特征图;然后再依次使用3组2层,每层为64个3×3×3的三维卷积核对上述特征图进行卷积操作,在H、W、K的维度上设置步进为1。具体地,输入维度为64×8×56×56的特征图,经卷积操作后得到维度仍然为64×8×56×56的特征图。
Conv3_x:使用4组2层、每层为128个3×3×3的三维卷积核对上述特征图进行卷积操作,并在第一层卷积操作时设置H、W、K的维度上设置步进均为2;其余层卷积操作均设置H、W、K的维度上设置步进均为1。具体地,在输入特征图维度为64×8×56×56时,首先经第一层卷积操作得到维度为128×4×28×28的特征图;再经过剩余层的卷积操作,得到维度依旧为128×4×28×28的特征图。
Conv4_x:使用6组2层、每层为256个3×3×3的三维卷积核对上述特征图进行卷积操作,并在第一层卷积操作时设置H、W、K的维度上设置步进均为2;其余层卷积操作均设置H、W、K的维度上设置步进均为1。具体地,在输入特征图维度为128×4×28×28时,首先经第一层卷积操作得到维度为256×2×14×14的特征图;再经过剩余层的卷积操作,得到维度依旧为256×2×14×14的特征图。
Conv5_x:使用3组2层、每层为512个3×3×3的三维卷积核对上述特征图进行卷积操作,并在第一层卷积操作时设置H、W、K的维度上设置步进均为2;其余层卷积操作 均设置H、W、K的维度上设置步进均为1。具体地,在输入特征图维度为256×2×14×14时,首先经第一层卷积操作得到维度为512×1×7×7的特征图;再经过剩余层的卷积操作,得到维度依旧为512×1×7×7的特征图。
Fc:首先对H、W、K维度分别进行均值池化,选用的过滤器窗口大小为
Figure PCTCN2019122746-appb-000005
然后使用512×N c全连接层输出到对应的动作类别数。具体地,在输入维度为512×1×7×7的特征图,使用1×7×7大小的过滤器窗口进行均值池化,得到512×1×1×1维度的特征向量。然后以60类人体动作为例,使用权重系数维度为512×60全连接,得到1×60的特征向量。
改进后的三维残差卷积神经网络相较于二维卷积残差神经网络,其在Conv1和Conv2_x上不对时间维度进行缩减,在Conv3_x到Conv5_x之间同步进行空间和时间尺寸的缩减。最后采用均值池化输出512维的特征向量,采用全连接至输出的类别数N c。从输入到输出在时间维度上进行1/8的尺度缩减,空间维度上和二维残差神经网络保持一致进行1/32的尺度缩减。
可以理解的是,上文所示出的三维残差卷积神经网络仅仅是一种示例性结构,三维神经网络的具体网络结构、层数等可以根据计算资源消耗、识别性能等需要进行设置,在此不作限定。
可以看出,本实施例基于深度图像序列进行时序稀疏采样,然后提取梯度方向向量作为局部时空运动信息输入,再基于三维神经网络对得到的局部时空运动信息输入进行全局表观信息和运动信息的特征学习,得到动作类别标签,其只需计算梯度方向向量,计算十分简便,且时空描述能力强,识别性能高。
为了验证本实施例提供的人体动作识别方法的效果,在目标数据规模最大的NTU RGB+D数据集上进行实验验证。具体采用交叉志愿者(Cross subjects)验证和交叉视角(Cross views)验证两个测试协议进行了实验,同时对使用原始深度数据和使用梯度方向 向量两种方法进行了比较。本实施例提供的方法同其他公开的方法的识别率对比情况如表1所示。
表1 在NTU RGB+D数据集上同其他方法的识别率对比
Figure PCTCN2019122746-appb-000006
注:C表示可见光图像,D表示深度图像,S表示骨架关节点。
从表1中可以看出,目前性能较好的方法均采用骨架关节点或可见光数据作为数据输入。在多种模态数据融合时,效果提升更为明显。而本方法在只采用深度图像数据的情况下,可以在两个测试协议上均达到目前最好的水平,并且已经领先于多种模态融合的方法。
本实施例中,通过采集深度图像的梯度方向向量,根据梯度方向向量和三维卷积神经网络模块进行人体动作识别,即将梯度方向向量作为三维卷积神经网络模型的输入,计算较简便,提高了识别效率;通过梯度方向向量和三维卷积神经网络可以较好地完成对图像序列的时空信息建模,提高了时空信息描述能力,此外,通过时序稀疏采样和三维卷积的有机结合,也提高了时空信息描述能力,从而提高了识别准确率。
实施例二
基于上述实施例一,本实施例将对三维卷积神经网络模型的训练过程和测试过程进行介绍说明。
参见图3示出的训练过程的流程示意框图,基于上述实施例一,在上述获取人体动作的深度图像序列之前,还可以包括:
步骤S301、获取训练深度图像序列。
步骤S302、将训练深度图像序列划分为预设数量个训练图像序列片段。
步骤S303、通过第一预设时序稀疏采样方式对每个训练图像序列片段进行采样,得到 对应的目标训练图像序列。
需要说明的是,上述第一预设时序稀疏方式可以具体为:在每个训练图像序列片段中采用随机抽取的方式,抽取相应数量的深度图像,组成对应的目标训练图像序列。每个片段对应一个目标训练图像序列。例如,每个训练图像序列片段中需要抽取3张深度图像时,则从每个片段中随机抽取3张深度图像,将这3张深度图像组成相应片段的目标训练图像序列。
在得到每个片段的目标训练图像序列之后,可以进行数据增强操作,该数据增强操作可以包括裁剪、缩放等操作,其过程可以与上文提及的数据增强过程类似,在此不再赘述。
步骤S304、根据各个目标训练图像序列,对预建立的三维卷积神经网络模型进行训练。
可以理解的是,三维卷积神经网络模型的具体介绍可以参见上文相应内容,在此不再赘述。
具体训练过程中,可以采用交叉熵损失最为准则函数,使用小批次随机梯度下降法进行模型训练。且在模型初始化过程中不使用预训练参数,而是采用Kaiming初始化方法对卷积参数进行初始化。
在超参数设置方法,所使用的缺省配置可以具体为:批大小为64,初始学习率为0.01,动量为0.9,权重衰减为1×10 -5,一共进行100代迭代,设置学习率每20代衰减为之前的0.1。
在训练完成之后,需要对训练得到的网络模型进行测试,以确定该模型是否符合使用标准。
故在一些实施例中,参见图4示出的测试过程的流程示意框图,在上述根据目标训练图像序列,对预建立的三维卷积神经网络模型进行训练之后,还可以包括:
步骤S401、获取测试深度图像序列。
步骤S402、将测试深度图像序列划分为预设数量个测试图像序列片段。
步骤S403、通过第二预设时序稀疏采样方式对每个测试图像序列片段进行采样,得到相应的目标测试图像序列。
需要说明的是,上述第二预设时序稀疏采样方式可以具体为:从每个片段中抽取片段中间位置的深度图像。例如,当片段序列数目为11时,且每个片段需要抽取3张图像时,则抽取该片段内第6帧图像以及相邻的两帧深度图像。
可以看出,该时序稀疏采样方式与训练过程中的时序稀疏采样方式不同,训练过程和测试过程中的时序稀疏采样方式的不同,可以使得网络学习的特征更具有泛化能力。
当然,训练过程和测试过程的时序稀疏采样方式也可以相同,也能实现本申请实施例的目的。
在得到每个片段的目标测试图像序列之后,可以进行数据增强操作,该数据增强操作可以包括裁剪、缩放等操作。在测试阶段,裁剪出各个深度图像的固定区域之后,可以直接缩放至一定尺寸。例如,在512×424像素的原始深度图像中,分别选取x方向90~410、y方向90~410的像素点,得到尺寸为320×320像素,然后直接缩放至224×224像素。
可以看出,测试阶段的数据增强操作与训练过程的数据增强操作不一致,这样可以使得网络学习的特征更具有泛化能力。
步骤S404、根据各个目标测试图像序列,对训练后的三维卷积神经网络进行测试。
需要说明的是,训练过程、测试过程中与上述实施例一的识别过程中相似过程可以相互参见,在此不再赘述。
本实施例中,训练过程和测试过程中的时序稀疏采样方式的不同,可以使得网络学习的特征更具有泛化能力。
实施例三
请参见图5,为本申请实施例提供的一种人体动作识别装置的结构示意框图,该装置可以包括:
深度图像序列获取模块51,用于获取人体动作的深度图像序列;
第一划分模块52,用于将深度图像序列等间隔划分为预设数量个图像序列片段;
第一时序稀疏采样模块53,用于对每个图像序列片段进行时序稀疏采样,得到对应的目标图像序列;
提取模块54,用于提取每个目标图像序列的梯度方向向量;
识别模块55,用于根据梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。
在一种可行的实现方式中,上述第一时序稀疏采样模块包括:
抽取单元,用于从每个图像序列片段中抽取第一目标深度图像、第二目标深度图像以及第三目标深度图像,其中,第一目标深度图像、第二目标深度图像以及第三目标深度图像在深度图像序列中的时序相对位置呈等差数列;
形成单元,用于基于每个图像序列片段的第一目标深度图像、第二目标深度图像以及第三目标深度图像,得到对应的目标图像序列。
在一种可行的实现方式中,上述提取模块包括:
分量计算单元,用于分别计算每个目标图像序列的梯度分量;
归一化单元,用于将每个目标图像序列的梯度分量进行L2范数归一化,得到每个目标图像序列的梯度方向向量。
在一种可行的实现方式中,上述装置还包括:
数据增强模块,用于对每个目标图像序列进行数据增强操作。
在一种可行的实现方式中,上述数据增强模块包括:
裁剪单元,用于对各个深度图像的预设区域进行裁剪,得到相应的第一预设尺寸的第一目标区域;
目标尺寸选取单元,用于从预设备选尺寸中随机选取目标尺寸;
随机裁剪单元,用于根据目标尺寸,对各个第一目标区域进行随机裁剪,得到相应的第二目标区域;
缩放单元,用于将各个第二目标区域的缩放至第二预设尺寸。
在一种可行的实现方式中,上述装置还包括:
训练深度图像序列获取模块,用于获取训练深度图像序列;
第二划分模块,用于将训练深度图像序列划分为预设数量个训练图像序列片段;
第二时序稀疏采样模块,用于通过第一预设时序稀疏采样方式对每个训练图像序列片段进行采样,得到对应的目标训练图像序列;
训练模块,用于根据各个目标训练图像序列,对预建立的三维卷积神经网络模型进行训练。
在一种可行的实现方式中,上述装置还包括:
测试深度图像序列获取模块,用于获取测试深度图像序列;
第三划分模块,用于将测试深度图像序列划分为预设数量个测试图像序列片段;
第三时序稀疏采样模块,用于通过第二预设时序稀疏采样方式对每个测试图像序列片段进行采样,得到相应的目标测试图像序列;
测试模块,用于根据各个目标测试图像序列,对训练后的三维卷积神经网络进行测试。
本实施例中,通过采集深度图像的梯度方向向量,根据梯度方向向量和三维卷积神经网络模块进行人体动作识别,即将梯度方向向量作为三维卷积神经网络模型的输入,计算较简便,提高了识别效率;通过梯度方向向量和三维卷积神经网络可以较好地完成对图像序列的时空信息建模,提高了时空信息描述能力,此外,通过时序稀疏采样和三维卷积的有机结合,也提高了时空信息描述能力,从而提高了识别准确率。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
实施例四
图6是本申请一实施例提供的终端设备的示意图。如图6所示,该实施例的终端设备6包括:处理器60、存储器61以及存储在所述存储器61中并可在所述处理器60上运行的计算机程序62。所述处理器60执行所述计算机程序62时实现上述各个人体动作识别方法实施例中的步骤,例如图1所示的步骤S101至S105。或者,所述处理器60执行所述计算机程序62时实现上述各装置实施例中各模块或单元的功能,例如图5所示模块51至55的功能。
示例性的,所述计算机程序62可以被分割成一个或多个模块或单元,所述一个或者多个模块或单元被存储在所述存储器61中,并由所述处理器60执行,以完成本申请。所述一个或多个模块或单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序62在所述终端设备6中的执行过程。例如,所述计算机程序62可以被分割成深度图像序列获取模块、第一划分模块、第一时序稀疏采样模块、提取模块以及识别模块,各模块具体功能如下:
深度图像序列获取模块,用于获取人体动作的深度图像序列;第一划分模块,用于将深度图像序列等间隔划分为预设数量个图像序列片段;第一时序稀疏采样模块,用于对每个图像序列片段进行时序稀疏采样,得到对应的目标图像序列;提取模块,用于提取每个目标图像序列的梯度方向向量;识别模块,用于根据梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。
所述终端设备6可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述终端设备可包括,但不仅限于,处理器60、存储器61。本领域技术人员可以理解,图6仅仅是终端设备6的示例,并不构成对终端设备6的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。
所称处理器60可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器61可以是所述终端设备6的内部存储单元,例如终端设备6的硬盘或内 存。所述存储器61也可以是所述终端设备6的外部存储设备,例如所述终端设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器61还可以既包括所述终端设备6的内部存储单元也包括外部存储设备。所述存储器61用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器61还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置、终端设备和方法,可以通过其它的方式实现。例如,以上所描述的装置、终端设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的模块或单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种人体动作识别方法,其特征在于,包括:
    获取人体动作的深度图像序列;
    将所述深度图像序列等间隔划分为预设数量个图像序列片段;
    对每个所述图像序列片段进行时序稀疏采样,得到对应的目标图像序列;
    提取每个所述目标图像序列的梯度方向向量;
    根据所述梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。
  2. 根据权利要求1所述的人体动作识别方法,其特征在于,所述对每个所述图像序列片段进行时序稀疏采样,得到对应的目标图像序列,包括:
    从每个所述图像序列片段中抽取第一目标深度图像、第二目标深度图像以及第三目标深度图像,其中,所述第一目标深度图像、所述第二目标深度图像以及所述第三目标深度图像在所述深度图像序列中的时序相对位置呈等差数列;
    基于每个所述图像序列片段的所述第一目标深度图像、所述第二目标深度图像以及所述第三目标深度图像,得到对应的所述目标图像序列。
  3. 根据权利要求1所述的人体动作识别方法,其特征在于,所述提取每个所述目标图像序列的梯度方向向量,包括:
    分别计算每个所述目标图像序列的梯度分量;
    将每个所述目标图像序列的所述梯度分量进行L2范数归一化,得到每个所述目标图像序列的所述梯度方向向量。
  4. 根据权利要求1所述的人体动作识别方法,其特征在于,在所述提取每个所述目标图像序列的梯度方向向量之前,还包括:
    对每个所述目标图像序列进行数据增强操作。
  5. 根据权利要求4所述的人体动作识别方法,其特征在于,所述对每个所述目标图像序列进行数据增强操作,包括:
    对各个深度图像的预设区域进行裁剪,得到相应的第一预设尺寸的第一目标区域;
    从预设备选尺寸中随机选取目标尺寸;
    根据所述目标尺寸,对各个所述第一目标区域进行随机裁剪,得到相应的第二目标区域;
    将各个所述第二目标区域的缩放至第二预设尺寸。
  6. 根据权利要求1至5任一项所述的人体动作识别方法,其特征在于,在所述获取人体动作的深度图像序列之前,还包括:
    获取训练深度图像序列;
    将所述训练深度图像序列划分为所述预设数量个训练图像序列片段;
    通过第一预设时序稀疏采样方式对每个所述训练图像序列片段进行采样,得到对应的目标训练图像序列;
    根据各个所述目标训练图像序列,对预建立的三维卷积神经网络模型进行训练。
  7. 根据权利要求6所述的人体动作识别方法,其特征在于,在所述根据所述目标训练图像序列,对预建立的三维卷积神经网络模型进行训练之后,还包括:
    获取测试深度图像序列;
    将所述测试深度图像序列划分为所述预设数量个测试图像序列片段;
    通过第二预设时序稀疏采样方式对每个所述测试图像序列片段进行采样,得到相应的目标测试图像序列;
    根据各个所述目标测试图像序列,对训练后的三维卷积神经网络进行测试。
  8. 一种人体动作识别装置,其特征在于,包括:
    深度图像序列获取模块,用于获取人体动作的深度图像序列;
    第一划分模块,用于将所述深度图像序列等间隔划分为预设数量个图像序列片段;
    第一时序稀疏采样模块,用于对每个所述图像序列片段进行时序稀疏采样,得到对应的目标图像序列;
    提取模块,用于提取每个所述目标图像序列的梯度方向向量;
    识别模块,用于根据所述梯度方向向量和预训练的三维卷积神经网络模型,进行人体动作识别。
  9. 一种终端设备,其特征在于,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述方法的步骤。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述方法的步骤。
PCT/CN2019/122746 2018-12-11 2019-12-03 人体动作识别方法、装置、终端设备及存储介质 WO2020119527A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811509445.7 2018-12-11
CN201811509445.7A CN109522874B (zh) 2018-12-11 2018-12-11 人体动作识别方法、装置、终端设备及存储介质

Publications (1)

Publication Number Publication Date
WO2020119527A1 true WO2020119527A1 (zh) 2020-06-18

Family

ID=65795275

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122746 WO2020119527A1 (zh) 2018-12-11 2019-12-03 人体动作识别方法、装置、终端设备及存储介质

Country Status (2)

Country Link
CN (1) CN109522874B (zh)
WO (1) WO2020119527A1 (zh)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783713A (zh) * 2020-07-09 2020-10-16 中国科学院自动化研究所 基于关系原型网络的弱监督时序行为定位方法及装置
CN111881794A (zh) * 2020-07-20 2020-11-03 元神科技(杭州)有限公司 一种视频行为识别方法及系统
CN111914798A (zh) * 2020-08-17 2020-11-10 四川大学 基于骨骼关节点数据的人体行为识别方法
CN112085063A (zh) * 2020-08-10 2020-12-15 深圳市优必选科技股份有限公司 一种目标识别方法、装置、终端设备及存储介质
CN112102235A (zh) * 2020-08-07 2020-12-18 上海联影智能医疗科技有限公司 人体部位识别方法、计算机设备和存储介质
CN112560875A (zh) * 2020-12-25 2021-03-26 北京百度网讯科技有限公司 深度信息补全模型训练方法、装置、设备以及存储介质
CN112580577A (zh) * 2020-12-28 2021-03-30 出门问问(苏州)信息科技有限公司 一种基于面部关键点生成说话人图像的训练方法及装置
CN112587129A (zh) * 2020-12-01 2021-04-02 上海影谱科技有限公司 一种人体动作识别方法及装置
CN112749625A (zh) * 2020-12-10 2021-05-04 深圳市优必选科技股份有限公司 时序行为检测方法、时序行为检测装置及终端设备
CN112834764A (zh) * 2020-12-28 2021-05-25 深圳市人工智能与机器人研究院 机械臂的采样控制方法及装置、采样系统
CN113177450A (zh) * 2021-04-20 2021-07-27 北京有竹居网络技术有限公司 行为识别方法、装置、电子设备和存储介质
CN113392743A (zh) * 2021-06-04 2021-09-14 北京格灵深瞳信息技术股份有限公司 异常动作检测方法、装置、电子设备和计算机存储介质
CN113887419A (zh) * 2021-09-30 2022-01-04 四川大学 一种基于提取视频时空信息的人体行为识别方法及系统
CN115687674A (zh) * 2022-12-20 2023-02-03 昆明勤砖晟信息科技有限公司 服务于智慧云服务平台的大数据需求分析方法及系统

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110321761B (zh) * 2018-03-29 2022-02-11 中国科学院深圳先进技术研究院 一种行为识别方法、终端设备及计算机可读存储介质
CN109522874B (zh) * 2018-12-11 2020-08-21 中国科学院深圳先进技术研究院 人体动作识别方法、装置、终端设备及存储介质
CN112434604A (zh) * 2020-11-24 2021-03-02 中国科学院深圳先进技术研究院 基于视频特征的动作时段定位方法与计算机设备
CN112396637A (zh) * 2021-01-19 2021-02-23 南京野果信息技术有限公司 一种基于3d神经网络的动态行为识别方法及系统
CN113743387B (zh) * 2021-11-05 2022-03-22 中电科新型智慧城市研究院有限公司 视频行人重识别方法、装置、电子设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN107103277A (zh) * 2017-02-28 2017-08-29 中科唯实科技(北京)有限公司 一种基于深度相机和3d卷积神经网络的步态识别方法
CN108288016A (zh) * 2017-01-10 2018-07-17 武汉大学 基于梯度边界图和多模卷积融合的动作识别方法及系统
CN109522874A (zh) * 2018-12-11 2019-03-26 中国科学院深圳先进技术研究院 人体动作识别方法、装置、终端设备及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8934675B2 (en) * 2012-06-25 2015-01-13 Aquifi, Inc. Systems and methods for tracking human hands by performing parts based template matching using images from multiple viewpoints
CN105740823B (zh) * 2016-02-01 2019-03-29 北京高科中天技术股份有限公司 基于深度卷积神经网络的动态手势轨迹识别方法
CN107704799A (zh) * 2017-08-10 2018-02-16 深圳市金立通信设备有限公司 一种人体动作识别方法及设备、计算机可读存储介质
CN107609501A (zh) * 2017-09-05 2018-01-19 东软集团股份有限公司 人体相近动作识别方法及装置、存储介质、电子设备
CN107506756A (zh) * 2017-09-26 2017-12-22 北京航空航天大学 一种基于Gabor滤波器三维卷积神经网络模型的人体动作识别方法
CN108197580B (zh) * 2018-01-09 2019-07-23 吉林大学 一种基于3d卷积神经网络的手势识别方法
CN108830252B (zh) * 2018-06-26 2021-09-10 哈尔滨工业大学 一种融合全局时空特征的卷积神经网络人体动作识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN108288016A (zh) * 2017-01-10 2018-07-17 武汉大学 基于梯度边界图和多模卷积融合的动作识别方法及系统
CN107103277A (zh) * 2017-02-28 2017-08-29 中科唯实科技(北京)有限公司 一种基于深度相机和3d卷积神经网络的步态识别方法
CN109522874A (zh) * 2018-12-11 2019-03-26 中国科学院深圳先进技术研究院 人体动作识别方法、装置、终端设备及存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUANXU WANG ET AL: "Abnormal Crowded Behavior Detection Algorithm Based onSpatial Temporal Interesting Points", JOURNAL OF DATA ACQUISITION AND PROCESSING, vol. 27, no. 4, 31 July 2012 (2012-07-31), pages 422 - 428, XP009521507, DOI: 10.16337/j.1004 -9037.2012.04.01 *
TIANMING YANG ET AL: "Spatio-temporal Dual-stream Human Motion Recognition Model Based on Video Deep Learning", JOURNAL OF COMPUTER APPLICATIONS, vol. 38, no. 3, 10 March 2018 (2018-03-10), pages 895 - 899,915, XP009521506, DOI: 10.11772/j.issn.1001-9081.2017071740 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783713A (zh) * 2020-07-09 2020-10-16 中国科学院自动化研究所 基于关系原型网络的弱监督时序行为定位方法及装置
CN111783713B (zh) * 2020-07-09 2022-12-02 中国科学院自动化研究所 基于关系原型网络的弱监督时序行为定位方法及装置
CN111881794A (zh) * 2020-07-20 2020-11-03 元神科技(杭州)有限公司 一种视频行为识别方法及系统
CN111881794B (zh) * 2020-07-20 2023-10-10 元神科技(杭州)有限公司 一种视频行为识别方法及系统
CN112102235A (zh) * 2020-08-07 2020-12-18 上海联影智能医疗科技有限公司 人体部位识别方法、计算机设备和存储介质
CN112102235B (zh) * 2020-08-07 2023-10-27 上海联影智能医疗科技有限公司 人体部位识别方法、计算机设备和存储介质
CN112085063B (zh) * 2020-08-10 2023-10-13 深圳市优必选科技股份有限公司 一种目标识别方法、装置、终端设备及存储介质
CN112085063A (zh) * 2020-08-10 2020-12-15 深圳市优必选科技股份有限公司 一种目标识别方法、装置、终端设备及存储介质
CN111914798B (zh) * 2020-08-17 2022-06-07 四川大学 基于骨骼关节点数据的人体行为识别方法
CN111914798A (zh) * 2020-08-17 2020-11-10 四川大学 基于骨骼关节点数据的人体行为识别方法
CN112587129B (zh) * 2020-12-01 2024-02-02 上海影谱科技有限公司 一种人体动作识别方法及装置
CN112587129A (zh) * 2020-12-01 2021-04-02 上海影谱科技有限公司 一种人体动作识别方法及装置
CN112749625A (zh) * 2020-12-10 2021-05-04 深圳市优必选科技股份有限公司 时序行为检测方法、时序行为检测装置及终端设备
CN112749625B (zh) * 2020-12-10 2023-12-15 深圳市优必选科技股份有限公司 时序行为检测方法、时序行为检测装置及终端设备
CN112560875A (zh) * 2020-12-25 2021-03-26 北京百度网讯科技有限公司 深度信息补全模型训练方法、装置、设备以及存储介质
CN112560875B (zh) * 2020-12-25 2023-07-28 北京百度网讯科技有限公司 深度信息补全模型训练方法、装置、设备以及存储介质
CN112580577B (zh) * 2020-12-28 2023-06-30 出门问问(苏州)信息科技有限公司 一种基于面部关键点生成说话人图像的训练方法及装置
CN112834764A (zh) * 2020-12-28 2021-05-25 深圳市人工智能与机器人研究院 机械臂的采样控制方法及装置、采样系统
CN112580577A (zh) * 2020-12-28 2021-03-30 出门问问(苏州)信息科技有限公司 一种基于面部关键点生成说话人图像的训练方法及装置
CN113177450A (zh) * 2021-04-20 2021-07-27 北京有竹居网络技术有限公司 行为识别方法、装置、电子设备和存储介质
CN113392743B (zh) * 2021-06-04 2023-04-07 北京格灵深瞳信息技术股份有限公司 异常动作检测方法、装置、电子设备和计算机存储介质
CN113392743A (zh) * 2021-06-04 2021-09-14 北京格灵深瞳信息技术股份有限公司 异常动作检测方法、装置、电子设备和计算机存储介质
CN113887419B (zh) * 2021-09-30 2023-05-12 四川大学 一种基于提取视频时空信息的人体行为识别方法及系统
CN113887419A (zh) * 2021-09-30 2022-01-04 四川大学 一种基于提取视频时空信息的人体行为识别方法及系统
CN115687674A (zh) * 2022-12-20 2023-02-03 昆明勤砖晟信息科技有限公司 服务于智慧云服务平台的大数据需求分析方法及系统

Also Published As

Publication number Publication date
CN109522874A (zh) 2019-03-26
CN109522874B (zh) 2020-08-21

Similar Documents

Publication Publication Date Title
WO2020119527A1 (zh) 人体动作识别方法、装置、终端设备及存储介质
WO2020199931A1 (zh) 人脸关键点检测方法及装置、存储介质和电子设备
JP7110493B2 (ja) 深層モデルの訓練方法及びその装置、電子機器並びに記憶媒体
CN111860398B (zh) 遥感图像目标检测方法、系统及终端设备
CN110765860A (zh) 摔倒判定方法、装置、计算机设备及存储介质
EP3803803A1 (en) Lighting estimation
WO2021027692A1 (zh) 视觉特征库的构建方法、视觉定位方法、装置和存储介质
TW202205215A (zh) 三維網格模型的重建方法、電子設備、電腦可讀儲存介質
CN112183541B (zh) 一种轮廓提取方法及装置、电子设备、存储介质
CN112308866A (zh) 图像处理方法、装置、电子设备及存储介质
CN111383232A (zh) 抠图方法、装置、终端设备及计算机可读存储介质
CN112529068A (zh) 一种多视图图像分类方法、系统、计算机设备和存储介质
CN110163095B (zh) 回环检测方法、回环检测装置及终端设备
CN111488810A (zh) 人脸识别方法、装置、终端设备及计算机可读介质
CN111161348B (zh) 一种基于单目相机的物体位姿估计方法、装置及设备
CN114549765A (zh) 三维重建方法及装置、计算机可存储介质
WO2021115061A1 (zh) 图像分割方法、装置及服务器
CN110633630B (zh) 一种行为识别方法、装置及终端设备
WO2023109086A1 (zh) 文字识别方法、装置、设备及存储介质
CN107622498B (zh) 基于场景分割的图像穿越处理方法、装置及计算设备
WO2022236802A1 (zh) 一种物体模型的重建方法、装置、终端设备和存储介质
CN113724176A (zh) 一种多摄像头动作捕捉无缝衔接方法、装置、终端及介质
CN115147434A (zh) 图像处理方法、装置、终端设备及计算机可读存储介质
CN114821216A (zh) 图片去网纹神经网络模型建模、使用方法及相关设备
CN113033256A (zh) 一种指尖检测模型的训练方法和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19895295

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 05.11.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19895295

Country of ref document: EP

Kind code of ref document: A1