WO2023010758A1 - 一种动作检测方法、装置、终端设备和存储介质 - Google Patents

一种动作检测方法、装置、终端设备和存储介质 Download PDF

Info

Publication number
WO2023010758A1
WO2023010758A1 PCT/CN2021/138566 CN2021138566W WO2023010758A1 WO 2023010758 A1 WO2023010758 A1 WO 2023010758A1 CN 2021138566 W CN2021138566 W CN 2021138566W WO 2023010758 A1 WO2023010758 A1 WO 2023010758A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
sequence
target
pixel
frame
Prior art date
Application number
PCT/CN2021/138566
Other languages
English (en)
French (fr)
Inventor
任子良
程俊
张锲石
高向阳
康宇航
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2023010758A1 publication Critical patent/WO2023010758A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Definitions

  • the present application relates to the technical field of image processing, and in particular to an action detection method, device, terminal device and storage medium.
  • Action detection refers to the recognition and tracking of a target (such as a human body) in a video clip to determine the action category of the target.
  • an action detection method based on RGB images is usually used, which implements action detection by analyzing the pixel features of an RGB image sequence.
  • RGB images as detection samples are easily disturbed by environmental factors such as illumination changes, resulting in low accuracy of action detection.
  • embodiments of the present application provide a motion detection method, device, terminal device, and storage medium, which can improve the accuracy of motion detection.
  • the first aspect of the embodiments of the present application provides an action detection method, including:
  • the pixel feature image including the features of each frame image of the target pixel frame sequence
  • the depth feature image including the features of each frame image of the target depth map sequence
  • the target video sequence containing the target action is obtained, and the target video sequence includes a one-to-one correspondence between a sequence of pixel frames and a sequence of depth maps; then, according to the sequence of pixel frames, image features containing images of each frame are generated The pixel feature image, and the depth feature image containing the image features of each frame image are generated according to the depth map sequence; finally, the pixel feature image and the depth feature image are input into a trained deep neural network for image feature extraction and fusion processing , so as to determine the category of the target action.
  • the above process fuses the pixel features and depth features of the video image, and utilizes the complementarity of pixel information and depth information to reduce the interference of environmental factors on the detection samples to a certain extent, thereby improving the accuracy of motion detection.
  • the deep neural network includes a feature extraction module and a feature fusion module, and the pixel feature image and the depth feature image are input into the trained deep neural network to perform image feature extraction Processing with fusion to determine the category of the target action may include:
  • the category of the target action is determined based on the fused image features.
  • the generating the pixel feature image according to the target pixel frame sequence may include:
  • the generating a depth feature image according to the target depth map sequence may include:
  • the frame images included in the second image sequence are fused to obtain the depth feature image.
  • each frame image included in the first image sequence to obtain the pixel feature image may include:
  • the merging of each frame image included in the second image sequence to obtain the depth feature image may include:
  • the image features are superimposed, averaged and rounded according to the corresponding pixel points of each frame of the grayscale image to obtain the depth feature image.
  • the reference video sequence includes the normalized target action
  • the number of image frames included in the third image sequence is the same as the number of image frames included in the first image sequence;
  • the target object is an object that performs the target action
  • the normalization degree of the target action included in the target video sequence is determined according to the curve error.
  • the calculating and obtaining the curve error according to the first motion trajectory curve and the second motion trajectory curve may include:
  • the acquisition of the target video sequence containing the target action may include:
  • an original video sequence comprising a plurality of actions, the original video sequence including a one-to-one correspondence of an original pixel frame sequence and an original depth map sequence;
  • the second aspect of the embodiments of the present application provides an action detection device, including:
  • a video sequence acquiring module configured to acquire a target video sequence comprising a target action, the target video sequence comprising a one-to-one corresponding target pixel frame sequence and a target depth map sequence;
  • a pixel feature generating module configured to generate a pixel feature image according to the target pixel frame sequence, the pixel feature image including the features of each frame image of the target pixel frame sequence;
  • a depth feature generation module configured to generate a depth feature image according to the target depth map sequence, the depth feature image including the features of each frame image of the target depth map sequence;
  • An action detection module configured to input the pixel feature image and the depth feature image into a trained deep neural network to perform image feature extraction and fusion processing, so as to determine the category of the target action.
  • the third aspect of the embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and operable on the processor, when the processor executes the computer program Realize the action detection method provided by the first aspect of the embodiment of the present application.
  • the fourth aspect of the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it implements the method provided in the first aspect of the embodiment of the present application. motion detection method.
  • a fifth aspect of the embodiments of the present application provides a computer program product, which enables the terminal device to execute the action detection method described in the first aspect of the embodiments of the present application when the computer program product is run on the terminal device.
  • FIG. 1 is a flow chart of an action detection method provided in an embodiment of the present application
  • Fig. 2 is a schematic diagram of respectively performing spatio-temporal information representation on a pixel frame sequence and a depth map sequence to obtain corresponding pixel feature images and depth feature images;
  • Fig. 3 is a schematic structural diagram of a deep neural network provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a feature interaction module provided by an embodiment of the present application.
  • Fig. 5 is a schematic diagram of using a visual saliency algorithm to segment and mark a human body into multiple specified parts
  • FIG. 6 is a schematic flowchart of a method for human motion detection and motion standardization evaluation provided by an embodiment of the present application
  • FIG. 7 is a structural diagram of an action detection device provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application.
  • This application proposes an action detection method based on computer vision.
  • the complementarity between pixel information and depth information can be effectively used, and the overall action recognition efficiency and the anti-interference ability of the model can be greatly improved.
  • the normativeness of the action can be further evaluated.
  • FIG. 1 shows an action detection method provided by an embodiment of the present application, including:
  • a target video sequence including a target action, where the target video sequence includes a one-to-one correspondence between a target pixel frame sequence and a target depth map sequence;
  • the target action is an action to be recognized, which can be any type of action performed by any target object (such as a human body, an animal, or a robot, etc.), such as sitting down, bowling, and push-ups of a human body.
  • target object such as a human body, an animal, or a robot, etc.
  • the target video sequence contains two parts of the image sequence, the first part is the target pixel frame sequence containing pixel features (specifically, it can be RGB image sequence or grayscale image sequence, etc.), and the second part is the target depth map containing depth features
  • the two parts of the image sequence are in one-to-one correspondence, that is, the first frame image of the target pixel frame sequence corresponds to the first frame image of the target depth map sequence, and the second frame image of the target pixel frame sequence corresponds to the first frame image of the target depth map sequence.
  • Two frames of images correspond, and so on.
  • certain specified types of cameras such as Microsoft's kinect camera, etc.
  • Each frame of the two image sequences is in one-to-one correspondence. .
  • the acquisition of the target video sequence containing the target action may include:
  • an original video sequence containing multiple different actions such as an activity video for a certain human body within a certain period of time, which may include multiple different actions such as walking, sitting, standing, and running. action.
  • the method of action segmentation can be used to divide the original video sequence into video sequence segments containing a single action, and then each video sequence segment can be identified by the action detection method proposed in this application. Included actions, so as to realize the recognition of all actions contained in the entire original video sequence.
  • the obtained original pixel frame sequence also includes a one-to-one correspondence between the original pixel frame sequence and the original depth map sequence, and the motion segmentation processing of the video sequence can be performed on the original pixel frame sequence and the original depth map sequence respectively, and multiple A pixel frame sequence segment containing a single motion and multiple depth map sequence segments containing a single motion. Then, select a pixel frame sequence segment (including the pixel frame sequence segment of the target action that currently needs to be recognized) from the plurality of pixel frame sequence segments as the target pixel frame sequence; from the plurality of depth map sequence segments Select the depth map sequence segment corresponding to the target pixel frame sequence (that is, the depth map sequence segment containing the current target action to be recognized) as the target depth map sequence.
  • an action segmentation method based on Quantity of movement can be used.
  • QOM Quantity of movement
  • Threshold QOM is a preset parameter, which can generally be set to 60 according to empirical values.
  • a parameter Threshold inter is set as the threshold value of the intra-action segmentation, which can be updated iteratively through the sliding window method. Assuming that the average length of an action is L frames, the average QOM value of the first 12.5% and the last 12.5% of the frame images can be used as the candidate value of Threshold inter . Then, by comparing and selecting the frame corresponding to the minimum QOM value within the image length of L frames as the action boundary frame (start frame and end frame), the action segmentation is completed.
  • the action segmentation method based on QOM mainly realizes the action detection and segmentation of the video sequence through the combination of the change of the motion amount and the time scale, and finally obtains a video sequence segment containing only a single action.
  • a pixel feature image can be generated based on each frame image of the target pixel frame sequence.
  • the pixel feature image includes the features of each frame image of the target pixel frame sequence, and can be used to characterize the overall pixel feature of the target pixel frame sequence. For example, assuming that the target pixel frame sequence is N frames of RGB images, an RGB-like pixel feature image can be generated based on the N frames of RGB images to represent the overall pixel features of the N frames of RGB images.
  • the pixel feature image obtained in this process can be called the spatio-temporal information representation of the target pixel frame sequence.
  • the generating the pixel feature image according to the target pixel frame sequence may include:
  • the pixel feature image can be obtained by fusing the frames of images contained in the target pixel frame sequence.
  • the target pixel frame sequence contains many frame images, if these images are fused, a large amount of calculation will be generated, which will affect the running speed of the algorithm. . Therefore, sparse sampling processing can be performed on the target pixel frame sequence in the time dimension first, so as to remove redundant information between frames and reduce the amount of calculation.
  • the sparse sampling can adopt the method of average sampling to avoid the problem of uneven action representation and loss of spatial dimension information.
  • the sparse average sampling method can be used to extract the 5th frame, the 15th frame, the 25th frame...the 95th frame, a total of 10 frames of images, as the obtained first image sequence, that is, the 10 frames of images can be used to represent the entire target action.
  • each frame image included in the first image sequence may be fused by means of image superposition or the like, so as to obtain a corresponding pixel feature image.
  • the merging the frames of images contained in the first image sequence to obtain the pixel feature image may include:
  • the first image sequence is an RGB image sequence
  • this process can treat the RGB three-channels as a vector matrix, that is, each frame of RGB image in the RGB image sequence has a corresponding vector matrix, and the pixels contained in these vector matrices After feature superposition, averaging and rounding, a final vector matrix can be obtained, which is the vector matrix corresponding to the pixel feature image, and can also be called the spatiotemporal information representation sample of the RGB image sequence.
  • the first image sequence is ⁇ I 1 , I 2 , I 3 , ... I T >, that is, it contains a total of T frames of images
  • its corresponding pixel feature image M can be expressed as:
  • a depth feature image may also be generated based on each frame image of the target depth map sequence.
  • the depth feature image contains the features of each frame image of the target depth map sequence, and can be used to characterize the overall depth feature of the target depth map sequence. For example, assuming that the target depth image sequence is N frames of depth images, an approximate depth image may be generated based on the N frames of depth images to characterize the overall depth features of the N frames of depth images.
  • the depth feature image obtained in this process can be called the spatio-temporal information representation of the target depth image sequence.
  • the generating a depth feature image according to the target depth map sequence may include:
  • sparse sampling processing can also be performed on the target depth map sequence in the time dimension, thereby removing redundant information between frames and reducing the amount of calculation.
  • the sparse average sampling method can be used to extract the 5th frame, the 15th frame, the 25th frame ... the 95th frame (the frame label can be the same as that of the first frame in step 102).
  • the frames of the image sequence correspond to each other) with a total of 10 frames of depth images, as the obtained second image sequence, that is, the 10 frames of depth images can be used to represent the entire target action.
  • the depth images of the frames included in the second image sequence may be fused by means of image superposition and the like, so as to obtain corresponding depth feature images.
  • the merging each frame image included in the second image sequence to obtain the depth feature image may include:
  • each frame of the depth image in the second image sequence has a corresponding single-channel vector matrix.
  • a final vector matrix can be obtained. It is the vector matrix corresponding to the depth feature image, and can also be called the spatio-temporal information representation sample of the second image sequence.
  • FIG. 2 it is a schematic diagram of representing the temporal and spatial information of the pixel frame sequence and the depth image sequence respectively, and obtaining the corresponding pixel feature image and depth feature image. It can be seen that the obtained pixel feature image contains the overall pixel feature of the pixel frame sequence; the obtained depth feature image contains the overall depth feature of the depth map sequence.
  • the two frames of feature images can be input into a pre-trained deep neural network for processing.
  • the deep neural network can jointly learn the pixel features and deep features in the feature image by performing image feature extraction and fusion, and finally output a category label to determine the category of the target action.
  • the deep neural network can use mature network model architectures such as Resnet, inception, and VGG, and this application does not limit the type and structure of the deep neural network.
  • the deep neural network includes a feature extraction module and a feature fusion module, and the pixel feature image and the depth feature image are input into the trained deep neural network to perform image feature extraction Processing with fusion to determine the category of the target action may include:
  • the feature extraction module can be constructed by using the structure of multi-level convolutional layer and pooling layer, which is used to extract the image semantic features of the pixel feature image and depth feature image;
  • the fusion (mainly adopts the point multiplication of features, weighted summation or maximum value to achieve fusion) to obtain the fused image features.
  • the fused image feature is a distinguishing feature between the target action and other actions, so action recognition can be realized based on this feature, that is, the category of the target action can be determined.
  • FIG. 3 A schematic diagram of the structure of the deep neural network is shown in FIG. 3 .
  • the deep neural network consists of two parts: a feature extraction module and a feature fusion module.
  • the feature extraction module is mainly composed of a multi-level cascaded convolutional layer, a feature interaction module, and a fully connected layer.
  • the cascaded convolutional neural unit (indicated by the circle 3 in the figure) is composed.
  • a schematic diagram of the structure of the feature interaction module is shown in Figure 4. This module mainly includes two convolution layers with a convolution kernel of 1*1, and the image features (pixel features and depth features) of two modalities are input into the module.
  • middle-level semantic features and high-level semantic features can be performed on the image features (middle-level semantic features generally refer to the features learned during the network model parameter learning process, and high-level semantic features generally refer to the output after the network model learning is completed and can classify samples. features) complementary learning.
  • the feature images of two modalities can be connected to obtain four-channel samples.
  • an RGB image can be expressed as a three-channel vector matrix
  • a depth image can be expressed as a single-channel vector matrix.
  • two vector matrices can be spliced together to obtain a four-channel vector matrix.
  • the number of input channels of the deep neural network is also four.
  • the middle-level semantic features and high-level semantic features of the sample are learned; after that, the high-level semantic features are fused through the feature fusion module to obtain the distinguishing features of the target action, and finally the action classification is completed based on the distinguishing features.
  • the target object is an object that performs the target action
  • a standard action video library can be constructed in a designated storage area (such as a database) in advance, and the video sequences of various types of normalized actions (such as video sequences of human standard running actions, video sequences of human standard push-up actions, etc.) can be stored. into the normative action video library.
  • the reference video sequence corresponding to the type of the target action can be searched from the standard action video library, and the reference video sequence contains the normalized described target action. For example, if the target action is running, the reference video sequence is a video sequence of a standard running action of a human body.
  • the third image sequence is the same as the number of image frames contained in the first image sequence mentioned above.
  • the first image sequence is ⁇ I 1 , I 2 , I 3 ,..., I T > containing T frame images
  • the third image sequence can be expressed as ⁇ N 1 , N 2 , N 3 ,..., N T >.
  • the target object is the object that executes the target action
  • the number of the specified parts can be one or multiple (in order to improve the accuracy of the subsequent curve error calculation, it is generally necessary to set multiple).
  • the visual saliency algorithm can be used to segment and mark the human body into six designated parts, including head, torso, left hand, right hand, left foot, and right foot, as shown in Figure 5.
  • the first motion trajectory curves corresponding to the designated parts are constructed based on the time dimension.
  • the center point of the human head in each frame image contained in the first image sequence can be expressed as a center point by using a normalized operation, and the human body head can be calculated specifically.
  • the average coordinate value of all coordinate points contained in the head is used as the center point, and the coordinates of the same operation method can be used for other designated parts) to connect the coordinates to obtain the first motion trajectory curve corresponding to the human head;
  • the first image sequence contains The coordinates of the center point of the human torso in each frame of the image are connected to obtain the first motion trajectory curve corresponding to the human torso, and so on, respectively constructing the first motion trajectory curves corresponding to the six specified parts.
  • the second motion trajectory curve corresponding to each designated part can be constructed based on the time dimension. That is, for the six designated parts of the human body, the coordinates of the center point of the human head in each frame image contained in the third image sequence can be connected to obtain the second motion trajectory curve corresponding to the human head; the third image sequence The coordinates of the center points of the human torso in the included frames of images are connected to obtain the second motion trajectory curves corresponding to the human torso, and so on to construct the second motion trajectory curves corresponding to the six specified parts.
  • the curve error can be calculated. Assuming that there is only one designated part, that is, there is only one first motion trajectory curve and one second motion trajectory curve, then the curve error can be the difference between the first motion trajectory curve and the second motion trajectory curve. Specifically, the distance between each target position point in the first motion trajectory curve and its corresponding position point in the second motion trajectory curve can be calculated respectively to obtain the error of each target position point, and then each The errors of the target position points are superimposed to obtain the curve error.
  • the target position points here can be nodes connected by motion trajectory curves. For example, assuming that the specified part is a human head, each target position point in the first motion trajectory curve can be the human body in each frame image contained in the first image sequence. The center point of the head, and the corresponding position points of each target position point in the second motion trajectory curve may be the center point of the human head in each frame image included in the third image sequence.
  • the curve error can be calculated using the following formula:
  • err represents the curve error
  • t 1,2,3...T
  • T represents the number of image frames of the first image sequence
  • the third image sequence Indicates the distance between the target position point in the first motion trajectory curve and its corresponding position point in the second motion trajectory curve.
  • err always err head + err torso + err left hand + err right hand + err left foot + err right foot
  • the degree of normalization of the target action contained in the target video sequence can be determined according to the curve error.
  • the curve error can be used to characterize the degree of deviation between the target action contained in the target video sequence and the normalized target action (standard action) contained in the reference video sequence, that is, if the curve error is smaller, it means that the target video sequence The smaller the deviation between the included target action and the standard action, that is, the higher the degree of normalization of the target action.
  • the curve error can be normalized first, and here the method of calculating the reciprocal of the error can be used to deal with it, namely:
  • 1/err total 1/err head + 1/err torso + 1/err left hand + 1/err right hand + 1/err left foot + 1/err right foot
  • the normalized curve error can be converted into a probability value through the softmax function.
  • the larger the probability value the closer the sample to be tested (ie, the target video sequence) is to the normalized action sample (ie, the reference video sequence), that is, the target The higher the normalization degree of the test sample is. On the contrary, it means that the deviation between the sample to be tested and the normalized action sample is greater, that is, the degree of normalization of the sample to be tested is lower.
  • the target video sequence containing the target action is obtained, and the target video sequence includes a one-to-one correspondence between a sequence of pixel frames and a sequence of depth maps; then, according to the sequence of pixel frames, image features containing images of each frame are generated The pixel feature image, and the depth feature image containing the image features of each frame image are generated according to the depth map sequence; finally, the pixel feature image and the depth feature image are input into a trained deep neural network for image feature extraction and fusion processing , so as to determine the category of the target action.
  • the above process fuses the pixel features and depth features of the video image, and utilizes the complementarity of pixel information and depth information to reduce the interference of environmental factors on the detection samples to a certain extent, thereby improving the accuracy of motion detection.
  • FIG. 6 it is a schematic flowchart of a human motion detection and motion normalization evaluation method proposed in the embodiment of the present application.
  • the original video sequence is first input, and then the original video sequence is subjected to QOM action segmentation processing to obtain the target video sequence containing a single human action; then, the target video sequence is subjected to sparse average sampling processing to obtain the first image sequence ; Then, obtain the spatiotemporal information representation of the first image sequence to obtain the pixel feature image, input it and the corresponding depth feature image into the deep neural network for identification, and obtain the action category; after that, search for the corresponding action category from the standard action video library.
  • the reference video sequence of the reference video sequence is subjected to sparse average sampling processing to obtain the third image sequence; then, the method of saliency detection is used to mark the human head, torso, and left hand in each frame image contained in the first image sequence. , right hand, left foot and right foot 6 positions of specified parts, and construct corresponding first motion trajectory curve accordingly, and mark the positions of the 6 specified parts in each frame image contained in the third image sequence, and Based on this, the corresponding second motion trajectory curve is constructed; next, the corresponding curve error is calculated according to the two parts of the motion trajectory curve, and finally the curve error is normalized, and the normalized curve error is converted by the softmax function is the probability value, and evaluate the normalization degree of the human body action contained in the target video sequence according to the magnitude of the probability value.
  • the embodiment of the present application can effectively utilize the complementarity of pixel information and depth information by fusing the pixel features and depth features of video images, and greatly improve the overall action recognition efficiency and the anti-interference ability of the model. Moreover, by performing vectorized error comparison between the motion of the sample to be tested and the motion of the normalized motion sample, the evaluation of the normalization degree of the motion can be realized, which has great application value in fields such as physical training.
  • sequence numbers of the steps in the above embodiments do not mean the order of execution, and the execution order of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application .
  • a motion detection method is mainly described above, and a motion detection device will be described below.
  • an embodiment of an action detection device in the embodiment of the present application includes:
  • a video sequence acquisition module 701 configured to acquire a target video sequence including a target action, the target video sequence including a one-to-one corresponding target pixel frame sequence and target depth map sequence;
  • a pixel feature generating module 702 configured to generate a pixel feature image according to the target pixel frame sequence, the pixel feature image including the features of each frame image of the target pixel frame sequence;
  • a depth feature generation module 703, configured to generate a depth feature image according to the target depth map sequence, the depth feature image including the features of each frame image of the target depth map sequence;
  • the action detection module 704 is configured to input the pixel feature image and the depth feature image into the trained deep neural network to perform image feature extraction and fusion processing, so as to determine the category of the target action.
  • the deep neural network includes a feature extraction module and a feature fusion module
  • the action detection module may include:
  • a feature extraction unit configured to input the pixel feature image and the depth feature image into the feature extraction module for processing to obtain image semantic features
  • a feature fusion unit configured to input the semantic features of the image into the feature fusion module for processing to obtain fused image features
  • An action detection unit configured to determine the category of the target action based on the fused image features.
  • the pixel feature generation module may include:
  • a first sparse sampling processing unit configured to perform sparse sampling processing on the target pixel frame sequence in the time dimension to obtain a first image sequence
  • a first image fusion unit configured to fuse each frame of images included in the first image sequence to obtain the pixel feature image
  • the deep feature generation module may include:
  • a second sparse sampling processing unit configured to perform sparse sampling processing on the target depth map sequence in the time dimension to obtain a second image sequence
  • the second image fusion unit is configured to fuse the frames of images included in the second image sequence to obtain the depth feature image.
  • the first image fusion unit may include:
  • a pixel feature processing subunit configured to perform image feature superposition, averaging and rounding operations on each frame image contained in the first image sequence according to corresponding pixel points, to obtain the pixel feature image;
  • the second image fusion unit may include:
  • a grayscale conversion subunit configured to convert each frame of image included in the second image sequence into each frame of grayscale image
  • the depth feature processing subunit is configured to perform image feature superposition, averaging and rounding operations on the grayscale images of each frame according to corresponding pixel points to obtain the depth feature image.
  • the motion detection device may further include:
  • a reference video search module configured to search for a reference video sequence corresponding to the category of the target action from a preset standard action video library, the reference video sequence including the normalized target action;
  • a sparse sampling processing module configured to perform sparse sampling processing on the reference video sequence in the time dimension to obtain a third image sequence, the number of image frames contained in the third image sequence and the image frames contained in the first image sequence the same number;
  • a saliency labeling module configured to respectively mark the position of the designated part of the target object in each frame image included in the first image sequence, and the position of the target object in each frame image included in the third image sequence
  • the position of the designated part, the target object is an object that performs the target action
  • a first curve construction module configured to construct a first motion trajectory curve corresponding to the specified part according to the position of the specified part of the target object in each frame image included in the first image sequence
  • the second curve construction module is configured to construct a second motion trajectory curve corresponding to the specified part according to the position of the specified part of the target object in each frame image included in the third image sequence;
  • a curve error calculation module configured to calculate a curve error according to the first motion trajectory curve and the second motion trajectory curve
  • a normalization evaluation module configured to determine the normalization degree of the target action included in the target video sequence according to the curve error.
  • the curve error calculation module may include:
  • a position point error calculation unit configured to separately calculate the distance between each target position point in the first motion trajectory curve and its corresponding position point in the second motion trajectory curve, to obtain each target The error of the position point;
  • the error superposition unit is configured to superimpose the errors of each target position point to obtain the curve error.
  • the video sequence acquisition module may include:
  • An original video sequence acquiring unit configured to acquire an original video sequence comprising a plurality of actions, the original video sequence including a one-to-one corresponding original pixel frame sequence and an original depth map sequence;
  • the first action segmentation processing unit is configured to perform video action segmentation processing on the original pixel frame sequence, obtain multiple pixel frame sequence segments containing a single action, and select one of the multiple pixel frame sequence segments.
  • the pixel frame sequence is segmented as the target pixel frame sequence;
  • the second action segmentation processing unit is configured to perform video action segmentation processing on the original depth map sequence, obtain a plurality of depth map sequence segments containing a single action, and select from the plurality of depth map sequence segments and The depth map sequence segment corresponding to the target pixel frame sequence is used as the target depth map sequence.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, any action detection method as shown in FIG. 1 is implemented.
  • the embodiment of the present application further provides a computer program product, which, when the computer program product is run on a terminal device, enables the terminal device to implement any action detection method as shown in FIG. 1 .
  • Fig. 8 is a schematic diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device 8 of this embodiment includes: a processor 80 , a memory 81 , and a computer program 82 stored in the memory 81 and operable on the processor 80 .
  • the processor 80 executes the computer program 82, it implements the steps in the above-mentioned embodiments of the various motion detection methods, such as steps 101 to 104 shown in FIG. 1 .
  • the processor 80 executes the computer program 82, it realizes the functions of the modules/units in the above-mentioned device embodiments, for example, the functions of the modules 701 to 704 shown in FIG. 7 .
  • the computer program 82 can be divided into one or more modules/units, and the one or more modules/units are stored in the memory 81 and executed by the processor 80 to complete the present application.
  • the one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program 82 in the terminal device 8 .
  • the so-called processor 80 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the storage 81 may be an internal storage unit of the terminal device 8 , such as a hard disk or memory of the terminal device 8 .
  • the memory 81 can also be an external storage device of the terminal device 8, such as a plug-in hard disk equipped on the terminal device 8, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Further, the memory 81 may also include both an internal storage unit of the terminal device 8 and an external storage device.
  • the memory 81 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 81 can also be used to temporarily store data that has been output or will be output.
  • the disclosed devices and methods may be implemented in other ways.
  • the system embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division.
  • multiple units or components can be Incorporation may either be integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the computer programs can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps in the above-mentioned various method embodiments can be realized.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, and a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable media Excluding electrical carrier signals and telecommunication signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

一种动作检测方法、装置、终端设备和存储介质。该方法包括:获取包含目标动作的目标视频序列,所述目标视频序列包括一一对应的目标像素帧序列和目标深度图序列(101);根据所述目标像素帧序列生成像素特征图像,所述像素特征图像包含所述目标像素帧序列具有的各帧图像的特征(102);根据所述目标深度图序列生成深度特征图像,所述深度特征图像包含所述目标深度图序列具有的各帧图像的特征(103);将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别(104)。采用该方法能够在一定程度上减弱环境因素对检测样本的干扰,从而提高动作检测的准确率。

Description

一种动作检测方法、装置、终端设备和存储介质 技术领域
本申请涉及图像处理技术领域,尤其涉及一种动作检测方法、装置、终端设备和存储介质。
背景技术
动作检测作为模式识别的研究分支,广泛应用于视频安全监控、视频检索和健康医疗等领域。动作检测是指对视频片段中的目标(例如人体)进行识别与跟踪,以确定该目标的动作类别。
目前,通常采用基于RGB图像的动作检测方法,该方法通过对RGB图像序列的像素特征进行分析以实现动作检测。然而,作为检测样本的RGB图像容易受到光照变化等环境因素的干扰,导致动作检测的准确率较低。
发明内容
有鉴于此,本申请实施例提供了一种动作检测方法、装置、终端设备和存储介质,能够提高动作检测的准确率。
本申请实施例的第一方面提供了一种动作检测方法,包括:
获取包含目标动作的目标视频序列,所述目标视频序列包括一一对应的目标像素帧序列和目标深度图序列;
根据所述目标像素帧序列生成像素特征图像,所述像素特征图像包含所述目标像素帧序列具有的各帧图像的特征;
根据所述目标深度图序列生成深度特征图像,所述深度特征图像包含所述目标深度图序列具有的各帧图像的特征;
将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行 图像特征的提取与融合处理,以确定所述目标动作的类别。
在本申请实施例中,首先,获取包含目标动作的目标视频序列,该目标视频序列包括一一对应的像素帧序列和深度图序列;然后,根据像素帧序列生成包含其中各帧图像的图像特征的像素特征图像,以及根据深度图序列生成包含其中各帧图像的图像特征的深度特征图像;最后,将像素特征图像和深度特征图像输入一个已训练的深度神经网络进行图像特征的提取与融合处理,从而确定目标动作的类别。上述过程将视频图像的像素特征和深度特征融合,利用像素信息和深度信息的互补性,能够在一定程度上减弱环境因素对检测样本的干扰,从而提高动作检测的准确率。
在本申请的一种实现方式中,所述深度神经网络包括特征提取模块和特征融合模块,所述将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别,可以包括:
将所述像素特征图像和所述深度特征图像输入所述特征提取模块进行处理,得到图像语义特征;
将所述图像语义特征输入所述特征融合模块进行处理,得到融合后的图像特征;
基于所述融合后的图像特征确定所述目标动作的类别。
在本申请的一种实现方式中,所述根据所述目标像素帧序列生成像素特征图像,可以包括:
在时间维度上对所述目标像素帧序列执行稀疏采样处理,得到第一图像序列;
将所述第一图像序列包含的各帧图像融合,得到所述像素特征图像;
所述根据所述目标深度图序列生成深度特征图像,可以包括:
在时间维度上对所述目标深度图序列执行稀疏采样处理,得到第二图像序列;
将所述第二图像序列包含的各帧图像融合,得到所述深度特征图像。
进一步的,所述将所述第一图像序列包含的各帧图像融合,得到所述像素特征图像,可以包括:
对所述第一图像序列包含的各帧图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述像素特征图像;
所述将所述第二图像序列包含的各帧图像融合,得到所述深度特征图像,可以包括:
将所述第二图像序列包含的各帧图像分别转换成各帧灰度图像;
对所述各帧灰度图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述深度特征图像。
在本申请的一种实现方式中,在确定所述目标动作的类别之后,还可以包括:
从预设的规范动作视频库中查找与所述目标动作的类别对应的基准视频序列,所述基准视频序列包含规范化的所述目标动作;
在时间维度上对所述基准视频序列执行稀疏采样处理,得到第三图像序列,所述第三图像序列包含的图像帧数和所述第一图像序列包含的图像帧数相同;
分别标注出所述第一图像序列包含的各帧图像中目标物体具有的指定部位的位置,以及所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,所述目标物体为执行所述目标动作的物体;
根据所述第一图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第一运动轨迹曲线;
根据所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第二运动轨迹曲线;
根据所述第一运动轨迹曲线和所述第二运动轨迹曲线,计算得到曲线误差;
根据所述曲线误差确定所述目标视频序列包含的所述目标动作的规范化程度。
进一步的,所述根据所述第一运动轨迹曲线和所述第二运动轨迹曲线,计算得到曲线误差,可以包括:
分别计算所述第一运动轨迹曲线中的每个目标位置点和其在所述第二运动轨迹曲线中的对应位置点之间的距离,得到每个所述目标位置点的误差;
将每个所述目标位置点的误差叠加,得到所述曲线误差。
在本申请的一种实现方式中,所述获取包含目标动作的目标视频序列,可以包括:
获取包含多个动作的原始视频序列,所述原始视频序列包括一一对应的原始像素帧序列和原始深度图序列;
对所述原始像素帧序列执行视频的动作分割处理,得到多个包含单一动作的像素帧序列分段,并从所述多个像素帧序列分段中选取一个像素帧序列分段,作为所述目标像素帧序列;
对所述原始深度图序列执行视频的动作分割处理,得到多个包含单一动作的深度图序列分段,并从所述多个深度图序列分段中选取与所述目标像素帧序列对应的深度图序列分段,作为所述目标深度图序列。
本申请实施例的第二方面提供了一种动作检测装置,包括:
视频序列获取模块,用于获取包含目标动作的目标视频序列,所述目标视频序列包括一一对应的目标像素帧序列和目标深度图序列;
像素特征生成模块,用于根据所述目标像素帧序列生成像素特征图像,所述像素特征图像包含所述目标像素帧序列具有的各帧图像的特征;
深度特征生成模块,用于根据所述目标深度图序列生成深度特征图像,所述深度特征图像包含所述目标深度图序列具有的各帧图像的特征;
动作检测模块,用于将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别。
本申请实施例的第三方面提供了一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行 所述计算机程序时实现如本申请实施例的第一方面提供的动作检测方法。
本申请实施例的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如本申请实施例的第一方面提供的动作检测方法。
本申请实施例的第五方面提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行本申请实施例的第一方面所述的动作检测方法。
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种动作检测方法的流程图;
图2是对像素帧序列和深度图序列分别进行时空信息表示,获得对应的像素特征图像和深度特征图像的示意图;
图3是本申请实施例提供的一种深度神经网络的结构示意图;
图4是本申请实施例提供的一种特征交互模块的结构示意图;
图5是采用视觉显著性算法将人体分割标注成多个指定部位的示意图;
图6是本申请实施例提供的一种人体动作检测与动作规范化评价方法的流程示意图;
图7是本申请实施例提供的一种动作检测装置的结构图;
图8是本申请实施例提供的一种终端设备的示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
本申请提出一种基于计算机视觉的动作检测方法,通过将视频图像的像素特征和深度特征融合,能够有效利用像素信息和深度信息的互补性,极大地提高整体动作识别效率和模型的抗干扰能力。另外,在识别出动作之后,还能进一步对该动作的规范性进行评价。本申请更具体的技术实现细节,请参照下文所述的方法实施例。
应当理解,本申请各个方法实施例的执行主体为各种类型的终端设备或服务器,比如手机、平板电脑、笔记本电脑、台式电脑和各类可穿戴设备等。
请参阅图1,示出了本申请实施例提供的一种动作检测方法,包括:
101、获取包含目标动作的目标视频序列,所述目标视频序列包括一一对应的目标像素帧序列和目标深度图序列;
首先,获取一个包含目标动作的目标视频序列。其中,目标动作是待识别的动作,其可以是任何目标物体(例如人体、动物或者机器人等)执行的一个任意类型的动作,例如人体的坐下、打保龄球和俯卧撑等动作。另外,该目标视频序列包含两部分的图像序列,第一部分是包含像素特征的目标像素帧序列(具体可以是RGB图像序列或者灰度图像序列等),第二部分是包含深度特征的目标深度图序列,两部分图像序列是一一对应的,即目标像素帧序列的第一帧图像与目标深度图序列的第一帧图像对应,目标像素帧序列的第二帧图像与目标深度图序列的第二帧图像对应,以此类推。在实际操作中,可以采用某些 指定类型的摄像头(例如微软的kinect摄像头等),拍摄得到RGB图像序列及其对应的深度图像序列,两个图像序列中的每帧图像都是一一对应的。
在本申请的一种实现方式中,所述获取包含目标动作的目标视频序列,可以包括:
(1)获取包含多个动作的原始视频序列,所述原始视频序列包括一一对应的原始像素帧序列和原始深度图序列;
(2)对所述原始像素帧序列执行视频的动作分割处理,得到多个包含单一动作的像素帧序列分段,并从所述多个像素帧序列分段中选取一个像素帧序列分段,作为所述目标像素帧序列;
(3)对所述原始深度图序列执行视频的动作分割处理,得到多个包含单一动作的深度图序列分段,并从所述多个深度图序列分段中选取与所述目标像素帧序列对应的深度图序列分段,作为所述目标深度图序列。
在某些应用场合中,获取到的通常是包含多个不同动作的原始视频序列,例如针对某个人体在一定时间内的活动视频,其可能包含走路、坐下、站立和跑步等多个不同动作。针对这些应用场合,可以采用动作分割的方法,将原始视频序列分割成一个个包含单一动作的视频序列分段,然后针对每个视频序列分段可以分别采用本申请提出的动作检测方法识别出各自包含的动作,从而实现整个原始视频序列包含的所有动作的识别。
具体的,获取到的原始像素帧序列同样包括一一对应的原始像素帧序列和原始深度图序列,可以分别对原始像素帧序列和原始深度图序列执行视频序列的动作分割处理,分别得到多个包含单一动作的像素帧序列分段和多个包含单一动作的深度图序列分段。然后,从该多个像素帧序列分段中选取一个像素帧序列分段(包含当前需要识别的目标动作的像素帧序列分段),作为目标像素帧序列;从该多个深度图序列分段中选取与目标像素帧序列对应的深度图序列分段(即包含当前需要识别的目标动作的深度图序列分段),作为目标深度图序列。
在实际操作中,可以采用基于运动量(Quantity of movement,QOM)的动作分割方法。对于包含多个不同动作的原始视频序列,其包含的每一帧图像具有相对于其相邻帧图像和第一帧图像的相关移动信息,可以根据其中对应的运动量来检测每个动作的开始帧和结束帧,从而实现动作分割。例如,假设某个原始视频序列为I,可以定义其包含的第t帧图像的QOM为:
Figure PCTCN2021138566-appb-000001
其中,(m,n)表示图像中的像素坐标,Ψ(x,y)定义为:
Figure PCTCN2021138566-appb-000002
Threshold QOM是一个预设的参数,根据经验值一般可以设为60。另外设置一个参数Threshold inter作为动作内分割的门限值,可以通过滑窗的方法来对其进行迭代更新。假设某个动作的平均长度为L帧图像,则可以以其前12.5%和后12.5%帧图像的平均QOM值作为Threshold inter的候选数值。然后,通过比对选取L帧图像长度内的最小QOM值所对应的帧作为动作分界帧(开始帧和结束帧),以此完成动作分割。总的来说,基于QOM的动作分割方法,主要通过运动量变化和时间尺度的结合,来实现视频序列的动作检测与分割,最终获得的是仅包含单一动作的视频序列分段。
102、根据所述目标像素帧序列生成像素特征图像,所述像素特征图像包含所述目标像素帧序列具有的各帧图像的特征;
在获得目标像素帧序列之后,可以基于该目标像素帧序列具有的各帧图像生成一幅像素特征图像。该像素特征图像包含目标像素帧序列具有的每帧图像的特征,可以用于表征目标像素帧序列的整体像素特征。例如,假设目标像素帧序列为N帧RGB图像,则可以基于该N帧RGB图像生成一幅类似RGB的像素特征图像,用于表征该N帧RGB图像的整体像素特征。这个过程获得的 像素特征图像,可以称作目标像素帧序列的时空信息表示。
在本申请的一种实现方式中,所述根据所述目标像素帧序列生成像素特征图像,可以包括:
(1)在时间维度上对所述目标像素帧序列执行稀疏采样处理,得到第一图像序列;
(2)将所述第一图像序列包含的各帧图像融合,得到所述像素特征图像。
可以采用将目标像素帧序列包含的各帧图像融合的方式获得像素特征图像,然而,由于目标像素帧序列包含很多帧图像,若将这些图像都进行融合会产生大量的计算量,影响算法运行速度。因此,可以先在时间维度上对目标像素帧序列执行稀疏采样处理,从而去除帧间的冗余信息,降低计算量。另外,该稀疏采样可以采用平均采样的方式,以避免产生动作表示不均匀,丢失空间维度信息的问题。例如,假设目标像素帧序列包含100帧图像,则可以使用稀疏平均采样的方式,从中提取出第5帧、第15帧、第25帧…第95帧总共10帧图像,作为得到的第一图像序列,即可以用该10帧图像来表示整个目标动作。然后,可以采用图像叠加等方式将该第一图像序列包含的各帧图像融合,从而得到对应的像素特征图像。
具体的,所述将所述第一图像序列包含的各帧图像融合,得到所述像素特征图像,可以包括:
对所述第一图像序列包含的各帧图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述像素特征图像。
假设第一图像序列是RGB图像序列,则这个过程可以将RGB三通道视作一个矢量矩阵,即该RGB图像序列中的每帧RGB图像都具有一个对应的矢量矩阵,将这些矢量矩阵包含的像素特征叠加、求平均再取整后,就可以得到一个最终的矢量矩阵,其即为该像素特征图像对应的矢量矩阵,也可以称作该RGB图像序列的时空信息表示样本。例如,假设第一图像序列为〈I 1,I 2,I 3,…I T〉,即共包含T帧图像,则其对应的像素特征图像M可以表示为:
Figure PCTCN2021138566-appb-000003
103、根据所述目标深度图序列生成深度特征图像,所述深度特征图像包含所述目标深度图序列具有的各帧图像的特征;
与步骤102类似的,在获得目标深度图序列之后,同样可以基于该目标深度图序列具有的各帧图像生成一幅深度特征图像。该深度特征图像包含目标深度图序列具有的每帧图像的特征,可以用于表征目标深度图序列的整体深度特征。例如,假设目标深度图序列为N帧深度图像,则可以基于该N帧深度图像生成一幅近似的深度图像,用于表征该N帧深度图像的整体深度特征。这个过程获得的深度特征图像,可以称作目标深度图序列的时空信息表示。
在本申请的一种实现方式中,所述根据所述目标深度图序列生成深度特征图像,可以包括:
(1)在时间维度上对所述目标深度图序列执行稀疏采样处理,得到第二图像序列;
(2)将所述第二图像序列包含的各帧图像融合,得到所述深度特征图像。
与根据目标像素帧序列生成像素特征图像的方法类似,同样可以在时间维度上对目标深度图序列执行稀疏采样处理,从而去除帧间的冗余信息,降低计算量。例如,假设目标深度图序列包含100帧深度图像,则可以使用稀疏平均采样的方式,从中提取出第5帧、第15帧、第25帧…第95帧(帧标可以与步骤102中第一图像序列的帧标一一对应)总共10帧深度图像,作为得到的第二图像序列,即可以用该10帧深度图像来表示整个目标动作。然后,可以采用图像叠加等方式将该第二图像序列包含的各帧深度图像融合,从而得到对应的深度特征图像。
具体的,所述将所述第二图像序列包含的各帧图像融合,得到所述深度特征图像,可以包括:
(1)将所述第二图像序列包含的各帧图像分别转换成各帧灰度图像;
(2)对所述各帧灰度图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述深度特征图像。
对于由深度图像构成的第二图像序列,由于其表示的是距离信息,故需要先通过缩放等方式将其包含的各帧深度图像转换成灰度值为0-255的灰度图像,此时可视作单通道的矢量矩阵,从而方便地实现图像融合。也即,第二图像序列中的每帧深度图像都具有一个对应的单通道的矢量矩阵,将这些矢量矩阵包含的像素特征叠加、求平均再取整后,就可以得到一个最终的矢量矩阵,其即为该深度特征图像对应的矢量矩阵,也可以称作该第二图像序列的时空信息表示样本。
如图2所示,是对像素帧序列和深度图序列分别进行时空信息表示,获得对应的像素特征图像和深度特征图像的示意图。可以看出,获得的像素特征图像包含该像素帧序列的整体像素特征;获得的深度特征图像包含该深度图序列的整体深度特征。
104、将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别。
在通过步骤102获得像素特征图像以及通过步骤103获得深度特征图像之后,可以将该两帧特征图像输入一个预先训练完成的深度神经网络进行处理。该深度神经网络通过执行图像特征的提取与融合等方式,能够联合学习特征图像中的像素特征和深度特征,最终输出一个类别标签,从而确定目标动作的类别。在实际操作中,该深度神经网络可以选用Resnet、inception和VGG等成熟的网络模型架构,本申请不对该深度神经网络的类型和结构进行限定。
在本申请的一种实现方式中,所述深度神经网络包括特征提取模块和特征融合模块,所述将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别,可以包括:
(1)将所述像素特征图像和所述深度特征图像输入所述特征提取模块进行 处理,得到图像语义特征;
(2)将所述图像语义特征输入所述特征融合模块进行处理,得到融合后的图像特征;
(3)基于所述融合后的图像特征确定所述目标动作的类别。
可以采用多级卷积层和池化层的结构构建特征提取模块,用于提取像素特征图像和深度特征图像的图像语义特征;然后,通过包含卷积神经单元的特征融合模块来实现图像语义特征的融合(主要采用特征的点乘、加权求和或者求最大值等方式实现融合),得到融合后的图像特征。该融合后的图像特征是目标动作与其它动作之间的区别性特征,故可以基于该特征实现动作识别,即确定目标动作的类别。
该深度神经网络的一种结构示意图如图3所示。在图3中,该深度神经网络包含特征提取模块和特征融合模块两部分,其中特征提取模块主要由多级级联的卷积层、特征交互模块和全连接层构成,特征融合模块主要由多级级联的卷积神经单元(用图中3的圆圈表示)构成。该特征交互模块的一种结构示意图如图4所示,该模块主要包含两个卷积核1*1的卷积层,将两个模态的图像特征(像素特征和深度特征)输入该模块之后,能够对图像特征进行中层语义特征和高层语义特征(中层语义特征一般指在网络模型参数学习过程中所学习到的特征,高层语义特征一般指网络模型学习完成后输出的能够对样本进行分类的特征)的互补学习。
假设像素特征图像是三通道的RGB图像,深度特征图像是单通道的深度图像,为了适应该深度神经网络,可以将两个模态的特征图像进行联结,获得四通道的样本。例如,RGB图像可以表示为三通道的矢量矩阵,深度图像可以表示为单通道的矢量矩阵,那么将两个矢量矩阵拼接,即可得到一个四通道的矢量矩阵。相应的,该深度神经网络的输入端通道数也是四个,将四通道的样本输入该深度神经网络后,经过多级卷积层、池化层以及前文所述的交互学习模块的处理,能够学习到样本的中层语义特征和高层语义特征;之后,通过特 征融合模块将高层语义特征融合,得到目标动作的区别性特征,最终基于该区别性特征完成动作分类。
在某些应用场景中,除识别出用户的动作类型之外,还需要进一步检测该动作是否规范,并给出相应的规范性评价结果,以便纠正用户的错误动作。有鉴于此,在本申请的一种实现方式中,在确定所述目标动作的类别之后,还可以包括:
(1)从预设的规范动作视频库中查找与所述目标动作的类别对应的基准视频序列,所述基准视频序列包含规范化的所述目标动作;
(2)在时间维度上对所述基准视频序列执行稀疏采样处理,得到第三图像序列,所述第三图像序列包含的图像帧数和所述第一图像序列包含的图像帧数相同;
(3)分别标注出所述第一图像序列包含的各帧图像中目标物体具有的指定部位的位置,以及所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,所述目标物体为执行所述目标动作的物体;
(4)根据所述第一图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第一运动轨迹曲线;
(5)根据所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第二运动轨迹曲线;
(6)根据所述第一运动轨迹曲线和所述第二运动轨迹曲线,计算得到曲线误差;
(7)根据所述曲线误差确定所述目标视频序列包含的所述目标动作的规范化程度。
可以预先在指定的存储区域(例如某个数据库)构建一个规范动作视频库,将规范化的各种类型动作的视频序列(例如人体标准跑步动作的视频序列、人体标准俯卧撑动作的视频序列等)存入该规范动作视频库。在确定目标动作的类别之后,可以从该规范动作视频库中查找与该目标动作的类型对应的基准视 频序列,该基准视频序列包含规范化的所述目标动作。例如,若目标动作为跑步,则基准视频序列为人体标准跑步动作的视频序列。
然后,在时间维度上对基准视频序列执行稀疏采样处理,得到第三图像序列,需要注意第三图像序列包含的图像帧数和前文所述的第一图像序列包含的图像帧数是相同的。例如,假设第一图像序列为包含T帧图像的〈I 1,I 2,I 3,…,I T〉,则第三图像序列可以表示为包含T帧图像的〈N 1,N 2,N 3,…,N T〉。
接着,分别标注出第一图像序列包含的各帧图像中目标物体具有的指定部位的位置,以及第三图像序列包含的各帧图像中目标物体具有的该指定部位的位置;其中,该目标物体是执行目标动作的物体,该指定部位的数量可以是一个,也可以是多个(为了提高后续曲线误差计算的准确率,一般需要设置多个)。例如,若目标物体是人体,则可以采用视觉显著性算法等方式将人体分割标注成头部、躯干、左手、右手、左脚和右脚等6个指定部位,如图5所示。
之后,根据第一图像序列包含的各帧图像中该目标物体具有的各个指定部位的位置,基于时间维度构建出各个指定部位分别对应的第一运动轨迹曲线。例如,针对人体的6个指定部位,可以将第一图像序列包含的各帧图像中人体头部中心点(可以采用归一化的操作,将人体头部表示为一个中心点,具体可以计算人体头部包含的所有坐标点的平均坐标值作为中心点,针对其它指定部位均可以采用相同的操作方式)的坐标连接起来,得到人体头部对应的第一运动轨迹曲线;将第一图像序列包含的各帧图像中人体躯干中心点的坐标连接起来,得到人体躯干对应的第一运动轨迹曲线,以此类推,分别构建出6个指定部位对应的第一运动轨迹曲线。
同样的,可以根据第三图像序列包含的各帧图像中该目标物体具有的各个指定部位的位置,基于时间维度构建得到各个指定部位分别对应的第二运动轨迹曲线。也即,针对人体的6个指定部位,可以将第三图像序列包含的各帧图像中人体头部中心点的坐标连接起来,得到人体头部对应的第二运动轨迹曲线;将第三图像序列包含的各帧图像中人体躯干中心点的坐标连接起来,得到人体 躯干对应的第二运动轨迹曲线,以此类推,分别构建出6个指定部位对应的第二运动轨迹曲线。
接下来,根据构建的第一运动轨迹曲线和第二运动轨迹曲线,可以计算得到曲线误差。假设指定部位只有一个,即第一运动轨迹曲线和第二运动轨迹曲线都只有一条,那么曲线误差可以是该第一运动轨迹曲线和该第二运动轨迹曲线之差。具体的,可以分别计算第一运动轨迹曲线中的每个目标位置点和其在第二运动轨迹曲线中的对应位置点之间的距离,得到每个目标位置点的误差,然后再将每个目标位置点的误差叠加,得到曲线误差。这里的目标位置点可以是运动轨迹曲线连接的节点,例如,假设指定部位为人体头部,则第一运动轨迹曲线中的各个目标位置点可以为第一图像序列包含的各帧图像中的人体头部中心点,各个目标位置点在第二运动轨迹曲线中的对应位置点可以为第三图像序列包含的各帧图像中的人体头部中心点。该曲线误差可以采用以下公式计算:
Figure PCTCN2021138566-appb-000004
其中,err表示曲线误差,t=1,2,3…T,T表示第一图像序列和第三图像序列的图像帧数,
Figure PCTCN2021138566-appb-000005
表示第一运动轨迹曲线中的目标位置点和其在第二运动轨迹曲线中的对应位置点之间的距离。
假设指定部位有多个,即第一运动轨迹曲线和第二运动轨迹曲线都有多条,则可以采用以上方法分别计算得到各个指定部位的曲线误差分量,最后再将各个指定部位的曲线误差分量相加,得到总的曲线误差。例如,针对人体的6个指定部位,可以分别计算得到6个曲线误差分量:err 头部、err 躯干、err 左手、err 、err 左脚和err 右脚,则总的曲线误差err 可以表示为:
err =err 头部+err 躯干+err 左手+err 右手+err 左脚+err 右脚
在计算得到曲线误差之后,可以根据该曲线误差确定目标视频序列包含的 目标动作的规范化程度。具体的,该曲线误差可以用于表征目标视频序列包含的目标动作和基准视频序列包含的规范化目标动作(标准动作)之间的偏差程度,也即若该曲线误差越小,则表示目标视频序列包含的目标动作和标准动作之间的偏差越小,即该目标动作的规范化程度越高。在实际操作中,可以先对该曲线误差执行归一化处理,这里可以采用计算误差倒数的方法来处理,即:
1/err =1/err 头部+1/err 躯干+1/err 左手+1/err 右手+1/err 左脚+1/err 右脚
然后,可以通过softmax函数将归一化后的曲线误差转换为概率值,该概率值越大则表示待测样本(即目标视频序列)和规范化动作样本(即基准视频序列)越接近,即待测样本的规范化程度越高。反之,则表示待测样本和规范化动作样本的偏差越大,即待测样本的规范化程度越低。
在本申请实施例中,首先,获取包含目标动作的目标视频序列,该目标视频序列包括一一对应的像素帧序列和深度图序列;然后,根据像素帧序列生成包含其中各帧图像的图像特征的像素特征图像,以及根据深度图序列生成包含其中各帧图像的图像特征的深度特征图像;最后,将像素特征图像和深度特征图像输入一个已训练的深度神经网络进行图像特征的提取与融合处理,从而确定目标动作的类别。上述过程将视频图像的像素特征和深度特征融合,利用像素信息和深度信息的互补性,能够在一定程度上减弱环境因素对检测样本的干扰,从而提高动作检测的准确率。
为便于理解本申请实施例提出的动作检测与动作规范化评价方法,以下列举一个实际的应用场景。如图6所示,是本申请实施例提出的一种人体动作检测与动作规范化评价方法的流程示意图。
在图6中,首先输入原始视频序列,然后对该原始视频序列执行QOM动作分割处理,得到包含单一人体动作的目标视频序列;接着,对目标视频序列执行稀疏平均采样处理,得到第一图像序列;然后,获取第一图像序列的时空信息表示得到像素特征图像,将其和对应的深度特征图像输入深度神经网络进行识别,得到动作类别;之后,从规范动作视频库中查找与该动作类别对应的 基准视频序列,对基准视频序列执行稀疏平均采样处理,得到第三图像序列;然后,采用显著性检测的方式,分别标注出第一图像序列包含的各帧图像中人体头部、躯干、左手、右手、左脚和右脚6个指定部位的位置,并据此构建得到对应的第一运动轨迹曲线,以及标注出第三图像序列包含的各帧图像中该6个指定部位的位置,并据此构建得到对应的第二运动轨迹曲线;接下来,根据两部分运动轨迹曲线计算得到对应的曲线误差,最后对曲线误差执行归一化处理,通过softmax函数将归一化后的曲线误差转换为概率值,并根据该概率值的大小评估目标视频序列包含的人体动作的规范化程度。
综上所述,本申请实施例通过将视频图像的像素特征和深度特征融合,能够有效利用像素信息和深度信息的互补性,极大地提高整体动作识别效率和模型的抗干扰能力。而且,通过将待测样本的动作和规范化动作样本的动作进行向量化的误差比对,能够实现动作的规范化程度评价,在体育锻炼等领域中具有重大的应用价值。
应理解,上述各个实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
上面主要描述了一种动作检测方法,下面将对一种动作检测装置进行描述。
请参阅图7,本申请实施例中一种动作检测装置的一个实施例包括:
视频序列获取模块701,用于获取包含目标动作的目标视频序列,所述目标视频序列包括一一对应的目标像素帧序列和目标深度图序列;
像素特征生成模块702,用于根据所述目标像素帧序列生成像素特征图像,所述像素特征图像包含所述目标像素帧序列具有的各帧图像的特征;
深度特征生成模块703,用于根据所述目标深度图序列生成深度特征图像,所述深度特征图像包含所述目标深度图序列具有的各帧图像的特征;
动作检测模块704,用于将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的 类别。
在本申请的一种实现方式中,所述深度神经网络包括特征提取模块和特征融合模块,所述动作检测模块可以包括:
特征提取单元,用于将所述像素特征图像和所述深度特征图像输入所述特征提取模块进行处理,得到图像语义特征;
特征融合单元,用于将所述图像语义特征输入所述特征融合模块进行处理,得到融合后的图像特征;
动作检测单元,用于基于所述融合后的图像特征确定所述目标动作的类别。
在本申请的一种实现方式中,所述像素特征生成模块可以包括:
第一稀疏采样处理单元,用于在时间维度上对所述目标像素帧序列执行稀疏采样处理,得到第一图像序列;
第一图像融合单元,用于将所述第一图像序列包含的各帧图像融合,得到所述像素特征图像;
所述深度特征生成模块可以包括:
第二稀疏采样处理单元,用于在时间维度上对所述目标深度图序列执行稀疏采样处理,得到第二图像序列;
第二图像融合单元,用于将所述第二图像序列包含的各帧图像融合,得到所述深度特征图像。
进一步的,所述第一图像融合单元可以包括:
像素特征处理子单元,用于对所述第一图像序列包含的各帧图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述像素特征图像;
所述第二图像融合单元可以包括:
灰度转换子单元,用于将所述第二图像序列包含的各帧图像分别转换成各帧灰度图像;
深度特征处理子单元,用于对所述各帧灰度图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述深度特征图像。
在本申请的一种实现方式中,所述动作检测装置还可以包括:
基准视频查找模块,用于从预设的规范动作视频库中查找与所述目标动作的类别对应的基准视频序列,所述基准视频序列包含规范化的所述目标动作;
稀疏采样处理模块,用于在时间维度上对所述基准视频序列执行稀疏采样处理,得到第三图像序列,所述第三图像序列包含的图像帧数和所述第一图像序列包含的图像帧数相同;
显著性标注模块,用于分别标注出所述第一图像序列包含的各帧图像中目标物体具有的指定部位的位置,以及所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,所述目标物体为执行所述目标动作的物体;
第一曲线构建模块,用于根据所述第一图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第一运动轨迹曲线;
第二曲线构建模块,用于根据所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第二运动轨迹曲线;
曲线误差计算模块,用于根据所述第一运动轨迹曲线和所述第二运动轨迹曲线,计算得到曲线误差;
规范化评价模块,用于根据所述曲线误差确定所述目标视频序列包含的所述目标动作的规范化程度。
进一步的,所述曲线误差计算模块可以包括:
位置点误差计算单元,用于分别计算所述第一运动轨迹曲线中的每个目标位置点和其在所述第二运动轨迹曲线中的对应位置点之间的距离,得到每个所述目标位置点的误差;
误差叠加单元,用于将每个所述目标位置点的误差叠加,得到所述曲线误差。
在本申请的一种实现方式中,所述视频序列获取模块可以包括:
原始视频序列获取单元,用于获取包含多个动作的原始视频序列,所述原始视频序列包括一一对应的原始像素帧序列和原始深度图序列;
第一动作分割处理单元,用于对所述原始像素帧序列执行视频的动作分割处理,得到多个包含单一动作的像素帧序列分段,并从所述多个像素帧序列分段中选取一个像素帧序列分段,作为所述目标像素帧序列;
第二动作分割处理单元,用于对所述原始深度图序列执行视频的动作分割处理,得到多个包含单一动作的深度图序列分段,并从所述多个深度图序列分段中选取与所述目标像素帧序列对应的深度图序列分段,作为所述目标深度图序列。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如图1表示的任意一种动作检测方法。
本申请实施例还提供一种计算机程序产品,当该计算机程序产品在终端设备上运行时,使得终端设备执行实现如图1表示的任意一种动作检测方法。
图8是本申请一实施例提供的终端设备的示意图。如图8所示,该实施例的终端设备8包括:处理器80、存储器81以及存储在所述存储器81中并可在所述处理器80上运行的计算机程序82。所述处理器80执行所述计算机程序82时实现上述各个动作检测方法的实施例中的步骤,例如图1所示的步骤101至104。或者,所述处理器80执行所述计算机程序82时实现上述各装置实施例中各模块/单元的功能,例如图7所示模块701至704的功能。
所述计算机程序82可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器81中,并由所述处理器80执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序82在所述终端设备8中的执行过程。
所称处理器80可以是中央处理单元(Central Processing Unit,CPU),还可 以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器81可以是所述终端设备8的内部存储单元,例如终端设备8的硬盘或内存。所述存储器81也可以是所述终端设备8的外部存储设备,例如所述终端设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器81还可以既包括所述终端设备8的内部存储单元也包括外部存储设备。所述存储器81用于存储所述计算机程序以及所述终端设备所需的其他程序和数据。所述存储器81还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详 述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的系统实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计 算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种动作检测方法,其特征在于,包括:
    获取包含目标动作的目标视频序列,所述目标视频序列包括一一对应的目标像素帧序列和目标深度图序列;
    根据所述目标像素帧序列生成像素特征图像,所述像素特征图像包含所述目标像素帧序列具有的各帧图像的特征;
    根据所述目标深度图序列生成深度特征图像,所述深度特征图像包含所述目标深度图序列具有的各帧图像的特征;
    将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别。
  2. 如权利要求1所述的方法,其特征在于,所述深度神经网络包括特征提取模块和特征融合模块,所述将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别,包括:
    将所述像素特征图像和所述深度特征图像输入所述特征提取模块进行处理,得到图像语义特征;
    将所述图像语义特征输入所述特征融合模块进行处理,得到融合后的图像特征;
    基于所述融合后的图像特征确定所述目标动作的类别。
  3. 如权利要求1所述的方法,其特征在于,所述根据所述目标像素帧序列生成像素特征图像,包括:
    在时间维度上对所述目标像素帧序列执行稀疏采样处理,得到第一图像序列;
    将所述第一图像序列包含的各帧图像融合,得到所述像素特征图像;
    所述根据所述目标深度图序列生成深度特征图像,包括:
    在时间维度上对所述目标深度图序列执行稀疏采样处理,得到第二图像序列;
    将所述第二图像序列包含的各帧图像融合,得到所述深度特征图像。
  4. 如权利要求3所述的方法,其特征在于,所述将所述第一图像序列包含的各帧图像融合,得到所述像素特征图像,包括:
    对所述第一图像序列包含的各帧图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述像素特征图像;
    所述将所述第二图像序列包含的各帧图像融合,得到所述深度特征图像,包括:
    将所述第二图像序列包含的各帧图像分别转换成各帧灰度图像;
    对所述各帧灰度图像按照对应位置像素点执行图像特征的叠加、求平均和取整操作,得到所述深度特征图像。
  5. 如权利要求3所述的方法,其特征在于,在确定所述目标动作的类别之后,还包括:
    从预设的规范动作视频库中查找与所述目标动作的类别对应的基准视频序列,所述基准视频序列包含规范化的所述目标动作;
    在时间维度上对所述基准视频序列执行稀疏采样处理,得到第三图像序列,所述第三图像序列包含的图像帧数和所述第一图像序列包含的图像帧数相同;
    分别标注出所述第一图像序列包含的各帧图像中目标物体具有的指定部位的位置,以及所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,所述目标物体为执行所述目标动作的物体;
    根据所述第一图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第一运动轨迹曲线;
    根据所述第三图像序列包含的各帧图像中所述目标物体具有的所述指定部位的位置,构建得到所述指定部位对应的第二运动轨迹曲线;
    根据所述第一运动轨迹曲线和所述第二运动轨迹曲线,计算得到曲线误差;
    根据所述曲线误差确定所述目标视频序列包含的所述目标动作的规范化程度。
  6. 如权利要求5所述的方法,其特征在于,所述根据所述第一运动轨迹曲线和所述第二运动轨迹曲线,计算得到曲线误差,包括:
    分别计算所述第一运动轨迹曲线中的每个目标位置点和其在所述第二运动轨迹曲线中的对应位置点之间的距离,得到每个所述目标位置点的误差;
    将每个所述目标位置点的误差叠加,得到所述曲线误差。
  7. 如权利要求1至6中任一项所述的方法,其特征在于,所述获取包含目标动作的目标视频序列,包括:
    获取包含多个动作的原始视频序列,所述原始视频序列包括一一对应的原始像素帧序列和原始深度图序列;
    对所述原始像素帧序列执行视频的动作分割处理,得到多个包含单一动作的像素帧序列分段,并从所述多个包含单一动作的像素帧序列分段中选取一个像素帧序列分段,作为所述目标像素帧序列;
    对所述原始深度图序列执行视频的动作分割处理,得到多个包含单一动作的深度图序列分段,并从所述多个包含单一动作的深度图序列分段中选取与所述目标像素帧序列对应的深度图序列分段,作为所述目标深度图序列。
  8. 一种动作检测装置,其特征在于,包括:
    视频序列获取模块,用于获取包含目标动作的目标视频序列,所述目标视频序列包括一一对应的目标像素帧序列和目标深度图序列;
    像素特征生成模块,用于根据所述目标像素帧序列生成像素特征图像,所述像素特征图像包含所述目标像素帧序列具有的各帧图像的特征;
    深度特征生成模块,用于根据所述目标深度图序列生成深度特征图像,所述深度特征图像包含所述目标深度图序列具有的各帧图像的特征;
    动作检测模块,用于将所述像素特征图像和所述深度特征图像输入已训练的深度神经网络执行图像特征的提取与融合处理,以确定所述目标动作的类别。
  9. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7中任一项所述的动作检测方法。
  10. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的动作检测方法。
PCT/CN2021/138566 2021-08-04 2021-12-15 一种动作检测方法、装置、终端设备和存储介质 WO2023010758A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110889116.5 2021-08-04
CN202110889116.5A CN113326835B (zh) 2021-08-04 2021-08-04 一种动作检测方法、装置、终端设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023010758A1 true WO2023010758A1 (zh) 2023-02-09

Family

ID=77427065

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138566 WO2023010758A1 (zh) 2021-08-04 2021-12-15 一种动作检测方法、装置、终端设备和存储介质

Country Status (2)

Country Link
CN (1) CN113326835B (zh)
WO (1) WO2023010758A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880616A (zh) * 2023-03-08 2023-03-31 城云科技(中国)有限公司 大型工程车辆清洗过程规范判定方法、装置及其应用
CN116311004A (zh) * 2023-05-23 2023-06-23 南京信息工程大学 基于稀疏光流提取的视频运动目标检测方法
CN116434335A (zh) * 2023-03-30 2023-07-14 东莞理工学院 动作序列识别和意图推断方法、装置、设备及存储介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113326835B (zh) * 2021-08-04 2021-10-29 中国科学院深圳先进技术研究院 一种动作检测方法、装置、终端设备和存储介质
CN114067442B (zh) * 2022-01-18 2022-04-19 深圳市海清视讯科技有限公司 洗手动作检测方法、模型训练方法、装置及电子设备
CN114495015A (zh) * 2022-03-30 2022-05-13 行为科技(北京)有限公司 人体姿态检测方法和装置
CN115170934B (zh) * 2022-09-05 2022-12-23 粤港澳大湾区数字经济研究院(福田) 一种图像分割方法、系统、设备及存储介质
CN116110131B (zh) * 2023-04-11 2023-06-30 深圳未来立体教育科技有限公司 一种身体交互行为识别方法及vr系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108699A (zh) * 2017-12-25 2018-06-01 重庆邮电大学 融合深度神经网络模型和二进制哈希的人体动作识别方法
US20200265567A1 (en) * 2019-02-18 2020-08-20 Samsung Electronics Co., Ltd. Techniques for convolutional neural network-based multi-exposure fusion of multiple image frames and for deblurring multiple image frames
CN112131928A (zh) * 2020-08-04 2020-12-25 浙江工业大学 一种rgb-d图像特征融合的人体姿态实时估计方法
CN112257526A (zh) * 2020-10-10 2021-01-22 中国科学院深圳先进技术研究院 一种基于特征交互学习的动作识别方法及终端设备
CN113326835A (zh) * 2021-08-04 2021-08-31 中国科学院深圳先进技术研究院 一种动作检测方法、装置、终端设备和存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183355B (zh) * 2020-09-28 2022-12-27 北京理工大学 基于双目视觉和深度学习的出水高度检测系统及其方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108699A (zh) * 2017-12-25 2018-06-01 重庆邮电大学 融合深度神经网络模型和二进制哈希的人体动作识别方法
US20200265567A1 (en) * 2019-02-18 2020-08-20 Samsung Electronics Co., Ltd. Techniques for convolutional neural network-based multi-exposure fusion of multiple image frames and for deblurring multiple image frames
CN112131928A (zh) * 2020-08-04 2020-12-25 浙江工业大学 一种rgb-d图像特征融合的人体姿态实时估计方法
CN112257526A (zh) * 2020-10-10 2021-01-22 中国科学院深圳先进技术研究院 一种基于特征交互学习的动作识别方法及终端设备
CN113326835A (zh) * 2021-08-04 2021-08-31 中国科学院深圳先进技术研究院 一种动作检测方法、装置、终端设备和存储介质

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880616A (zh) * 2023-03-08 2023-03-31 城云科技(中国)有限公司 大型工程车辆清洗过程规范判定方法、装置及其应用
CN116434335A (zh) * 2023-03-30 2023-07-14 东莞理工学院 动作序列识别和意图推断方法、装置、设备及存储介质
CN116434335B (zh) * 2023-03-30 2024-04-30 东莞理工学院 动作序列识别和意图推断方法、装置、设备及存储介质
CN116311004A (zh) * 2023-05-23 2023-06-23 南京信息工程大学 基于稀疏光流提取的视频运动目标检测方法
CN116311004B (zh) * 2023-05-23 2023-08-15 南京信息工程大学 基于稀疏光流提取的视频运动目标检测方法

Also Published As

Publication number Publication date
CN113326835A (zh) 2021-08-31
CN113326835B (zh) 2021-10-29

Similar Documents

Publication Publication Date Title
WO2023010758A1 (zh) 一种动作检测方法、装置、终端设备和存储介质
US10762376B2 (en) Method and apparatus for detecting text
Qu et al. RGBD salient object detection via deep fusion
CN113963445B (zh) 一种基于姿态估计的行人摔倒动作识别方法及设备
CN111797893A (zh) 一种神经网络的训练方法、图像分类系统及相关设备
CN108288051B (zh) 行人再识别模型训练方法及装置、电子设备和存储介质
Song et al. Unsupervised Alignment of Actions in Video with Text Descriptions.
CN112861575A (zh) 一种行人结构化方法、装置、设备和存储介质
CN109086811A (zh) 多标签图像分类方法、装置及电子设备
KR20100014092A (ko) 오브젝트 궤적에 기반한 모션 검출 시스템 및 방법
CN109492576B (zh) 图像识别方法、装置及电子设备
WO2021169642A1 (zh) 基于视频的眼球转向确定方法与系统
Yang et al. Binary descriptor based nonparametric background modeling for foreground extraction by using detection theory
CN111104941B (zh) 图像方向纠正方法、装置及电子设备
Cholakkal et al. Backtracking spatial pyramid pooling-based image classifier for weakly supervised top–down salient object detection
Demirkus et al. Hierarchical temporal graphical model for head pose estimation and subsequent attribute classification in real-world videos
CN114937285B (zh) 动态手势识别方法、装置、设备及存储介质
US20240303848A1 (en) Electronic device and method for determining human height using neural networks
CN115577768A (zh) 半监督模型训练方法和装置
Zhang et al. Fine-grained-based multi-feature fusion for occluded person re-identification
Yu et al. Shallow detail and semantic segmentation combined bilateral network model for lane detection
Li et al. Location and model reconstruction algorithm for overlapped and sheltered spherical fruits based on geometry
CN117953581A (zh) 动作识别的方法、装置、电子设备及可读存储介质
CN116580063B (zh) 目标追踪方法、装置、电子设备及存储介质
Wu et al. Semi-supervised human detection via region proposal networks aided by verification

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21952621

Country of ref document: EP

Kind code of ref document: A1