CN115937975A - Action recognition method and system based on multi-modal sequence fusion - Google Patents
Action recognition method and system based on multi-modal sequence fusion Download PDFInfo
- Publication number
- CN115937975A CN115937975A CN202211568552.3A CN202211568552A CN115937975A CN 115937975 A CN115937975 A CN 115937975A CN 202211568552 A CN202211568552 A CN 202211568552A CN 115937975 A CN115937975 A CN 115937975A
- Authority
- CN
- China
- Prior art keywords
- action
- video
- time
- modal
- fusion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009471 action Effects 0.000 title claims abstract description 172
- 230000004927 fusion Effects 0.000 title claims abstract description 65
- 238000000034 method Methods 0.000 title claims abstract description 34
- 241000282414 Homo sapiens Species 0.000 claims abstract description 63
- 238000001514 detection method Methods 0.000 claims abstract description 36
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 26
- 230000015654 memory Effects 0.000 claims abstract description 18
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 238000002372 labelling Methods 0.000 claims abstract description 9
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000033001 locomotion Effects 0.000 claims description 69
- 230000006399 behavior Effects 0.000 claims description 57
- 239000013598 vector Substances 0.000 claims description 50
- 230000014509 gene expression Effects 0.000 claims description 32
- 230000006870 function Effects 0.000 claims description 26
- 238000013507 mapping Methods 0.000 claims description 15
- 238000000354 decomposition reaction Methods 0.000 claims description 14
- 230000003993 interaction Effects 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 12
- 230000002123 temporal effect Effects 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 230000002452 interceptive effect Effects 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 abstract description 2
- 230000006872 improvement Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000007787 long-term memory Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 239000012535 impurity Substances 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000013256 coordination polymer Substances 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 238000000513 principal component analysis Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention discloses an action recognition method based on multi-modal sequence fusion, which comprises the steps of obtaining a human action video to be recognized, carrying out action labeling on the action video to obtain a video frame, obtaining a spatial position corresponding to human action, detecting spatial candidate frame coordinates and action category division of each frame action, adopting correlation progress action time domain detection among continuous frames, locating action occurring time periods, preprocessing the video frame to obtain a data set, adopting a convolutional neural network and a long-short term memory network to carry out feature extraction on the data set to obtain action features, constructing a network model by combining the time domain features corresponding to the action features and the time domain features, inputting a plurality of modal information into the network model to carry out feature fusion and classification so as to complete the action recognition of a human body, adopting the human action to carry out recognition and multi-modal fusion, enhancing the accuracy and robustness of the model in a real scene, and improving the stability and recognition accuracy of working habits.
Description
Technical Field
The invention belongs to the technical field of motion recognition, and particularly relates to a motion recognition method and system based on multi-modal sequence fusion.
Background
The human body behavior recognition is always a hotspot technology in the field of human-computer interaction, the human body recognition technology has wide application prospect, good economic benefit can be brought, and many practical scenes are closely related to the human body behavior recognition technology, such as dangerous activities recognized in a video behavior monitoring system and human behavior perception in an automatic navigation system, and safe operation is easy to realize. Although motion recognition has been widely applied to various aspects of society due to complex and diversified activities of daily behaviors of human beings, small motion changes may generate completely different behaviors and change along with changes of environments, many problems in the field need to be solved in the real world, such as a view angle transformation problem, a motion scale difference problem and the like, and meanwhile, how to quickly and effectively acquire an internal relation in multi-modal motion information and efficiently model the internal relation is a challenging problem.
Disclosure of Invention
In view of this, the present invention provides a method and a system for recognizing actions based on multi-modal sequence fusion, which can improve the accuracy of action recognition, implement multi-modal data fusion, and effectively control the number of network parameters, so as to solve the above-mentioned technical problems.
In a first aspect, the invention provides a motion recognition method based on multi-modal sequence fusion, which comprises the following steps:
acquiring a human body action video to be identified, and performing action labeling on the action video to obtain a video frame, wherein the action labeling comprises semantic segmentation and timeline segmentation labels of actions;
acquiring a spatial position corresponding to human body actions, detecting the coordinate and action category division of a space candidate frame of each frame of action, positioning an action occurring time period by adopting correlation progress action time domain detection among continuous frames, connecting a detection result of each frame with a space-time channel of the formed action, and preprocessing the video frame to obtain a data set corresponding to the human body actions;
performing feature extraction on the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior features and time domain features corresponding to the behavior features;
and constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action recognition of the human body.
As a further improvement of the technical scheme, a plurality of modal information is input to the network model for feature fusion and classification so as to complete the motion recognition of the human body. The method comprises the following steps:
taking the interactive information among the multi-modal data as common features contained in the multi-modal data, and performing volume operation on feature vectors corresponding to the multi-modal data by adopting a tensor fusion algorithm to obtain an information association tensor of the multi-modal feature elements;
when calculating the volume, adding one dimension 1 to each feature vector to maintain the single-mode input features in the network model, wherein the expression isWherein a modal characteristic +> Representing an outer product operation; the expression of the outer product operation of the three characteristic vectors a, g and v is
By means of towersThe gram decomposition compresses the original tensor, the weight impurity quantity is expressed as the product of four orthogonal matrixes and one core tensor, and then the four-bit weight tensor tau is decomposed into t = ((t) c × 1 W a )× 2 W g )× 3 W v × 4 W 0 Decomposed core tensorFor the interaction of three modes, the complexity of the whole fusion mode is controlled by the function of the number of parameters under constraint and the dimension, the mapping from all the feature vectors to the fusion features is stored, and then the trilinear model expression isWhereinAnd &>Respectively representing the projection of the respective feature vectors into respective low-dimensional spaces, t a 、t g And t v The larger, the more training parameters are required, the higher the complexity of the model, and finally the passControlling the dimensions of the fused feature vectors.
As a further improvement of the above technical solution, performing volume operation on eigenvectors corresponding to a plurality of different modal data by using a tensor fusion algorithm includes:
after the n-mode fusion feature tensor is subjected to linear mapping, the (n + 1) -dimensional weight tensor is obtainedPerforming tensor tak decomposition on the orthogonal mapping matrix to obtain n +1 orthogonal mapping matrixes and an expression of a core tensor is->Order toFurther decomposition with the introduction of a rank constraint and/or a rank constraint> Hamiltonian representing a series of tensors, -a>Rank factorization for the ith mode.
As a further improvement of the above technical solution, constructing a network model according to the behavior characteristics and the time domain characteristics includes:
obtaining a multi-modal semantic idiomatic representation through multi-modal feature fusion Then the semantic feature is sent to a neural network layer to obtain the final cross-modal semantic feature representation>Presetting a series of multi-scale temporal candidate segments->WhereinDenotes the start and end time boundaries, W, of the jth candidate segment at time point i j Represents a predetermined time width of the jth segment>Representing the sum of candidate segmentsCounting;
the expression for evaluating the confidence score of a candidate segment by sigmoid activation function (σ) is cs i =σ(Convld(h i ) Therein) are provided withRepresents->The score of each candidate segment at time point i is used for representing the similarity of the video segment and the text description, and the expression of calculating the corresponding prediction time sequence boundary offset for each candidate segment is as follows Represents the predicted timing start and end offsets at time i, and finally the predicted segment j at time i is represented as ≧ greater>
As a further improvement of the technical scheme, the loss function is composed of a loss function for calculating the matching score of the video segment and the text description and a loss function for calculating the time sequence boundary offset, and each candidate time sequence segment is calculatedAnd the time overlap of the target segment (s, e) is compared with the IoU, if the time overlap is smaller than a preset threshold value lambda, the IoU is set to be 0, if the time overlap is larger than the threshold value lambda, the candidate segment is determined to be a positive sample, otherwise, the candidate segment is a negative sample, and the expression of the matching loss function is ^ or>Wherein N is pos Represents the number of positive samples of the candidate time-series segment, N neg Representing the number of negative samples;
by means of edgesAdjusting offset of time positioning by a boundary regression strategy, calculating IoU (IoU) of candidate segments and target segments, and selecting a candidate time sequence segment set C larger than a set threshold gamma h The expression for calculating the timing boundary offset of these candidate timing segments isWherein (s, e) denotes the start and end point in time ≥ of a given text description>Corresponding candidate time-series video clip set C h Start and end time points of;
using δ = [ δ ] s ,δ e ]Indicating the true time-alignment offset,indicating a predicted time alignment offset, adaptively adjusting timing boundaries of the epoch candidate segments based on the true time alignment offset,wherein SL 1 Represents L 1 Norm, N represents C h The size of the collection.
As a further improvement of the above technical solution, performing feature extraction on a data set by using a convolutional neural network and a long-short term memory network to obtain a behavior feature and a time domain feature corresponding to the behavior feature, including:
taking original inertial sensor data as a picture, namely time multiplied by channel, adopting a long and short term extreme network LSTM and a convolutional neural network CNN, adopting one-dimensional convolution operation to obtain a time signal structure in a convolutional kernel window, and obtaining key behavior characteristics of the inertial sensor signal through the convolutional neural network, wherein the one-dimensional convolution calculation expression isWhere N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel, device for selecting or keeping>Representing one-dimensional convolution kernel depth d 0 The nth weight in->Representing depth d 0 I th of lower sensor signal 0 Element(s) is/are present>Indicating the i-th value obtained by the sensor through convolution operation 0 Features, f (—) represents an activation function;
the expression of the characteristic size obtained by pooling isRepresents the current i 0 The method comprises the steps of generating a low-dimensional high-level feature on a time sequence through three times of convolution and pooling operations, inputting the behavior feature processed by the CNN into a long-term and short-term memory network according to a time sequence, wherein the long-term and short-term memory network comprises two LSTM layers, each LSTM layer is connected in a one-way mode, the number of hidden points is 128, converting the behavior feature obtained in the previous part into a 128-dimensional time sequence feature, and dynamically modeling time information.
As a further improvement of the above technical solution, connecting the detection result of each frame to a space-time channel with formed motion, so that the video frame is preprocessed to obtain a data set corresponding to the human motion, including:
representing an undivided original video asWhere xn represents the nth frame image of video X and w represents the number of frames in video X, all action sets contained in this video can be evaluated with a set of instances &>Is shown in which N is g Representing real action instances in video XData, t s,i 、t e,i Respectively represent an instance of action>Start and end nodes of g Representing an action instance;
the method comprises the steps of obtaining incoming videos and texts to construct serialization characteristics, generating multi-scale candidate time sequence video clips at each time point of the video sequence characteristics according to a preset time length, carrying out characteristic interaction and fusion on the candidate time sequence clips and the text sequence characteristics by adopting a time sequence cooperative attention interaction network, and obtaining multi-mode fusion data embedded in the same characteristic space to obtain a corresponding data set.
As a further improvement of the above technical solution, acquiring a spatial position corresponding to a human body motion and detecting spatial candidate frame coordinates and motion category division of each frame motion, and adopting correlation progress motion time domain detection between consecutive frames, includes:
presetting an un-divided video sequence V, firstly sampling a video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing the image frame sequence into M video segment units { V } with equal length and without overlapping 1 ,...,v j ,...,v M Using a position coding function to add additional time sequence position information to the video image, and video unit characteristicsIs expressed as-> Wherein->d and d v Representing video coding feature dimensions and extracted video unit feature dimensions;
constructing video sum by adopting unit cooperative attention interaction layerInteractive information between texts, preset input video and text description feature expression is V in And S in Transforming d-dimensional feature vectors into query vectors (Q) by linear mapping f ,Q s ) Key vector (K) f ,K s ) Vector of sum values (V) f ,,V s ) Then the corresponding expression isWherein W sk 、W fk 、W sq 、W fq 、W sv And W fv Representing learnable weight matrices, Q of ion video modalities f Features as query vectors, K from text modalities s And V s The characteristics are respectively used as a key vector and a value vector, a similarity weight matrix between the characteristics is calculated, and the weighted video characteristic is obtained as->And integrating the context information of the text and the video features into the feature vector of the current position to obtain the corresponding time domain features.
As a further improvement of the above technical solution, acquiring a human motion video to be recognized, and performing motion annotation on the motion video to obtain a video frame, includes:
the motion characteristics of the body part are obtained by adopting an inertial sensor, the camera is switched to label when an unmarked video is read in each time, after a certain action is marked, whether the action is correct or not is judged through a screenshot when the action starts and stops, and if a place with a wrong mark appears, the camera is finely adjusted or re-labeled.
In a second aspect, the present invention provides a motion recognition system based on multi-modal sequence fusion, comprising:
the device comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a human body action video to be identified and performing action marking on the action video to obtain a video frame, and the action marking comprises semantic segmentation and timeline segmentation labels of actions;
the second acquisition module is used for acquiring a spatial position corresponding to the human body action, detecting the spatial candidate frame coordinates and action category division of each frame action, positioning the action occurring time period by adopting correlation progress action time domain detection between continuous frames, connecting the detection result of each frame with a space-time channel with formed action, and preprocessing the video frame to obtain a data set corresponding to the human body action;
the characteristic extraction module is used for extracting the characteristics of the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior characteristics and time domain characteristics corresponding to the behavior characteristics;
and the identification module is used for constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to complete the action identification of the human body.
The invention provides an action recognition method and system based on multi-modal sequence fusion, which are characterized in that a video frame is obtained by obtaining a human action video to be recognized, action labeling is carried out on the action video to obtain a video frame, a spatial position corresponding to human action is obtained, spatial candidate frame coordinates and action category division of each frame action are detected, correlation progress action time domain detection among continuous frames is adopted, the time period of action occurrence is positioned, the detection result of each frame is connected with a space-time channel with action formed, the video frame is preprocessed to obtain a data set corresponding to the human action, a convolutional neural network and a long-short term memory network are adopted to carry out feature extraction on the data set to obtain behavior features and time domain features corresponding to the behavior features, a network model is built according to the behavior features and the time domain features, a plurality of modal information are input into the network model to carry out feature fusion and classification so as to complete action recognition of a human action, multi-modal fusion is adopted to recognize the human action, the accuracy and robustness of the model in a real scene can be enhanced, the problem of semantic loss in a later stage in a feature vector operation process can be fully exerted, and the complementary work stability of products in different modes can be improved, and the accuracy and the recognition of the multi-modal recognition can be further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a method for recognizing actions based on multi-modal sequence fusion according to the present invention;
fig. 2 is a block diagram of a multi-modal sequence fusion-based motion recognition system provided in the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Referring to fig. 1, the invention provides a method for recognizing actions based on multi-modal sequence fusion, comprising the following steps:
s1: acquiring a human body action video to be identified, and performing action labeling on the action video to obtain a video frame, wherein the action labeling comprises semantic segmentation and timeline segmentation labels of an action;
s2: acquiring a spatial position corresponding to human body actions, detecting the coordinate and action category division of a space candidate frame of each frame of action, positioning an action occurring time period by adopting correlation progress action time domain detection among continuous frames, connecting a detection result of each frame with a space-time channel of the formed action, and preprocessing the video frame to obtain a data set corresponding to the human body actions;
s3: performing feature extraction on a data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior features and time domain features corresponding to the behavior features;
s4: and constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action recognition of the human body.
In this embodiment, a plurality of modal information is input to the network model to perform feature fusion and classification, so as to complete the motion recognition of the human body. The method comprises the following steps: taking the interactive information among the multi-modal data as common features contained in the multi-modal data, and performing volume operation on feature vectors corresponding to the multi-modal data by adopting a tensor fusion algorithm to obtain an information association tensor of the multi-modal feature elements; in calculating the volume, one dimension 1 is added to each feature vector to maintain the unimodal input features in the network model, expressed asWherein the modal characteristic-> Representing an outer product operation; the expression of the outer product operation of the three characteristic vectors a, g and v isCompressing the original tensor by adopting Tack decomposition, and expressing the weight impurity quantity as the product of four orthogonal matrixes and one core tensor, the four-bit weight tensor is decomposed by t = ((t) by the weight tensor tau c × 1 W a )× 2 、N g )× 3 W v × 4 W 0 Resolved core tensor>For the interaction of three modes, the dimension controls the complexity of the whole fusion mode through the function of the number of parameters under constraintDegree, storing the mapping from all the characteristic vectors to the fusion characteristic, and judging whether the trilinear model expression is->Wherein->And &>Respectively representing the projection of the respective feature vectors into respective low-dimensional spaces, t a 、t g And t v The larger, the more parameters need to be trained, the higher the complexity of the model, and finally theControlling the dimensions of the fused feature vectors.
It should be noted that, acquiring a human motion video to be recognized, and performing motion annotation on the motion video to obtain a video frame includes: the motion characteristics of the body part are obtained by adopting an inertial sensor, the camera is switched to mark when the video which is not marked is read in each time, after a certain action is marked, whether the video is correct or not is judged through the screenshot when the action starts and stops, and if the place with the wrong mark appears, the camera is finely adjusted or marked again. The action recognition comprises offline action recognition and online action recognition, the offline action recognition needs to judge the human action types in the videos after the whole video sequence is observed, and the task allocates an action type label to each video according to a preset action list. On-line action recognition is oriented to the actual scene requirements and requires real-time processing of on-line video streams. The motion detection comprises time sequence motion detection and space-time motion detection, for an uncut video sequence, the time sequence motion detection task is to position the starting time point and the ending time point of a target motion and the corresponding motion type, and the space-time motion detection also needs to predict the spatial position of the motion on the basis. The time sequence action detection comprises the steps of generating candidate action segments with accurate time boundaries and dividing action categories for the time sequence candidate action segments, and the general flow of a space-time action detection task is as follows: the method comprises the steps of capturing the space position of human body motion, namely detecting the space candidate frame coordinates and motion category scores of each frame of motion, then adopting the correlation between continuous frames to carry out motion time domain detection, positioning the time period of motion occurrence, and finally connecting the detection results of each frame to form a motion space-time channel.
It should be understood that short-term motion prediction predicts the motion class occurring in the video at the beginning of motion execution from the local video segment that is scrolled, and long-term motion prediction is the prediction of the motion class that is likely to occur in the future by observing the human motion at the current time in the video. For the time sequence action detection task, the average accuracy mean mAP index is usually used to measure the performance of the algorithm, the average accuracy AP of the detection result is calculated for each action category, and then the average accuracy is averaged to obtain the average accuracy mean mAP. When the time sequence cross ratio IoU of a predicted time sequence action segment and the corresponding truth segment is greater than a set threshold value and the action type prediction is correct, and the two conditions are met, the action type prediction is considered to be a correct detection result, the time sequence cross ratio IoU is the cross ratio of the predicted time sequence action segment and the corresponding truth segment, and the calculation formula is thatWherein T is p And T g Showing the time sequence interval and the truth value interval predicted by the algorithm, and the zeta (-) function is used for calculating the interval length. By acquiring the human body actions, detecting the actions and predicting the actions, the accuracy of action recognition can be improved, and the convenience of human-computer interaction is also improved. />
Optionally, performing volume operation on the eigenvectors corresponding to the multiple different modality data by using a tensor fusion algorithm, including:
after the n-mode fusion feature tensor is subjected to linear mapping, the (n + 1) -dimensional weight tensor is obtainedCarrying out tensor Tak decomposition to obtain n +1 orthogonal mapping matrixes and a coreThe expression of tensor is->Order toFurther decomposition with the introduction of a rank constraint and/or a rank constraint> Hamiltonian representing a series of tensors, based on the sum of the partial sum values>Rank factorization for the ith modality.
In this embodiment, constructing a network model according to the behavior features and the time domain features includes: obtaining a multi-modal semantic idiomatic representation through multi-modal feature fusionThen the semantic feature is sent to a neural network layer to obtain the final cross-modal semantic feature representation> Presetting a series of multi-scale temporal candidate segments->Wherein->Denotes the start and end time boundaries, W, of the jth candidate segment at time point i j Represents a predetermined time width of the jth segment, < > H>Representing the total number of candidate segments; the expression for evaluating the confidence score of a candidate segment by sigmoid activation function (σ) is cs i =σ(Convld(h i ) Therein->Represents->A score at time i for each candidate segment representing the similarity of the video segment and the text description, with the expression ≧ for each candidate segment whose corresponding predicted temporal boundary offset is calculated> Represents the predicted timing start and end offsets at time i, and finally the predicted segment j at time i is represented as ≧ greater>
It should be noted that the loss function is composed of a loss function for calculating a matching score between the video segment and the text description and a loss function for calculating a time sequence boundary offset, and each candidate time sequence segment is calculatedAnd the time overlap of the target segment (s, e) is compared with the IoU, if the time overlap is smaller than a preset threshold lambda, the IoU is set to be 0, if the time overlap is larger than the threshold lambda, the candidate segment is determined to be a positive sample, otherwise, the candidate segment is a negative sample, and the expression of the matching loss function isWherein N is pos Represents the number of positive samples of the candidate time-series segment, N neg Representing the number of negative samples; when adjusted by a boundary regression strategyThe offset of inter-positioning is calculated, the IoU of the candidate segment and the target segment is calculated, and a candidate time sequence segment set C larger than a set threshold value Y is selected h The timing boundary offset for these candidate timing segments is calculated as @>Where (s, e) denotes the start and end point in time { (s, e) } of a given text description>Corresponding candidate set of temporal video segments C h Start and end time points of; using δ = [ δ ] s ,δ e ]Represents a true time alignment offset>Indicating a predicted time alignment offset, adaptively adjusting the timing boundaries of the segment candidate based on the true time alignment offset, and ` Harbin `> Wherein SL 1 Represents L 1 Norm, N represents C h The size of the collection.
It will be appreciated that the tach decomposition is a multi-linear form of principal component analysis, each tensor can be expressed non-uniquely as a core tensor, i.e. the product of the principal component factors and the factor matrix over all orders, with the advantages of using the tach decomposition: compared with the size of rank needing to be evaluated and CP decomposition approaching to the initial tensor, the Take decomposition is used for obtaining a more accurate tensor decomposition result, and the purpose of feature selection of each modal feature vector can be achieved by adjusting the dimension of the core tensor. In order to further reduce the computational complexity of a fusion model and balance the complexity and expressiveness of interactive fusion modeling, structured sparse constraint is introduced according to the sparsity of a core tensor, a weight core tensor is decomposed into a plurality of factors, rank constraint is used as regularization in the training process to prevent overfitting, and mapping of input and data can be adjusted flexibly.
Optionally, performing feature extraction on the data set by using a convolutional neural network and a long-short term memory network to obtain a behavior feature and a time domain feature corresponding to the behavior feature, where the feature extraction includes:
the method comprises the steps of taking original inertial sensor data as a picture, namely a time multiplied channel, adopting a long and short term extreme network (LSTM) and a Convolutional Neural Network (CNN), adopting one-dimensional convolution operation to obtain a time signal structure in a convolutional kernel window, and obtaining key behavior characteristics of the inertial sensor signal through the convolutional neural network, wherein the one-dimensional convolution calculation expression isWhere N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel, device for selecting or keeping>Representing one-dimensional convolution kernel depth d 0 The nth weight in->Representing depth d 0 I th of lower sensor signal 0 A plurality of elements +>Indicating the i-th value obtained by convolution operation of the sensor 0 Features, f () represents an activation function;
the expression of the characteristic size obtained by pooling isRepresents the current i 0 The characteristic length of the layers, P represents the filling size, S represents the advancing step length, the high-level characteristic of the low dimension in time sequence is generated through three times of convolution and pooling operation, the behavior characteristic processed by CNN is input into the long-short term memory network according to the time sequence, wherein the long-short term memory network comprises two LSTM layers, each LSTM layer adopts one-way connection, the number of hidden points is set as 12And 8, converting the behavior characteristics obtained from the front part into 128-dimensional time sequence characteristics, and dynamically modeling the time information.
In the embodiment, the original inertial sensor data of the convolutional neural network is taken as a picture, the image is subjected to convolution operation by utilizing a shared convolution kernel according to the inherent local mode in the sensor image, partial features are extracted from the image, time sequence information hidden in the image is not further processed, the continuity of human behavior is ignored, the original inertial sensor signal is directly input into the LSTM by the long-short term memory network, integration of the sensor data is lacked, the operation speed of the algorithm is slower, and although the problem of gradient disappearance of the recurrent neural network can be solved to a certain extent by adopting a gating mechanism in the long-short term memory network, longer time sequence information cannot be processed. Time domain information of context correlation between different signal frames is obtained through a double-layer LSTM, behavior information obtained from input CNN extraction features is selectively reserved through a gating mechanism, so that time sequence excitation is better performed on signal features of an inertial sensor, space-time features related to behavior recognition are obtained, and space-time behavior feature learning is achieved.
Optionally, connecting the detection result of each frame to a spatio-temporal channel having formed an action, so that the video frame is preprocessed to obtain a data set corresponding to the human body action, including:
representing an undivided original video asWhere xn represents the nth frame image of video X and w represents the number of frames in video X, all action sets contained in the video can be evaluated by a set of instances ≧>Is shown in which N is g Data representing instances of real actions in video X, t s,i 、t e,i Respectively represent an instance of action>Start node and end ofNodal point, psi g Representing an action instance;
the method comprises the steps of obtaining incoming videos and texts to construct serialization characteristics, generating multi-scale candidate time sequence video clips at each time point of the video sequence characteristics according to a preset time length, carrying out characteristic interaction and fusion on the candidate time sequence clips and the text sequence characteristics by adopting a time sequence cooperative attention interaction network, and obtaining multi-mode fusion data embedded in the same characteristic space to obtain a corresponding data set.
In this embodiment, obtaining a spatial position corresponding to a human body action and detecting a spatial candidate frame coordinate and action category division of each frame action, and adopting correlation progress action time domain detection between consecutive frames includes: presetting an un-divided video sequence V, firstly sampling a video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing the image frame sequence into M video segment units { V } with equal length and without overlapping 1 ,...,v j ,...,v M Using a position coding function to add additional time sequence position information to the video image, and video unit characteristicsIs expressed asWherein +>d and d v Representing video coding feature dimensions and extracted video unit feature dimensions; interactive information between videos and texts is constructed by adopting unit cooperative attention interaction layers, and characteristics of input video and text descriptions are preset to be V in And S in Transforming d-dimensional feature vectors into query vectors (Q) by linear mapping f ,Q s ) Key vector (K) f ,K s ) Vector of sum values (V) f ,,V s ) Then the corresponding expression is->Wherein W sk 、W fk 、W sq 、W fq 、W sv And W fv Representing a learnable weight matrix, Q of an ion video modality f Features as query vectors, K from text modalities s And V s The features are respectively used as key vectors and value vectors, and similarity weight matrixes between the key vectors and the value vectors are calculated to obtain weighted video featuresAnd integrating the context information of the text and the video features into the feature vector of the current position to obtain the corresponding time domain features.
In this embodiment, the modalities refer to different ways of characterizing information, including various sensory ways of objects, and the multi-modality refers to a combination of two or more modalities, and the reason for performing multi-modality fusion is that different modalities treat problems with different expressions and different angles, and the multi-modality data has various different information intersections and data complementation, and the multi-modality effect is better than that of a single modality. In the field of human behavior recognition, acceleration, angular velocity and RGB video image data belong to heterogeneous data and have respective characteristics, an inertial sensor can only obtain motion characteristics of a body part, fine motions such as hand motion details cannot be recognized accurately, RGB videos are affected by a shielding object and illumination, and when a human body is shielded, the RGB videos can only be recognized through the inertial sensor. The deep learning multi-mode feature layer fusion method is cascade fusion, addition fusion or cascade fusion, and is characterized in that a plurality of modal feature vectors are spliced, so that the dimensionality of the overall feature vector is increased.
Referring to fig. 2, the present invention provides a motion recognition system based on multi-modal sequence fusion, comprising:
the device comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a human body action video to be identified and carrying out action marking on the action video to obtain a video frame, and the action marking comprises a semantic segmentation and a timeline segmentation label of an action;
the second acquisition module is used for acquiring a spatial position corresponding to the human body action, detecting the spatial candidate frame coordinates and action category division of each frame action, positioning the action occurring time period by adopting correlation progress action time domain detection between continuous frames, connecting the detection result of each frame with a space-time channel with formed action, and preprocessing the video frame to obtain a data set corresponding to the human body action;
the characteristic extraction module is used for extracting the characteristics of the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior characteristics and time domain characteristics corresponding to the behavior characteristics;
and the identification module is used for constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action identification of the human body.
In this embodiment, the human behavior recognition is to perform relevant classification recognition on the collected user motion information and data by a certain means and method, so as to determine the user activity state or detect the user behavior. Processing time series data, spatio-temporal structures, and temporal structures is excessive behavior, and if the temporal features are not fully utilized, it will be a great loss for behavior recognition models. The method comprises the steps of obtaining a human body action video to be recognized, marking the action video to obtain a video frame, obtaining a spatial position corresponding to the human body action, detecting spatial candidate frame coordinates and action category division of each frame action, adopting correlation progress action time domain detection among continuous frames, positioning action occurrence time periods, connecting detection results of each frame with a formed action space-time channel, preprocessing the video frame to obtain a data set corresponding to the human body action, adopting a convolutional neural network and a long-term memory network to perform feature extraction on the data set to obtain behavior features and time domain features corresponding to the behavior features, constructing a network model according to the behavior features and the time domain features, inputting a plurality of modal information into the network model to perform feature fusion and classification to complete human body action recognition, adopting multi-mode to recognize and fuse human body actions, enhancing the accuracy and robustness of the model in a real scene, effectively reducing the problem of semantic loss in the later-stage feature vector operation process, and fully playing the complementary information in different modes, thereby improving the stability and recognition accuracy of work.
In all examples shown and described herein, any particular value should be construed as exemplary only and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above examples are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.
Claims (10)
1. A motion recognition method based on multi-modal sequence fusion is characterized by comprising the following steps:
acquiring a human body action video to be identified, and performing action labeling on the action video to obtain a video frame, wherein the action labeling comprises semantic segmentation and timeline segmentation labels of actions;
acquiring a spatial position corresponding to human body actions, detecting the coordinate and action category division of a space candidate frame of each frame of action, positioning an action occurring time period by adopting correlation progress action time domain detection among continuous frames, connecting a detection result of each frame with a space-time channel of the formed action, and preprocessing the video frame to obtain a data set corresponding to the human body actions;
performing feature extraction on the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior features and time domain features corresponding to the behavior features;
and constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action recognition of the human body.
2. The method according to claim 1, wherein a plurality of modal information is inputted to the network model for feature fusion and classification to complete the motion recognition of human body. The method comprises the following steps:
taking the interactive information among the multi-modal data as common features contained in the multi-modal data, and performing volume operation on feature vectors corresponding to the multi-modal data by adopting a tensor fusion algorithm to obtain an information association tensor of the multi-modal feature elements;
in calculating the volume, one dimension 1 is added to each feature vector to maintain the unimodal input features in the network model, expressed asWherein the modal characteristic-> Representing an outer product operation; the expression of the outer product operation of the three characteristic vectors a, g and v is
Compressing the original tensor by using Tack decomposition, and expressing the weight stray quantity as the product of four orthogonal matrixes and one core tensor, the four-bit weight tensor is decomposed by t = ((t) weight tensor tau c × 1 W a )× 2 W g )× 3 W v × 4 W 0 Decomposed core tensorFor the interaction of three modes, the complexity of the whole fusion mode is controlled by the function of the number of parameters under constraint and the dimension, the mapping from all the feature vectors to the fusion features is stored, and then the trilinear model expression isWhereinAnd &>Respectively representing the projection of the respective feature vectors into respective low-dimensional spaces, t a 、t g And t v The larger, the more parameters need to be trained, the higher the complexity of the model, and finally theControlling the dimensions of the fused feature vectors.
3. The method for recognizing motion based on multi-modal sequence fusion according to claim 2, wherein performing a volume operation on feature vectors corresponding to a plurality of different modal data by using a tensor fusion algorithm comprises:
after the n-mode fusion feature tensor is subjected to linear mapping, the (n + 1) -dimensional weight tensor is obtainedPerforming tensor tak decomposition on the orthogonal mapping matrix to obtain n +1 orthogonal mapping matrixes and an expression of a core tensor is->Make->y=z T W 0 ,And then further introducing rank constraint to carry out decomposition, hamiltonian representing a series of tensors, -a>Rank factorization for the ith mode.
4. The method for motion recognition based on multi-modal sequence fusion of claim 1, wherein constructing a network model from the behavior features and the time domain features comprises:
obtaining a multi-modal semantic Tenza representation through multi-modal feature fusion Then the semantic feature is sent to a neural network layer to obtain the final cross-modal semantic feature representation>Presetting a series of multi-scale temporal candidate segments->WhereinDenotes the start and end time boundaries, W, of the jth candidate segment at time point i j Represents a predetermined time width of the jth segment, < > H>Representing the total number of candidate segments;
the expression for evaluating the confidence score of the candidate segment by sigmoid activation function (sigma) is cs i =σ(Convld(h i ) Therein), whereinRepresents->The score of each candidate segment at time point i is used for representing the similarity of the video segment and the text description, and the expression of calculating the corresponding prediction time sequence boundary offset for each candidate segment is as follows Represents the predicted timing start and end offsets at time i, and finally the predicted segment j at time i is represented as ≧ greater>
5. The method of claim 4, wherein the loss function comprises a loss function for calculating matching scores between video segments and text descriptions and a loss function for calculating temporal boundary offsets, and each candidate temporal segment is calculatedAnd the time overlap coincidence ratio IoU of the target segment (s, e), if the time overlap coincidence ratio IoU is smaller than a preset threshold value lambda, the IoU is set to be 0, if the time overlap coincidence ratio IoU is larger than the threshold value lambda, the candidate segment is determined to be a positive sample, otherwise, the candidate segment is a negative sample, and the expression of the matching loss function is ^ or>Wherein N is pos Represents the number of positive samples of the candidate time-series segment, N neg Representing the number of negative samples;
adjusting the offset of time positioning by adopting a boundary regression strategy, calculating IoU (IoU) of the candidate segment and the target segment, and selecting a candidate time sequence segment set C larger than a set threshold gamma h The expression for calculating the timing boundary offset of these candidate timing segments isWhere (s, e) denotes the start and end point in time { (s, e) } of a given text description>Corresponding candidate time-series video clip set C h Start and end time points of;
using δ = [ δ ] s ,δ e ]A true time alignment offset is represented and,indicating a predicted time alignment offset, adaptively adjusting timing boundaries of the epoch candidate segments based on the true time alignment offset,wherein SL 1 Represents L 1 Norm, N represents C h The size of the collection.
6. The method for recognizing the action based on the multi-modal sequence fusion as claimed in claim 1, wherein the step of extracting the features of the data set by using a convolutional neural network and a long-short term memory network to obtain the behavior features and the time domain features corresponding to the behavior features comprises the following steps:
the method comprises the steps of taking original inertial sensor data as a picture, namely a time multiplied channel, adopting a long and short term extreme network (LSTM) and a Convolutional Neural Network (CNN), adopting one-dimensional convolution operation to obtain a time signal structure in a convolutional kernel window, and obtaining key behavior characteristics of the inertial sensor signal through the convolutional neural network, wherein the one-dimensional convolution calculation expression isWhere N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel, device for selecting or keeping>Representing one-dimensional convolution kernel depth d 0 The nth weight in->Representing depth d 0 I th of lower sensor signal 0 Element(s) is/are present>Indicating the i-th value obtained by the sensor through convolution operation 0 Features, f () represents an activation function;
the expression of the feature size obtained by pooling is Represents the current i 0 The characteristic length of the layer, P represents the filling size, S represents the advancing step length, the high-level characteristic with low dimension on the time sequence is generated through three times of convolution and pooling operation, and the behavior characteristic after CNN processing is input into the long-short term memory network according to the time sequence, wherein the long-short term memory network comprises two LSTM layers, each LSTM layer adopts the mode of filling the space, the S represents the advancing step length, the long-short term memory network comprises two LSTM layers, and each LSTM layer adopts the mode of filling the spaceAnd (4) unidirectional connection, wherein the number of hidden points is 128, the behavior characteristics obtained from the front part are converted into 128-dimensional time sequence characteristics, and the time information is dynamically modeled.
7. The method for recognizing the motion based on the multi-modal sequence fusion as claimed in claim 1, wherein the step of connecting the detection result of each frame with the spatio-temporal channel with the motion formed so as to preprocess the video frame to obtain the data set corresponding to the motion of the human body comprises the steps of:
representing an undivided original video asWherein x n Represents the nth frame image of video X, w represents the number of frames in video X that all action sets contained in the video can be taken by a set of instances ≧>Is shown in which N is g Data representing real action instances in video X, t s,i 、t e,i Respectively represent instances of action>Start and end nodes of g Representing an action instance;
the method comprises the steps of obtaining incoming videos and texts to construct serialization characteristics, generating multi-scale candidate time sequence video clips at each time point of the video sequence characteristics according to a preset time length, carrying out characteristic interaction and fusion on the candidate time sequence clips and the text sequence characteristics by adopting a time sequence cooperative attention interaction network, and obtaining multi-mode fusion data embedded in the same characteristic space to obtain a corresponding data set.
8. The method for recognizing the motion based on the multi-modal sequence fusion as claimed in claim 1, wherein the obtaining of the spatial position corresponding to the human motion and the detection of the spatial candidate frame coordinates and motion class division of each frame of motion, and the time domain detection of the motion by adopting the correlation progress between the continuous frames comprises:
presetting an un-divided video sequence V, firstly sampling a video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing the image frame sequence into M video segment units { V } with equal length and without overlapping 1 ,...,v j ,...,v M Using a position coding function to add additional time sequence position information to the video image, and video unit characteristicsIs expressed as-> Wherein +>d and d v Representing video coding feature dimensions and extracted video unit feature dimensions;
constructing interactive information between videos and texts by adopting unit cooperative attention interaction layers, and presetting the characteristics of input video and text description as V in And S in Transforming d-dimensional feature vectors into query vectors (Q) by linear mapping f ,Q s ) Key vector (K) f ,K s ) Vector of sum values (V) f ,,V s ) Then the corresponding expression isWherein W sk 、W fk 、W sq 、W fq 、W sv And W fv Representing a learnable weight matrix, Q of an ion video modality f Features as query vectors, K from text modalities s And V s The features are respectively used as a key vector and a value vector, and a similarity weight matrix between the features is calculated to obtain the weighted video feature which is ^ greater than or equal to>And (3) integrating the context information of the text and the video characteristics into the characteristic vector of the current position to obtain corresponding time domain characteristics.
9. The method for recognizing the motion based on the multi-modal sequence fusion as claimed in claim 1, wherein the step of obtaining a motion video of a human body to be recognized and performing motion labeling on the motion video to obtain a video frame comprises the following steps:
the motion characteristics of the body part are obtained by adopting an inertial sensor, the camera is switched to label when an unmarked video is read in each time, after a certain action is marked, whether the action is correct or not is judged through a screenshot when the action starts and stops, and if a place with a wrong mark appears, the camera is finely adjusted or re-labeled.
10. A multi-modal sequence fusion based motion recognition system of the multi-modal sequence fusion based motion recognition method according to any one of claims 1 to 8, comprising:
the device comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a human body action video to be identified and carrying out action marking on the action video to obtain a video frame, and the action marking comprises a semantic segmentation and a timeline segmentation label of an action;
the second acquisition module is used for acquiring a spatial position corresponding to the human body action, detecting the spatial candidate frame coordinates and action category division of each frame action, positioning the action occurring time period by adopting correlation progress action time domain detection between continuous frames, connecting the detection result of each frame with a space-time channel with formed action, and preprocessing the video frame to obtain a data set corresponding to the human body action;
the characteristic extraction module is used for extracting the characteristics of the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior characteristics and time domain characteristics corresponding to the behavior characteristics;
and the identification module is used for constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action identification of the human body.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211568552.3A CN115937975A (en) | 2022-12-08 | 2022-12-08 | Action recognition method and system based on multi-modal sequence fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211568552.3A CN115937975A (en) | 2022-12-08 | 2022-12-08 | Action recognition method and system based on multi-modal sequence fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115937975A true CN115937975A (en) | 2023-04-07 |
Family
ID=86550146
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211568552.3A Pending CN115937975A (en) | 2022-12-08 | 2022-12-08 | Action recognition method and system based on multi-modal sequence fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115937975A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116503957A (en) * | 2023-06-26 | 2023-07-28 | 成都千嘉科技股份有限公司 | Gas household operation behavior identification method |
CN117409538A (en) * | 2023-12-13 | 2024-01-16 | 吉林大学 | Wireless fall-prevention alarm system and method for nursing |
CN117953543A (en) * | 2024-03-26 | 2024-04-30 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Character interaction detection method based on multiple texts, terminal and readable storage medium |
-
2022
- 2022-12-08 CN CN202211568552.3A patent/CN115937975A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116503957A (en) * | 2023-06-26 | 2023-07-28 | 成都千嘉科技股份有限公司 | Gas household operation behavior identification method |
CN116503957B (en) * | 2023-06-26 | 2023-09-15 | 成都千嘉科技股份有限公司 | Gas household operation behavior identification method |
CN117409538A (en) * | 2023-12-13 | 2024-01-16 | 吉林大学 | Wireless fall-prevention alarm system and method for nursing |
CN117953543A (en) * | 2024-03-26 | 2024-04-30 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Character interaction detection method based on multiple texts, terminal and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ullah et al. | Activity recognition using temporal optical flow convolutional features and multilayer LSTM | |
Subetha et al. | A survey on human activity recognition from videos | |
Hu et al. | Video facial emotion recognition based on local enhanced motion history image and CNN-CTSLSTM networks | |
Avola et al. | 2-D skeleton-based action recognition via two-branch stacked LSTM-RNNs | |
Devanne et al. | 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold | |
Ibraheem et al. | Survey on various gesture recognition technologies and techniques | |
CN115937975A (en) | Action recognition method and system based on multi-modal sequence fusion | |
CN110472531A (en) | Method for processing video frequency, device, electronic equipment and storage medium | |
CN110555481A (en) | Portrait style identification method and device and computer readable storage medium | |
Caputo et al. | SHREC 2021: Skeleton-based hand gesture recognition in the wild | |
Alazrai et al. | Anatomical-plane-based representation for human–human interactions analysis | |
Lee et al. | Real-time gesture recognition in the view of repeating characteristics of sign languages | |
CN111444488A (en) | Identity authentication method based on dynamic gesture | |
Colaco et al. | Facial keypoint detection with convolutional neural networks | |
de Araujo Zeni et al. | Real-time gender detection in the wild using deep neural networks | |
Chen et al. | A multi-scale fusion convolutional neural network for face detection | |
Alhersh et al. | Learning human activity from visual data using deep learning | |
Zerrouki et al. | Deep Learning for Hand Gesture Recognition in Virtual Museum Using Wearable Vision Sensors | |
Nasir et al. | Recognition of human emotion transition from video sequence using triangulation induced various centre pairs distance signatures | |
Kumar et al. | Predictive analytics on gender classification using machine learning | |
Ghosh et al. | Deep learning-based multi-view 3D-human action recognition using skeleton and depth data | |
Zerrouki et al. | Exploiting deep learning-based LSTM classification for improving hand gesture recognition to enhance visitors’ museum experiences | |
Latha et al. | Human action recognition using deep learning methods (CNN-LSTM) without sensors | |
CN113869189A (en) | Human behavior recognition method, system, device and medium | |
Wang | Deeply-learned and spatial–temporal feature engineering for human action understanding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |