CN115937975A - Action recognition method and system based on multi-modal sequence fusion - Google Patents

Action recognition method and system based on multi-modal sequence fusion Download PDF

Info

Publication number
CN115937975A
CN115937975A CN202211568552.3A CN202211568552A CN115937975A CN 115937975 A CN115937975 A CN 115937975A CN 202211568552 A CN202211568552 A CN 202211568552A CN 115937975 A CN115937975 A CN 115937975A
Authority
CN
China
Prior art keywords
action
video
time
modal
fusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211568552.3A
Other languages
Chinese (zh)
Inventor
曾国坤
刘予川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202211568552.3A priority Critical patent/CN115937975A/en
Publication of CN115937975A publication Critical patent/CN115937975A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses an action recognition method based on multi-modal sequence fusion, which comprises the steps of obtaining a human action video to be recognized, carrying out action labeling on the action video to obtain a video frame, obtaining a spatial position corresponding to human action, detecting spatial candidate frame coordinates and action category division of each frame action, adopting correlation progress action time domain detection among continuous frames, locating action occurring time periods, preprocessing the video frame to obtain a data set, adopting a convolutional neural network and a long-short term memory network to carry out feature extraction on the data set to obtain action features, constructing a network model by combining the time domain features corresponding to the action features and the time domain features, inputting a plurality of modal information into the network model to carry out feature fusion and classification so as to complete the action recognition of a human body, adopting the human action to carry out recognition and multi-modal fusion, enhancing the accuracy and robustness of the model in a real scene, and improving the stability and recognition accuracy of working habits.

Description

Action recognition method and system based on multi-modal sequence fusion
Technical Field
The invention belongs to the technical field of motion recognition, and particularly relates to a motion recognition method and system based on multi-modal sequence fusion.
Background
The human body behavior recognition is always a hotspot technology in the field of human-computer interaction, the human body recognition technology has wide application prospect, good economic benefit can be brought, and many practical scenes are closely related to the human body behavior recognition technology, such as dangerous activities recognized in a video behavior monitoring system and human behavior perception in an automatic navigation system, and safe operation is easy to realize. Although motion recognition has been widely applied to various aspects of society due to complex and diversified activities of daily behaviors of human beings, small motion changes may generate completely different behaviors and change along with changes of environments, many problems in the field need to be solved in the real world, such as a view angle transformation problem, a motion scale difference problem and the like, and meanwhile, how to quickly and effectively acquire an internal relation in multi-modal motion information and efficiently model the internal relation is a challenging problem.
Disclosure of Invention
In view of this, the present invention provides a method and a system for recognizing actions based on multi-modal sequence fusion, which can improve the accuracy of action recognition, implement multi-modal data fusion, and effectively control the number of network parameters, so as to solve the above-mentioned technical problems.
In a first aspect, the invention provides a motion recognition method based on multi-modal sequence fusion, which comprises the following steps:
acquiring a human body action video to be identified, and performing action labeling on the action video to obtain a video frame, wherein the action labeling comprises semantic segmentation and timeline segmentation labels of actions;
acquiring a spatial position corresponding to human body actions, detecting the coordinate and action category division of a space candidate frame of each frame of action, positioning an action occurring time period by adopting correlation progress action time domain detection among continuous frames, connecting a detection result of each frame with a space-time channel of the formed action, and preprocessing the video frame to obtain a data set corresponding to the human body actions;
performing feature extraction on the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior features and time domain features corresponding to the behavior features;
and constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action recognition of the human body.
As a further improvement of the technical scheme, a plurality of modal information is input to the network model for feature fusion and classification so as to complete the motion recognition of the human body. The method comprises the following steps:
taking the interactive information among the multi-modal data as common features contained in the multi-modal data, and performing volume operation on feature vectors corresponding to the multi-modal data by adopting a tensor fusion algorithm to obtain an information association tensor of the multi-modal feature elements;
when calculating the volume, adding one dimension 1 to each feature vector to maintain the single-mode input features in the network model, wherein the expression is
Figure BDA0003987109560000021
Wherein a modal characteristic +>
Figure BDA0003987109560000022
Figure BDA0003987109560000023
Figure BDA0003987109560000024
Representing an outer product operation; the expression of the outer product operation of the three characteristic vectors a, g and v is
Figure BDA0003987109560000025
By means of towersThe gram decomposition compresses the original tensor, the weight impurity quantity is expressed as the product of four orthogonal matrixes and one core tensor, and then the four-bit weight tensor tau is decomposed into t = ((t) c × 1 W a2 W g3 W v × 4 W 0 Decomposed core tensor
Figure BDA00039871095600000210
For the interaction of three modes, the complexity of the whole fusion mode is controlled by the function of the number of parameters under constraint and the dimension, the mapping from all the feature vectors to the fusion features is stored, and then the trilinear model expression is
Figure BDA0003987109560000026
Wherein
Figure BDA0003987109560000027
And &>
Figure BDA0003987109560000028
Respectively representing the projection of the respective feature vectors into respective low-dimensional spaces, t a 、t g And t v The larger, the more training parameters are required, the higher the complexity of the model, and finally the pass
Figure BDA0003987109560000029
Controlling the dimensions of the fused feature vectors.
As a further improvement of the above technical solution, performing volume operation on eigenvectors corresponding to a plurality of different modal data by using a tensor fusion algorithm includes:
after the n-mode fusion feature tensor is subjected to linear mapping, the (n + 1) -dimensional weight tensor is obtained
Figure BDA0003987109560000031
Performing tensor tak decomposition on the orthogonal mapping matrix to obtain n +1 orthogonal mapping matrixes and an expression of a core tensor is->
Figure BDA0003987109560000032
Order to
Figure BDA0003987109560000033
Further decomposition with the introduction of a rank constraint and/or a rank constraint>
Figure BDA0003987109560000034
Figure BDA0003987109560000035
Hamiltonian representing a series of tensors, -a>
Figure BDA0003987109560000036
Rank factorization for the ith mode.
As a further improvement of the above technical solution, constructing a network model according to the behavior characteristics and the time domain characteristics includes:
obtaining a multi-modal semantic idiomatic representation through multi-modal feature fusion
Figure BDA0003987109560000037
Figure BDA0003987109560000038
Then the semantic feature is sent to a neural network layer to obtain the final cross-modal semantic feature representation>
Figure BDA0003987109560000039
Presetting a series of multi-scale temporal candidate segments->
Figure BDA00039871095600000310
Wherein
Figure BDA00039871095600000311
Denotes the start and end time boundaries, W, of the jth candidate segment at time point i j Represents a predetermined time width of the jth segment>
Figure BDA00039871095600000312
Representing the sum of candidate segmentsCounting;
the expression for evaluating the confidence score of a candidate segment by sigmoid activation function (σ) is cs i =σ(Convld(h i ) Therein) are provided with
Figure BDA00039871095600000313
Represents->
Figure BDA00039871095600000314
The score of each candidate segment at time point i is used for representing the similarity of the video segment and the text description, and the expression of calculating the corresponding prediction time sequence boundary offset for each candidate segment is as follows
Figure BDA00039871095600000315
Figure BDA00039871095600000316
Represents the predicted timing start and end offsets at time i, and finally the predicted segment j at time i is represented as ≧ greater>
Figure BDA00039871095600000317
As a further improvement of the technical scheme, the loss function is composed of a loss function for calculating the matching score of the video segment and the text description and a loss function for calculating the time sequence boundary offset, and each candidate time sequence segment is calculated
Figure BDA0003987109560000041
And the time overlap of the target segment (s, e) is compared with the IoU, if the time overlap is smaller than a preset threshold value lambda, the IoU is set to be 0, if the time overlap is larger than the threshold value lambda, the candidate segment is determined to be a positive sample, otherwise, the candidate segment is a negative sample, and the expression of the matching loss function is ^ or>
Figure BDA0003987109560000042
Wherein N is pos Represents the number of positive samples of the candidate time-series segment, N neg Representing the number of negative samples;
by means of edgesAdjusting offset of time positioning by a boundary regression strategy, calculating IoU (IoU) of candidate segments and target segments, and selecting a candidate time sequence segment set C larger than a set threshold gamma h The expression for calculating the timing boundary offset of these candidate timing segments is
Figure BDA0003987109560000043
Wherein (s, e) denotes the start and end point in time ≥ of a given text description>
Figure BDA00039871095600000410
Corresponding candidate time-series video clip set C h Start and end time points of;
using δ = [ δ ] s ,δ e ]Indicating the true time-alignment offset,
Figure BDA0003987109560000044
indicating a predicted time alignment offset, adaptively adjusting timing boundaries of the epoch candidate segments based on the true time alignment offset,
Figure BDA0003987109560000045
wherein SL 1 Represents L 1 Norm, N represents C h The size of the collection.
As a further improvement of the above technical solution, performing feature extraction on a data set by using a convolutional neural network and a long-short term memory network to obtain a behavior feature and a time domain feature corresponding to the behavior feature, including:
taking original inertial sensor data as a picture, namely time multiplied by channel, adopting a long and short term extreme network LSTM and a convolutional neural network CNN, adopting one-dimensional convolution operation to obtain a time signal structure in a convolutional kernel window, and obtaining key behavior characteristics of the inertial sensor signal through the convolutional neural network, wherein the one-dimensional convolution calculation expression is
Figure BDA0003987109560000046
Where N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel, device for selecting or keeping>
Figure BDA0003987109560000047
Representing one-dimensional convolution kernel depth d 0 The nth weight in->
Figure BDA0003987109560000048
Representing depth d 0 I th of lower sensor signal 0 Element(s) is/are present>
Figure BDA0003987109560000049
Indicating the i-th value obtained by the sensor through convolution operation 0 Features, f (—) represents an activation function;
the expression of the characteristic size obtained by pooling is
Figure BDA0003987109560000051
Represents the current i 0 The method comprises the steps of generating a low-dimensional high-level feature on a time sequence through three times of convolution and pooling operations, inputting the behavior feature processed by the CNN into a long-term and short-term memory network according to a time sequence, wherein the long-term and short-term memory network comprises two LSTM layers, each LSTM layer is connected in a one-way mode, the number of hidden points is 128, converting the behavior feature obtained in the previous part into a 128-dimensional time sequence feature, and dynamically modeling time information.
As a further improvement of the above technical solution, connecting the detection result of each frame to a space-time channel with formed motion, so that the video frame is preprocessed to obtain a data set corresponding to the human motion, including:
representing an undivided original video as
Figure BDA0003987109560000052
Where xn represents the nth frame image of video X and w represents the number of frames in video X, all action sets contained in this video can be evaluated with a set of instances &>
Figure BDA0003987109560000053
Is shown in which N is g Representing real action instances in video XData, t s,i 、t e,i Respectively represent an instance of action>
Figure BDA0003987109560000054
Start and end nodes of g Representing an action instance;
the method comprises the steps of obtaining incoming videos and texts to construct serialization characteristics, generating multi-scale candidate time sequence video clips at each time point of the video sequence characteristics according to a preset time length, carrying out characteristic interaction and fusion on the candidate time sequence clips and the text sequence characteristics by adopting a time sequence cooperative attention interaction network, and obtaining multi-mode fusion data embedded in the same characteristic space to obtain a corresponding data set.
As a further improvement of the above technical solution, acquiring a spatial position corresponding to a human body motion and detecting spatial candidate frame coordinates and motion category division of each frame motion, and adopting correlation progress motion time domain detection between consecutive frames, includes:
presetting an un-divided video sequence V, firstly sampling a video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing the image frame sequence into M video segment units { V } with equal length and without overlapping 1 ,...,v j ,...,v M Using a position coding function to add additional time sequence position information to the video image, and video unit characteristics
Figure BDA0003987109560000055
Is expressed as->
Figure BDA0003987109560000056
Figure BDA0003987109560000061
Wherein->
Figure BDA0003987109560000062
d and d v Representing video coding feature dimensions and extracted video unit feature dimensions;
constructing video sum by adopting unit cooperative attention interaction layerInteractive information between texts, preset input video and text description feature expression is V in And S in Transforming d-dimensional feature vectors into query vectors (Q) by linear mapping f ,Q s ) Key vector (K) f ,K s ) Vector of sum values (V) f ,,V s ) Then the corresponding expression is
Figure BDA0003987109560000063
Wherein W sk 、W fk 、W sq 、W fq 、W sv And W fv Representing learnable weight matrices, Q of ion video modalities f Features as query vectors, K from text modalities s And V s The characteristics are respectively used as a key vector and a value vector, a similarity weight matrix between the characteristics is calculated, and the weighted video characteristic is obtained as->
Figure BDA0003987109560000064
And integrating the context information of the text and the video features into the feature vector of the current position to obtain the corresponding time domain features.
As a further improvement of the above technical solution, acquiring a human motion video to be recognized, and performing motion annotation on the motion video to obtain a video frame, includes:
the motion characteristics of the body part are obtained by adopting an inertial sensor, the camera is switched to label when an unmarked video is read in each time, after a certain action is marked, whether the action is correct or not is judged through a screenshot when the action starts and stops, and if a place with a wrong mark appears, the camera is finely adjusted or re-labeled.
In a second aspect, the present invention provides a motion recognition system based on multi-modal sequence fusion, comprising:
the device comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a human body action video to be identified and performing action marking on the action video to obtain a video frame, and the action marking comprises semantic segmentation and timeline segmentation labels of actions;
the second acquisition module is used for acquiring a spatial position corresponding to the human body action, detecting the spatial candidate frame coordinates and action category division of each frame action, positioning the action occurring time period by adopting correlation progress action time domain detection between continuous frames, connecting the detection result of each frame with a space-time channel with formed action, and preprocessing the video frame to obtain a data set corresponding to the human body action;
the characteristic extraction module is used for extracting the characteristics of the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior characteristics and time domain characteristics corresponding to the behavior characteristics;
and the identification module is used for constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to complete the action identification of the human body.
The invention provides an action recognition method and system based on multi-modal sequence fusion, which are characterized in that a video frame is obtained by obtaining a human action video to be recognized, action labeling is carried out on the action video to obtain a video frame, a spatial position corresponding to human action is obtained, spatial candidate frame coordinates and action category division of each frame action are detected, correlation progress action time domain detection among continuous frames is adopted, the time period of action occurrence is positioned, the detection result of each frame is connected with a space-time channel with action formed, the video frame is preprocessed to obtain a data set corresponding to the human action, a convolutional neural network and a long-short term memory network are adopted to carry out feature extraction on the data set to obtain behavior features and time domain features corresponding to the behavior features, a network model is built according to the behavior features and the time domain features, a plurality of modal information are input into the network model to carry out feature fusion and classification so as to complete action recognition of a human action, multi-modal fusion is adopted to recognize the human action, the accuracy and robustness of the model in a real scene can be enhanced, the problem of semantic loss in a later stage in a feature vector operation process can be fully exerted, and the complementary work stability of products in different modes can be improved, and the accuracy and the recognition of the multi-modal recognition can be further improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
FIG. 1 is a flow chart of a method for recognizing actions based on multi-modal sequence fusion according to the present invention;
fig. 2 is a block diagram of a multi-modal sequence fusion-based motion recognition system provided in the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
Referring to fig. 1, the invention provides a method for recognizing actions based on multi-modal sequence fusion, comprising the following steps:
s1: acquiring a human body action video to be identified, and performing action labeling on the action video to obtain a video frame, wherein the action labeling comprises semantic segmentation and timeline segmentation labels of an action;
s2: acquiring a spatial position corresponding to human body actions, detecting the coordinate and action category division of a space candidate frame of each frame of action, positioning an action occurring time period by adopting correlation progress action time domain detection among continuous frames, connecting a detection result of each frame with a space-time channel of the formed action, and preprocessing the video frame to obtain a data set corresponding to the human body actions;
s3: performing feature extraction on a data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior features and time domain features corresponding to the behavior features;
s4: and constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action recognition of the human body.
In this embodiment, a plurality of modal information is input to the network model to perform feature fusion and classification, so as to complete the motion recognition of the human body. The method comprises the following steps: taking the interactive information among the multi-modal data as common features contained in the multi-modal data, and performing volume operation on feature vectors corresponding to the multi-modal data by adopting a tensor fusion algorithm to obtain an information association tensor of the multi-modal feature elements; in calculating the volume, one dimension 1 is added to each feature vector to maintain the unimodal input features in the network model, expressed as
Figure BDA0003987109560000081
Wherein the modal characteristic->
Figure BDA0003987109560000082
Figure BDA0003987109560000083
Figure BDA0003987109560000084
Representing an outer product operation; the expression of the outer product operation of the three characteristic vectors a, g and v is
Figure BDA0003987109560000085
Compressing the original tensor by adopting Tack decomposition, and expressing the weight impurity quantity as the product of four orthogonal matrixes and one core tensor, the four-bit weight tensor is decomposed by t = ((t) by the weight tensor tau c × 1 W a2 、N g3 W v × 4 W 0 Resolved core tensor>
Figure BDA0003987109560000091
For the interaction of three modes, the dimension controls the complexity of the whole fusion mode through the function of the number of parameters under constraintDegree, storing the mapping from all the characteristic vectors to the fusion characteristic, and judging whether the trilinear model expression is->
Figure BDA0003987109560000092
Wherein->
Figure BDA0003987109560000093
And &>
Figure BDA0003987109560000094
Respectively representing the projection of the respective feature vectors into respective low-dimensional spaces, t a 、t g And t v The larger, the more parameters need to be trained, the higher the complexity of the model, and finally the
Figure BDA0003987109560000095
Controlling the dimensions of the fused feature vectors.
It should be noted that, acquiring a human motion video to be recognized, and performing motion annotation on the motion video to obtain a video frame includes: the motion characteristics of the body part are obtained by adopting an inertial sensor, the camera is switched to mark when the video which is not marked is read in each time, after a certain action is marked, whether the video is correct or not is judged through the screenshot when the action starts and stops, and if the place with the wrong mark appears, the camera is finely adjusted or marked again. The action recognition comprises offline action recognition and online action recognition, the offline action recognition needs to judge the human action types in the videos after the whole video sequence is observed, and the task allocates an action type label to each video according to a preset action list. On-line action recognition is oriented to the actual scene requirements and requires real-time processing of on-line video streams. The motion detection comprises time sequence motion detection and space-time motion detection, for an uncut video sequence, the time sequence motion detection task is to position the starting time point and the ending time point of a target motion and the corresponding motion type, and the space-time motion detection also needs to predict the spatial position of the motion on the basis. The time sequence action detection comprises the steps of generating candidate action segments with accurate time boundaries and dividing action categories for the time sequence candidate action segments, and the general flow of a space-time action detection task is as follows: the method comprises the steps of capturing the space position of human body motion, namely detecting the space candidate frame coordinates and motion category scores of each frame of motion, then adopting the correlation between continuous frames to carry out motion time domain detection, positioning the time period of motion occurrence, and finally connecting the detection results of each frame to form a motion space-time channel.
It should be understood that short-term motion prediction predicts the motion class occurring in the video at the beginning of motion execution from the local video segment that is scrolled, and long-term motion prediction is the prediction of the motion class that is likely to occur in the future by observing the human motion at the current time in the video. For the time sequence action detection task, the average accuracy mean mAP index is usually used to measure the performance of the algorithm, the average accuracy AP of the detection result is calculated for each action category, and then the average accuracy is averaged to obtain the average accuracy mean mAP. When the time sequence cross ratio IoU of a predicted time sequence action segment and the corresponding truth segment is greater than a set threshold value and the action type prediction is correct, and the two conditions are met, the action type prediction is considered to be a correct detection result, the time sequence cross ratio IoU is the cross ratio of the predicted time sequence action segment and the corresponding truth segment, and the calculation formula is that
Figure BDA0003987109560000101
Wherein T is p And T g Showing the time sequence interval and the truth value interval predicted by the algorithm, and the zeta (-) function is used for calculating the interval length. By acquiring the human body actions, detecting the actions and predicting the actions, the accuracy of action recognition can be improved, and the convenience of human-computer interaction is also improved. />
Optionally, performing volume operation on the eigenvectors corresponding to the multiple different modality data by using a tensor fusion algorithm, including:
after the n-mode fusion feature tensor is subjected to linear mapping, the (n + 1) -dimensional weight tensor is obtained
Figure BDA0003987109560000102
Carrying out tensor Tak decomposition to obtain n +1 orthogonal mapping matrixes and a coreThe expression of tensor is->
Figure BDA0003987109560000103
Order to
Figure BDA0003987109560000104
Further decomposition with the introduction of a rank constraint and/or a rank constraint>
Figure BDA0003987109560000105
Figure BDA0003987109560000106
Hamiltonian representing a series of tensors, based on the sum of the partial sum values>
Figure BDA00039871095600001011
Rank factorization for the ith modality.
In this embodiment, constructing a network model according to the behavior features and the time domain features includes: obtaining a multi-modal semantic idiomatic representation through multi-modal feature fusion
Figure BDA0003987109560000107
Then the semantic feature is sent to a neural network layer to obtain the final cross-modal semantic feature representation>
Figure BDA0003987109560000108
Figure BDA0003987109560000109
Presetting a series of multi-scale temporal candidate segments->
Figure BDA00039871095600001010
Wherein->
Figure BDA0003987109560000111
Denotes the start and end time boundaries, W, of the jth candidate segment at time point i j Represents a predetermined time width of the jth segment, < > H>
Figure BDA0003987109560000112
Representing the total number of candidate segments; the expression for evaluating the confidence score of a candidate segment by sigmoid activation function (σ) is cs i =σ(Convld(h i ) Therein->
Figure BDA0003987109560000113
Represents->
Figure BDA0003987109560000114
A score at time i for each candidate segment representing the similarity of the video segment and the text description, with the expression ≧ for each candidate segment whose corresponding predicted temporal boundary offset is calculated>
Figure BDA0003987109560000115
Figure BDA0003987109560000116
Represents the predicted timing start and end offsets at time i, and finally the predicted segment j at time i is represented as ≧ greater>
Figure BDA0003987109560000117
It should be noted that the loss function is composed of a loss function for calculating a matching score between the video segment and the text description and a loss function for calculating a time sequence boundary offset, and each candidate time sequence segment is calculated
Figure BDA0003987109560000118
And the time overlap of the target segment (s, e) is compared with the IoU, if the time overlap is smaller than a preset threshold lambda, the IoU is set to be 0, if the time overlap is larger than the threshold lambda, the candidate segment is determined to be a positive sample, otherwise, the candidate segment is a negative sample, and the expression of the matching loss function is
Figure BDA0003987109560000119
Wherein N is pos Represents the number of positive samples of the candidate time-series segment, N neg Representing the number of negative samples; when adjusted by a boundary regression strategyThe offset of inter-positioning is calculated, the IoU of the candidate segment and the target segment is calculated, and a candidate time sequence segment set C larger than a set threshold value Y is selected h The timing boundary offset for these candidate timing segments is calculated as @>
Figure BDA00039871095600001110
Where (s, e) denotes the start and end point in time { (s, e) } of a given text description>
Figure BDA00039871095600001111
Corresponding candidate set of temporal video segments C h Start and end time points of; using δ = [ δ ] s ,δ e ]Represents a true time alignment offset>
Figure BDA00039871095600001112
Indicating a predicted time alignment offset, adaptively adjusting the timing boundaries of the segment candidate based on the true time alignment offset, and ` Harbin `>
Figure BDA00039871095600001113
Figure BDA00039871095600001114
Wherein SL 1 Represents L 1 Norm, N represents C h The size of the collection.
It will be appreciated that the tach decomposition is a multi-linear form of principal component analysis, each tensor can be expressed non-uniquely as a core tensor, i.e. the product of the principal component factors and the factor matrix over all orders, with the advantages of using the tach decomposition: compared with the size of rank needing to be evaluated and CP decomposition approaching to the initial tensor, the Take decomposition is used for obtaining a more accurate tensor decomposition result, and the purpose of feature selection of each modal feature vector can be achieved by adjusting the dimension of the core tensor. In order to further reduce the computational complexity of a fusion model and balance the complexity and expressiveness of interactive fusion modeling, structured sparse constraint is introduced according to the sparsity of a core tensor, a weight core tensor is decomposed into a plurality of factors, rank constraint is used as regularization in the training process to prevent overfitting, and mapping of input and data can be adjusted flexibly.
Optionally, performing feature extraction on the data set by using a convolutional neural network and a long-short term memory network to obtain a behavior feature and a time domain feature corresponding to the behavior feature, where the feature extraction includes:
the method comprises the steps of taking original inertial sensor data as a picture, namely a time multiplied channel, adopting a long and short term extreme network (LSTM) and a Convolutional Neural Network (CNN), adopting one-dimensional convolution operation to obtain a time signal structure in a convolutional kernel window, and obtaining key behavior characteristics of the inertial sensor signal through the convolutional neural network, wherein the one-dimensional convolution calculation expression is
Figure BDA0003987109560000121
Where N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel, device for selecting or keeping>
Figure BDA0003987109560000122
Representing one-dimensional convolution kernel depth d 0 The nth weight in->
Figure BDA0003987109560000123
Representing depth d 0 I th of lower sensor signal 0 A plurality of elements +>
Figure BDA0003987109560000124
Indicating the i-th value obtained by convolution operation of the sensor 0 Features, f () represents an activation function;
the expression of the characteristic size obtained by pooling is
Figure BDA0003987109560000125
Represents the current i 0 The characteristic length of the layers, P represents the filling size, S represents the advancing step length, the high-level characteristic of the low dimension in time sequence is generated through three times of convolution and pooling operation, the behavior characteristic processed by CNN is input into the long-short term memory network according to the time sequence, wherein the long-short term memory network comprises two LSTM layers, each LSTM layer adopts one-way connection, the number of hidden points is set as 12And 8, converting the behavior characteristics obtained from the front part into 128-dimensional time sequence characteristics, and dynamically modeling the time information.
In the embodiment, the original inertial sensor data of the convolutional neural network is taken as a picture, the image is subjected to convolution operation by utilizing a shared convolution kernel according to the inherent local mode in the sensor image, partial features are extracted from the image, time sequence information hidden in the image is not further processed, the continuity of human behavior is ignored, the original inertial sensor signal is directly input into the LSTM by the long-short term memory network, integration of the sensor data is lacked, the operation speed of the algorithm is slower, and although the problem of gradient disappearance of the recurrent neural network can be solved to a certain extent by adopting a gating mechanism in the long-short term memory network, longer time sequence information cannot be processed. Time domain information of context correlation between different signal frames is obtained through a double-layer LSTM, behavior information obtained from input CNN extraction features is selectively reserved through a gating mechanism, so that time sequence excitation is better performed on signal features of an inertial sensor, space-time features related to behavior recognition are obtained, and space-time behavior feature learning is achieved.
Optionally, connecting the detection result of each frame to a spatio-temporal channel having formed an action, so that the video frame is preprocessed to obtain a data set corresponding to the human body action, including:
representing an undivided original video as
Figure BDA0003987109560000131
Where xn represents the nth frame image of video X and w represents the number of frames in video X, all action sets contained in the video can be evaluated by a set of instances ≧>
Figure BDA0003987109560000132
Is shown in which N is g Data representing instances of real actions in video X, t s,i 、t e,i Respectively represent an instance of action>
Figure BDA0003987109560000133
Start node and end ofNodal point, psi g Representing an action instance;
the method comprises the steps of obtaining incoming videos and texts to construct serialization characteristics, generating multi-scale candidate time sequence video clips at each time point of the video sequence characteristics according to a preset time length, carrying out characteristic interaction and fusion on the candidate time sequence clips and the text sequence characteristics by adopting a time sequence cooperative attention interaction network, and obtaining multi-mode fusion data embedded in the same characteristic space to obtain a corresponding data set.
In this embodiment, obtaining a spatial position corresponding to a human body action and detecting a spatial candidate frame coordinate and action category division of each frame action, and adopting correlation progress action time domain detection between consecutive frames includes: presetting an un-divided video sequence V, firstly sampling a video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing the image frame sequence into M video segment units { V } with equal length and without overlapping 1 ,...,v j ,...,v M Using a position coding function to add additional time sequence position information to the video image, and video unit characteristics
Figure BDA0003987109560000134
Is expressed as
Figure BDA0003987109560000141
Wherein +>
Figure BDA0003987109560000142
d and d v Representing video coding feature dimensions and extracted video unit feature dimensions; interactive information between videos and texts is constructed by adopting unit cooperative attention interaction layers, and characteristics of input video and text descriptions are preset to be V in And S in Transforming d-dimensional feature vectors into query vectors (Q) by linear mapping f ,Q s ) Key vector (K) f ,K s ) Vector of sum values (V) f ,,V s ) Then the corresponding expression is->
Figure BDA0003987109560000143
Wherein W sk 、W fk 、W sq 、W fq 、W sv And W fv Representing a learnable weight matrix, Q of an ion video modality f Features as query vectors, K from text modalities s And V s The features are respectively used as key vectors and value vectors, and similarity weight matrixes between the key vectors and the value vectors are calculated to obtain weighted video features
Figure BDA0003987109560000144
And integrating the context information of the text and the video features into the feature vector of the current position to obtain the corresponding time domain features.
In this embodiment, the modalities refer to different ways of characterizing information, including various sensory ways of objects, and the multi-modality refers to a combination of two or more modalities, and the reason for performing multi-modality fusion is that different modalities treat problems with different expressions and different angles, and the multi-modality data has various different information intersections and data complementation, and the multi-modality effect is better than that of a single modality. In the field of human behavior recognition, acceleration, angular velocity and RGB video image data belong to heterogeneous data and have respective characteristics, an inertial sensor can only obtain motion characteristics of a body part, fine motions such as hand motion details cannot be recognized accurately, RGB videos are affected by a shielding object and illumination, and when a human body is shielded, the RGB videos can only be recognized through the inertial sensor. The deep learning multi-mode feature layer fusion method is cascade fusion, addition fusion or cascade fusion, and is characterized in that a plurality of modal feature vectors are spliced, so that the dimensionality of the overall feature vector is increased.
Referring to fig. 2, the present invention provides a motion recognition system based on multi-modal sequence fusion, comprising:
the device comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a human body action video to be identified and carrying out action marking on the action video to obtain a video frame, and the action marking comprises a semantic segmentation and a timeline segmentation label of an action;
the second acquisition module is used for acquiring a spatial position corresponding to the human body action, detecting the spatial candidate frame coordinates and action category division of each frame action, positioning the action occurring time period by adopting correlation progress action time domain detection between continuous frames, connecting the detection result of each frame with a space-time channel with formed action, and preprocessing the video frame to obtain a data set corresponding to the human body action;
the characteristic extraction module is used for extracting the characteristics of the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior characteristics and time domain characteristics corresponding to the behavior characteristics;
and the identification module is used for constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action identification of the human body.
In this embodiment, the human behavior recognition is to perform relevant classification recognition on the collected user motion information and data by a certain means and method, so as to determine the user activity state or detect the user behavior. Processing time series data, spatio-temporal structures, and temporal structures is excessive behavior, and if the temporal features are not fully utilized, it will be a great loss for behavior recognition models. The method comprises the steps of obtaining a human body action video to be recognized, marking the action video to obtain a video frame, obtaining a spatial position corresponding to the human body action, detecting spatial candidate frame coordinates and action category division of each frame action, adopting correlation progress action time domain detection among continuous frames, positioning action occurrence time periods, connecting detection results of each frame with a formed action space-time channel, preprocessing the video frame to obtain a data set corresponding to the human body action, adopting a convolutional neural network and a long-term memory network to perform feature extraction on the data set to obtain behavior features and time domain features corresponding to the behavior features, constructing a network model according to the behavior features and the time domain features, inputting a plurality of modal information into the network model to perform feature fusion and classification to complete human body action recognition, adopting multi-mode to recognize and fuse human body actions, enhancing the accuracy and robustness of the model in a real scene, effectively reducing the problem of semantic loss in the later-stage feature vector operation process, and fully playing the complementary information in different modes, thereby improving the stability and recognition accuracy of work.
In all examples shown and described herein, any particular value should be construed as exemplary only and not as a limitation, and thus other examples of example embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above examples are merely illustrative of several embodiments of the present invention, and the description thereof is more specific and detailed, but not to be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention.

Claims (10)

1. A motion recognition method based on multi-modal sequence fusion is characterized by comprising the following steps:
acquiring a human body action video to be identified, and performing action labeling on the action video to obtain a video frame, wherein the action labeling comprises semantic segmentation and timeline segmentation labels of actions;
acquiring a spatial position corresponding to human body actions, detecting the coordinate and action category division of a space candidate frame of each frame of action, positioning an action occurring time period by adopting correlation progress action time domain detection among continuous frames, connecting a detection result of each frame with a space-time channel of the formed action, and preprocessing the video frame to obtain a data set corresponding to the human body actions;
performing feature extraction on the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior features and time domain features corresponding to the behavior features;
and constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action recognition of the human body.
2. The method according to claim 1, wherein a plurality of modal information is inputted to the network model for feature fusion and classification to complete the motion recognition of human body. The method comprises the following steps:
taking the interactive information among the multi-modal data as common features contained in the multi-modal data, and performing volume operation on feature vectors corresponding to the multi-modal data by adopting a tensor fusion algorithm to obtain an information association tensor of the multi-modal feature elements;
in calculating the volume, one dimension 1 is added to each feature vector to maintain the unimodal input features in the network model, expressed as
Figure FDA0003987109550000011
Wherein the modal characteristic->
Figure FDA0003987109550000012
Figure FDA0003987109550000013
Figure FDA0003987109550000014
Representing an outer product operation; the expression of the outer product operation of the three characteristic vectors a, g and v is
Figure FDA0003987109550000015
Compressing the original tensor by using Tack decomposition, and expressing the weight stray quantity as the product of four orthogonal matrixes and one core tensor, the four-bit weight tensor is decomposed by t = ((t) weight tensor tau c × 1 W a2 W g3 W v × 4 W 0 Decomposed core tensor
Figure FDA0003987109550000016
For the interaction of three modes, the complexity of the whole fusion mode is controlled by the function of the number of parameters under constraint and the dimension, the mapping from all the feature vectors to the fusion features is stored, and then the trilinear model expression is
Figure FDA0003987109550000021
Wherein
Figure FDA0003987109550000022
And &>
Figure FDA0003987109550000023
Respectively representing the projection of the respective feature vectors into respective low-dimensional spaces, t a 、t g And t v The larger, the more parameters need to be trained, the higher the complexity of the model, and finally the
Figure FDA00039871095500000217
Controlling the dimensions of the fused feature vectors.
3. The method for recognizing motion based on multi-modal sequence fusion according to claim 2, wherein performing a volume operation on feature vectors corresponding to a plurality of different modal data by using a tensor fusion algorithm comprises:
after the n-mode fusion feature tensor is subjected to linear mapping, the (n + 1) -dimensional weight tensor is obtained
Figure FDA0003987109550000024
Performing tensor tak decomposition on the orthogonal mapping matrix to obtain n +1 orthogonal mapping matrixes and an expression of a core tensor is->
Figure FDA0003987109550000025
Make->
Figure FDA0003987109550000026
y=z T W 0
Figure FDA0003987109550000027
And then further introducing rank constraint to carry out decomposition,
Figure FDA0003987109550000028
Figure FDA0003987109550000029
hamiltonian representing a series of tensors, -a>
Figure FDA00039871095500000210
Rank factorization for the ith mode.
4. The method for motion recognition based on multi-modal sequence fusion of claim 1, wherein constructing a network model from the behavior features and the time domain features comprises:
obtaining a multi-modal semantic Tenza representation through multi-modal feature fusion
Figure FDA00039871095500000211
Figure FDA00039871095500000212
Then the semantic feature is sent to a neural network layer to obtain the final cross-modal semantic feature representation>
Figure FDA00039871095500000213
Presetting a series of multi-scale temporal candidate segments->
Figure FDA00039871095500000214
Wherein
Figure FDA00039871095500000215
Denotes the start and end time boundaries, W, of the jth candidate segment at time point i j Represents a predetermined time width of the jth segment, < > H>
Figure FDA00039871095500000216
Representing the total number of candidate segments;
the expression for evaluating the confidence score of the candidate segment by sigmoid activation function (sigma) is cs i =σ(Convld(h i ) Therein), wherein
Figure FDA0003987109550000031
Represents->
Figure FDA0003987109550000032
The score of each candidate segment at time point i is used for representing the similarity of the video segment and the text description, and the expression of calculating the corresponding prediction time sequence boundary offset for each candidate segment is as follows
Figure FDA0003987109550000033
Figure FDA0003987109550000034
Represents the predicted timing start and end offsets at time i, and finally the predicted segment j at time i is represented as ≧ greater>
Figure FDA0003987109550000035
5. The method of claim 4, wherein the loss function comprises a loss function for calculating matching scores between video segments and text descriptions and a loss function for calculating temporal boundary offsets, and each candidate temporal segment is calculated
Figure FDA0003987109550000036
And the time overlap coincidence ratio IoU of the target segment (s, e), if the time overlap coincidence ratio IoU is smaller than a preset threshold value lambda, the IoU is set to be 0, if the time overlap coincidence ratio IoU is larger than the threshold value lambda, the candidate segment is determined to be a positive sample, otherwise, the candidate segment is a negative sample, and the expression of the matching loss function is ^ or>
Figure FDA0003987109550000037
Wherein N is pos Represents the number of positive samples of the candidate time-series segment, N neg Representing the number of negative samples;
adjusting the offset of time positioning by adopting a boundary regression strategy, calculating IoU (IoU) of the candidate segment and the target segment, and selecting a candidate time sequence segment set C larger than a set threshold gamma h The expression for calculating the timing boundary offset of these candidate timing segments is
Figure FDA0003987109550000038
Where (s, e) denotes the start and end point in time { (s, e) } of a given text description>
Figure FDA0003987109550000039
Corresponding candidate time-series video clip set C h Start and end time points of;
using δ = [ δ ] se ]A true time alignment offset is represented and,
Figure FDA00039871095500000310
indicating a predicted time alignment offset, adaptively adjusting timing boundaries of the epoch candidate segments based on the true time alignment offset,
Figure FDA00039871095500000311
wherein SL 1 Represents L 1 Norm, N represents C h The size of the collection.
6. The method for recognizing the action based on the multi-modal sequence fusion as claimed in claim 1, wherein the step of extracting the features of the data set by using a convolutional neural network and a long-short term memory network to obtain the behavior features and the time domain features corresponding to the behavior features comprises the following steps:
the method comprises the steps of taking original inertial sensor data as a picture, namely a time multiplied channel, adopting a long and short term extreme network (LSTM) and a Convolutional Neural Network (CNN), adopting one-dimensional convolution operation to obtain a time signal structure in a convolutional kernel window, and obtaining key behavior characteristics of the inertial sensor signal through the convolutional neural network, wherein the one-dimensional convolution calculation expression is
Figure FDA0003987109550000041
Where N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel, device for selecting or keeping>
Figure FDA0003987109550000042
Representing one-dimensional convolution kernel depth d 0 The nth weight in->
Figure FDA0003987109550000043
Representing depth d 0 I th of lower sensor signal 0 Element(s) is/are present>
Figure FDA0003987109550000044
Indicating the i-th value obtained by the sensor through convolution operation 0 Features, f () represents an activation function;
the expression of the feature size obtained by pooling is
Figure FDA0003987109550000045
Figure FDA0003987109550000046
Represents the current i 0 The characteristic length of the layer, P represents the filling size, S represents the advancing step length, the high-level characteristic with low dimension on the time sequence is generated through three times of convolution and pooling operation, and the behavior characteristic after CNN processing is input into the long-short term memory network according to the time sequence, wherein the long-short term memory network comprises two LSTM layers, each LSTM layer adopts the mode of filling the space, the S represents the advancing step length, the long-short term memory network comprises two LSTM layers, and each LSTM layer adopts the mode of filling the spaceAnd (4) unidirectional connection, wherein the number of hidden points is 128, the behavior characteristics obtained from the front part are converted into 128-dimensional time sequence characteristics, and the time information is dynamically modeled.
7. The method for recognizing the motion based on the multi-modal sequence fusion as claimed in claim 1, wherein the step of connecting the detection result of each frame with the spatio-temporal channel with the motion formed so as to preprocess the video frame to obtain the data set corresponding to the motion of the human body comprises the steps of:
representing an undivided original video as
Figure FDA0003987109550000047
Wherein x n Represents the nth frame image of video X, w represents the number of frames in video X that all action sets contained in the video can be taken by a set of instances ≧>
Figure FDA0003987109550000048
Is shown in which N is g Data representing real action instances in video X, t s,i 、t e,i Respectively represent instances of action>
Figure FDA0003987109550000049
Start and end nodes of g Representing an action instance;
the method comprises the steps of obtaining incoming videos and texts to construct serialization characteristics, generating multi-scale candidate time sequence video clips at each time point of the video sequence characteristics according to a preset time length, carrying out characteristic interaction and fusion on the candidate time sequence clips and the text sequence characteristics by adopting a time sequence cooperative attention interaction network, and obtaining multi-mode fusion data embedded in the same characteristic space to obtain a corresponding data set.
8. The method for recognizing the motion based on the multi-modal sequence fusion as claimed in claim 1, wherein the obtaining of the spatial position corresponding to the human motion and the detection of the spatial candidate frame coordinates and motion class division of each frame of motion, and the time domain detection of the motion by adopting the correlation progress between the continuous frames comprises:
presetting an un-divided video sequence V, firstly sampling a video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing the image frame sequence into M video segment units { V } with equal length and without overlapping 1 ,...,v j ,...,v M Using a position coding function to add additional time sequence position information to the video image, and video unit characteristics
Figure FDA0003987109550000051
Is expressed as->
Figure FDA0003987109550000052
Figure FDA0003987109550000053
Wherein +>
Figure FDA0003987109550000054
d and d v Representing video coding feature dimensions and extracted video unit feature dimensions;
constructing interactive information between videos and texts by adopting unit cooperative attention interaction layers, and presetting the characteristics of input video and text description as V in And S in Transforming d-dimensional feature vectors into query vectors (Q) by linear mapping f ,Q s ) Key vector (K) f ,K s ) Vector of sum values (V) f ,,V s ) Then the corresponding expression is
Figure FDA0003987109550000055
Wherein W sk 、W fk 、W sq 、W fq 、W sv And W fv Representing a learnable weight matrix, Q of an ion video modality f Features as query vectors, K from text modalities s And V s The features are respectively used as a key vector and a value vector, and a similarity weight matrix between the features is calculated to obtain the weighted video feature which is ^ greater than or equal to>
Figure FDA0003987109550000056
And (3) integrating the context information of the text and the video characteristics into the characteristic vector of the current position to obtain corresponding time domain characteristics.
9. The method for recognizing the motion based on the multi-modal sequence fusion as claimed in claim 1, wherein the step of obtaining a motion video of a human body to be recognized and performing motion labeling on the motion video to obtain a video frame comprises the following steps:
the motion characteristics of the body part are obtained by adopting an inertial sensor, the camera is switched to label when an unmarked video is read in each time, after a certain action is marked, whether the action is correct or not is judged through a screenshot when the action starts and stops, and if a place with a wrong mark appears, the camera is finely adjusted or re-labeled.
10. A multi-modal sequence fusion based motion recognition system of the multi-modal sequence fusion based motion recognition method according to any one of claims 1 to 8, comprising:
the device comprises a first acquisition module, a second acquisition module and a display module, wherein the first acquisition module is used for acquiring a human body action video to be identified and carrying out action marking on the action video to obtain a video frame, and the action marking comprises a semantic segmentation and a timeline segmentation label of an action;
the second acquisition module is used for acquiring a spatial position corresponding to the human body action, detecting the spatial candidate frame coordinates and action category division of each frame action, positioning the action occurring time period by adopting correlation progress action time domain detection between continuous frames, connecting the detection result of each frame with a space-time channel with formed action, and preprocessing the video frame to obtain a data set corresponding to the human body action;
the characteristic extraction module is used for extracting the characteristics of the data set by adopting a convolutional neural network and a long-short term memory network to obtain behavior characteristics and time domain characteristics corresponding to the behavior characteristics;
and the identification module is used for constructing a network model according to the behavior characteristics and the time domain characteristics, inputting a plurality of modal information into the network model for characteristic fusion and classification so as to finish the action identification of the human body.
CN202211568552.3A 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion Pending CN115937975A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211568552.3A CN115937975A (en) 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211568552.3A CN115937975A (en) 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion

Publications (1)

Publication Number Publication Date
CN115937975A true CN115937975A (en) 2023-04-07

Family

ID=86550146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211568552.3A Pending CN115937975A (en) 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion

Country Status (1)

Country Link
CN (1) CN115937975A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503957A (en) * 2023-06-26 2023-07-28 成都千嘉科技股份有限公司 Gas household operation behavior identification method
CN117409538A (en) * 2023-12-13 2024-01-16 吉林大学 Wireless fall-prevention alarm system and method for nursing
CN117953543A (en) * 2024-03-26 2024-04-30 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Character interaction detection method based on multiple texts, terminal and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503957A (en) * 2023-06-26 2023-07-28 成都千嘉科技股份有限公司 Gas household operation behavior identification method
CN116503957B (en) * 2023-06-26 2023-09-15 成都千嘉科技股份有限公司 Gas household operation behavior identification method
CN117409538A (en) * 2023-12-13 2024-01-16 吉林大学 Wireless fall-prevention alarm system and method for nursing
CN117953543A (en) * 2024-03-26 2024-04-30 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Character interaction detection method based on multiple texts, terminal and readable storage medium

Similar Documents

Publication Publication Date Title
Ullah et al. Activity recognition using temporal optical flow convolutional features and multilayer LSTM
Subetha et al. A survey on human activity recognition from videos
Hu et al. Video facial emotion recognition based on local enhanced motion history image and CNN-CTSLSTM networks
Avola et al. 2-D skeleton-based action recognition via two-branch stacked LSTM-RNNs
Devanne et al. 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold
Ibraheem et al. Survey on various gesture recognition technologies and techniques
CN115937975A (en) Action recognition method and system based on multi-modal sequence fusion
CN110472531A (en) Method for processing video frequency, device, electronic equipment and storage medium
CN110555481A (en) Portrait style identification method and device and computer readable storage medium
Caputo et al. SHREC 2021: Skeleton-based hand gesture recognition in the wild
Alazrai et al. Anatomical-plane-based representation for human–human interactions analysis
Lee et al. Real-time gesture recognition in the view of repeating characteristics of sign languages
CN111444488A (en) Identity authentication method based on dynamic gesture
Colaco et al. Facial keypoint detection with convolutional neural networks
de Araujo Zeni et al. Real-time gender detection in the wild using deep neural networks
Chen et al. A multi-scale fusion convolutional neural network for face detection
Alhersh et al. Learning human activity from visual data using deep learning
Zerrouki et al. Deep Learning for Hand Gesture Recognition in Virtual Museum Using Wearable Vision Sensors
Nasir et al. Recognition of human emotion transition from video sequence using triangulation induced various centre pairs distance signatures
Kumar et al. Predictive analytics on gender classification using machine learning
Ghosh et al. Deep learning-based multi-view 3D-human action recognition using skeleton and depth data
Zerrouki et al. Exploiting deep learning-based LSTM classification for improving hand gesture recognition to enhance visitors’ museum experiences
Latha et al. Human action recognition using deep learning methods (CNN-LSTM) without sensors
CN113869189A (en) Human behavior recognition method, system, device and medium
Wang Deeply-learned and spatial–temporal feature engineering for human action understanding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination