CN109460707A - A kind of multi-modal action identification method based on deep neural network - Google Patents

A kind of multi-modal action identification method based on deep neural network Download PDF

Info

Publication number
CN109460707A
CN109460707A CN201811165862.4A CN201811165862A CN109460707A CN 109460707 A CN109460707 A CN 109460707A CN 201811165862 A CN201811165862 A CN 201811165862A CN 109460707 A CN109460707 A CN 109460707A
Authority
CN
China
Prior art keywords
layer
video
neural network
deep neural
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811165862.4A
Other languages
Chinese (zh)
Inventor
许泽珊
余卫宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201811165862.4A priority Critical patent/CN109460707A/en
Publication of CN109460707A publication Critical patent/CN109460707A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • G06V40/25Recognition of walking or running movements, e.g. gait recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language

Abstract

The invention discloses a kind of multi-modal action identification method based on deep neural network, this method comprehensively utilize the multi-modal informations such as video image, light stream figure and human skeleton, the specific steps are as follows: firstly, carrying out series of preprocessing and compression to video;Consecutive frame based on video obtains light stream figure;Using Attitude estimation algorithm, human skeleton is acquired in frames from video, and calculates the path integral feature of frame sequence;It is input to resulting light stream figure, skeleton path integral feature and original video image in the deep neural network with multiple-branching construction, enables its study characterize about the abstract space-time of human action, and correctly judge its action classification.In addition, also having accessed the pooling layer based on attention mechanism in video image branch, strengthen and the closely bound up abstract characteristics of final classification of motion result, reduction unrelated interruptions.The present invention comprehensively utilizes multi-modal information, has many advantages, such as that strong robustness is strong, discrimination is high.

Description

A kind of multi-modal action identification method based on deep neural network
Technical field
The present invention relates to technical field of image processing, and in particular to a kind of multi-modal movement knowledge based on deep neural network Other method.
Background technique
Action recognition is a recently very popular research direction, passes through the movement of human body in identification video, Ke Yizuo Processing equipment is interactively entered for a kind of new, the application field of the everyday exposures such as game, video display can be widely used in In.Action recognition task is related to identifying different movements from video clip, and possibly through entire video, this is for movement therein A kind of natural expansion of image classification task, i.e., carry out image recognition in multi-frame video, then calculates from each frame The prediction result finally acted.
Traditional video actions identification technology tends to rely on the feature extractor of hand-designed, to extract the space-time of movement Feature.With being announced to the world splendidly for deep learning, this kind of craft feature extractor is substituted by depth convolutional neural networks.
Although deep learning frame achieves success, visual classification and expression in the field image classification (ImageNet) The framework of learning areas is made slow progress.Mainly huge calculating cost, the two-dimensional convolution nerve net of simple 101 classification Network only has about 5M parameter, and it is about 33M parameter that same architecture, which expands to a three-dimensional structure to increase,.On UCF101 One Three dimensional convolution neural network (3DConvNet) of training needs 3 to 4 days time, and about 2 are needed on Sports-1M Month, this becomes difficult extension framework probes and may overfitting.
Action recognition is related to capturing the space-time context across frame, in addition, the spatial information captured needs the auxiliary of hardware device It helps, is generally compensated using for camera movement, and even if having very strong spatial object detectability that can not meet The demand of movement, because more detailed minutia entrained by motion information is mined out not yet.In order to more pre- It surveys, needs to capture the motion information of local context in video, while capturing the motion information of global context.
Nowadays video actions identification technology has used depth learning technology completely, wherein classical work is double-current convolution mind Through network.The it is proposed of double-current convolutional neural networks is actually to have used for reference the double-current access that information is handled in brain vision system, Wherein: veutro access (what path way) is for handling the spatial informations such as shape, the color of object;Back side (where path Way) access for handle to move, the relevant information in position.Although this method is mentioned by obviously capturing local time's movement The performance for having risen single stream method, since the prediction of video level is obtained by the prediction score of average sample editing, in institute In the feature of study, medium-term and long-term temporal information is still loss.Therefore, there are also can much mention double-current video frequency identifying method The space risen.
Summary of the invention
The purpose of the present invention is to solve drawbacks described above in the prior art, provide a kind of based on deep neural network Multi-modal action identification method.This method increases this mode of human skeleton on double-current convolutional network network foundation.Due to Human body attitude estimates that relatively low (human skeleton, that is, key point has a very strong correlation relationship to difficulty, therefore can be in combination with The clue of bottom-up and top-down is positioned), and have mature Open Framework such as AlphaPose etc., it is introduced into Into action recognition, the interference of extraneous background on the one hand can be eliminated, on the other hand, the human motion of frame sequence meticulous depiction When each key point position situation of change, be conducive to the identification of movement.This method uses the depth mind with multiple-branching construction Multi-modal action recognition is carried out through network, wherein image branch is for handling the spatial informations such as shape, the color of object;Light stream branch Road for handle to move, the relevant information in position;Skeleton branch is reached pair by the path integral feature of processing frame sequence The meticulous depiction of movement.In addition, the invention also introduces a kind of pond method based on attention mechanism in image branch, so that Lime light can be placed on the area-of-interest closely bound up with action classification by image branch automatically, further increase action recognition The accuracy of method.
The purpose of the present invention can be reached by adopting the following technical scheme that:
A kind of multi-modal action identification method based on deep neural network, the action identification method include the following steps:
S1, pass through the disclosed database of acquisition, every frame image of video data is converted into RGB picture set, name rule Data separation then is carried out as filename according to video name+time+movement id, data construct training according to the ratio of 3:1 here Set and test set, wherein movement id includes following six kinds of basic movements: walking, run, wave, bend over, jump, stand.
S2, the data acquisition system unified resolution that step S1 is obtained.
S3, data compression is carried out to step S2 treated sets of image data, reduces calculation amount, i.e., using image from Cosine transform is dissipated to compress the Pixel Information of every frame video pictures.
S4, to step S3 treated video data according to time dimension, by video of the time interval in interval threshold The video frame deletion of frame or picture similarity more than similar threshold value.
S5, to the Optic flow information of step S4 treated data extract N number of successive video frames, wherein N is more than or equal to 10 Positive integer.
S6, using open source Attitude estimation algorithm such as AlphaPose etc., human skeleton is extracted by frame to video, thus Path integral feature is sought to a frame sequence, and to the frame sequence.
S7, by Optic flow information that step S5 is extracted, step S6 the human skeleton path integral feature extracted and Step S4 treated several video images, the input as deep neural network.There are three deep neural network is total in low layer Branch --- one is used for the convolutional neural networks of extraction time feature, and one for extracting the convolutional Neural net of space characteristics Network, one for handling the fully-connected network of skeleton path integral feature.In high level, three branches of low layer are melted by feature A branch is merged into conjunction, and the classification id of video actions is predicted by softmax activation primitive.
Further, the database of step S1 acquisition mainly includes KTH human body behavior database, UCF Sports data Library.
Further, step S2 is unified to 120*90 by video image resolution ratio.
Further, step S3 carries out discrete cosine transform to every frame image of video data, by transformed DCT Coefficient carries out thresholding operation, and the coefficient for being less than certain threshold value is zeroed, so that compression ratio is 10:1, then carries out inverse DCT fortune It calculates, obtains the single-frame images of compressed video data.
Further, the similar variation of the video frame for being spaced in 500ms is greater than 70% view according to time dimension by step S4 Frequency frame deletion reduces redundancy.Wherein, the value range of the interval threshold of time interval is 400ms-1000ms, representative value For 500ms.The value range of the similar threshold value of picture similarity is 0.5-0.9, representative value 0.7.
Further, the Optic flow information that the processing data of step S5 are extracted with 10 successive video frames, mainly passes through Lucas-Kanade algorithm solves basic optical flow equation to all pixels in neighborhood using least square principle, finally obtains required Optic flow information.
Further, step S6 extracts human body by frame to video using open source Attitude estimation algorithm such as AlphaPose etc. Skeleton to obtain a frame sequence, and seeks path integral feature to the frame sequence.
Further, the human skeleton road that step S7 extracts Optic flow information that step S5 is extracted, step S6 Diameter integrates feature and step S4 treated several video images, is input in deep neural network.The deep neural network Network structure is as follows:
There are three branches, i.e. image branch, light stream branch and skeleton branch in low layer tool for the deep neural network, right respectively The input of Ying Yusan mode;Three branches of low layer are merged into a branch by Fusion Features in high level, wherein image Branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond layer pooling1, volume Lamination conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond layer attention Pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss;
Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, Chi Hua Layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond Change layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss;
Skeleton branch use fully-connected network, be followed successively by from input layer to output layer full articulamentum fc1, Quan Lian stratum fc2, Data aggregation layer fusion, loss function layer loss.
Further, the data aggregation layer fusion acts input video by softmax activation primitive and carries out Classification, and optimize the parameter of network by minimizing Classification Loss function.
Further, in the step S7 image branch pool layers of introducing attention mechanism of attention, will roll up Two groups of construction of the parameter input weight vectors for study after product, respectively the conspicuousness weight vectors b from bottom-up With the attention weight vectors a from top-down, matrix operation implementation is reused to the bottom-up conspicuousness of Projection Character The weighting of weighted sum top-down attention, the response for finally merging the two again obtain final result;Assuming that the feature to pond is X And X ∈ Rn×f, a, b ∈ Rf×1, wherein n is the bulk to pond Projection Character, and f is the port number to pond Projection Character Amount,Represent feature X and carried out the perspective view after the significance weighted from bottom-up, the perspective view with Specific classification is unrelated, represents the perspective view after feature X has carried out the attention weighting from top-down with a thickness of 1, Xa, Since different classifications should have different attention weight vectors a, enabling categorical measure is K, then the attention of all categories Weight matrix is A ∈ Rf×K, top-down attention perspective view formula isFinal every a piece of particular category Attention projects Xa and first carries out being multiplied by element with conspicuousness projection Xb, then sums to multiplied result, obtains the category The eigenmatrix of attention weighting.
Further, data aggregation layer fusion classifies to input video movement by softmax activation primitive, and By minimizing Classification Loss function, the parameter of Lai Youhua network.The training of the network model is not only restricted to specifically train frame Caffe frame, MxNet frame, Torch frame and Tesorflow frame etc. can be used in frame.
The present invention has the following advantages and effects with respect to the prior art:
(1) a kind of multi-modal action identification method based on deep neural network disclosed by the invention has been used with more The deep neural network of branched structure, wherein image branch is for handling the spatial informations such as shape, the color of object;Light stream branch For handle to move, the relevant information in position;Skeleton branch is reached by the path integral feature of processing frame sequence to dynamic The meticulous depiction of work.
(2) a kind of multi-modal action identification method based on deep neural network disclosed by the invention, first by locating in advance Reason reduces the calculation amount of network, to be substantially reduced operation time, and can comprehensively utilize video image, light stream figure and human body The multi-modal informations such as skeleton significantly improve video actions accuracy of identification.
(3) a kind of multi-modal action identification method based on deep neural network disclosed by the invention, in image branch Pond layer operation introduces a kind of attention weighting pondization operation, it can voluntarily each pond unit be arrived in study in training Weight, the bigger pond unit of weight corresponds to the abstract characteristics closely bound up with the movement, and the Chi Huadan that weight is smaller Member corresponds to other features that should ignore or can generate interference to action recognition.By based on attention mechanism After pooling structure, the feature unrelated with action classification will be ignored, and will be by with closely bound up feature is acted " amplification " improves the accuracy rate and precision of action recognition.
Detailed description of the invention
Fig. 1 is a kind of multi-modal action identification method model signal based on deep neural network disclosed in the present invention Figure;
Fig. 2 is the pooling Structure Calculation schematic diagram proposed by the present invention based on attention mechanism;
Fig. 3 is a kind of process of the multi-modal action identification method method based on deep neural network disclosed in the present invention Figure.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Embodiment
As shown in Figure 1, present embodiment discloses a kind of multi-modal action identification method based on deep neural network.
There are three branches altogether in low layer for the deep neural network that the present embodiment uses --- and one is used for extraction time feature Convolutional neural networks, one for extracting the convolutional neural networks of space characteristics, one is special for handling skeleton path integral The fully-connected network of sign.In high level, three branches merge into a branch by Fusion Features, and activate letter by softmax The classification id of number prediction video actions.In image branch, a kind of pooling structure based on attention mechanism is introduced, it can On the basis of not changing existing network infrastructure, network structure is helped to focus on the feature for being conducive to identification maneuver, to reduce nothing The interference for closing feature, improves the performance of existing network, and video human motion recognition system is enabled more effectively to be applied to engineering.
As the embodiment of the present invention, the complete training precision that can be improved model of training data, in addition to data into Row pretreatment and compression, can be further reduced the interference of redundancy and irrelevant information, reduce the calculation amount of model, thus It reduces the model training time and improves training precision, therefore as the embodiment of the present invention, the multimode based on deep neural network State action identification method is as follows:
The collection of S1, training data
It mainly include following data library: KTH human body behavior database, UCF Sports by acquiring disclosed database Every frame image of video data is converted to RGB picture set by database, and naming rule is made according to video name+time+movement id Data separation is carried out for filename, data are gathered according to the ratio construction training set of 3:1 and test here, wherein movement id packet Containing six kinds of basic movements: walking, run, wave, bend over, jump, stand.
S2, the data acquisition system obtained to step S1 normalize, i.e. unified resolution size, i.e., to the picture specification of every frame into It, is uniformly arrived the resolution ratio of 120*90 by row compression, on the basis of guaranteeing the information integrity of image as far as possible, reduces convolution mind Calculation amount through network model improves recognition speed.
S3, data compression is carried out to step S2 treated sets of image data, reduces calculation amount, i.e., using image from It dissipates cosine transform DCT to compress the Pixel Information of every frame video pictures, compression ratio 10:1 can reduce initialization with this Information content when processing.Discrete cosine transform is carried out to original image, transformed DCT coefficient is subjected to thresholding operation, it will Coefficient less than threshold value is zeroed, compression quantization image process, then carries out inverse DCT operation, available final compressed figure Picture.
S4, to step S3 treated video data according to time dimension, by video frame of the time interval in 500ms or Video frame deletion of person's picture similarity 0.7 or more reduces redundancy.
Wherein, the method and step for calculating picture similarity is as follows:
S41, scaling pictures: being general size 8*8,64 pixel values by picture compression;
S42, simplify color, be converted into grayscale image;
S43, it calculates average value: calculating the average value of the pixel value of grayscale image all pixels point;
S44, compared pixels gray value: the average value that each pixel value and previous step for traversing grayscale image calculate is greater than Average value is recorded as 1, is otherwise 0;
S45,64 bit image fingerprints are obtained;
S46, calculate two pictures finger image Hamming distance, using the Hamming distance as picture similarity.
S5, optical flow method mainly utilize in image sequence pixel in the variation in time-domain and the correlation between consecutive frame Property finds previous frame with corresponding relationship existing between present frame, to calculate the motion information of object between consecutive frame. Lucas-Kanade method is a kind of widely used light stream estimation difference method, is owned using least square principle in neighborhood Pixel solves basic optical flow equation, compares common point by point method, and Lucas-Kanade algorithm is more insensitive for picture noise. Therefore the bi-directional light of 10 successive video frames is extracted using Lucas-Kanade algorithm to step S4 treated video requency frame data Stream information.Wherein, Lucas-Kanade algorithm is Lucas B and Kanade T.An Iterative Image Registration Technique with an Application to Stereo Vision.Proc.Of 7th International Joint Conference on Artificial Intelligence (IJCAI), pp.674-679 opinion The method that text is mentioned, has been carried out in openCV, therefore it is using the Lucas- on openCV that Optic flow information is extracted in this realization Kanade extracts Optic flow information.
S6, path integral feature passage path iterated integral, being capable of extraction paths come information such as displacement, the curvature of portraying path Multidate information abundant.By using Attitude estimation algorithm, such as AlphaPose etc., to step S4 treated video data Human skeleton is extracted by frame, obtains a skeleton time series.Note video frame number is N, and key point number is K (value 15), often A key point there are two coordinate, then frame sequence be a dimension be 2K, the path that length is N.It can be remembered for Pd={ X1, X2,...,XN, wherein XiFor 2K dimensional vector.Path PdIt is to practical continuous type key point path P for a discreet pathst:[0,T] →RdSampling.For Pt, kth rank accumulated path, which divides, to be defined as follows:
Path integral feature is then the set of all rank accumulated paths point, is an infinite dimensional vector.0th rank path is tired Integral is defined as 1.In general, the iterated integral of preceding m rank portrays the behavioral characteristics in path enough in engineering practice, then Take its preceding m rank path integral feature as follows:
S(X)|m={ 1, I1,I2,...,Im}
In practice, without Pt, only Pd, path integral can be calculated by tensor algebra at this time.
In the data that above-mentioned steps have been constructed by downloading and pretreatment, the ratio cut partition according to 3:1 is training dataset It closes with after test set, the neural network model of action recognition is constructed using following methods.
S7, by Optic flow information that step S5 is extracted, step S6 the human skeleton path integral feature extracted and Step S4 treated several video images, are input in deep neural network.There are three branches in low layer tool for the network, that is, scheme As branch, light stream branch and skeleton branch, the input of three mode is corresponded respectively to;Pass through Fusion Features in three branches of high level Merge into a branch.
Image branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond Change layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, Pond layer attention pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, Loss function layer loss;
Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond Change layer pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, Pond layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function Layer loss;
Skeleton branch uses fully-connected network, and full articulamentum fc1, Quan Lian stratum is followed successively by from input layer to output layer Fc2, data aggregation layer fusion, loss function layer loss.
For overall network structure as shown in Figure 1, in multiple-limb neural network, image branch can capture the space in video It relies on, light stream branch can capture the presence of the cycling service of each spatial position in video, skeleton branch meticulous depiction The change in time and space of human body key point position when movement.Three branches pass through respective feature learning network respectively, in data fusion Layer fusion is merged, to obtain finally abstract space-time characteristic relevant to action recognition.Fusion layers of feature are passed through Softmax activation, to predict action classification.
In image branch, attention pool is the pooling Network Computing Architecture based on attention mechanism, is such as schemed Shown in 2, by two groups of construction of the parameter input weight vectors for study after convolution, respectively from the significant of bottom-up The property weight vectors b and attention weight vectors a from top-down, reuses matrix operation and implements to Projection Character Bottom-up significance weighted and the weighting of top-down attention, the response for finally merging the two again obtain final result.Assuming that Feature to pond is X and X ∈ Rn×f, a, b ∈ Rf×1, wherein n is the bulk to pond Projection Character, and f is to Chi Huate The number of channels of projection is levied,It represents feature X and has carried out the projection after the significance weighted from bottom-up Figure, the perspective view is unrelated with specific classification, with a thickness of 1, has carried out as shown in Fig. 2, Xa represents feature X from top-down Attention weighting after perspective view enable categorical measure since different classifications should have different attention weight vectors a For K, then the attention weight matrix of all categories is A ∈ Rf×K, top-down attention perspective view formula isAs shown in Figure 2, the final attention projection Xa per a piece of particular category is first carried out with conspicuousness projection Xb It sums by the multiplication of element, then to multiplied result, obtains the eigenmatrix of category attention weighting.
Multi-modal action identification method in the present invention based on deep neural network, passes through multi-modal fusion and attention Mechanism can capture space-time characteristic relevant to action recognition, to improve the accuracy of identification of network.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any changes, modifications, substitutions, combinations, simplifications made without departing from the spirit and principles of the present invention, It should be equivalent substitute mode, be included within the scope of the present invention.

Claims (10)

1. a kind of multi-modal action identification method based on deep neural network, which is characterized in that the action identification method Include:
The collection of S1, training data acquire disclosed database, the set of construction training according to a certain percentage and test set, and Every frame image of video data is converted into RGB picture set;
S2, the data acquisition system unified resolution obtained to step S1 carry out the picture specification of every frame image of video data Compression;
S3, data compression is carried out to step S2 treated sets of image data;
S4, to step S3 treated video data according to time dimension, by video frame of the time interval in interval threshold or Video frame deletion of person's picture similarity more than similar threshold value;
S5, to the two-way Optic flow information of step S4 treated data extract N number of successive video frames, wherein N is more than or equal to 10 Positive integer;
S6, human skeleton is extracted by frame to step S4 treated data, and calculates the path integral feature of frame sequence;
S7, the human skeleton path integral feature and step for extracting Optic flow information that step S5 is extracted, step S6 S4 treated several video images, the input as deep neural network, wherein the deep neural network has in low layer There are three branches, correspond respectively to the input of three mode, three branches of low layer are merged into one by Fusion Features in high level A branch, and the classification id that institute's input video acts is predicted by softmax activation primitive.
2. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The naming rule of every frame image carries out data separation as filename according to video name+time+movement id in the step S1, Wherein, it classification id of the movement id as video actions, including following elemental motion: walks, run, wave, bend over, jump, stand.
3. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that Data compression, process are carried out using Pixel Information of the discrete cosine transform of image to every frame video pictures in the step S3 It is as follows:
Discrete cosine transform is carried out to every frame image of video data, transformed DCT coefficient is subjected to thresholding operation, it will Coefficient less than certain threshold value is zeroed, and then carries out inverse DCT operation, obtains the single-frame images of compressed video data.
4. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The video frame deletion of video frame or picture similarity 0.7 or more in the step S4 by time interval in 500ms.
5. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that Steps are as follows for the calculating of the picture similarity:
Picture compression is certain proportion size W*W, wherein W is pixel quantity by S41, scaling pictures;
S42, simplify color, be converted into grayscale image;
S43, average value is calculated, calculates the average value of the pixel value of grayscale image all pixels point;
S44, compared pixels gray value traverse each pixel value and above-mentioned average value of grayscale image, are greater than average value and record It is 1, is otherwise 0;
S45, W is obtained2Bit image fingerprint;
S46, calculate two pictures finger image Hamming distance, using the Hamming distance as picture similarity.
6. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that All pixels in neighborhood are solved using Lucas-Kanade algorithm in the step S5 and using least square principle basic Optical flow equation finally extracts the two-way Optic flow information of N number of successive video frames.
7. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The step S6 extracts human skeleton by frame using open source Attitude estimation algorithm, to video, so that a frame sequence is obtained, And path integral feature is sought to the frame sequence.
8. a kind of multi-modal action identification method based on deep neural network according to claim 1, which is characterized in that The network structure of deep neural network is as follows in the step S7:
There are three branches, i.e. image branch, light stream branch and skeleton branch in low layer tool for the deep neural network, correspond respectively to The input of three mode;Three branches of low layer are merged into a branch by Fusion Features in high level, wherein image branch It using convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond layer pooling1, convolutional layer Conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, pond layer attention Pooling, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss;
Light stream branch uses convolutional neural networks, is sequentially connected from input layer to output layer are as follows: convolutional layer conv1, pond layer Pooling1, convolutional layer conv2, pond layer pooling2, convolutional layer conv3, convolutional layer conv4, convolutional layer conv5, Chi Hua Layer pooling5, full articulamentum fc6, full articulamentum fc7, full articulamentum fc8, data aggregation layer fusion, loss function layer loss;
Skeleton branch uses fully-connected network, and full articulamentum fc1, Quan Lian stratum fc2, data are followed successively by from input layer to output layer Fused layer fusion, loss function layer loss.
9. a kind of multi-modal action identification method based on deep neural network according to claim 8, which is characterized in that The data aggregation layer fusion classifies to input video movement by softmax activation primitive, and passes through minimum Classification Loss function optimizes the parameter of network.
10. a kind of multi-modal action identification method based on deep neural network according to claim 8, feature exist In pool layers of introducing attention mechanism of the attention of image branch in the step S7 input the parameter after convolution It constructs two groups of weight vectors for study, the respectively conspicuousness weight vectors b from bottom-up and comes from top-down Attention weight vectors a, reuse matrix operation and implement to the bottom-up significance weighted and top-down of Projection Character Attention weighting, the response for finally merging the two again obtain final result;Assuming that the feature to pond is X and X ∈ Rn×f, a, b ∈ Rf×1, wherein n is the bulk to pond Projection Character, and f is the number of channels to pond Projection Character, It representing feature X and has carried out the perspective view after the significance weighted from bottom-up, the perspective view is unrelated with specific classification, It represents the perspective view after feature X has carried out the attention weighting from top-down with a thickness of 1, Xa, due to different classifications There should be different attention weight vectors a, enabling categorical measure is K, then the attention weight matrix of all categories is A ∈ Rf ×K, top-down attention perspective view formula isThe final attention projection Xa per a piece of particular category first and Conspicuousness projects Xb progress by the multiplication of element, then sums to multiplied result, obtains the feature of category attention weighting Matrix.
CN201811165862.4A 2018-10-08 2018-10-08 A kind of multi-modal action identification method based on deep neural network Pending CN109460707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811165862.4A CN109460707A (en) 2018-10-08 2018-10-08 A kind of multi-modal action identification method based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811165862.4A CN109460707A (en) 2018-10-08 2018-10-08 A kind of multi-modal action identification method based on deep neural network

Publications (1)

Publication Number Publication Date
CN109460707A true CN109460707A (en) 2019-03-12

Family

ID=65607315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811165862.4A Pending CN109460707A (en) 2018-10-08 2018-10-08 A kind of multi-modal action identification method based on deep neural network

Country Status (1)

Country Link
CN (1) CN109460707A (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583334A (en) * 2018-11-16 2019-04-05 中山大学 A kind of action identification method and its system based on space time correlation neural network
CN109948528A (en) * 2019-03-18 2019-06-28 南京砺剑光电技术研究院有限公司 A kind of robot behavior recognition methods based on visual classification
CN110059620A (en) * 2019-04-17 2019-07-26 安徽艾睿思智能科技有限公司 Bone Activity recognition method based on space-time attention
CN110084228A (en) * 2019-06-25 2019-08-02 江苏德劭信息科技有限公司 A kind of hazardous act automatic identifying method based on double-current convolutional neural networks
CN110096968A (en) * 2019-04-10 2019-08-06 西安电子科技大学 A kind of ultrahigh speed static gesture identification method based on depth model optimization
CN110135304A (en) * 2019-04-30 2019-08-16 北京地平线机器人技术研发有限公司 Human body method for recognizing position and attitude and device
CN110135386A (en) * 2019-05-24 2019-08-16 长沙学院 A kind of human motion recognition method and system based on deep learning
CN110175266A (en) * 2019-05-28 2019-08-27 复旦大学 A method of it is retrieved for multistage video cross-module state
CN110197116A (en) * 2019-04-15 2019-09-03 深圳大学 A kind of Human bodys' response method, apparatus and computer readable storage medium
CN110232412A (en) * 2019-05-30 2019-09-13 清华大学 A kind of body gait prediction technique based on multi-modal deep learning
CN110263666A (en) * 2019-05-29 2019-09-20 西安交通大学 A kind of motion detection method based on asymmetric multithread
CN110298332A (en) * 2019-07-05 2019-10-01 海南大学 Method, system, computer equipment and the storage medium of Activity recognition
CN110398369A (en) * 2019-08-15 2019-11-01 贵州大学 A kind of Fault Diagnosis of Roller Bearings merged based on 1-DCNN and LSTM
CN110458038A (en) * 2019-07-19 2019-11-15 天津理工大学 The cross-domain action identification method of small data based on double-strand depth binary-flow network
CN110472532A (en) * 2019-07-30 2019-11-19 中国科学院深圳先进技术研究院 A kind of the video object Activity recognition method and apparatus
CN110491479A (en) * 2019-07-16 2019-11-22 北京邮电大学 A kind of construction method of sclerotin status assessment model neural network based
CN110516595A (en) * 2019-08-27 2019-11-29 中国民航大学 Finger multi-modal fusion recognition methods based on convolutional neural networks
CN111027472A (en) * 2019-12-09 2020-04-17 北京邮电大学 Video identification method based on fusion of video optical flow and image space feature weight
CN111046227A (en) * 2019-11-29 2020-04-21 腾讯科技(深圳)有限公司 Video duplicate checking method and device
CN111274998A (en) * 2020-02-17 2020-06-12 上海交通大学 Parkinson's disease finger knocking action identification method and system, storage medium and terminal
CN111310707A (en) * 2020-02-28 2020-06-19 山东大学 Skeleton-based method and system for recognizing attention network actions
CN111695523A (en) * 2020-06-15 2020-09-22 浙江理工大学 Double-current convolutional neural network action identification method based on skeleton space-time and dynamic information
CN111754620A (en) * 2020-06-29 2020-10-09 武汉市东旅科技有限公司 Human body space motion conversion method, conversion device, electronic equipment and storage medium
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112131908A (en) * 2019-06-24 2020-12-25 北京眼神智能科技有限公司 Action identification method and device based on double-flow network, storage medium and equipment
CN112183240A (en) * 2020-09-11 2021-01-05 山东大学 Double-current convolution behavior identification method based on 3D time stream and parallel space stream
CN112396018A (en) * 2020-11-27 2021-02-23 广东工业大学 Badminton player foul action recognition method combining multi-modal feature analysis and neural network
WO2021035807A1 (en) * 2019-08-23 2021-03-04 深圳大学 Target tracking method and device fusing optical flow information and siamese framework
CN112686193A (en) * 2021-01-06 2021-04-20 东北大学 Action recognition method and device based on compressed video and computer equipment
CN113033430A (en) * 2021-03-30 2021-06-25 中山大学 Bilinear-based artificial intelligence method, system and medium for multi-modal information processing
CN113065451A (en) * 2021-03-29 2021-07-02 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113761975A (en) * 2020-06-04 2021-12-07 南京大学 Human skeleton action recognition method based on multi-mode feature fusion
CN113902995A (en) * 2021-11-10 2022-01-07 中国科学技术大学 Multi-mode human behavior recognition method and related equipment
CN114611584A (en) * 2022-02-21 2022-06-10 上海市胸科医院 CP-EBUS elastic mode video processing method, device, equipment and medium
CN114821206A (en) * 2022-06-30 2022-07-29 山东建筑大学 Multi-modal image fusion classification method and system based on confrontation complementary features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014155215A1 (en) * 2013-03-29 2014-10-02 Università Degli Studi Dell'aquila Method and apparatus for monitoring the personal exposure to static or quasi- static magnetic fields
CN104156693A (en) * 2014-07-15 2014-11-19 天津大学 Motion recognition method based on multi-model sequence fusion
US20160321498A1 (en) * 2007-01-12 2016-11-03 International Business Machines Corporation Warning a user about adverse behaviors of others within an environment based on a 3d captured image stream
CN108288035A (en) * 2018-01-11 2018-07-17 华南理工大学 The human motion recognition method of multichannel image Fusion Features based on deep learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160321498A1 (en) * 2007-01-12 2016-11-03 International Business Machines Corporation Warning a user about adverse behaviors of others within an environment based on a 3d captured image stream
WO2014155215A1 (en) * 2013-03-29 2014-10-02 Università Degli Studi Dell'aquila Method and apparatus for monitoring the personal exposure to static or quasi- static magnetic fields
CN104156693A (en) * 2014-07-15 2014-11-19 天津大学 Motion recognition method based on multi-model sequence fusion
CN108288035A (en) * 2018-01-11 2018-07-17 华南理工大学 The human motion recognition method of multichannel image Fusion Features based on deep learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHUNHUI LIU等: "PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding", 《HTTPS://ARXIV.ORG/ABS/1703.07475》 *
ROHIT GIRDHAR等: "Attentional Pooling for Action Recognition", 《HTTPS://ARXIV.ORG/PDF/1711.01467V3.PDF》 *

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109583334A (en) * 2018-11-16 2019-04-05 中山大学 A kind of action identification method and its system based on space time correlation neural network
CN109948528B (en) * 2019-03-18 2023-04-07 南京砺剑光电技术研究院有限公司 Robot behavior identification method based on video classification
CN109948528A (en) * 2019-03-18 2019-06-28 南京砺剑光电技术研究院有限公司 A kind of robot behavior recognition methods based on visual classification
CN110096968A (en) * 2019-04-10 2019-08-06 西安电子科技大学 A kind of ultrahigh speed static gesture identification method based on depth model optimization
CN110096968B (en) * 2019-04-10 2023-02-07 西安电子科技大学 Ultra-high-speed static gesture recognition method based on depth model optimization
CN110197116A (en) * 2019-04-15 2019-09-03 深圳大学 A kind of Human bodys' response method, apparatus and computer readable storage medium
CN110059620A (en) * 2019-04-17 2019-07-26 安徽艾睿思智能科技有限公司 Bone Activity recognition method based on space-time attention
CN110059620B (en) * 2019-04-17 2021-09-03 安徽艾睿思智能科技有限公司 Skeletal behavior identification method based on space-time attention
CN110135304A (en) * 2019-04-30 2019-08-16 北京地平线机器人技术研发有限公司 Human body method for recognizing position and attitude and device
CN110135386A (en) * 2019-05-24 2019-08-16 长沙学院 A kind of human motion recognition method and system based on deep learning
CN110175266A (en) * 2019-05-28 2019-08-27 复旦大学 A method of it is retrieved for multistage video cross-module state
CN110175266B (en) * 2019-05-28 2020-10-30 复旦大学 Cross-modal retrieval method for multi-segment video
CN110263666A (en) * 2019-05-29 2019-09-20 西安交通大学 A kind of motion detection method based on asymmetric multithread
CN110263666B (en) * 2019-05-29 2021-01-19 西安交通大学 Action detection method based on asymmetric multi-stream
CN110232412A (en) * 2019-05-30 2019-09-13 清华大学 A kind of body gait prediction technique based on multi-modal deep learning
CN112131908A (en) * 2019-06-24 2020-12-25 北京眼神智能科技有限公司 Action identification method and device based on double-flow network, storage medium and equipment
CN110084228A (en) * 2019-06-25 2019-08-02 江苏德劭信息科技有限公司 A kind of hazardous act automatic identifying method based on double-current convolutional neural networks
CN110298332A (en) * 2019-07-05 2019-10-01 海南大学 Method, system, computer equipment and the storage medium of Activity recognition
CN110491479A (en) * 2019-07-16 2019-11-22 北京邮电大学 A kind of construction method of sclerotin status assessment model neural network based
CN110458038A (en) * 2019-07-19 2019-11-15 天津理工大学 The cross-domain action identification method of small data based on double-strand depth binary-flow network
CN110472532A (en) * 2019-07-30 2019-11-19 中国科学院深圳先进技术研究院 A kind of the video object Activity recognition method and apparatus
CN110472532B (en) * 2019-07-30 2022-02-25 中国科学院深圳先进技术研究院 Video object behavior identification method and device
CN110398369A (en) * 2019-08-15 2019-11-01 贵州大学 A kind of Fault Diagnosis of Roller Bearings merged based on 1-DCNN and LSTM
WO2021035807A1 (en) * 2019-08-23 2021-03-04 深圳大学 Target tracking method and device fusing optical flow information and siamese framework
CN110516595A (en) * 2019-08-27 2019-11-29 中国民航大学 Finger multi-modal fusion recognition methods based on convolutional neural networks
CN110516595B (en) * 2019-08-27 2023-04-07 中国民航大学 Finger multi-mode feature fusion recognition method based on convolutional neural network
CN111046227B (en) * 2019-11-29 2023-04-07 腾讯科技(深圳)有限公司 Video duplicate checking method and device
CN111046227A (en) * 2019-11-29 2020-04-21 腾讯科技(深圳)有限公司 Video duplicate checking method and device
CN111027472A (en) * 2019-12-09 2020-04-17 北京邮电大学 Video identification method based on fusion of video optical flow and image space feature weight
CN111274998A (en) * 2020-02-17 2020-06-12 上海交通大学 Parkinson's disease finger knocking action identification method and system, storage medium and terminal
CN111274998B (en) * 2020-02-17 2023-04-28 上海交通大学 Parkinson's disease finger knocking action recognition method and system, storage medium and terminal
CN111310707A (en) * 2020-02-28 2020-06-19 山东大学 Skeleton-based method and system for recognizing attention network actions
CN111310707B (en) * 2020-02-28 2023-06-20 山东大学 Bone-based graph annotation meaning network action recognition method and system
CN113761975A (en) * 2020-06-04 2021-12-07 南京大学 Human skeleton action recognition method based on multi-mode feature fusion
CN113761975B (en) * 2020-06-04 2023-12-15 南京大学 Human skeleton action recognition method based on multi-mode feature fusion
CN111695523B (en) * 2020-06-15 2023-09-26 浙江理工大学 Double-flow convolutional neural network action recognition method based on skeleton space-time and dynamic information
CN111695523A (en) * 2020-06-15 2020-09-22 浙江理工大学 Double-current convolutional neural network action identification method based on skeleton space-time and dynamic information
CN111754620A (en) * 2020-06-29 2020-10-09 武汉市东旅科技有限公司 Human body space motion conversion method, conversion device, electronic equipment and storage medium
CN111754620B (en) * 2020-06-29 2024-04-26 武汉市东旅科技有限公司 Human body space motion conversion method, conversion device, electronic equipment and storage medium
CN111931602B (en) * 2020-07-22 2023-08-08 北方工业大学 Attention mechanism-based multi-flow segmented network human body action recognition method and system
CN111931602A (en) * 2020-07-22 2020-11-13 北方工业大学 Multi-stream segmented network human body action identification method and system based on attention mechanism
CN112183240B (en) * 2020-09-11 2022-07-22 山东大学 Double-current convolution behavior identification method based on 3D time stream and parallel space stream
CN112183240A (en) * 2020-09-11 2021-01-05 山东大学 Double-current convolution behavior identification method based on 3D time stream and parallel space stream
CN112396018B (en) * 2020-11-27 2023-06-06 广东工业大学 Badminton player foul action recognition method combining multi-mode feature analysis and neural network
CN112396018A (en) * 2020-11-27 2021-02-23 广东工业大学 Badminton player foul action recognition method combining multi-modal feature analysis and neural network
CN112686193A (en) * 2021-01-06 2021-04-20 东北大学 Action recognition method and device based on compressed video and computer equipment
CN112686193B (en) * 2021-01-06 2024-02-06 东北大学 Action recognition method and device based on compressed video and computer equipment
CN113065451A (en) * 2021-03-29 2021-07-02 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113065451B (en) * 2021-03-29 2022-08-09 四川翼飞视科技有限公司 Multi-mode fused action recognition device and method and storage medium
CN113033430B (en) * 2021-03-30 2023-10-03 中山大学 Artificial intelligence method, system and medium for multi-mode information processing based on bilinear
CN113033430A (en) * 2021-03-30 2021-06-25 中山大学 Bilinear-based artificial intelligence method, system and medium for multi-modal information processing
CN113902995A (en) * 2021-11-10 2022-01-07 中国科学技术大学 Multi-mode human behavior recognition method and related equipment
CN113902995B (en) * 2021-11-10 2024-04-02 中国科学技术大学 Multi-mode human behavior recognition method and related equipment
CN114611584A (en) * 2022-02-21 2022-06-10 上海市胸科医院 CP-EBUS elastic mode video processing method, device, equipment and medium
CN114821206B (en) * 2022-06-30 2022-09-13 山东建筑大学 Multi-modal image fusion classification method and system based on confrontation complementary features
CN114821206A (en) * 2022-06-30 2022-07-29 山东建筑大学 Multi-modal image fusion classification method and system based on confrontation complementary features

Similar Documents

Publication Publication Date Title
CN109460707A (en) A kind of multi-modal action identification method based on deep neural network
CN110516620A (en) Method for tracking target, device, storage medium and electronic equipment
Fan et al. Point spatio-temporal transformer networks for point cloud video modeling
CN111985343A (en) Method for constructing behavior recognition deep network model and behavior recognition method
CN114220176A (en) Human behavior recognition method based on deep learning
CN112446342B (en) Key frame recognition model training method, recognition method and device
Fan et al. Deep hierarchical representation of point cloud videos via spatio-temporal decomposition
CN112288627B (en) Recognition-oriented low-resolution face image super-resolution method
CN110390294B (en) Target tracking method based on bidirectional long-short term memory neural network
CN111444370A (en) Image retrieval method, device, equipment and storage medium thereof
CN113111842A (en) Action recognition method, device, equipment and computer readable storage medium
CN114445430A (en) Real-time image semantic segmentation method and system for lightweight multi-scale feature fusion
CN110942037A (en) Action recognition method for video analysis
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
CN114333002A (en) Micro-expression recognition method based on deep learning of image and three-dimensional reconstruction of human face
CN116311525A (en) Video behavior recognition method based on cross-modal fusion
Cha et al. Learning 3D skeletal representation from transformer for action recognition
CN111626212B (en) Method and device for identifying object in picture, storage medium and electronic device
Huang et al. A detection method of individual fare evasion behaviours on metros based on skeleton sequence and time series
CN112001313A (en) Image identification method and device based on attribution key points
CN115359550A (en) Gait emotion recognition method and device based on Transformer, electronic device and storage medium
CN115063717A (en) Video target detection and tracking method based on key area live-action modeling
Verma et al. Intensifying security with smart video surveillance
Gomes et al. Real time vision for robotics using a moving fovea approach with multi resolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190312