CN111898442B - Human body action recognition method and device based on multi-mode feature fusion - Google Patents

Human body action recognition method and device based on multi-mode feature fusion Download PDF

Info

Publication number
CN111898442B
CN111898442B CN202010607674.3A CN202010607674A CN111898442B CN 111898442 B CN111898442 B CN 111898442B CN 202010607674 A CN202010607674 A CN 202010607674A CN 111898442 B CN111898442 B CN 111898442B
Authority
CN
China
Prior art keywords
modality
samples
class
video
mapping matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010607674.3A
Other languages
Chinese (zh)
Other versions
CN111898442A (en
Inventor
郭军
石梅
常晓军
汤战勇
刘宝英
朱省吾
黄位
贺怡
许鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NORTHWEST UNIVERSITY
Original Assignee
NORTHWEST UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NORTHWEST UNIVERSITY filed Critical NORTHWEST UNIVERSITY
Priority to CN202010607674.3A priority Critical patent/CN111898442B/en
Publication of CN111898442A publication Critical patent/CN111898442A/en
Application granted granted Critical
Publication of CN111898442B publication Critical patent/CN111898442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Social Psychology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a human body action recognition method and device based on multi-mode feature fusion, which uses a WiFi signal with strongest commercial property, and fuses the CSI features and video features of the WiFi signal by using the multi-mode feature fusion method; the multi-mode feature fusion method maps the two different features to the same public space, classifies the features, and finally identifies the human action category. Experimental results show that under the condition that WiFi signals are added and a multi-mode feature fusion method is utilized, the accuracy of human body motion recognition is obviously improved.

Description

Human body action recognition method and device based on multi-mode feature fusion
Technical Field
The application belongs to the technical field of motion recognition, and particularly relates to a human motion recognition method and device based on multi-mode feature fusion.
Background
Human motion recognition algorithms play a vital role in many areas of computer vision, with respect to video motion recognition, the most popular approach is based on spatio-temporal and optical information analysis. However, these methods are not ideal due to poor quality of the data frames and ambient light in the natural environment.
Existing multimodal models are divided into unsupervised algorithms and supervised algorithms. The unsupervised multi-mode algorithm lacks label information, so that a discriminant public space cannot be obtained, and a poor result is caused. The currently used multi-modal algorithms are: GMA (generalized multi-view analysis) and MvDA (multi-view discriminant analysis), which both map multi-modal samples onto a common space by looking for a mapping matrix and then classify them. However, the GMA only considers the discrimination information in the modes, and ignores the discrimination information among the modes; the MvDA and the MvDA are compatible, so that a public space with discriminant is obtained, but the MvDA has the defect that only generalized eigenvalue decomposition is used for the final mapping matrix solving process, so that the solved mapping matrix is an approximate value instead of a globally optimal solution, and the final precision is reduced.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the application provides a human body action recognition method and device based on multi-mode feature fusion, which are characterized in that video features are assisted to be recognized by using wireless WiFi signals, and the two features are fused and subjected to discriminant analysis by using a multi-mode feature fusion scheme to obtain a final human body action recognition result; therefore, the defects that the existing human motion recognition schemes all use video features to judge, but are influenced by optical limitations and the like to lead the result to be non-ideal are overcome.
In order to achieve the above purpose, the application adopts the following technical scheme:
the method utilizes a multi-mode feature fusion method to fuse the CSI features and video features of WiFi signals, maps the two features to a public space through a multi-mode feature fusion model for discrimination analysis, and finally identifies the human motion category; the method comprises the following steps:
step 1, preprocessing a data set: the Vi-Wi15 data set comprises video information and CSI information data of corresponding Wifi signals, a convolutional neural network is adopted to extract video features in the Vi-Wi15 data set, and CSI features of the WiFi signals in the Vi-Wi15 data set are extracted according to a standard statistical algorithm;
defining a Vi-Wi15 dataset as
Wherein X is the Vi-Wi15 dataset, X ijk Is the kth sample of the jth modality in the ith class, i is the class, each action done in the video is defined as a class, c is the number of classes, j is the different modalities, D j Is the dimension of the kth sample of the jth modality in the ith class, j=1 represents the video modality, j=2 represents the WiFi modality, n ij The number of samples of the ith class of the jth modality;
step 2, taking the video features obtained in the step 1 and the CSI features of the WiFi signals as two modes respectively, establishing a multi-mode feature fusion model and defining an objective function for solving a mapping matrix:
wherein ,v1 * Is an optimal mapping matrix for video modalities, v 2 * Is an optimal mapping matrix for WiFi modalities; v 1 Is the mapping matrix of the video modality, v 2 Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) T Transposed set V for mapping matrix T ={v 1 T ,v 2 T V is the set v= { V of the mapping matrix 1 ,v 2 The construction of D and S is: and />They are about->Blocks of (2)The definition of the elements of the matrix is as follows:
wherein ,is the i-th sample in the j-th modality with respect to input X ijk Is used for the matrix of the average of (a),is->Is a transpose of (2); n is n ij Is the number of samples of the ith class in the jth modality,/for the jth modality>Is the number of samples of the i-th class in all modalities, n is the number of all samples; c is the number of classes; j=1 represents a video modality, j=2 represents a WiFi modality;is the ith class of sample in the nth modality with respect to input X irk Mean matrix of>Is->Is a transpose of (2); n is n ir Is the number of samples of the ith class in the nth modality,/th class>Is the number of samples of class i in all modalities; r=1 represents a video modality, r=2 represents a WiFi modality;
step 3, calculating the objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;
step 4: and (3) obtaining a global optimal solution about the mapping matrix from the step (3), and passing the global optimal solution through the formula:
mapping onto a public space y;
wherein ,Yijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality;
and finally, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category.
The application also comprises the following technical characteristics:
specifically, the step 3 specifically includes:
changing the matrices S and D in the formulas (6) and (7) to semi-positive definite matrices:
knowing that matrices S and D are symmetric matrices, relaxing the constraints in MvDA, replacing D and S with the following strategies:
D=D+e 1 I (9)
S=S+e 2 I (10)
wherein I is the corresponding identity matrix, e 1 and e2 Is two arbitrary constants; through selection of e 1 and e2 D and S are converted into a semi-positive definite matrix, so that a global optimal solution based on a Newton-Raphson method is obtained by the formula (8);
setting up and />Adding an orthogonal constraint V T V=i to preserve the global geometry of the data, the objective function equation (8) is described asThe following steps:
the optimal solution of equation (11) is equivalent to f (λ) =0 of the trace difference function:
to make f (λ) =0, the optimal mapping matrix V at this time * The method comprises the following steps:
wherein λ* Is the optimal TR (Trace ratio) value.
Specifically, the optimal TR value is calculated by using a Newton-Raphson iteration method: initializing: t=0, λ 0 =0
(1) Calculation ofIs a characteristic value of (2);
(2) at an initial value lambda t Next, an iterative strategy is used to solve equation (12) and a first order Taylor expansion is used to approximate λ t Nearby eigenvalues:
where k=1, …, m;
at this time, using Taylor expansion, we approximate the trace difference function f (λ) asIt is for->Summation of the first d larger values:
wherein Is->Is the first i maximum eigenvalues of (a);
(3) by solving the problems thatUpdating lambda t+1
(4) Calculating |lambda t+1t I, when less than the threshold epsilon (epsilon=10) -4 ) The cycle is terminated when the optimum lambda is obtained * =λ t+1 Then calculate the optimal mapping matrix V by using the formula (13) *
Specifically, the step 4 specifically includes:
step 4.1: the global optimal solution of the mapping matrix obtained in the step 3 is passed through a formulaMapping to a public space y, testing classification accuracy when different kernel functions are adopted, and selecting the kernel function with the best performance; wherein Y is ijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality;
step 4.2: cross-verifying the kernel function selected in the step 4.1, and searching the optimal parameters of the current kernel function by a parameter searching method;
step 4.3: and 4.1, selecting the optimal kernel function and parameters in the step 4.2, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action type.
A human motion recognition device based on multi-modal feature fusion, comprising:
the data set preprocessing unit is used for extracting video features in the Vi-Wi15 data set by adopting a convolutional neural network and extracting CSI features of WiFi signals in the Vi-Wi15 data set according to a standard statistical algorithm;
the construction unit of the multi-mode feature fusion model is used for respectively taking the obtained video features and the CSI features of the WiFi signals as two modes, establishing the multi-mode feature fusion model and defining an objective function for solving a mapping matrix;
the mapping matrix global optimal solution solving unit is used for calculating an objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;
an action recognition unit for passing the obtained global optimal solution about the mapping matrix through a formulaThe multi-modal samples are mapped onto a common space y,
wherein ,Yijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality; and classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category.
Compared with the prior art, the application has the beneficial technical effects that:
the application discloses a novel method for fusing video and Wi-Fi signals by utilizing a multi-mode feature fusion method and then identifying human actions. The method utilizes human action information carried by Wi-Fi signals to compensate information loss caused by the influence of environmental factors on actions in videos. Finally, the classification task of the video actions is completed under the condition of feature fusion, so that the loss of part of information is effectively compensated, and the classification accuracy is improved.
Drawings
FIG. 1 is a flow chart of the method of the present application.
Detailed Description
Because the moving object can reflect the wireless signal and change the amplitude and the phase thereof, thereby providing discriminable information, the wireless signal can be widely applied to the identification of the moving object, and compared with wireless signals such as WiFi, RFID, radar, bluetooth and the like, the wireless signal has the advantage of being not influenced by optical factors. Accordingly, human motion recognition research based on wireless signals has received increasing attention in recent years. However, a great challenge faced by the task of human motion recognition based on wireless signals is multipath effects and unavoidable noise interference, which can reduce recognition performance. The effect of independently using the wireless signal is not ideal at present, and the best method for improving the human motion recognition performance undoubtedly is to explore the characteristics of the video and the wireless signal together. Inspired by the recent success of combining video and radio signals for human gestures, the present scheme fuses WiFi signals into video-based HARs to improve recognition performance. In the present application, wiFi is selected because: 1) WiFi does not require additional devices carried by humans; 2) As a widely used commercial wireless signal, wiFi-based wireless communication services are established around the world, which means that we easily collect WiFi signals at very low cost.
Feature fusion technology: in the fields of machine learning and computer vision, fusing data features of different modalities is a significant challenge. In recent years, feature fusion techniques have gained increasing attention in multi-modal data analysis. The existing feature fusion technology is divided into three types: 1) Early fusion based on features; 2) Decision-based late fusion; 3) And the mixing and fusion of the two. Early fusion was to fuse the multi-modal features after feature extraction (typically by simply summing their features), but this approach ignores the important correlation between the different modal features, increasing computational and storage costs. Later fusion is performed after decision making (classification or regression) is made by different modality features. Hybrid fusion combines the advantages of early fusion and late fusion. Late fusion and hybrid fusion are more complex to implement than early fusion; thus, an efficient multimodal model was explored to solve this problem.
Multimode feature fusion: a common approach in multi-modal learning is common space projection, which projects multi-modal high-dimensional data into a common space to achieve better predictive performance. Generally, multi-modal learning methods are classified into two categories, unsupervised or supervised, depending on whether tag information is used. However, these unsupervised schemes do not have good supervised effect, so the scheme makes a supervised model, and it is desirable to consider both inter-view information and intra-view information in the model, so that the learned public space is more discriminant. Based on the above, the application provides a multi-mode feature fusion method for fusing video and WiFi signals to identify human actions.
Three data sets that use in this scheme have erect the collection that WiFi equipment is used for WiFi signal data in the both sides by the gathering people, have erect two cameras respectively in the gathering people's front and side and are used for video data's collection. To meet the needs of experimental design, we record video at different angles (front and side), while adding various forms of occlusion (random occlusion and stripe occlusion) to the targets in the video during recording. The dataset was collected for 92 subjects, containing 15 action categories.
Vi-Wi15 dataset: the method comprises video information and data of corresponding WIFi signal CSI characteristics;
Vi-Wi15 (video) dataset: it contains only video information in the Vi-Wi15 dataset;
Vi-Wi15 (WiFi) dataset: it contains only Wi-fi signal CSI information in the Vi-Wi15 dataset.
The following specific embodiments of the present application are provided, and it should be noted that the present application is not limited to the following specific embodiments, and all equivalent changes made on the basis of the technical scheme of the present application fall within the protection scope of the present application.
Example 1:
the embodiment provides a human motion recognition method based on multi-mode feature fusion, which utilizes the multi-mode feature fusion method to fuse the CSI features and the video features of WiFi signals, maps the two features to the same public space for discriminant analysis, and finally recognizes the human motion category; as shown in fig. 1, a multi-mode dataset is processed into numbers through a data preprocessing module, and then an objective function of a multi-mode feature fusion model solving mapping matrix is constructed; and solving a globally optimal mapping matrix for the model, and finally mapping the input multi-mode sample to a public space by using the mapping matrix, and then carrying out SVM classification to obtain a final classification result. The method comprises the following steps:
step 1, preprocessing a data set: the Vi-Wi15 data set comprises video information and CSI information data of corresponding Wifi signals; extracting video features (the video contains 4096-dimensional features) in the Vi-Wi15 data set by adopting a convolutional neural network, extracting CSI features (635-dimensional features) of WiFi signals in the Vi-Wi15 data set according to a standard statistical algorithm, and reserving 95% of energy in the Vi-Wi15 data set by using a Principal Component Analysis (PCA) method to remove redundant information so as to simplify data;
defining a Vi-Wi15 dataset as
Wherein X is the Vi-Wi15 dataset, X ijk Is the kth sample of the jth modality in the ith class, i is the class, each action done in the video is defined as a class, c is the number of classes, j is the different modalities, D j Is the dimension of the kth sample of the jth modality in the ith class, j=1 represents the video modality, j=2 represents the WiFi modality, n ij The number of samples of the ith class of the jth modality;
the video action data set with the WiFi signals is obtained through the step, the requirement of experiments is met, and through setting an experimental scene, the obtained data set can simulate the degradation of video quality caused by the influence of external environments on a monitoring video in a real environment, so that the video action data set is used for identifying experimental demonstration of a new method for the work.
Step 2, taking the video features obtained in the step 1 and the CSI features of the WiFi signals as two modes respectively, establishing a multi-mode feature fusion model and defining an objective function for solving a mapping matrix:
wherein ,v1 * Is an optimal mapping matrix for video modalities, v 2 * Is an optimal mapping matrix for WiFi modalities; v 1 Is the mapping matrix of the video modality, v 2 Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) T Transposed set V for mapping matrix T ={v 1 T ,v 2 T V is the set v= { V of the mapping matrix 1 ,v 2 The construction of D and S is: and />They are about->The definition of the elements of which is as follows:
wherein ,is the i-th sample in the j-th modality with respect to input X ijk Is used for the matrix of the average of (a),is->Is a transpose of (2); n is n ij Is the number of samples of the ith class in the jth modality,/for the jth modality>Is the number of samples of the i-th class in all modalities, n is the number of all samples; c is the number of classes; j=1 represents a video modality, j=2 represents a WiFi modality;is the ith class of sample in the nth modality with respect to input X irk Mean matrix of>Is->Is a transpose of (2); n is n ir Is the number of samples of the ith class in the nth modality,/th class>Is the number of samples of class i in all modalities; r=1 represents a video modality, r=2 represents a WiFi modality;
in this embodiment, specifically, the process of defining the objective function is as follows:
wherein ,Sb For the inter-class scattering matrix, subscript b is inter-class betwen-class, S w For an intra-class scattering matrix, subscript w is intra-class witin-class;
definition of between classesScattering matrix S b And intra-class scattering matrix and S w The following are provided:
wherein ,is the average matrix of samples of class i in all modes,/, for example>Is the number of samples of class i in all modalities,/->Is the average matrix of all samples in the common space, n is the number of all samples, and +.>Y ijk Is X ijk Mapping the samples to sample values corresponding to the public space;
definition of the definitionR m Representing the size of this matrix, v 1 T and v2 T Is a vector of m 1, v l Is the first column of matrix V, at this time S b and Sw The expression is as follows:
the construction of D and S is: and />They are about->The definition of the elements of which is as follows:
wherein ,is the i-th sample in the j-th modality with respect to input X ijk Mean matrix of>Is thatIs a transpose of (2); n is n ij Is the number of samples of the ith class in the jth modality,/for the jth modality>Is the number of samples of the i-th class in all modalities, n is the number of all samples; c is the number of classes; j=1 represents a video modality, j=2 represents a WiFi modality; />Is the ith class of sample in the nth modality with respect to input X irk Mean matrix of>Is->Is a transpose of (2); n is n ir Is the number of samples of the ith class in the nth modality,/th class>Is the number of samples of class i in all modalities; r=1 represents a video modality, r=2 represents a WiFi modality;
thus, the objective function of equation (1) can be expressed as:
wherein ,v1 * Is an optimal mapping matrix for video modalities, v 2 * Is an optimal mapping matrix for WiFi modalities; v 1 Is the mapping matrix of the video modality, v 2 Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) T Transposed set V for mapping matrix T ={v 1 T ,v 2 T V is the set v= { V of the mapping matrix 1 ,v 2 "D and S" are related toIs a block matrix of the block matrix.
The method comprises the following steps: and respectively taking the video features and the WiFi features as two modes, and mapping the two modes onto a public space through a multi-mode feature fusion model to perform discriminant analysis. Thus, the relation between the two modes can be fully utilized, and the original characteristics are maintained. Compared with early fusion, the method has the advantages that the effect of simply adding the two features is better, so that the final human motion recognition effect is better.
Step 3, calculating the objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;
since the global optimal solution for the multi-modal feature fusion method cannot be obtained in step 2, in order to solve this problem, an iterative algorithm based on the Newton-Raphson method is used to solve the trace ratio problem. However, equation (8) can be solved directly using the Newton-Raphson method because it is difficult to determine from equation (6) and equation (7) whether the matrices S and D are positive half-definite. Therefore, a strategy for solving this dilemma is proposed first.
Step 3.1: changing the matrices S and D in the formulas (6) and (7) to semi-positive definite matrices:
knowing that matrices S and D are symmetric matrices, relaxing the constraints in MvDA, replacing D and S with the following strategies:
D=D+e 1 I (9)
S=S+e 2 I (10)
wherein I is the corresponding identity matrix, e 1 and e2 Is two arbitrary constants; through selection of e 1 and e2 D and S are converted into a semi-positive definite matrix, so that a global optimal solution based on a Newton-Raphson method is obtained by the formula (8);
setting up and />Adding an orthogonal constraint V T V=i to preserve the global geometry of the data, the objective function equation (8) is described as follows:
the optimal solution of equation (11) is equivalent to f (λ) =0 of the trace difference function:
to make f (λ) =0, the optimal mapping matrix V at this time * The method comprises the following steps:
wherein λ* Is the optimal TR value.
Calculating an optimal TR value by using a Newton-Raphson iteration method: initializing: t=0, λ 0 =0
(1) Calculation ofIs a characteristic value of (2);
(2) at an initial value lambda t Next, an iterative strategy is used to solve equation (12) and a first order Taylor expansion is used to approximate λ t Nearby eigenvalues:
where k=1, …, m;
at this time, using Taylor expansion, we approximate the trace difference function f (λ) asIt is for->Summation of the first d larger values:
wherein Is->Is the first i maximum eigenvalues of (a);
(3) by solving the problems thatUpdating lambda t+1
(4) Calculating |lambda t+1t I, when less than the threshold epsilon (epsilon=10) -4 ) The cycle is terminated when the optimum lambda is obtained * =λ t+1 And calculating an optimal mapping matrix V by using the formula (13).
The Trace Ratio problem is converted into the Trace difference problem through the steps, so that the equation 12 is directly solved by using a Newton-Raphson method, and a globally optimal solution of the Trace Ratio (TR) problem is obtained.
The method comprises the following steps: the problem of solving Trace Ratio (TR) of the multi-mode feature fusion model proposed in the step 2 is known to be solved by using generalized eigenvalues, but the global optimal solution cannot be obtained by such processing, and the obtained approximate solution deviates from the result. Step 3 thus solves the global optimal solution for the TR problem using the Newton-Raphson method for the above-mentioned problem, but the matrices S and D must be required to be semi-positive definite matrices when using this method, so we propose a strategy to change the pair-wise matrix into a semi-positive definite matrix. Therefore, the global optimal solution of the model can be obtained to obtain the optimal human body action recognition result.
Step 4: and (3) obtaining a global optimal solution about the mapping matrix from the step (3), and passing the global optimal solution through the formula:
mapping onto a public space y;
wherein ,Yijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality;
and finally, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category.
The step 4 specifically comprises the following steps:
step (a)4.1: the global optimal solution of the mapping matrix obtained in the step 3 is passed through a formulaMapping to a public space y, testing classification accuracy when different kernel functions are adopted, and selecting the kernel function with the best performance; wherein Y is ijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality;
step 4.2: cross-verifying the kernel function selected in the step 4.1, and searching the optimal parameters of the current kernel function by a parameter searching method;
step 4.3: and 4.1, selecting the optimal kernel function and parameters in the step 4.2, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action type.
Specifically, for step 4.1, libsvm is used for classification; the Libsvm has a plurality of kernel functions, and the Vi-Wi15 data set is used for testing and selecting the optimal kernel function;
for step 4.2, the parameter search is performed on the best kernel function selected in step 4.1, and the best parameters of the kernel function are selected by a grid parameter search (searching is performed from a range with a relatively large parameter value, and detailed search is performed within the specific range after the specific range is determined).
Table 1 shows the results of SVM different kernel function selections for two data sets, a Vi-Wi15 (video) data set (specific information of the data set: sample dimension 4096, sample number 2760, category number 15) and a Vi-Wi15 data set (specific information of the data set: sample dimension video 4096+WiFi635, sample number 2760, category number 15); it can be seen from table 1 that the linear kernel is best, and the data set ACC incorporating the WiFi signal is better than the data set with only video features, which also demonstrates that the addition of the WiFi signal can assist in the analysis of human motion recognition.
Table 1 results of selection of different kernel functions of SVM
Table 2 in the case of no front occlusion, in order to select ACC values corresponding to different penalty factors of the SVM, it can be seen that the ACC result is the highest and stable in the case where the penalty factor is greater than 0.01. A value of 0.1 is chosen as penalty factor C. It can also be seen that the effect of human motion recognition is better with the assistance of WiFi.
TABLE 2 results of selection of different penalty factors C for SVM
Data set 0.0001 0.001 0.01 0.1 1 10 100 1000
Vi-Wi15(video) 47.61% 62.72% 65.51% 65.43% 65.43% 65.43% 65.43% 65.43%
Vi-Wi15 58.70% 75.07% 76.12% 76.05% 76.05% 76.05% 76.05% 76.05%
For step 4.3, after selecting the best kernel function and parameters, classifying by cross-validation to obtain the accuracy of the final classification.
The method comprises the following steps: the best kernel function and parameters thereof are selected, so that the reliability and the accuracy of the experiment can be ensured, and the experiment result is prevented from being non-ideal due to the difference of the kernel function and the parameters thereof. Proper selection is beneficial to improving the classification effect. The classification adopts a cross-validation mode, so that experimental results are not affected by experimental errors caused by the arrangement sequence of samples in the data set.
Example 2:
the embodiment provides a human motion recognition device based on multi-mode feature fusion, which comprises:
the data set preprocessing unit is used for extracting video features in the Vi-Wi15 data set by adopting a convolutional neural network and extracting CSI features of WiFi signals in the Vi-Wi15 data set according to a standard statistical algorithm;
the construction unit of the multi-mode feature fusion model is used for respectively taking the obtained video features and the CSI features of the WiFi signals as two modes, establishing the multi-mode feature fusion model and defining an objective function for solving a mapping matrix;
the mapping matrix global optimal solution solving unit is used for calculating an objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;
an action recognition unit for passing the obtained global optimal solution about the mapping matrix through a formulaMapping the multimodal sample onto a public space y, < >>
wherein ,Yijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality; and classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category.
Performance analysis:
(1) Table 3 is specific information for three data sets:
table 3 three data set specific information
Data set Sample dimension Number of samples Number of kinds
Vi-Wi15 (video 4096+WiFi635) 2760 15
Vi-Wi15(video) 4096 2760 15
Vi-Wi15(WiFi) 635 2760 15
(2) Evaluation criteria: according to the scheme, the clustering algorithm specifically implements the steps to finish the action recognition classification task. Accuracy (ACC) was used as an evaluation criterion for classification performance. ACC for the ith sample in the dataset if g is defined i As finally obtained cluster tag, h i As a real tag, the calculation formula of ACC is as follows:
where N is the number of samples in the training set, map (g i ) Is a mapping function for mapping the obtained cluster labels onto the real labels; δ is a function that achieves a match between x and y, and if x=y, δ (x, y) =1, and vice versa is 0.
(3) Analysis of results:
first, in order to simulate the video angle, two different photographing angles (front photographing angle and side photographing angle) are applied to the original video frame, and three data sets of table 3 are tested.
Table 4 results of three dataset performance evaluations
In table 4, a comparison between videos at different viewing angles, which are front and side views without occlusion, is shown. It is easy to see that the side view angle has better performance than the front view angle. This means that the side view may be better suited for human motion recognition of our dataset, which means that the side view contains most of the information and the information loss is not severe. Furthermore, the multi-mode feature fusion scheme is about 10% higher than that of the early fusion method, and excellent human motion recognition effect is obtained.
Table 5 results of three dataset performance evaluations
Second, to simulate real world environmental constraints, two occlusion modes (stripe occlusion and block occlusion) are applied to the original video frame and experiments are performed on the three data sets, respectively. The experimental results are shown in table 5. From table 5 we can see that when the video is blocked by bars or blocks, the final performance is reduced by more than 10% compared to the accuracy in the case of front-side non-blocking. And with the help of the WiFi feature, the performance of the classifier is obviously improved. And the multi-mode feature fusion still plays the best effect.
TABLE 6 Classification accuracy results for four different multimodal algorithms
Scene(s) GMLDA GMMFA MvDA MvDAvc The scheme of the application
Front surface is free from shielding 76.23% 83.43% 82.50% 82.86% 82.86%
Front stripe shielding 57.75% 63.44% 62.43% 64.35% 75.72%
Front block-shaped shielding 61.63% 67.14% 68.26% 69.42% 80.29%
Side surface is free from shielding 78.91% 83.77% 83.80% 84.39% 90.40%
Finally, table 6 shows classification accuracy results for Vi-Wi15 datasets with video and WiFi with four different multi-modal algorithms under 4 different scenarios (front no occlusion, front stripe occlusion, front block occlusion and side no occlusion). From table 6 it can be seen that: the method of the present application yields the highest accuracy among all 4 cases and 4 multimodal algorithms, and the performance of the method is improved in breakthrough in the multimodal method. GMMFA, mvDA and MvDAvc all have similar properties, with MvDAvc slightly higher than MvDA by 0.3% -2%. Notably, the GMLDA method herein is about 6% lower than other algorithms, indicating that the GMLDA method works poorly on Vi-Wi15 datasets.

Claims (4)

1. The human body action recognition method based on the multi-mode feature fusion is characterized in that the method utilizes the multi-mode feature fusion method to fuse the CSI features and the video features of WiFi signals, maps the two features to a public space through a multi-mode feature fusion model for discrimination analysis, and finally recognizes the human body action category; the method comprises the following steps:
step 1, preprocessing a data set: the Vi-Wi15 data set comprises video information and CSI information data of corresponding Wifi signals, a convolutional neural network is adopted to extract video features in the Vi-Wi15 data set, and CSI features of the WiFi signals in the Vi-Wi15 data set are extracted according to a standard statistical algorithm;
defining a Vi-Wi15 dataset asc;j=1,2;k=1,…,n ij };
Wherein X is the Vi-Wi15 dataset, X ijk Is the ithThe kth sample of the jth modality in the class, i is the class, each action done in the video is defined as a class, c is the number of classes, j is the different modalities, D j Is the dimension of the kth sample of the jth modality in the ith class, j=1 represents the video modality, j=2 represents the WiFi modality, n ij The number of samples of the ith class of the jth modality;
step 2, taking the video features obtained in the step 1 and the CSI features of the WiFi signals as two modes respectively, establishing a multi-mode feature fusion model and defining an objective function for solving a mapping matrix:
wherein ,v1 * Is an optimal mapping matrix for video modalities, v 2 * Is an optimal mapping matrix for WiFi modalities; v 1 Is the mapping matrix of the video modality, v 2 Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) T Transposed set V for mapping matrix T ={v 1 T ,v 2 T V is the set v= { V of the mapping matrix 1 ,v 2 The construction of D and S is: and />They are about->The definition of the elements of which is as follows:
wherein ,is the i-th sample in the j-th modality with respect to input X ijk Mean matrix of>Is thatIs a transpose of (2); n is n ij Is the number of samples of the ith class in the jth modality,/for the jth modality>Is the number of samples of the i-th class in all modalities, n is the number of all samples; c is the number of classes; j=1 represents a video modality, j=2 represents a WiFi modality; />Is the ith class of sample in the nth modality with respect to input X irk Mean matrix of>Is->Is a transpose of (2); n is n ir Is the number of samples of the ith class in the nth modality,/th class>Is the number of samples of class i in all modalities; r=1 represents a video modality, r=2 represents a WiFi modality;
step 3, calculating the objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;
step 4: and (3) obtaining a global optimal solution about the mapping matrix from the step (3), and passing the global optimal solution through the formula:
mapping onto a public space y;
wherein ,Yijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality;
finally, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category;
the step 3 specifically includes:
changing the matrices S and D in the formulas (6) and (7) to semi-positive definite matrices:
knowing that matrices S and D are symmetric matrices, relaxing the constraints in MvDA, replacing D and S with the following strategies:
D=D+e 1 I (9)
S=S+e 2 I (10)
wherein I is the corresponding identity matrix, e 1 and e2 Is two arbitrary constants; through selection of e 1 and e2 D and S are converted into a semi-positive definite matrix, so that a global optimal solution based on a Newton-Raphson method is obtained by the formula (8);
setting up and />Adding an orthogonal constraint V T V=i to preserve the global geometry of the data, the objective function equation (8) is described as follows:
the optimal solution of equation (11) is equivalent to f (λ) =0 of the trace difference function:
to make f (λ) =0, the optimal mapping matrix V at this time * The method comprises the following steps:
wherein λ* Is the optimal TR (Trace ratio) value.
2. The human motion recognition method based on multi-modal feature fusion of claim 1, wherein the optimal TR value is calculated using a Newton-Raphson iterative method: initializing: t=0, λ 0 =0
(1) Calculation ofIs a characteristic value of (2);
(2) at an initial value lambda t Next, an iterative strategy is used to solve equation (12) and a first order Taylor expansion is used to approximate λ t Nearby eigenvalues:
where k=1, …, m;
at this time, using Taylor expansion, we approximate the trace difference function f (λ) asIt is for->Summation of the first d larger values:
wherein Is->Is the first i maximum eigenvalues of (a);
(3) by solving the problems thatUpdating lambda t+1
(4) Calculating |lambda t+1t I, when less than the threshold epsilon (epsilon=10) -4 ) The cycle is terminated when the optimum lambda is obtained * =λ t+1 Then calculate the optimal mapping matrix V by using the formula (13) *
3. The human motion recognition method based on multi-modal feature fusion of claim 1, wherein the step 4 specifically includes:
step 4.1: the global optimal solution of the mapping matrix obtained in the step 3 is passed through a formulaMapping to a public space y, testing classification accuracy when different kernel functions are adopted, and selecting the kernel function with the best performance; wherein Y is ijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modalityJ=1 denotes video modality, j=2 denotes WiFi modality, n ij The number of samples of the ith class of the jth modality;
step 4.2: cross-verifying the kernel function selected in the step 4.1, and searching the optimal parameters of the current kernel function by a parameter searching method;
step 4.3: and 4.1, selecting the optimal kernel function and parameters in the step 4.2, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action type.
4. A human motion recognition device based on multi-modal feature fusion, comprising:
the data set preprocessing unit is used for extracting video features in the Vi-Wi15 data set by adopting a convolutional neural network and extracting CSI features of WiFi signals in the Vi-Wi15 data set according to a standard statistical algorithm;
the construction unit of the multi-mode feature fusion model is used for respectively taking the obtained video features and the CSI features of the WiFi signals as two modes, establishing the multi-mode feature fusion model and defining an objective function for solving a mapping matrix:
wherein ,v1 * Is an optimal mapping matrix for video modalities, v 2 * Is an optimal mapping matrix for WiFi modalities; v 1 Is the mapping matrix of the video modality, v 2 Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) T Transposed set V for mapping matrix T ={v 1 T ,v 2 T V is the set v= { V of the mapping matrix 1 ,v 2 The construction of D and S is: and />They are about->The definition of the elements of which is as follows:
wherein ,is the i-th sample in the j-th modality with respect to input X ijk Mean matrix of>Is thatIs a transpose of (2); n is n ij Is the number of samples of the ith class in the jth modality,/for the jth modality>Is the number of samples of the i-th class in all modalities, n is the number of all samples; c is the number of classes; j=1 represents a video modality, j=2 represents a WiFi modality; />Is the ith class of sample in the nth modality with respect to input X irk Mean matrix of>Is->Is a transpose of (2); n is n ir Is the number of samples of the ith class in the nth modality,/th class>Is the number of samples of class i in all modalities; r=1 represents a video modality, r=2 represents a WiFi modality;
the mapping matrix global optimal solution solving unit is used for calculating an objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;
changing the matrices S and D in the formulas (6) and (7) to semi-positive definite matrices:
knowing that matrices S and D are symmetric matrices, relaxing the constraints in MvDA, replacing D and S with the following strategies:
D=D+e 1 I (9)
S=S+e 2 I (10)
wherein I is the corresponding identity matrix, e 1 and e2 Is two arbitrary constants; through selection of e 1 and e2 D and S are converted into a semi-positive definite matrix, so that a global optimal solution based on a Newton-Raphson method is obtained by the formula (8);
setting up and />Adding an orthogonal constraint V T V=i to preserve the global geometry of the data, the objective function equation (8) is described as follows:
the optimal solution of equation (11) is equivalent to f (λ) =0 of the trace difference function:
to make f (λ) =0, the optimal mapping matrix V at this time * The method comprises the following steps:
wherein λ* Is the optimal TR (Trace ratio) value;
an action recognition unit for passing the obtained global optimal solution about the mapping matrix through a formulaThe multi-modal samples are mapped onto a common space y,
wherein ,Yijk Is X ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n ij The number of samples of the ith class of the jth modality; and classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category.
CN202010607674.3A 2020-06-29 2020-06-29 Human body action recognition method and device based on multi-mode feature fusion Active CN111898442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010607674.3A CN111898442B (en) 2020-06-29 2020-06-29 Human body action recognition method and device based on multi-mode feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010607674.3A CN111898442B (en) 2020-06-29 2020-06-29 Human body action recognition method and device based on multi-mode feature fusion

Publications (2)

Publication Number Publication Date
CN111898442A CN111898442A (en) 2020-11-06
CN111898442B true CN111898442B (en) 2023-08-11

Family

ID=73207221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010607674.3A Active CN111898442B (en) 2020-06-29 2020-06-29 Human body action recognition method and device based on multi-mode feature fusion

Country Status (1)

Country Link
CN (1) CN111898442B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113033351B (en) * 2021-03-11 2023-04-07 西北大学 CSI sensing identification method based on video analysis
CN113111778B (en) * 2021-04-12 2022-11-15 内蒙古大学 Large-scale crowd analysis method with video and wireless integration
CN113435603A (en) * 2021-06-01 2021-09-24 浙江师范大学 Agent graph improvement-based late-stage fusion multi-core clustering machine learning method and system
CN116579967B (en) * 2023-07-12 2023-09-12 天津亿科科技有限公司 Three-dimensional point cloud image fusion system based on computer vision

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109059895A (en) * 2018-03-28 2018-12-21 南京航空航天大学 A kind of multi-modal indoor ranging and localization method based on mobile phone camera and sensor
WO2019090878A1 (en) * 2017-11-09 2019-05-16 合肥工业大学 Analog circuit fault diagnosis method based on vector-valued regularized kernel function approximation
EP3492945A1 (en) * 2017-12-01 2019-06-05 Origin Wireless, Inc. Method, apparatus, and system for periodic motion detection and monitoring
CN110892408A (en) * 2017-02-07 2020-03-17 迈恩德玛泽控股股份有限公司 Systems, methods, and apparatus for stereo vision and tracking

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110892408A (en) * 2017-02-07 2020-03-17 迈恩德玛泽控股股份有限公司 Systems, methods, and apparatus for stereo vision and tracking
WO2019090878A1 (en) * 2017-11-09 2019-05-16 合肥工业大学 Analog circuit fault diagnosis method based on vector-valued regularized kernel function approximation
EP3492945A1 (en) * 2017-12-01 2019-06-05 Origin Wireless, Inc. Method, apparatus, and system for periodic motion detection and monitoring
CN109059895A (en) * 2018-03-28 2018-12-21 南京航空航天大学 A kind of multi-modal indoor ranging and localization method based on mobile phone camera and sensor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于广义典型相关分析融合和鲁棒概率协同表示的人脸指纹多模态识别;张静;刘欢喜;丁德锐;肖建力;;上海理工大学学报(02);全文 *

Also Published As

Publication number Publication date
CN111898442A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN111898442B (en) Human body action recognition method and device based on multi-mode feature fusion
CN108960140B (en) Pedestrian re-identification method based on multi-region feature extraction and fusion
CN111814584B (en) Vehicle re-identification method based on multi-center measurement loss under multi-view environment
Tsai et al. Image co-saliency detection and co-segmentation via progressive joint optimization
CN113408492B (en) Pedestrian re-identification method based on global-local feature dynamic alignment
Poursaeed et al. Deep fundamental matrix estimation without correspondences
Nuevo et al. RSMAT: Robust simultaneous modeling and tracking
Dong Optimal Visual Representation Engineering and Learning for Computer Vision
Zhang et al. Second-and high-order graph matching for correspondence problems
Lu et al. Improving 3d vulnerable road user detection with point augmentation
Feizi Hierarchical detection of abnormal behaviors in video surveillance through modeling normal behaviors based on AUC maximization
Zhang et al. Capturing the grouping and compactness of high-level semantic feature for saliency detection
Takezoe et al. Deep active learning for computer vision: Past and future
CN110751005B (en) Pedestrian detection method integrating depth perception features and kernel extreme learning machine
Liao et al. Multi-scale saliency features fusion model for person re-identification
Cilla et al. Human action recognition with sparse classification and multiple‐view learning
Zhu et al. Human pose estimation with multiple mixture parts model based on upper body categories
Sajid et al. Facial asymmetry-based feature extraction for different applications: a review complemented by new advances
Zhou et al. Retrieval and localization with observation constraints
Li et al. Action recognition with spatio-temporal augmented descriptor and fusion method
Li et al. Spatial and temporal information fusion for human action recognition via Center Boundary Balancing Multimodal Classifier
Ying et al. Dynamic random regression forests for real-time head pose estimation
Keyvanpour et al. Detection of individual activities in video sequences based on fast interference discovery and semi-supervised method
Kim et al. Scalable representation for 3D object recognition using feature sharing and view clustering
Deng et al. Abnormal Occupancy Grid Map Recognition using Attention Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant