CN111898442B

CN111898442B - Human body action recognition method and device based on multi-mode feature fusion

Info

Publication number: CN111898442B
Application number: CN202010607674.3A
Authority: CN
Inventors: 郭军; 石梅; 常晓军; 汤战勇; 刘宝英; 朱省吾; 黄位; 贺怡; 许鹏飞
Original assignee: NORTHWEST UNIVERSITY
Current assignee: NORTHWEST UNIVERSITY
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2023-08-11
Anticipated expiration: 2040-06-29
Also published as: CN111898442A

Abstract

The application provides a human body action recognition method and device based on multi-mode feature fusion, which uses a WiFi signal with strongest commercial property, and fuses the CSI features and video features of the WiFi signal by using the multi-mode feature fusion method; the multi-mode feature fusion method maps the two different features to the same public space, classifies the features, and finally identifies the human action category. Experimental results show that under the condition that WiFi signals are added and a multi-mode feature fusion method is utilized, the accuracy of human body motion recognition is obviously improved.

Description

Human body action recognition method and device based on multi-mode feature fusion

Technical Field

The application belongs to the technical field of motion recognition, and particularly relates to a human motion recognition method and device based on multi-mode feature fusion.

Background

Human motion recognition algorithms play a vital role in many areas of computer vision, with respect to video motion recognition, the most popular approach is based on spatio-temporal and optical information analysis. However, these methods are not ideal due to poor quality of the data frames and ambient light in the natural environment.

Existing multimodal models are divided into unsupervised algorithms and supervised algorithms. The unsupervised multi-mode algorithm lacks label information, so that a discriminant public space cannot be obtained, and a poor result is caused. The currently used multi-modal algorithms are: GMA (generalized multi-view analysis) and MvDA (multi-view discriminant analysis), which both map multi-modal samples onto a common space by looking for a mapping matrix and then classify them. However, the GMA only considers the discrimination information in the modes, and ignores the discrimination information among the modes; the MvDA and the MvDA are compatible, so that a public space with discriminant is obtained, but the MvDA has the defect that only generalized eigenvalue decomposition is used for the final mapping matrix solving process, so that the solved mapping matrix is an approximate value instead of a globally optimal solution, and the final precision is reduced.

Disclosure of Invention

Aiming at the defects and shortcomings in the prior art, the application provides a human body action recognition method and device based on multi-mode feature fusion, which are characterized in that video features are assisted to be recognized by using wireless WiFi signals, and the two features are fused and subjected to discriminant analysis by using a multi-mode feature fusion scheme to obtain a final human body action recognition result; therefore, the defects that the existing human motion recognition schemes all use video features to judge, but are influenced by optical limitations and the like to lead the result to be non-ideal are overcome.

In order to achieve the above purpose, the application adopts the following technical scheme:

the method utilizes a multi-mode feature fusion method to fuse the CSI features and video features of WiFi signals, maps the two features to a public space through a multi-mode feature fusion model for discrimination analysis, and finally identifies the human motion category; the method comprises the following steps:

step 1, preprocessing a data set: the Vi-Wi15 data set comprises video information and CSI information data of corresponding Wifi signals, a convolutional neural network is adopted to extract video features in the Vi-Wi15 data set, and CSI features of the WiFi signals in the Vi-Wi15 data set are extracted according to a standard statistical algorithm;

defining a Vi-Wi15 dataset as

Wherein X is the Vi-Wi15 dataset, X _ijk Is the kth sample of the jth modality in the ith class, i is the class, each action done in the video is defined as a class, c is the number of classes, j is the different modalities, D _j Is the dimension of the kth sample of the jth modality in the ith class, j=1 represents the video modality, j=2 represents the WiFi modality, n _ij The number of samples of the ith class of the jth modality;

step 2, taking the video features obtained in the step 1 and the CSI features of the WiFi signals as two modes respectively, establishing a multi-mode feature fusion model and defining an objective function for solving a mapping matrix:

wherein ,v₁ * Is an optimal mapping matrix for video modalities, v ₂ * Is an optimal mapping matrix for WiFi modalities; v ₁ Is the mapping matrix of the video modality, v ₂ Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) ^T Transposed set V for mapping matrix ^T ＝{v ₁ ^T ，v ₂ ^T V is the set v= { V of the mapping matrix ₁ ，v ₂ The construction of D and S is: and />They are about->Blocks of (2)The definition of the elements of the matrix is as follows:

wherein ,is the i-th sample in the j-th modality with respect to input X _ijk Is used for the matrix of the average of (a),is->Is a transpose of (2); n is n _ij Is the number of samples of the ith class in the jth modality,/for the jth modality>Is the number of samples of the i-th class in all modalities, n is the number of all samples; c is the number of classes; j=1 represents a video modality, j=2 represents a WiFi modality;is the ith class of sample in the nth modality with respect to input X _irk Mean matrix of>Is->Is a transpose of (2); n is n _ir Is the number of samples of the ith class in the nth modality,/th class>Is the number of samples of class i in all modalities; r=1 represents a video modality, r=2 represents a WiFi modality;

step 3, calculating the objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;

step 4: and (3) obtaining a global optimal solution about the mapping matrix from the step (3), and passing the global optimal solution through the formula:

mapping onto a public space y;

wherein ,Y_ijk Is X _ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v _j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n _ij The number of samples of the ith class of the jth modality;

and finally, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category.

The application also comprises the following technical characteristics:

specifically, the step 3 specifically includes:

changing the matrices S and D in the formulas (6) and (7) to semi-positive definite matrices:

knowing that matrices S and D are symmetric matrices, relaxing the constraints in MvDA, replacing D and S with the following strategies:

D＝D+e ₁ I (9)

S＝S+e ₂ I (10)

wherein I is the corresponding identity matrix, e ₁ and e₂ Is two arbitrary constants; through selection of e ₁ and e₂ D and S are converted into a semi-positive definite matrix, so that a global optimal solution based on a Newton-Raphson method is obtained by the formula (8);

setting up and />Adding an orthogonal constraint V ^T V=i to preserve the global geometry of the data, the objective function equation (8) is described asThe following steps:

the optimal solution of equation (11) is equivalent to f (λ) =0 of the trace difference function:

to make f (λ) =0, the optimal mapping matrix V at this time ^* The method comprises the following steps:

wherein λ^* Is the optimal TR (Trace ratio) value.

Specifically, the optimal TR value is calculated by using a Newton-Raphson iteration method: initializing: t=0, λ ₀ ＝0

(1) Calculation ofIs a characteristic value of (2);

(2) at an initial value lambda _t Next, an iterative strategy is used to solve equation (12) and a first order Taylor expansion is used to approximate λ _t Nearby eigenvalues:

where k=1, …, m;

at this time, using Taylor expansion, we approximate the trace difference function f (λ) asIt is for->Summation of the first d larger values:

wherein Is->Is the first i maximum eigenvalues of (a);

(3) by solving the problems thatUpdating lambda _t+1 ；

(4) Calculating |lambda _t+1 -λ _t I, when less than the threshold epsilon (epsilon=10) ^-4 ) The cycle is terminated when the optimum lambda is obtained ^* ＝λ _t+1 Then calculate the optimal mapping matrix V by using the formula (13) ^* 。

Specifically, the step 4 specifically includes:

step 4.1: the global optimal solution of the mapping matrix obtained in the step 3 is passed through a formulaMapping to a public space y, testing classification accuracy when different kernel functions are adopted, and selecting the kernel function with the best performance; wherein Y is _ijk Is X _ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v _j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n _ij The number of samples of the ith class of the jth modality;

step 4.2: cross-verifying the kernel function selected in the step 4.1, and searching the optimal parameters of the current kernel function by a parameter searching method;

step 4.3: and 4.1, selecting the optimal kernel function and parameters in the step 4.2, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action type.

A human motion recognition device based on multi-modal feature fusion, comprising:

the data set preprocessing unit is used for extracting video features in the Vi-Wi15 data set by adopting a convolutional neural network and extracting CSI features of WiFi signals in the Vi-Wi15 data set according to a standard statistical algorithm;

the construction unit of the multi-mode feature fusion model is used for respectively taking the obtained video features and the CSI features of the WiFi signals as two modes, establishing the multi-mode feature fusion model and defining an objective function for solving a mapping matrix;

the mapping matrix global optimal solution solving unit is used for calculating an objective function to obtain a global optimal solution of a mapping matrix in the multi-modal feature fusion model;

an action recognition unit for passing the obtained global optimal solution about the mapping matrix through a formulaThe multi-modal samples are mapped onto a common space y,

wherein ,Y_ijk Is X _ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v _j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n _ij The number of samples of the ith class of the jth modality; and classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category.

Compared with the prior art, the application has the beneficial technical effects that:

the application discloses a novel method for fusing video and Wi-Fi signals by utilizing a multi-mode feature fusion method and then identifying human actions. The method utilizes human action information carried by Wi-Fi signals to compensate information loss caused by the influence of environmental factors on actions in videos. Finally, the classification task of the video actions is completed under the condition of feature fusion, so that the loss of part of information is effectively compensated, and the classification accuracy is improved.

Drawings

FIG. 1 is a flow chart of the method of the present application.

Detailed Description

Because the moving object can reflect the wireless signal and change the amplitude and the phase thereof, thereby providing discriminable information, the wireless signal can be widely applied to the identification of the moving object, and compared with wireless signals such as WiFi, RFID, radar, bluetooth and the like, the wireless signal has the advantage of being not influenced by optical factors. Accordingly, human motion recognition research based on wireless signals has received increasing attention in recent years. However, a great challenge faced by the task of human motion recognition based on wireless signals is multipath effects and unavoidable noise interference, which can reduce recognition performance. The effect of independently using the wireless signal is not ideal at present, and the best method for improving the human motion recognition performance undoubtedly is to explore the characteristics of the video and the wireless signal together. Inspired by the recent success of combining video and radio signals for human gestures, the present scheme fuses WiFi signals into video-based HARs to improve recognition performance. In the present application, wiFi is selected because: 1) WiFi does not require additional devices carried by humans; 2) As a widely used commercial wireless signal, wiFi-based wireless communication services are established around the world, which means that we easily collect WiFi signals at very low cost.

Feature fusion technology: in the fields of machine learning and computer vision, fusing data features of different modalities is a significant challenge. In recent years, feature fusion techniques have gained increasing attention in multi-modal data analysis. The existing feature fusion technology is divided into three types: 1) Early fusion based on features; 2) Decision-based late fusion; 3) And the mixing and fusion of the two. Early fusion was to fuse the multi-modal features after feature extraction (typically by simply summing their features), but this approach ignores the important correlation between the different modal features, increasing computational and storage costs. Later fusion is performed after decision making (classification or regression) is made by different modality features. Hybrid fusion combines the advantages of early fusion and late fusion. Late fusion and hybrid fusion are more complex to implement than early fusion; thus, an efficient multimodal model was explored to solve this problem.

Multimode feature fusion: a common approach in multi-modal learning is common space projection, which projects multi-modal high-dimensional data into a common space to achieve better predictive performance. Generally, multi-modal learning methods are classified into two categories, unsupervised or supervised, depending on whether tag information is used. However, these unsupervised schemes do not have good supervised effect, so the scheme makes a supervised model, and it is desirable to consider both inter-view information and intra-view information in the model, so that the learned public space is more discriminant. Based on the above, the application provides a multi-mode feature fusion method for fusing video and WiFi signals to identify human actions.

Three data sets that use in this scheme have erect the collection that WiFi equipment is used for WiFi signal data in the both sides by the gathering people, have erect two cameras respectively in the gathering people's front and side and are used for video data's collection. To meet the needs of experimental design, we record video at different angles (front and side), while adding various forms of occlusion (random occlusion and stripe occlusion) to the targets in the video during recording. The dataset was collected for 92 subjects, containing 15 action categories.

Vi-Wi15 dataset: the method comprises video information and data of corresponding WIFi signal CSI characteristics;

Vi-Wi15 (video) dataset: it contains only video information in the Vi-Wi15 dataset;

Vi-Wi15 (WiFi) dataset: it contains only Wi-fi signal CSI information in the Vi-Wi15 dataset.

The following specific embodiments of the present application are provided, and it should be noted that the present application is not limited to the following specific embodiments, and all equivalent changes made on the basis of the technical scheme of the present application fall within the protection scope of the present application.

Example 1:

the embodiment provides a human motion recognition method based on multi-mode feature fusion, which utilizes the multi-mode feature fusion method to fuse the CSI features and the video features of WiFi signals, maps the two features to the same public space for discriminant analysis, and finally recognizes the human motion category; as shown in fig. 1, a multi-mode dataset is processed into numbers through a data preprocessing module, and then an objective function of a multi-mode feature fusion model solving mapping matrix is constructed; and solving a globally optimal mapping matrix for the model, and finally mapping the input multi-mode sample to a public space by using the mapping matrix, and then carrying out SVM classification to obtain a final classification result. The method comprises the following steps:

step 1, preprocessing a data set: the Vi-Wi15 data set comprises video information and CSI information data of corresponding Wifi signals; extracting video features (the video contains 4096-dimensional features) in the Vi-Wi15 data set by adopting a convolutional neural network, extracting CSI features (635-dimensional features) of WiFi signals in the Vi-Wi15 data set according to a standard statistical algorithm, and reserving 95% of energy in the Vi-Wi15 data set by using a Principal Component Analysis (PCA) method to remove redundant information so as to simplify data;

defining a Vi-Wi15 dataset as

the video action data set with the WiFi signals is obtained through the step, the requirement of experiments is met, and through setting an experimental scene, the obtained data set can simulate the degradation of video quality caused by the influence of external environments on a monitoring video in a real environment, so that the video action data set is used for identifying experimental demonstration of a new method for the work.

wherein ,v₁ * Is an optimal mapping matrix for video modalities, v ₂ * Is an optimal mapping matrix for WiFi modalities; v ₁ Is the mapping matrix of the video modality, v ₂ Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) ^T Transposed set V for mapping matrix ^T ＝{v ₁ ^T ，v ₂ ^T V is the set v= { V of the mapping matrix ₁ ，v ₂ The construction of D and S is: and />They are about->The definition of the elements of which is as follows:

in this embodiment, specifically, the process of defining the objective function is as follows:

wherein ,S_b For the inter-class scattering matrix, subscript b is inter-class betwen-class, S _w For an intra-class scattering matrix, subscript w is intra-class witin-class;

definition of between classesScattering matrix S _b And intra-class scattering matrix and S _w The following are provided:

wherein ,is the average matrix of samples of class i in all modes,/, for example>Is the number of samples of class i in all modalities,/->Is the average matrix of all samples in the common space, n is the number of all samples, and +.>Y _ijk Is X _ijk Mapping the samples to sample values corresponding to the public space;

definition of the definitionR ^m Representing the size of this matrix, v ₁ ^T and v₂ ^T Is a vector of m 1, v _l Is the first column of matrix V, at this time S _b and S_w The expression is as follows:

the construction of D and S is: and />They are about->The definition of the elements of which is as follows:

wherein ,is the i-th sample in the j-th modality with respect to input X _ijk Mean matrix of>Is thatIs a transpose of (2); n is n _ij Is the number of samples of the ith class in the jth modality,/for the jth modality>Is the number of samples of the i-th class in all modalities, n is the number of all samples; c is the number of classes; j=1 represents a video modality, j=2 represents a WiFi modality; />Is the ith class of sample in the nth modality with respect to input X _irk Mean matrix of>Is->Is a transpose of (2); n is n _ir Is the number of samples of the ith class in the nth modality,/th class>Is the number of samples of class i in all modalities; r=1 represents a video modality, r=2 represents a WiFi modality;

thus, the objective function of equation (1) can be expressed as:

wherein ,v₁ ^* Is an optimal mapping matrix for video modalities, v ₂ ^* Is an optimal mapping matrix for WiFi modalities; v ₁ Is the mapping matrix of the video modality, v ₂ Is a mapping matrix of WiFi modalities, which are all independent variables in the formula; v (V) ^T Transposed set V for mapping matrix ^T ＝{v ₁ ^T ，v ₂ ^T V is the set v= { V of the mapping matrix ₁ ，v ₂ "D and S" are related toIs a block matrix of the block matrix.

The method comprises the following steps: and respectively taking the video features and the WiFi features as two modes, and mapping the two modes onto a public space through a multi-mode feature fusion model to perform discriminant analysis. Thus, the relation between the two modes can be fully utilized, and the original characteristics are maintained. Compared with early fusion, the method has the advantages that the effect of simply adding the two features is better, so that the final human motion recognition effect is better.

since the global optimal solution for the multi-modal feature fusion method cannot be obtained in step 2, in order to solve this problem, an iterative algorithm based on the Newton-Raphson method is used to solve the trace ratio problem. However, equation (8) can be solved directly using the Newton-Raphson method because it is difficult to determine from equation (6) and equation (7) whether the matrices S and D are positive half-definite. Therefore, a strategy for solving this dilemma is proposed first.

Step 3.1: changing the matrices S and D in the formulas (6) and (7) to semi-positive definite matrices:

D＝D+e ₁ I (9)

S＝S+e ₂ I (10)

setting up and />Adding an orthogonal constraint V ^T V=i to preserve the global geometry of the data, the objective function equation (8) is described as follows:

wherein λ^* Is the optimal TR value.

Calculating an optimal TR value by using a Newton-Raphson iteration method: initializing: t=0, λ ₀ ＝0

(1) Calculation ofIs a characteristic value of (2);

where k=1, …, m;

wherein Is->Is the first i maximum eigenvalues of (a);

(3) by solving the problems thatUpdating lambda _t+1 ；

(4) Calculating |lambda _t+1 -λ _t I, when less than the threshold epsilon (epsilon=10) ^-4 ) The cycle is terminated when the optimum lambda is obtained ^* ＝λ _t+1 And calculating an optimal mapping matrix V by using the formula (13).

The Trace Ratio problem is converted into the Trace difference problem through the steps, so that the equation 12 is directly solved by using a Newton-Raphson method, and a globally optimal solution of the Trace Ratio (TR) problem is obtained.

The method comprises the following steps: the problem of solving Trace Ratio (TR) of the multi-mode feature fusion model proposed in the step 2 is known to be solved by using generalized eigenvalues, but the global optimal solution cannot be obtained by such processing, and the obtained approximate solution deviates from the result. Step 3 thus solves the global optimal solution for the TR problem using the Newton-Raphson method for the above-mentioned problem, but the matrices S and D must be required to be semi-positive definite matrices when using this method, so we propose a strategy to change the pair-wise matrix into a semi-positive definite matrix. Therefore, the global optimal solution of the model can be obtained to obtain the optimal human body action recognition result.

mapping onto a public space y;

The step 4 specifically comprises the following steps:

step (a)4.1: the global optimal solution of the mapping matrix obtained in the step 3 is passed through a formulaMapping to a public space y, testing classification accuracy when different kernel functions are adopted, and selecting the kernel function with the best performance; wherein Y is _ijk Is X _ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v _j Is a mapping matrix, i is a class, c is the number of classes, j is a different modality, j=1 represents a video modality, j=2 represents a WiFi modality, n _ij The number of samples of the ith class of the jth modality;

Specifically, for step 4.1, libsvm is used for classification; the Libsvm has a plurality of kernel functions, and the Vi-Wi15 data set is used for testing and selecting the optimal kernel function;

for step 4.2, the parameter search is performed on the best kernel function selected in step 4.1, and the best parameters of the kernel function are selected by a grid parameter search (searching is performed from a range with a relatively large parameter value, and detailed search is performed within the specific range after the specific range is determined).

Table 1 shows the results of SVM different kernel function selections for two data sets, a Vi-Wi15 (video) data set (specific information of the data set: sample dimension 4096, sample number 2760, category number 15) and a Vi-Wi15 data set (specific information of the data set: sample dimension video 4096+WiFi635, sample number 2760, category number 15); it can be seen from table 1 that the linear kernel is best, and the data set ACC incorporating the WiFi signal is better than the data set with only video features, which also demonstrates that the addition of the WiFi signal can assist in the analysis of human motion recognition.

Table 1 results of selection of different kernel functions of SVM

Table 2 in the case of no front occlusion, in order to select ACC values corresponding to different penalty factors of the SVM, it can be seen that the ACC result is the highest and stable in the case where the penalty factor is greater than 0.01. A value of 0.1 is chosen as penalty factor C. It can also be seen that the effect of human motion recognition is better with the assistance of WiFi.

TABLE 2 results of selection of different penalty factors C for SVM

Data set

0.0001

0.001

0.01

0.1

1

10

100

1000

Vi-Wi15(video)

47.61％

62.72％

65.51％

65.43％

Vi-Wi15

58.70％

75.07％

76.12％

76.05％

For step 4.3, after selecting the best kernel function and parameters, classifying by cross-validation to obtain the accuracy of the final classification.

The method comprises the following steps: the best kernel function and parameters thereof are selected, so that the reliability and the accuracy of the experiment can be ensured, and the experiment result is prevented from being non-ideal due to the difference of the kernel function and the parameters thereof. Proper selection is beneficial to improving the classification effect. The classification adopts a cross-validation mode, so that experimental results are not affected by experimental errors caused by the arrangement sequence of samples in the data set.

Example 2:

the embodiment provides a human motion recognition device based on multi-mode feature fusion, which comprises:

an action recognition unit for passing the obtained global optimal solution about the mapping matrix through a formulaMapping the multimodal sample onto a public space y, < >>

Performance analysis:

(1) Table 3 is specific information for three data sets:

table 3 three data set specific information

Data set	Sample dimension	Number of samples	Number of kinds
				Vi-Wi15	(video 4096+WiFi635)	2760	15
Vi-Wi15(video)	4096	2760	15
				Vi-Wi15(WiFi)	635	2760	15

(2) Evaluation criteria: according to the scheme, the clustering algorithm specifically implements the steps to finish the action recognition classification task. Accuracy (ACC) was used as an evaluation criterion for classification performance. ACC for the ith sample in the dataset if g is defined ⁱ As finally obtained cluster tag, h _i As a real tag, the calculation formula of ACC is as follows:

where N is the number of samples in the training set, map (g _i ) Is a mapping function for mapping the obtained cluster labels onto the real labels; δ is a function that achieves a match between x and y, and if x=y, δ (x, y) =1, and vice versa is 0.

(3) Analysis of results:

first, in order to simulate the video angle, two different photographing angles (front photographing angle and side photographing angle) are applied to the original video frame, and three data sets of table 3 are tested.

Table 4 results of three dataset performance evaluations

In table 4, a comparison between videos at different viewing angles, which are front and side views without occlusion, is shown. It is easy to see that the side view angle has better performance than the front view angle. This means that the side view may be better suited for human motion recognition of our dataset, which means that the side view contains most of the information and the information loss is not severe. Furthermore, the multi-mode feature fusion scheme is about 10% higher than that of the early fusion method, and excellent human motion recognition effect is obtained.

Table 5 results of three dataset performance evaluations

Second, to simulate real world environmental constraints, two occlusion modes (stripe occlusion and block occlusion) are applied to the original video frame and experiments are performed on the three data sets, respectively. The experimental results are shown in table 5. From table 5 we can see that when the video is blocked by bars or blocks, the final performance is reduced by more than 10% compared to the accuracy in the case of front-side non-blocking. And with the help of the WiFi feature, the performance of the classifier is obviously improved. And the multi-mode feature fusion still plays the best effect.

TABLE 6 Classification accuracy results for four different multimodal algorithms

Scene(s)	GMLDA	GMMFA	MvDA	MvDAvc	The scheme of the application
						Front surface is free from shielding	76.23％	83.43％	82.50％	82.86％	82.86％
Front stripe shielding	57.75％	63.44％	62.43％	64.35％	75.72％
						Front block-shaped shielding	61.63％	67.14％	68.26％	69.42％	80.29％
Side surface is free from shielding	78.91％	83.77％	83.80％	84.39％	90.40％

Finally, table 6 shows classification accuracy results for Vi-Wi15 datasets with video and WiFi with four different multi-modal algorithms under 4 different scenarios (front no occlusion, front stripe occlusion, front block occlusion and side no occlusion). From table 6 it can be seen that: the method of the present application yields the highest accuracy among all 4 cases and 4 multimodal algorithms, and the performance of the method is improved in breakthrough in the multimodal method. GMMFA, mvDA and MvDAvc all have similar properties, with MvDAvc slightly higher than MvDA by 0.3% -2%. Notably, the GMLDA method herein is about 6% lower than other algorithms, indicating that the GMLDA method works poorly on Vi-Wi15 datasets.

Claims

1. The human body action recognition method based on the multi-mode feature fusion is characterized in that the method utilizes the multi-mode feature fusion method to fuse the CSI features and the video features of WiFi signals, maps the two features to a public space through a multi-mode feature fusion model for discrimination analysis, and finally recognizes the human body action category; the method comprises the following steps:

defining a Vi-Wi15 dataset asc；j＝1,2；k＝1,…,n _ij }；

Wherein X is the Vi-Wi15 dataset, X _ijk Is the ithThe kth sample of the jth modality in the class, i is the class, each action done in the video is defined as a class, c is the number of classes, j is the different modalities, D _j Is the dimension of the kth sample of the jth modality in the ith class, j=1 represents the video modality, j=2 represents the WiFi modality, n _ij The number of samples of the ith class of the jth modality;

mapping onto a public space y;

finally, classifying the samples mapped to the public space by adopting a linear SVM, and finally identifying the human action category;

the step 3 specifically includes:

D＝D+e ₁ I (9)

S＝S+e ₂ I (10)

wherein λ^* Is the optimal TR (Trace ratio) value.

2. The human motion recognition method based on multi-modal feature fusion of claim 1, wherein the optimal TR value is calculated using a Newton-Raphson iterative method: initializing: t=0, λ ₀ ＝0

(1) Calculation ofIs a characteristic value of (2);

where k=1, …, m;

wherein Is->Is the first i maximum eigenvalues of (a);

(3) by solving the problems thatUpdating lambda _t+1 ；

3. The human motion recognition method based on multi-modal feature fusion of claim 1, wherein the step 4 specifically includes:

step 4.1: the global optimal solution of the mapping matrix obtained in the step 3 is passed through a formulaMapping to a public space y, testing classification accuracy when different kernel functions are adopted, and selecting the kernel function with the best performance; wherein Y is _ijk Is X _ijk Mapping samples to sample values corresponding to a common space, i.e. the samples of the ith sample of the jth modality in the ith class projected onto the common space, v _j Is a mapping matrix, i is a class, c is the number of classes, j is a different modalityJ=1 denotes video modality, j=2 denotes WiFi modality, n _ij The number of samples of the ith class of the jth modality;

4. A human motion recognition device based on multi-modal feature fusion, comprising:

the construction unit of the multi-mode feature fusion model is used for respectively taking the obtained video features and the CSI features of the WiFi signals as two modes, establishing the multi-mode feature fusion model and defining an objective function for solving a mapping matrix:

D＝D+e ₁ I (9)

S＝S+e ₂ I (10)

wherein λ^* Is the optimal TR (Trace ratio) value;