CN117746518A

CN117746518A - Multi-mode feature fusion and classification method

Info

Publication number: CN117746518A
Application number: CN202410181868.XA
Authority: CN
Inventors: 黄倩; 谢梦婷; 胡鹤轩; 狄峰; 周晓军; 张丽华
Original assignee: Jiuyisanluling Medical Technology Nanjing Co ltd; Hohai University HHU
Current assignee: Jiuyisanluling Medical Technology Nanjing Co ltd; Hohai University HHU
Priority date: 2024-02-19
Filing date: 2024-02-19
Publication date: 2024-03-22
Anticipated expiration: 2044-02-19
Also published as: CN117746518B

Abstract

The invention discloses a multi-mode feature fusion and classification method, which relates to the technical field of computer vision. Two schemes are then employed to project each of the modality-specific features into a semantic space. And assigning the labels of the samples according to the multi-mode action semantic features. Finally, a multi-mode feature fusion and classification method is designed and used for behavior identification in the depth video sequence. Therefore, compatibility of feature fusion and classification is improved, and meanwhile recognition accuracy is greatly improved.

Description

Multi-mode feature fusion and classification method

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-modal feature fusion and classification method.

Background

Depth video is an active field in computer vision applications, and is commonly used in man-machine interaction, security monitoring, automatic driving, robotic applications, and the like. The studies of motion recognition can be classified into four types based on three primary colors (Red, green, blue, RGB) video, depth skeleton, and acceleration data. The depth video is easy to collect, no additional algorithm or equipment is needed, additional geometric and shape information is provided, and accuracy is improved. Feature mapping is a compact and efficient representation of manufacturing equipment interconnections (Manufacturing Equipment Intercommunication, MEI), desktop management interfaces (Desktop Management Interface, DMI), digital multimeters (Digital Multimeter, DMM), modular hose interface (Modular Hose Interface, MHI) High-precision digital multimeters (High-Precision Digital Multimeter, HP-DMM), modular hose interface systems (Modular Hose Interface System, MHIs), and the like that characterize human behavior. Because these feature maps do not fully exploit the critical spatiotemporal information of the depth video sequence, researchers often use only a single feature map and it is difficult to capture sufficient spatiotemporal texture information.

Disclosure of Invention

In order to solve the technical problems, the invention provides a multi-modal feature fusion and classification method, which comprises the following steps:

s1, mapping original features from different modes into a public semantic feature space; is provided withRepresenting N multi-model training samples, each training sample containing M subsamples of the same semantic meaning from different modalities, N and M scoresThe number of samples and the number of modes are respectively; training sample data from M modalitiesRepresenting the same semantics, sub-samples of which share the same semantic tags in a common semantic space, using y _i A representation; will->A semantic tag matrix represented as training samples, whereinFor K _i Is the semantic mark vector of (C), C is the class number; is provided withRepresenting from p ^th Original feature matrix of modality, wherein->And D _p Is X _p Is a feature dimension of (2);

s2, calculating a high confidence center of each feature class space;

s3, learning boundary and center information of the sample class space by using a joint loss function;

s4, selecting multi-mode local features, and classifying by analyzing semantic features of each mode and fusing multi-mode information;

s5, selecting multi-mode global features, and fusing multi-mode information by selecting an optimal cross feature combination;

s6, calculating a projection matrix by adopting an iterative algorithm based on half-quadratic minimization;

s7, after a projection matrix is obtained, projecting an original feature matrix of the test sample into an action semantic space to obtain semantic feature matrices of the test sample;

s8, finally outputting the prediction label of the test sample.

The technical scheme of the invention is as follows:

further, in step S1, the boundary and center information of the feature class space is also learned:

wherein U is _p Is a projection matrix, E _p Is corresponding to X _p Is a high confidence center matrix of (1).

In the foregoing method for multi-modal feature fusion and classification, in step S2, a step is providedIs the original feature matrix X _p Is the ith of (2) ^th Samples of>Representation ofIs the j of (2) ^th A characteristic attribute; is provided with->Representing all from p ^th Mode c ^th Jth of class primitive feature ^th Personal attribute set,/->Obeys normal distribution, and->，/>。

In the aforementioned method for multi-modal feature fusion and classification, in step S2, for X _p C of (2) ^th Feature class space, first for each feature attribute set using maximum likelihood estimationParameter estimation is performed and the maximum likelihood estimation function is formalized as follows:

wherein s is _i Is thatN is the number of samples; />And->Respectively->Statistical values of mean and variance;

obtaining the estimation of the mean and the variance by taking the partial derivative in the maximum likelihood estimation function formula:

wherein,and->Is->An estimate of the mean and variance of (a);

feature attribute setPress->And->Further divided into three sections, two edge sections +.>And->Is the noise characteristic attribute probabilityLow confidence interval, middle interval->Is a high confidence interval;

then, calculating a new mean value of the high confidence interval of each characteristic attribute set to form a c-th ^th A high confidence center for the individual feature class space; by at X _p A high confidence center matrix E is generated by replacing original feature data with a high confidence center of each feature class space _p 。

In the foregoing method for multi-modal feature fusion and classification, in step S3, boundary and center information of a sample class space is learned by using a joint loss function, as shown in the following formula:

wherein λ is an adjustment parameter for balancing learning of boundary and center information in the feature class space; the sensitivity of lambda is used to reflect the noise level of the training sample, the higher the sensitivity of lambda value, the higher the noise level of the training sample.

In the foregoing method for multi-modal feature fusion and classification, in step S4, the multi-modal local feature selection problem is expressed as a minimization problem:

wherein,regularization parameters for adjusting the sparsity of the projection matrix; delta is a utilization->Norms penalize regularization term to reduce the risk of overfitting; through->Projection matrix of norm penalty->Is sparse with the ability to embed select features.

In the foregoing method for multi-modal feature fusion and classification, in step S5, the multi-modal global feature selection problem is expressed as a minimization problem:

wherein X is a multi-modal feature matrix connecting the original feature matrices of all modalities; e and U are the high confidence center matrix and the projection matrix corresponding to X, respectively;and->Respectively the Fu Luo Beini Usnea norm and +.>Norms.

In the foregoing method for multi-modal feature fusion and classification, in step S4, for differentiation, the multi-modal local feature selection problem is re-expressed as:

wherein R is _p By the method of U _p Auxiliary vector r of (2) _p Diagonalizing to obtain the product; r is (r) _p I of (a) ^th Element(s)Calculated by the following formula:

wherein,is U _p I of (a) ^th Line vector, ε is a constant used to prevent denominator from becoming zero;

the pair of loss functions of the formula for expressing the multi-modal local feature selection problem is differentiated and set to zero as follows:

the formula is rewritten as:

。

in the method for multi-modal feature fusion and classification, in step S6, a projection matrix is calculated by using an iterative algorithm based on half-quadratic minimization, as shown in the following formula:

。

in the aforementioned method for multi-modal feature fusion and classification, in step S7, after obtaining the projection matrix, the test sample is testedProjecting the original feature matrix of (2) into the action semantic space to obtain their semantic feature matrix：

Wherein,，/>the number of test samples is the number of test samples;

wherein,is from the p-th ^th Sample of modality i ^th Semantic features, but->Representation->Belonging to the j th ^th Probability of class;

is provided withRepresenting i consisting of M original features from different modalities ^th Test specimens, i.e.The method comprises the steps of carrying out a first treatment on the surface of the After multimodal projection, the drug is administered by the method of the present invention>Respectively produce->M semantic features of (a);

test sampleClass labels of (c) are calculated as follows:

wherein a is _p Is the weight of each modality semantic feature; setting upI.e. it means that the impact of each modality sample on label prediction is the same.

The beneficial effects of the invention are as follows:

(1) In the invention, the compatibility of feature fusion and classification is improved by adopting a center boundary balanced projection method; aiming at the problems of multi-mode data information redundancy and isolation, two feature selection and fusion schemes are provided, and tests are carried out;

(2) In the invention, different data modes (depth video data, RGB video data, skeleton data and acceleration data) are used for carrying out comparison experiments on MSR Action 3D, UTD-MHAD, DHA and NTU RGB+D60 data sets, and the result shows that the performance of the method using the depth data is optimal. On the MSR Action 3D data set, the method has higher recognition accuracy than other latest methods, especially has 0.18% higher performance than DDPDI algorithm and 0.71% higher performance than Rang-Sample Feature algorithm. The multi-modal feature fusion and classification method of the present invention also performs optimally on UTD-MHAD datasets, with identification accuracy at least 0.84% higher than other methods.

Drawings

FIG. 1 is a schematic diagram of a multi-modal feature fusion and classification method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the calculation of a high confidence center in an embodiment of the present invention.

Description of the embodiments

The method for fusing and classifying multi-modal features provided in this embodiment, as shown in fig. 1 to 2, includes the following steps:

s1, mapping original features from different modes into a public semantic feature space; is provided withRepresenting N multi-model training samples, wherein each training sample comprises M subsamples of the same semantics from different modes, and N and M are respectively the number of samples and the number of modes; training sample data from M modalitiesRepresenting the same semantics, sub-samples of which share the same semantic tags in a common semantic space, using y _i A representation;

will beA semantic tag matrix represented as training samples, whereinFor K _i Is the semantic mark vector of (C), C is the class number;

is provided withRepresenting from p ^th A raw feature matrix of the modality, whereinAnd D _p Is X _p Is a feature dimension of (c).

In order to meet different data requirements of feature fusion and feature classification, boundary and center information of feature class space is learned at the same time:

S2, calculating a high confidence center (High Confidence Center, HCC) of each feature class space; when noise data appears in the class space, the class center will be shifted toward the noise data; the class center affected by the noise data is called as a pseudo center, and the pseudo center obviously reduces the fitting performance of the model; to overcome this problem, HCC for each class space is calculated.

Is provided withIs the original feature matrix X _p Is the ith of (2) ^th Samples of>Representation->Is the j of (2) ^th A characteristic attribute; is provided with->Representing all from p ^th Mode c ^th Jth of class primitive feature ^th A set of feature attributes; according to statistical law, ->Obeys normal distribution.

For X _p C of (2) ^th Feature class space, first for each feature attribute set using maximum likelihood estimationParameter estimation is performed and the maximum likelihood estimation function is formalized as follows:

wherein s is _i Is thatN is the number of samples; />And->Respectively->Statistical values of mean and variance.

wherein,and->Is->An estimate of the mean and variance of (c).

Feature attribute setPress->And->Further divided into three sections, two edge sections +.>And->Is a low confidence interval of probability occurrence of noise characteristic attribute, middle interval +.>Is a high confidence interval; because noise data is typically small and occurs in low confidence intervals. At->The high confidence interval contains approximately 68.3% of the sample near the mean.

S3, HCC matrix E _p Accurate center information of each class space is contained, but it lacks boundary information of the class space; original feature matrix X _p Containing complete boundary information, however, X _p The boundary information in (a) may be affected by noise samples, and thus the boundary and center information of the sample class space is learned using a joint loss function, as shown in the following equation:

S4, selecting multi-mode local features, and classifying by analyzing semantic features of each mode and fusing multi-mode information; the problem of local feature selection of a center boundary balanced multi-modal classifier (Center Boundary Balancing Multimodal Classifier, CBBMC) is expressed as a minimization problem:

S5, selecting multi-mode global features, and fusing multi-mode information by selecting an optimal cross feature combination; the global feature selection problem of the center boundary balanced multi-modal classifier is expressed as a minimization problem:

wherein X is a multi-modal feature matrix connecting the original feature matrices of all modalitiesThe method comprises the steps of carrying out a first treatment on the surface of the E and U are the high confidence center matrix and the projection matrix corresponding to X, respectively;and->Frobenius norms and +.>A norm; in the case where m=1, the calculation process of the global feature selection is the same as that of the local feature selection.

For differentiation, the local feature selection problem of the center boundary balanced multi-modal classifier is restated as:

the loss function pair of the formula for expressing the local feature selection problem of the center boundary balanced multi-modal classifier is differentiated and set to zero as follows:

the formula is rewritten as:

。

s6, calculating a projection matrix by adopting an iterative algorithm based on half-quadratic minimization, wherein the projection matrix is shown in the following formula:

。

the two multi-mode feature selection schemes correspond to two multi-mode feature fusion methods, and local feature selection is performed by comprehensively analyzing semantic features of all modes and fusing multi-mode information for classification; global feature selection fuses multimodal information by selecting an optimal combination of intersecting features.

S7, after the projection matrix is obtained, testing the sampleProjecting the original feature matrix of (2) into the action semantic space to obtain their semantic feature matrix +.>：

Wherein,，/>for the number of test samples;

test sampleClass labels of (c) are calculated as follows:

S8, finally outputting the prediction label of the test sample.

In the method of the embodiment, center and boundary information of the feature class space is balanced by adopting center boundary balanced projection, so that negative effects of noise data are reduced, and compatibility of feature fusion and classification stages is improved; then adopting two schemes to project each condition-specific feature into a semantic space; assigning a value to the label of the sample according to the multi-mode action semantic features; finally, designing a multi-mode feature fusion and classification method for behavior recognition in the depth video sequence; therefore, compatibility of feature fusion and classification is improved, and meanwhile recognition accuracy is greatly improved.

In addition to the embodiments described above, other embodiments of the invention are possible. All technical schemes formed by equivalent substitution or equivalent transformation fall within the protection scope of the invention.

Claims

1. A method for multi-modal feature fusion and classification, characterized by: the method comprises the following steps:

s1, mapping original features from different modes into a public semantic feature space; is provided withRepresenting N multi-model training samples, wherein each training sample comprises M subsamples of the same semantics from different modes, and N and M are respectively the number of samples and the number of modes; training sample data from M modalitiesRepresenting the same semantics, sub-samples of which share the same semantic tags in a common semantic space, using y _i A representation; will->A semantic tag matrix represented as training samples, whereinFor K _i Is the semantic mark vector of (C), C is the class number; is provided withRepresenting from p ^th Original feature matrix of modality, wherein->And D _p Is X _p Is a feature dimension of (2);

s2, calculating a high confidence center of each feature class space;

s8, finally outputting the prediction label of the test sample.

2. A method of multimodal feature fusion and classification as claimed in claim 1, wherein: in the step S1, the boundary and center information of the feature class space is also learned:

3. A method of multimodal feature fusion and classification as claimed in claim 1, wherein: in the step S2, set upIs the original feature matrix X _p Is the ith of (2) ^th Samples of>Representation->Is the j of (2) ^th A characteristic attribute; is provided with->Representing all from p ^th Mode c ^th Jth of class primitive feature ^th Personal attribute set,/->Obeys normal distribution, and->，/>。

4. A method of multi-modal feature fusion and classification as claimed in claim 3 wherein: in the step S2, for X _p C of (2) ^th Feature class space, first for each feature attribute set using maximum likelihood estimationParameter estimation is performed and the maximum likelihood estimation function is formalized as follows:

wherein,and->Is->An estimate of the mean and variance of (a);

feature attribute setPress->And->Further divided into three sections, two edge sections +.>Andis a low confidence interval of probability occurrence of noise characteristic attribute, middle interval +.>Is a high confidence interval;

then, calculating a new mean value of the high confidence interval of each characteristic attribute set to form a c-th ^th A high confidence center for the individual feature class space; by at X _p Each feature class space is usedIs used for generating a high confidence center matrix E by replacing original characteristic data with the high confidence center of the model _p 。

5. A method of multimodal feature fusion and classification as claimed in claim 1, wherein: in the step S3, boundary and center information of the sample class space is learned using the joint loss function, as shown in the following formula:

6. A method of multimodal feature fusion and classification as claimed in claim 1, wherein: in the step S4, the multi-modal local feature selection problem is expressed as a minimization problem:

wherein,regularization parameters for adjusting the sparsity of the projection matrix; delta is a utilization->Norms penalize regularization term to reduce the risk of overfitting; through->Projection matrix of norm penalty->Is sparse withThe ability to embed select features.

7. A method of multimodal feature fusion and classification as claimed in claim 1, wherein: in the step S5, the multi-modal global feature selection problem is expressed as a minimization problem:

8. The method of claim 6, wherein the steps of: in the step S4, the multi-modal local feature selection problem is restated as follows:

the formula is rewritten as:

。

9. a method of multimodal feature fusion and classification as claimed in claim 1, wherein: in the step S6, an iterative algorithm based on semi-quadratic minimization is used to calculate a projection matrix, as shown in the following formula:

。

10. a method of multimodal feature fusion and classification as claimed in claim 1, wherein: in the step S7, after the projection matrix is obtained, the test sample is obtainedProjecting the original feature matrix of (2) into the action semantic space to obtain their semantic feature matrix +.>：

Wherein,，/>for the number of test samples;

test sampleClass labels of (c) are calculated as follows: