CN117828281A

CN117828281A - Behavior intention recognition method, system and terminal based on cross-mode hypergraph

Info

Publication number: CN117828281A
Application number: CN202410247777.1A
Authority: CN
Inventors: 任卫红; 董潜; 高宇; 王志永; 王焦乐; 刘洪海
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2024-03-05
Filing date: 2024-03-05
Publication date: 2024-04-05
Anticipated expiration: 2044-03-05
Also published as: CN117828281B

Abstract

The invention discloses a behavior intention recognition method, a system and a terminal based on cross-mode hypergraph, wherein the method comprises the following steps: obtaining a plurality of different segment modal characteristics of a target object in a target time period, performing time domain information enhancement processing and cross-modal enhancement processing on each segment modal characteristic to obtain a single-mode enhancement characteristic and a cross-mode enhancement characteristic corresponding to each segment modal characteristic, obtaining a pre-erasure time domain characteristic, a pre-erasure airspace characteristic, a post-erasure time domain characteristic and a post-erasure airspace characteristic corresponding to each node in a cross-mode hypergraph, fusing to obtain a final fused characteristic, and finally obtaining a behavior prediction result of the target object according to the final fused characteristic. The invention combines physical signals and physiological signals, fully utilizes the complementarity of information among different modes, realizes cross-mode interaction and enhancement in time dimension and space dimension, can effectively eliminate uncertainty among modes, and realizes cognition and behavior detection of patients.

Description

Behavior intention recognition method, system and terminal based on cross-mode hypergraph

Technical Field

The invention relates to the technical field of computer vision, in particular to a behavior intention recognition method, a system, a terminal and a computer readable storage medium based on cross-mode hypergraph.

Background

The behavior of ICU-acquired muscle weakness patients (ICU-AW) is intended to understand that the rehabilitation stages of patients are different from each other due to the complexity of ICU environments, and the problems of deep and hidden are two. The behavioral intention recognition system for ICU-AW patients uses computer vision techniques and machine learning techniques to extract information from the patient's physiological and physical signals to recognize the patient's current behaviors (shaking head, opening mouth, opening eyes, etc.) and intentions (seeking analgesic assistance, skin dryness, hunger, etc.).

At present, the behavior intention recognition methods aiming at patients are mainly divided into two main categories: the recognition method based on the single-mode signals and the recognition method based on the multi-mode signals are difficult to comprehensively understand the behavior intention of the patient by using the single-mode signals, and generally depend on a complex machine learning model, so that the result has poor interpretability. The complementarity of different modal signals can be utilized by using the multi-modal signals, additional clues are provided for semantic understanding and behavior analysis, the multi-modal signals can be divided into physical signals and physiological signals, but the problems of different time lags and multiple dimensions exist between the physical signals and the physiological signals, so that uncertainty among the multi-modal signals is caused, and effective alignment and fusion cannot be carried out.

Accordingly, the prior art is still in need of improvement and development.

Disclosure of Invention

The invention mainly aims to provide a behavior intention recognition method, a system, a terminal and a computer readable storage medium based on a cross-mode hypergraph, and aims to solve the problem that in the prior art, when behavior intention of a patient is recognized, physical signals and physiological signals are difficult to align and fuse effectively, so that the accuracy of a generated behavior intention prediction result is low.

In order to achieve the above object, the present invention provides a behavior intention recognition method based on a cross-modal hypergraph, the behavior intention recognition method based on the cross-modal hypergraph includes the following steps:

acquiring a plurality of different modal signals of a target object in a target time period, and preprocessing the different modal signals according to the target time period and each modal signal to obtain a plurality of fragment modal characteristics corresponding to each modal signal;

constructing a cross-modal hypergraph according to all the segment modal characteristics, and performing time domain information enhancement processing on each segment modal characteristic in the cross-modal hypergraph to obtain a single-mode enhancement characteristic corresponding to each segment modal characteristic;

calculating cross-modal enhancement features corresponding to each segment modal feature according to a cross-modal attention mechanism, obtaining pre-erasure time domain features corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement features corresponding to each segment modal feature, and obtaining pre-erasure airspace features corresponding to each node according to the segment modal features of each node;

Performing time erasure processing on the cross-modal hypergraph according to the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each segment mode feature, and acquiring an erased time domain feature and an erased space domain feature corresponding to each node in the cross-modal hypergraph after the time erasure processing;

and carrying out fusion processing on all the pre-erasure time domain features, all the pre-erasure space domain features, all the post-erasure time domain features and all the post-erasure space domain features to obtain final fusion features, and obtaining a behavior prediction result of the target object in a target time period according to the final fusion features.

Optionally, in the behavior intention recognition method based on cross-modal hypergraph, the obtaining a plurality of different modal signals of the target object in a target time period, and preprocessing according to the target time period and each modal signal to obtain a plurality of segment modal features corresponding to each modal signal specifically includes:

acquiring a plurality of different modal signals of a target object in a target time period, wherein the different modal signals are acquired by different preset sensors;

preprocessing each modal signal to obtain a modal characteristic corresponding to each modal signal;

Uniformly dividing each modal feature according to the preset time segment length to obtain a plurality of segment modal features corresponding to each modal feature

Optionally, in the behavior intention recognition method based on a cross-modal hypergraph, constructing a cross-modal hypergraph according to all the segment modal features, and performing time domain information enhancement processing on each segment modal feature in the cross-modal hypergraph to obtain a single-mode enhancement feature corresponding to each segment modal feature, which specifically includes:

respectively integrating the segment modal characteristics corresponding to each modal signal to obtain nodes corresponding to each modal signal, and constructing a cross-modal hypergraph according to all the nodes;

performing linear mapping operation on segment modal characteristics of each node in the cross-modal hypergraph respectively to obtain a key vector, a value vector and a query vector corresponding to each segment modal characteristic;

constructing a similarity matrix according to the key vector, the value vector and the query vector corresponding to each segment modal feature;

and respectively carrying out time domain information enhancement processing on all the segment modal characteristics according to the similarity matrix to obtain single-mode enhancement characteristics corresponding to each segment modal characteristic.

Optionally, in the behavior intention recognition method based on a cross-modal hypergraph, calculating a cross-modal enhancement feature corresponding to each segment modal feature according to a cross-modal attention mechanism, and obtaining a pre-erasure time domain feature corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement feature corresponding to each segment modal feature, and obtaining a pre-erasure airspace feature corresponding to each node according to the segment modal feature of each node, including:

acquiring a time segment corresponding to each segment modal feature, and classifying all the segment modal features according to the time segments to divide a plurality of segment modal features belonging to the same time into the same group to obtain a plurality of modal feature groups;

taking any segment modal feature in any one modal feature group as a feature to be processed according to a cross-modal attention mechanism, and performing cross-modal enhancement processing on the feature to be processed according to all segment modal features in the same modal feature group as the feature to be processed to obtain a cross-modal enhancement feature corresponding to the feature to be processed;

after the cross-modal enhancement processing of all the segment modal features is completed, obtaining cross-modal enhancement features corresponding to each segment modal feature;

Carrying out fusion processing on the single-mode enhancement features and the cross-mode enhancement features belonging to the same node to obtain pre-erasure time domain features corresponding to each node;

and integrating the segment modal characteristics of each node to obtain node characteristics corresponding to each node, and performing coding processing according to each node characteristic to obtain the pre-erasure airspace characteristics corresponding to each node.

Optionally, the behavior intention recognition method based on cross-mode hypergraph, wherein the encoding processing is performed according to each node feature to obtain a pre-erasure airspace feature corresponding to each node, specifically includes:

respectively carrying out coding processing on each node characteristic through a public coder and a private coder to obtain a public characteristic and a private characteristic corresponding to each node characteristic;

respectively carrying out graph distillation treatment on the public characteristic and the private characteristic of each node characteristic to obtain public distillation loss corresponding to each public characteristic and private distillation loss corresponding to each private characteristic;

optimizing the corresponding public characteristics and private characteristics according to each public distillation loss and each private distillation loss to obtain final public characteristics and final private characteristics corresponding to each node;

And respectively carrying out fusion processing and linear mapping on the final public features and the final private features of each node to obtain the pre-erasure airspace features corresponding to each node.

Optionally, in the behavior intention recognition method based on a cross-modal hypergraph, the performing time erasure processing on the cross-modal hypergraph according to a single-mode enhancement feature and a cross-modal enhancement feature corresponding to each segment mode feature, and acquiring an post-erasure time domain feature and an post-erasure space domain feature corresponding to each node in the cross-modal hypergraph after the time erasure processing specifically includes:

respectively carrying out addition processing on the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each segment mode feature to obtain an addition feature corresponding to each segment mode feature, and obtaining a time attention diagram of each segment mode feature according to each addition feature;

performing time erasure processing on all the time attention force diagrams according to a preset time erasure template, and taking the time attention force diagrams meeting preset conditions as post-erasure time attention force diagrams;

acquiring all the segment modal characteristics corresponding to the time attention map after erasure and taking the segment modal characteristics as the segment modal characteristics after erasure corresponding to each node;

Performing time domain information enhancement processing on the erased fragment modal characteristics, and then calculating the erased fragment modal characteristics according to a cross-modal attention mechanism to obtain erased cross-modal enhancement characteristics corresponding to the erased fragment modal characteristics;

and obtaining the erased time domain characteristic corresponding to each node according to the erased single-mode characteristic and the erased cross-mode enhancement characteristic corresponding to each node, and obtaining the erased airspace characteristic corresponding to each node according to each erased segment mode characteristic.

Optionally, in the behavior intention recognition method based on cross-modal hypergraph, the fusing processing is performed on all the pre-erasure time domain features, all the pre-erasure space domain features, all the post-erasure time domain features and all the post-erasure space domain features to obtain a final fused feature, and a behavior prediction result of the target object in a target time period is obtained according to the final fused feature, which specifically includes:

respectively carrying out fusion processing on the pre-erasure time domain features, the pre-erasure space domain features, the post-erasure time domain features and the post-erasure space domain features corresponding to each node to obtain space-time features corresponding to each node;

The space-time characteristics of all the nodes are connected in series to obtain the final fusion characteristics;

and analyzing and processing the final fusion characteristics by using a pre-trained graph neural network to obtain a behavior prediction result of the target object in a target time period.

In addition, in order to achieve the above object, the present invention further provides a behavior intention recognition system based on a cross-modal hypergraph, wherein the behavior intention recognition system based on the cross-modal hypergraph includes:

the segment characteristic acquisition module is used for acquiring a plurality of different modal signals of a target object in a target time period, and preprocessing the different modal signals according to the target time period and each modal signal to obtain a plurality of segment modal characteristics corresponding to each modal signal;

the single-mode enhancement module is used for constructing a cross-mode hypergraph according to all the segment mode characteristics, and carrying out time domain information enhancement processing on each segment mode characteristic in the cross-mode hypergraph to obtain single-mode enhancement characteristics corresponding to each segment mode characteristic;

the pre-erasure feature acquisition module is used for calculating cross-modal enhancement features corresponding to each segment modal feature according to a cross-modal attention mechanism, obtaining pre-erasure time domain features corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement features corresponding to each segment modal feature, and obtaining pre-erasure airspace features corresponding to each node according to the segment modal features of each node;

The post-erasure feature acquisition module is used for performing time erasure processing on the cross-modal hypergraph according to the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each segment mode feature, and acquiring post-erasure time domain features and post-erasure space domain features corresponding to each node in the cross-modal hypergraph after the time erasure processing;

and the prediction result generation module is used for carrying out fusion processing on all the pre-erasure time domain features, all the pre-erasure space domain features, all the post-erasure time domain features and all the post-erasure space domain features to obtain final fusion features, and obtaining a behavior prediction result of the target object in a target time period according to the final fusion features.

In addition, to achieve the above object, the present invention also provides a terminal, wherein the terminal includes: a memory, a processor, and a cross-modal hypergraph-based behavior intent recognition program stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the cross-modal hypergraph-based behavior intent recognition method as described above.

In addition, in order to achieve the above object, the present invention also provides a computer-readable storage medium storing a behavior intention recognition program based on a cross-modal hypergraph, which when executed by a processor, implements the steps of the behavior intention recognition method based on a cross-modal hypergraph as described above.

According to the method, a plurality of different modal signals of a target object in a target time period are obtained, and preprocessing is carried out according to the target time period and each modal signal to obtain a plurality of fragment modal characteristics corresponding to each modal signal; constructing a cross-modal hypergraph according to all the segment modal characteristics, and performing time domain information enhancement processing on each segment modal characteristic in the cross-modal hypergraph to obtain a single-mode enhancement characteristic corresponding to each segment modal characteristic; calculating cross-modal enhancement features corresponding to each segment modal feature according to a cross-modal attention mechanism, obtaining pre-erasure time domain features corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement features corresponding to each segment modal feature, and obtaining pre-erasure airspace features corresponding to each node according to the segment modal features of each node; performing time erasure processing on the cross-modal hypergraph according to the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each segment mode feature, and acquiring an erased time domain feature and an erased space domain feature corresponding to each node in the cross-modal hypergraph after the time erasure processing; and carrying out fusion processing on all the pre-erasure time domain features, all the pre-erasure space domain features, all the post-erasure time domain features and all the post-erasure space domain features to obtain final fusion features, and obtaining a behavior prediction result of the target object in a target time period according to the final fusion features. The invention uses physical signals and physiological signals, fully utilizes the complementarity of information among different modes, realizes cross-mode interaction and enhancement in time dimension and space dimension, can effectively eliminate uncertainty among modes, and realizes cognition and behavior detection of patients.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the behavior intent recognition method based on cross-modal hypergraph of the present invention;

FIG. 2 is a schematic diagram of a hardware module of the behavior intent recognition method based on cross-modal hypergraph of the present invention;

FIG. 3 is a schematic diagram of signal presentation of a behavior intent recognition method based on cross-modal hypergraph of the present invention;

FIG. 4 is a flow chart of multi-modal signal acquisition of the behavior intent recognition method based on cross-modal hypergraph of the present invention;

FIG. 5 is a flow chart of time domain feature extraction of the behavior intention recognition method based on cross-modal hypergraph of the invention;

FIG. 6 is a flow chart of airspace feature extraction of the behavior intention recognition method based on cross-modal hypergraph;

FIG. 7 is a complete flow chart of behavior prediction for a behavior intent recognition method based on cross-modal hypergraph of the present invention;

FIG. 8 is a schematic diagram of a preferred embodiment of the behavior intent recognition system based on cross-modal hypergraphs of the present invention;

FIG. 9 is a schematic diagram of the operating environment of a preferred embodiment of the terminal of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear and clear, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

According to the behavior intention recognition method based on the cross-modal hypergraph, as shown in fig. 1, the behavior intention recognition method based on the cross-modal hypergraph comprises the following steps:

step S10, acquiring a plurality of different modal signals of a target object in a target time period, and preprocessing the different modal signals according to the target time period and each modal signal to obtain a plurality of fragment modal characteristics corresponding to each modal signal.

Specifically, in a preferred embodiment of the present invention, a behavioral intention recognition system for an ICU-AW patient is designed, where the system is a somatosensory network system suitable for ICU scenes, and includes a hardware module for acquiring modal signals of the patient, such as a wearable physiological and physical signal sensing module and a multi-view visual sensing module, so as to collect a plurality of different modal signals of the patient in a target time period, and then pre-process the acquired modal signals in different manners, so as to obtain a plurality of segment modal characteristics corresponding to each modal signal.

Further, the acquiring a plurality of different modal signals of the target object in a target time period, and preprocessing according to the target time period and each modal signal to obtain a plurality of segment modal characteristics corresponding to each modal signal, specifically includes:

Acquiring a plurality of different modal signals of a target object in a target time period, wherein the different modal signals are acquired by different preset sensors; preprocessing each modal signal to obtain a modal characteristic corresponding to each modal signal; and uniformly dividing each modal feature according to the preset time segment length to obtain a plurality of segment modal features corresponding to each modal feature.

Specifically, the hardware module of the invention is shown in fig. 2, which comprises a wearable physiological physical signal sensing module and a multi-view visual sensing module, wherein the wearable physiological physical signal sensing module comprises a scalp electroencephalogram module, an myoelectricity module, an ultrasonic module, an IMU (Inertial Measurement Unit ) module and a force touch sensing module, the multi-view sensing module mainly comprises a plurality of RGBD cameras (three-dimensional color depth cameras) and can capture facial expressions of a patient (target object) in real time, the behavior intention recognition system is matched with upper computer software and can display signals of all modes simultaneously so as to observe the state of the patient in real time, a left signal column can select to observe signals of a certain mode or signals of a plurality of modes simultaneously, the right signal column can display original signals in real time, and a menu column can control interface display and the like.

The system can realize independent work of each mode or synchronous acquisition of all modes, and when each mode sensor is arranged, repeated positions are subjected to integrated design so as to synchronously detect myoelectricity, ultrasound and IMU of limb movement. Because the acquisition of each mode adopts an independent module, the synchronous operation of each mode needs to be realized by using a synchronous center during synchronous acquisition so as to obtain time-aligned multi-mode signals. The method for synchronously aligning the multi-mode data is shown in fig. 4, the synchronous center synchronously transmits the signals for starting acquisition to each module to enable the modules to synchronously start working, the visual module positions and shares the time tag with the synchronous center, and the synchronous center transmits the time tag to each module to enable the visual module to be synchronized, so that a plurality of different mode signals of a patient in a target time period are obtained.

After a plurality of different modal signals are acquired, the system needs to synchronously preprocess each modal signal so as to obtain the corresponding modal characteristics of each modal signal, and the preprocessing modes of the different modal signals are different, specifically as follows:

electroencephalogram signal: firstly, converting an electroencephalogram signal into a time-frequency domain feature map by using a wavelet transformation method, then constructing a space transformation matrix according to the space position distribution of an electroencephalogram acquisition electrode, converting the time-frequency domain feature map into a 3D feature map containing a time-frequency-space domain, and extracting features by using a convolutional neural network and a cyclic neural network; firstly, a convolutional neural network is used for extracting features of a frequency domain and a spatial domain, the features are extracted at the tail end of a structure, the features are converted into feature vectors, the feature vectors are used as input of a cyclic neural network, and the cyclic neural network is used for extracting time features to obtain preprocessed electroencephalogram features.

Myoelectric signal: denoising the electromyographic signals by using a low-pass filter, then learning the spatial characteristics of the high-density electromyographic signals by using a graph neural network, specifically, regarding different channel signals as independent nodes, modeling the inter-channel relationship of the signals by using an adjacency matrix, calculating the dependency relationship of each node by using a graph convolution module, and then extracting the electromyographic characteristics by using a depth separable convolution network.

Ultrasonic signal: the ultrasonic sensor is placed at the muscles such as the rectus leg muscle, the extensor carpi forearm, the deep flexor digitorum of finger, the longus hallucination muscle and the like to collect multi-channel ultrasonic signals, then a Gaussian band-pass filter is used for filtering ultrasonic echo signals to obtain echo data corresponding to each channel, the echo data of each channel are uniformly cut into a plurality of sections, the characteristics are extracted from the cut data by wavelet packet transformation, the characteristics extracted from the echo data of the same channel jointly form the characteristic vector of each channel, and finally the characteristic vectors of different ultrasonic channels are connected in series to obtain ultrasonic characteristics.

Haptic signal: according to the piezoelectric signals and the piezoresistive signals acquired by the haptic signal acquisition module, the piezoelectric signals and the piezoresistive signals are aggregated by using a convolution layer of a convolution neural network, then feature vectors of the piezoelectric signals and the piezoresistive signals are extracted by using a pooling layer to obtain piezoelectric feature vectors and piezoresistive feature vectors, and the piezoelectric feature vectors and the piezoresistive feature vectors are connected in series and then input into a multi-layer perceptron to obtain the haptic characteristics.

IMU signal (inertial signal): the IMU signal is low-pass filtered in time to eliminate high frequency interference, then a convolutional neural network is used to extract local features with the convolutional layer, and then a pre-trained transducer (a neural network model) is used to extract global features of the IMU as IMU features (inertial features).

Visual signal: RGBD cameras with different visual angles can shoot different human body parts. For a face, firstly, a face detection network is used for extracting the face in an image and then standardization is carried out according to the RGB (color mode) data of the image; for the body, different parts of the human body such as shoulders, waist, knees, feet, elbows and the like are abstracted into characteristic points, and each part corresponds to one characteristic point. And then extracting facial expression characteristics, key point characteristics of the human face, line of sight characteristics and skeleton point characteristics of the human body of a patient through a main characteristic extraction module of a pre-trained expression recognition network, a line of sight estimation network and a skeleton point monitoring network according to the human face and a plurality of characteristic points, and finally, performing serial splicing on the four characteristics and outputting the four characteristics through a multi-layer perceptron to obtain visual characteristics of the external behavior state of the patient.

After the electroencephalogram feature, myoelectricity feature, ultrasonic feature, IMU feature, force touch feature and visual feature of the patient are obtained, each modal feature is uniformly divided according to the preset time segment length, and a plurality of segment modal features corresponding to each modal feature are obtained; the brain electrical characteristic, myoelectrical characteristic, ultrasonic characteristic, IMU characteristic, force touch characteristic and visual characteristic of the patient are respectively provided as ，/>，/>，/>，/>，/>Taking the example of recording a plurality of modality characteristics within a target period of time for acquiring a patient, the target period of time is divided into +.>Each time segment, each time segment randomly extracts k frames of continuous frames, and then the 3D-ResNet (a neural network) feature network is used to map the features of each mode into the same dimension to obtain ≡>，/>Representing segment modality characteristics, < >>Representing the number of time slices,/-, and>dimension representing feature->The method is characterized by comprising a set of all segment modal characteristics of the same time period, namely the segment modal characteristics comprise segment brain electrical characteristics, segment myoelectrical characteristics, segment ultrasonic characteristics, segment IMU characteristics, segment force touch characteristics and segment visual characteristics.

And step S20, constructing a cross-modal hypergraph according to all the segment modal characteristics, and carrying out time domain information enhancement processing on each segment modal characteristic in the cross-modal hypergraph to obtain a single-mode enhancement characteristic corresponding to each segment modal characteristic.

Specifically, after the electroencephalogram feature, myoelectric feature, ultrasonic feature, IMU feature, force touch feature and visual feature of a patient are obtained, because the problems of different time lags and multiple dimensions exist among the multi-modal features, a cross-modal hypergraph information fusion algorithm is adopted, a graph substructure is utilized to integrate modal information, the graph is defined as a directed graph containing nodes and edges, space construction of the cross-modal hypergraph is carried out, wherein different nodes represent the features of different modalities, and edges represent the relationship among the nodes.

And then carrying out time domain information enhancement processing on all the segment modal characteristics contained in the cross-modal hypergraph through a self-attention mechanism to obtain single-mode enhancement characteristics corresponding to each segment modal characteristic.

Further, constructing a cross-modal hypergraph according to all the segment modal features, and performing time domain information enhancement processing on each segment modal feature in the cross-modal hypergraph to obtain a single-modal enhancement feature corresponding to each segment modal feature, which specifically comprises:

respectively integrating the segment modal characteristics corresponding to each modal signal to obtain nodes corresponding to each modal signal, and constructing a cross-modal hypergraph according to all the nodes; performing linear mapping operation on segment modal characteristics of each node in the cross-modal hypergraph respectively to obtain a key vector, a value vector and a query vector corresponding to each segment modal characteristic; constructing a similarity matrix according to the key vector, the value vector and the query vector corresponding to each segment modal feature; and respectively carrying out time domain information enhancement processing on all the segment modal characteristics according to the similarity matrix to obtain single-mode enhancement characteristics corresponding to each segment modal characteristic.

Specifically, segment modal features corresponding to the same modal feature are integrated to obtain nodes corresponding to each modal feature, and then construction of a cross-modal hypergraph is completed according to all the nodes; for each node in the cross-mode hypergraph, the time domain information of each node needs to be enhanced, and considering that different segments in the time domain have correlation, the intra-mode correlation modeling is firstly carried out on the mode characteristics corresponding to each node, namely, the time domain information enhancement processing is respectively carried out on the segment mode characteristics belonging to the same node.

Taking the long-distance dependency relationship among different frames of a single mode as an example, taking visual characteristics as the example, based on a self-attention mechanism, the method comprises the following steps ofLinear mapping of all segment visual features in the nodes corresponding to the visual features to obtain corresponding Key (Key) vectors, value (Value) vectors and Query (Query) vectors, the temporal correlation between segment visual features of different time segments may use a similarity matrixTo express, specifically: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the scale-up and scale-down coefficients,Qfor the query vector of all segment visual features,Kfor the key vectors of all segment visual features, the similarity matrix formula is used for carrying out normalization processing on each row of data in brackets, namely the time correlation among segment visual features corresponding to all different time segments in the visual feature nodes is encoded in the similarity matrix.

And then respectively carrying out time domain information enhancement processing on the visual characteristics of each segment according to the similarity matrix, wherein the specific formula is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein,G、FandVrepresenting the segment visual features after enhancement (i.e. the unimodal enhancement features), the segment visual features before enhancement and the corresponding value vectors of the segment visual features respectively,GandFhaving the same dimensions.

And respectively carrying out time domain information enhancement processing on all the segment modal characteristics of the rest nodes, thereby obtaining single-mode enhancement characteristics corresponding to all the segment modal characteristics in the cross-mode hypergraph.

Step S30, calculating cross-modal enhancement features corresponding to each segment modal feature according to a cross-modal attention mechanism, obtaining pre-erasure time domain features corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement features corresponding to each segment modal feature, and obtaining pre-erasure airspace features corresponding to each node according to the segment modal features of each node.

Specifically, after obtaining the unimodal enhancement feature corresponding to each segment modal feature in each node, performing cross-modal enhancement on a plurality of different unimodal enhancement features (in the preferred embodiment of the present invention, each time segment has six different unimodal enhancement features) of the same time segment, to obtain the cross-modal enhancement feature corresponding to each unimodal enhancement feature, and obtaining the pre-erasure time domain feature corresponding to each node according to all the cross-modal enhancement features corresponding to each node, and then obtaining the pre-erasure airspace feature corresponding to each node according to the segment modal feature of each node.

Further, the calculating a cross-modal enhancement feature corresponding to each segment modal feature according to a cross-modal attention mechanism, and obtaining a pre-erasure time domain feature corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement feature corresponding to each segment modal feature, and obtaining a pre-erasure airspace feature corresponding to each node according to the segment modal feature of each node specifically includes:

acquiring a time segment corresponding to each segment modal feature, and classifying all the segment modal features according to the time segments to divide a plurality of segment modal features belonging to the same time into the same group to obtain a plurality of modal feature groups; taking any segment modal feature in any one modal feature group as a feature to be processed according to a cross-modal attention mechanism, and performing cross-modal enhancement processing on the feature to be processed according to all segment modal features in the same modal feature group as the feature to be processed to obtain a cross-modal enhancement feature corresponding to the feature to be processed; after the cross-modal enhancement processing of all the segment modal features is completed, obtaining cross-modal enhancement features corresponding to each segment modal feature; carrying out fusion processing on the single-mode enhancement features and the cross-mode enhancement features belonging to the same node to obtain pre-erasure time domain features corresponding to each node; and integrating the segment modal characteristics of each node to obtain node characteristics corresponding to each node, and performing coding processing according to each node characteristic to obtain the pre-erasure airspace characteristics corresponding to each node.

Specifically, firstly, a time segment corresponding to each segment modal feature is obtained, then, a plurality of segment modal features belonging to the same time segment are divided into the same group, a plurality of modal feature groups (actually, each modal feature group comprises six segment modal features) are obtained, then, information complementarity among different modalities is utilized, and a cross-modal attention is used for learning modal representation, as shown in fig. 5, by taking a visual feature as an example, the cross-modal attention is represented as:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>And->Representing segment visual features in a set of modal features that are not cross-modal enhanced and then re-using the set of modal features according to the formulaThe method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For the number of cross-modal enhancements +.>For weight parameter, ++>Representing the remaining segment modality features except segment visual features, < ->Representing the enhancement of segment visual features by using the other segment modal features, and sequentially using segment electroencephalogram features, segment myoelectricity features, segment ultrasonic features, segment IMU features and segment force touch features in the modal feature group to perform the cross-modal enhancement in order to obtain cross-modal enhancement features corresponding to the segment visual featuresThus, cross-modal enhancement features corresponding to the segment visual features are obtained >In practice, 5 cross-modal enhancement processes are performed on the remaining 1 segment modal features by using 5 segment modal features in the same modal feature group; and similarly, carrying out the same processing on all the segment modal characteristics in all the modal characteristic groups, and obtaining cross-modal enhancement characteristics corresponding to each segment modal characteristic after the processing is completed.

Then, carrying out fusion processing on all single-mode enhancement features and all cross-mode enhancement features of each node (each mode), and obtaining the time domain features before erasure corresponding to each node; and integrating the segment modal characteristics of each node to obtain node characteristics corresponding to each node, and simultaneously carrying out coding processing on the node characteristics to obtain the pre-erasure airspace characteristics corresponding to each node.

Further, the encoding processing is performed according to each node characteristic to obtain a pre-erasure airspace characteristic corresponding to each node, which specifically includes:

respectively carrying out coding processing on each node characteristic through a public coder and a private coder to obtain a public characteristic and a private characteristic corresponding to each node characteristic; respectively carrying out graph distillation treatment on the public characteristic and the private characteristic of each node characteristic to obtain public distillation loss corresponding to each public characteristic and private distillation loss corresponding to each private characteristic; optimizing the corresponding public characteristics and private characteristics according to each public distillation loss and each private distillation loss to obtain final public characteristics and final private characteristics corresponding to each node; and respectively carrying out fusion processing and linear mapping on the final public features and the final private features of each node to obtain the pre-erasure airspace features corresponding to each node.

Specifically, as shown in fig. 6, after defining the pre-erasure time domain features of each node, it is necessary to calculate the pre-erasure spatial bits of each nodeCharacterization, first using a common encoderAnd private encoder->The node characteristics of each node are respectively encoded, and the specific expression is as follows: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>And->Respectively representing public features existing in each node and private features which are unique to the mode,Mrepresenting node characteristics.

Then using knowledge distillation algorithm based on graph, for convenience of formula description, the method is now unifiedRepresenting common featuresAnd->. For use->Nodes (modalities) representing the diagram, +.>Representing modality->To->Distillation intensity of (2)The method comprises the following steps: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing a series of features, +.>And->Representing a full connection layer, ">Expressed +.>Public or private feature, < >>Representation->Public or private features of (a).

From the modalityTo->Is defined as the difference between the logits (output vector), is defined as +.>. For modality->Distillation loss->The method comprises the following steps: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing injection modality->Is defined by a node set.

From distillation intensityCan form a square matrix as an element>，/>As element composition matrix->. Graph distillation loss->The method comprises the following steps: / >The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing element-by-element multiplication>Representing the L1 norm.

According to the description of the formula, graph distillation is carried out on the public features and the private features corresponding to each node, so that the public distillation loss corresponding to each public feature and the private distillation loss corresponding to each private feature can be obtained respectively, then optimization processing is carried out by using each public distillation loss as the corresponding public feature, optimization processing is carried out by using each private distillation loss as the corresponding private feature, the final public feature and the final private feature corresponding to each node are obtained, and then the final public feature and the final private feature of each node are connected in series and then are subjected to linear mapping to obtain the space domain feature before erasure corresponding to each node.

And S40, performing time erasure processing on the cross-modal hypergraph according to the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each segment mode feature, and acquiring the time domain feature and the space domain feature after erasure corresponding to each node in the cross-modal hypergraph after the time erasure processing.

Specifically, in order to learn more time slices more significant for behavior intention recognition, time erasure processing needs to be performed on all time slices, and then the post-erasure time domain features and the post-erasure space domain features of each node after the time erasure processing are acquired.

Further, the time erasure processing is performed on the cross-modal hypergraph according to the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each segment mode feature, and the time domain feature and the space domain feature after erasure corresponding to each node in the cross-modal hypergraph after the time erasure processing are obtained, which specifically comprises:

respectively carrying out addition processing on the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each segment mode feature to obtain an addition feature corresponding to each segment mode feature, and obtaining a time attention diagram of each segment mode feature according to each addition feature; performing time erasure processing on all the time attention force diagrams according to a preset time erasure template, and taking the time attention force diagrams meeting preset conditions as post-erasure time attention force diagrams; acquiring all the segment modal characteristics corresponding to the time attention map after erasure and taking the segment modal characteristics as the segment modal characteristics after erasure corresponding to each node; performing time domain information enhancement processing on the erased fragment modal characteristics, and then calculating the erased fragment modal characteristics according to a cross-modal attention mechanism to obtain erased cross-modal enhancement characteristics corresponding to the erased fragment modal characteristics; and obtaining the erased time domain characteristic corresponding to each node according to the erased single-mode characteristic and the erased cross-mode enhancement characteristic corresponding to each node, and obtaining the erased airspace characteristic corresponding to each node according to each erased segment mode characteristic.

Specifically, taking the visual characteristics as an example, a certain segment of vision is obtainedFeature-corresponding unimodal enhancement featuresGAnd cross-modal enhancement featuresThe unimodal enhancement features and the cross-modal enhancement features of the segment visual features are feature added, and then a multi-layer perceptron (MLP) is used to obtain a time attention map +.>The specific formula is as follows: />Then the template is erased by using the preset timeEThe method is specifically expressed as follows: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Is a manually set superparameter,/->Representing a more meaningful time segment for behavioral intention recognition, to learn further more useful features, will +.>Temporary discard, retention +.>Time slices of (2), i.e.)>The time segment of the time frame is a time segment meeting the preset condition.

Similarly, after the time erasure processing is completed on all the segment modal characteristics, all the segment modal characteristics are matchedAs a post-erase time attention profile, and then acquire a post-erase segment corresponding to each of the post-erase time attention profilesThe mode characteristics are subjected to repeated time domain information enhancement processing according to all the erased segment mode characteristics to obtain erased single mode enhancement characteristics corresponding to the erased segment mode characteristics, and then cross-mode enhancement processing is carried out on the erased segment mode characteristics to obtain erased cross-mode enhancement characteristics corresponding to the erased segment mode characteristics; then obtaining erased time domain features corresponding to each node according to erased single-mode enhancement features and erased cross-mode enhancement features corresponding to each node, and obtaining erased airspace features corresponding to each node according to all erased segment mode features corresponding to each node; the steps of acquiring the time domain features after erasure and the space domain features after erasure are identical to the above, and are not described in detail herein.

And S50, carrying out fusion processing on all the pre-erasure time domain features, all the pre-erasure airspace features, all the post-erasure time domain features and all the post-erasure airspace features to obtain final fusion features, and obtaining a behavior prediction result of the target object in a target time period according to the final fusion features.

Specifically, after the pre-erasure time domain feature, the pre-erasure airspace feature, the post-erasure time domain feature and the post-erasure airspace feature corresponding to each node are obtained, corresponding fusion processing is needed to obtain final fusion features, the final fusion features reflect the behavior features of the patient in the target time period, so that corresponding behavior intention prediction can be performed on the patient in the target time period according to the final fusion features, and a behavior prediction result of the patient in the target time period is obtained.

Further, the fusing processing is performed on all the pre-erasure time domain features, all the pre-erasure airspace features, all the post-erasure time domain features and all the post-erasure airspace features to obtain a final fused feature, and a behavior prediction result of the target object in a target time period is obtained according to the final fused feature, which specifically includes:

Respectively carrying out fusion processing on the pre-erasure time domain features, the pre-erasure space domain features, the post-erasure time domain features and the post-erasure space domain features corresponding to each node to obtain space-time features corresponding to each node; the space-time characteristics of all the nodes are connected in series to obtain the final fusion characteristics; and analyzing and processing the final fusion characteristics by using a pre-trained graph neural network to obtain a behavior prediction result of the target object in a target time period.

Specifically, as shown in fig. 7, after the pre-erasure time domain feature, the pre-erasure space domain feature, the post-erasure time domain feature and the post-erasure space domain feature corresponding to each node are obtained, the four features of each node need to be fused to obtain a space-time feature corresponding to each node, and the specific formula is as follows:the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>For space-time characteristics>For pre-erasure temporal feature +.>For post-erasure temporal feature +.>For erasing the pre-spatial features +.>To erase the post-spatial features.

After the space-time characteristics corresponding to each node are obtained, the space-time characteristics of all the nodes are connected in series to be used as final fusion characteristics, the final fusion characteristics are input into a pre-trained graph neural network, the pre-trained graph neural network analyzes and processes the final fusion characteristics, so that the score of each behavior category of a patient in a target time period is obtained, and the behavior category with the highest score is selected to be used as a behavior prediction result of the patient in the target time period.

Further, as shown in fig. 8, based on the above behavior intention recognition method based on the cross-modal hypergraph, the present invention further correspondingly provides a behavior intention recognition system based on the cross-modal hypergraph, where the behavior intention recognition system based on the cross-modal hypergraph includes:

the segment characteristic obtaining module 51 is configured to obtain a plurality of different modal signals of a target object in a target time period, and perform preprocessing according to the target time period and each modal signal to obtain a plurality of segment modal characteristics corresponding to each modal signal;

the unimodal enhancement module 52 is configured to construct a cross-modal hypergraph according to all the segment modal features, and perform time domain information enhancement processing on each segment modal feature in the cross-modal hypergraph to obtain a unimodal enhancement feature corresponding to each segment modal feature;

the pre-erasure feature obtaining module 53 is configured to calculate a cross-modal enhancement feature corresponding to each segment modal feature according to a cross-modal attention mechanism, obtain a pre-erasure time domain feature corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement feature corresponding to each segment modal feature, and obtain a pre-erasure airspace feature corresponding to each node according to the segment modal feature of each node;

The post-erasure feature obtaining module 54 is configured to perform a time erasure process on the cross-modal hypergraph according to the single-mode enhancement feature and the cross-mode enhancement feature corresponding to each of the segment mode features, and obtain post-erasure time domain features and post-erasure space domain features corresponding to each node in the cross-modal hypergraph after the time erasure process;

and the prediction result generating module 55 is configured to perform fusion processing on all the pre-erasure time domain features, all the pre-erasure space domain features, all the post-erasure time domain features and all the post-erasure space domain features to obtain a final fusion feature, and obtain a behavior prediction result of the target object in a target time period according to the final fusion feature.

Further, as shown in fig. 9, based on the above behavior intention recognition method and system based on cross-modal hypergraph, the present invention further provides a terminal correspondingly, where the terminal includes a processor 10, a memory 20 and a display 30. Fig. 9 shows only some of the components of the terminal, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may alternatively be implemented.

The memory 20 may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 20 may in other embodiments also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the terminal. The memory 20 is used for storing application software installed in the terminal and various data, such as program codes of the installation terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In an embodiment, the memory 20 has stored thereon a cross-modal hypergraph based behavior intent recognition program 40, the cross-modal hypergraph based behavior intent recognition program 40 being executable by the processor 10 to implement the cross-modal hypergraph based behavior intent recognition method of the present application.

The processor 10 may in some embodiments be a central processing unit (Central Processing Unit, CPU), microprocessor or other data processing chip for executing program code or processing data stored in the memory 20, for example performing the cross-modal hypergraph based behavior intention recognition method or the like.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like in some embodiments. The display 30 is used for displaying information at the terminal and for displaying a visual user interface. The components 10-30 of the terminal communicate with each other via a system bus.

The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a cross-modal hypergraph-based behavior intention recognition program which, when executed by a processor, implements the steps of the cross-modal hypergraph-based behavior intention recognition method described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal comprising the element.

Of course, those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by a computer program for instructing relevant hardware (e.g., processor, controller, etc.), the program may be stored on a computer readable storage medium, and the program may include the above described methods when executed. The computer readable storage medium may be a memory, a magnetic disk, an optical disk, etc.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims

1. The behavior intention recognition method based on the cross-modal hypergraph is characterized by comprising the following steps of:

2. The behavior intention recognition method based on cross-modal hypergraph according to claim 1, wherein the obtaining a plurality of different modal signals of a target object in a target time period, and preprocessing according to the target time period and each modal signal to obtain a plurality of segment modal features corresponding to each modal signal specifically comprises:

and uniformly dividing each modal feature according to the preset time segment length to obtain a plurality of segment modal features corresponding to each modal feature.

3. The behavior intention recognition method based on cross-modal hypergraph according to claim 1, wherein the steps of constructing a cross-modal hypergraph according to all the segment modal features, and performing time domain information enhancement processing on each segment modal feature in the cross-modal hypergraph to obtain a single-modal enhancement feature corresponding to each segment modal feature comprise the following steps:

4. The behavior intention recognition method based on a cross-modal hypergraph according to claim 1, wherein the calculating a cross-modal enhancement feature corresponding to each segment modal feature according to a cross-modal attention mechanism, and obtaining a pre-erasure time domain feature corresponding to each node in the cross-modal hypergraph according to the cross-modal enhancement feature corresponding to each segment modal feature, and obtaining a pre-erasure airspace feature corresponding to each node according to the segment modal feature of each node specifically includes:

5. The behavior intention recognition method based on cross-modal hypergraph according to claim 4, wherein the encoding processing is performed according to each node characteristic to obtain a pre-erasure airspace characteristic corresponding to each node, and the method specifically comprises:

6. The behavior intention recognition method based on a cross-modal hypergraph according to claim 1, wherein the performing time erasure processing on the cross-modal hypergraph according to a single-mode enhancement feature and a cross-modal enhancement feature corresponding to each segment mode feature, and acquiring an post-erasure time domain feature and an post-erasure space domain feature corresponding to each node in the cross-modal hypergraph after the time erasure processing specifically comprises:

7. The behavior intention recognition method based on cross-modal hypergraph according to claim 1, wherein the fusing processing is performed on all the pre-erasure time domain features, all the pre-erasure space domain features, all the post-erasure time domain features and all the post-erasure space domain features to obtain a final fused feature, and the behavior prediction result of the target object in the target time period is obtained according to the final fused feature, and specifically includes:

8. A cross-modal hypergraph-based behavior intent recognition system, the cross-modal hypergraph-based behavior intent recognition system comprising:

9. A terminal, the terminal comprising: memory, a processor and a cross-modal hypergraph based behavior intent recognition program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the cross-modal hypergraph based behavior intent recognition method as claimed in any one of claims 1-7.

10. A computer readable storage medium, characterized in that it stores a cross-modal hypergraph based behavior intention recognition program, which when executed by a processor implements the steps of the cross-modal hypergraph based behavior intention recognition method according to any of claims 1-7.