CN114973411A - Self-adaptive evaluation method, system, equipment and storage medium for attitude motion - Google Patents

Self-adaptive evaluation method, system, equipment and storage medium for attitude motion Download PDF

Info

Publication number
CN114973411A
CN114973411A CN202210604517.6A CN202210604517A CN114973411A CN 114973411 A CN114973411 A CN 114973411A CN 202210604517 A CN202210604517 A CN 202210604517A CN 114973411 A CN114973411 A CN 114973411A
Authority
CN
China
Prior art keywords
heartbeat
feature
human body
body posture
posture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210604517.6A
Other languages
Chinese (zh)
Inventor
刘海
张昭理
朱俊艳
宋云霄
李家豪
刘婷婷
杨兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University
Central China Normal University
Original Assignee
Hubei University
Central China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University, Central China Normal University filed Critical Hubei University
Priority to CN202210604517.6A priority Critical patent/CN114973411A/en
Publication of CN114973411A publication Critical patent/CN114973411A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/02Detecting, measuring or recording pulse, heart rate, blood pressure or blood flow; Combined pulse/heart-rate/blood pressure determination; Evaluating a cardiovascular condition not otherwise provided for, e.g. using combinations of techniques provided for in this group with electrocardiography or electroauscultation; Heart catheters for measuring blood pressure
    • A61B5/0205Simultaneously evaluating both cardiovascular conditions and different types of body conditions, e.g. heart and respiratory condition
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B5/00Measuring for diagnostic purposes; Identification of persons
    • A61B5/103Detecting, measuring or recording devices for testing the shape, pattern, colour, size or movement of the body or parts thereof, for diagnostic purposes
    • A61B5/11Measuring movement of the entire body or parts thereof, e.g. head or hand tremor, mobility of a limb
    • A61B5/113Measuring movement of the entire body or parts thereof, e.g. head or hand tremor, mobility of a limb occurring during breathing
    • A61B5/1135Measuring movement of the entire body or parts thereof, e.g. head or hand tremor, mobility of a limb occurring during breathing by monitoring thoracic expansion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Physiology (AREA)
  • Computational Linguistics (AREA)
  • Surgery (AREA)
  • Animal Behavior & Ethology (AREA)
  • Cardiology (AREA)
  • Databases & Information Systems (AREA)
  • Veterinary Medicine (AREA)
  • Public Health (AREA)
  • Pathology (AREA)
  • Heart & Thoracic Surgery (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Dentistry (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Pulmonology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a method, a system, equipment and a storage medium for self-adaptive evaluation of attitude motion. The method comprises the following steps: acquiring a video sequence and a respiratory heartbeat echo signal of an object to be detected, and preprocessing the respiratory heartbeat echo signal; inputting the preprocessed breathing heartbeat echo signal into the trained first network model to obtain breathing heartbeat characteristics; inputting the video sequence into a trained second network model to obtain 3D human body posture characteristics; interactively fusing the respiratory heartbeat feature and the 3D human posture feature to obtain a fused interactive feature, and outputting an action score and a respiratory state prediction result according to the fused interactive feature; and predicting the 3D human body posture according to the 3D human body posture characteristics, and calculating the similarity between the predicted 3D human body posture and the standard action. The method and the device can be used for evaluating by combining the action and the breathing heartbeat of the object to be detected, and are helpful for improving the efficiency and the quality of the motion teaching training.

Description

Self-adaptive evaluation method, system, equipment and storage medium for attitude motion
Technical Field
The present application relates to the field of gesture-based motion adaptive evaluation technologies, and in particular, to a gesture-based motion adaptive evaluation method, system, device, and storage medium.
Background
Currently, artificial intelligence technology is widely used in many fields. In the aspect of training and guidance of sports, fitness and dance learning, artificial intelligence plays an important role due to the convenient technology. In a campus or other fitness facility, a teacher or coach is generally unable to give targeted guidance to each student on a one-to-one basis, and demonstrating and correcting the student's mistakes in one pass can not only greatly increase the teacher's workload, but also reduce the student's learning enthusiasm in this boring manner.
Disclosure of Invention
Aiming at least one defect or improvement requirement in the prior art, the invention provides a posture-type motion self-adaptive evaluation method, a posture-type motion self-adaptive evaluation system, a posture-type motion self-adaptive evaluation device and a storage medium, wherein evaluation is carried out by combining actions and respiratory heartbeat of a to-be-detected object, and the posture-type motion self-adaptive evaluation method, the posture-type motion self-adaptive evaluation system, the posture-type motion self-adaptive evaluation device and the storage medium are helpful for improving the efficiency and the quality of motion teaching training.
To achieve the above object, according to a first aspect of the present invention, there is provided a method for adaptively evaluating a gesture-like motion, including:
collecting a video sequence and a respiratory heartbeat echo signal of an object to be detected by using a visible light camera and a millimeter wave radar, and preprocessing the respiratory heartbeat echo signal;
inputting the preprocessed breathing heartbeat echo signal into the trained first network model to obtain breathing heartbeat characteristics;
inputting the video sequence into a trained second network model to obtain 3D human body posture characteristics;
interactively fusing the respiratory heartbeat feature and the 3D human posture feature to obtain a fused interactive feature, and outputting an action score and a respiratory state prediction result according to the fused interactive feature;
and predicting the 3D human body posture according to the 3D human body posture characteristics, and calculating the similarity between the predicted 3D human body posture and the standard action.
Further, the pre-processing comprises:
mixing the respiration heartbeat echo signal with a transmitting signal of the millimeter wave radar, and then performing low-pass filtering to obtain an intermediate frequency signal;
performing fast Fourier transform on the intermediate frequency signal to obtain a signal frequency domain energy spectrogram, and acquiring target distance information from the signal frequency domain energy spectrogram;
solving phase information based on the distance information to obtain a respiration heartbeat oscillogram;
and expanding the respiration heartbeat oscillogram into a time-frequency spectrogram through Fourier transform.
Further, the first network model includes:
the ResNet50 backbone convolution neural network is used for extracting image characteristics of each frame from the preprocessed breathing heartbeat echo signals;
the long-time and short-time memory network is used for establishing the characteristic fusion of the context information in the time domain to obtain the enhanced characteristic;
a normalization layer to normalize the enhancement features;
and the multilayer perceptron is used for carrying out feature conversion on the normalized features to obtain the breathing heartbeat features.
Further, the second network model includes:
a multi-hypothesis pose generation module for generating an initialized plurality of pose hypotheses from the video sequence;
a temporal information embedding module for embedding a temporal position code into the feature representation of the postulated posture;
the single-hypothesis characteristic enhancement module is used for enhancing the characteristics inside the single posture hypothesis;
the multi-hypothesis feature fusion module is used for realizing feature information fusion among a plurality of enhanced attitude hypotheses;
and the 3D posture regression module is used for applying linear transformation operation regression to obtain the 3D human posture characteristics.
Further, the generating the initialized plurality of pose hypotheses comprises:
extracting a 2D posture sequence X e R of a human body in each frame of image from the video sequence N×J×2 Wherein R is N×J×2 Representing a vector of NxJx 2, N representing the total number of input frames, J representing the total number of human joints, let (x, y) represent the joint coordinates, and the coordinates (x, y) of the 2D pose sequence are spliced into
Figure BDA0003670807510000031
Using learnable location embedding
Figure BDA0003670807510000032
Keeping the position information of each joint point, taking the embedded result as the input of a Transformer coder to carry out feature extraction, and carrying out residual error connectionA plurality of pose hypotheses are initialized.
Further, the enhancing features inside the single pose hypothesis includes:
firstly, performing layer normalization on each posture hypothesis, and then calculating self attention; obtaining a new characteristic block after residual connection; the different channel information of a single pose hypothesis is then blended using a multi-layered perceptron to further enhance the features.
Further, the interactive fusion comprises:
converting the 3D human body posture characteristic and the respiration heartbeat characteristic into a D-dimensional vector through a full connection layer, wherein the vector after the 3D human body posture characteristic is converted is represented as P ═ P (P ═ P) 1 ,…,p d ) And the vector after the 3D human body posture feature conversion is expressed as Q ═ (Q ═ Q) 1 ,…,q d );
Constructing a cyclic matrix A and B by using projection vectors:
Figure BDA0003670807510000033
Figure BDA0003670807510000034
multiplying the cyclic matrix by the vector P, Q to obtain F, G, wherein F is PA, and G is QB;
f, G is transformed into a fused interactive feature M in the k dimension by a projection matrix W in the d x k dimension.
According to a second aspect of the present invention, there is also provided an adaptive evaluation system for gesture-like motion, including:
the signal acquisition and preprocessing module is used for acquiring a video sequence and a respiratory heartbeat echo signal of an object to be detected and preprocessing the respiratory heartbeat echo signal;
the breathing heartbeat feature extraction module is used for inputting the preprocessed breathing heartbeat echo signal into a first network model to obtain breathing heartbeat features;
the 3D human body posture feature extraction module is used for inputting the video sequence into a trained second network model to obtain 3D human body posture features;
the binary feature cycle interaction module is used for interactively fusing the respiratory heartbeat feature and the 3D human posture feature to obtain a fused interaction feature, and outputting an action score and a respiratory state prediction result according to the fused interaction feature;
and the evaluating module is used for predicting the 3D human body posture according to the 3D human body posture characteristics and calculating the similarity between the predicted 3D human body posture and the standard action.
According to a third aspect of the present invention, there is also provided an electronic device comprising at least one processor, and at least one memory module, wherein the memory module stores a computer program that, when executed by the processor, causes the processor to perform the steps of any of the methods described above.
According to a fourth aspect of the present invention, there is also provided a storage medium storing a computer program executable by a processor, the computer program, when run on the processor, causing the processor to perform the steps of any of the methods described above.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) in the invention, the 3D human body posture information and the breathing heartbeat information are fused to predict the result, and the action standard degree is judged more comprehensively. Compared with the breathing heartbeat data under the standard condition, when the breathing heartbeat is too gentle, the method may indicate that the learner is not skilled in mastering the action; when the breathing heartbeat is too violent, an abnormal prompt may need to be given, so that the learner can slow down the action training. Therefore, the method for judging the binary characteristic cycle interaction is more helpful for guiding learners to carry out reasonable training.
(2) The invention uses the neural network model to extract the internal features of the breathing heartbeat information, and uses the time-frequency spectrogram as the input of the model, thereby further simplifying the fixed and fussy steps of manually extracting the features, and realizing the automatic extraction of various essential features of the model from end to end, thereby effectively improving the accuracy and efficiency of the model.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for adaptively evaluating gesture-like motion according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a method for self-adaptively evaluating gesture-like motions according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a first network model, a second network model and a binary attention cycle interaction network according to an embodiment of the present invention;
fig. 4 is a schematic diagram illustrating the principle of 3D human body posture estimation according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The terms "first," "second," and the like in the description and claims of the present application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to the listed steps or modules but may alternatively include other steps or modules not listed or inherent to such process, method, article, or apparatus.
As shown in fig. 1 and fig. 2, a method for adaptively evaluating a gesture-like motion according to an embodiment of the present invention includes the steps of:
s101, collecting a video sequence and a respiration heartbeat echo signal of an object to be detected in real time by using a visible light camera and a millimeter wave radar, and preprocessing the respiration heartbeat echo signal.
The method specifically comprises the following substeps:
(1) positioning a human body target in a detection area of the visible light and millimeter wave radar sensors; in this embodiment, the human target is a learner who is performing athletic training;
(2) acquiring an RGB video sequence by using a visible light camera, and receiving an echo signal by using a millimeter wave radar;
(3) the echo signals are preprocessed to enhance relevant information and eliminate useless information.
Further, the preprocessing of the echo comprises the sub-steps of:
(1) acquiring an intermediate frequency signal from the echo signal, wherein the specific processing process comprises the following steps:
the time delay of the echo received by the millimeter wave radar transmitting the chirp continuous wave to the chest area of the target user varies with the movement of the chest. The radar receiving signals are subjected to frequency mixing and filtering to obtain intermediate frequency signals, and the intermediate frequency signals contain motion information of the target chest breathing and the target heartbeat after mixing. The emission signal in one frequency modulation period of the frequency modulation continuous wave adopted by the millimeter wave radar is as follows:
Figure BDA0003670807510000061
wherein A is T To transmit the amplitude of the signal, f c Is the center frequency, W is the bandwidth, T m τ represents the time of extension of the transmitted signal to the received signal, which is the signal chirp period. After the reflection of the target and the environment, the echo signal is obtained
Figure BDA0003670807510000062
Wherein A is R For the amplitude of the echo signal, Δ t is the time delay, Δ f d Indicating the doppler shift. The transmitting signal and the echo signal are subjected to frequency mixing processing and low-pass filtering to obtain an intermediate frequency signal:
S IF (t)=S T (t)S R (t)≈A T A R exp{j2π[f c Δt+(f I -Δf d )t], (7)
wherein the content of the first and second substances,
Figure BDA0003670807510000063
representing the frequency of the intermediate frequency signal at time t.
(2) Obtaining a target distance through FFT processing, wherein the specific processing process comprises the following steps:
the method for acquiring vital sign signals by radar mainly measures the phase of a target at a corresponding distance. Since the chest movement of the chirped continuous wave radar is coupled with the distance of the target, the signal needs to be preprocessed before obtaining the phase information so as to obtain the distance of the target. In order to obtain the range of the target, which corresponds to the distance bin with the largest energy, a Fast Fourier Transform (FFT) is used to obtain the frequency domain energy spectrum of the signal. However, due to the limitation of frequency resolution, there is only one distance range, and accurate positioning cannot be achieved, so two adjacent distance bins with the largest energy are selected for principal component analysis, and then the first principal component is extracted as the target actual distance.
(3) Solving target phase information, and acquiring respiratory heartbeat sign expression, wherein the specific processing process comprises the following steps:
the millimeter wave radar can obtain I/Q double-path signals, and the mismatching phenomenon can occur between the two paths, so that the invention uses a derivation and Cross-multiplication (DACM) method to recover the phase information of the signals. The procedure of the DACM algorithm is as follows:
Figure BDA0003670807510000071
wherein the content of the first and second substances,
Figure BDA0003670807510000072
represents the phase information of the recovered signal, t represents time, Q (t) represents Q path signals of t time, I (t) represents I path signals of t time, Q '(t) represents Q (t) differentiates t, and I' (t) represents I (t) differentiates t. Depending on the definition of the derivative, the above equation can be further refined as:
Figure BDA0003670807510000073
where Δ t represents a time interval that goes to 0.
Obtained because the differentiation enhances the interference of high-frequency noise
Figure BDA0003670807510000074
Is more sensitive to noise interference. To suppress noise, Δ t in the above expression is set to 1, and the signal is integrated to obtain final phase information
Figure BDA0003670807510000075
Figure BDA0003670807510000076
Thus, the actual phase information of the object is obtained. The phase of the intermediate frequency signal is related to the distance to the object as follows:
Figure BDA0003670807510000077
wherein the content of the first and second substances,
Figure BDA0003670807510000078
representing the phase change of the intermediate frequency signal, and Δ d representing the heartOr displacement changes caused by the thorax.
Therefore, the relationship between the respiratory heartbeat waveform information of the human body and the phase of the distance unit where the human body target is located can be found, and the respiratory heartbeat waveform information of the human body target can be obtained by extracting the phase information of the distance unit where the target is located within a period of time.
(4) Converting the oscillogram into a time-frequency spectrogram through Fourier expansion, wherein the specific processing process comprises the following steps:
the feature extraction is a necessary process of inputting signals into a neural network, in order to embody the end-to-end idea as much as possible, the respiratory heartbeat oscillogram is converted into a time-frequency spectrogram to be used as input data of the neural network, and the convolutional neural network and the long-short term memory network are used for predicting whether physiological information of a target user is abnormal or not so as to guide the user to make adjustment.
The above analysis shows that the phase reflects the motion change of the thorax, and the motion change of the thorax is caused by respiration and heartbeat, so that a respiration heartbeat oscillogram taking time as a horizontal axis can be converted into a time-frequency spectrogram taking the horizontal axis as time, a vertical axis as frequency and color to represent amplitude by utilizing wavelet transformation, the required characteristics can be stored more favorably, the time-frequency spectrogram is used as the input of a neural network, and after a large amount of training, the model can identify the standard degree or the rationality of the physiological data of a target user.
S102, inputting the preprocessed breathing heartbeat echo signal into the trained first network model to obtain breathing heartbeat characteristics.
As shown in fig. 3, the first network model includes a ResNet50 backbone convolutional neural network, a long-and-short term memory network, a normalization layer, and a multi-layer perceptron (MLP). Taking a preprocessed respiratory heartbeat echo signal, namely a time-frequency spectrogram sequence as the input of the network model, firstly extracting the image characteristics of each frame through a ResNet50 backbone convolution neural network, and then depicting the time correlation information of sequence data by using a long-time memory network; next, carrying out layer normalization on the enhanced features, and inputting the features into an MLP (Multi-layer processing) for feature conversion again; finally obtaining the required breath heartbeat characteristic B eR o
The ResNet50 network is mainly composed of convolutional layers, pooling layers, fully-connected layers, and residual connections. Convolution is a mathematical calculation and in neural networks is understood to be an operation used to extract image features. The convolution operation gradually increases the receptive field to obtain the extraction of the high-level features of the image; pooling is a spatial operation in the height and length directions, and has the effects of reducing the dimensionality of a characteristic diagram while keeping the most important information, thereby compressing the number of data and parameters, reducing overfitting and improving the fault tolerance of a model; the fully connected layer enables the output of each result to be determined by all inputs; residual concatenation is used to solve the problem of loss of feature information as the network deepens. And extracting the characteristics of each frame of time-frequency spectrogram after passing through the network.
The long-time and short-time memory network is a special recurrent neural network, and can effectively utilize the time correlation of the t moment, the t-1 moment and the t +1 moment to obtain the rich characteristics of the t moment data. Compared with a general recurrent neural network, the long-term memory network can solve the problems that useful information is ignored and gradient disappears before a long time. Through a long-time and short-time memory network, the feature fusion of time domain context information can be established between time-frequency spectrograms, and richer feature representation Y is obtained i (i∈[1,…,S])。
It is further preferred that the cross-attention calculation of the plurality of features obtained above results in a different degree of attention to each feature, with the goal of more attention to the primary information and less attention to the secondary information. In particular, the feature matrix Y i (i∈[1,…,S]) First, obtaining Q, K and V through linear mapping, and then calculating as follows:
Sim(Q,K)=Q·K T , (12)
A=Softmax(Sim(Q,K)), (13)
Attention(Q)=A·V. (14)
equation (12) calculates the similarity Sim (Q, K) between features by means of dot product; performing Softmax normalization processing on the similarity score by the formula (13); and (14) taking A as a weight coefficient to carry out weighted summation to obtain a final attention score result matrix.
Therefore, the attention degree of each feature is obtained, then weighted fusion is carried out, after layer normalization is carried out, the final breathing heartbeat feature can be obtained after MLP processing. MLP is used to enhance features. It comprises two linear layers and an active layer:
MLP(x)=σ(xW 1 +b 1 )W 2 +b 2 , (15)
where, σ represents the GELU activation function,
Figure BDA0003670807510000091
and
Figure BDA0003670807510000092
b 2 ∈R d representing the weights and biases of the two linear layers, respectively.
S103, inputting the video sequence into the trained second network model to obtain the 3D human body posture characteristics.
As shown in fig. 3 and 4, the second network model uses a multi-pose hypothesis interaction network model (mphnteraction) diagram, including: a multi-hypothesis pose generation module for generating an initialized plurality of pose hypotheses from the video sequence; a temporal information embedding module for embedding a temporal position code into the feature representation of the postulated posture; the single-hypothesis characteristic enhancement module is used for enhancing the characteristics inside the single posture hypothesis; the multi-hypothesis feature fusion module is used for realizing feature information fusion among a plurality of enhanced attitude hypotheses; and the 3D posture regression module is used for obtaining the 3D human body posture characteristic by applying linear transformation operation regression.
The details are as follows:
(1) generating a multi-hypothesis posture, wherein the specific processing process comprises the following steps:
let (x, y) denote certain joint coordinates on a certain frame, N denote the total number of input frames, and J denote the total number of human joints. Firstly, extracting a string of 2D postures X epsilon R by using OpenPose N×J×2 Is spliced into
Figure BDA0003670807510000101
Then using a learnable position embedding
Figure BDA0003670807510000102
Reserving position information of each joint point; and then input into a transform encoder for processing. The whole process can be expressed as:
Figure BDA0003670807510000103
Figure BDA0003670807510000104
Figure BDA0003670807510000105
Figure BDA0003670807510000106
wherein L is 1 (l∈[1,...,L 1 ]) Represents the number of transform encoders used in the multi-hypothesis pose generation module. In addition, the first and second substrates are,
Figure BDA0003670807510000107
X m representing the generated mth hypothetical gesture, the M hypothetical gestures passing through L 1 The final representation of the layer after encoder processing is
Figure BDA0003670807510000108
Figure BDA0003670807510000109
Figure BDA00036708075100001010
Represents a pair X m As a result of the position-coding embedding,
Figure BDA00036708075100001011
representing the feature representation of the mth pose at the (l-1) th layer,
Figure BDA00036708075100001012
and
Figure BDA00036708075100001013
representing production during transform encoder computation
Figure BDA00036708075100001014
Is temporarily indicated. LN represents the Layer Normalization operation. MSA represents a multi-headed self-attention (multi-head self-attention) operation that will input x ∈ R n×d The linear mapping is that QueryQ belongs to R n×d ,KeyK∈R n×d And ValueV ∈ R n×d Where n represents the sequence length and d represents the dimension, attention is calculated as follows:
Figure BDA00036708075100001015
the MSA will perform the above operations with h heads in parallel, and finally concatenate the output results of the h attention heads.
(2) Embedding time information, wherein the specific processing process comprises the following steps:
the previous position coding of the joint points belongs to the field of space, and the obtained characteristics are not abundant. To exploit temporal information, we consider converting spatial domain features to the temporal domain. Features extracted from pose hypotheses obtained for each frame
Figure BDA00036708075100001016
Using one transformation operation and one linear embedding, resulting in high dimensional features
Figure BDA0003670807510000111
Where C represents the embedding dimension. Then, a learnable time-position code is used
Figure BDA0003670807510000112
To preserve time information between frames. This process can be expressed as:
Figure BDA0003670807510000113
in the formula (I), the compound is shown in the specification,
Figure BDA0003670807510000114
is composed of
Figure BDA0003670807510000115
The time-position coded representation is embedded.
The single hypothesis characteristic enhancement comprises the following specific processing procedures:
the feature is further enhanced by first performing a self-attention calculation for each pose hypothesis and then blending the different channel information of the individual pose hypotheses through a multi-layered perceptron. In particular, embedded features that assume different poses
Figure BDA0003670807510000116
Parallel inputs as multiple MSA blocks:
Figure BDA0003670807510000117
wherein L is ∈ [1, …, L 2 ]An index indicating the SHR layer is displayed,
Figure BDA0003670807510000118
representing the characteristic representation of the mth assumed pose at the (l-1) th level. In this way, the features of the pose hypothesis are enhanced, facilitating the evaluation of the final result. Further, a plurality of assumed features are spliced and used as the input of the MLP:
Figure BDA0003670807510000119
Figure BDA00036708075100001110
wherein Concat (. cndot.) represents a splicing operation,
Figure BDA00036708075100001111
the result after splicing of multiple hypothetical features is shown.
These aggregated features are uniformly divided into non-overlapping blocks along the channel dimension, and this process results in a mixture of relationships between different hypothesized channels.
(3) The multi-hypothesis feature fusion method comprises the following specific processing procedures:
in order to get interaction between the assumed poses, cross-attention calculations need to be done. Specifically, let the mth posture assume the characteristics at the l-th layer
Figure BDA00036708075100001112
The following calculations were performed:
Figure BDA00036708075100001113
wherein L is ∈ [1, …, L 3 ]An index indicating the CHI layer is shown,
Figure BDA00036708075100001114
m 1 ,m 2 it is the other two attitude assumptions that the user is in,
Figure BDA00036708075100001115
representing the characteristic representation of the mth pose assumed at the (l-1) th level. MCA (Q, K, V) represents multi-head cross Attention computation (multi-head cross-Attention), which is the same as the Attention computation process of the multi-layered perceptron in the first network model.
After the attention calculation, information between different channels needs to be mixed through a multilayer perceptron:
Figure BDA0003670807510000121
Figure BDA0003670807510000122
finally, no partitioning operation is performed, so that the aggregated features can be finally synthesized into a single pose hypothesis representation: z M ∈R N×(C·M) .
(5) And 3D posture regression, wherein the specific treatment process is as follows:
for Z obtained in claim 7 M Using a linear conversion layer to obtain a 3D attitude sequence by regression
Figure BDA0003670807510000123
Finally, from
Figure BDA0003670807510000124
3D pose of the selected center frame
Figure BDA0003670807510000125
As a final prediction result.
The network is trained in an end-to-end fashion, and the loss function used is Mean Per Joint Position Error (MPJPE). The goal of model training is to minimize the error between the predicted and true values:
Figure BDA0003670807510000126
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003670807510000127
and Y i n Respectively representing predicted 3D coordinate prediction results and true values of the ith joint point of the nth frame.
And S104, interactively fusing the respiratory heartbeat feature and the 3D human posture feature to obtain a fused interactive feature, predicting the 3D human posture according to the 3D human posture feature, calculating the similarity between the predicted 3D human posture and a standard action, and outputting an action score and a respiratory state prediction result according to the fused interactive feature.
And (3) performing interactive fusion of the features by using a binary feature cycle interactive model, and finally performing model regression to obtain an action standard degree score and a respiratory state prediction, wherein the details are as follows:
given a 3D human pose feature Z M ∈R N×(C·M) Features of respiration and heartbeat B ∈ R o Firstly, the vector data is converted into vector data with the same dimension through a full connection layer, and the vector data is respectively expressed as P (P) 1 ,…,p d )∈R d ,Q(q 1 ,…,q d )∈R d (ii) a Then using the projection vector P ∈ R d ,Q∈R d Constructing a cyclic matrix A e R d×d ,B∈R d×d
Figure BDA0003670807510000128
Figure BDA0003670807510000131
In order to make the elements in the projection vector and the circulant matrix fully functional, the circulant matrix and the projection vector are multiplied by:
F=PA,G=QB. (31)
finally, a projection matrix W ∈ R is passed d×k Let F be equal to R d ,G∈R d Conversion to a target vector M ∈ R k . After the model is trained by a large amount of data, the score scoreX of the action standard degree can be predicted according to the target vector. The MSE loss function is used during training:
Figure BDA0003670807510000132
wherein, scoreY i And (5) representing the score condition of the manual marking of the ith sample, wherein m is the total number of samples.
Figure BDA0003670807510000133
For the regularization term, λ is called the regularization parameter, whose role is to control the cut-off between two different targets, thus avoiding overfitting. The training goal is to make the loss function as small as possible.
In a specific example, the classification network may adopt a softmax classifier, and for the prediction of the respiratory heartbeat state, the classification network is divided into three classification results of rapid, stable and slow, and the cross entropy is used as a loss function for training, so that the optimal prediction result can be obtained.
And S105, predicting the 3D human body posture according to the 3D human body posture characteristic, calculating the similarity between the predicted 3D human body posture and the standard action, and outputting an action score according to the fusion interaction characteristic.
And calculating a similarity score between each part of the learner and the standard action according to the regression result value estimated from the 3D human body posture of the learner, and then obtaining an average similarity score.
In a specific example, the motion pose matching can be performed by using a dynamic time warping algorithm (3DPose-DTW) based on the feature difference of the 3D pose coordinates, and the algorithm can overcome the matching problem of different sequence lengths and solve the problem of motion advance or delay. Generally, training is performed according to background music or background guidance prompt tones, so that firstly, by using a dynamic programming thought, time sequence information to be recognized and time sequence information of a standard template are subjected to nonlinear normalization, an optimal corresponding point between two sequences is found according to sound, and then an Euclidean distance between a learner posture feature vector and a standard action feature vector in a video frame at the corresponding time is calculated. Obtaining learner 3D attitude feature vector by regression
Figure BDA0003670807510000134
The 3D pose feature vector is assumed to be derived by regression to the video frame of the standard motion
Figure BDA0003670807510000141
Through a full connection layer, the full connection layer is unfolded into a one-dimensional vector
Figure BDA0003670807510000142
Then, the following euclidean distance is calculated:
Figure BDA0003670807510000143
the similarity of the two posture features is expressed by Euclidean distance, the smaller the distance is, the more similar the posture features are, and the larger the distance is, the less similar the posture features are. In the ideal case of the water-cooled turbine,
Figure BDA0003670807510000144
the two posture characteristics are completely the same, so that the two actions are judged to be completely the same. To obtain the final score based on the percentile, the following conversion is performed
Figure BDA0003670807510000145
Figure BDA0003670807510000146
And performing evaluation guidance on the learner according to the obtained action part average similarity score, the obtained action standard degree score and the respiratory state prediction result.
In a specific example, as a preferred scheme, the action part average similarity score and the action standard degree score are weighted and summed to obtain a final score, which is used as a quantitative evaluation for the learner. In order to qualitatively evaluate the exercise condition of each part of the learner, according to the obtained action similarity score of the video frame, when the score is less than 60 minutes, the video frame image where the learner is located can be provided, and the part with larger deviation action is displayed. Meanwhile, according to the characteristic analysis of the breathing heartbeat data and the prediction of the breathing state, the guidance for the breathing adjustment of the learner is given.
The invention provides a self-adaptive evaluation system for attitude motion, which comprises:
the signal acquisition and preprocessing module is used for acquiring a video sequence and a respiratory heartbeat echo signal of an object to be detected and preprocessing the respiratory heartbeat echo signal;
the breathing heartbeat feature extraction module is used for inputting the preprocessed breathing heartbeat echo signal into the first network model to obtain breathing heartbeat features;
the 3D human body posture feature extraction module is used for inputting the video sequence into the trained second network model to obtain 3D human body posture features;
the double-element feature cycle interaction module is used for interactively fusing the respiratory heartbeat feature and the 3D human posture feature to obtain a fused interaction feature and outputting an action score and a respiratory state prediction result according to the fused interaction feature;
and the evaluation module is used for predicting the 3D human body posture according to the 3D human body posture characteristics and calculating the similarity between the predicted 3D human body posture and the standard action.
The implementation principle of the system is the same as that of the method, and the details are not repeated here.
The embodiment also provides an electronic device, which includes at least one processor and at least one memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor executes any one of the steps of the above gesture-like motion adaptive evaluation method, and the specific steps refer to the method embodiment and are not described herein again; in this embodiment, the types of the processor and the memory are not particularly limited, for example: the processor may be a microprocessor, digital information processor, on-chip programmable logic system, or the like; the memory may be volatile memory, non-volatile memory, a combination thereof, or the like.
The present application further provides a storage medium storing a computer program executable by a processor, wherein the computer program causes the processor to execute the steps of any one of the above gesture-based motion adaptive evaluation methods when the computer program runs on the processor. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed system may be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be through some service interfaces, indirect coupling or communication connection of systems or modules, and may be electrical or in other forms.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable memory. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory comprises: various media capable of storing program codes, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by a program, which is stored in a computer-readable memory, and the memory may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The above description is only an exemplary embodiment of the present disclosure, and the scope of the present disclosure should not be limited thereby. That is, all equivalent changes and modifications made in accordance with the teachings of the present disclosure are intended to be included within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A self-adaptive evaluation method for gesture-like motion is characterized by comprising the following steps:
collecting a video sequence and a respiratory heartbeat echo signal of an object to be detected by using a visible light camera and a millimeter wave radar, and preprocessing the respiratory heartbeat echo signal;
inputting the preprocessed breathing heartbeat echo signal into the trained first network model to obtain breathing heartbeat characteristics;
inputting the video sequence into a trained second network model to obtain 3D human body posture characteristics;
interactively fusing the respiratory heartbeat feature and the 3D human body posture feature to obtain a fused interactive feature, and outputting an action score and a respiratory state prediction result according to the fused interactive feature;
and predicting the 3D human body posture according to the 3D human body posture characteristics, and calculating the similarity between the predicted 3D human body posture and the standard action.
2. The method for adaptive evaluation of gesture-like motion according to claim 1, wherein the preprocessing comprises:
mixing the respiration heartbeat echo signal with a transmitting signal of the millimeter wave radar, and then performing low-pass filtering to obtain an intermediate frequency signal;
performing fast Fourier transform on the intermediate frequency signal to obtain a signal frequency domain energy spectrogram, and acquiring target distance information from the signal frequency domain energy spectrogram;
solving phase information based on the distance information to obtain a respiration heartbeat oscillogram;
and expanding the respiration and heartbeat oscillogram into a time-frequency spectrogram through Fourier transform.
3. The method for adaptive evaluation of gesture-like motion according to claim 1, wherein the first network model comprises:
the ResNet50 backbone convolution neural network is used for extracting image characteristics of each frame from the preprocessed breathing heartbeat echo signals;
the long-time and short-time memory network is used for establishing the characteristic fusion of the context information in the time domain to obtain the enhanced characteristic;
a normalization layer to normalize the enhancement features;
and the multilayer perceptron is used for carrying out feature conversion on the normalized features to obtain the breathing heartbeat features.
4. The method for adaptive evaluation of gesture-like motion according to claim 1, wherein the second network model comprises:
a multi-hypothesis pose generation module for generating an initialized plurality of pose hypotheses from the video sequence;
a temporal information embedding module for embedding a temporal position code into the feature representation of the postulated posture;
the single-hypothesis characteristic enhancement module is used for enhancing the characteristics inside the single posture hypothesis;
the multi-hypothesis feature fusion module is used for realizing feature information fusion among a plurality of enhanced attitude hypotheses;
and the 3D posture regression module is used for applying linear transformation operation regression to obtain the 3D human body posture characteristics.
5. The method for adaptive evaluation of gesture-like motion according to claim 4, wherein the generating of the initialized plurality of gesture hypotheses comprises:
extracting a 2D posture sequence X e R of a human body in each frame of image from the video sequence N×J×2 Wherein R is N×J×2 Representing a vector of NxJx 2, N representing the total number of input frames, J representing the total number of human joints, let (x, y) represent the joint coordinates, and the coordinates (x, y) of the 2D pose sequence are spliced into
Figure FDA0003670807500000021
Using learnable location embedding
Figure FDA0003670807500000022
And reserving the position information of each joint point, taking the embedding result as the input of a transform encoder to perform feature extraction on the embedding result, and performing residual error connection to obtain multiple initialized posture hypotheses.
6. The method for gesture-like motion adaptive evaluation according to claim 4, wherein the enhancing features inside a single gesture hypothesis comprises:
firstly, carrying out layer normalization on each attitude hypothesis, and then calculating the self attention; obtaining a new characteristic block after residual connection; the different channel information of a single pose hypothesis is then blended using a multi-layered perceptron to further enhance the features.
7. The method for adaptive evaluation of gesture-like motion according to claim 1, wherein the interactive fusion comprises:
converting the 3D human body posture characteristic and the respiration heartbeat characteristic into a D-dimensional vector through a full connection layer, wherein the vector after the 3D human body posture characteristic is converted is represented as P ═ P (P ═ P) 1 ,…,p d ) And the vector after the 3D human body posture feature conversion is expressed as Q ═ (Q ═ Q) 1 ,…,q d );
Constructing a cyclic matrix A and B by using projection vectors:
Figure FDA0003670807500000031
Figure FDA0003670807500000032
multiplying the cyclic matrix by the vector P, Q to obtain F, G, wherein F is PA, and G is QB;
f, G is transformed into a fused interactive feature M in the k dimension by a projection matrix W in the d x k dimension.
8. An adaptive evaluation system for gesture-like motion, comprising:
the signal acquisition and preprocessing module is used for acquiring a video sequence and a respiratory heartbeat echo signal of an object to be detected and preprocessing the respiratory heartbeat echo signal;
the breathing heartbeat feature extraction module is used for inputting the preprocessed breathing heartbeat echo signal into a first network model to obtain breathing heartbeat features;
the 3D human body posture feature extraction module is used for inputting the video sequence into a trained second network model to obtain 3D human body posture features;
the binary feature cycle interaction module is used for interactively fusing the respiratory heartbeat feature and the 3D human posture feature to obtain a fused interaction feature, and outputting an action score and a respiratory state prediction result according to the fused interaction feature;
and the evaluating module is used for predicting the 3D human body posture according to the 3D human body posture characteristics and calculating the similarity between the predicted 3D human body posture and the standard action.
9. An electronic device comprising at least one processor and at least one memory module, wherein the memory module stores a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 7.
10. A storage medium, characterized in that it stores a computer program which, when run on a processor, causes the processor to perform the steps of the method according to any one of claims 1 to 7.
CN202210604517.6A 2022-05-31 2022-05-31 Self-adaptive evaluation method, system, equipment and storage medium for attitude motion Pending CN114973411A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210604517.6A CN114973411A (en) 2022-05-31 2022-05-31 Self-adaptive evaluation method, system, equipment and storage medium for attitude motion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210604517.6A CN114973411A (en) 2022-05-31 2022-05-31 Self-adaptive evaluation method, system, equipment and storage medium for attitude motion

Publications (1)

Publication Number Publication Date
CN114973411A true CN114973411A (en) 2022-08-30

Family

ID=82958194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210604517.6A Pending CN114973411A (en) 2022-05-31 2022-05-31 Self-adaptive evaluation method, system, equipment and storage medium for attitude motion

Country Status (1)

Country Link
CN (1) CN114973411A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129525A (en) * 2023-01-24 2023-05-16 中国人民解放军陆军防化学院 Respiratory protection training evaluation system and method
CN117036891A (en) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system
CN117158924A (en) * 2023-08-08 2023-12-05 知榆科技有限公司 Health monitoring method, device, system and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129525A (en) * 2023-01-24 2023-05-16 中国人民解放军陆军防化学院 Respiratory protection training evaluation system and method
CN116129525B (en) * 2023-01-24 2023-11-14 中国人民解放军陆军防化学院 Respiratory protection training evaluation system and method
CN117158924A (en) * 2023-08-08 2023-12-05 知榆科技有限公司 Health monitoring method, device, system and storage medium
CN117036891A (en) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system
CN117036891B (en) * 2023-08-22 2024-03-29 睿尔曼智能科技(北京)有限公司 Cross-modal feature fusion-based image recognition method and system

Similar Documents

Publication Publication Date Title
CN114973411A (en) Self-adaptive evaluation method, system, equipment and storage medium for attitude motion
Pan et al. Adversarial cross-domain action recognition with co-attention
Li et al. Blindly assess quality of in-the-wild videos via quality-aware pre-training and motion perception
CN110309732B (en) Behavior identification method based on skeleton video
Rishan et al. Infinity yoga tutor: Yoga posture detection and correction system
CN108985259A (en) Human motion recognition method and device
CN113688723A (en) Infrared image pedestrian target detection method based on improved YOLOv5
CN109165735B (en) Method for generating sample picture based on generation of confrontation network and adaptive proportion
Aliakbarian et al. Flag: Flow-based 3d avatar generation from sparse observations
CN110853074B (en) Video target detection network system for enhancing targets by utilizing optical flow
CN110413838A (en) A kind of unsupervised video frequency abstract model and its method for building up
CN108491808B (en) Method and device for acquiring information
Kim et al. Towards sequence-level training for visual tracking
CN115223082A (en) Aerial video classification method based on space-time multi-scale transform
CN112085717B (en) Video prediction method and system for laparoscopic surgery
KR101912570B1 (en) The object tracking system using artificial neural networks
CN114399829B (en) Posture migration method based on generative countermeasure network, electronic device and medium
CN116977367A (en) Campus multi-target tracking method based on transform and Kalman filtering
CN104463245B (en) A kind of target identification method
JP2018055287A (en) Integration device and program
KR102211847B1 (en) System and method for path loss exponent prediction
CN110705408A (en) Indoor people counting method and system based on mixed Gaussian people distribution learning
CN116416678A (en) Method for realizing motion capture and intelligent judgment by using artificial intelligence technology
CN111914798B (en) Human body behavior identification method based on skeletal joint point data
CN114998402A (en) Monocular depth estimation method and device for pulse camera

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination