CN116092189A - Bimodal human behavior recognition method based on RGB data and bone data - Google Patents

Bimodal human behavior recognition method based on RGB data and bone data Download PDF

Info

Publication number
CN116092189A
CN116092189A CN202310010763.3A CN202310010763A CN116092189A CN 116092189 A CN116092189 A CN 116092189A CN 202310010763 A CN202310010763 A CN 202310010763A CN 116092189 A CN116092189 A CN 116092189A
Authority
CN
China
Prior art keywords
rgb
heat map
attention
frame
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310010763.3A
Other languages
Chinese (zh)
Inventor
陈良银
石静
张媛媛
廖俊华
刘圣杰
倪浩文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202310010763.3A priority Critical patent/CN116092189A/en
Publication of CN116092189A publication Critical patent/CN116092189A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/70Multimodal biometrics, e.g. combining information from different biometric modalities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10024Color image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a bimodal human behavior recognition method based on RGB data and skeleton data, which can accurately capture action information in a skeleton and space information in an RGB mode by using different network structures, and solves the problem that the skeleton information and the RGB information in a transducer frame are difficult to fuse and exert the maximum effect. Firstly, a pseudo heat map is generated for bone data, which can avoid the problem that the stability caused by expressing bones as graphics is insufficient and a multi-person scene cannot be processed. Then, a dual stream transducer architecture having different attention layers and different window sizes is designed, and the pseudo heat map and RGB frames are input into the dual stream architecture with different temporal and spatial resolutions. Finally, through experimental verification, the behavior recognition method provided by the invention has higher accuracy and can solve the behavior recognition under a multi-person scene. The double-flow structure and skeleton heat map generation mode based on the method is suitable for behavior identification in various public monitoring.

Description

Bimodal human behavior recognition method based on RGB data and bone data
Technical Field
The invention relates to the field of human behavior recognition, in particular to the field of human behavior recognition based on videos, and specifically relates to a human behavior recognition method based on RGB data and skeletal heat map data.
Background
Human behavior recognition refers to the retrieval and recognition of human behaviors by acquiring human behavior data through monitoring videos, motion traps and the like. The human behavior recognition has wide application prospect in public safety, intelligent transportation, medical monitoring and production safety. The human behavior recognition technology which is widely used can be divided into two types according to the difference of input data: human behavior recognition based on RGB video, and human behavior recognition based on skeletal data. Compared with bone data, the RGB video is very convenient to collect, can provide detailed and rich appearance information such as shapes, colors, textures and the like, but is very sensitive to lighting conditions, shooting angles and the like, and when illumination is weak, the accuracy of a human behavior recognition algorithm based on the RGB video is greatly reduced. In contrast, bone data can not provide specific appearance information, but is quite compatible with illumination conditions and shooting angles, and good recognition accuracy can be obtained through methods such as graphic neural network (GCN). However, bone data is generally difficult to collect, accurate bone key point information can be obtained through the motion capture system, but in many application scenarios, the motion capture system is difficult to popularize and use. The development of the gesture estimation algorithm provides a more convenient and quick way for extracting bone data, bone data of a corresponding video can be obtained by inputting RGB video into the gesture estimation algorithm, and the RGB data and the bone data are combined to be used as the input of the human behavior recognition algorithm, so that higher accuracy can be obtained than an algorithm which only depends on the RGB data or the bone data.
There is a work of combining RGB data with bone data to improve accuracy of human behavior recognition. Luvizon et al relate the problem of pose estimation and behavior recognition, and achieve efficient pose extraction and behavior recognition through a shared mechanism of two tasks. Das et al propose a gesture-driven spatiotemporal attention mechanism and apply it to 3DCNN to perform human behavior recognition. In its subsequent work, it calculates spatiotemporal features by increasing the attentive mechanisms to the topology of the bone. Li et al propose a dual stream network structure in which there are three main modules, an ST-GCN module to extract skeleton features, an R (2+1) D network to extract RGB features, and a module to use the two features to enhance motion related information in RGB video, finally using score fusion to obtain classification results. Cai et al also employ a dual stream network architecture, but differs in that the input to another stream, in addition to the skeletal data stream, is stream data with aligned key points extracted from the RGB video. The above work is to treat the skeleton data as a topological graph, but the topological graph is not a robust expression mode compared with the RGB data, and the loss of certain skeleton key point information will have a great influence on the whole skeleton data. Jing and Wang propose a two-way network based on ViT, wherein the input of the two-way network is RGB frames with different frame rates and different resolutions, and bone data is fused with the RGB data in three different fusion modes, so that a final classification result is obtained. The fusion of the skeleton data is carried out by adopting a mode of fusing skeleton codes into a token as ebedding and RGB data. The fusion mode still does not avoid the problem of non-robustness of bone data, and simultaneously, the feature dimension is increased, so that the algorithm complexity is increased.
Disclosure of Invention
The invention aims to provide a bone and RGB bimodal behavior recognition method based on a transducer. A dual stream transducer architecture is presented with RGB frames and skeletal heatmaps as inputs. The present invention uses different number of attention layers and window sizes for the two inputs inside the network by inputting RGB frames and skeletal heatmaps into the transducer at different temporal and spatial resolutions. In addition, in order to avoid the problems of stability and the like caused by graphically representing bones, a heat map is generated for the bones, so that the stability is improved, and the problem of behavior identification in a multi-person scene can be solved. The fusion mode of double-flow transverse connection is provided, the influence of noisy background in a behavior recognition algorithm is reduced, and the problem of missing key appearance information in character interaction behavior is solved.
The aim of the invention is achieved in that:
a dual stream transducer architecture with RGB and skeletal modalities as inputs is presented. Through the use of the posture estimation algorithm and the generation of the bone heat map, the problem of reduced stability of graphically expressed bones can be avoided, and the fusion with RGB information can be more conveniently carried out. Based on different spatial resolutions of input and different structures in the network, the RGB stream can capture spatial information and the skeleton stream can accurately capture motion information. Finally, the fusion method provided by the invention is used in the network to fully fuse the appearance information in the RGB mode and the action information in the skeleton mode, thereby solving the problem of insufficient single-mode information and reducing the influence of noisy background.
The specific mode is as follows:
a dual-flow RGB and skeleton dual-mode behavior recognition method based on a transducer comprises the following steps:
and step 1, acquiring skeleton information in the RGB video by using a posture estimation algorithm.
And 2, generating a bone heat map.
And 3, sampling the RGB video and the bone heat map.
And 4, inputting a double-flow transducer structure.
And 5, acquiring classification token information to carry out transverse fusion.
And 6, acquiring network output, and mapping the network output into a classification result in the linear classifier.
Further, in the step 1, the data owner uses HRNet to perform gesture estimation on the RGB video on the COCO-keypoint key point by using a pre-training model.
Further, in the step 2, after the result of the step 1 is obtained and the 2D pose is obtained, a bone heat map is generated in the following manner. For each skeletal joint:
Figure BDA0004038215390000031
wherein, (x) k ,y k ) Representing the coordinates of the kth point, c k Representing the confidence of the kth point.
For the limb:
Figure BDA0004038215390000032
wherein D ((i, j), seg [ a ] k ,b k ]) Representing the distance from the point to the line, seg [ a ] k ,b k ]Representing a line segment between two points. All heatmaps are then organized along the time dimension into the form of heatmap videos.
Further, step three uniformly samples the heat map video and the RGB video obtained in step two, and for the heat map video, 32 frames of heat maps should be uniformly sampled along the time dimension. For RGB video, 8 frames of RGB frames should be sampled uniformly along the time dimension.
Further, before inputting the transducer, the video which is adjusted to be of a short side 320 pixels is cut into 224×224 size by using a random cutting method in a training stage, cut into 224×224 size by using a center cutting method in a verification stage, cut into 224×224 size at three space positions respectively at the left upper part, the center and the right lower part in a test stage, and input a network to take a softmax average as a final result. And when the heat map video is generated, acquiring a minimum detection frame capable of covering all target characters, performing zero filling in the detection frame, and cutting a background outside the detection frame, which is irrelevant to human behavior recognition. The resulting heat map size was 224×224. No cutting or resizing operations are required.
For RGB video andheat map video, using the same decomposition operation, is exemplified by heat map video, which is decomposed into N non-overlapping space-time "tubes", x 1 ,x 2 ,...,x N ∈R t×h×w×3 Wherein n=
Figure BDA0004038215390000041
Next, each tube is set i Linearly mapped to a mark (encoding) z i ,z i =Ex i . Finally, all encoding z i Connected in series to form a vector z 0 . Adding a special learnable vector z at a first position cls ∈R d Representing the embedding of the classification marks. Position embedding p pos ∈R (N+1)×d Is also added to this sequence.
The RGB stream and the skeletal stream use the same attention mechanism. The mechanism first calculates the frame-level spatial attention at the same time pointer. The spatial attention layer number of the bone stream is L' =10, and the spatial attention layer number of the rgb stream is l=12. For the spatial attention module of layer l, first calculate query/key/value:
Figure BDA0004038215390000042
where a=1,..a represents the attention head, p=1..n represents a spatial position,
Figure BDA0004038215390000043
representing the time position. />
Figure BDA0004038215390000044
Representing the output of the previous layer. Spatial attention is then calculated:
Figure BDA0004038215390000045
wherein the method comprises the steps of
Figure BDA0004038215390000046
The output of the l layer can be obtained by the following formula:
Figure BDA0004038215390000047
where s represents the output vector of all the attention heads.
After calculating the L' layer, the output is passed to the MLP layer, which contains one gel function and two linear layers separated by the gel function:
Figure BDA0004038215390000048
at this time, the data owner obtains the frame-level spatial attention expression
Figure BDA0004038215390000051
Which can be regarded as classification features
Figure BDA0004038215390000052
So we can express it as a frame level expression h i ∈R d And the frame-level expressions are all merged into:
Figure BDA0004038215390000053
/>
after the spatial attention block output is obtained, double-flow fusion can be performed.
Further, step five averages the frame-level expressions obtained by the bone stream in step four in a manner that the frame-level expressions obtained by the bone stream are averaged every four frames in a group, so that the frame-level expressions obtained by the bone stream can be combined with the frame-level expressions obtained by the RGB stream at the same time position.
Further, the module of spatial attention is sent to the temporal attention module, whose attention is as follows:
Figure BDA0004038215390000054
thereafter, the frame-level representation is obtained in line with spatial attention.
In the process of L t After 4 temporal attention layers, the obtained classification token will be input into the MLP header.
Further, in step 6, the data owner inputs the output of step 5 into the linear classification layer, we obtain a class score. And (5) averaging the scores of the double streams to obtain a final classification result.
The invention has the positive effects that:
(1) A double-flow transducer framework for RGB and skeleton double modes is provided, and the framework integrates the advantages of skeleton data and RGB data, so that the accuracy of behavior recognition is improved.
(2) The framework generates the bone heat map from the bone data, so that the problem that different gesture extractors brought by expressing bones as maps affect the behavior recognition accuracy is avoided, and the problem that the calculation amount of the rapid growth of a multi-person scene cannot be solved. The skeletal heatmap and RGB frames are then input into a dual stream network at different temporal and spatial resolutions. Within the network, the skeletal heat map employs fewer layers of attention. So that the bone stream can extract the motion information more accurately.
(3) Meanwhile, a new fusion mode is provided for transverse connection in double streams, and the information from skeleton streams and the information from RGB streams can be fused.
Drawings
Fig. 1 is a dual flow frame structure diagram.
Fig. 2 is a diagram of a skeletal flow attention mechanism.
Fig. 3 is a detail view of a dual stream transverse connection.
Detailed Description
The following describes the embodiments of the present invention in further detail with reference to the drawings.
Step 1, as shown in fig. 1, we acquire bone data from the original RGB video using a pose extractor and generate a bone heat map in step 2.
Step 3, as shown in fig. 1, when the bone heat map and RGB frames are input into the dual stream transducer, they have different time resolutions. The frame rate of the heatmap is 4 times the frame rate of the heatmap. In step 4, the same attention mechanism is used in the transducer, the spatial attention is calculated first, the frame-level expression is obtained, the transverse connection is performed in step 5, the time attention is calculated again, and finally the MLP head is input. And obtaining the class scores of the double streams, and finally obtaining the classification result on average.
The detailed structure of step 4 is shown in fig. 2, and the input data is first decomposed to obtain complementary overlapping tubes. After mapping to token and adding classification token, inputting the attention mechanism, after L' layer space attention, outputting result to time attention, finally inputting to MLP head.
The detailed structure of step 5 is shown in fig. 3, after the spatial attention layer, the two streams acquire frame level expressions, after the frame level expressions of the bone stream are averaged, the frame level expressions of the two streams are combined according to the same time position, and then the temporal attention layer is input.

Claims (7)

1. The bimodal human behavior recognition method based on RGB data and skeleton data is characterized by comprising the following steps:
step 1, acquiring skeleton information in RGB video by using a gesture estimation algorithm;
step 2, generating a bone heat map;
step 3, sampling the RGB video and the bone heat map;
step 4, inputting a double-flow transducer structure;
step 5, acquiring classification token information for transverse fusion;
and 6, acquiring network output, and mapping the network output into a classification result in the linear classifier.
2. The method of claim 1 wherein the data owner uses HRNet to pose the RGB video on a co-keypoint pre-trained model.
3. According to claim 1, the data owner generates a skeletal heat map in said step 2 after obtaining the result of step 1, the 2D pose, using the following manner; for each skeletal joint:
Figure FDA0004038215380000011
wherein, (x) k ,y k ) Representing the coordinates of the kth point, c k Representing the confidence of the kth point;
for the limb:
Figure FDA0004038215380000012
wherein D ((i, j), seg [ a ] k ,b k ]) Representing the distance from the point to the line, seg [ a ] k ,b k ]Representing a line segment between two points; all heatmaps are then organized along the time dimension into the form of heatmap videos.
4. According to the requirements in claim 1, the data owner uniformly samples the heat map video and the RGB video obtained in the second step in the third step, and for the heat map video, 32 frames of heat maps are uniformly sampled along the time dimension; for RGB video, 8 frames of RGB frames should be sampled uniformly along the time dimension.
5. According to claim 1, the data owner, before inputting the transducer, cuts the video which is resized to the short side 320 pixels for the RGB video, uses the random cutting method to 224×224 during the training phase, uses the center cutting method to 224×224 during the verification phase, cuts the 224×224 at the upper left, center and lower right three spatial positions respectively during the testing phase, and inputs the network to take the softmax average as the final result; when a heat map is generated for the heat map video, acquiring a minimum detection frame capable of covering all target characters, performing zero filling in the detection frame, and cutting a background outside the detection frame, which is irrelevant to human behavior recognition; the size of the finally generated heat map is 224×224; the operation of cutting or adjusting the size is not needed;
for RGB video and heat map video, the same decomposition operation is adopted, and the heat map video is taken as an example to decompose the heat map video into N non-overlapping space-time 'tubes', x 1 ,x 2 ,...,x N ∈R t×h×w×3 Wherein
Figure FDA0004038215380000021
Figure FDA0004038215380000022
Next, each tube is set i Linearly mapped to a mark (encoding) z i ,z i =Ex i The method comprises the steps of carrying out a first treatment on the surface of the Finally, all encoding z i Connected in series to form a vector z 0 The method comprises the steps of carrying out a first treatment on the surface of the Adding a special learnable vector z at a first position cls ∈R d Representing the embedding of the classification marks; position embedding p pos ∈R (N+1)×d Is also added to this sequence;
the RGB streams and the skeletal streams use the same attention mechanism; the mechanism first calculates the frame-level spatial attention at the same time pointer; the spatial attention layer number of the bone stream is L' =10, and the spatial attention layer number of the rgb stream is l=12; for the spatial attention module of layer l, first calculate query/key/value:
Figure FDA0004038215380000023
Figure FDA0004038215380000024
Figure FDA0004038215380000025
where a=1,..a represents the attention head, p=1..n represents a spatial position,
Figure FDA0004038215380000026
representing a time position;
Figure FDA0004038215380000027
an output representing the previous layer; spatial attention is then calculated:
Figure FDA0004038215380000028
wherein the method comprises the steps of
Figure FDA0004038215380000029
The output of the l layer can be obtained by the following formula:
Figure FDA00040382153800000210
Figure FDA0004038215380000031
where s represents the output vectors of all the attention heads;
after calculating the L' layer, the output is passed to the MLP layer, which contains one gel function and two linear layers separated by the gel function:
Figure FDA0004038215380000032
at this time, the data owner obtains the frame-level spatial attention expression
Figure FDA0004038215380000033
Which can be regarded as classification features
Figure FDA0004038215380000034
So we can express it as a frame level expression h i ∈R d And the frame-level expressions are all merged into:
Figure FDA0004038215380000035
after the spatial attention block output is obtained, double-flow fusion can be performed.
6. According to claim 1, the data owner averages the frame-level representations obtained from the bone stream in step four in a manner that the frame-level representations obtained from the bone stream are averaged every four frames in a group, so that the frame-level representations obtained from the bone stream can be combined with the frame-level representations obtained from the RGB stream at the same time position;
further, the module of spatial attention is sent to the temporal attention module, whose attention is as follows:
Figure FDA0004038215380000036
thereafter, the frame-level representation is obtained in line with spatial attention; in the process of L t After 4 temporal attention layers, the obtained classification token will be input into the MLP header.
7. According to claim 1, the data owner inputs the output of step 5 to the linear classification layer in step 6, we obtain a class score; and (5) averaging the scores of the double streams to obtain a final classification result.
CN202310010763.3A 2023-01-05 2023-01-05 Bimodal human behavior recognition method based on RGB data and bone data Pending CN116092189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310010763.3A CN116092189A (en) 2023-01-05 2023-01-05 Bimodal human behavior recognition method based on RGB data and bone data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310010763.3A CN116092189A (en) 2023-01-05 2023-01-05 Bimodal human behavior recognition method based on RGB data and bone data

Publications (1)

Publication Number Publication Date
CN116092189A true CN116092189A (en) 2023-05-09

Family

ID=86187874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310010763.3A Pending CN116092189A (en) 2023-01-05 2023-01-05 Bimodal human behavior recognition method based on RGB data and bone data

Country Status (1)

Country Link
CN (1) CN116092189A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114083A (en) * 2023-10-25 2023-11-24 阿米华晟数据科技(江苏)有限公司 Method and device for constructing attitude estimation model and attitude estimation method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117114083A (en) * 2023-10-25 2023-11-24 阿米华晟数据科技(江苏)有限公司 Method and device for constructing attitude estimation model and attitude estimation method
CN117114083B (en) * 2023-10-25 2024-02-23 阿米华晟数据科技(江苏)有限公司 Method and device for constructing attitude estimation model and attitude estimation method

Similar Documents

Publication Publication Date Title
WO2021098261A1 (en) Target detection method and apparatus
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
CN108062525B (en) Deep learning hand detection method based on hand region prediction
CN109919977B (en) Video motion person tracking and identity recognition method based on time characteristics
CN113240691B (en) Medical image segmentation method based on U-shaped network
CN110555412B (en) End-to-end human body gesture recognition method based on combination of RGB and point cloud
CN107767419A (en) A kind of skeleton critical point detection method and device
CN110728220A (en) Gymnastics auxiliary training method based on human body action skeleton information
CN107025661B (en) Method, server, terminal and system for realizing augmented reality
CN114187665B (en) Multi-person gait recognition method based on human skeleton heat map
CN113283444B (en) Heterogeneous image migration method based on generation countermeasure network
CN111199207B (en) Two-dimensional multi-human body posture estimation method based on depth residual error neural network
CN112926475B (en) Human body three-dimensional key point extraction method
CN111783520A (en) Double-flow network-based laparoscopic surgery stage automatic identification method and device
CN114399838A (en) Multi-person behavior recognition method and system based on attitude estimation and double classification
CN113792641A (en) High-resolution lightweight human body posture estimation method combined with multispectral attention mechanism
CN116092189A (en) Bimodal human behavior recognition method based on RGB data and bone data
CN117095128A (en) Priori-free multi-view human body clothes editing method
CN117409476A (en) Gait recognition method based on event camera
CN113283372A (en) Method and apparatus for processing image of person
CN116091793A (en) Light field significance detection method based on optical flow fusion
CN116109673A (en) Multi-frame track tracking system and method based on pedestrian gesture estimation
CN117152829A (en) Industrial boxing action recognition method of multi-view self-adaptive skeleton network
CN113255514B (en) Behavior identification method based on local scene perception graph convolutional network
CN115359513A (en) Multi-view pedestrian detection method based on key point supervision and grouping feature fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination