CN116721468A - Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection - Google Patents

Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection Download PDF

Info

Publication number
CN116721468A
CN116721468A CN202310762371.2A CN202310762371A CN116721468A CN 116721468 A CN116721468 A CN 116721468A CN 202310762371 A CN202310762371 A CN 202310762371A CN 116721468 A CN116721468 A CN 116721468A
Authority
CN
China
Prior art keywords
amplitude
gesture
motion
image
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310762371.2A
Other languages
Chinese (zh)
Inventor
帅千钧
何健强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202310762371.2A priority Critical patent/CN116721468A/en
Publication of CN116721468A publication Critical patent/CN116721468A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Psychiatry (AREA)
  • Human Computer Interaction (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection, which comprises the steps of firstly, drawing frames into images, inputting the images into a built and trained multi-person gesture estimation model to detect key point coordinates and associated information, actually considering the problem of different picture depth and character scale by an action amplitude detection module, carrying out normalization processing calculation through an action amplitude detection algorithm based on the coordinates of the key points and the associated information to obtain gesture characteristic values, judging whether the action amplitude is a large-amplitude action according to the gesture characteristic values, judging whether the action is a large-amplitude action if the value of one gesture characteristic exceeds a system set threshold value, otherwise, judging the action to be a small-amplitude action, and finally, carrying out lens switching of an intelligent guided broadcast system according to a detection result. The invention aims to improve the accuracy of positioning key points of a human body and detecting the action amplitude, and achieves automatic or auxiliary production of programs.

Description

Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection
Technical Field
The invention relates to an intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection, and belongs to the technical field of artificial intelligence.
Background
The performance type program usually adopts a plurality of machine positions to shoot different angles of a stage, the guide broadcast is used for selecting, switching and broadcasting the shots of different machine positions, and the guide broadcast plays a vital role in the selection of shot pictures on the overall effect and expressive force of the program. However, the traditional director manufacturing method needs a director team with abundant experience, consumes a great amount of manpower and material resources, needs a longer manufacturing period in the aspects of multi-shot shooting, director switching, later editing and the like of the program, and is complicated in manufacturing process. Especially in live broadcasting performance scenes, the broadcasting needs to rapidly judge a plurality of groups of camera pictures, capturing the wonderful moment of a performer, and is particularly important for the selection and switching of shot pictures. The intelligent guide based on artificial intelligence technology aims to solve the intelligent guide problem of various forms of performance programs, provides an automatic and intelligent decision judgment method, and carries out guide switching through real-time state identification of people in a scene. The application of the technology can greatly reduce the manufacturing period and the cost of manpower and material resources, and improve the manufacturing efficiency and the quality of the program. Unlike motion recognition and motion detection, the motion recognition aims at detecting the type of motion, and the motion amplitude detection of the method aims at detecting the stretching variation amplitude of human motion in a specific scene, and belongs to the application direction of the new artificial intelligence technical field. The action amplitude can influence the performance style of the actor and can also help the actor to show more emotion and dynamics on the stage. The video is analyzed frame by frame through the action amplitude detection, so that the real-time state identification of a host and a guest is realized, the result is fed back to the intelligent guide system, the effect of automatic or auxiliary production of the program is achieved, and the method has important research significance and application value.
With the development of deep learning, the current action amplitude detection method based on deep learning mainly has two ideas: firstly, the motion amplitude detection method based on image classification uses an image classification technology to identify and detect the motion of the actor, so that the motion amplitude detection is simply realized. But lacks gesture feature analysis of motion and its real-time and multi-person motion analysis capabilities remain challenging. Secondly, the motion amplitude detection method based on the multi-person gesture estimation mainly detects the joints of the human body, such as shoulders, knees, ankles and the like, the amplitude change of the performance motion is mainly reflected by the joints of the human body below the head, and the relative positions and the motion conditions among the joints are analyzed to realize the performance motion amplitude detection. Therefore, the method has the advantages of reducing the interference of the background environment, focusing on the person, and having quick operation and accurate detection.
Chinese patent CN201910530100.8 discloses a method for measuring the motion amplitude of a person, which includes the steps of parsing a first video and a second video into a frame sequence, extracting a last frame of the first video and a first frame of the second video, finding out corresponding nodes in the last frame and the first frame by using a joint point detection algorithm, calculating the displacement average value of each pair of corresponding nodes, and normalizing the displacement average value Dis to obtain the motion amplitude values of the two frames; the beneficial effects are that: according to the method, the motion amplitude of the next frame relative to the previous frame is finally calculated, the motion amplitude is also a measurement standard of the frame inserting position, two frames with larger motion amplitude are cut, two frames with smaller motion amplitude are inserted, and the smoothness of the spliced video after the processing is better.
As described in the patent, current motion amplitude detection relies primarily on the relative displacement of the joint points of adjacent video frames. However, in a performance scene, due to various kinds of actions, the existing action amplitude detection algorithm is easy to misjudge, and the reliability of the algorithm is low. In addition, the multi-person gesture estimation algorithm used in the current artificial intelligence application field is a top-down and bottom-up two-stage multi-person gesture estimation model, and the top-down multi-person gesture estimation model has the defects of larger memory requirement, poor real-time performance, high calculation cost and the like, and the bottom-up multi-person gesture estimation model has the defects of larger complex background influence, easy occurrence of misjudgment, mismatching and the like, so that the existing multi-person gesture estimation technology still needs to be improved and improved. At present, a motion amplitude detection algorithm with high speed and precision based on multi-person gesture estimation is not available so as to meet the application of an intelligent guide and broadcasting system in a performance scene.
In summary, the following drawbacks exist in the technical solution of motion amplitude detection:
1. occupies a great deal of computing resources:
the model needs manual operations such as cutting, non-maximal inhibition, grouping and the like, and the target detection model is used for assisting in motion amplitude detection, so that a large amount of calculation resources are occupied.
2. The multi-person attitude estimation model cannot achieve speed and precision equalization:
the existing motion analysis type (comprising motion recognition, motion classification, motion amplitude detection and the like) method based on multi-person gesture estimation almost uses a top-down and bottom-up two-stage multi-person gesture estimation model, so that full end-to-end multi-person gesture estimation cannot be achieved, and the speed and precision balance is realized.
3. The accuracy of motion amplitude detection is low:
the motion amplitude detection is mainly realized through image classification or relative displacement of joint points of adjacent video frames, and the lack of human body posture feature analysis leads to the problems of easy misjudgment, poor reliability and low accuracy of the motion amplitude detection.
Disclosure of Invention
In order to solve the technical problems, the invention provides an intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection, which aims to solve the problems that a system occupies a large amount of computing resources and the action amplitude detection accuracy is low due to the fact that a multi-person gesture estimation model cannot achieve speed and accuracy balance, and achieves excellent effects. According to the invention, an End-to-End Multi-person gesture estimation (PETR) model based on Transformers is applied to the motion amplitude detection direction, the problem of different scene depth and character scale is actually considered, gesture characteristic values are obtained by carrying out normalization processing calculation through a motion amplitude detection algorithm based on joint point coordinates, and the motion amplitude is judged accordingly, so that intelligent guide and broadcasting switching is realized, the effect of automatic or auxiliary program production is achieved, manpower and material resources are saved, and the working efficiency is improved.
The technical scheme adopted by the invention is an intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection, which comprises the following steps:
step 1, performing frame extraction based on a performance video acquired by a camera to acquire an image;
step 2, building and training a multi-person gesture estimation model;
step 3, inputting the image in the step 1 into the multi-person gesture estimation model in the step 2, and calculating each gesture characteristic through a motion amplitude detection algorithm;
step 4, judging the action amplitude of the gesture features obtained in the step 3;
and step 5, performing lens switching according to the judgment result of the action amplitude.
The step 1 specifically includes:
and acquiring multi-person gesture motion amplitude videos based on the plurality of cameras, and obtaining multi-person gesture motion amplitude images based on the multi-person gesture motion amplitude videos.
Step 1.1, selecting video frames of a multi-person gesture motion amplitude video at intervals of n frames, and directly discarding other video frames due to information redundancy; where n is a positive integer greater than 1, e.g., n is 10.
And 1.2, naming each video frame according to camera position information to obtain a plurality of images, wherein each image represents a current program picture of a camera position camera. For example, 1_01.Jpg represents the first frame image of camera No. 1.
The step 2 specifically includes:
the multi-person pose estimation model, PETR model, mainly includes a backbone network module, a position coding module, a visual feature encoder module, a pose decoder module, and a joint decoder module. Based on the image obtained in the step 1, the PETR model can output coordinates of human joints in the image and draw human gestures in the image according to joint related information.
And 2.1, the backbone network module is used for inputting the image in the step 1 and outputting the image as a multi-scale characteristic map. The backbone network module is ResNet-50, and ResNet-50 is a residual network of 50 layers and is used for extracting the characteristic diagram of the image, and the backbone network module is used for extracting the multi-scale characteristic diagram with high resolution.
Step 2.2, the visual feature encoder generates a position code for each pixel point based on the multi-scale feature map and the position code module obtained in the step 2.1, and generates a multi-scale feature token and a gesture query;
step 2.3, the gesture decoder based on the multi-scale feature token F and N randomly initialized gesture queries Q obtained in step 2.2 pose ∈R N×D And acquires N body posesWherein->Represents the coordinates of the K joints of the ith person, D represents the dimension of the query key. The pose decoder estimates the pose coordinates layer by means of all decoder layers, each layer refining the pose based on the predictions of the previous layer.
Step 2.4, the joint point decoder uses the K joint point information of each group of human body gestures predicted by the gesture decoder in step 2.3 as randomly initialized joint point query, and refines the joint point position information and the joint point structure information. The characteristic information of the node query and the key is continuously updated through self-attention and deformable cross-attention, and finally the node query is aggregated through deviation, so that the node position information and the node structure information can be further refined.
And 2.5, adopting the aim of forcing unique prediction for each real gesture based on the set Hungary loss during training, and reducing false detection and false detection of the joint point. The sorting loss function is denoted as L cls Is used in the sorting head. In order to eliminate the relative error of the prediction result, the PETR model adopts OKS loss, which is the regression loss calculation of the joint point coordinates of the human body posture predicted by the model and the real joint point coordinates. L (L) 1 Loss is marked as L reg And OKS loss is noted as L oks For posture regression head and joint regression head, respectivelyA loss function, which enables the attitude decoder and the joint point decoder to converge better;
the training is assisted by using a thermodynamic diagram regression-based method, a directional guide can be added to the training of the model, the closer the model is to the target node, the larger the activation value is, and the model can quickly approach the target node according to the guide direction, so that the effect of quick convergence is achieved. Thus, the PETR model uses a deformable transducer encoder to generate thermodynamic diagram predictions, and then calculates a variant Focal Loss, denoted L, between the predicted and true thermodynamic diagrams hm . Thermodynamic branches are used only for training assistance and are removed during the inference phase. In summary, all loss functions of the PETR model are expressed as:
L=L cls1 L reg2 L oks3 L hm
wherein lambda is 123 Representing the corresponding loss weight.
Step 2.6, constructing a COCO node data set as a training set, and training and testing a PETR model; specifically, the COCO joint point data contains 200000 images in total, wherein the joint point images are 64355 in total, including 250000 human joint points. Each person is marked with 17 joints, such as head, shoulder, arm, hand, knee, ankle, etc. The PETR model is trained for a total of 100 iteration cycles. The learning rate was 2.5e-5, and was reduced to 2.5e-6 when 80 epochs were trained in order to allow better convergence of PETR model training. And detecting the coordinates of the joint points of the human body for each input image through training the PETR model, and generating the human body posture.
The step 3 specifically includes:
the input is the coordinates of the joint points of the human body in the step 2, and the coordinates are output as the characteristic values of the posture of the human body in the image. The problem that the depth of field of the picture is different from the figure scale is actually considered, and the action amplitude detection algorithm consists of four detection conditions.
Step 3.1, the joint point connection format for posture estimation adopts a COCO human body joint point connection format, as shown in fig. 5, (x) i ,y i ) The x coordinate and y coordinate of the ith joint point of the human body are expressed, and the left shoulder joint point and the right shoulder joint point (joint point numbers: 5, 6) coordinates of the midpoint s, left and right hip joint points (joint point number: 11 12) the coordinates of the midpoint h, the left knee, the right knee joint points (joint point numbers: 13 14) the coordinates of the midpoint k.
Step 3.2, detection condition one: and calculating the ratio of the opening and closing length of the two hands to the length of the trunk at the center of the body. The opening and closing length of the two hands is composed of left wrist joint points and right wrist joint points (joint point numbers: 9 and 10), and the length of the trunk at the center of the body is composed of s and h. Calculating the opening and closing length d_hand of the two hands and the length d_body of the body center based on the joint point coordinates, judging that the two hands move in a large amplitude mode when the ratio of the opening and closing length of the two hands to the length of the body center and the body is larger than or equal to a first threshold value, otherwise, setting the first threshold value as 1.8 according to an experimental result, and judging that a formula is shown as follows:
step 3.3, detecting a second condition: the body trunk tilt angle is calculated. Vector x composed of s and h 1 Vector x consisting of h and k 2 The angle of the two vectors is the body trunk inclination angle. When the body trunk inclination angle angle_1 is smaller than or equal to a second threshold value, judging that the body trunk inclination angle is large-amplitude motion, otherwise, judging that the body trunk inclination angle is small-amplitude motion, setting the second threshold value to be 150 degrees according to an experimental result, and judging that a formula is as follows:
step 3.4, detecting a third condition: and calculating the opening and closing angles of the arms and the trunk. The opening and closing angle of the left arm is defined by the vector x of the left elbow and the left shoulder (joint point number: 7, 5) 3 And vector x of left shoulder, left hip (joint point number: 5, 11) 4 The right arm opening and closing angle consists of a vector x of a right elbow and a right shoulder (joint point numbers: 8 and 6) 5 And vector x of right shoulder and right hip (joint point number: 6, 12) 6 Composition is prepared. When the left arm opening and closing angle is angle_2, the right arm opening and closing angle is angle_3 is largeWhen the third threshold value is equal to the third threshold value, the large-amplitude action is judged, otherwise, the small-amplitude action is judged, the third threshold value is set to 90 degrees according to the experimental result, and the judgment formula is as follows:
step 3.5, detection condition four: and calculating the opening and closing angles of the legs and the trunk. The left leg opening and closing angle is defined by the vector x of the left knee and the left hip (joint point numbers: 13, 11) 7 And vectors x of left and right hips (joint point numbers: 11, 12) 8 The right leg opening and closing angle consists of a vector x of a right knee and a right hip (joint point numbers: 14, 12) 9 And vectors x of right and left hip (joint point numbers: 11, 12) 10 Composition is prepared. When the left leg opening and closing angle angle_4 and the right leg opening and closing angle angle_5 are larger than or equal to a fourth threshold value, judging that the motion is large-amplitude motion, otherwise, judging that the motion is small-amplitude motion, setting the fourth threshold value to 125 degrees according to an experimental result, and judging that a formula is as follows:
the step 4 specifically includes:
the input of the step is the characteristic value of the human body in the image output in the step 3, and the detection result of the action amplitude is output. And judging whether the motion is a large-amplitude motion or not by calculating the amplitude of the change of the gesture characteristics and adopting a threshold method.
And 4.1, comparing the output result of the step 3 with a set threshold value, and judging that the performance acts as a large-amplitude action when one of the gesture characteristic values in the four detection conditions exceeds the set threshold value, or else, judging as a small-amplitude action.
The step 5 specifically includes:
the step of inputting is that the step of 4 outputs the detection result of the movement amplitude, the intelligent guiding and broadcasting system controls the camera to switch the lens to the camera picture corresponding to the large-amplitude movement image.
In step 5.1, each image is named according to the camera position information as described in step 1.2, so that each image represents the current program picture of one camera position, e.g. 1_01.Jpg represents the first frame image of camera No. 1. Therefore, the intelligent guide system can switch the lens to a camera picture corresponding to the large-amplitude action image according to the image name and the action amplitude detection result.
The invention applies the complete end-to-end multi-person gesture estimation model to the motion amplitude detection, improves the acquisition function and performance, and aims at improving the accuracy of human body joint point positioning and motion amplitude detection. The gesture features are accurately and rapidly calculated based on the joint point coordinates, the action amplitude is judged according to the gesture features, and the intelligent guide system achieves the effect of automatic or auxiliary program production, so that a large amount of manpower and material resources are saved, and the working efficiency is improved.
Compared with the prior art, the invention has the following beneficial effects:
(1) High operation: compared with the motion amplitude detection method based on image classification and the motion amplitude detection method based on gesture estimation, the method can better reduce the interference of the background environment and concentrate on the character, and the model omits manual operation, does not need to resort to a target detection model, reduces a large amount of calculation resources, and has the advantages of high-speed operation and low memory consumption.
(2) Complete end-to-end, speed and accuracy trade-off with each other: compared with a top-down model, a bottom-up model, a multi-person posture estimation model and the like, the PETR model is applied to the motion amplitude detection direction, the model generalizes the human posture estimation into a layered set prediction problem, human body examples and fine granularity human body joint point coordinates are uniformly processed, the full end-to-end multi-person posture estimation is realized, the speed and the precision reach higher levels, and the two are balanced with each other.
(3) The accuracy is high: the traditional motion amplitude detection and judgment mainly depends on the relative displacement of the joint points of adjacent video frames, and the detected motion categories are fewer. The invention actually considers the problem of different scene depth and character scale, proposes a motion amplitude detection algorithm, normalizes motion amplitude detection through gesture characteristic analysis, and has better detection effect and high accuracy on the motion amplitude of the performance in the performance scene.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention;
FIG. 2 is an overall block diagram of the PETR model of the present invention;
FIG. 3 is a block diagram of a gesture decoder of the present invention;
FIG. 4 is a block diagram of a joint decoder according to the present invention;
FIG. 5 is a graph of COCO human body joint connection according to the present invention.
Detailed Description
The method is described in detail below with reference to the drawings and examples.
The flow chart of the embodiment is shown in fig. 1, and comprises the following steps:
step S10, video frame extraction;
step S20, constructing and training a multi-person gesture estimation model;
step S30, calculating attitude characteristics through an action amplitude detection algorithm;
step S40, judging the action amplitude;
and S50, switching the lens by the intelligent guide system according to the judging result.
The intelligent guide based on the artificial intelligence technology aims to solve the problems in the process of producing various types of performance programs, including real-time state identification and guide switching of a host and a guest in different scenes. The technology is characterized in that the dynamic change identification of human body joint points is estimated by using multiple human gestures, so that the automatic guiding and broadcasting switching of multiple programs is realized. The application of the technology can greatly reduce the manufacturing period and the cost of manpower and material resources, and improve the manufacturing efficiency and the quality of the program. The motion recognition and the motion detection aim at detecting the types of the motions, and the motion amplitude detection aims at detecting the stretching variation amplitude of the human motion, which is different from the motion recognition and the motion detection, and belongs to the application direction of the new artificial intelligence technical field. The action amplitude can influence the performance style of the actor and can also help the actor to show more emotion and dynamics on the stage. And analyzing the video frame by frame through the action amplitude detection, realizing the real-time state identification of the host and the guest, and feeding back the result to the intelligent guide system to achieve the effect of automatic or auxiliary production of the program.
The video frame extraction step S10 of the embodiment specifically includes:
the step inputs videos of a plurality of cameras and outputs images obtained by video frame extraction. An image is obtained by decimating the video frames as input to the PETR model of step S20.
Step S100, the naming mode of the images is as follows, each image is named according to the camera position information, so that each image represents the current program picture of one camera position camera; for example, 1_01.Jpg represents the first frame image of camera No. 1.
Step S110, the video frame extraction method of step S100: selecting video frames at intervals of n frames, and directly discarding other video frames due to information redundancy; where n is a positive integer greater than 1, e.g., n is 10.
The step S20 of building and training the multi-person posture estimation model specifically comprises the following steps:
the step is to input the image obtained in the step S10, output the image as the coordinates of the joint point of the human body, and draw the human body posture in the image according to the joint point related information. The PETR model mainly comprises a backbone network module, a position coding module, a visual characteristic coder module, a gesture decoder module and a joint point decoder module; the overall structure of the PETR model is shown in fig. 2.
Step S200, which is input as the image obtained in step S10 and output as a multi-scale feature map. ResNet-50 is used as a backbone network of the PETR model, resNet-50 is a residual network of 50 layers and can be used for extracting the characteristic map of the image, and the model is used for extracting the multi-scale characteristic map of the last three stages. Input imageWherein H is the image height, W is the width, and the multi-scale feature map of the last three stages is extracted through a ResNet-50 network; resNet-50 is a 50layer residual network that can be used to extract feature maps of images.
Step S210, a visual feature encoder of the PETR model. The step is input into a multi-scale feature map obtained in the step S200 and a position code generated by a position code module for each pixel point, and the multi-scale feature map and the position code are output into multi-scale feature information and a gesture query key. Because the multi-headed attention module has square complexity of the input scale, the visual feature encoder employs a deformable multi-headed attention module to implement feature encoding. Will contain multi-scale feature information F epsilon R L×256 Where L is the total amount of tokens. And finally, inputting F and the gesture query key into a gesture decoder to perform gesture prediction.
In step S220, the pose decoder of the PETR model. The step inputs the multi-scale feature F and N randomly initialized gesture queries Q obtained for the step S210 pose ∈R N×D The pose decoder outputs poses of N bodiesWherein->Represents the coordinates of the K joints of the ith person, D represents the dimension of the query key.
Gesture decoder architecture as shown in fig. 3, a gesture query is first entered into self-attention (selfattribute) for interaction between gesture queries, namely else-To-Pose SelfAttention. Then, each gesture query extracts K feature image pixels from the multi-scale feature memory F as keys layer by layer through Deformable Cross-Attention (Deformable Cross-Attention). The features are aggregated as values based on the gesture query based on the deviations. The cross attention module outputs K deduced joint point coordinates as initial coordinates of the joint points of the human body posture. Subsequently, the gesture query with the attention characteristic information of the key is input into a multitasking head, wherein the classification head predicts the confidence level of each target through a linear mapping layer; the pose regression head uses an MLP with 256 hidden layers to predict the relative positional offset of K derived points.
The gesture decoder may be composed of a plurality of decoding layers. This differs from other transform methods that use only the last decoder layer to predict pose coordinates, the PETR's pose decoder estimates the pose coordinates layer by layer with all decoder layers, each layer refining the pose based on the previous layer's predictions.
In step S230, the joint point decoder of the PETR model. The step inputs K joint point information of each group of human body gestures predicted by the gesture decoder in the step S220 as random initialized joint point query, and outputs the joint point position information and joint point structure information which are further refined. The characteristic information of the node query and the key is continuously updated through self-attention and deformable cross-attention, and finally the node query is aggregated through deviation, so that the node position information and the node structure information can be further refined. Since each set of human body pose nodes are not related to each other, all poses can be processed in parallel, which greatly reduces the time complexity of network prediction and derivation. The block diagram of the joint decoder is shown in fig. 4.
The node query first performs mutual interaction of node query feature information, namely, the point-To-point attention through the self-attention module. Visual features are then extracted by a deformable cross-attention module, feature-To-join. Subsequently, the joint point prediction head predicts the relative displacement Δj= (Δx, Δy) between the 2D joint points by the MLP. Similar to the pose decoder, the joint point coordinates are also progressive refinements.
Steps S200 to S230 are building portions of the PETR model global framework, and specific steps of PETR model training are described below.
Step S240, the aim of forcing unique prediction for each real gesture based on the aggregate Hungary loss is adopted during training, and false detection of the joint point are reduced. The sorting loss function is denoted as L cls Is used in the sorting head. In order to eliminate the relative error of the prediction result, the PETR model adopts OKS loss, which is the regression loss calculation of the joint point coordinates of the human body posture predicted by the model and the real joint point coordinates. L (L) 1 Loss is marked as L reg And OKS loss is noted as L oks The loss functions respectively used for the gesture regression head and the joint regression head enable the gesture decoder and the joint point decoder to better converge
The training is assisted by using a thermodynamic diagram regression-based method, a directional guide can be added to the training of the model, the closer the model is to the target node, the larger the activation value is, and the model can quickly approach the target node according to the guide direction, so that the effect of quick convergence is achieved. Thus, the PETR model uses a deformable transducer encoder to generate thermodynamic diagram predictions, and then calculates a variant Focal Loss, denoted L, between the predicted and true thermodynamic diagrams hm . Thermodynamic branches are used only for training assistance and are removed during the inference phase. In summary, all loss functions of the PETR model are expressed as:
L=L cls1 L reg2 L oks3 L hm
wherein lambda is 123 Representing the corresponding loss weight.
In step S250, the COCO node data set is used as a training set, and the resnet-50 is used as a backbone network. The model is trained for a total of 100 iteration cycles. The learning rate was 2.5e-5, and was reduced to 2.5e-6 when 80 epochs were trained for better convergence of model training. The PETR model was trained and tested. And detecting the coordinates of the joint points of the human body for each input image through the trained PETR model, and generating the human body posture.
The step S30 of calculating the gesture feature by the action amplitude detection algorithm specifically includes:
the input is the coordinates of the joint points of the human body in the step S20, and the coordinates are output as the characteristic values of the posture of the human body in the image. The problem that the depth of field of the picture is different from the figure scale is actually considered, and the action amplitude detection algorithm consists of four detection conditions.
Step S300, the joint point connection format for the posture estimation of the present invention adopts the COCO human body joint point connection format, as shown in FIG. 5, (x) i ,y i ) The x coordinate and y coordinate of the ith joint point of the human body are expressed, and the left shoulder joint point and the right shoulder joint point (joint point numbers: 5, 6) coordinates of the midpoint s, left and right hip joint points (joint point number: 11 12), the coordinates of the midpoint h, left knee, right kneeCoordinates of a midpoint k of the node points (node point numbers: 13, 14);
step S310, detecting condition one: and calculating the ratio of the opening and closing length of the two hands to the length of the trunk at the center of the body. The opening and closing length of the two hands is composed of left wrist joint points and right wrist joint points (joint point numbers: 9 and 10), and the length of the trunk at the center of the body is composed of s and h. Calculating the opening and closing length d_hand of the two hands and the length d_body of the body center based on the joint point coordinates, judging that the two hands move in a large amplitude mode when the ratio of the opening and closing length of the two hands to the length of the body center and the body is larger than or equal to a first threshold value, otherwise, setting the first threshold value as 1.8 according to an experimental result, and judging that a formula is shown as follows:
step S320, detecting condition two: the body trunk tilt angle is calculated. Vector x composed of s and h 1 Vector x consisting of h and k 2 The angle of the two vectors is the body trunk inclination angle. When the body trunk inclination angle angle_1 is smaller than or equal to a second threshold value, judging that the body trunk inclination angle is large-amplitude motion, otherwise, judging that the body trunk inclination angle is small-amplitude motion, setting the second threshold value to be 150 degrees according to an experimental result, and judging that a formula is as follows:
step S330, detecting condition three: and calculating the opening and closing angles of the arms and the trunk. The opening and closing angle of the left arm is defined by the vector x of the left elbow and the left shoulder (joint point number: 7, 5) 3 And vector x of left shoulder, left hip (joint point number: 5, 11) 4 The right arm opening and closing angle consists of a vector x of a right elbow and a right shoulder (joint point numbers: 8 and 6) 5 And vector x of right shoulder and right hip (joint point number: 6, 12) 6 Composition is prepared. When the left arm opening and closing angle angle_2 or the right arm opening and closing angle angle_3 is larger than or equal to a third threshold value, judging that the motion is large-amplitude motion, otherwise, judging that the motion is small-amplitude motion, setting the third threshold value to be 90 degrees according to an experimental result, and judging that a formula is as follows:
step S340, detecting condition four: and calculating the opening and closing angles of the legs and the trunk. The left leg opening and closing angle is defined by the vector x of the left knee and the left hip (joint point numbers: 13, 11) 7 And vectors x of left and right hips (joint point numbers: 11, 12) 8 The right leg opening and closing angle consists of a vector x of a right knee and a right hip (joint point numbers: 14, 12) 9 And vectors x of right and left hip (joint point numbers: 11, 12) 10 Composition is prepared. When the left leg opening and closing angle angle_4 or the right leg opening and closing angle angle_5 is larger than or equal to a fourth threshold value, judging that the motion is large-amplitude motion, otherwise, judging that the motion is small-amplitude motion, setting the fourth threshold value to 125 degrees according to an experimental result, and judging that a formula is as follows:
the action amplitude judging step S40 specifically includes:
the input of the step is the gesture characteristic value of the human body in the image output in the step S30, and the detection result of the action amplitude is output. Judging whether the motion is a large-amplitude motion or not by calculating the amplitude of the gesture characteristic change and adopting a threshold method;
step S400, comparing the output result of step S30 with a set threshold value, and judging that the performance acts as a large-amplitude action when one of the gesture characteristic values in the four detection conditions exceeds the threshold value set by the system, or else, the performance acts as a small-amplitude action.
According to the judgment result, the step S50 of switching the shots by the intelligent guiding and broadcasting system specifically includes:
the step is input as step S40, the detection result of the motion amplitude is output, and the intelligent guiding and broadcasting system switches the lens to the camera picture corresponding to the large-amplitude motion image.
In step S500, as described in step S100, each image is named according to camera position information, so that each image represents the current program picture of one camera, for example, 1_01.Jpg represents the first frame image of camera No. 1. Therefore, the intelligent guide system can switch the lens to a camera picture corresponding to the large-amplitude action image according to the image name and the action amplitude detection result
The experimental results of applying the present invention are given below.
The interval duration of the key frame images in the intelligent guide system is about: t (T) 1 =0.3333 s, the time required to process a key frame image is: t (T) 2 = 0.3024s. Because of T 1 >T 2 Therefore, the system can complete the detection of the motion amplitude when the next image is input, and achieves real-time lens switching, so that the detection speed and the detection precision of the invention accord with the application requirements of actual scenes.
Table 1 shows the results of testing in 600 performance class test images using the present invention.
Table 1 performance action amplitude test results
Image category Quantity of Accuracy rate of
Large amplitude performance action 300 88.0
Small amplitude performance action 300 86.33
The beneficial effects of the invention are as follows:
the invention applies the complete end-to-end multi-person gesture estimation model to the motion amplitude detection, improves the acquisition function and performance, and aims at improving the accuracy of human body joint point positioning and motion amplitude detection. The gesture features are accurately and rapidly calculated based on the joint point coordinates, and the action amplitude is judged according to the gesture features, so that the effect of automatic or auxiliary program production is achieved, a large amount of manpower and material resources are saved, and the working efficiency is improved. In addition, the invention has the following beneficial effects:
1. high operation: the method can better reduce the interference of the background environment, is focused on the person, omits manual operation, does not need to use a target detection model, reduces a large amount of computing resources, and has the advantages of high-speed operation and low memory consumption.
2. Complete end-to-end, speed and accuracy trade-off with each other: compared with a method using a top-down multi-person gesture estimation method and other two-stage models, the model has the advantages that the human gesture estimation is generalized into a layered set prediction problem, human body examples and fine granularity human joint point coordinates are processed uniformly, the full end-to-end multi-person gesture estimation is realized, the speed and the precision reach higher levels, and the two are balanced with each other.
3. The accuracy is high: the traditional motion amplitude detection and judgment mainly depends on the relative displacement of the joint points of adjacent video frames, and the detected motion categories are fewer. The invention actually considers the problem of different scene depth and character scale, proposes a motion amplitude detection algorithm, normalizes motion amplitude detection through gesture characteristic analysis, and has better detection effect and high accuracy on the motion amplitude of the performance in the performance scene.

Claims (6)

1. An intelligent guided broadcast switching method based on multi-person gesture estimation motion amplitude detection is characterized by comprising the following steps:
step 1, performing frame extraction based on a performance video acquired by a camera to acquire an image;
step 2, building and training a multi-person gesture estimation model;
step 3, inputting the image in the step 1 into the multi-person gesture estimation model in the step 2, and calculating each gesture characteristic through a motion amplitude detection algorithm;
step 4, judging the action amplitude of the gesture features obtained in the step 3;
and step 5, performing lens switching according to the judgment result of the action amplitude.
2. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 1, wherein the step 1 specifically includes:
and acquiring multi-person gesture motion amplitude videos based on the plurality of cameras, and obtaining multi-person gesture motion amplitude images based on the multi-person gesture motion amplitude videos.
Step 1.1, selecting video frames of a multi-person gesture motion amplitude video at intervals of n frames, and directly discarding other video frames due to information redundancy; wherein n is a positive integer greater than 1.
And 1.2, naming each video frame according to camera position information to obtain a plurality of images, wherein each image represents a current program picture of a camera position camera.
3. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 2, wherein the step 2 specifically includes:
the multi-person pose estimation model, PETR model, mainly includes a backbone network module, a position coding module, a visual feature encoder module, a pose decoder module, and a joint decoder module. Based on the image obtained in the step 1, the PETR model can output coordinates of human joints in the image and draw human gestures in the image according to joint related information.
And 2.1, the backbone network module is used for inputting the image in the step 1 and outputting the image as a multi-scale characteristic map. The backbone network module is ResNet-50, and ResNet-50 is a residual network of 50 layers and is used for extracting the characteristic diagram of the image, and the backbone network module is used for extracting the multi-scale characteristic diagram with high resolution.
Step 2.2, the visual feature encoder generates a position code for each pixel point based on the multi-scale feature map and the position code module obtained in the step 2.1, and generates a multi-scale feature token and a gesture query;
step 2.3, the gesture decoder based on the multi-scale feature token F and N randomly initialized gesture queries Q obtained in step 2.2 pose ∈R N×D And acquires N body posesWherein->Represents the coordinates of the K joints of the ith person, D represents the dimension of the query key. The pose decoder estimates the pose coordinates layer by means of all decoder layers, each layer refining the pose based on the predictions of the previous layer.
Step 2.4, the joint point decoder uses the K joint point information of each group of human body gestures predicted by the gesture decoder in step 2.3 as randomly initialized joint point query, and refines the joint point position information and the joint point structure information. The characteristic information of the node query and the key is continuously updated through self-attention and deformable cross-attention, and finally the node query is aggregated through deviation, so that the node position information and the node structure information can be further refined.
Step 2.5, training by adopting a Hungary loss based on the set, and recording a classification loss function as L cls Is used in the sorting head. In order to eliminate the relative error of the predicted result, the PETR model adopts OKS loss, which is the regression loss calculation of the joint point coordinates of the human body posture predicted by the model and the real joint point coordinates. L (L) 1 Loss is marked as L reg And OKS loss is noted as L oks The loss functions are respectively used for the gesture regression head and the joint regression head;
the PETR model uses a deformable transducer encoder to generate thermodynamic diagram predictions, and calculates a variant FocalLoss, i.e., focus loss, between the predicted and true thermodynamic diagrams, denoted as L hm . Thermodynamic branches are used only for training assistance and are removed during the inference phase. All loss functions of the PETR model are expressed as:
L=L cls1 L reg2 L oks3 L hm
wherein lambda is 123 Representing the corresponding loss weight.
Step 2.6, constructing a COCO node data set as a training set, and training and testing a PETR model; the COCO joint point data contains 200000 images in total, wherein the joint point images total 64355, including 250000 human joint points. Each person is marked with 17 nodes. The PETR model is trained for a total of 100 iteration cycles. The learning rate was 2.5e-5, and was reduced to 2.5e-6 when 80 epochs were trained in order to allow better convergence of PETR model training. And detecting the coordinates of the joint points of the human body for each input image through training the PETR model, and generating the human body posture.
4. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 3, wherein the step 3 specifically includes:
the input is the coordinates of the joint points of the human body in the step 2, and the coordinates are output as the characteristic values of the posture of the human body in the image. The problem that the depth of field of the picture is different from the figure scale is actually considered, and the action amplitude detection algorithm consists of four detection conditions.
Step 3.1, the joint point connection format of the gesture estimation adopts a COCO human body joint point connection format, (x) i ,y i ) The coordinate of the midpoint s of the joint points of the left shoulder and the right shoulder, the coordinate of the midpoint h of the joint points of the left hip and the right hip, and the coordinate of the midpoint k of the joint points of the left knee and the right knee are calculated by representing the coordinate of the x coordinate and the y coordinate of the ith joint point of the human body.
Step 3.2, detection condition one: and calculating the ratio of the opening and closing length of the two hands to the length of the trunk at the center of the body. The opening and closing length of the two hands is composed of left wrist joint points and right wrist joint points, and the length of the trunk at the center of the body is composed of s and h. Calculating the opening and closing length d_hand of the two hands and the length d_body of the body center based on the joint point coordinates, judging that the two hands move in a large amplitude mode when the ratio of the opening and closing length of the two hands to the length of the body center and the body is larger than or equal to a first threshold value, otherwise, setting the first threshold value as 1.8 according to an experimental result, and judging that a formula is shown as follows:
step 3.3, detecting a second condition: the body trunk tilt angle is calculated. Vector x composed of s and h 1 Vector x consisting of h and k 2 The angle of the two vectors is the body trunk inclination angle. When the body trunk inclination angle angle_1 is smaller than or equal to a second threshold value, judging that the body trunk inclination angle is large-amplitude motion, otherwise, judging that the body trunk inclination angle is small-amplitude motion, setting the second threshold value to be 150 degrees according to an experimental result, and judging that a formula is as follows:
step 3.4, detecting a third condition: and calculating the opening and closing angles of the arms and the trunk. The opening and closing angle of the left arm is defined by the vector x of the left elbow and the left shoulder 3 Vector x and left shoulder, left hip 4 The right arm opening and closing angle consists of a vector x of a right elbow and a right shoulder 5 Vector x for right shoulder and hip 6 Composition is prepared. When the left arm opening and closing angle angle_2 and the right arm opening and closing angle angle_3 are larger than or equal to a third threshold, the large-amplitude motion is judged, otherwise, the small-amplitude motion is judged, the third threshold is set to 90 degrees according to the experimental result, and the judgment formula is as follows:
step 3.5, detection condition four: and calculating the opening and closing angles of the legs and the trunk. The left leg opening and closing angle is defined by the vector x of the left knee and the left hip 7 And vector x of left hip and right hip 8 The right leg opening and closing angle consists of a vector x of a right knee and a right hip 9 And vector x of right and left hip 10 Composition is prepared. When the left leg opening and closing angle angle_4 and the right leg opening and closing angle angle_5 are larger than or equal to a fourth threshold value, judging that the motion is large-amplitude motion, otherwise, judging that the motion is small-amplitude motion, and according to the experimental result, the fourth threshold valueThe value is set to 125 degrees, and the discrimination formula is as follows:
5. the intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 4, wherein the step 4 specifically includes:
the input of the step is the characteristic value of the human body in the image output in the step 3, and the detection result of the action amplitude is output. And judging whether the motion is a large-amplitude motion or not by calculating the amplitude of the change of the gesture characteristics and adopting a threshold method.
And 4.1, comparing the output result of the step 3 with a set threshold value, and judging that the performance acts as a large-amplitude action when one of the gesture characteristic values in the four detection conditions exceeds the set threshold value, or else, judging as a small-amplitude action.
6. The intelligent guided broadcast switching method based on the multi-person gesture estimation motion amplitude detection of claim 5, wherein the step 5 specifically includes:
the step of inputting is that the step of 4 outputs the detection result of the movement amplitude, the intelligent guiding and broadcasting system controls the camera to switch the lens to the camera picture corresponding to the large-amplitude movement image.
Step 5.1, as described in step 1.2, each image is named according to camera position information, and each image represents the current program picture of a camera position. And the intelligent guide system switches the lens to a camera picture corresponding to the large-amplitude action image according to the image name and the action amplitude detection result.
CN202310762371.2A 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection Pending CN116721468A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310762371.2A CN116721468A (en) 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310762371.2A CN116721468A (en) 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection

Publications (1)

Publication Number Publication Date
CN116721468A true CN116721468A (en) 2023-09-08

Family

ID=87867724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310762371.2A Pending CN116721468A (en) 2023-06-27 2023-06-27 Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection

Country Status (1)

Country Link
CN (1) CN116721468A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117860242A (en) * 2024-03-12 2024-04-12 首都儿科研究所 Infant walking action development detection method, equipment and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117860242A (en) * 2024-03-12 2024-04-12 首都儿科研究所 Infant walking action development detection method, equipment and device
CN117860242B (en) * 2024-03-12 2024-05-28 首都儿科研究所 Infant walking action development detection method, equipment and device

Similar Documents

Publication Publication Date Title
Jiang et al. Seeing invisible poses: Estimating 3d body pose from egocentric video
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
CN107886069A (en) A kind of multiple target human body 2D gesture real-time detection systems and detection method
Jiang et al. Skeletor: Skeletal transformers for robust body-pose estimation
Ma et al. Video saliency forecasting transformer
Zheng et al. A joint relationship aware neural network for single-image 3D human pose estimation
CN110942006A (en) Motion gesture recognition method, motion gesture recognition apparatus, terminal device, and medium
CN113516005B (en) Dance action evaluation system based on deep learning and gesture estimation
Gao et al. Focal and global spatial-temporal transformer for skeleton-based action recognition
Véges et al. Absolute human pose estimation with depth prediction network
CN111191630A (en) Performance action identification method suitable for intelligent interactive viewing scene
CN112329513A (en) High frame rate 3D (three-dimensional) posture recognition method based on convolutional neural network
CN116721468A (en) Intelligent guided broadcast switching method based on multi-person gesture estimation action amplitude detection
Liao et al. Ai golf: Golf swing analysis tool for self-training
CN112906520A (en) Gesture coding-based action recognition method and device
CN115393964A (en) Body-building action recognition method and device based on BlazePose
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN109784295B (en) Video stream feature identification method, device, equipment and storage medium
Wu et al. Limb pose aware networks for monocular 3D pose estimation
Yadav et al. YogaTube: a video benchmark for Yoga action recognition
Nguyen et al. Combined YOLOv5 and HRNet for high accuracy 2D keypoint and human pose estimation
Sheu et al. Improvement of human pose estimation and processing with the intensive feature consistency network
Dong et al. ADORE: An adaptive holons representation framework for human pose estimation
Jessika et al. A study on part affinity fields implementation for human pose estimation with deep neural network
CN112818801B (en) Motion counting method, recognition device, recognition system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination