WO2023197390A1 - Posture tracking method and apparatus, electronic device, and computer readable medium - Google Patents

Posture tracking method and apparatus, electronic device, and computer readable medium Download PDF

Info

Publication number
WO2023197390A1
WO2023197390A1 PCT/CN2022/092143 CN2022092143W WO2023197390A1 WO 2023197390 A1 WO2023197390 A1 WO 2023197390A1 CN 2022092143 W CN2022092143 W CN 2022092143W WO 2023197390 A1 WO2023197390 A1 WO 2023197390A1
Authority
WO
WIPO (PCT)
Prior art keywords
human body
posture information
body posture
optical flow
video frame
Prior art date
Application number
PCT/CN2022/092143
Other languages
French (fr)
Chinese (zh)
Inventor
傅泽华
左文航
胡征慧
刘庆杰
王蕴红
Original Assignee
北京航空航天大学杭州创新研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京航空航天大学杭州创新研究院 filed Critical 北京航空航天大学杭州创新研究院
Publication of WO2023197390A1 publication Critical patent/WO2023197390A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • Embodiments of the present disclosure relate to the field of computer technology, and specifically to attitude tracking methods, devices, electronic devices and computer-readable media.
  • the multi-person pose tracking task in the field of computer vision refers to processing the input video, detecting the pose of each pedestrian in each frame of the video, and then analyzing the target's appearance characteristics, position, motion status and other information. Computational analysis correctly records each person's continuous posture trajectory over time.
  • Some embodiments of the present disclosure provide gesture tracking methods, devices, electronic devices, and computer-readable media to solve one or more of the technical problems mentioned in the background section above.
  • some embodiments of the present disclosure provide a posture tracking method.
  • the method includes: detecting pedestrian videos frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group in the human body bounding box group set Corresponding to the pedestrian video frames included in the pedestrian video; input the above-mentioned human body bounding box group set to the joint point confidence network to obtain the human body posture information group set; according to the above-mentioned human body posture information group set, for each person in the above-mentioned human posture information group set
  • the body posture information group is subjected to matching processing to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include at least one video frame number corresponding to the human body bounding box of the above-mentioned matching result; in response to determining the above-mentioned There are matching results that satisfy the preset video frame number condition in the matching result group set.
  • a human body optical flow frame group set is generated, wherein the above preset video The frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next video frame number is the next video frame number of the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number.
  • Video frame number input the above-mentioned human body optical flow frame set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information set and the optical flow joint point confidence information set; based on the above-mentioned optical flow joint point confidence information set , filter the above optical flow human posture information set to obtain the target human posture information set.
  • some embodiments of the present disclosure provide a gesture tracking device.
  • the device includes: a detection unit configured to detect pedestrian videos frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group set The human body bounding box group corresponds to the pedestrian video frame included in the pedestrian video; the first input unit is configured to input the above human body bounding box group set to the joint point confidence network to obtain the human body posture information group set; the matching unit is configured According to the above-mentioned human body posture information group set, matching processing is performed on each human body posture information group in the above-mentioned human posture information group set to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include At least one video frame number corresponding to the human body boundary box of the above-mentioned matching result; the generation unit is configured to respond to determining that there is a matching result that satisfies the preset video frame number condition in the above-mentioned matching result group
  • Each matching result group corresponding to the matching result generates a human body optical flow frame group set, wherein the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next
  • the video frame number is the video frame number of the next frame of the video corresponding to the video frame number corresponding to the matching result in the above-mentioned at least one video frame number
  • the second input unit is configured to input the above-mentioned human body optical flow frame set to the above-mentioned joint point
  • the confidence network obtains the optical flow human posture information group set and the optical flow joint point confidence information group set
  • the filtering unit is configured to perform the optical flow human posture information group set based on the optical flow joint point confidence information group set. Filter to obtain the target human body posture information set.
  • some embodiments of the present disclosure provide an electronic device, including: at least one processor; a storage device on which at least one program is stored, and when the at least one program is executed by at least one processor, at least one process
  • the device implements the method described in any implementation manner of the first aspect above.
  • some embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein when the program is executed by a processor, the method described in any implementation manner of the first aspect is implemented.
  • Figure 1 is a flowchart of some embodiments of a gesture tracking method according to the present disclosure
  • Figure 2 is a network schematic diagram of a joint point confidence network according to the attitude tracking method of the present disclosure
  • Figure 3 is a schematic structural diagram of some embodiments of a posture tracking device according to the present disclosure.
  • Figure 4 is a schematic structural diagram of an electronic device suitable for implementing some embodiments of the present disclosure.
  • some embodiments of the present disclosure propose attitude tracking methods, devices, electronic devices and computer-readable media, which can improve the accuracy of multi-person attitude tracking tasks.
  • FIG. 1 illustrates a process 100 of some embodiments of a gesture tracking method according to the present disclosure.
  • the attitude tracking method includes the following steps:
  • Step 101 Detect pedestrian videos frame by frame to obtain a set of human body bounding boxes.
  • the execution subject of the posture tracking method can detect the pedestrian video frame by frame to obtain a set of human body bounding boxes.
  • the human body boundary frame group in the above human body boundary frame group set corresponds to the pedestrian video frame included in the pedestrian video.
  • the human bounding boxes in the human bounding box group correspond to pedestrians appearing in pedestrian video frames.
  • the pedestrian video may be a video in which at least one pedestrian appears in the recorded scene.
  • the above execution subject can use the HTC (Hybrid Task Cascade) detector to detect the pedestrian video frame by frame to obtain the human body bounding box group set. From this, the position of the pedestrian in the pedestrian video can be initially obtained.
  • the above computing device may be hardware or software.
  • the computing device When the computing device is hardware, it can be implemented as a distributed cluster composed of multiple servers or terminal devices, or it can be implemented as a single server or a single terminal device.
  • the computing device When the computing device is embodied as software, it can be installed in the hardware device listed above. It may be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. There are no specific limitations here. It should be understood that there may be any number of computing devices depending on implementation needs.
  • Step 102 Input the human body bounding box set to the joint point confidence network to obtain the human body posture information set.
  • the execution subject may input the human body bounding box set into the joint point confidence network to obtain a human body posture information set and a joint point confidence information set.
  • the above-mentioned joint point confidence network can be a neural network that takes a human body bounding box set as an input and uses a human body posture information set and a joint point confidence information set as an output.
  • the above-mentioned joint point confidence network can be Hourglass Network.
  • a set of human posture information representing the human posture can be obtained to facilitate tracking of the human posture.
  • the above-mentioned joint point confidence network includes a backbone network, a joint point prediction branch and a joint point availability branch, and the above-mentioned joint point availability branch includes a residual network and at least a classifier.
  • the above-mentioned backbone network can be HRNet (High-Resoultion Net, high-resolution network).
  • the above-mentioned joint point prediction branch can be a neural network that takes the feature map set output by the above-mentioned backbone network as input and uses the joint point position probability information set and the joint point position coordinate information set as output.
  • the above-mentioned joint point prediction branch can perform continuous transposed convolution on the set of feature maps output by the above-mentioned backbone network.
  • the loss function for training the above joint point prediction branch can be:
  • L represents the loss function of the above-mentioned joint point prediction branch.
  • K represents the number of preset joint point types. For example, K can take the value 15.
  • W represents the length of the heat map.
  • H represents the width of the heat map.
  • the above-mentioned heat map is a heat map generated during the operation of the above-mentioned joint point prediction branch.
  • k represents the k-th joint point.
  • i represents the i-th column of the heat map.
  • j represents the jth row of the heat map. Represents the value of the i-th column and j-th row in the heat map matrix of the k-th joint point.
  • the above-mentioned second-order Gaussian distribution label matrix is obtained by converting the real labels of each preset joint point. and are the same size.
  • v k represents the real label of the preset joint point of the k-th joint point.
  • the above-mentioned preset real labels of joint points may be preset real labels of joint points.
  • the value of the real label of the above-mentioned preset joint point is 0 or 1.
  • the above-mentioned joint point availability branch may be a neural network that takes the feature map set output by the above-mentioned backbone network as input and uses the joint point availability probability information set as output.
  • the focal loss function can be used as the loss function for training the above-mentioned joint point availability branch.
  • the loss function can be expressed as follows:
  • ⁇ p ⁇ y ⁇ (1- ⁇ ) 1-y .
  • y represents the real label of the preset joint point.
  • the default true label of the joint point is 0 or 1.
  • p represents the joint point classification probability.
  • p avl represents the joint point availability probability.
  • the above joint point availability probability may be a probability output by a classifier included in the joint point availability branch.
  • the joint point classification probability p is p avl .
  • the joint point classification probability p is 1-p avl .
  • FL(p) represents the Focal Floss function.
  • ⁇ p is the weight factor corresponding to the above-mentioned joint point classification probability p, and ⁇ is the focus parameter.
  • the values of other parameters can use the default parameters of the focal loss function. For example, ⁇ takes the value 2. The value of ⁇ is 0.25.
  • the human body bounding boxes included in the above human body bounding box group set can be cropped and scaled.
  • the above-mentioned execution subject can crop the human body bounding boxes included in the above-mentioned human body bounding box group set from the corresponding pedestrian video frames, and scale each of the cropped human body bounding boxes to a fixed size (for example, 384 ⁇ 288).
  • the first step is to input the above cropped and rescaled human body bounding box set into the above backbone network to obtain the feature map Group set.
  • each high-resolution feature output by the last switching unit of the fourth stage of HRNet can be used as the feature map output by the above-mentioned backbone network.
  • the above feature map sets can be input into the above joint point prediction branch and the joint point availability branch respectively to obtain the joint point position probability information group set, the joint point position coordinate information group set and the joint point availability probability information group set.
  • a joint point confidence information group set can be generated based on the above joint point position probability information group set and the above joint point availability probability information group set.
  • the product of the joint point position probability and the joint point availability probability corresponding to the same joint point can be determined as the joint point confidence, and the joint point confidence information can be obtained.
  • the human body posture represented by each feature map in the above feature map set can be determined as human body posture information, and a human body posture information group set can be obtained.
  • the fifth step is to output the above-mentioned joint point confidence information group set and the above-mentioned human posture information group set.
  • Step 103 Perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set.
  • the execution subject may perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set.
  • the matching results in the above matching result group include at least one video frame number corresponding to the human body bounding box of the above matching result.
  • the above video frame number may be the frame number of each pedestrian video frame in the pedestrian video.
  • the Hungarian algorithm can be used to perform matching processing on each human posture information group in the above human posture information group set. From this, the matching results corresponding to each human body posture information can be obtained.
  • the above-mentioned execution subject can perform the following operations for the human body posture information groups corresponding to two adjacent pedestrian video frames in the above-mentioned human posture information group set:
  • the human body posture information group corresponding to the previous pedestrian video frame among the two adjacent pedestrian video frames is determined as the first human body posture information group.
  • the human body posture information group corresponding to the latter pedestrian video frame of the two adjacent pedestrian video frames is determined as the second human body posture information group.
  • the third step is to determine the distance between each first human body posture information in the above-mentioned first human body posture information group and each second human body posture information in the above-mentioned second human body posture information group, and obtain a distance set.
  • the above distance can be IoU (Intersection over union) distance.
  • the fourth step is to perform allocation processing on the first human body posture information in the first human body posture information group and the second human posture information in the second human body posture information group according to the above distance set, and obtain the adjacent video frame matching result.
  • the Hungarian algorithm can be used to allocate the first human posture information in the first human posture information group and the second human posture information in the second human posture information group based on the above distance set to obtain adjacent videos. Frame matching results.
  • a matching result group set can be generated based on the obtained matching results of each adjacent video frame.
  • the video frame numbers corresponding to each successfully matched second human body posture information can be combined into a matching result, thereby obtaining a matching result set.
  • inter-frame matching of human postures can be achieved.
  • Step 104 In response to determining that the matching result group set contains a matching result that satisfies the preset video frame number condition, generate a human body optical flow frame group set based on each matching result group corresponding to the matching result that satisfies the preset video frame number condition.
  • the above-mentioned execution subject may respond to determining that there are matching results that satisfy the preset video frame number condition in the above-mentioned matching result group set, and according to each matching result group corresponding to the matching result that satisfies the above-mentioned preset video frame number condition, Generate human body optical flow frame group set.
  • the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number.
  • the above-mentioned next video frame number is the video frame number of the next video frame of the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number.
  • various methods can be used to generate human body optical flow frame sets. From this, a set of human optical flow frames representing the predicted human posture can be obtained.
  • the execution subject may determine the set of video frames to be processed based on each matching result group corresponding to the matching result that satisfies the above preset video frame number condition.
  • the video frames corresponding to each matching result group corresponding to the matching results that meet the above preset video frame number conditions can be determined as the video frames to be processed.
  • the following sub-steps can be performed based on each pair of two adjacent video frames to be processed in the above set of video frames to be processed:
  • the first sub-step is to generate an optical flow graph matrix based on the two adjacent video frames to be processed.
  • the above-mentioned optical flow map matrix includes a set of pixel offsets.
  • the optical flow method can be used to generate an optical flow graph matrix.
  • the second sub-step is to select the human body posture information group corresponding to the two adjacent video frames to be processed from the human body posture information group set as the first human posture information group to obtain the first human posture information group set.
  • the third sub-step is to select the first human body posture information corresponding to the matching result that satisfies the above preset video frame number condition from the above first human body posture information group set as the second human body posture information to obtain at least one second human body posture information.
  • the fourth sub-step is: for each second human body posture information in the at least one second human body posture information, select the pixel offset of the corresponding pixel from the above pixel offset set within the range corresponding to the second human body posture information.
  • the shift amount is used as the target pixel offset to obtain a set of target pixel offsets corresponding to the above-mentioned second human body posture information.
  • the above-mentioned pixel may be an offset pixel among the two pixels corresponding to the above-mentioned pixel offset amount.
  • the fifth sub-step is to generate an optical flow mask map based on the above target pixel offset set.
  • the target pixel offsets in the above target pixel offset set can be constructed as a matrix to obtain the optical flow mask map.
  • the sixth sub-step is to determine the minimum circumscribed rectangle of the above-mentioned optical flow mask image as the human body optical flow frame.
  • the human body optical flow frame corresponds to the latter video frame to be processed among the two adjacent video frames to be processed. From this, the human body optical flow frame that represents the prediction of the next video frame to be processed can be obtained.
  • Step 105 Input the human body optical flow frame set to the joint point confidence network to obtain the optical flow human posture information set and the optical flow joint point confidence information set.
  • the execution subject may input the human body optical flow frame set to the joint point confidence network to obtain an optical flow human posture information set and an optical flow joint point confidence information set.
  • the optical flow human posture information group set representing the human posture of the human optical flow frame set in the human optical flow frame group set can be obtained.
  • Step 106 Filter the optical flow human posture information group based on the optical flow joint point confidence information group to obtain the target human posture information group.
  • the execution subject may filter the optical flow human posture information group set based on the optical flow joint point confidence information group set to obtain the target human posture information group set.
  • the joint point confidence threshold condition can be set, and in response to the presence of joint point confidence information that satisfies the set joint point confidence threshold condition in the above optical flow joint point confidence information group, the corresponding optical flow human body posture Information deleted.
  • the above joint point confidence threshold condition may be: the optical flow joint point confidence included in the optical flow joint point confidence information is less than the preset confidence threshold.
  • the above-mentioned execution subject can calculate the above-mentioned optical flow joint point confidence level according to the above-mentioned optical flow joint point confidence level. information to generate average optical flow joint node confidence.
  • the average value of the optical flow joint point confidence included in the optical flow joint point confidence information may be determined as the average optical flow joint point confidence.
  • the average optical flow joint point confidence that is smaller than the preset optical flow threshold can be selected from the obtained average optical flow joint point confidence as the first confidence to be filtered.
  • the obtained optical flow human body posture information corresponding to the first confidence level to be filtered can be deleted from the above optical flow human body posture information set. From this, the corresponding optical flow human posture information with higher confidence can be obtained.
  • the above execution subject can determine the video frame number corresponding to each optical flow human posture information group in the deleted optical flow human posture information group set as the target video frame number, and obtain the target video frame number set.
  • a human body posture information group whose corresponding video frame number is within the above target video frame number set can be selected from the above human body posture information group set as a comparison human body posture information group to obtain a comparison human posture information group set.
  • a posture overlap degree group set can be generated based on the deleted processed optical flow human posture information group set and the above-mentioned comparison human posture information group set.
  • the following formula can be used to generate attitude overlap:
  • represents a function that converts Boolean type results into 0 and 1.
  • converts the Boolean type result to 1 when the conditions in the brackets are met.
  • converts the Boolean result to 0 when the condition within the brackets is not met.
  • p and q respectively represent the two pedestrians corresponding to the optical flow human posture information and the control human posture information.
  • IOU p, q represents the IoU value (Intersection over Union) of pedestrian p and pedestrian q. The IoU value can be determined in advance.
  • k represents the k-th joint point. Represents the Euclidean distance between the k-th joint point of pedestrian p and the k-th joint point of pedestrian q.
  • s 2 represents the scale factor of pedestrian p and pedestrian q.
  • the scale factor can be determined by the sum of the square roots of the human body bounding box areas corresponding to the optical flow human posture information and the control human posture information.
  • v pk represents the confidence of the k-th joint point of pedestrian p.
  • v qk represents the confidence of the k-th joint point of pedestrian q.
  • is a preset parameter.
  • the posture overlap degree greater than the preset posture overlap degree threshold can be selected from the above-mentioned posture overlap degree group set as the second confidence level to be filtered.
  • the optical flow human posture information corresponding to the second confidence level to be filtered can be deleted from the above optical flow human posture information set.
  • the present disclosure filters joint points by adopting the above-mentioned method of using posture overlap. Therefore, by adopting the above method of using posture overlap, the overlapping optical flow human posture information can be effectively filtered. This can achieve accurate tracking of some pedestrians, thereby improving the accuracy of multi-person posture tracking tasks.
  • the execution subject may determine the matching results that satisfy the preset single video frame number condition in the matching result group set as target matching results to obtain a target matching result set.
  • the above preset single video frame number condition can be that the matching result only includes one video frame number.
  • the human body posture information corresponding to the target matching results included in the above target matching result set can be selected from the above human body posture information group set as the human body posture information to be integrated, thereby obtaining the human body posture information set to be integrated.
  • the above-mentioned human body posture information set to be integrated can be input into the preset pedestrian feature model to obtain a pedestrian feature information set to be integrated.
  • the above-mentioned preset pedestrian feature model may be a neural network model that takes a human body posture information set to be integrated as input and a pedestrian feature information set to be integrated as an output.
  • the above-mentioned preset pedestrian feature model may include a backbone network, a pedestrian feature extraction module, and a classifier.
  • the above-mentioned backbone network may be HRNet.
  • the above pedestrian feature extraction module may include a convolution layer, an average pooling layer and a batch normalization layer.
  • the above classifier can include a fully connected layer and a Softmax layer. It should be noted that the above classifier is only used for training the above preset pedestrian feature model, and is not used when the above preset pedestrian feature model is actually used.
  • pedestrian feature similarity information for each pedestrian feature information to be integrated in the set of pedestrian feature information to be integrated can be generated based on the set of pedestrian feature information to be integrated, to obtain a set of pedestrian feature similarity information.
  • the similarity between each pedestrian feature information to be integrated and other pedestrian feature information to be integrated in the pedestrian feature information set to be integrated can be determined to obtain a pedestrian feature similarity information set.
  • the formula for determining the similarity is as follows:
  • p and q respectively represent the two pedestrians corresponding to the two pedestrian feature information to be integrated.
  • S(p, q) represents the pedestrian feature similarity.
  • D represents the dimension of the feature vector represented by the pedestrian feature information to be integrated.
  • k represents the k-th joint point. Represents the feature vector of the k-th joint point of pedestrian p. Represents the feature vector of the k-th joint point of pedestrian q.
  • the pedestrian feature similarity information that satisfies the preset similarity conditions in the above pedestrian feature similarity information set can be determined as the target similarity information, and a target similarity information set is obtained.
  • the above preset similarity condition may be that the pedestrian feature similarity in the above pedestrian feature similarity information is greater than the preset pedestrian feature similarity threshold.
  • the matching result group set can be updated according to the target similarity information set.
  • the video frame number corresponding to each target similarity information in the above target similarity information set can be recorded in the corresponding matching result.
  • the pedestrian feature similarity information corresponding to the target similarity information can be deleted from the pedestrian feature similarity information set.
  • the human posture information whose matching result only has one video frame number can be matched again, so that the human posture information that has not been successfully matched due to the detector performance can be successfully matched, thereby improving the accuracy of multi-person posture tracking tasks.
  • the posture tracking methods of some embodiments of the present disclosure have the following beneficial effects: through the posture tracking methods of some embodiments of the present disclosure, the accuracy of multi-person posture tracking tasks can be improved. Specifically, the reason for the low accuracy of the multi-person attitude tracking task is that the corresponding low position probabilities caused by motion blur will be mistakenly filtered out, resulting in the inability to accurately track some pedestrians, resulting in the multi-person attitude tracking task. The accuracy is lower. Based on this, the posture tracking method of some embodiments of the present disclosure first detects the pedestrian video frame by frame to obtain a human body bounding box set. From this, the position of the pedestrian in the pedestrian video can be initially obtained.
  • the above human body bounding box set is input into the joint point confidence network to obtain the human body posture information set.
  • a set of human posture information representing the human posture can be obtained to facilitate tracking of the human posture.
  • matching processing is performed on each human body posture information group in the above-mentioned human posture information group set to generate a matching result group and obtain a matching result group set. From this, the matching results corresponding to each human body posture information can be obtained.
  • a human body optical flow frame group set is generated based on each matching result group corresponding to the matching result that satisfies the above preset video frame number condition. From this, a set of human optical flow frames representing the predicted human posture can be obtained. After that, the above-mentioned human body optical flow frame group set is input to the above-mentioned joint point confidence network, and an optical flow human body posture information group set and an optical flow joint node confidence information group set are obtained. Thus, the optical flow human posture information group set representing the human posture of the human optical flow frame set in the human optical flow frame group set can be obtained.
  • the above-mentioned optical flow human posture information group set is filtered to obtain the target human body posture information group set.
  • the optical flow joint point confidence information group is used to filter the optical flow human posture information group, the confidence level corresponding to the target human posture information group obtained is relatively high.
  • the joint points will also be retained, thereby avoiding the false filtering out of joint points corresponding to lower position probabilities caused by motion blur, and thus the above-mentioned joint points of the pedestrian can be accurately tracked. , which can improve the accuracy of multi-person posture tracking tasks.
  • the present disclosure provides some embodiments of a posture tracking device. These device embodiments correspond to those method embodiments shown in Figure 3.
  • the device is specifically Can be used in various electronic devices.
  • the posture tracking device 300 of some embodiments includes: a detection unit 301 , a first input unit 302 , a matching unit 303 , a generation unit 304 , a second input unit 305 and a filtering unit 306 .
  • the detection unit 301 is configured to detect the pedestrian video frame by frame to obtain a human body boundary frame group set, wherein the human body boundary frame group in the human body boundary frame group set corresponds to the pedestrian video frame included in the pedestrian video;
  • the first input unit 302 is configured to input the above-mentioned human body bounding box group set to the joint point confidence network to obtain a human body posture information group set;
  • the matching unit 303 is configured to, according to the above-mentioned human body posture information group set, match each person in the above-mentioned human body posture information group set.
  • the body posture information group is subjected to matching processing to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include at least one video frame number corresponding to the human body bounding box of the above-mentioned matching result;
  • the generation unit 304 is configured to generate a human body optical flow frame group set according to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set,
  • the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next video frame number is the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number.
  • the units recorded in the device 300 correspond to various steps in the method described with reference to FIG. 3 . Therefore, the operations, features and beneficial effects described above for the method are also applicable to the device 300 and the units included therein, and will not be described again here.
  • FIG. 4 a schematic structural diagram of an electronic device (eg, computing device) 400 suitable for implementing some embodiments of the present disclosure is shown.
  • the electronic device shown in FIG. 4 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
  • the electronic device 400 may include a processing device (eg, central processing unit, graphics processor, etc.) 401, which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 402 or from a storage device 408.
  • the program in the memory (RAM) 403 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 401, ROM 402 and RAM 403 are connected to each other via a bus 404.
  • An input/output (I/O) interface 405 is also connected to bus 404.
  • the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 such as a computer; a storage device 408 including a magnetic tape, a hard disk, etc.; and a communication device 409.
  • the communication device 409 may allow the electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 4 illustrates electronic device 400 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided. Each block shown in Figure 4 may represent one device, or may represent multiple devices as needed.
  • the processes described above with reference to the flowcharts may be implemented as a computer software program.
  • some embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 409, or from storage device 408, or from ROM 402.
  • the processing device 401 When the computer program is executed by the processing device 401, the above-described functions defined in the methods of some embodiments of the present disclosure are performed.
  • the computer-readable medium recorded in some embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having at least one conductor, a portable computer disk, a hard disk, random access memory (RAM), read only memory (ROM), erasable programmable memory Read memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium.
  • Communications e.g., communications network
  • communications networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs.
  • the electronic device causes the electronic device to: detect the pedestrian video frame by frame and obtain the human body bounding box group set, wherein the above-mentioned The human body bounding box group set in the human body bounding box group set corresponds to the pedestrian video frame included in the pedestrian video; the above human body bounding box group set is input to the joint point confidence network to obtain the human body posture information group set; according to the above human body posture information group set, Perform matching processing on each human body posture information group in the above human body posture information group set to generate a matching result group and obtain a matching result group set, wherein the matching results in the above matching result group include the human body bounding box corresponding to the above matching result.
  • At least one video frame number in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set, generate human body optical flow according to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition Frame group set, wherein the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not contain the next video frame number, and the above-mentioned next video frame number is the above-mentioned matching result in the above-mentioned at least one video frame number.
  • the video frame number of the next video frame of the video frame number input the above-mentioned human body optical flow frame group set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information group set and the optical flow joint point confidence information group set; based on The above-mentioned optical flow joint point confidence information group set is used to filter the above-mentioned optical flow human posture information group set to obtain the target human body posture information group set.
  • Computer program code for performing the operations of some embodiments of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, or a combination thereof, Also included are conventional procedural programming languages—such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider). connected via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one operable function for implementing the specified logical function.
  • Execute instructions may also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units described in some embodiments of the present disclosure may be implemented in software or hardware.
  • the described unit may also be provided in a processor, for example, it may be described as follows: a processor includes a detection unit, a first input unit, a matching unit, a generation unit, a second input unit and a filtering unit. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances.
  • the first input unit can also be described as “input the above human body bounding box set into the joint point confidence network, and obtain the human body Unit of attitude information grouping".
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLD Complex Programmable Logical device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

Disclosed in embodiments of the present disclosure are a posture tracking method and apparatus, an electronic device, and a computer readable medium. A specific implementation of the method comprises: performing frame-by-frame detection on a pedestrian video to obtain a human body bounding box group set; inputting the human body bounding box group set into a joint point confidence network to obtain a human body posture information group set; performing matching processing on each human body posture information group in the human body posture information group set to obtain a matching result group set; generating a human body optical flow box group set; and inputting the human body optical flow box group set into the joint point confidence network to obtain an optical flow human body posture information group set and an optical flow joint point confidence information group set; and on the basis of the optical flow joint point confidence information group set, filtering the optical flow human body posture information group set to obtain a target human body posture information group set.

Description

姿态跟踪方法、装置、电子设备和计算机可读介质Attitude tracking method, device, electronic device and computer-readable medium
相关申请的交叉引用Cross-references to related applications
本申请要求申请日为2022年04月15日,申请号为202210395730.0、发明名称为“姿态跟踪方法、装置、电子设备和计算机可读介质”的中国专利申请的优先权,其全部内容作为整体并入本申请中。This application requires the priority of a Chinese patent application with an application date of April 15, 2022, an application number of 202210395730.0, and an invention title of "Attitude Tracking Method, Device, Electronic Equipment and Computer-Readable Medium", the entire content of which is incorporated as a whole. into this application.
技术领域Technical field
本公开的实施例涉及计算机技术领域,具体涉及姿态跟踪方法、装置、电子设备和计算机可读介质。Embodiments of the present disclosure relate to the field of computer technology, and specifically to attitude tracking methods, devices, electronic devices and computer-readable media.
背景技术Background technique
计算机视觉领域中的多人姿态跟踪任务就是指通过对输入的视频进行处理,在视频的每一帧中检测出每个行人的姿态,再通过对目标的外观特征、位置、运动状态等信息进行计算分析,随着时间推移正确记录每个人的连续姿态轨迹。目前,在进行多人姿态跟踪任务时,除了需要正确识别出每个行人的关节点的位置,还需要对遮挡等原因造成的不可用关节点进行过滤。The multi-person pose tracking task in the field of computer vision refers to processing the input video, detecting the pose of each pedestrian in each frame of the video, and then analyzing the target's appearance characteristics, position, motion status and other information. Computational analysis correctly records each person's continuous posture trajectory over time. Currently, when performing multi-person pose tracking tasks, in addition to correctly identifying the location of each pedestrian's joint points, it is also necessary to filter unavailable joint points caused by occlusion and other reasons.
发明内容Contents of the invention
本公开的内容部分用于以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。本公开的内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。This Summary is provided to introduce in simplified form concepts that are later described in detail in the Detailed Description. The content of this disclosure is not intended to identify key features or essential features of the claimed technical solutions, nor is it intended to be used to limit the scope of the claimed technical solutions.
本公开的一些实施例提出了姿态跟踪方法、装置、电子设备和计算机可读介质,来解决以上背景技术部分提到的技术问题中的一项或多项。Some embodiments of the present disclosure provide gesture tracking methods, devices, electronic devices, and computer-readable media to solve one or more of the technical problems mentioned in the background section above.
第一方面,本公开的一些实施例提供了一种姿态跟踪方法,该方法包括:对行人视频进行逐帧检测,得到人体边界框组集,其中,上述人体边界框组集中的人体边界框组对应于行人视频包括的行人视频帧;将上述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集;根据上述人体姿态信息组集,对上述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集,其中,上述匹配结果组中的匹配结果包括对应上述匹配结果的人体边界框的至少一个视频帧号;响应于确定上述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足上述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集,其中,上述预设视频帧号条件为匹配结果包括的至少一个视频帧号中不包含下一视频帧号,上述下一视频帧号为上述至少一个视频帧号中上述匹配结果对应的视频帧号的下一帧视频的视频帧号;将上述人体光流框组集输入至上述关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集;基于上述光流关节点置信度信息组集,对上述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。In a first aspect, some embodiments of the present disclosure provide a posture tracking method. The method includes: detecting pedestrian videos frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group in the human body bounding box group set Corresponding to the pedestrian video frames included in the pedestrian video; input the above-mentioned human body bounding box group set to the joint point confidence network to obtain the human body posture information group set; according to the above-mentioned human body posture information group set, for each person in the above-mentioned human posture information group set The body posture information group is subjected to matching processing to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include at least one video frame number corresponding to the human body bounding box of the above-mentioned matching result; in response to determining the above-mentioned There are matching results that satisfy the preset video frame number condition in the matching result group set. According to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition, a human body optical flow frame group set is generated, wherein the above preset video The frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next video frame number is the next video frame number of the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number. Video frame number; input the above-mentioned human body optical flow frame set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information set and the optical flow joint point confidence information set; based on the above-mentioned optical flow joint point confidence information set , filter the above optical flow human posture information set to obtain the target human posture information set.
第二方面,本公开的一些实施例提供了一种姿态跟踪装置,装置包括:检测单元,被配置成对行人视频进行逐帧检测,得到人体边界框组集,其中,上述人体边界框组集中的人体边界框组对应于行人视频包括的行人视频帧;第一输入单元,被配置成将上述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集;匹配单元,被配置成根据上述人体姿态信息组集,对上述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集,其中,上述匹配结果组中的匹配结果包括对应上述匹配结果的人体边界框的至少一个视频帧号;生成单元,被配置成响应于确定上述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足上述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集,其中,上述预设视频帧号条件为匹配结果包括的至少一个视频帧号中不包含下一视频帧号,上述下一视频帧号为上述至少 一个视频帧号中上述匹配结果对应的视频帧号的下一帧视频的视频帧号;第二输入单元,被配置成将上述人体光流框组集输入至上述关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集;过滤单元,被配置成基于上述光流关节点置信度信息组集,对上述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。In a second aspect, some embodiments of the present disclosure provide a gesture tracking device. The device includes: a detection unit configured to detect pedestrian videos frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group set The human body bounding box group corresponds to the pedestrian video frame included in the pedestrian video; the first input unit is configured to input the above human body bounding box group set to the joint point confidence network to obtain the human body posture information group set; the matching unit is configured According to the above-mentioned human body posture information group set, matching processing is performed on each human body posture information group in the above-mentioned human posture information group set to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include At least one video frame number corresponding to the human body boundary box of the above-mentioned matching result; the generation unit is configured to respond to determining that there is a matching result that satisfies the preset video frame number condition in the above-mentioned matching result group set, based on satisfying the above-mentioned preset video frame number condition. Each matching result group corresponding to the matching result generates a human body optical flow frame group set, wherein the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next The video frame number is the video frame number of the next frame of the video corresponding to the video frame number corresponding to the matching result in the above-mentioned at least one video frame number; the second input unit is configured to input the above-mentioned human body optical flow frame set to the above-mentioned joint point The confidence network obtains the optical flow human posture information group set and the optical flow joint point confidence information group set; the filtering unit is configured to perform the optical flow human posture information group set based on the optical flow joint point confidence information group set. Filter to obtain the target human body posture information set.
第三方面,本公开的一些实施例提供了一种电子设备,包括:至少一个处理器;存储装置,其上存储有至少一个程序,当至少一个程序被至少一个处理器执行,使得至少一个处理器实现上述第一方面任一实现方式所描述的方法。In a third aspect, some embodiments of the present disclosure provide an electronic device, including: at least one processor; a storage device on which at least one program is stored, and when the at least one program is executed by at least one processor, at least one process The device implements the method described in any implementation manner of the first aspect above.
第四方面,本公开的一些实施例提供了一种计算机可读介质,其上存储有计算机程序,其中,程序被处理器执行时实现上述第一方面任一实现方式所描述的方法。In a fourth aspect, some embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein when the program is executed by a processor, the method described in any implementation manner of the first aspect is implemented.
附图说明Description of the drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,元件和元素不一定按照比例绘制。The above and other features, advantages, and aspects of various embodiments of the present disclosure will become more apparent with reference to the following detailed description taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.
图1是根据本公开的姿态跟踪方法的一些实施例的流程图;Figure 1 is a flowchart of some embodiments of a gesture tracking method according to the present disclosure;
图2是根据本公开的姿态跟踪方法的关节点置信度网络的网络示意图;Figure 2 is a network schematic diagram of a joint point confidence network according to the attitude tracking method of the present disclosure;
图3是根据本公开的姿态跟踪装置的一些实施例的结构示意图;Figure 3 is a schematic structural diagram of some embodiments of a posture tracking device according to the present disclosure;
图4是适于用来实现本公开的一些实施例的电子设备的结构示意图。Figure 4 is a schematic structural diagram of an electronic device suitable for implementing some embodiments of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例。相反,提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本 公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.
另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。在不冲突的情况下,本公开中的实施例及实施例中的特征可以相互组合。It should also be noted that, for convenience of description, only the parts related to the invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "plurality" mentioned in this disclosure are illustrative and not restrictive. Those skilled in the art will understand that unless the context clearly indicates otherwise, it should be understood as "one or Multiple”.
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are for illustrative purposes only and are not used to limit the scope of these messages or information.
相关的对关节点进行过滤的方式,例如,根据位置概率过滤关节点等经常会存在如下技术问题:会错误过滤掉因运动模糊造成对应的值较低的位置概率所对应的关节点,导致对于部分行人无法准确跟踪,造成多人姿态跟踪任务的准确率较低。Related methods of filtering joint points, such as filtering joint points based on position probability, often have the following technical problems: joint points corresponding to lower position probabilities caused by motion blur will be mistakenly filtered out, resulting in Some pedestrians cannot be accurately tracked, resulting in low accuracy in multi-person attitude tracking tasks.
为了解决以上所阐述的问题,本公开的一些实施例提出了姿态跟踪方法、装置、电子设备和计算机可读介质,可以提高多人姿态跟踪任务的准确率。In order to solve the problems described above, some embodiments of the present disclosure propose attitude tracking methods, devices, electronic devices and computer-readable media, which can improve the accuracy of multi-person attitude tracking tasks.
下面将参考附图并结合实施例来详细说明本公开。The present disclosure will be described in detail below in conjunction with embodiments with reference to the accompanying drawings.
图1示出了根据本公开的姿态跟踪方法的一些实施例的流程100。该姿态跟踪方法,包括以下步骤:Figure 1 illustrates a process 100 of some embodiments of a gesture tracking method according to the present disclosure. The attitude tracking method includes the following steps:
步骤101,对行人视频进行逐帧检测,得到人体边界框组集。Step 101: Detect pedestrian videos frame by frame to obtain a set of human body bounding boxes.
在一些实施例中,姿态跟踪方法的执行主体(例如计算设备)可以对行人视频进行逐帧检测,得到人体边界框组集。其中,上述人体边界框组集中的人体边界框组对应于行人视频包括的行人视频帧。人体边界框组中的人体边界框对应行人视频帧中出现的行人。其中,上述行人视频可以为录制场景中出现至少一个行人的视频。实践中,上 述执行主体可以使用HTC(Hybrid Task Cascade,混合任务级联)检测器对行人视频进行逐帧检测,得到人体边界框组集。由此,可以初步得到行人在行人视频中的位置。In some embodiments, the execution subject of the posture tracking method (such as a computing device) can detect the pedestrian video frame by frame to obtain a set of human body bounding boxes. Wherein, the human body boundary frame group in the above human body boundary frame group set corresponds to the pedestrian video frame included in the pedestrian video. The human bounding boxes in the human bounding box group correspond to pedestrians appearing in pedestrian video frames. The pedestrian video may be a video in which at least one pedestrian appears in the recorded scene. In practice, the above execution subject can use the HTC (Hybrid Task Cascade) detector to detect the pedestrian video frame by frame to obtain the human body bounding box group set. From this, the position of the pedestrian in the pedestrian video can be initially obtained.
需要说明的是,上述计算设备可以是硬件,也可以是软件。当计算设备为硬件时,可以实现成多个服务器或终端设备组成的分布式集群,也可以实现成单个服务器或单个终端设备。当计算设备体现为软件时,可以安装在上述所列举的硬件设备中。其可以实现成例如用来提供分布式服务的多个软件或软件模块,也可以实现成单个软件或软件模块。在此不做具体限定。应该理解,根据实现需要,可以具有任意数目的计算设备。It should be noted that the above computing device may be hardware or software. When the computing device is hardware, it can be implemented as a distributed cluster composed of multiple servers or terminal devices, or it can be implemented as a single server or a single terminal device. When the computing device is embodied as software, it can be installed in the hardware device listed above. It may be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. There are no specific limitations here. It should be understood that there may be any number of computing devices depending on implementation needs.
步骤102,将人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集。Step 102: Input the human body bounding box set to the joint point confidence network to obtain the human body posture information set.
在一些实施例中,上述执行主体可以将上述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集和关节点置信度信息组集。其中,上述关节点置信度网络可以为以人体边界框组集作为输入,以人体姿态信息组集和关节点置信度信息组集作为输出的神经网络。例如,上述关节点置信度网络可以为Hourglass Network(沙漏网络)。由此,可以得到表征人体姿态的人体姿态信息组集,以便于对人体姿态进行跟踪。In some embodiments, the execution subject may input the human body bounding box set into the joint point confidence network to obtain a human body posture information set and a joint point confidence information set. Wherein, the above-mentioned joint point confidence network can be a neural network that takes a human body bounding box set as an input and uses a human body posture information set and a joint point confidence information set as an output. For example, the above-mentioned joint point confidence network can be Hourglass Network. Thus, a set of human posture information representing the human posture can be obtained to facilitate tracking of the human posture.
如图2所示,在一些实施例的一些可选的实现方式中,上述关节点置信度网络包括骨干网络、关节点预测分支和关节点可用性分支,上述关节点可用性分支包括残差网络和至少一个分类器。上述骨干网络可以为HRNet(High-Resoultion Net,高分辨率网络)。上述关节点预测分支可以为以上述骨干网络输出的特征图组集作为输入,以关节点位置概率信息组集和关节点位置坐标信息组集作为输出的神经网络。实践中,上述关节点预测分支可以对上述骨干网络输出的特征图组集进行连续的转置卷积。其中,训练上述关节点预测分支的损失函数可以为:As shown in Figure 2, in some optional implementations of some embodiments, the above-mentioned joint point confidence network includes a backbone network, a joint point prediction branch and a joint point availability branch, and the above-mentioned joint point availability branch includes a residual network and at least a classifier. The above-mentioned backbone network can be HRNet (High-Resoultion Net, high-resolution network). The above-mentioned joint point prediction branch can be a neural network that takes the feature map set output by the above-mentioned backbone network as input and uses the joint point position probability information set and the joint point position coordinate information set as output. In practice, the above-mentioned joint point prediction branch can perform continuous transposed convolution on the set of feature maps output by the above-mentioned backbone network. Among them, the loss function for training the above joint point prediction branch can be:
Figure PCTCN2022092143-appb-000001
Figure PCTCN2022092143-appb-000001
其中,L表示上述关节点预测分支的损失函数。K表示预设关节点类型数量。例如,K可以取值为15。W表示热力图的长度。H表示热力图的宽度。其中,上述热力图为上述关节点预测分支在运行过程中生成的热力图。k表示第k个关节点。i表示热力图的第i列。j表示热力图的第j行。
Figure PCTCN2022092143-appb-000002
表示第k个关节点的热力图矩阵中第i列第j行的值。
Figure PCTCN2022092143-appb-000003
表示第k个关节点的二阶高斯分布标签矩阵中第i列第j行的值。上述二阶高斯分布标签矩阵由各个预设关节点真实标签转化得到。
Figure PCTCN2022092143-appb-000004
Figure PCTCN2022092143-appb-000005
的大小相同。v k表示第k个关节点的预设关节点真实标签。上述预设关节点真实标签可以为预先设置的关节点真实标签。上述预设关节点真实标签的值为0或1。
Among them, L represents the loss function of the above-mentioned joint point prediction branch. K represents the number of preset joint point types. For example, K can take the value 15. W represents the length of the heat map. H represents the width of the heat map. Wherein, the above-mentioned heat map is a heat map generated during the operation of the above-mentioned joint point prediction branch. k represents the k-th joint point. i represents the i-th column of the heat map. j represents the jth row of the heat map.
Figure PCTCN2022092143-appb-000002
Represents the value of the i-th column and j-th row in the heat map matrix of the k-th joint point.
Figure PCTCN2022092143-appb-000003
Represents the value of the i-th column and j-th row in the second-order Gaussian distribution label matrix of the k-th joint point. The above-mentioned second-order Gaussian distribution label matrix is obtained by converting the real labels of each preset joint point.
Figure PCTCN2022092143-appb-000004
and
Figure PCTCN2022092143-appb-000005
are the same size. v k represents the real label of the preset joint point of the k-th joint point. The above-mentioned preset real labels of joint points may be preset real labels of joint points. The value of the real label of the above-mentioned preset joint point is 0 or 1.
上述关节点可用性分支可以为以上述骨干网络输出的特征图组集作为输入,以关节点可用性概率信息组集作为输出的神经网络。其中,可以使用focal loss函数作为训练上述关节点可用性分支的损失函数,该损失函数可以如下式所示:The above-mentioned joint point availability branch may be a neural network that takes the feature map set output by the above-mentioned backbone network as input and uses the joint point availability probability information set as output. Among them, the focal loss function can be used as the loss function for training the above-mentioned joint point availability branch. The loss function can be expressed as follows:
FL(p)=-α p×(1-p) γ×log(p)。 FL(p)=-α p ×(1-p) γ ×log(p).
p=(p avl) y×(1-p avl) 1-yp=( pavl ) y ×(1- pavl ) 1-y .
α p=α y×(1-α) 1-yα py ×(1-α) 1-y .
其中,y表示预设关节点真实标签。预设关节点真实标签为0或1。p表示关节点分类概率。p avl表示关节点可用性概率。其中,上述关节点可用性概率可以为关节点可用性分支包括的分类器输出的概率。由上式所示,当预设关节点真实标签y为1时,关节点分类概率p为p avl。当预设关节点真实标签y为0时,关节点分类概率p为1-p avl。FL(p)表示Focal Floss函数。α p是上述关节点分类概率p对应的权重因子,γ是聚焦参数。这里,其他参数的取值可以采用focal loss函数的默认参数。例如,γ取值为2。α取值为0.25。 Among them, y represents the real label of the preset joint point. The default true label of the joint point is 0 or 1. p represents the joint point classification probability. p avl represents the joint point availability probability. Wherein, the above joint point availability probability may be a probability output by a classifier included in the joint point availability branch. As shown in the above formula, when the preset true label y of the joint point is 1, the joint point classification probability p is p avl . When the preset true label y of the joint point is 0, the joint point classification probability p is 1-p avl . FL(p) represents the Focal Floss function. α p is the weight factor corresponding to the above-mentioned joint point classification probability p, and γ is the focus parameter. Here, the values of other parameters can use the default parameters of the focal loss function. For example, γ takes the value 2. The value of α is 0.25.
实践中,首先,可以对上述人体边界框组集包括的人体边界框进行裁剪以及缩放。实践中,上述执行主体可以将上述人体边界框组集包括的人体边界框从对应的行人视频帧中裁剪出来,并将裁剪得到的 各个人体边界框缩放至固定大小(例如,384×288)。然后,上述关节点置信度网络在处理裁剪以及重新缩放后的人体边界框组集时,第一步,可以将上述裁剪以及重新缩放后的人体边界框组集输入至上述骨干网络,得到特征图组集。这里,可以将HRNet的第四阶段最后一个交换单元输出的每个高分辨率特征作为上述骨干网络输出的特征图。第二步,可以将上述特征图组集分别输入至上述关节点预测分支和关节点可用性分支,得到关节点位置概率信息组集、关节点位置坐标信息组集和关节点可用性概率信息组集。第三步,可以根据上述关节点位置概率信息组集和上述关节点可用性概率信息组集,生成关节点置信度信息组集。实践中,对于关节点位置概率信息和关节点可用性概率信息,可以将对应同一关节点的关节点位置概率和关节点可用性概率的积确定为关节点置信度,得到关节点置信度信息。第四步,可以将上述特征图组集中每个特征图表征的人体姿态确定为人体姿态信息,得到人体姿态信息组集。第五步,输出上述关节点置信度信息组集和上述人体姿态信息组集。In practice, first, the human body bounding boxes included in the above human body bounding box group set can be cropped and scaled. In practice, the above-mentioned execution subject can crop the human body bounding boxes included in the above-mentioned human body bounding box group set from the corresponding pedestrian video frames, and scale each of the cropped human body bounding boxes to a fixed size (for example, 384×288). Then, when the above joint point confidence network processes the cropped and rescaled human body bounding box set, the first step is to input the above cropped and rescaled human body bounding box set into the above backbone network to obtain the feature map Group set. Here, each high-resolution feature output by the last switching unit of the fourth stage of HRNet can be used as the feature map output by the above-mentioned backbone network. In the second step, the above feature map sets can be input into the above joint point prediction branch and the joint point availability branch respectively to obtain the joint point position probability information group set, the joint point position coordinate information group set and the joint point availability probability information group set. In the third step, a joint point confidence information group set can be generated based on the above joint point position probability information group set and the above joint point availability probability information group set. In practice, for joint point position probability information and joint point availability probability information, the product of the joint point position probability and the joint point availability probability corresponding to the same joint point can be determined as the joint point confidence, and the joint point confidence information can be obtained. In the fourth step, the human body posture represented by each feature map in the above feature map set can be determined as human body posture information, and a human body posture information group set can be obtained. The fifth step is to output the above-mentioned joint point confidence information group set and the above-mentioned human posture information group set.
步骤103,根据人体姿态信息组集,对人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集。Step 103: Perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set.
在一些实施例中,上述执行主体可以根据上述人体姿态信息组集,对上述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集。其中,上述匹配结果组中的匹配结果包括对应上述匹配结果的人体边界框的至少一个视频帧号。上述视频帧号可以为行人视频中每个行人视频帧的帧号。实践中,可以使用匈牙利算法对上述人体姿态信息组集中的每个人体姿态信息组进行匹配处理。由此,可以得到对应每个人体姿态信息的匹配结果。In some embodiments, the execution subject may perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set. Wherein, the matching results in the above matching result group include at least one video frame number corresponding to the human body bounding box of the above matching result. The above video frame number may be the frame number of each pedestrian video frame in the pedestrian video. In practice, the Hungarian algorithm can be used to perform matching processing on each human posture information group in the above human posture information group set. From this, the matching results corresponding to each human body posture information can be obtained.
在一些实施例的一些可选的实现方式中,首先,上述执行主体可以对于上述人体姿态信息组集中相邻两帧行人视频帧分别对应的人体姿态信息组,执行如下操作:In some optional implementations of some embodiments, first, the above-mentioned execution subject can perform the following operations for the human body posture information groups corresponding to two adjacent pedestrian video frames in the above-mentioned human posture information group set:
第一步,将相邻两帧行人视频帧中前一帧行人视频帧对应的人体姿态信息组确定为第一人体姿态信息组。In the first step, the human body posture information group corresponding to the previous pedestrian video frame among the two adjacent pedestrian video frames is determined as the first human body posture information group.
第二步,将相邻两帧行人视频帧中后一帧行人视频帧对应的人体姿态信息组确定为第二人体姿态信息组。In the second step, the human body posture information group corresponding to the latter pedestrian video frame of the two adjacent pedestrian video frames is determined as the second human body posture information group.
第三步,确定上述第一人体姿态信息组中每个第一人体姿态信息与上述第二人体姿态信息组中各个第二人体姿态信息的距离,得到距离集合。其中,上述距离可以为IoU(Intersection over union)距离。The third step is to determine the distance between each first human body posture information in the above-mentioned first human body posture information group and each second human body posture information in the above-mentioned second human body posture information group, and obtain a distance set. Among them, the above distance can be IoU (Intersection over union) distance.
第四步,根据上述距离集合,对上述第一人体姿态信息组中的第一人体姿态信息和上述第二人体姿态信息组中的第二人体姿态信息进行分配处理,得到相邻视频帧匹配结果。实践中,可以根据上述距离集合,采用匈牙利算法对上述第一人体姿态信息组中的第一人体姿态信息和上述第二人体姿态信息组中的第二人体姿态信息进行分配处理,得到相邻视频帧匹配结果。The fourth step is to perform allocation processing on the first human body posture information in the first human body posture information group and the second human posture information in the second human body posture information group according to the above distance set, and obtain the adjacent video frame matching result. . In practice, the Hungarian algorithm can be used to allocate the first human posture information in the first human posture information group and the second human posture information in the second human posture information group based on the above distance set to obtain adjacent videos. Frame matching results.
然后,可以根据所得到的各个相邻视频帧匹配结果,生成匹配结果组集。实践中,可以将成功匹配的各个第二人体姿态信息对应的视频帧号组合为匹配结果,从而得到匹配结果组集。由此,可以实现人体姿态的帧间匹配。Then, a matching result group set can be generated based on the obtained matching results of each adjacent video frame. In practice, the video frame numbers corresponding to each successfully matched second human body posture information can be combined into a matching result, thereby obtaining a matching result set. As a result, inter-frame matching of human postures can be achieved.
步骤104,响应于确定匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集。Step 104: In response to determining that the matching result group set contains a matching result that satisfies the preset video frame number condition, generate a human body optical flow frame group set based on each matching result group corresponding to the matching result that satisfies the preset video frame number condition.
在一些实施例中,上述执行主体可以响应于确定上述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足上述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集。其中,上述预设视频帧号条件为匹配结果包括的至少一个视频帧号中不包含下一视频帧号。上述下一视频帧号为上述至少一个视频帧号中上述匹配结果对应的视频帧号的下一帧视频的视频帧号。实践中,可以采用各种方式生成人体光流框组集。由此,可以得到表征预测人体姿态的人体光流框组集。In some embodiments, the above-mentioned execution subject may respond to determining that there are matching results that satisfy the preset video frame number condition in the above-mentioned matching result group set, and according to each matching result group corresponding to the matching result that satisfies the above-mentioned preset video frame number condition, Generate human body optical flow frame group set. Wherein, the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number. The above-mentioned next video frame number is the video frame number of the next video frame of the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number. In practice, various methods can be used to generate human body optical flow frame sets. From this, a set of human optical flow frames representing the predicted human posture can be obtained.
在一些实施例的一些可选的实现方式中,第一步,上述执行主体可以根据满足上述预设视频帧号条件的匹配结果所对应的各个匹配结果组,确定待处理视频帧集合。实践中,可以将满足上述预设视频帧 号条件的匹配结果所对应的各个匹配结果组对应的视频帧确定为待处理视频帧。第二步,可以根据上述待处理视频帧集合中每对相邻的两帧待处理视频帧,执行以下子步骤:In some optional implementations of some embodiments, in the first step, the execution subject may determine the set of video frames to be processed based on each matching result group corresponding to the matching result that satisfies the above preset video frame number condition. In practice, the video frames corresponding to each matching result group corresponding to the matching results that meet the above preset video frame number conditions can be determined as the video frames to be processed. In the second step, the following sub-steps can be performed based on each pair of two adjacent video frames to be processed in the above set of video frames to be processed:
第一子步骤,根据上述相邻的两帧待处理视频帧,生成光流图矩阵。其中,上述光流图矩阵包括像素偏移量集合。实践中,根据上述相邻的两帧待处理视频帧,可以使用光流法生成光流图矩阵。The first sub-step is to generate an optical flow graph matrix based on the two adjacent video frames to be processed. Wherein, the above-mentioned optical flow map matrix includes a set of pixel offsets. In practice, based on the above two adjacent video frames to be processed, the optical flow method can be used to generate an optical flow graph matrix.
第二子步骤,从上述人体姿态信息组集中选择对应上述相邻的两帧待处理视频帧的人体姿态信息组作为第一人体姿态信息组,得到第一人体姿态信息组集。The second sub-step is to select the human body posture information group corresponding to the two adjacent video frames to be processed from the human body posture information group set as the first human posture information group to obtain the first human posture information group set.
第三子步骤,从上述第一人体姿态信息组集中选择满足上述预设视频帧号条件的匹配结果所对应的第一人体姿态信息作为第二人体姿态信息,得到至少一个第二人体姿态信息。The third sub-step is to select the first human body posture information corresponding to the matching result that satisfies the above preset video frame number condition from the above first human body posture information group set as the second human body posture information to obtain at least one second human body posture information.
第四子步骤,对于上述至少一个第二人体姿态信息中的每个第二人体姿态信息,从上述像素偏移量集合中选择对应的像素位于上述第二人体姿态信息对应的范围内的像素偏移量作为目标像素偏移量,得到对应上述第二人体姿态信息的目标像素偏移量集合。其中,上述像素可以为上述像素偏移量对应的两个像素中偏移后的像素。The fourth sub-step is: for each second human body posture information in the at least one second human body posture information, select the pixel offset of the corresponding pixel from the above pixel offset set within the range corresponding to the second human body posture information. The shift amount is used as the target pixel offset to obtain a set of target pixel offsets corresponding to the above-mentioned second human body posture information. Wherein, the above-mentioned pixel may be an offset pixel among the two pixels corresponding to the above-mentioned pixel offset amount.
第五子步骤,根据上述目标像素偏移量集合,生成光流掩码图。实践中,可以将上述目标像素偏移量集合中的目标像素偏移量构建为矩阵,得到光流掩码图。The fifth sub-step is to generate an optical flow mask map based on the above target pixel offset set. In practice, the target pixel offsets in the above target pixel offset set can be constructed as a matrix to obtain the optical flow mask map.
第六子步骤,将上述光流掩码图的最小外接矩形确定为人体光流框。其中,上述人体光流框对应上述相邻的两帧待处理视频帧中的后一帧待处理视频帧。由此,可以得到表征对后一帧待处理视频帧预测的人体光流框。The sixth sub-step is to determine the minimum circumscribed rectangle of the above-mentioned optical flow mask image as the human body optical flow frame. Wherein, the human body optical flow frame corresponds to the latter video frame to be processed among the two adjacent video frames to be processed. From this, the human body optical flow frame that represents the prediction of the next video frame to be processed can be obtained.
步骤105,将人体光流框组集输入至关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集。Step 105: Input the human body optical flow frame set to the joint point confidence network to obtain the optical flow human posture information set and the optical flow joint point confidence information set.
在一些实施例中,上述执行主体可以将上述人体光流框组集输入至上述关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集。由此,可以得到表征人体光流框组集中人体光流框的人体姿态的光流人体姿态信息组集。In some embodiments, the execution subject may input the human body optical flow frame set to the joint point confidence network to obtain an optical flow human posture information set and an optical flow joint point confidence information set. Thus, the optical flow human posture information group set representing the human posture of the human optical flow frame set in the human optical flow frame group set can be obtained.
步骤106,基于光流关节点置信度信息组集,对光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。Step 106: Filter the optical flow human posture information group based on the optical flow joint point confidence information group to obtain the target human posture information group.
在一些实施例中,上述执行主体可以基于上述光流关节点置信度信息组集,对上述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。实践中,可以设定关节点置信度阈值条件,响应于上述光流关节点置信度信息组集中存在满足设定的关节点置信度阈值条件的关节点置信度信息,将对应的光流人体姿态信息删除。例如,上述关节点置信度阈值条件可以为:光流关节点置信度信息中存在包括的光流关节点置信度小于预设置信度阈值。由此,可以实现对于光流人体姿态信息组集的过滤,从而得到对应的置信度较高的目标人体姿态信息组集。In some embodiments, the execution subject may filter the optical flow human posture information group set based on the optical flow joint point confidence information group set to obtain the target human posture information group set. In practice, the joint point confidence threshold condition can be set, and in response to the presence of joint point confidence information that satisfies the set joint point confidence threshold condition in the above optical flow joint point confidence information group, the corresponding optical flow human body posture Information deleted. For example, the above joint point confidence threshold condition may be: the optical flow joint point confidence included in the optical flow joint point confidence information is less than the preset confidence threshold. Thus, it is possible to filter the optical flow human posture information set, thereby obtaining the corresponding target human posture information set with higher confidence.
在一些实施例的一些可选的实现方式中,首先,对于上述光流关节点置信度信息组集包括的每个光流关节点置信度信息,上述执行主体可以根据上述光流关节点置信度信息,生成平均光流关节点置信度。实践中,可以将光流关节点置信度信息包括的各个光流关节点置信度的均值确定为平均光流关节点置信度。其次,可以从所得到的平均光流关节点置信度中选择小于预设光流阈值的平均光流关节点置信度作为第一待过滤置信度。最后,可以将得到的第一待过滤置信度对应的光流人体姿态信息从上述光流人体姿态信息组集中删除。由此,可以得到对应的置信度较高的光流人体姿态信息。In some optional implementations of some embodiments, first, for each optical flow joint point confidence information included in the above-mentioned optical flow joint point confidence information group set, the above-mentioned execution subject can calculate the above-mentioned optical flow joint point confidence level according to the above-mentioned optical flow joint point confidence level. information to generate average optical flow joint node confidence. In practice, the average value of the optical flow joint point confidence included in the optical flow joint point confidence information may be determined as the average optical flow joint point confidence. Secondly, the average optical flow joint point confidence that is smaller than the preset optical flow threshold can be selected from the obtained average optical flow joint point confidence as the first confidence to be filtered. Finally, the obtained optical flow human body posture information corresponding to the first confidence level to be filtered can be deleted from the above optical flow human body posture information set. From this, the corresponding optical flow human posture information with higher confidence can be obtained.
可选地,首先,上述执行主体可以将删除处理后的光流人体姿态信息组集中每个光流人体姿态信息组对应的视频帧号确定为目标视频帧号,得到目标视频帧号集。其次,可以从上述人体姿态信息组集中选择对应的视频帧号在上述目标视频帧号集内的人体姿态信息组作为对照人体姿态信息组,得到对照人体姿态信息组集。然后,可以根据删除处理后的光流人体姿态信息组集和上述对照人体姿态信息组集,生成姿态重叠度组集。实践中,可以采用下式生成姿态重叠度:Optionally, first, the above execution subject can determine the video frame number corresponding to each optical flow human posture information group in the deleted optical flow human posture information group set as the target video frame number, and obtain the target video frame number set. Secondly, a human body posture information group whose corresponding video frame number is within the above target video frame number set can be selected from the above human body posture information group set as a comparison human body posture information group to obtain a comparison human posture information group set. Then, a posture overlap degree group set can be generated based on the deleted processed optical flow human posture information group set and the above-mentioned comparison human posture information group set. In practice, the following formula can be used to generate attitude overlap:
Figure PCTCN2022092143-appb-000006
Figure PCTCN2022092143-appb-000006
其中,δ表示将布尔类型结果转换为0和1的函数。其中,当满足括号内的条件时,δ将布尔类型结果转换为1。当不满足括号内的条件时,δ将布尔类型结果转换为0。p,q分别表示光流人体姿态信息和对照人体姿态信息对应的两个行人。IOU p,q表示行人p和行人q的IoU值(Intersection over Union)。该IoU值可以预先确定。k表示第k个关节点。
Figure PCTCN2022092143-appb-000007
表示行人p的第k个关节点和行人q的第k个关节点之间的欧氏距离。s 2表示行人p和行人q的尺度因子。其中,该尺度因子可以通过光流人体姿态信息和对照人体姿态信息分别对应的人体边界框面积的平方根的和来确定。
Figure PCTCN2022092143-appb-000008
表示第k个关节点的归一化因子。该归一化因子可以预先确定。v pk表示行人p的第k个关节点的置信度。v qk表示行人q的第k个关节点的置信度。ε为预设参数。
Among them, δ represents a function that converts Boolean type results into 0 and 1. Among them, δ converts the Boolean type result to 1 when the conditions in the brackets are met. δ converts the Boolean result to 0 when the condition within the brackets is not met. p and q respectively represent the two pedestrians corresponding to the optical flow human posture information and the control human posture information. IOU p, q represents the IoU value (Intersection over Union) of pedestrian p and pedestrian q. The IoU value can be determined in advance. k represents the k-th joint point.
Figure PCTCN2022092143-appb-000007
Represents the Euclidean distance between the k-th joint point of pedestrian p and the k-th joint point of pedestrian q. s 2 represents the scale factor of pedestrian p and pedestrian q. Among them, the scale factor can be determined by the sum of the square roots of the human body bounding box areas corresponding to the optical flow human posture information and the control human posture information.
Figure PCTCN2022092143-appb-000008
Represents the normalization factor of the k-th joint point. This normalization factor can be determined in advance. v pk represents the confidence of the k-th joint point of pedestrian p. v qk represents the confidence of the k-th joint point of pedestrian q. ε is a preset parameter.
之后,可以从上述姿态重叠度组集中选择大于预设姿态重叠度阈值的姿态重叠度作为第二待过滤置信度。最后,可以将上述第二待过滤置信度对应的光流人体姿态信息从上述光流人体姿态信息组集中删除。Afterwards, the posture overlap degree greater than the preset posture overlap degree threshold can be selected from the above-mentioned posture overlap degree group set as the second confidence level to be filtered. Finally, the optical flow human posture information corresponding to the second confidence level to be filtered can be deleted from the above optical flow human posture information set.
上述内容作为本公开的实施例的一个发明点,解决了背景技术提及的技术问题二“在人体交错遮挡的场景下,对于因错误标记到别人身上而对应值较高的位置概率所对应的关节点,无法过滤,导致对于部分行人无法准确跟踪,造成多人姿态跟踪任务的准确率较低”。导致多人姿态跟踪任务的准确率较低的因素如下:在人体交错遮挡的场景下,由于因错误标记到别人身上而对应的位置概率的值较高,使用IoU的度量方式确定交并比,从而得到的相似度较低,无法过滤;此外,由于两个框对应的关节点不同,同样导致相似度较低,无法过滤。如果解决了上述因素,就能达到将上述目标的目标定位信息和目标分类信息投入实际生产使用的效果。为了达到这一效果,本公开通过采用 上述使用姿态重叠度的方式对关节点进行过滤。因此,通过采用上述使用姿态重叠度的方式,可以对重叠的光流人体姿态信息有效过滤。从而可以实现对于部分行人的准确跟踪,从而提升多人姿态跟踪任务的准确率。The above content, as an inventive point of the embodiments of the present disclosure, solves the second technical problem mentioned in the background art: "In the scene where the human body is interlaced and blocked, for the location probability corresponding to a high value due to being erroneously marked on someone else, Joint points cannot be filtered, resulting in the inability to accurately track some pedestrians, resulting in low accuracy in multi-person posture tracking tasks." The factors that lead to the low accuracy of multi-person pose tracking tasks are as follows: In the scene where human bodies are intertwined and blocked, due to the high value of the corresponding position probability due to incorrect marking on others, the intersection and union ratio is determined using the IoU measurement method. The resulting similarity is low and cannot be filtered; in addition, because the joint points corresponding to the two boxes are different, the similarity is also low and cannot be filtered. If the above factors are solved, the effect of putting the target positioning information and target classification information of the above targets into actual production can be achieved. In order to achieve this effect, the present disclosure filters joint points by adopting the above-mentioned method of using posture overlap. Therefore, by adopting the above method of using posture overlap, the overlapping optical flow human posture information can be effectively filtered. This can achieve accurate tracking of some pedestrians, thereby improving the accuracy of multi-person posture tracking tasks.
可选地,首先,上述执行主体可以将上述匹配结果组集中满足预设单视频帧号条件的匹配结果确定为目标匹配结果,得到目标匹配结果集合。上述预设单视频帧号条件可以为匹配结果只包括1个视频帧号。其次,可以从上述人体姿态信息组集中选择对应上述目标匹配结果集合包括的目标匹配结果的人体姿态信息作为待整合人体姿态信息,得到待整合人体姿态信息集合。然后,可以将上述待整合人体姿态信息集合输入至预设行人特征模型,得到待整合行人特征信息集合。上述预设行人特征模型可以为以待整合人体姿态信息集合为输入,以待整合行人特征信息集合为输出的神经网络模型。上述预设行人特征模型可以包括骨干网络、行人特征提取模块、分类器。上述骨干网络可以为HRNet。上述行人特征提取模块可以包括卷积层、平均池化层和批标准化层。上述分类器可以包括一个全连接层和Softmax层。需要说明的是,上述分类器仅用于上述预设行人特征模型的训练,在上述预设行人特征模型实际应用时,上述分类器并不使用。之后,可以根据所述待整合行人特征信息集合,生成所述待整合行人特征信息集合中每个待整合行人特征信息的行人特征相似度信息,得到行人特征相似度信息集合。实践中,可以确定上述待整合行人特征信息集合中每个待整合行人特征信息与其他待整合行人特征信息的相似度,得到行人特征相似度信息集合。其中,确定相似度的公式如下式所示:Optionally, first, the execution subject may determine the matching results that satisfy the preset single video frame number condition in the matching result group set as target matching results to obtain a target matching result set. The above preset single video frame number condition can be that the matching result only includes one video frame number. Secondly, the human body posture information corresponding to the target matching results included in the above target matching result set can be selected from the above human body posture information group set as the human body posture information to be integrated, thereby obtaining the human body posture information set to be integrated. Then, the above-mentioned human body posture information set to be integrated can be input into the preset pedestrian feature model to obtain a pedestrian feature information set to be integrated. The above-mentioned preset pedestrian feature model may be a neural network model that takes a human body posture information set to be integrated as input and a pedestrian feature information set to be integrated as an output. The above-mentioned preset pedestrian feature model may include a backbone network, a pedestrian feature extraction module, and a classifier. The above-mentioned backbone network may be HRNet. The above pedestrian feature extraction module may include a convolution layer, an average pooling layer and a batch normalization layer. The above classifier can include a fully connected layer and a Softmax layer. It should be noted that the above classifier is only used for training the above preset pedestrian feature model, and is not used when the above preset pedestrian feature model is actually used. Thereafter, pedestrian feature similarity information for each pedestrian feature information to be integrated in the set of pedestrian feature information to be integrated can be generated based on the set of pedestrian feature information to be integrated, to obtain a set of pedestrian feature similarity information. In practice, the similarity between each pedestrian feature information to be integrated and other pedestrian feature information to be integrated in the pedestrian feature information set to be integrated can be determined to obtain a pedestrian feature similarity information set. Among them, the formula for determining the similarity is as follows:
Figure PCTCN2022092143-appb-000009
Figure PCTCN2022092143-appb-000009
其中,p,q分别表示两个待整合行人特征信息对应的两个行人。S(p,q)表示行人特征相似度。D表示待整合行人特征信息所表征的特征向量的维度。k表示第k个关节点。
Figure PCTCN2022092143-appb-000010
表示行人p第k个关节点的 特征向量。
Figure PCTCN2022092143-appb-000011
表示行人q第k个关节点的特征向量。
Among them, p and q respectively represent the two pedestrians corresponding to the two pedestrian feature information to be integrated. S(p, q) represents the pedestrian feature similarity. D represents the dimension of the feature vector represented by the pedestrian feature information to be integrated. k represents the k-th joint point.
Figure PCTCN2022092143-appb-000010
Represents the feature vector of the k-th joint point of pedestrian p.
Figure PCTCN2022092143-appb-000011
Represents the feature vector of the k-th joint point of pedestrian q.
其次,可以将上述行人特征相似度信息集合中满足预设相似度条件的行人特征相似度信息确定为目标相似度信息,得到目标相似度信息集合。上述预设相似度条件可以为上述行人特征相似度信息中的行人特征相似度大于预设行人特征相似度阈值。然后,可以根据上述目标相似度信息集合,对上述匹配结果组集进行更新。实践中,可以将上述目标相似度信息集合中每个目标相似度信息对应的视频帧号分别记录进对应的匹配结果中。最后,对于所述目标相似度信息集合中的每个目标相似度信息,可以从所述行人特征相似度信息集合中删除对应所述目标相似度信息的行人特征相似度信息。由此,可以对于匹配结果仅有一个视频帧号的人体姿态信息再次进行匹配,使得由于检测器性能导致未成功匹配的人体姿态信息成功匹配,从而提高多人姿态跟踪任务的准确率。Secondly, the pedestrian feature similarity information that satisfies the preset similarity conditions in the above pedestrian feature similarity information set can be determined as the target similarity information, and a target similarity information set is obtained. The above preset similarity condition may be that the pedestrian feature similarity in the above pedestrian feature similarity information is greater than the preset pedestrian feature similarity threshold. Then, the matching result group set can be updated according to the target similarity information set. In practice, the video frame number corresponding to each target similarity information in the above target similarity information set can be recorded in the corresponding matching result. Finally, for each target similarity information in the target similarity information set, the pedestrian feature similarity information corresponding to the target similarity information can be deleted from the pedestrian feature similarity information set. As a result, the human posture information whose matching result only has one video frame number can be matched again, so that the human posture information that has not been successfully matched due to the detector performance can be successfully matched, thereby improving the accuracy of multi-person posture tracking tasks.
本公开的上述各个实施例具有如下有益效果:通过本公开的一些实施例的姿态跟踪方法,可以提升多人姿态跟踪任务的准确率。具体来说,造成多人姿态跟踪任务的准确率较低的原因在于:会错误过滤掉因运动模糊造成对应的值较低的位置概率,导致对于部分行人无法准确跟踪,造成多人姿态跟踪任务的准确率较低。基于此,本公开的一些实施例的姿态跟踪方法,首先,对行人视频进行逐帧检测,得到人体边界框组集。由此,可以初步得到行人在行人视频中的位置。然后,将上述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集。由此,可以得到表征人体姿态的人体姿态信息组集,以便于对人体姿态进行跟踪。其次,根据上述人体姿态信息组集,对上述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集。由此,可以得到对应每个人体姿态信息的匹配结果。然后,响应于确定上述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足上述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集。由此,可以得到表征预测人体姿态的人体光流框组集。之后,将上述人体光流框组集输入至上述关节点置信度网络,得到光流人体姿态信息组集和光流关 节点置信度信息组集。由此,可以得到表征人体光流框组集中人体光流框的人体姿态的光流人体姿态信息组集。最后,基于上述光流关节点置信度信息组集,对上述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。由此,可以实现对于光流人体姿态信息组集的过滤,从而得到对应的置信度较高的目标人体姿态信息组集。因为使用光流关节点置信度信息组集对光流人体姿态信息组集进行过滤,得到的目标人体姿态信息组集对应的置信度较高,对于因运动而成像模糊的关节点,当对应的置信度较高时,也会对该关节点进行保留,从而避免了错误过滤掉因运动模糊造成对应的值较低的位置概率所对应的关节点,进而对于该行人的上述关节点可以准确跟踪,从而可以提高多人姿态跟踪任务的准确率。The above-mentioned embodiments of the present disclosure have the following beneficial effects: through the posture tracking methods of some embodiments of the present disclosure, the accuracy of multi-person posture tracking tasks can be improved. Specifically, the reason for the low accuracy of the multi-person attitude tracking task is that the corresponding low position probabilities caused by motion blur will be mistakenly filtered out, resulting in the inability to accurately track some pedestrians, resulting in the multi-person attitude tracking task. The accuracy is lower. Based on this, the posture tracking method of some embodiments of the present disclosure first detects the pedestrian video frame by frame to obtain a human body bounding box set. From this, the position of the pedestrian in the pedestrian video can be initially obtained. Then, the above human body bounding box set is input into the joint point confidence network to obtain the human body posture information set. Thus, a set of human posture information representing the human posture can be obtained to facilitate tracking of the human posture. Secondly, according to the above-mentioned human posture information group set, matching processing is performed on each human body posture information group in the above-mentioned human posture information group set to generate a matching result group and obtain a matching result group set. From this, the matching results corresponding to each human body posture information can be obtained. Then, in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set, a human body optical flow frame group set is generated based on each matching result group corresponding to the matching result that satisfies the above preset video frame number condition. From this, a set of human optical flow frames representing the predicted human posture can be obtained. After that, the above-mentioned human body optical flow frame group set is input to the above-mentioned joint point confidence network, and an optical flow human body posture information group set and an optical flow joint node confidence information group set are obtained. Thus, the optical flow human posture information group set representing the human posture of the human optical flow frame set in the human optical flow frame group set can be obtained. Finally, based on the above-mentioned optical flow joint point confidence information group set, the above-mentioned optical flow human posture information group set is filtered to obtain the target human body posture information group set. Thus, it is possible to filter the optical flow human posture information set, thereby obtaining the corresponding target human posture information set with higher confidence. Because the optical flow joint point confidence information group is used to filter the optical flow human posture information group, the confidence level corresponding to the target human posture information group obtained is relatively high. For joint points that are blurred due to motion, when the corresponding When the confidence level is high, the joint points will also be retained, thereby avoiding the false filtering out of joint points corresponding to lower position probabilities caused by motion blur, and thus the above-mentioned joint points of the pedestrian can be accurately tracked. , which can improve the accuracy of multi-person posture tracking tasks.
继续参考图3,作为对上述各图所示方法的实现,本公开提供了一种姿态跟踪装置的一些实施例,这些装置实施例与图3所示的那些方法实施例相对应,该装置具体可以应用于各种电子设备中。Continuing to refer to Figure 3, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a posture tracking device. These device embodiments correspond to those method embodiments shown in Figure 3. The device is specifically Can be used in various electronic devices.
如图3所示,一些实施例的姿态跟踪装置300包括:检测单元301、第一输入单元302、匹配单元303、生成单元304、第二输入单元305和过滤单元306。其中,检测单元301被配置成对行人视频进行逐帧检测,得到人体边界框组集,其中,上述人体边界框组集中的人体边界框组对应于行人视频包括的行人视频帧;第一输入单元302被配置成将上述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集;匹配单元303被配置成根据上述人体姿态信息组集,对上述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集,其中,上述匹配结果组中的匹配结果包括对应上述匹配结果的人体边界框的至少一个视频帧号;生成单元304被配置成响应于确定上述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足上述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集,其中,上述预设视频帧号条件为匹配结果包括的至少一个视频帧号中不包含下一视频帧号,上述下一视频帧号为上述至少一个视频帧号中上述匹配结果对应的视频 帧号的下一帧视频的视频帧号;第二输入单元305被配置成将上述人体光流框组集输入至上述关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集;过滤单元306被配置成基于上述光流关节点置信度信息组集,对上述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。As shown in FIG. 3 , the posture tracking device 300 of some embodiments includes: a detection unit 301 , a first input unit 302 , a matching unit 303 , a generation unit 304 , a second input unit 305 and a filtering unit 306 . Wherein, the detection unit 301 is configured to detect the pedestrian video frame by frame to obtain a human body boundary frame group set, wherein the human body boundary frame group in the human body boundary frame group set corresponds to the pedestrian video frame included in the pedestrian video; the first input unit 302 is configured to input the above-mentioned human body bounding box group set to the joint point confidence network to obtain a human body posture information group set; the matching unit 303 is configured to, according to the above-mentioned human body posture information group set, match each person in the above-mentioned human body posture information group set. The body posture information group is subjected to matching processing to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include at least one video frame number corresponding to the human body bounding box of the above-mentioned matching result; the generation unit 304 is configured to generate a human body optical flow frame group set according to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set, Wherein, the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next video frame number is the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number. The video frame number of the next frame of video; the second input unit 305 is configured to input the above-mentioned human body optical flow frame set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information set and the optical flow joint point confidence information Group set; the filtering unit 306 is configured to filter the above-mentioned optical flow human posture information group set based on the above-mentioned optical flow joint point confidence information group set to obtain the target human body posture information group set.
可以理解的是,该装置300中记载的诸单元与参考图3描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作、特征以及产生的有益效果同样适用于装置300及其中包含的单元,在此不再赘述。It can be understood that the units recorded in the device 300 correspond to various steps in the method described with reference to FIG. 3 . Therefore, the operations, features and beneficial effects described above for the method are also applicable to the device 300 and the units included therein, and will not be described again here.
下面参考图4,其示出了适于用来实现本公开的一些实施例的电子设备(例如计算设备)400的结构示意图。图4示出的电子设备仅仅是一个示例,不应对本公开的实施例的功能和使用范围带来任何限制。Referring now to FIG. 4 , a schematic structural diagram of an electronic device (eg, computing device) 400 suitable for implementing some embodiments of the present disclosure is shown. The electronic device shown in FIG. 4 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.
如图4所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(ROM)402中的程序或者从存储装置408加载到随机访问存储器(RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备400操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。As shown in FIG. 4, the electronic device 400 may include a processing device (eg, central processing unit, graphics processor, etc.) 401, which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 402 or from a storage device 408. The program in the memory (RAM) 403 executes various appropriate actions and processes. In the RAM 403, various programs and data required for the operation of the electronic device 400 are also stored. The processing device 401, ROM 402 and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有各种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。图4中示出的每个方框可以代表一个装置,也可以根据需要代表多个装置。Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 such as a computer; a storage device 408 including a magnetic tape, a hard disk, etc.; and a communication device 409. The communication device 409 may allow the electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data. Although FIG. 4 illustrates electronic device 400 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided. Each block shown in Figure 4 may represent one device, or may represent multiple devices as needed.
特别地,根据本公开的一些实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的一些实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的一些实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本公开的一些实施例的方法中限定的上述功能。In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as a computer software program. For example, some embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In some such embodiments, the computer program may be downloaded and installed from the network via communication device 409, or from storage device 408, or from ROM 402. When the computer program is executed by the processing device 401, the above-described functions defined in the methods of some embodiments of the present disclosure are performed.
需要说明的是,本公开的一些实施例中记载的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有至少一个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开的一些实施例中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开的一些实施例中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium recorded in some embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having at least one conductor, a portable computer disk, a hard disk, random access memory (RAM), read only memory (ROM), erasable programmable memory Read memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In some embodiments of the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知 或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and server can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium. Communications (e.g., communications network) interconnections. Examples of communications networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:对行人视频进行逐帧检测,得到人体边界框组集,其中,上述人体边界框组集中的人体边界框组对应于行人视频包括的行人视频帧;将上述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集;根据上述人体姿态信息组集,对上述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集,其中,上述匹配结果组中的匹配结果包括对应上述匹配结果的人体边界框的至少一个视频帧号;响应于确定上述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足上述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集,其中,上述预设视频帧号条件为匹配结果包括的至少一个视频帧号中不包含下一视频帧号,上述下一视频帧号为上述至少一个视频帧号中上述匹配结果对应的视频帧号的下一帧视频的视频帧号;将上述人体光流框组集输入至上述关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集;基于上述光流关节点置信度信息组集,对上述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device. The above-mentioned computer-readable medium carries one or more programs. When the above-mentioned one or more programs are executed by the electronic device, the electronic device causes the electronic device to: detect the pedestrian video frame by frame and obtain the human body bounding box group set, wherein the above-mentioned The human body bounding box group set in the human body bounding box group set corresponds to the pedestrian video frame included in the pedestrian video; the above human body bounding box group set is input to the joint point confidence network to obtain the human body posture information group set; according to the above human body posture information group set, Perform matching processing on each human body posture information group in the above human body posture information group set to generate a matching result group and obtain a matching result group set, wherein the matching results in the above matching result group include the human body bounding box corresponding to the above matching result. At least one video frame number; in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set, generate human body optical flow according to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition Frame group set, wherein the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not contain the next video frame number, and the above-mentioned next video frame number is the above-mentioned matching result in the above-mentioned at least one video frame number. The video frame number of the next video frame of the video frame number; input the above-mentioned human body optical flow frame group set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information group set and the optical flow joint point confidence information group set; based on The above-mentioned optical flow joint point confidence information group set is used to filter the above-mentioned optical flow human posture information group set to obtain the target human body posture information group set.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的一些实施例的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上 执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of some embodiments of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, or a combination thereof, Also included are conventional procedural programming languages—such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider). connected via the Internet).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含至少一个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one operable function for implementing the specified logical function. Execute instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
描述于本公开的一些实施例中的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括检测单元、第一输入单元、匹配单元、生成单元、第二输入单元和过滤单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一输入单元还可以被描述为“将上述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集的单元”。The units described in some embodiments of the present disclosure may be implemented in software or hardware. The described unit may also be provided in a processor, for example, it may be described as follows: a processor includes a detection unit, a first input unit, a matching unit, a generation unit, a second input unit and a filtering unit. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances. For example, the first input unit can also be described as “input the above human body bounding box set into the joint point confidence network, and obtain the human body Unit of attitude information grouping".
本文中以上描述的功能可以至少部分地由至少一个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described above herein may be performed, at least in part, by at least one hardware logic component. For example, and without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.
以上描述仅为本公开的一些较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开的实施例中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵 盖在不脱离上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开的实施例中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only an illustration of some preferred embodiments of the present disclosure and the technical principles applied. Persons skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to technical solutions composed of specific combinations of the above technical features, and should also cover the above-mentioned technical solutions without departing from the above-mentioned inventive concept. Other technical solutions formed by any combination of technical features or their equivalent features. For example, a technical solution is formed by replacing the above features with technical features with similar functions disclosed in the embodiments of the present disclosure (but not limited to).

Claims (9)

  1. 一种姿态跟踪方法,包括:A posture tracking method, including:
    对行人视频进行逐帧检测,得到人体边界框组集,其中,所述人体边界框组集中的人体边界框组对应于行人视频包括的行人视频帧;Perform frame-by-frame detection on the pedestrian video to obtain a human body bounding box group set, wherein the human body bounding box group in the human body bounding box group set corresponds to the pedestrian video frame included in the pedestrian video;
    将所述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集;Input the human body bounding box set into the joint point confidence network to obtain the human body posture information set;
    根据所述人体姿态信息组集,对所述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集,其中,所述匹配结果组中的匹配结果包括对应所述匹配结果的人体边界框的至少一个视频帧号;According to the human body posture information group set, matching processing is performed on each human body posture information group in the human body posture information group set to generate a matching result group, and obtain a matching result group set, wherein the matching in the matching result group The result includes at least one video frame number corresponding to the human body bounding box of the matching result;
    响应于确定所述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足所述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集,其中,所述预设视频帧号条件为匹配结果包括的至少一个视频帧号中不包含下一视频帧号,所述下一视频帧号为所述至少一个视频帧号中所述匹配结果对应的视频帧号的下一帧视频的视频帧号;In response to determining that the matching result group set contains a matching result that satisfies the preset video frame number condition, generating a human body optical flow frame group set according to each matching result group corresponding to the matching result that satisfies the preset video frame number condition, Wherein, the preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the next video frame number is the matching result corresponding to the at least one video frame number. The video frame number of the next video frame of the video frame number;
    将所述人体光流框组集输入至所述关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集;Input the human body optical flow frame group set into the joint point confidence network to obtain an optical flow human body posture information group set and an optical flow joint point confidence information group set;
    基于所述光流关节点置信度信息组集,对所述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。Based on the optical flow joint point confidence information group, the optical flow human posture information group is filtered to obtain the target human body posture information group.
  2. 根据权利要求1所述的方法,其中,所述生成人体光流框组集,包括:The method according to claim 1, wherein generating a human body optical flow frame group set includes:
    根据满足所述预设视频帧号条件的匹配结果所对应的各个匹配结果组,确定待处理视频帧集合;Determine the set of video frames to be processed according to each matching result group corresponding to the matching result that satisfies the preset video frame number condition;
    根据所述待处理视频帧集合中每对相邻的两帧待处理视频帧,执行以下步骤:According to each pair of two adjacent video frames to be processed in the set of video frames to be processed, perform the following steps:
    根据所述相邻的两帧待处理视频帧,生成光流图矩阵,其中,所述光流图矩阵包括像素偏移量集合;Generate an optical flow map matrix based on the two adjacent video frames to be processed, where the optical flow map matrix includes a set of pixel offsets;
    从所述人体姿态信息组集中选择对应所述相邻的两帧待处理视频帧的人体姿态信息组作为第一人体姿态信息组,得到第一人体姿态信息组集;Select the human body posture information group corresponding to the two adjacent video frames to be processed from the human body posture information group set as the first human body posture information group to obtain the first human body posture information group set;
    从所述第一人体姿态信息组集中选择满足所述预设视频帧号条件的匹配结果所对应的第一人体姿态信息作为第二人体姿态信息,得到至少一个第二人体姿态信息;Select the first human body posture information corresponding to the matching result that satisfies the preset video frame number condition from the first human body posture information group set as the second human body posture information to obtain at least one second human body posture information;
    对于所述至少一个第二人体姿态信息中的每个第二人体姿态信息,从所述像素偏移量集合中选择对应的像素位于所述第二人体姿态信息对应的范围内的像素偏移量作为目标像素偏移量,得到对应所述第二人体姿态信息的目标像素偏移量集合;For each second human body posture information in the at least one second human body posture information, select a pixel offset of the corresponding pixel within the range corresponding to the second human posture information from the pixel offset set. As the target pixel offset, a set of target pixel offsets corresponding to the second human body posture information is obtained;
    根据所述目标像素偏移量集合,生成光流掩码图;Generate an optical flow mask map according to the target pixel offset set;
    将所述光流掩码图的最小外接矩形确定为人体光流框。The minimum circumscribed rectangle of the optical flow mask is determined as the human body optical flow frame.
  3. 根据权利要求1或2所述的方法,其中,所述对所述光流人体姿态信息组集进行过滤,包括:The method according to claim 1 or 2, wherein filtering the optical flow human posture information set includes:
    对于所述光流关节点置信度信息组集包括的每个光流关节点置信度信息,根据所述光流关节点置信度信息,生成平均光流关节点置信度;For each optical flow joint point confidence information included in the optical flow joint point confidence information group set, generate an average optical flow joint point confidence based on the optical flow joint point confidence information;
    从所得到的平均光流关节点置信度中选择小于预设光流阈值的平均光流关节点置信度作为第一待过滤置信度;Select the average optical flow joint point confidence that is less than the preset optical flow threshold from the obtained average optical flow joint point confidence as the first confidence to be filtered;
    将得到的第一待过滤置信度对应的光流人体姿态信息从所述光流人体姿态信息组集中删除。The optical flow human body posture information corresponding to the obtained first confidence level to be filtered is deleted from the optical flow human body posture information set.
  4. 根据权利要求1-3之一所述的方法,其中,所述关节点置信度网络包括骨干网络、关节点预测分支和关节点可用性分支,所述关节点可用性分支包括残差网络和至少一个分类器。The method according to one of claims 1-3, wherein the joint point confidence network includes a backbone network, a joint point prediction branch and a joint point availability branch, and the joint point availability branch includes a residual network and at least one classification device.
  5. 根据权利要求1-4之一所述的方法,其中,所述方法还包括:The method according to one of claims 1-4, wherein the method further includes:
    将所述匹配结果组集中满足预设单视频帧号条件的匹配结果确定为目标匹配结果,得到目标匹配结果集合;Determine the matching results in the matching result group that meet the preset single video frame number condition as the target matching results to obtain a target matching result set;
    从所述人体姿态信息组集中选择对应所述目标匹配结果集合包括的目标匹配结果的人体姿态信息作为待整合人体姿态信息,得到待整合人体姿态信息集合;Select the human body posture information corresponding to the target matching results included in the target matching result set from the human body posture information group set as the human body posture information to be integrated, and obtain the human body posture information set to be integrated;
    将所述待整合人体姿态信息集合输入至预设行人特征模型,得到待整合行人特征信息集合;Input the human body posture information set to be integrated into the preset pedestrian feature model to obtain the pedestrian feature information set to be integrated;
    根据所述待整合行人特征信息集合,生成所述待整合行人特征信息集合中每个待整合行人特征信息的行人特征相似度信息,得到行人特征相似度信息集合;According to the pedestrian feature information set to be integrated, pedestrian feature similarity information for each pedestrian feature information to be integrated in the pedestrian feature information set to be integrated is generated, and a pedestrian feature similarity information set is obtained;
    将所述行人特征相似度信息集合中满足预设相似度条件的行人特征相似度信息确定为目标相似度信息,得到目标相似度信息集合;Determine the pedestrian feature similarity information that satisfies the preset similarity conditions in the pedestrian feature similarity information set as the target similarity information, and obtain the target similarity information set;
    根据所述目标相似度信息集合,对所述匹配结果组集进行更新;Update the matching result group set according to the target similarity information set;
    对于所述目标相似度信息集合中的每个目标相似度信息,从所述行人特征相似度信息集合中删除对应所述目标相似度信息的行人特征相似度信息。For each target similarity information in the target similarity information set, pedestrian feature similarity information corresponding to the target similarity information is deleted from the pedestrian feature similarity information set.
  6. 根据权利要求1-5之一所述的方法,其中,所述对所述人体姿态信息组集进行匹配处理,包括:The method according to any one of claims 1 to 5, wherein the matching process on the human body posture information group set includes:
    对于所述人体姿态信息组集中相邻两帧行人视频帧分别对应的人体姿态信息组,执行如下操作:For the human body posture information group corresponding to two adjacent pedestrian video frames in the human body posture information group set, perform the following operations:
    将相邻两帧行人视频帧中前一帧行人视频帧对应的人体姿态信息组确定为第一人体姿态信息组;Determine the human body posture information group corresponding to the previous pedestrian video frame in the two adjacent pedestrian video frames as the first human body posture information group;
    将相邻两帧行人视频帧中后一帧行人视频帧对应的人体姿态信息组确定为第二人体姿态信息组;Determine the human body posture information group corresponding to the latter of the two adjacent pedestrian video frames as the second human body posture information group;
    确定所述第一人体姿态信息组中每个第一人体姿态信息与所述第二人体姿态信息组中各个第二人体姿态信息的距离,得到距离集合;Determine the distance between each first human body posture information in the first human body posture information group and each second human body posture information in the second human body posture information group to obtain a distance set;
    根据所述距离集合,对所述第一人体姿态信息组中的第一人体姿态信息和所述第二人体姿态信息组中的第二人体姿态信息进行分配处理,得到相邻视频帧匹配结果;According to the distance set, perform allocation processing on the first human body posture information in the first human body posture information group and the second human body posture information in the second human body posture information group to obtain adjacent video frame matching results;
    根据所得到的各个相邻视频帧匹配结果,生成匹配结果组集。According to the obtained matching results of each adjacent video frame, a matching result group set is generated.
  7. 一种姿态跟踪装置,包括:An attitude tracking device, including:
    检测单元,被配置成对行人视频进行逐帧检测,得到人体边界框组集,其中,所述人体边界框组集中的人体边界框组对应于行人视频包括的行人视频帧;a detection unit configured to detect the pedestrian video frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group set corresponds to the pedestrian video frame included in the pedestrian video;
    第一输入单元,被配置成将所述人体边界框组集输入至关节点置信度网络,得到人体姿态信息组集;The first input unit is configured to input the human body bounding box set to the joint point confidence network to obtain the human body posture information set;
    匹配单元,被配置成根据所述人体姿态信息组集,对所述人体姿态信息组集中的每个人体姿态信息组进行匹配处理,以生成匹配结果组,得到匹配结果组集,其中,所述匹配结果组中的匹配结果包括对应所述匹配结果的人体边界框的至少一个视频帧号;The matching unit is configured to perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set, wherein: The matching results in the matching result group include at least one video frame number corresponding to the human body bounding box of the matching result;
    生成单元,被配置成响应于确定所述匹配结果组集中存在满足预设视频帧号条件的匹配结果,根据满足所述预设视频帧号条件的匹配结果所对应的各个匹配结果组,生成人体光流框组集,其中,所述预设视频帧号条件为匹配结果包括的至少一个视频帧号中不包含下一视频帧号,所述下一视频帧号为所述至少一个视频帧号中所述匹配结果对应的视频帧号的下一帧视频的视频帧号;a generating unit configured to generate a human body according to each matching result group corresponding to the matching result that satisfies the preset video frame number condition in response to determining that a matching result satisfying a preset video frame number condition exists in the matching result group set. Optical flow frame group set, wherein the preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the next video frame number is the at least one video frame number. The video frame number of the next frame of video corresponding to the video frame number corresponding to the matching result;
    第二输入单元,被配置成将所述人体光流框组集输入至所述关节点置信度网络,得到光流人体姿态信息组集和光流关节点置信度信息组集;The second input unit is configured to input the human body optical flow frame group set to the joint point confidence network to obtain an optical flow human body posture information group set and an optical flow joint point confidence information group set;
    过滤单元,被配置成基于所述光流关节点置信度信息组集,对所述光流人体姿态信息组集进行过滤,得到目标人体姿态信息组集。The filtering unit is configured to filter the optical flow human posture information group based on the optical flow joint point confidence information group to obtain a target human posture information group.
  8. 一种电子设备,包括:An electronic device including:
    至少一个处理器;at least one processor;
    存储装置,其上存储有至少一个程序;a storage device having at least one program stored thereon;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-6中任一所述的方法。When the at least one program is executed by the at least one processor, the at least one processor implements the method according to any one of claims 1-6.
  9. 一种计算机可读介质,其上存储有计算机程序,其中,所述程序被处理器执行时实现如权利要求1-6中任一所述的方法。A computer-readable medium with a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1-6 is implemented.
PCT/CN2022/092143 2022-04-15 2022-05-11 Posture tracking method and apparatus, electronic device, and computer readable medium WO2023197390A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210395730.0 2022-04-15
CN202210395730.0A CN115311324A (en) 2022-04-15 2022-04-15 Attitude tracking method and apparatus, electronic device, and computer-readable medium

Publications (1)

Publication Number Publication Date
WO2023197390A1 true WO2023197390A1 (en) 2023-10-19

Family

ID=83854891

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092143 WO2023197390A1 (en) 2022-04-15 2022-05-11 Posture tracking method and apparatus, electronic device, and computer readable medium

Country Status (2)

Country Link
CN (1) CN115311324A (en)
WO (1) WO2023197390A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682302A (en) * 2012-03-12 2012-09-19 浙江工业大学 Human body posture identification method based on multi-characteristic fusion of key frame
US20120281918A1 (en) * 2011-05-04 2012-11-08 National Chiao Tung University Method for dynamically setting environmental boundary in image and method for instantly determining human activity
US20140270540A1 (en) * 2013-03-13 2014-09-18 Mecommerce, Inc. Determining dimension of target object in an image using reference object
CN104392223A (en) * 2014-12-05 2015-03-04 青岛科技大学 Method for recognizing human postures in two-dimensional video images
CN112651291A (en) * 2020-10-01 2021-04-13 新加坡依图有限责任公司(私有) Video-based posture estimation method, device, medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120281918A1 (en) * 2011-05-04 2012-11-08 National Chiao Tung University Method for dynamically setting environmental boundary in image and method for instantly determining human activity
CN102682302A (en) * 2012-03-12 2012-09-19 浙江工业大学 Human body posture identification method based on multi-characteristic fusion of key frame
US20140270540A1 (en) * 2013-03-13 2014-09-18 Mecommerce, Inc. Determining dimension of target object in an image using reference object
CN104392223A (en) * 2014-12-05 2015-03-04 青岛科技大学 Method for recognizing human postures in two-dimensional video images
CN112651291A (en) * 2020-10-01 2021-04-13 新加坡依图有限责任公司(私有) Video-based posture estimation method, device, medium and electronic equipment

Also Published As

Publication number Publication date
CN115311324A (en) 2022-11-08

Similar Documents

Publication Publication Date Title
US11734851B2 (en) Face key point detection method and apparatus, storage medium, and electronic device
CN108710885B (en) Target object detection method and device
CN108062525B (en) Deep learning hand detection method based on hand region prediction
US20230267735A1 (en) Method for structuring pedestrian information, device, apparatus and storage medium
CN112668588B (en) Parking space information generation method, device, equipment and computer readable medium
WO2023082453A1 (en) Image processing method and device
WO2021249114A1 (en) Target tracking method and target tracking device
CN111582074A (en) Monitoring video leaf occlusion detection method based on scene depth information perception
JP2023525462A (en) Methods, apparatus, electronics, storage media and computer programs for extracting features
CN114241386A (en) Method for detecting and identifying hidden danger of power transmission line based on real-time video stream
CN116363748A (en) Power grid field operation integrated management and control method based on infrared-visible light image fusion
CN108229281B (en) Neural network generation method, face detection device and electronic equipment
CN113992860B (en) Behavior recognition method and device based on cloud edge cooperation, electronic equipment and medium
CN114037087B (en) Model training method and device, depth prediction method and device, equipment and medium
CN114169425B (en) Training target tracking model and target tracking method and device
WO2023197390A1 (en) Posture tracking method and apparatus, electronic device, and computer readable medium
CN111310595A (en) Method and apparatus for generating information
CN116110095A (en) Training method of face filtering model, face recognition method and device
CN112686828B (en) Video denoising method, device, equipment and storage medium
CN111652831B (en) Object fusion method and device, computer-readable storage medium and electronic equipment
CN114120423A (en) Face image detection method and device, electronic equipment and computer readable medium
CN113642510A (en) Target detection method, device, equipment and computer readable medium
CN112069997A (en) Unmanned aerial vehicle autonomous landing target extraction method and device based on DenseHR-Net
CN115861684B (en) Training method of image classification model, image classification method and device
WO2024099026A1 (en) Image processing method and apparatus, device, storage medium and program product

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22937032

Country of ref document: EP

Kind code of ref document: A1