WO2023197390A1

WO2023197390A1 - Posture tracking method and apparatus, electronic device, and computer readable medium

Info

Publication number: WO2023197390A1
Application number: PCT/CN2022/092143
Authority: WO
Inventors: 傅泽华; 左文航; 胡征慧; 刘庆杰; 王蕴红
Original assignee: 北京航空航天大学杭州创新研究院
Priority date: 2022-04-15
Filing date: 2022-05-11
Publication date: 2023-10-19
Also published as: CN115311324A

Abstract

Disclosed in embodiments of the present disclosure are a posture tracking method and apparatus, an electronic device, and a computer readable medium. A specific implementation of the method comprises: performing frame-by-frame detection on a pedestrian video to obtain a human body bounding box group set; inputting the human body bounding box group set into a joint point confidence network to obtain a human body posture information group set; performing matching processing on each human body posture information group in the human body posture information group set to obtain a matching result group set; generating a human body optical flow box group set; and inputting the human body optical flow box group set into the joint point confidence network to obtain an optical flow human body posture information group set and an optical flow joint point confidence information group set; and on the basis of the optical flow joint point confidence information group set, filtering the optical flow human body posture information group set to obtain a target human body posture information group set.

Description

Attitude tracking method, device, electronic device and computer-readable medium

Cross-references to related applications

This application requires the priority of a Chinese patent application with an application date of April 15, 2022, an application number of 202210395730.0, and an invention title of "Attitude Tracking Method, Device, Electronic Equipment and Computer-Readable Medium", the entire content of which is incorporated as a whole. into this application.

Technical field

Embodiments of the present disclosure relate to the field of computer technology, and specifically to attitude tracking methods, devices, electronic devices and computer-readable media.

Background technique

The multi-person pose tracking task in the field of computer vision refers to processing the input video, detecting the pose of each pedestrian in each frame of the video, and then analyzing the target's appearance characteristics, position, motion status and other information. Computational analysis correctly records each person's continuous posture trajectory over time. Currently, when performing multi-person pose tracking tasks, in addition to correctly identifying the location of each pedestrian's joint points, it is also necessary to filter unavailable joint points caused by occlusion and other reasons.

Contents of the invention

This Summary is provided to introduce in simplified form concepts that are later described in detail in the Detailed Description. The content of this disclosure is not intended to identify key features or essential features of the claimed technical solutions, nor is it intended to be used to limit the scope of the claimed technical solutions.

Some embodiments of the present disclosure provide gesture tracking methods, devices, electronic devices, and computer-readable media to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a posture tracking method. The method includes: detecting pedestrian videos frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group in the human body bounding box group set Corresponding to the pedestrian video frames included in the pedestrian video; input the above-mentioned human body bounding box group set to the joint point confidence network to obtain the human body posture information group set; according to the above-mentioned human body posture information group set, for each person in the above-mentioned human posture information group set The body posture information group is subjected to matching processing to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include at least one video frame number corresponding to the human body bounding box of the above-mentioned matching result; in response to determining the above-mentioned There are matching results that satisfy the preset video frame number condition in the matching result group set. According to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition, a human body optical flow frame group set is generated, wherein the above preset video The frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next video frame number is the next video frame number of the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number. Video frame number; input the above-mentioned human body optical flow frame set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information set and the optical flow joint point confidence information set; based on the above-mentioned optical flow joint point confidence information set , filter the above optical flow human posture information set to obtain the target human posture information set.

In a second aspect, some embodiments of the present disclosure provide a gesture tracking device. The device includes: a detection unit configured to detect pedestrian videos frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group set The human body bounding box group corresponds to the pedestrian video frame included in the pedestrian video; the first input unit is configured to input the above human body bounding box group set to the joint point confidence network to obtain the human body posture information group set; the matching unit is configured According to the above-mentioned human body posture information group set, matching processing is performed on each human body posture information group in the above-mentioned human posture information group set to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include At least one video frame number corresponding to the human body boundary box of the above-mentioned matching result; the generation unit is configured to respond to determining that there is a matching result that satisfies the preset video frame number condition in the above-mentioned matching result group set, based on satisfying the above-mentioned preset video frame number condition. Each matching result group corresponding to the matching result generates a human body optical flow frame group set, wherein the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next The video frame number is the video frame number of the next frame of the video corresponding to the video frame number corresponding to the matching result in the above-mentioned at least one video frame number; the second input unit is configured to input the above-mentioned human body optical flow frame set to the above-mentioned joint point The confidence network obtains the optical flow human posture information group set and the optical flow joint point confidence information group set; the filtering unit is configured to perform the optical flow human posture information group set based on the optical flow joint point confidence information group set. Filter to obtain the target human body posture information set.

In a third aspect, some embodiments of the present disclosure provide an electronic device, including: at least one processor; a storage device on which at least one program is stored, and when the at least one program is executed by at least one processor, at least one process The device implements the method described in any implementation manner of the first aspect above.

In a fourth aspect, some embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, wherein when the program is executed by a processor, the method described in any implementation manner of the first aspect is implemented.

Description of the drawings

The above and other features, advantages, and aspects of various embodiments of the present disclosure will become more apparent with reference to the following detailed description taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

Figure 1 is a flowchart of some embodiments of a gesture tracking method according to the present disclosure;

Figure 2 is a network schematic diagram of a joint point confidence network according to the attitude tracking method of the present disclosure;

Figure 3 is a schematic structural diagram of some embodiments of a posture tracking device according to the present disclosure;

Figure 4 is a schematic structural diagram of an electronic device suitable for implementing some embodiments of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.

It should also be noted that, for convenience of description, only the parts related to the invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that concepts such as “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.

It should be noted that the modifications of "one" and "plurality" mentioned in this disclosure are illustrative and not restrictive. Those skilled in the art will understand that unless the context clearly indicates otherwise, it should be understood as "one or Multiple”.

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are for illustrative purposes only and are not used to limit the scope of these messages or information.

Related methods of filtering joint points, such as filtering joint points based on position probability, often have the following technical problems: joint points corresponding to lower position probabilities caused by motion blur will be mistakenly filtered out, resulting in Some pedestrians cannot be accurately tracked, resulting in low accuracy in multi-person attitude tracking tasks.

In order to solve the problems described above, some embodiments of the present disclosure propose attitude tracking methods, devices, electronic devices and computer-readable media, which can improve the accuracy of multi-person attitude tracking tasks.

The present disclosure will be described in detail below in conjunction with embodiments with reference to the accompanying drawings.

Figure 1 illustrates a process 100 of some embodiments of a gesture tracking method according to the present disclosure. The attitude tracking method includes the following steps:

Step 101: Detect pedestrian videos frame by frame to obtain a set of human body bounding boxes.

In some embodiments, the execution subject of the posture tracking method (such as a computing device) can detect the pedestrian video frame by frame to obtain a set of human body bounding boxes. Wherein, the human body boundary frame group in the above human body boundary frame group set corresponds to the pedestrian video frame included in the pedestrian video. The human bounding boxes in the human bounding box group correspond to pedestrians appearing in pedestrian video frames. The pedestrian video may be a video in which at least one pedestrian appears in the recorded scene. In practice, the above execution subject can use the HTC (Hybrid Task Cascade) detector to detect the pedestrian video frame by frame to obtain the human body bounding box group set. From this, the position of the pedestrian in the pedestrian video can be initially obtained.

It should be noted that the above computing device may be hardware or software. When the computing device is hardware, it can be implemented as a distributed cluster composed of multiple servers or terminal devices, or it can be implemented as a single server or a single terminal device. When the computing device is embodied as software, it can be installed in the hardware device listed above. It may be implemented, for example, as multiple software or software modules for providing distributed services, or as a single software or software module. There are no specific limitations here. It should be understood that there may be any number of computing devices depending on implementation needs.

Step 102: Input the human body bounding box set to the joint point confidence network to obtain the human body posture information set.

In some embodiments, the execution subject may input the human body bounding box set into the joint point confidence network to obtain a human body posture information set and a joint point confidence information set. Wherein, the above-mentioned joint point confidence network can be a neural network that takes a human body bounding box set as an input and uses a human body posture information set and a joint point confidence information set as an output. For example, the above-mentioned joint point confidence network can be Hourglass Network. Thus, a set of human posture information representing the human posture can be obtained to facilitate tracking of the human posture.

As shown in Figure 2, in some optional implementations of some embodiments, the above-mentioned joint point confidence network includes a backbone network, a joint point prediction branch and a joint point availability branch, and the above-mentioned joint point availability branch includes a residual network and at least a classifier. The above-mentioned backbone network can be HRNet (High-Resoultion Net, high-resolution network). The above-mentioned joint point prediction branch can be a neural network that takes the feature map set output by the above-mentioned backbone network as input and uses the joint point position probability information set and the joint point position coordinate information set as output. In practice, the above-mentioned joint point prediction branch can perform continuous transposed convolution on the set of feature maps output by the above-mentioned backbone network. Among them, the loss function for training the above joint point prediction branch can be:

Among them, L represents the loss function of the above-mentioned joint point prediction branch. K represents the number of preset joint point types. For example, K can take the value 15. W represents the length of the heat map. H represents the width of the heat map. Wherein, the above-mentioned heat map is a heat map generated during the operation of the above-mentioned joint point prediction branch. k represents the k-th joint point. i represents the i-th column of the heat map. j represents the jth row of the heat map.

Represents the value of the i-th column and j-th row in the heat map matrix of the k-th joint point.

Represents the value of the i-th column and j-th row in the second-order Gaussian distribution label matrix of the k-th joint point. The above-mentioned second-order Gaussian distribution label matrix is obtained by converting the real labels of each preset joint point.

and

are the same size. v _k represents the real label of the preset joint point of the k-th joint point. The above-mentioned preset real labels of joint points may be preset real labels of joint points. The value of the real label of the above-mentioned preset joint point is 0 or 1.

The above-mentioned joint point availability branch may be a neural network that takes the feature map set output by the above-mentioned backbone network as input and uses the joint point availability probability information set as output. Among them, the focal loss function can be used as the loss function for training the above-mentioned joint point availability branch. The loss function can be expressed as follows:

FL(p)=-α _p ×(1-p) ^γ ×log(p).

p=( ^pavl ) ^y ×(1- ^pavl ) ^1-y .

α _p =α ^y ×(1-α) ^1-y .

Among them, y represents the real label of the preset joint point. The default true label of the joint point is 0 or 1. p represents the joint point classification probability. p ^avl represents the joint point availability probability. Wherein, the above joint point availability probability may be a probability output by a classifier included in the joint point availability branch. As shown in the above formula, when the preset true label y of the joint point is 1, the joint point classification probability p is p ^avl . When the preset true label y of the joint point is 0, the joint point classification probability p is 1-p ^avl . FL(p) represents the Focal Floss function. α _p is the weight factor corresponding to the above-mentioned joint point classification probability p, and γ is the focus parameter. Here, the values of other parameters can use the default parameters of the focal loss function. For example, γ takes the value 2. The value of α is 0.25.

In practice, first, the human body bounding boxes included in the above human body bounding box group set can be cropped and scaled. In practice, the above-mentioned execution subject can crop the human body bounding boxes included in the above-mentioned human body bounding box group set from the corresponding pedestrian video frames, and scale each of the cropped human body bounding boxes to a fixed size (for example, 384×288). Then, when the above joint point confidence network processes the cropped and rescaled human body bounding box set, the first step is to input the above cropped and rescaled human body bounding box set into the above backbone network to obtain the feature map Group set. Here, each high-resolution feature output by the last switching unit of the fourth stage of HRNet can be used as the feature map output by the above-mentioned backbone network. In the second step, the above feature map sets can be input into the above joint point prediction branch and the joint point availability branch respectively to obtain the joint point position probability information group set, the joint point position coordinate information group set and the joint point availability probability information group set. In the third step, a joint point confidence information group set can be generated based on the above joint point position probability information group set and the above joint point availability probability information group set. In practice, for joint point position probability information and joint point availability probability information, the product of the joint point position probability and the joint point availability probability corresponding to the same joint point can be determined as the joint point confidence, and the joint point confidence information can be obtained. In the fourth step, the human body posture represented by each feature map in the above feature map set can be determined as human body posture information, and a human body posture information group set can be obtained. The fifth step is to output the above-mentioned joint point confidence information group set and the above-mentioned human posture information group set.

Step 103: Perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set.

In some embodiments, the execution subject may perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set. Wherein, the matching results in the above matching result group include at least one video frame number corresponding to the human body bounding box of the above matching result. The above video frame number may be the frame number of each pedestrian video frame in the pedestrian video. In practice, the Hungarian algorithm can be used to perform matching processing on each human posture information group in the above human posture information group set. From this, the matching results corresponding to each human body posture information can be obtained.

In some optional implementations of some embodiments, first, the above-mentioned execution subject can perform the following operations for the human body posture information groups corresponding to two adjacent pedestrian video frames in the above-mentioned human posture information group set:

In the first step, the human body posture information group corresponding to the previous pedestrian video frame among the two adjacent pedestrian video frames is determined as the first human body posture information group.

In the second step, the human body posture information group corresponding to the latter pedestrian video frame of the two adjacent pedestrian video frames is determined as the second human body posture information group.

The third step is to determine the distance between each first human body posture information in the above-mentioned first human body posture information group and each second human body posture information in the above-mentioned second human body posture information group, and obtain a distance set. Among them, the above distance can be IoU (Intersection over union) distance.

The fourth step is to perform allocation processing on the first human body posture information in the first human body posture information group and the second human posture information in the second human body posture information group according to the above distance set, and obtain the adjacent video frame matching result. . In practice, the Hungarian algorithm can be used to allocate the first human posture information in the first human posture information group and the second human posture information in the second human posture information group based on the above distance set to obtain adjacent videos. Frame matching results.

Then, a matching result group set can be generated based on the obtained matching results of each adjacent video frame. In practice, the video frame numbers corresponding to each successfully matched second human body posture information can be combined into a matching result, thereby obtaining a matching result set. As a result, inter-frame matching of human postures can be achieved.

Step 104: In response to determining that the matching result group set contains a matching result that satisfies the preset video frame number condition, generate a human body optical flow frame group set based on each matching result group corresponding to the matching result that satisfies the preset video frame number condition.

In some embodiments, the above-mentioned execution subject may respond to determining that there are matching results that satisfy the preset video frame number condition in the above-mentioned matching result group set, and according to each matching result group corresponding to the matching result that satisfies the above-mentioned preset video frame number condition, Generate human body optical flow frame group set. Wherein, the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number. The above-mentioned next video frame number is the video frame number of the next video frame of the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number. In practice, various methods can be used to generate human body optical flow frame sets. From this, a set of human optical flow frames representing the predicted human posture can be obtained.

In some optional implementations of some embodiments, in the first step, the execution subject may determine the set of video frames to be processed based on each matching result group corresponding to the matching result that satisfies the above preset video frame number condition. In practice, the video frames corresponding to each matching result group corresponding to the matching results that meet the above preset video frame number conditions can be determined as the video frames to be processed. In the second step, the following sub-steps can be performed based on each pair of two adjacent video frames to be processed in the above set of video frames to be processed:

The first sub-step is to generate an optical flow graph matrix based on the two adjacent video frames to be processed. Wherein, the above-mentioned optical flow map matrix includes a set of pixel offsets. In practice, based on the above two adjacent video frames to be processed, the optical flow method can be used to generate an optical flow graph matrix.

The second sub-step is to select the human body posture information group corresponding to the two adjacent video frames to be processed from the human body posture information group set as the first human posture information group to obtain the first human posture information group set.

The third sub-step is to select the first human body posture information corresponding to the matching result that satisfies the above preset video frame number condition from the above first human body posture information group set as the second human body posture information to obtain at least one second human body posture information.

The fourth sub-step is: for each second human body posture information in the at least one second human body posture information, select the pixel offset of the corresponding pixel from the above pixel offset set within the range corresponding to the second human body posture information. The shift amount is used as the target pixel offset to obtain a set of target pixel offsets corresponding to the above-mentioned second human body posture information. Wherein, the above-mentioned pixel may be an offset pixel among the two pixels corresponding to the above-mentioned pixel offset amount.

The fifth sub-step is to generate an optical flow mask map based on the above target pixel offset set. In practice, the target pixel offsets in the above target pixel offset set can be constructed as a matrix to obtain the optical flow mask map.

The sixth sub-step is to determine the minimum circumscribed rectangle of the above-mentioned optical flow mask image as the human body optical flow frame. Wherein, the human body optical flow frame corresponds to the latter video frame to be processed among the two adjacent video frames to be processed. From this, the human body optical flow frame that represents the prediction of the next video frame to be processed can be obtained.

Step 105: Input the human body optical flow frame set to the joint point confidence network to obtain the optical flow human posture information set and the optical flow joint point confidence information set.

In some embodiments, the execution subject may input the human body optical flow frame set to the joint point confidence network to obtain an optical flow human posture information set and an optical flow joint point confidence information set. Thus, the optical flow human posture information group set representing the human posture of the human optical flow frame set in the human optical flow frame group set can be obtained.

Step 106: Filter the optical flow human posture information group based on the optical flow joint point confidence information group to obtain the target human posture information group.

In some embodiments, the execution subject may filter the optical flow human posture information group set based on the optical flow joint point confidence information group set to obtain the target human posture information group set. In practice, the joint point confidence threshold condition can be set, and in response to the presence of joint point confidence information that satisfies the set joint point confidence threshold condition in the above optical flow joint point confidence information group, the corresponding optical flow human body posture Information deleted. For example, the above joint point confidence threshold condition may be: the optical flow joint point confidence included in the optical flow joint point confidence information is less than the preset confidence threshold. Thus, it is possible to filter the optical flow human posture information set, thereby obtaining the corresponding target human posture information set with higher confidence.

In some optional implementations of some embodiments, first, for each optical flow joint point confidence information included in the above-mentioned optical flow joint point confidence information group set, the above-mentioned execution subject can calculate the above-mentioned optical flow joint point confidence level according to the above-mentioned optical flow joint point confidence level. information to generate average optical flow joint node confidence. In practice, the average value of the optical flow joint point confidence included in the optical flow joint point confidence information may be determined as the average optical flow joint point confidence. Secondly, the average optical flow joint point confidence that is smaller than the preset optical flow threshold can be selected from the obtained average optical flow joint point confidence as the first confidence to be filtered. Finally, the obtained optical flow human body posture information corresponding to the first confidence level to be filtered can be deleted from the above optical flow human body posture information set. From this, the corresponding optical flow human posture information with higher confidence can be obtained.

Optionally, first, the above execution subject can determine the video frame number corresponding to each optical flow human posture information group in the deleted optical flow human posture information group set as the target video frame number, and obtain the target video frame number set. Secondly, a human body posture information group whose corresponding video frame number is within the above target video frame number set can be selected from the above human body posture information group set as a comparison human body posture information group to obtain a comparison human posture information group set. Then, a posture overlap degree group set can be generated based on the deleted processed optical flow human posture information group set and the above-mentioned comparison human posture information group set. In practice, the following formula can be used to generate attitude overlap:

Among them, δ represents a function that converts Boolean type results into 0 and 1. Among them, δ converts the Boolean type result to 1 when the conditions in the brackets are met. δ converts the Boolean result to 0 when the condition within the brackets is not met. p and q respectively represent the two pedestrians corresponding to the optical flow human posture information and the control human posture information. IOU _{p, q} represents the IoU value (Intersection over Union) of pedestrian p and pedestrian q. The IoU value can be determined in advance. k represents the k-th joint point.

Represents the Euclidean distance between the k-th joint point of pedestrian p and the k-th joint point of pedestrian q. s ² represents the scale factor of pedestrian p and pedestrian q. Among them, the scale factor can be determined by the sum of the square roots of the human body bounding box areas corresponding to the optical flow human posture information and the control human posture information.

Represents the normalization factor of the k-th joint point. This normalization factor can be determined in advance. v _pk represents the confidence of the k-th joint point of pedestrian p. v _qk represents the confidence of the k-th joint point of pedestrian q. ε is a preset parameter.

Afterwards, the posture overlap degree greater than the preset posture overlap degree threshold can be selected from the above-mentioned posture overlap degree group set as the second confidence level to be filtered. Finally, the optical flow human posture information corresponding to the second confidence level to be filtered can be deleted from the above optical flow human posture information set.

The above content, as an inventive point of the embodiments of the present disclosure, solves the second technical problem mentioned in the background art: "In the scene where the human body is interlaced and blocked, for the location probability corresponding to a high value due to being erroneously marked on someone else, Joint points cannot be filtered, resulting in the inability to accurately track some pedestrians, resulting in low accuracy in multi-person posture tracking tasks." The factors that lead to the low accuracy of multi-person pose tracking tasks are as follows: In the scene where human bodies are intertwined and blocked, due to the high value of the corresponding position probability due to incorrect marking on others, the intersection and union ratio is determined using the IoU measurement method. The resulting similarity is low and cannot be filtered; in addition, because the joint points corresponding to the two boxes are different, the similarity is also low and cannot be filtered. If the above factors are solved, the effect of putting the target positioning information and target classification information of the above targets into actual production can be achieved. In order to achieve this effect, the present disclosure filters joint points by adopting the above-mentioned method of using posture overlap. Therefore, by adopting the above method of using posture overlap, the overlapping optical flow human posture information can be effectively filtered. This can achieve accurate tracking of some pedestrians, thereby improving the accuracy of multi-person posture tracking tasks.

Optionally, first, the execution subject may determine the matching results that satisfy the preset single video frame number condition in the matching result group set as target matching results to obtain a target matching result set. The above preset single video frame number condition can be that the matching result only includes one video frame number. Secondly, the human body posture information corresponding to the target matching results included in the above target matching result set can be selected from the above human body posture information group set as the human body posture information to be integrated, thereby obtaining the human body posture information set to be integrated. Then, the above-mentioned human body posture information set to be integrated can be input into the preset pedestrian feature model to obtain a pedestrian feature information set to be integrated. The above-mentioned preset pedestrian feature model may be a neural network model that takes a human body posture information set to be integrated as input and a pedestrian feature information set to be integrated as an output. The above-mentioned preset pedestrian feature model may include a backbone network, a pedestrian feature extraction module, and a classifier. The above-mentioned backbone network may be HRNet. The above pedestrian feature extraction module may include a convolution layer, an average pooling layer and a batch normalization layer. The above classifier can include a fully connected layer and a Softmax layer. It should be noted that the above classifier is only used for training the above preset pedestrian feature model, and is not used when the above preset pedestrian feature model is actually used. Thereafter, pedestrian feature similarity information for each pedestrian feature information to be integrated in the set of pedestrian feature information to be integrated can be generated based on the set of pedestrian feature information to be integrated, to obtain a set of pedestrian feature similarity information. In practice, the similarity between each pedestrian feature information to be integrated and other pedestrian feature information to be integrated in the pedestrian feature information set to be integrated can be determined to obtain a pedestrian feature similarity information set. Among them, the formula for determining the similarity is as follows:

Among them, p and q respectively represent the two pedestrians corresponding to the two pedestrian feature information to be integrated. S(p, q) represents the pedestrian feature similarity. D represents the dimension of the feature vector represented by the pedestrian feature information to be integrated. k represents the k-th joint point.

Represents the feature vector of the k-th joint point of pedestrian p.

Represents the feature vector of the k-th joint point of pedestrian q.

Secondly, the pedestrian feature similarity information that satisfies the preset similarity conditions in the above pedestrian feature similarity information set can be determined as the target similarity information, and a target similarity information set is obtained. The above preset similarity condition may be that the pedestrian feature similarity in the above pedestrian feature similarity information is greater than the preset pedestrian feature similarity threshold. Then, the matching result group set can be updated according to the target similarity information set. In practice, the video frame number corresponding to each target similarity information in the above target similarity information set can be recorded in the corresponding matching result. Finally, for each target similarity information in the target similarity information set, the pedestrian feature similarity information corresponding to the target similarity information can be deleted from the pedestrian feature similarity information set. As a result, the human posture information whose matching result only has one video frame number can be matched again, so that the human posture information that has not been successfully matched due to the detector performance can be successfully matched, thereby improving the accuracy of multi-person posture tracking tasks.

The above-mentioned embodiments of the present disclosure have the following beneficial effects: through the posture tracking methods of some embodiments of the present disclosure, the accuracy of multi-person posture tracking tasks can be improved. Specifically, the reason for the low accuracy of the multi-person attitude tracking task is that the corresponding low position probabilities caused by motion blur will be mistakenly filtered out, resulting in the inability to accurately track some pedestrians, resulting in the multi-person attitude tracking task. The accuracy is lower. Based on this, the posture tracking method of some embodiments of the present disclosure first detects the pedestrian video frame by frame to obtain a human body bounding box set. From this, the position of the pedestrian in the pedestrian video can be initially obtained. Then, the above human body bounding box set is input into the joint point confidence network to obtain the human body posture information set. Thus, a set of human posture information representing the human posture can be obtained to facilitate tracking of the human posture. Secondly, according to the above-mentioned human posture information group set, matching processing is performed on each human body posture information group in the above-mentioned human posture information group set to generate a matching result group and obtain a matching result group set. From this, the matching results corresponding to each human body posture information can be obtained. Then, in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set, a human body optical flow frame group set is generated based on each matching result group corresponding to the matching result that satisfies the above preset video frame number condition. From this, a set of human optical flow frames representing the predicted human posture can be obtained. After that, the above-mentioned human body optical flow frame group set is input to the above-mentioned joint point confidence network, and an optical flow human body posture information group set and an optical flow joint node confidence information group set are obtained. Thus, the optical flow human posture information group set representing the human posture of the human optical flow frame set in the human optical flow frame group set can be obtained. Finally, based on the above-mentioned optical flow joint point confidence information group set, the above-mentioned optical flow human posture information group set is filtered to obtain the target human body posture information group set. Thus, it is possible to filter the optical flow human posture information set, thereby obtaining the corresponding target human posture information set with higher confidence. Because the optical flow joint point confidence information group is used to filter the optical flow human posture information group, the confidence level corresponding to the target human posture information group obtained is relatively high. For joint points that are blurred due to motion, when the corresponding When the confidence level is high, the joint points will also be retained, thereby avoiding the false filtering out of joint points corresponding to lower position probabilities caused by motion blur, and thus the above-mentioned joint points of the pedestrian can be accurately tracked. , which can improve the accuracy of multi-person posture tracking tasks.

Continuing to refer to Figure 3, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a posture tracking device. These device embodiments correspond to those method embodiments shown in Figure 3. The device is specifically Can be used in various electronic devices.

As shown in FIG. 3 , the posture tracking device 300 of some embodiments includes: a detection unit 301 , a first input unit 302 , a matching unit 303 , a generation unit 304 , a second input unit 305 and a filtering unit 306 . Wherein, the detection unit 301 is configured to detect the pedestrian video frame by frame to obtain a human body boundary frame group set, wherein the human body boundary frame group in the human body boundary frame group set corresponds to the pedestrian video frame included in the pedestrian video; the first input unit 302 is configured to input the above-mentioned human body bounding box group set to the joint point confidence network to obtain a human body posture information group set; the matching unit 303 is configured to, according to the above-mentioned human body posture information group set, match each person in the above-mentioned human body posture information group set. The body posture information group is subjected to matching processing to generate a matching result group and obtain a matching result group set, wherein the matching results in the above-mentioned matching result group include at least one video frame number corresponding to the human body bounding box of the above-mentioned matching result; the generation unit 304 is configured to generate a human body optical flow frame group set according to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set, Wherein, the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the above-mentioned next video frame number is the video frame number corresponding to the above-mentioned matching result in the above-mentioned at least one video frame number. The video frame number of the next frame of video; the second input unit 305 is configured to input the above-mentioned human body optical flow frame set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information set and the optical flow joint point confidence information Group set; the filtering unit 306 is configured to filter the above-mentioned optical flow human posture information group set based on the above-mentioned optical flow joint point confidence information group set to obtain the target human body posture information group set.

It can be understood that the units recorded in the device 300 correspond to various steps in the method described with reference to FIG. 3 . Therefore, the operations, features and beneficial effects described above for the method are also applicable to the device 300 and the units included therein, and will not be described again here.

Referring now to FIG. 4 , a schematic structural diagram of an electronic device (eg, computing device) 400 suitable for implementing some embodiments of the present disclosure is shown. The electronic device shown in FIG. 4 is only an example and should not bring any limitations to the functions and scope of use of the embodiments of the present disclosure.

As shown in FIG. 4, the electronic device 400 may include a processing device (eg, central processing unit, graphics processor, etc.) 401, which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 402 or from a storage device 408. The program in the memory (RAM) 403 executes various appropriate actions and processes. In the RAM 403, various programs and data required for the operation of the electronic device 400 are also stored. The processing device 401, ROM 402 and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Generally, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 407 such as a computer; a storage device 408 including a magnetic tape, a hard disk, etc.; and a communication device 409. The communication device 409 may allow the electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data. Although FIG. 4 illustrates electronic device 400 with various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided. Each block shown in Figure 4 may represent one device, or may represent multiple devices as needed.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as a computer software program. For example, some embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart. In some such embodiments, the computer program may be downloaded and installed from the network via communication device 409, or from storage device 408, or from ROM 402. When the computer program is executed by the processing device 401, the above-described functions defined in the methods of some embodiments of the present disclosure are performed.

It should be noted that the computer-readable medium recorded in some embodiments of the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having at least one conductor, a portable computer disk, a hard disk, random access memory (RAM), read only memory (ROM), erasable programmable memory Read memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In some embodiments of the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device . Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.

In some embodiments, the client and server can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium. Communications (e.g., communications network) interconnections. Examples of communications networks include local area networks ("LAN"), wide area networks ("WAN"), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device. The above-mentioned computer-readable medium carries one or more programs. When the above-mentioned one or more programs are executed by the electronic device, the electronic device causes the electronic device to: detect the pedestrian video frame by frame and obtain the human body bounding box group set, wherein the above-mentioned The human body bounding box group set in the human body bounding box group set corresponds to the pedestrian video frame included in the pedestrian video; the above human body bounding box group set is input to the joint point confidence network to obtain the human body posture information group set; according to the above human body posture information group set, Perform matching processing on each human body posture information group in the above human body posture information group set to generate a matching result group and obtain a matching result group set, wherein the matching results in the above matching result group include the human body bounding box corresponding to the above matching result. At least one video frame number; in response to determining that there is a matching result that satisfies the preset video frame number condition in the above matching result group set, generate human body optical flow according to each matching result group corresponding to the matching result that satisfies the above preset video frame number condition Frame group set, wherein the above-mentioned preset video frame number condition is that at least one video frame number included in the matching result does not contain the next video frame number, and the above-mentioned next video frame number is the above-mentioned matching result in the above-mentioned at least one video frame number. The video frame number of the next video frame of the video frame number; input the above-mentioned human body optical flow frame group set to the above-mentioned joint point confidence network to obtain the optical flow human body posture information group set and the optical flow joint point confidence information group set; based on The above-mentioned optical flow joint point confidence information group set is used to filter the above-mentioned optical flow human posture information group set to obtain the target human body posture information group set.

Computer program code for performing the operations of some embodiments of the present disclosure may be written in one or more programming languages, including object-oriented programming languages—such as Java, Smalltalk, C++, or a combination thereof, Also included are conventional procedural programming languages—such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In situations involving remote computers, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider). connected via the Internet).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one operable function for implementing the specified logical function. Execute instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented in software or hardware. The described unit may also be provided in a processor, for example, it may be described as follows: a processor includes a detection unit, a first input unit, a matching unit, a generation unit, a second input unit and a filtering unit. Among them, the names of these units do not constitute a limitation on the unit itself under certain circumstances. For example, the first input unit can also be described as “input the above human body bounding box set into the joint point confidence network, and obtain the human body Unit of attitude information grouping".

The functions described above herein may be performed, at least in part, by at least one hardware logic component. For example, and without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.

The above description is only an illustration of some preferred embodiments of the present disclosure and the technical principles applied. Persons skilled in the art should understand that the scope of the invention involved in the embodiments of the present disclosure is not limited to technical solutions composed of specific combinations of the above technical features, and should also cover the above-mentioned technical solutions without departing from the above-mentioned inventive concept. Other technical solutions formed by any combination of technical features or their equivalent features. For example, a technical solution is formed by replacing the above features with technical features with similar functions disclosed in the embodiments of the present disclosure (but not limited to).

Claims

A posture tracking method, including:

Perform frame-by-frame detection on the pedestrian video to obtain a human body bounding box group set, wherein the human body bounding box group in the human body bounding box group set corresponds to the pedestrian video frame included in the pedestrian video;

Input the human body bounding box set into the joint point confidence network to obtain the human body posture information set;

According to the human body posture information group set, matching processing is performed on each human body posture information group in the human body posture information group set to generate a matching result group, and obtain a matching result group set, wherein the matching in the matching result group The result includes at least one video frame number corresponding to the human body bounding box of the matching result;

In response to determining that the matching result group set contains a matching result that satisfies the preset video frame number condition, generating a human body optical flow frame group set according to each matching result group corresponding to the matching result that satisfies the preset video frame number condition, Wherein, the preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the next video frame number is the matching result corresponding to the at least one video frame number. The video frame number of the next video frame of the video frame number;

Input the human body optical flow frame group set into the joint point confidence network to obtain an optical flow human body posture information group set and an optical flow joint point confidence information group set;

Based on the optical flow joint point confidence information group, the optical flow human posture information group is filtered to obtain the target human body posture information group.
The method according to claim 1, wherein generating a human body optical flow frame group set includes:

Determine the set of video frames to be processed according to each matching result group corresponding to the matching result that satisfies the preset video frame number condition;

According to each pair of two adjacent video frames to be processed in the set of video frames to be processed, perform the following steps:

Generate an optical flow map matrix based on the two adjacent video frames to be processed, where the optical flow map matrix includes a set of pixel offsets;

Select the human body posture information group corresponding to the two adjacent video frames to be processed from the human body posture information group set as the first human body posture information group to obtain the first human body posture information group set;

Select the first human body posture information corresponding to the matching result that satisfies the preset video frame number condition from the first human body posture information group set as the second human body posture information to obtain at least one second human body posture information;

For each second human body posture information in the at least one second human body posture information, select a pixel offset of the corresponding pixel within the range corresponding to the second human posture information from the pixel offset set. As the target pixel offset, a set of target pixel offsets corresponding to the second human body posture information is obtained;

Generate an optical flow mask map according to the target pixel offset set;

The minimum circumscribed rectangle of the optical flow mask is determined as the human body optical flow frame.
The method according to claim 1 or 2, wherein filtering the optical flow human posture information set includes:

For each optical flow joint point confidence information included in the optical flow joint point confidence information group set, generate an average optical flow joint point confidence based on the optical flow joint point confidence information;

Select the average optical flow joint point confidence that is less than the preset optical flow threshold from the obtained average optical flow joint point confidence as the first confidence to be filtered;

The optical flow human body posture information corresponding to the obtained first confidence level to be filtered is deleted from the optical flow human body posture information set.
The method according to one of claims 1-3, wherein the joint point confidence network includes a backbone network, a joint point prediction branch and a joint point availability branch, and the joint point availability branch includes a residual network and at least one classification device.
The method according to one of claims 1-4, wherein the method further includes:

Determine the matching results in the matching result group that meet the preset single video frame number condition as the target matching results to obtain a target matching result set;

Select the human body posture information corresponding to the target matching results included in the target matching result set from the human body posture information group set as the human body posture information to be integrated, and obtain the human body posture information set to be integrated;

Input the human body posture information set to be integrated into the preset pedestrian feature model to obtain the pedestrian feature information set to be integrated;

According to the pedestrian feature information set to be integrated, pedestrian feature similarity information for each pedestrian feature information to be integrated in the pedestrian feature information set to be integrated is generated, and a pedestrian feature similarity information set is obtained;

Determine the pedestrian feature similarity information that satisfies the preset similarity conditions in the pedestrian feature similarity information set as the target similarity information, and obtain the target similarity information set;

Update the matching result group set according to the target similarity information set;

For each target similarity information in the target similarity information set, pedestrian feature similarity information corresponding to the target similarity information is deleted from the pedestrian feature similarity information set.
The method according to any one of claims 1 to 5, wherein the matching process on the human body posture information group set includes:

For the human body posture information group corresponding to two adjacent pedestrian video frames in the human body posture information group set, perform the following operations:

Determine the human body posture information group corresponding to the previous pedestrian video frame in the two adjacent pedestrian video frames as the first human body posture information group;

Determine the human body posture information group corresponding to the latter of the two adjacent pedestrian video frames as the second human body posture information group;

Determine the distance between each first human body posture information in the first human body posture information group and each second human body posture information in the second human body posture information group to obtain a distance set;

According to the distance set, perform allocation processing on the first human body posture information in the first human body posture information group and the second human body posture information in the second human body posture information group to obtain adjacent video frame matching results;

According to the obtained matching results of each adjacent video frame, a matching result group set is generated.
An attitude tracking device, including:

a detection unit configured to detect the pedestrian video frame by frame to obtain a human body bounding box group set, wherein the human body bounding box group set corresponds to the pedestrian video frame included in the pedestrian video;

The first input unit is configured to input the human body bounding box set to the joint point confidence network to obtain the human body posture information set;

The matching unit is configured to perform matching processing on each human body posture information group in the human body posture information group set according to the human body posture information group set to generate a matching result group and obtain a matching result group set, wherein: The matching results in the matching result group include at least one video frame number corresponding to the human body bounding box of the matching result;

a generating unit configured to generate a human body according to each matching result group corresponding to the matching result that satisfies the preset video frame number condition in response to determining that a matching result satisfying a preset video frame number condition exists in the matching result group set. Optical flow frame group set, wherein the preset video frame number condition is that at least one video frame number included in the matching result does not include the next video frame number, and the next video frame number is the at least one video frame number. The video frame number of the next frame of video corresponding to the video frame number corresponding to the matching result;

The second input unit is configured to input the human body optical flow frame group set to the joint point confidence network to obtain an optical flow human body posture information group set and an optical flow joint point confidence information group set;

The filtering unit is configured to filter the optical flow human posture information group based on the optical flow joint point confidence information group to obtain a target human posture information group.
An electronic device including:

at least one processor;

a storage device having at least one program stored thereon;

When the at least one program is executed by the at least one processor, the at least one processor implements the method according to any one of claims 1-6.
A computer-readable medium with a computer program stored thereon, wherein when the program is executed by a processor, the method according to any one of claims 1-6 is implemented.