CN111881887A

CN111881887A - Multi-camera-based motion attitude monitoring and guiding method and device

Info

Publication number: CN111881887A
Application number: CN202010853582.3A
Authority: CN
Inventors: 董秀园
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-08-21
Filing date: 2020-08-21
Publication date: 2020-11-03

Abstract

The disclosure relates to a multi-camera based motion gesture monitoring and guidance method and device. The method comprises the following steps: acquiring moving images and/or video data of at least one target sporter through a plurality of cameras; identifying a target sporter from the moving image and/or video data through a gesture identification algorithm, and outputting a human body posture graph required by motion gesture monitoring; reconstructing human body three-dimensional posture information of a target sporter through a three-dimensional reconstruction algorithm based on a human body posture graph; carrying out bone registration on the target sporter and the reference person by utilizing a body state key node in a three-dimensional space; comparing the human body three-dimensional posture information of the target sporter and the reference person registered by the skeleton at a certain moment or within a preset time period; evaluating the action completion degree and quality of the target sporter based on the comparison result; and providing feedback to the target athlete based on the evaluation, wherein the feedback includes whether the action is up to standard and/or an athletic optimization recommendation.

Description

Multi-camera-based motion attitude monitoring and guiding method and device

Technical Field

The present disclosure relates generally to the field of image processing, and in particular to a multi-camera based motion gesture monitoring and guidance method and apparatus.

Background

In recent years, with the increasing income level and health awareness of the people, people are increasingly engaged in sports to improve physical fitness. Traditional athletic coaching methods often rely on the observation of a sports coach who then gives a targeted coaching program. However, at present, the level of the sports coaches is uneven, the number of the professional coaches is short and the price is high, and the requirements of low-consumption people on fitness rehabilitation and sports cannot be met continuously. However, many sports beginners may be injured by exercising by mistake, or may be unaware of how to improve their performance due to lack of instruction, or may not be able to engage in more specialized athletic activities without engaging in a specialized trainer.

Systems related to motion posture detection are emerging to meet the demands of people for fitness rehabilitation, low cost and high quality of sports. In the prior art, detection is mostly performed based on single video or two-dimensional motion video information shot by a single camera, but the method cannot accurately show the action posture of a person in a three-dimensional space, so that scientific guidance cannot be given. In addition, most of the video-dependent methods need to be compared with a certain normative action, and then the user cannot select favorite professional coaches for remote synchronous fitness aiming at different people or select a fitness system suitable for the current self condition. For example, beginners with little body-building experience or athletes with ill health can not adapt to fixed and mechanical movements.

Disclosure of Invention

In view of the above technical problems, the present disclosure provides a method and an apparatus for monitoring and guiding a motion gesture based on multiple cameras. The method and the device shoot video images of the target sporter based on multiple cameras, reconstruct three-dimensional information through videos and images acquired from multiple visual angles, monitor the movement of the sporter in an all-around manner, monitor the movement posture by utilizing a deep learning neural network and combining a traditional machine learning technology, and feed correction information back to the target sporter through comparison.

In one aspect of the present disclosure, a multi-camera based motion gesture monitoring and guidance method is provided, comprising the steps of: acquiring moving images and/or video data of at least one target sporter through a plurality of cameras; identifying the target sporter from the moving image and/or video data through a posture identification algorithm, and outputting a human body posture graph required by movement posture monitoring; reconstructing human body three-dimensional posture information of the target sporter through a three-dimensional reconstruction algorithm based on the human body posture graph; carrying out bone registration on the target sporter and a reference person by utilizing a posture key node in a three-dimensional space; comparing the human body three-dimensional posture information of the target sporter and the reference person which are subjected to bone registration at a certain moment or within a preset time period; evaluating the action completion degree and quality of the target sporter based on the comparison result; and providing feedback to the target athlete based on the assessment, wherein the feedback comprises whether the action is up to standard and/or an exercise optimization suggestion.

In a preferred embodiment, the camera may include at least one of: a planar camera, a depth camera, an infrared camera or a thermal imager, wherein the depth camera may comprise at least one of: a time-of-flight camera, a structured light camera, or a binocular camera.

In another preferred embodiment, the outputting the body state diagram required for motion gesture monitoring further includes: determining the posture of the human body through key nodes of the human body, wherein the key nodes comprise at least one of the following: limb joint points and facial key points, and the position information of the key nodes is represented by coordinates; determining the position coordinates of at least one key node in the moving image or video; determining category information of at least one of the key nodes, wherein the category information includes: body feature information of interest, the body feature information of interest comprising: key characteristic points of human body parts required by human body monitoring tasks and human body biomechanical model analysis aiming at different applications; determining state information of at least one of the key nodes, wherein the state information comprises: visible, invisible, and either speculative or non-speculative; and linking the key nodes into the human body posture graph through the position relation and the reliability among the key nodes, and judging the action through the change of the human body posture graph.

In a further preferred embodiment, the gesture recognition algorithm comprises a deep learning neural network prediction algorithm, wherein the deep learning neural network requires training, the training comprising: preparing a human body posture image set, wherein human body posture image data in the human body posture image set is marked according to the key nodes; and training a deep learning model by using the human body posture image set, and updating parameters of the deep learning neural network through error back propagation until convergence to obtain the deep learning neural network which is completely trained.

In another preferred embodiment, the reconstructing the human three-dimensional pose information of the target moving person by the three-dimensional reconstruction algorithm based on the human body state diagram further comprises: acquiring shooting parameters of the plurality of cameras, and establishing a three-dimensional space coordinate system according to the shooting parameters, wherein the shooting parameters comprise at least one of the following: the orientation, angle and viewing angle of the camera, and the focal length.

In another preferred embodiment, the reconstructing the human three-dimensional pose information of the target moving person by the three-dimensional reconstruction algorithm based on the human body state diagram further comprises: in the case of a single depth camera, reconstructing the human three-dimensional pose information of the target user by converting the human body pose image generated from the depth image acquired by the depth camera into a three-dimensional point cloud image; in the case of a combination of a plane camera and a depth camera, processing the human body posture image generated by a plane image acquired by the plane camera and a three-dimensional point cloud image converted from a depth image acquired by the depth camera to reconstruct the human body three-dimensional posture information of the target sporter; or in the case of the combination of multi-view image data acquisition devices, the human body three-dimensional posture information of the target sporter is reconstructed by projecting the human body posture image into the three-dimensional space coordinate system.

In another preferred embodiment, the bone registration of the target moving person with the reference person using the body state key nodes in the three-dimensional space further includes global bone scaling and local bone scaling, wherein the global bone scaling refers to registration of coordinate sets of key nodes for the whole human body, and the local bone scaling refers to registration of coordinates of local key nodes in the key nodes for the human body, including: calculating a bone length of the target actor and the reference, wherein the bone length is a distance between the location coordinates of the key nodes linked together, wherein the distance comprises at least one of: euclidean distance, standardized Euclidean distance, Mahalanobis distance, cosine distance; performing bone registration on the bone length of the target sporter according to the corresponding bone length of the reference person; or carrying out bone registration on the bone length of the reference person according to the corresponding bone length of the target sporter.

In a further preferred embodiment, comparing the human three-dimensional pose information of the skeletal registered target player and the reference player over a predetermined continuous period of time further comprises at least one of: comparing the distances of the key nodes of the target sporter and the corresponding key nodes of the reference person on the three-dimensional space by calculating, wherein the larger the distance is, the larger the action gap is; calculating distances between a plurality of key nodes of the target sporter and a plurality of corresponding key nodes of the reference person on the three-dimensional space and averaging the distances for comparison, wherein the larger the average value is, the larger the action gap is; comparing the included angle between the line segment formed by the linked key nodes of the target sporter and the line segment formed by the corresponding linked key nodes of the reference person, wherein the larger the included angle is, the larger the action difference is; and extracting features of the target sporter and the human body three-dimensional posture of the reference person, which change in space-time, and comparing the extracted features of the target sporter with corresponding features of the reference person, wherein the features of the target sporter change in space-time include a movement direction and a movement speed, and the difference between the features and the corresponding features is larger, which indicates that the action gap is larger.

In a further preferred embodiment, the evaluating the action completion degree and quality of the target exerciser based on the comparison result further comprises: and evaluating the action completion degree and quality of the target sporter by assigning a weight to a comparison result obtained by one or more comparison modes.

In another aspect of the present disclosure, there is provided a multi-camera based motion gesture monitoring and guidance apparatus, including: a motion capture module configured to acquire moving images or video data of at least one target player from a plurality of shooting angles through a plurality of cameras; a gesture recognition module configured to recognize the target sporter from the moving image or video data through a gesture recognition algorithm and output a human body posture map required for motion gesture monitoring; a three-dimensional reconstruction module configured to reconstruct human three-dimensional pose information of the target actor through a three-dimensional reconstruction algorithm based on the human body posture map; a bone registration module configured to bone register the target actor with a reference actor utilizing a posture key node in three-dimensional space; a pose comparison module configured to compare the human three-dimensional pose information of the target player and the reference player registered via bones over a predetermined continuous period of time; a pose evaluation module configured to evaluate the degree and quality of action completion of the target athlete based on the comparison; and a motion feedback module configured to provide feedback to the target athlete based on the assessment, wherein the feedback includes whether the action is up to standard and/or a motion optimization suggestion.

In addition, the multi-camera based motion gesture monitoring and guidance device can realize the multi-camera based motion gesture monitoring and guidance method as described above.

Compared with the prior art, the beneficial effects of the disclosure are: the method and the device shoot video images of the target sporter based on multiple cameras, reconstruct three-dimensional information through videos and images acquired from multiple visual angles, monitor the movement of the sporter in an all-around manner, monitor the movement posture by utilizing a deep learning neural network and combining a traditional machine learning technology, and feed correction information back to the target sporter through comparison, so that the development of a body posture intelligent monitoring and feedback system in the market of the movement health industry is promoted, and the requirements of people on low cost and high quality of fitness rehabilitation and sports are met.

Drawings

The novel features believed characteristic of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following drawings and detailed description that set forth illustrative embodiments, in which the principles of the invention are utilized. The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, in the drawings, wherein like reference numerals refer to like elements throughout:

FIG. 1 illustrates a flow chart of a multi-camera based motion gesture monitoring and guidance method according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates one arrangement of cameras according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a model application flow diagram in accordance with an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a training flow diagram of a deep learning network model according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates an example of a body key feature point marker template according to an exemplary embodiment of the present disclosure;

FIG. 6 shows a schematic diagram of body key feature points used for curl gesture analysis in accordance with an exemplary embodiment of the present disclosure;

FIG. 7 illustrates an example of a human pose three-dimensional reconstruction flow according to an exemplary embodiment of the present disclosure;

FIG. 8 illustrates one example of a spatial transform neural network in accordance with an exemplary embodiment of the present disclosure;

9(a) -9 (c) illustrate three different measures of geometric similarity measure according to exemplary embodiments of the present disclosure;

10(a) -10 (d) illustrate four different characterizations of trajectory displacement according to exemplary embodiments of the present disclosure;

FIG. 11 shows a schematic diagram of a multi-camera based motion gesture monitoring and guidance apparatus according to an exemplary embodiment of the present disclosure; and

FIG. 12 illustrates a schematic diagram of a bone registration module and a pose comparison module in an apparatus according to an exemplary embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Nothing in the following detailed description is intended to indicate that any particular component, feature, or step is essential to the invention. Those skilled in the art will appreciate that various features or steps may be substituted for or combined with one another without departing from the scope of the present disclosure.

FIG. 1 shows a flow diagram of a multi-camera based motion gesture monitoring and guidance method according to an example embodiment of the present disclosure. Referring to fig. 1, the method comprises the following steps: at S101, moving images and/or video data of at least one target player are acquired through a plurality of cameras; in S102, the target sporter is identified from the moving image and/or video data through a gesture identification algorithm, and a human body posture graph required by motion gesture monitoring is output; at S103, reconstructing human body three-dimensional posture information of the target sporter through a three-dimensional reconstruction algorithm based on the human body posture graph; at S104, carrying out bone registration on the target sporter and a reference person by utilizing a body state key node in a three-dimensional space; at S105, comparing the human body three-dimensional posture information of the target sporter and the reference person which are subjected to bone registration at a certain moment or within a preset time period; at S106, based on the comparison result, evaluating the action completion degree and quality of the target sporter; and at S107, providing feedback to the target athlete based on the assessment results, wherein the feedback includes whether the action meets the standard and/or an exercise optimization suggestion.

In some embodiments, the camera may include at least one of: a planar camera, a depth camera, an infrared camera or a thermal imager, wherein the depth camera may comprise at least one of: a time-of-flight camera, a structured light camera, or a binocular camera. Time of flight (TOF) is a method of obtaining distance information from an object to a camera by continuously transmitting light pulses to the object, receiving light returning from the object by a sensor, and detecting the Time of flight of the light pulses. The structured light camera scans a measured object by emitting laser so as to obtain the distance information from the surface of the measured object to the camera. The binocular camera determines distance information from a shooting target to the camera through parallax calculation of images collected by the two cameras.

For the mode of combining the planar camera and the depth camera, the planar camera provides user appearance information, and the depth camera provides distance information of a user from the shooting direction of the camera, so that the body joint position of a photographer can be analyzed. The configuration using a combination of a flat-screen camera and a depth camera allows a user to be observed when there is only a single viewing angle. The core of the configuration mode is that the planar color image and the depth point cloud are overlapped and fused on the pixels, and the fusion method at the pixel level enables the model of the invention to definitely reason the local appearance and the geometric information. In addition, a binocular camera can be used as a preferred scheme, and the binocular camera is relatively reasonable in price and can output a plane color image and a depth image.

Fig. 2 is an arrangement of cameras according to an embodiment of the present invention. As shown in fig. 2, the

cameras

202 and 203 can acquire moving images and/or video data of at least one target player 201 from a plurality of angles. And extracting geometric feature points of the image based on geometric feature information (edges, lines, contours, interest points, corner points, geometric primitives and the like) according to the relative posture between the camera intrinsic parameters and the shot view. And performing parallax estimation on the geometric feature points, and reconstructing a three-dimensional space scene by using the obtained parallax information to obtain the position of the body skeleton joint of the target sporter in the three-dimensional space.

In some embodiments, the outputting the body state diagram required for motion gesture monitoring may further include: determining the posture of the human body through key nodes of the human body, wherein the key nodes comprise at least one of the following: limb joint points and facial key points, and the position information of the key nodes is represented by coordinates; determining the position coordinates of at least one key node in the moving image or video; determining category information of at least one of the key nodes, wherein the category information includes: body feature information of interest, the body feature information of interest comprising: key characteristic points of human body parts required by human body monitoring tasks and human body biomechanical model analysis aiming at different applications; determining state information of at least one of the key nodes, wherein the state information comprises: visible, invisible, and either speculative or non-speculative; and linking the key nodes into the human body posture graph through the position relation and the reliability among the key nodes, and judging the action through the change of the human body posture graph.

FIG. 3 shows a model application flow diagram of an embodiment of the invention. As shown in fig. 3, the process includes the following steps: s301, training a posture recognition deep learning network model; and S302, reasoning by using the fully trained gesture recognition deep learning neural network.

The gesture recognition algorithm comprises a deep learning neural network prediction algorithm, wherein the deep learning neural network needs to be trained. As shown in fig. 3, at S301, the training includes: preparing a human body posture image set, wherein human body posture image data in the human body posture image set is marked according to the key nodes; and training a deep learning model by using the human body posture image set, and updating parameters of the deep learning neural network through error back propagation until convergence to obtain the deep learning neural network which is completely trained.

Fig. 4 is a flowchart of training a deep learning network model according to an embodiment of the present invention, and as shown in fig. 4, the flowchart includes: s401, preparing an image set containing a human body, marking the human body in the human body image based on key body characteristic points required by stress analysis, wherein the marked characteristic points are used for distinguishing different human body parts; s402, constructing a neural network model; s403 defines a loss function; s404, model training: performing end-to-end training by using the built neural network model and the image data as input and the marked human posture data as output, and updating parameters of the neural network until convergence by back propagation of errors to obtain well-trained neural network parameters; the training degree of the deep learning model is preferably verified in a cross-validation mode so as to enhance the generalization ability of the deep learning model and avoid overfitting of the model.

In S401, the pose image in the human body pose image set is obtained by manually labeling the feature point labels. In order to obtain more human body posture images for training, the images can be preprocessed in the steps, and the preprocessing mode includes but is not limited to at least one of the following modes: rotation processing, cutting processing, brightness adjustment and down-sampling processing. Preferably, when preparing the gesture image database, the human gestures in the image may be labeled according to feature points required for biomechanical analysis of the region of interest, by modifying the existing mass labeled human database.

Fig. 5 shows an example of a body key feature point marker template according to an exemplary embodiment of the present disclosure. In this embodiment, popular public data sets may be used including, but not limited to: a Human 3.6M three-dimensional Human body posture data set, a COCO Human body posture data set, an MPII Human body posture database and the like. In the MPII dataset, each person has 14 body key feature point markers with different numbers, such as right shoulder, right elbow, right wrist, left shoulder, left elbow, left wrist, right hip, right knee, right ankle, left hip, left knee, left ankle, vertex, neck; each marker records coordinate information of a corresponding body key feature point, wherein the body key feature point marker template is shown in fig. 5.

FIG. 6 shows a schematic diagram of body key feature points used for curl gesture analysis in accordance with an exemplary embodiment of the present disclosure. In some embodiments, in order to perform biomechanical analysis to obtain the influence of the motion on muscles, joints and bones, a human body posture feature marking method which is more adaptive to requirements based on human engineering is provided, that is, important feature points of a human body in an image are additionally marked according to different human body mechanical analysis requirements. As shown in FIG. 6, for the curl gesture analysis, the feature points used include, but are not limited to, the following body features: head, neck center, shoulder center, back center, waist, sacrum, hip joint, thigh bone, knee joint, ischial contact point, etc.; for stance analysis, the feature points used include, but are not limited to, the following physical features: head, neck center, shoulder center, back center, waist, hip joint, knee joint, ankle joint, sole, etc. By adopting the mode, the key points required by the human body biomechanical analysis are predicted, the accuracy rate of the gesture recognition task can be improved, and the calculated amount and the prediction time of the gesture recognition model can be further reduced.

In this embodiment, in order to further improve the accuracy and stability of the estimation of the position of the body key feature point, a confidence region is generated based on the pixel position of the body key feature point marked on the image. In this region, the estimated body key feature points will have a relatively high probability of containing true values. Those skilled in the art will appreciate that other confidence region generation methods may also be used to generate confidence regions.

In some embodiments, the body key feature point monitoring deep learning neural network structure comprises a top-down deep learning neural network or a bottom-up deep learning neural network, wherein the bottom-up method is to monitor all body key feature points in an image first and then allocate all body key feature points to different human body examples respectively; and the top-down method is that the human body monitor is operated on the image firstly to find all human body examples, then all the human body examples are divided into a plurality of human body example sub-images, and the monitoring of the key characteristic points of the human body is carried out on each human body example sub-image. The method has the advantages that the speed and the accuracy of identifying the key body characteristic points of a single person are high, the high performance is realized, and the time for identifying the posture of the whole image is increased along with the increase of the number of the persons. However, the accuracy of recognizing complex body poses tends to be somewhat lower because the body key points cannot be modeled more carefully. In the bottom-up method, the key points of the human body parts in the image are firstly monitored on the image, and then the key points of the parts of the human body of a plurality of people in the image are respectively distributed to different human body examples. Although the method cannot benefit from the overall human body structure information, in the case that the image comprises a plurality of human body examples, the accuracy of detecting and classifying the key feature points of each human body is high, and the identification time cannot be obviously increased along with the increase of the number of detected people.

In other words, in the case of multi-person pose extraction, a bottom-up deep learning body key feature point monitoring neural network architecture model is preferably used. However, in the case of single-person gesture extraction, in order to improve the accuracy of single-person gesture detection, it is preferable to employ a top-down deep learning body key feature point monitoring neural network architecture model.

At S302, performing inference using the well-trained gesture recognition deep learning neural network may include: and taking the image or video data as input, and outputting each estimated human body state key feature point through a completely trained neural network, wherein the feature point information comprises: coordinates on the graph, state and confidence, etc.

Fig. 7 illustrates an example of a human pose three-dimensional reconstruction flow according to an exemplary embodiment of the present disclosure. As shown in fig. 7, reconstructing the human body three-dimensional posture information of the target sporter through a three-dimensional reconstruction algorithm based on the human body posture diagram further includes: acquiring shooting parameters of the plurality of cameras, and establishing a three-dimensional space coordinate system according to the shooting parameters, wherein the shooting parameters comprise at least one of the following: the orientation, angle and viewing angle of the camera, and the focal length.

In S701, the image capturing device parameters, the captured image, and the predicted human posture feature points obtained in S302 are obtained. In the above step, the acquired parameters of the image capturing device include: position information and shooting direction of the image acquisition device. Before scanning, shooting parameters of each camera are obtained, and a space coordinate system of the camera is calibrated and adjusted. The method comprises the steps of placing a camera at a certain position or certain positions of a motion space, collecting user action information from a plurality of selectable visual angles, and then performing three-dimensional reconstruction. The position of the camera is then calibrated. Each camera can establish a spatial three-dimensional rectangular coordinate system according to the origin of the position of the camera. The preferred definition in this example is: the coordinate origin is fixed at the lens of the camera, the x and y axes are parallel to two sides of the imaging surface of the camera, and the z axis is positioned in the direction of the shooting optical axis of the lens and is vertical to the phase surface.

At S702, the same person in each view is matched. In the above steps, the method for matching the same person in each view by an algorithm includes but is not limited to: determining the position of a target user through position tracking in the wearable device; under the condition of obtaining the depth image information, determining a human body according to the position of the human body positioned in the depth view in the three-dimensional space; and under the condition of acquiring the plane image information, calculating the human body feature similarity across multiple views according to the human body appearance similarity and the geometric compatibility. Preferably, the deep learning algorithm is used to calculate the output feature result of the specific convolutional layer and match the feature result. Alternative methods are machine learning or optical flow methods, etc.

In S703, three-dimensional pose information of the target is reconstructed from the matched image data and the body key feature point information.

In some embodiments, reconstructing the human three-dimensional pose information of the target actor through a three-dimensional reconstruction algorithm based on the human body state diagram further comprises: in the case of a single depth camera, reconstructing the human three-dimensional pose information of the target user by converting the human body pose image generated from the depth image acquired by the depth camera into a three-dimensional point cloud image; in the case of a combination of a plane camera and a depth camera, processing the human body posture image generated by a plane image acquired by the plane camera and a three-dimensional point cloud image converted from a depth image acquired by the depth camera to reconstruct the human body three-dimensional posture information of the target sporter; or in the case of a multi-plane camera or a multi-depth camera combination, the position of the bone joint of the target sporter body in the three-dimensional space is obtained by projecting the human body posture image into the three-dimensional space coordinate system through a triangulation method according to the position and the relative posture of the camera placed at multiple angles and based on the marked human body posture data output in S302, thereby calculating the three-dimensional space coordinates of the feature point.

In some embodiments, the bone registration of the target moving person and the reference person by using the key nodes of the posture in the three-dimensional space is a mathematical calculation process of converting the coordinate set of the three-dimensional posture data points of the target moving person obtained in step S703 into the coordinate system of the coordinate set of the posture characteristic in the reference three-dimensional space of the reference person.

For example, two posture feature coordinates of the target exerciser and the reference exerciser may be considered as two point sets, and the posture feature coordinate set is composed of at least one feature joint coordinate. Registration target objects can be classified as global bone registration based and local bone registration based. The global bone registration refers to the registration of the whole human posture feature coordinate set. Local bone registration refers to registration of local key feature coordinates in a human body posture feature coordinate set.

In a preferred embodiment, the registration method is a spatial transformation process that finds the two sets of pose points aligned: the purpose of the transformation mainly comprises: merging a plurality of attitude characteristic node sets into a global unified model; and mapping the set of unknown pose feature nodes onto the set of reference nodes to identify features or estimate poses thereof.

Furthermore, the problem of the point set registration study can be summarized as follows: assuming that X, Y are two point sets in euclidean space, a transformation T needs to be found, which, after being applied to point set X, can minimize the difference between transformed point set T (X) and point set Y. The difference between the transformed set of points t (x) and the set of points Y may be defined by some distance function. One preferred distance function is the euclidean distance. Additionally or alternatively, the distance function may also be a cosine distance.

Those skilled in the art will appreciate that Point set registration techniques include, but are not limited to, iterative closest Point (iterative closest Point), Robust Point matching (Robust Point matching), Kernel correlation (Kernel correlation), and consistent Point drift (Coherent Point drift).

Those skilled in the art will appreciate that the transformation methods in the registration of the point sets may also include, but are not limited to, simple linear transformations (translation, scaling, rotation, cropping), affine transformations, similarity transformations, and the like.

In some embodiments, a linear scaling method comprises: calculating a bone length of the target actor and the reference, wherein the bone length is a distance between the location coordinates of the key nodes linked together, wherein the distance comprises at least one of: euclidean distance, standardized Euclidean distance, Mahalanobis distance, cosine distance; stretching (enlarging, reducing) the skeleton of the target player in length in the corresponding direction of the skeleton of the target player according to the corresponding skeleton length of the reference player; or stretching the skeleton length of the reference person according to the corresponding skeleton length of the target sporter.

Wherein,

respectively representing a reference characteristic coordinate set and a target sporter characteristic coordinate set, wherein the parameter p is a scaling multiple and p is more than 0.

In some embodiments, an affine transformation method comprises: a transformation relationship is calculated from the target player coordinate system skeleton feature coordinate set to a corresponding skeleton coordinate set in the reference coordinate system. For example, a linear mapping matrix a and a three-dimensional translation vector b can be described as follows:

wherein,

respectively representing a reference feature coordinate set and a target actor feature coordinate set. Those skilled in the art will appreciate that the linear mapping matrix a includes, but is not limited to: and linear operations such as translation, zooming, rotation, clipping and the like.

In some embodiments, the coordinates may appear decimal when the target player coordinate set is mapped to the reference coordinate system after the registration transformation is performed on the image. This can be done, for example, by aligning the data to integer coordinates by bilinear interpolation.

In some embodiments, for example, spatial transformation neural network learning may be used to set the parameters of the affine matrix so that the target player pose coordinate set is effectively normalized to the reference coordinate system by the transformation matrix. FIG. 8 illustrates one example of a spatial transform neural network in accordance with an exemplary embodiment of the present disclosure. As shown in fig. 8, the spatial transform neural network 800 is generally divided into three parts: 1) a positioning Network 801 (localization Network); 2) a Grid generator 802(Grid generator); and 3) Sampler 803 (Sampler). The positioning network 801 is used to calculate a parameter θ of spatial transformation, the grid generator 802 obtains a correspondence relationship T θ from (x, y, z) of the input target player coordinate set U to (x, y, z) of V at each position of the pose coordinate set output to the reference coordinate system, and the sampler 803 generates a final output coordinate set in the reference coordinate system according to the input coordinate set U and the correspondence relationship T θ.

In an exemplary embodiment, the positioning network 801 functions to generate the parameters θ of the spatial transformation through a sub-network (full-link or convolutional network, plus a regression layer). The mesh generator 802 is operative to calculate the coordinate position in the original image V for each position in the target image U by a selected transformation matrix operation (e.g., non-linear transformation, affine transformation). I.e. to generate t (g). The sampler 803 is used for sampling in the original image U according to the coordinate information in t (g) and copying the pixels in U into the target image V. And finally acquiring a completely trained spatial transformation neural network through back propagation of the training network parameters.

Fig. 9(a) to 9(c) show three different measurement modes of the geometric similarity measurement according to an exemplary embodiment of the present disclosure. Specifically, fig. 9(a) shows two trajectories having exactly the same shape direction but at a larger distance from each other. Fig. 9(b) shows two overlapping tracks but in different directions. Fig. 9(c) shows two tracks synchronized in two directions but different in length.

When the key limbs of the moving target move in three dimensions, the trajectory similarity analysis is an important index. Trajectory similarity measures may include, but are not limited to, Euclidean distance methods, dynamic time warping methods, edit distance methods, longest common subsequences, Fourier distance methods, Hausdorff distance methods, deep learning distance measurements, and the like.

As shown in fig. 9(a), one of the most intuitive methods is to consider two tracks as two point sets, and then represent the similarity of the two tracks by the minimum distance.

In some embodiments, the euclidean distance method may also be employed. For example, for a sequence of the same length, the euclidean distance between corresponding points between the converted target player and the reference trajectory is calculated, and then accumulated and summed as the final metric value. For example, the target and reference trajectory distances are calculated as:

wherein the motion trail of the target sporter is L_iAnd a reference trajectory of L_jThe method comprises the following steps of firstly, obtaining a target sporter track sequence, then, obtaining a reference track sequence, obtaining k track points, obtaining n total track length, obtaining m key node coordinates and obtaining p key node coordinate sets. This method is also very limited, for example, requiring that the sampling rate and the trace points must be identical. As shown in fig. 9(c), considering the lengths of the tracks at the same time, even if the directions of the two tracks are exactly the same, the similarity of the two tracks is actually low because the lengths are different.

In some embodiments, in order to measure the tracks with different sampling rates and different lengths, the tracks are locally stretched or scaled, preferably using a dynamic time warping rule, so that the tracks with different sampling rates and different lengths can be compared.

The above methods have some drawbacks. For example, as shown in FIG. 9(b), if two tracks have overlapping regions, then the minimum distance is 0, and then the algorithm result will give the exact same result for both tracks, which is clearly wrong because the direction of the tracks must be taken into account.

In some embodiments, it is preferable to integrate all the factors and introduce the following three concepts to define the similarity measure of the trajectory:

1. centroid of the trace

The centroid of a three-dimensional shape is recorded

I.e. its geometric centre. The trajectory centroid is expressed by ctr (tra), and its mathematical expression is:

f, (x), f (y), f (z) respectively represent the density distribution of the track points in the x, y and z directions. If the density is assumed to be isotropic and uniformly distributed (i.e., f (x) (y) (f) (z) (1)), the trajectory centroid of any critical limb is mathematically expressed as:

2. displacement of track

Fig. 10(a) to 10(d) show four different characterizations of trajectory displacement according to exemplary embodiments of the present disclosure. Specifically, fig. 10(a) shows a trajectory tra1, fig. 10(b) shows a trajectory tra2 (dots are trajectory centroids), fig. 10(c) shows trajectory displacements (vectors from start points to end points), and fig. 10(d) shows euclidean distances between two trajectory centroids. For any trajectory, as shown in FIG. 10(c), the trajectory displacement may be defined as the vector from the start point to the end point of the trajectory, denoted as

3. Cosine similarity

Cosine similarity is the cosine value of the angle between two vectors [ -1, 1 [ ]]-1 represents that the two vectors point in exactly opposite directions, i.e. at an angle of 180 degrees, and 1 represents that the two vectors point in the same direction, i.e. at an angle of 0 degrees. The track displacement vectors of the tracks tra1 and tra2 are respectively denoted as

And

the mathematical expression of cosine similarity is:

in summary, if tra1 is the target user trajectory and tra2 is the reference trajectory, then the mathematical expression of the geometric distance between the two trajectories (the greater the value the less similar) is:

the expression is composed of three terms, the first term measures the Euclidean distance of the centroid, the second term measures the difference of the track lengths, and the third term is the cosine similarity multiplied by the average length of the two tracks.

In addition, based on the attitude geometric similarity measurement, an attitude regression loss function can be further defined, and the expression is as follows:

where J is the key limb index and J is the total number of key limbs (e.g., J17).

In some embodiments, comparing the human three-dimensional pose information of the skeletal registered target player and the reference player over a predetermined continuous period of time further comprises at least one of: comparing the distances of the key nodes of the target sporter and the corresponding key nodes of the reference person on the three-dimensional space by calculating, wherein the larger the distance is, the larger the action gap is; calculating distances between a plurality of key nodes of the target sporter and a plurality of corresponding key nodes of the reference person on the three-dimensional space and averaging the distances for comparison, wherein the larger the average value is, the larger the action gap is; comparing the included angle between the line segment formed by the linked key nodes of the target sporter and the line segment formed by the corresponding linked key nodes of the reference person, wherein the larger the included angle is, the larger the action difference is; and extracting features of the target sporter and the human body three-dimensional posture of the reference person, which change in space-time, and comparing the extracted features of the target sporter with corresponding features of the reference person, wherein the features of the target sporter change in space-time include a movement direction and a movement speed, and the difference between the features and the corresponding features is larger, which indicates that the action gap is larger.

Additionally or alternatively, in some embodiments, evaluating the motion completion and quality of the target athlete based on the comparison further comprises: and evaluating the action completion degree and quality of the target sporter by assigning a weight to a comparison result obtained by one or more comparison modes.

FIG. 11 shows a schematic diagram of a multi-camera based motion gesture monitoring and guidance apparatus according to an example embodiment of the present disclosure. As shown in fig. 11, includes: a motion capture module 1101 configured to acquire moving images or video data of at least one target sporter from a plurality of photographing angles through a plurality of cameras; a gesture recognition module 1102 configured to recognize the target sporter from the moving image or video data through a gesture recognition algorithm and output a human body posture map required for motion gesture monitoring; a three-dimensional reconstruction module 1103 configured to reconstruct human three-dimensional pose information of the target actor through a three-dimensional reconstruction algorithm based on the human body state diagram; a bone registration module 1104 configured to bone register the target actor with a reference actor using a posture key node in three-dimensional space; a pose comparison module 1105 configured to compare the human three-dimensional pose information of the target player and the reference player registered via bones over a predetermined continuous period of time; a pose evaluation module 1106 configured to evaluate the degree and quality of action completion of the target athlete based on the comparison; and a motion feedback module 1107 configured to provide feedback to the target athlete based on the assessment results, wherein the feedback comprises whether the action is up to standard and/or a motion optimization suggestion.

FIG. 12 illustrates a schematic diagram of a bone registration module and a pose comparison module in an apparatus according to an exemplary embodiment of the present disclosure. The principle of the present invention can be more clearly understood by those skilled in the art through the schematic view shown in fig. 12.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some embodiments, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

While exemplary embodiments of the present invention have been shown and described herein, it will be readily understood by those skilled in the art that such embodiments are provided by way of example only. Numerous modifications, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

1. A motion attitude monitoring and guiding method based on multiple cameras is characterized by comprising the following steps:

acquiring moving images and/or video data of at least one target sporter through a plurality of cameras;

identifying the target sporter from the moving image and/or video data through a posture identification algorithm, and outputting a human body posture graph required by movement posture monitoring;

reconstructing human body three-dimensional posture information of the target sporter through a three-dimensional reconstruction algorithm based on the human body posture graph;

carrying out bone registration on the target sporter and a reference person by utilizing a posture key node in a three-dimensional space;

comparing the human body three-dimensional posture information of the target sporter and the reference person which are subjected to bone registration at a certain moment or within a preset time period;

evaluating the action completion degree and quality of the target sporter based on the comparison result; and

providing feedback to the target athlete based on the assessment, wherein the feedback comprises whether the action is up to standard and/or an exercise optimization suggestion.

2. The method of claim 1, wherein the camera comprises at least one of: planar camera, degree of depth camera, infrared camera or thermal imaging system, wherein the degree of depth camera includes following at least one: a time-of-flight camera, a structured light camera, or a binocular camera.

3. The method of claim 1, wherein outputting the body state diagram required for motion gesture monitoring further comprises:

determining the posture of the human body through key nodes of the human body, wherein the key nodes comprise at least one of the following: limb joint points and facial key points, and the position information of the key nodes is represented by coordinates;

determining the position coordinates of at least one key node in the moving image or video;

determining category information of at least one of the key nodes, wherein the category information includes: body feature information of interest, the body feature information of interest comprising: key characteristic points of human body parts required by human body monitoring tasks and human body biomechanical model analysis aiming at different applications;

determining state information of at least one of the key nodes, wherein the state information comprises: visible, invisible, and either speculative or non-speculative; and

and linking the key nodes into the human body posture diagram through the position relation and the reliability among the key nodes, and judging the action through the change of the human body posture diagram.

4. The method of claim 3, wherein the gesture recognition algorithm comprises a deep learning neural network prediction algorithm, wherein the deep learning neural network requires training, the training comprising:

preparing a human body posture image set, wherein human body posture image data in the human body posture image set is marked according to the key nodes; and

and training a deep learning model by using the human body posture image set, and updating the parameters of the deep learning neural network through error back propagation until convergence to obtain the deep learning neural network which is completely trained.

5. The method of claim 1, wherein reconstructing the three-dimensional pose information of the target actor by a three-dimensional reconstruction algorithm based on the body state diagram further comprises:

acquiring shooting parameters of the plurality of cameras, and establishing a three-dimensional space coordinate system according to the shooting parameters, wherein the shooting parameters comprise at least one of the following: the orientation, angle and viewing angle of the camera, and the focal length.

6. The method of claim 5, wherein reconstructing the three-dimensional pose information of the target actor by a three-dimensional reconstruction algorithm based on the body state diagram further comprises:

in the case of a single depth camera, reconstructing the human three-dimensional pose information of the target user by converting the human body pose image generated from the depth image acquired by the depth camera into a three-dimensional point cloud image;

in the case of a combination of a plane camera and a depth camera, processing the human body posture image generated by a plane image acquired by the plane camera and a three-dimensional point cloud image converted from a depth image acquired by the depth camera to reconstruct the human body three-dimensional posture information of the target sporter; or

And under the condition of the combination of multi-view image data acquisition devices, the human body three-dimensional posture information of the target sporter is reconstructed by projecting the human body posture image into the three-dimensional space coordinate system.

7. The method of claim 3, wherein the bone registration of the target player with the referee using the key nodes of the posture in the three-dimensional space further comprises global bone scaling and local bone scaling, wherein the global bone scaling refers to the registration of the coordinate sets of the key nodes for the entire human body, and the local bone scaling refers to the registration of the coordinates of the local key nodes among the key nodes for the human body, comprising:

calculating a bone length of the target actor and the reference, wherein the bone length is a distance between the location coordinates of the key nodes linked together, wherein the distance comprises at least one of: euclidean distance, standardized Euclidean distance, Mahalanobis distance, cosine distance;

performing bone registration on the bone length of the target sporter according to the corresponding bone length of the reference person; or

And carrying out bone registration on the bone length of the reference person according to the corresponding bone length of the target sporter.

8. The method of claim 7, wherein comparing the human three-dimensional pose information of the skeletal registered target actor and the reference actor over a predetermined continuous period of time further comprises at least one of:

comparing the distances of the key nodes of the target sporter and the corresponding key nodes of the reference person on the three-dimensional space by calculating, wherein the larger the distance is, the larger the action gap is;

calculating distances between a plurality of key nodes of the target sporter and a plurality of corresponding key nodes of the reference person on the three-dimensional space and averaging the distances for comparison, wherein the larger the average value is, the larger the action gap is;

comparing the included angle between the line segment formed by the linked key nodes of the target sporter and the line segment formed by the corresponding linked key nodes of the reference person, wherein the larger the included angle is, the larger the action difference is; and

by extracting features of the target sporter and the human body three-dimensional posture of the reference person changing in space-time, and comparing the extracted features of the target sporter and corresponding features of the reference person, wherein the features changing in space-time comprise a movement direction and a movement speed, and the difference between the features and the corresponding features is larger, which indicates that the action gap is larger.

9. The method of claim 8, wherein evaluating the completion and quality of the target athlete's actions based on the comparison further comprises:

evaluating the performance and quality of the target athlete's performance by assigning weights to the results of one or more comparisons according to claim 8.

10. An apparatus for using the method of any of claims 1-9, comprising:

a motion capture module configured to acquire moving images or video data of at least one target player from a plurality of shooting angles through a plurality of cameras;

a gesture recognition module configured to recognize the target sporter from the moving image or video data through a gesture recognition algorithm and output a human body posture map required for motion gesture monitoring;

a three-dimensional reconstruction module configured to reconstruct human three-dimensional pose information of the target actor through a three-dimensional reconstruction algorithm based on the human body posture map;

a bone registration module configured to bone register the target actor with a reference actor utilizing a posture key node in three-dimensional space;

a pose comparison module configured to compare the human three-dimensional pose information of the target player and the reference player registered via bones over a predetermined continuous period of time;

a pose evaluation module configured to evaluate the degree and quality of action completion of the target athlete based on the comparison; and

a motion feedback module configured to provide feedback to the target athlete based on the assessment, wherein the feedback includes whether the action is up to standard and/or a motion optimization suggestion.