CN112153468A

CN112153468A - Method, computer readable medium and system for synchronizing video playback with user motion

Info

Publication number: CN112153468A
Application number: CN202010489056.3A
Authority: CN
Inventors: D·G·金贝尔; L·德努; M·J·贝克; P·邱; 金哲暄; 张艳霞
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2019-06-27
Filing date: 2020-06-02
Publication date: 2020-12-29
Also published as: JP7559360B2; US10811055B1; JP2021007214A

Abstract

Methods, computer-readable media, and systems for synchronizing video playback with user motion. A computer-implemented method for coordinating a sensed gesture associated with real-time motion of a first object in a live video feed with a pre-recorded gesture associated with motion of a second object in the video is provided. The computer-implemented method includes: applying a matching function to determine a match between a point of one of the sensed gestures and a corresponding point of the pre-recorded gesture; and based on the matching, determining a playing time and outputting at least one frame of the video associated with the second object at the playing time.

Description

Method, computer readable medium and system for synchronizing video playback with user motion

Technical Field

Aspects of the example implementations relate to and provide methods, systems, and user experiences for video playback synchronized in real-time with user motion to facilitate user observation of video associations.

Background

In the prior art, a user may observe a video intended to provide the user with training regarding performing an activity. For example, prior art videos may provide a user with training regarding a series of operations required to perform assembly of a kit of parts or facilitate the user moving his or her body in a prescribed manner. Some prior art videos may provide training on how a user may move his or her body to perform activities such as yoga, dance, physical therapy, and the like.

The related art video playback method can play a series of operations, and when a user needs to view a subset of the operations, it is difficult for the user to return to a desired position. For example, the user may need to rewind or fast-rewind the video to a desired operation.

The prior art methods may have disadvantages or problems. For example, but not by way of limitation, requiring a user to rewind a video back to a desired point can be intractable to a user's training experience and can distract the user from the actual task that the user is attempting to learn from the video. The user is focused on performing the task of controlling the video rather than providing video playback for training in performing a series of tasks.

Furthermore, in training activities such as yoga, dance, physical therapy, etc., the related art method of controlling video prevents a user from efficiently performing learning using the related art video playback method. As a result of these prior art shortcomings, users are often discouraged from continuing the activity, and may not be able to further improve skills in areas such as yoga or dance, or may actually continue to suffer physical discomfort or impede improvement for video playback associated with physical therapy or the like.

Other prior art methods of controlling video may include voice commands. However, these methods also have disadvantages and problems. When the user has to use or recall a voice command for rewinding or rewinding the video to a desired point, the user is distracted from the training activity, similar to the prior art method that requires manipulation using a remote control or the like. The prior art methods may also employ gesture related commands, as well as handheld devices. However, these methods also have problems and disadvantages, as the user may not be able to perform the training activities, particularly where the activities include positioning of arms, hands, fingers, and the like.

Additional prior art approaches may include the use of computer-generated graphics (e.g., avatars or animations of avatars) as an alternative to video playback. However, these prior art methods also have various drawbacks and problems. For example, but not by way of limitation, these prior art methods may have a relatively high development cost. Moreover, those methods do not provide the photo realisation that the user desires for the training activity.

Accordingly, there is an unmet need in the art to be able to track the motion of a user and control and present playback in an appropriate manner to provide feedback as to whether the user was successful in following the training operation provided in the video, without requiring the user to be distracted or to cease the training activity.

Disclosure of Invention

According to aspects of an example implementation, there is provided a computer-implemented method of coordinating a sensed gesture associated with real-time motion of a first object in a live video feed with a pre-recorded gesture associated with motion of a second object in the video, the computer-implemented method comprising the steps of: applying a matching function to determine a match between a point of one of the sensed gestures and a corresponding point of the pre-recorded gesture; and based on the matching, determining a play time and outputting at least one frame of the video associated with the second object for the play time.

According to some aspects, the points of the sensed gesture and the corresponding points of the pre-recorded gesture include a single point of the first object and the second object in the first mode and a plurality of points of the first object and the second object in the second mode. Optionally, a user control is provided to select between a first mode and a second mode, and for the second mode, the number and location of the plurality of points of the first object and the second object are selected.

According to other aspects, the first object includes a user receiving the video output and sequentially performing gestures to coordinate with the pre-recorded gestures, and the second object includes a coach displayed in the video that sequentially performs the pre-recorded gestures. Further, wherein the movement may comprise one of a physical therapy activity, at least a portion of an exercise, at least a portion of a dance, or at least a portion of a performance.

According to other aspects, the computer-implemented method comprises: determining an instantaneous velocity estimate over the time window based on the results of applying the matching function and determining the play time; applying a smoothing method to the instantaneous speed estimate to generate a play speed; and updating the playing time based on the playing speed. Additionally, the smoothing method may include using an empirically determined fixed smoothing factor or a sliding average of a variable that changes based on the degree of motion of the second object over the time window.

According to a further aspect, the video comprises one or more visual feedback indicators associated with the pre-recorded pose of the second object. Optionally, the one or more visual feedback indicators may include one or more of a trajectory indicative of a path of motion of the corresponding point of the second object, a trajectory indicative of a relative speed and a relative direction of the corresponding point of the second object along the trajectory, and a stretch band indicative of a difference between a position of the corresponding point of the second object and a position of the point of the first object.

Example implementations may also include a non-transitory computer-readable medium having a storage and a processor capable of executing instructions for implementing the methods of the present disclosure.

Drawings

Fig. 1 illustrates various aspects of a system according to an example implementation.

Fig. 2(a) -2 (c) illustrate various aspects of a system according to an example implementation.

Fig. 3 illustrates example hardware aspects used in some example implementations.

FIG. 4 illustrates an example process for some example implementations.

FIG. 5 illustrates an example computing environment with an example computer apparatus suitable for use in some example implementations.

FIG. 6 illustrates an example environment suitable for some example implementations.

Detailed Description

The following detailed description provides further details of the figures and example implementations of the present application. For clarity, the numbering and description of redundant elements between the drawings is omitted. The terminology used throughout the description is provided as an example and is not intended to be limiting.

Aspects of example implementations relate to methods and systems for video playback. More specifically, example implementations include video playback schemes in which playback of a video is synchronized in real-time with the motion and actions of a user who may be watching the video. According to this example implementation, the user may be able to engage and focus on physical activities and or tasks associated with the video playback at their selected pace while watching the video, and without being distracted in order to rewind or find a particular point in the video playback. Example implementations may be applied to a variety of videos, including but not limited to physical instructions such as dancing, yoga, taijiquan, etc., as well as physical therapy, entry videos, and entertainment.

According to these example implementations, the system tracks the user's motion in real-time during playback and provides output to the user that reacts to the motion to play the video in a manner that matches or coordinates the user's physical actions. Thus, example implementations will provide a user with control over video playback of a video, which may include, but is not limited to, a recorded expert performance. More specifically, through the user's motion, the system provides the user with an experience of being in the location of the expert in the video being played back. Example implementations are provided without the use of computer-generated graphics or avatars.

In an example implementation, a video playback system tracks motion in an environment. More specifically, the user's motion is tracked in a manner that allows the user to continue playing the video or to interact with the video in a manner that provides a useful feedback mechanism without requiring the user to be distracted from physical activity. For example, but not by way of limitation, the video may include an expert performing a physical activity. In one example implementation, a user may show a "cloud hand" (an action in a tai chi fist) that the user observing the video is attempting to learn or practice.

Fig. 1 illustrates an example implementation 100. In the example implementation 100, the user 101 is viewing a display 103, such as a video screen, a projection screen, or the like. One or more video cameras 105 are provided that sense the user 101. A speaker may also be provided. More specifically, physical activity of the user 101, such as physical movement in response to information provided on the display 103, is sensed.

In the display 103, the video playback includes an object 109. In this example implementation, the object 109 is a pre-recorded video of a teacher performing one or more movements. The video playback is synchronized with the motion of the user 101. For example, but not by way of limitation, if the user increases his or her motion speed, the playback speed of the video also increases at a corresponding rate. Similarly, if the user 101 slows down, stops, or moves in reverse, the video will correspondingly slow down, stop, or move in reverse. As a result, if the user 101 moves slower than the speed of the object 109 in the video, the video reacts to the speed difference and slows down. As a result, the user 109 does not need to rewind or rewind the position of the video playback to a previous point, and the attention of the user 101 does not change or distract.

In addition,

visual indicators

111, 113 are provided to give the user 101 feedback about her position, as well as correct movement (e.g., the user 101 moves her hands as shown at 115 and 117). For example, but not by way of limitation, in the "cloud hand" example shown in fig. 1, as the user 101 moves, the video plays together such that the pose of the object 109 shown in the video is in time with the movement of the user 101.

To implement the example implementation, the motion of the object 109 in the video is tracked, and the real-time motion of the user 101 is tracked. In an example implementation, the body of the user 101 is being tracked. The tracking provides estimates of joint positions (e.g., 115 and 117) associated with the body of the user 101 at various times. However, example implementations are not limited in this regard, and other objects may be substituted for the joint position of the user, such as tracking individuals or objects in a video, and in addition to the object 109 being a body, without departing from the scope of the invention.

The estimates generated in the example implementations are described in more detail below. For example, the estimate may be generated in 3D world space or 2D image space. Further, the estimate may be generated concurrently with the recording of the video (e.g., live with the initial recording of the video) prior to playback to the user. Further, as an alternative to or in addition to the live initial recording estimate, the estimate may be generated after the initial video has been recorded

Fig. 2(a) -2 (c) illustrate additional example descriptions 200 of visual indicators according to example implementations. For example, as shown in fig. 2(a), at 201, an object 109, such as a teacher (in this example implementation, a taijiquan teacher), has various indicators shown on the display 103 in the video. More specifically, the color traces 203 (green), 205 (vermilion) show the movements of the master's hands as he demonstrates the action. Arrows 207 (purple), 209 (purple) are provided, indicating the relative speed and direction along the

respective trajectories

203, 205.

As shown in fig. 2(b), a plurality of users 101 are sensed and tracked by a video camera at 211. The signal received from the video camera is processed to indicate color traces 213 (vermillion), 215 (green), 221 (green), 223 (vermillion) indicating past and suggested future movements of the user's 101 hand. Arrows 217 (purple), 219 (purple), 225 (purple), 227 (purple) are provided to show the speed and direction of motion of the user's 101 hand along the respective trajectory.

As shown in fig. 2(c), the object 109 of the video (in this case, a master (e.g., coach)) is shown with colored tracks 231 (green), 233 (vermilion) and arrows indicating relative speeds and directions along tracks 235 (purple), 237 (purple). Further, indicators 239, 231 (sometimes referred to as "stretch bands") indicate the relative position of the user's 101 hand compared to the matching joints of the recorded video from fig. 2(a), as shown in fig. 2(c) above.

With the example implementations described above, no additional visual feedback is provided or required for further training of the user, other than playback of the video itself. Optionally, the indicator is associated with a degree of match of the real-time motion of the user compared to the motion of the corresponding portion of the video. For example, but not by way of limitation, if the synchronization between the user and the pre-recorded video is based on the position of one or more joints or positions on the user's body (e.g., the position of the hand), the video displayed to the user may show the trajectory of the hand in the recorded video, along with instructions on how the user may follow those trajectories.

Thus, the user may move his or her hand along the trajectory at the speed desired by the user, or even in the opposite direction of the trajectory in the video, and the video, in conjunction with the camera, follows the user's activity. As also explained above, in addition to the trajectory itself, arrows may be provided to indicate the direction and speed of the trajectory.

Similarly, the user may select various display options with respect to the viewing of the video. According to one example implementation, a user may simply view a recorded video and follow (or guide) the recorded video. According to another example implementation, users may view recorded videos while viewing their own videos in live mode to be able to view visual feedback of their own movements alongside the recorded videos. In another example implementation, when recorded video and live video are displayed simultaneously as described herein, motion feedback may be provided in the live video in addition to or instead of displaying motion feedback in the recorded video.

According to another example implementation, the user's motion may be recorded while the user is following the video or moving freely. The record of the movement associated with the user may then be used by another user and/or by a teacher. For example, but not by way of limitation, example implementations may be used in telemedicine example implementations.

In this approach, the physical therapist may record treatment movements (e.g., exercises) for the patient to perform. The patient may view the video to perform the exercise or movement while recording. After performing the exercise or movement, the record may be provided to the physical therapist. The physical therapist may later review the patient's movements and provide additional movements or feedback regarding the recorded movements. In addition, the physical therapist may use the pre-recorded video in comparison to the user's recorded video, and incorporate the trajectories, arrows, and stretch bands as described above to provide detailed guidance and feedback to the user.

Although the example implementations described above relate to movement of a user's body synchronized in real-time with movement of a body of a subject in a video, example implementations are not so limited and other example implementations may be substituted therefor without departing from the scope of the invention. For example, but not by way of limitation, a pre-recorded video may include multiple people and tracked objects. Further, during playback, the user's finger position may be tracked and may be used to drag the video based on the motion profile of the finger position.

Fig. 3 shows a schematic diagram of a system according to an example implementation. More specifically, the system is physically located in region 300 such that user 301 having

joints

303 and 305 can perform physical activities requiring movement of one or

more joints

303 and 305. The area 300 also includes a display 307, the display 307 being viewable by a user. The playback of the previously recorded video is output to the display 307 and includes a series of motions or gestures that the user 301 is attempting to follow or learn.

A sensor 309, such as a video camera, is provided with the display 307. The sensor 309 may be separate from the display 307 or may be integrally formed with the display 307, such as a camera device built into the display 307. The sensors 309 sense the motion of the user 301, including the motion of the

joints

303 and 305 that the user moves based on the output of the pre-recorded video on the display 307. Optionally,

audio outputs

311a, 311b (e.g., speakers, etc.) may also be provided, for example, so that the user may also hear directions, commands, music, or other audio information associated with the pre-recorded video on the display 307.

The display 307, camera 309, and

optional audio outputs

311a, 311b may be communicatively coupled with the processor 313 by wired or wireless communication. For example, but not by way of limitation, processor 313 may be a personal computer or laptop computer located in area 300. Alternatively, processor 313 may be a server remote from area 300. The processor 313 may be used to adjust one or more settings associated with the display 307 and/or the sensors 309 to adjust (e.g., without limitation) a matching mode, a playback mode, a timing mode, a speed, a trajectory, an arrow, a flex band, or other characteristics associated with the output of the display 307 and the input of the sensors 309.

More specifically, display 307 may include a pre-recorded video of object 315, such as a trainer performing a sport or activity. In this example implementation, the object 315 is shown as having

joints

317, 319, and the user 301 follows the motion of the

joints

317, 319 and attempts to move his or her

joints

303, 305. Further, as described above,

indicators

321, 323, such as tracks, arrows, or stretch bands, may also be provided, either individually or in combination.

As described above, example implementations relate to performing a matching operation between a first sequence of videos including video poses captured in real-time as the videos are recorded or generated after offline processing of the videos. The second sequence of video poses is obtained in real time as live tracking during use to generate the second sequence of video poses.

For the case where the first sequence of poses and the second sequence of poses are each a set of 2D image positions of tracked joints, a basic playback mode is provided for each new time step in the sequence to indicate the frames of the first sequence that are most similar to the frames of the second sequence. To obtain a frame, a matching function is performed. The matching function may be different for different modes or applications. However, according to one example implementation, focus is on the prescribed joint positions, and other joint positions are ignored. In this method, the match will be indicated by the euclidean distance of the image coordinates of the prescribed joint position. For example, joint locations may be those associated with the right hand, left hand, or other body parts.

More specifically, the first sequence may be represented as a sequence pv (t) of video poses available for video, and the second sequence may be represented as a live pose PL (t)_c) Wherein, t_CRepresenting the real time clock time and t representing the playback time in the video. For Pv (t) and PL (t) of 2D image positions, each for a set of tracked joints, the basic playback mode shows for each new time step t_CRecording video with a sum of PL (t)_C) The most similar frame of pose pv (t). For example, for a frame shown at time t, equation (1) provides the following relationship:

t*＝argmin match[Pv(t), PL(tc)] (1)

where match [, ] is a matching function. For example, match [ pv (t), pl (tc) ] may be the euclidean distance between the image coordinates of the right-hand position. If P (t) j represents the 2D position of joint j, then for this example implementation, the matching function is represented by the following equation (2):

match[Pv(t), PL(tc)]＝dist[Pv(t)[‘right hand’], PL(tc)[‘right hand’]] (2)

example implementations provide one or more matching patterns. For example, but not by way of limitation, a single joint point and drag matching mode may be provided, and a multi-joint tracking mode may be provided.

According to the single joint point and drag mode of an example implementation, a joint, such as a hand, may be used to select a point in the video to track, and the user moves that particular joint to snap back along the trajectory of the joint in the video. To help the user make the correct movements without distracting, a portion of the trajectory is displayed in the video, and the user can know in real time where to move his or her hand. According to the single-joint point and the dragging mode, when the distance from the hand point to the track is smaller than a starting threshold value, dragging is carried out, and when the distance from the hand point to the track is larger than a stopping threshold value, the dragging is released.

Alternative methods may be employed including, but not limited to, using specific gestures or hand movements, such as a fist making and a punch loosening corresponding to starting and stopping. Thus, the user's joints are used as pointers in real time, which can also be used to drag objects or joints in the recorded video. Further, the dragged joints are not limited to joints for pointers in video.

According to another example implementation, a plurality of joints may be tracked on a user, and the tracked plurality of joints of the user may be matched with corresponding joints in a recorded video. To implement this method, a matching function may be provided as identified by the relationship in equation (3) below:

match[Pv(t), PL(tc)]＝Σdist[Pv(t)[j], PL(tc)[j]]2 (3)

where J represents the set of joints tracked, e.g., left hand and right hand. According to the matching function, the set of joints to be tracked may comprise at least a single joint, at most a complete set of tracked joints. While using fewer joints may be cognitively easier for a user to maintain attention and understanding feedback, and to adapt motion to the record they are training, using more joints may provide the user with a more comprehensive characterization of motion, and thus a more complete training opportunity.

According to one example implementation, the user may be able to control the number of joints tracked based on their concentration level or their intent with respect to the training opportunity. For example, but not by way of limitation, during an early training phase that indicates selection of one or several joints, the user may focus on understanding feedback to adapt fine motion to the video recording, and then gradually increase the number of joints in order to obtain a more comprehensive training method.

Further, example implementations may include one or more playback timing modes. For example, according to a first mode, the pre-recorded video provides a basis for the user to follow the motion in the pre-recorded video (e.g., prior art methods). According to the second mode, the user generates motion and the video tracks the user. According to the third mode, the user also generates motion and the video tracks the speed.

In a first mode (e.g., prior art methods), a pre-recorded video is played, for example, at a fixed rate. The user follows the motion of objects in the pre-recorded video in real time and attempts to maintain him or her with the pre-recorded videoHer exercise. For each time step t_CThe first mode can be represented by the following formula (4):

playTime＝playTime+playSpeed*Δt (4)

according to an example implementation in the first mode, the recorded frames are displayed for a play time.

According to this example implementation, a user may be presented with user interface elements that allow the user to set or change certain variables. For example, but not by way of limitation, the play time or play speed may be modified by the user. In this example implementation, the system does not track the user. However, if the user falls behind the pace of the video and wants to rewind or slow down the video, he or she must explicitly control the video to revert to an earlier playing time or change speed.

According to a second mode, the pre-recorded video may be paused and rewound to match motion associated with an aspect of the user's motion detected by the system. Thus, the playback time is determined by the motion of the user. Furthermore, the continuity of the playback speed is based on the motion of the user. As described above, for each real time t_CThe time of the best matching frame is calculated from equation (1) shown and described above, and the frames starting from t are used.

More specifically, for each time step t_CThe live frame is observed and the value of pl (tc) is determined. Furthermore, the value of t is determined as described above in relation to formula (1). Therefore, the playback time is set to t ×, and the recorded frame of the playback time is displayed.

According to a third mode, the video playback system tracks the motion of the user along the track of the video. Based on the tracking, a movement speed of the user is estimated. The speed may be positive forward motion on the track, negative or reverse motion on the track, or zero for a user pausing along the track. More specifically, an instantaneous speed estimate is determined over the time window based on the matching function and the result of determining the play time. A smoothing method is applied to the instantaneous speed estimate to generate a play speed, and the play time is updated based on the play speed. Further, the smoothing method may include a running average (smoothing) with an empirically determined fixed smoothing factor or a variable that changes based on the degree of motion of the second object over the time window.

According to this example implementation, for each time step t_CLive frames associated with the user's activity are observed and pl (tc) is determined. Furthermore, t is determined according to formula (1) as described above. In addition, the value of the instantaneous speed estimate s is determined based on the following equation (5):

s＝(t*-playTime)/Δt (5)

based on the determined value of S, the playback speed is defined by the following equation (6):

playSpeed＝w*s+(1-w)*playSpeed (6)

in equation (6) above, w is provided as a smoothing weight that determines how fast the running estimate of speed adapts to the newly observed speed estimate. For example, a value of 1 for w may effectively eliminate the speed pattern aspect. In this case, the individual time steps of the frame at t will be generated. On the other hand, when the system tracks the speed of movement of the user, a value of w close to zero will result indicating a more significant degree of smoothing.

Therefore, the playback speed is used to determine the playback time, as shown in the following equation (7):

playTime＝playTime+playSpeed*Δt (7)

based on the above, the recorded frames are displayed for the play time.

In the above example implementation associated with the third mode, the value of w is associated with the strength of the change in the detected position of the user in terms of its effect on the updated play-speed estimate. For example, but not by way of limitation, w may be a fixed value that is empirically determined. Alternatively, w may be determined in another way, which is not a constant but may be correlated to how fast the track in the recording changes with respect to time. For example, the velocity may be defined in the gesture space of the recorded trajectory motion at time t according to equation (8):

v(t)＝dist[Pv(t), Pv(t+Δt)]/Δt (8)

a larger value of v (t) indicates a more significant amount of motion. The detected change in position provides an estimate of the change in time and, therefore, w may be indicated as higher. On the other hand, a small value of v (t) may indicate that there is little motion for a period of time. In this case, the estimate of speed should not be updated when there is little motion, and the value of w may remain close to zero, so that when motion resumes, the video continues to play at the same speed until a later time in the recording.

In the case of a track interacting with itself or generating a "flicker" near the intersection, successive frames may be presented from significantly different recording times in the case where a slight disturbance in the tracked position may result in a discontinuity in the video playback. To address this event, items may be added to the match score in a manner that is not conducive to a time difference with the current play time. Further, a synchronized playback mode may be provided in which the system maintains continuity, but adapts the play speed to follow the user. Additionally, tracking of additional joints may be added. In the event that there is a period of time in the record and there is no significant movement of the tracked joints, the third mode as described above may be used with a movement-related update rate also as described above.

According to some example implementations, a user may position himself or herself such that the position of his or her joints in the video image may be used to control the system. For example, but not by way of limitation, the user may stand in a position with his or her joints proximate to the joint locations in the video that the user may wish to control or associate.

Alternatively, a transformation may be applied from the tracked joint positions to a registered position approximately aligned with the joint in the video. For example, but not by way of limitation, the tracked joints may be translated by a translation function T that aligns head positions, centroids, joint points, and the like. The registration may be performed at the beginning of use and updated on an explicit request or periodically (e.g., every frame). For frequently occurring updates, update weights may be set for small amounts of motion over a period of time such that the relative time scale associated with the registration is slower than the time scale of typical motion of the activity associated with the training.

For 3D positions, registration may implement Procrustes fitting to obtain least squares alignment of points with respect to translation, rotation, and optionally scaling transformations. The additional registration method may employ a method for motion adaptation between skeletonized models. By using the 3D example implementation, advantages may exist. For example, but not by way of limitation, when a user is performing motions that involve back and forth motions relative to the position of the camera, joint tracking may be significantly more effective in detecting and processing these motions with accuracy and precision than the 2D example implementation.

According to an example implementation, the system may automatically identify when the user switches between, for example, one workout and another workout. In one example implementation, the motion being performed live may be used as a query to an information base, such as historical data, and the system may use a matching function to automatically learn or predict activities.

In addition, the user activity being sensed by the system may be recorded and provided to a coach or, for example, a physical therapist. Thus, the user's activity can be recorded, fed through the tracker and converted to recorded content, and provided to the coach for comparison with the pre-recorded video using the matching function.

Fig. 4 illustrates an example process 400 according to an example implementation. As illustrated herein, the example process 400 may be executed on one or more devices.

At 401, a playback video or training video is received. For example, but not by way of limitation, the training video may be a recording of a trainer performing the sport. During training, a training video will be displayed to the user on the display.

At 403, the training video is processed to define a sequence of video poses in the training video. This process may be performed in real time as the video is recorded. Alternatively, the video pose may be captured after the training video has been recorded (e.g., during offline processing of the video). As described above, the pose is associated with tracking of one or more joints.

Thus, after 403, a training video has been generated. The training video may be used by one or more users in training the motion performed in the video content.

Operations

401 and 403 need only be performed the first time a video is generated, and need not be performed each time a user views a video.

At 405, the user performs the motion sensed by the sensor. For example, but not by way of limitation, the sensor may be a camera that senses real-time motion of the user as video input. According to the playback timing mode, the user can perform a motion, video leader, user follow simultaneously with the video playback. Alternatively, as explained herein, the user may lead and the video may track the user.

Live tracking is also performed in real time from the sensed video input at 407. As a result of the live tracking, one or more live gestures of the user are detected in real-time in the sensed video input.

Once one or more live poses of the user are detected in real-time in the sensed video input, and the video poses of the training video have been previously defined, as described above, a matching function may be applied at 409. Aspects of the matching function are described above. As a result of the matching function, a playing time (e.g., t) is defined for the frame. Accordingly, the playing of the pre-recorded playback video may be determined in terms of a play time based on the tracked pose of the joint. As described above, the matching pattern may include single joint points and drag, multi-joint tracking, or a combination thereof. In this example implementation, the argmin function is used as described above. However, as will be appreciated by those skilled in the art, other functions may be substituted therefor.

At 411, the playback timing mode is evaluated as to whether the video playback system is a mode (e.g., a "speed mode") that tracks the motion of the user and the track of the video and estimates the speed of the motion. As described below, if the playback timing mode is in the "speed mode", operations 415 to 419 are performed. If the playback timing mode is not in the "speed mode", operation 413 is performed.

For playback timing modes that are not in "speed mode," recorded frames of the training video are output 413 for the play time defined in operation 409. Then, as the user continues to change his or her posture, the operation continues.

For playback timing mode in "speed mode," at 415, the instantaneous speed estimate is determined as described above.

At 417, the determined instantaneous speed estimate is applied to the smoothing weight w as described above to determine the play speed. Alternatively, and as described above, the value of w may be determined empirically, or based on a relationship between how fast the track in the recorded video changes over time (e.g., v (t)).

At 419, the play time is updated to increase the play speed times the time interval, and then operation 413 is performed as described above.

Fig. 5 illustrates an example computing environment 500 having an example computer apparatus 505 suitable for use in some example implementations. The computing device 505 in the computing environment 500 may include one or more processing units, cores or processors 510, memory 515 (e.g., RAM, ROM, etc.), internal storage 520 (e.g., magnetic, optical, solid state storage, and/or organic), and/or I/O interfaces 525, any of which may be coupled to a communication mechanism or bus 530 for communicating information or embedded in the computing device 505.

Computing device 505 may be communicatively coupled to input/interface 535 and output device/interface 540. Either or both of input/interface 535 and output device/interface 540 may be wired or wireless interfaces and may be removable. Input/interface 535 may include any device, component, sensor, or interface (physical or virtual) that may be used to provide input (e.g., buttons, touch screen interfaces, keyboards, pointing/cursor controls, microphones, cameras, braille, motion sensors, optical readers, etc.).

Output device/interface 540 may include a display, television, monitor, printer, speakers, braille, etc. In some example implementations, input/interface 535 (e.g., a user interface) and output device/interface 540 may be embedded in or physically coupled to computing device 505. In other example implementations, other computing devices may be used as or provide the functionality of input/interface 535 and output device/interface 540 for computing device 505.

Examples of computing devices 505 may include, but are not limited to, highly mobile devices (e.g., smart phones, devices in vehicles and other machines, devices carried by humans and animals, etc.), mobile devices (e.g., tablets, notebooks, laptop computers, personal computers, portable televisions, radios, etc.), and devices that are not designed for mobility (e.g., desktop computers, server devices, other computers, kiosks, televisions with embedded and/or coupled with one or more processors, radios, etc.).

Computing device 505 may be communicatively coupled (e.g., via I/O interface 525) to external storage 545 and network 550 for communication with any number of networked components, devices, and systems, including one or more computing devices of the same or different configurations. Computing device 505 or any connected computing device may serve, provide its services, or be referred to as a server, a client, a thin server, a general purpose machine, a special purpose machine, or another tag. For example, but not by way of limitation, the network 550 may include a blockchain network and/or a cloud.

I/O interface 525 may include, but is not limited to, a wired and/or wireless interface using any communication or I/O protocol or standard (e.g., ethernet, 802.11xs, a general system bus, WiMAX, modem, cellular network protocol, etc.) for communicating information at least to and/or from all connected components, devices, and networks in computing environment 500. The network 550 may be any network or combination of networks (e.g., the internet, a local area network, a wide area network, a telephone network, a cellular network, a satellite network, etc.).

Computing device 505 may communicate using and/or communicate using computer-usable or computer-readable media, including transitory media and non-transitory media. Transitory media include transmission media (e.g., metal cables, optical fibers), signals, carrier waves, and the like. Non-transitory media include magnetic media (e.g., disks and tapes), optical media (e.g., CD ROMs, digital video disks, blu-ray disks), solid state media (e.g., RAMs, ROMs, flash memory, solid state storage), and other non-volatile storage or memory.

Computer device 505 may be used to implement techniques, methods, applications, processes, or computer-executable instructions in some example computing environments. Computer-executable instructions may be retrieved from a transitory medium, as well as stored on and retrieved from a non-transitory medium. The executable instructions may be derived from one or more of any programming, scripting, and machine language (e.g., C, C + +, C #, Java, Visual Basic, Python, Perl, JavaScript, etc.).

Processor 510 may execute under any Operating System (OS) (not shown) in a native or virtual environment. One or more applications may be deployed including a logic unit 555, an Application Programming Interface (API) unit 560, an input unit 565, an output unit 570, a matching unit 575, a tracking unit 580, a playback timing unit 585, and an inter-unit communication mechanism 595 for the different units to communicate with each other, with the OS, and with other applications (not shown).

For example, the matching unit 575, the tracking unit 580, and the playback timing unit 585 may implement one or more processes shown above with respect to the above-described structure. The design, function, configuration, or implementation of the described units and elements may vary and are not limited to the descriptions provided.

In some example implementations, when information or execution instructions are received through API unit 560, they may be communicated to one or more other units (e.g., logic unit 555, input unit 565, matching unit 575, tracking unit 580, and playback timing unit 585).

For example, the matching unit 575 may receive and process information associated with a pre-recorded playback video with a captured video gesture and a real-time live video feed to detect the gesture. The output of the matching unit may provide a value indicative of a time t associated with a comparison of the tracked joints between the pre-recorded playback video and the live pose of the user. For example, the output of the matching unit 575 may be used to determine a play time, which in turn is used to determine how long a recorded frame should be displayed. Similarly, a tracking unit 580 may be provided to perform the function of tracking the movement speed of the user along the track of the video as described above. In addition, the playback timing unit 585 may perform playback based on information obtained from the matching unit 575 and the tracking unit 580.

In some cases, in some example implementations described above, logic unit 555 may be configured to control the flow of information between units and direct services provided by API unit 560, input unit 565, matching unit 575, tracking unit 580, and playback timing unit 585. For example, the flow of one or more processes or implementations may be controlled by the logic unit 555 alone or in conjunction with the API unit 560.

FIG. 6 illustrates an example environment suitable for some example implementations. Environment 600 includes devices 605-645, and each device is communicatively connected to at least one other device via, for example, network 660 (e.g., by a wired and/or wireless connection). Some devices may be communicatively connected to one or

more storage devices

630 and 645.

Examples of one or more devices 605-645 may be computing device 505 described in fig. 5, respectively. The devices 605, at 645, may include, but are not limited to, a computer 605 (e.g., a laptop computing device) having a monitor and associated webcam, as described above, a mobile device 610 (e.g., a smartphone or tablet), a television 615, a device associated with the vehicle 620, a server computer 625, computing devices 635-640,

storage devices

630 and 645.

In some implementations, devices 605-620 can be considered user devices associated with users that can remotely receive a display of a workout at various locations. The devices 625-645 may be devices associated with a service provider (e.g., for storing and processing information associated with the operation of the display device, the sensing device, and settings associated with the matching mode and the playback timing mode). In this example implementation, one or more of these user devices may be associated with sensor locations proximate to the user to enable sensing of real-time motion of the user and providing a real-time live video feed to the system to facilitate processing of gestures based on tracking of one or more joints of the user as described above.

Example implementations may have various benefits and advantages. For example, but not by way of limitation, example implementations of the invention do not require the user to hold an object, such as a remote control or mouse, that can change the user's body movements, or cause the user to become distracted from the training process. Rather, example implementations track the user's body movements so that the user can maintain attention to the training activity.

Additionally, according to an example implementation, multi-joint tracking is provided. As described above, the user may select the number of joints to track according to the user's intent (e.g., the user is only focused on the motion of a single joint, or on learning a single activity associated with a single trajectory, as opposed to the user being focused on learning a series of motions as part of a more comprehensive training).

Further, example implementations provide for multiple playback modes that can be interactively determined by the system itself or manually. For example, and not by way of limitation, example implementations provide not only a user-primed and video-followed mode, but also a video-primed and user-followed mode, as well as a mutually synchronized mode. In contrast, prior art methods do not provide for synchronization of video with a user as illustrated herein with respect to example implementations.

Additionally, according to example implementations, events that the user must perform in order to proceed to other parts of the training need not be predefined. In contrast, with the motion analysis schemes described herein, example implementations provide for matching gestures from pre-recorded video to live motion of a user's body and user motion so that the user can experience actions from the video without having to pre-define the actions.

According to various aspects of example implementations, simulated training may be used in which a user performs a set of gestures or motions in real-time as compared to a prescribed set of training gestures or motions, without requiring the user to track joint positions using any means for the presence of gestures.

According to another example implementation, aspects described herein may be applied to entertainment activities. For example, a video of a celebrity performing or dancing may be a playback video, and the user may be a student performing or entertaining and aiming to learn the skill of that particular actor. By applying trajectories, arrows and/or flex bands, the actor may learn to control the fine motion of the tracked joints to be able to mimic the celebrity's style or technique. Furthermore, as actors become more skilled in specific motions, multi-joint functionality may be applied to allow users to learn additional motions, and ultimately, the celebrity's overall body motion and motion. Since the speed of the pre-recorded video is based on the user's motion, the actor may practice certain motions, styles or techniques without being distracted and having to rewind or return to a specified point in the video as in prior art methods.

Additionally, there may be other aspects and example implementations associated with the inventive concepts. For example, but not by way of limitation, a query mechanism may be provided such that for a desired sequence of poses, a search may be performed to find similar available online videos. In this example implementation, the user's real-time live video is replaced with pre-recorded online video segments that are compared to the pre-recorded video using a matching function to determine the similarity and match of joint motion.

Further, where the user performs one or more motions and the one or more motions performed by the user include a plurality of gestures, a query may be performed on the user's motions to determine a closest set of pre-recorded videos. The most relevant query results are then classified and the user may be informed that his or her movements appear similar to the most relevant categories. For example, but not by way of limitation, a user may attempt to perform a certain type of dance. However, the user's movements may be more similar to another type of dance than the type of dance in the pre-recorded training video. The user may be informed that the dance he or she is performing is a different dance and may be provided with the option to perform training on the different dance as opposed to the dance in the pre-recorded training video.

Similarly, a user performing some type of exercise or physical therapy may be provided with an indication that the most relevant joint motion associated with the user is actually different from that in the pre-recorded video. In this case, it may be recommended that the user attempt to show the type of exercise or physical therapy most relevant to the query, rather than the type of exercise or physical therapy selected by the user.

As an additional aspect of the example implementations, and as described above, the "speed mode" may provide an indication that the user has paused activity or that the motion has begun to slow down. The pause or slowing down may be an indication that the user is getting tired or losing interest in activity or experiencing pain (e.g., in the case of physical therapy). Thus, for example, the results of the "speed mode" may be used as an indicator to recommend that the user consider taking a break, trying another activity, or resuming a pre-recorded video.

Alternatively, the numerical value associated with the matching function may be used to generate a similarity score between the user's motion and the pre-recorded video. These scores may be used to determine which user, among multiple users engaged in motion as compared to a common pre-recorded video, best matches the pre-recorded video. For example, in determining which user most closely mimics a defined set of sports associated with sports, dancing, performing, etc., the user with the higher score may be determined based on the value associated with the matching function.

According to other example implementations, aspects may be integrated into an augmented reality or virtual reality scene. For example, but not by way of limitation, for forensics and identification of individuals, the physical movements of the individual displayed in the security camera recording may be compared to a live video feed at the same location in real time to determine the frequency of occurrence of the individual with gait and physical movements. Furthermore, the inclusion of a 3D camera may allow a user in an augmented reality or virtual reality scene to be able to perform body movements in a manner appropriate to the scene.

In the example implementations described above, various terms are used to describe aspects and features. However, example implementations are not so limited, and other terms and descriptions may be substituted therefor without departing from the scope of the invention. For example, but not by way of limitation, a computer-implemented method includes coordinating a sensed gesture with a pre-recorded gesture. The sensed gesture may be sensed by a 2D or 3D video camera as described above. Further, the sensed gesture may be associated with real-time motion of a first object in the live video feed (e.g., a user sensed by a video camera, but not limited to). The pre-recorded gesture may be associated with a motion of a second object (e.g., a coach in the video). However, the first object need not be a user and the second object need not be a coach.

Further, the computer-implemented method includes applying a matching function to determine a match between a point of one of the sensed gestures and a corresponding point of the pre-recorded gesture, and based on the match, determining a play time and outputting at least one frame of video associated with the second object for the play time. In some example implementations, the point at which one of the gestures is sensed and the corresponding point of the pre-recorded gesture may be a joint or joints (e.g., hand, foot, etc.) of the user. Additionally, as will be understood by those skilled in the art, the matching functions to be applied may include those described above, as well as other matching functions in place thereof.

Although a few example implementations have been shown and described, these example implementations are provided to convey the subject matter described herein to those skilled in the art. It should be understood that the subject matter described herein may be implemented in various forms and is not limited to the example implementations described. The subject matter described herein may be practiced without those specifically defined or described elements or with other or different elements or elements not described. It will be appreciated by those skilled in the art that changes may be made to these exemplary implementations without departing from the subject matter described herein as defined in the appended claims and their equivalents.

Claims

1. A computer-implemented method for coordinating a sensed pose associated with real-time motion of a first object in a live video feed with a pre-recorded pose associated with motion of a second object in the video, the computer-implemented method comprising the steps of:

applying a matching function to determine a match between a point of one of the sensed gestures and a corresponding point of the pre-recorded gesture; and

based on the matching, determining a play time and outputting at least one frame of the video associated with the second object for the play time.

2. The computer-implemented method of claim 1, wherein the point of one of the sensed gestures and the corresponding point of the pre-recorded gesture comprise a single point of the first object and the second object in a first mode and a plurality of points of the first object and the second object in a second mode.

3. The computer-implemented method of claim 2, wherein a user control is provided to select between the first mode and the second mode, and for the second mode, the number and location of the plurality of points of the first object and the second object are selected.

4. The computer-implemented method of claim 1, wherein the first object comprises a user receiving the output of the video and sequentially performing gestures to coordinate with the pre-recorded gestures, and the second object comprises a coach displayed in the video that sequentially performs the pre-recorded gestures.

5. The computer-implemented method of claim 4, wherein the motion comprises one of a physical therapy activity, at least a portion of an exercise, at least a portion of a dance, or at least a portion of a performance.

6. The computer-implemented method of claim 1, further comprising the steps of:

determining an instantaneous speed estimate over a time window based on a result of applying the matching function and determining the play time;

applying a smoothing method to the instantaneous speed estimate to generate a play speed; and

updating the playing time based on the playing speed.

7. The computer-implemented method of claim 6, wherein the smoothing method comprises utilizing an empirically determined smoothing factor or a running average of a variable that changes based on a degree of motion of the second object over the time window.

8. The computer-implemented method of claim 1, wherein the video includes one or more visual feedback indicators associated with the pre-recorded gesture of the second object.

9. The computer-implemented method of claim 8, wherein the one or more visual feedback indicators comprise one or more of a trajectory indicative of a path of motion of the corresponding point of the second object, a trajectory indicative of a relative speed and a relative direction of the corresponding point of the second object along the trajectory, and a stretch band indicative of a difference between a position of the corresponding point of the second object and a position of the point of the first object.

10. A system capable of coordinating a sensed gesture associated with real-time motion of a first object in a live video feed with a pre-recorded gesture associated with motion of a second object in the video, the system configured to perform the following operations:

applying a matching function to determine a match between a point of one of the sensed gestures and a corresponding point of the pre-recorded gesture; and is

11. The system of claim 10, wherein the point of one of the sensed gestures and the corresponding point of the pre-recorded gesture comprise a single point of the first object and the second object in a first mode and a plurality of points of the first object and the second object in a second mode, wherein a user control is provided to select between the first mode and the second mode and, for the second mode, to select a number and a location of the plurality of points of the first object and the second object.

12. The system of claim 10, wherein the first object comprises a user receiving the output of the video and sequentially performing gestures to coordinate with the pre-recorded gestures, and the second object comprises a coach displayed in the video that sequentially performs the pre-recorded gestures, wherein the motion comprises one of a physical therapy activity, at least a portion of an exercise, at least a portion of a dance, or at least a portion of a performance.

13. The system of claim 10, further comprising:

applying the instantaneous speed estimate to a smoothing factor to generate a play speed; and

updating the playing time based on the playing speed,

wherein the smoothing factor comprises a fixed value determined empirically or a variable that changes based on the degree of motion of the second object over the time window.

14. The system of claim 10, wherein the video comprises one or more visual feedback indicators associated with the pre-recorded pose of the second object, wherein the one or more visual feedback indicators comprise one or more of a trajectory indicative of a motion path of the corresponding point of the second object, a trajectory indicative of a relative speed and a relative direction of the corresponding point of the second object along the trajectory, and a stretch band indicative of a difference between a position of the corresponding point of the second object and a position of the point of the first object.

15. A non-transitory computer-readable medium having storage storing instructions for coordinating a sensed pose associated with real-time motion of a first object in a live video feed and a pre-recorded pose associated with motion of a second object in a video, the instructions being executed by a processor, the instructions comprising:

16. The non-transitory computer-readable medium of claim 15, wherein the point of one of the sensed gestures and the corresponding point of the pre-recorded gesture include a single point of the first object and the second object in a first mode and a plurality of points of the first object and the second object in a second mode, wherein a user control is provided to select between the first mode and the second mode and for the second mode a number and a position of the plurality of points of the first object and the second object are selected.

17. The non-transitory computer readable medium of claim 15, wherein the first object comprises a user receiving the output of the video and sequentially performing gestures to coordinate with the pre-recorded gestures, and the second object comprises a coach displayed in the video that sequentially performs the pre-recorded gestures, wherein the motion comprises one of a physical therapy activity, at least a portion of an exercise, at least a portion of a dance, or at least a portion of a performance.

18. The non-transitory computer-readable medium of claim 15, wherein the instructions further comprise:

updating the playing time based on the playing speed,

19. The non-transitory computer-readable medium of claim 15, wherein the video includes one or more visual feedback indicators associated with the pre-recorded gesture of the second object.

20. The non-transitory computer-readable medium of claim 19, wherein the one or more visual feedback indicators comprise one or more of a trajectory indicative of a motion path of the corresponding point of the second object, a trajectory indicative of a relative speed and a relative direction of the corresponding point of the second object along the trajectory, and a stretch band indicative of a difference between a position of the corresponding point of the second object and a position of the point of the first object.