CN112528768A

CN112528768A - Action processing method and device in video, electronic equipment and storage medium

Info

Publication number: CN112528768A
Application number: CN202011349038.1A
Authority: CN
Inventors: 韩瑞; 王丽云; 李婷婷; 郑任君
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-26
Filing date: 2020-11-26
Publication date: 2021-03-19

Abstract

The application provides a method and a device for processing actions in a video, electronic equipment and a computer-readable storage medium; relates to the computer vision technology in the field of artificial intelligence; the method comprises the following steps: displaying at least one impersonation action implemented by the participant in the first video; responding to action analysis trigger operation, synchronously displaying a plurality of imitation key points in the imitation action and the imitation action, and applying a display mode corresponding to the similarity to the plurality of imitation key points; wherein the similarity characterizes a degree of agreement between the mimicking action and a reference action. According to the method and the device, the efficiency of action simulation can be improved.

Description

Action processing method and device in video, electronic equipment and storage medium

Technical Field

The present application relates to artificial intelligence technology and internet technology, and in particular, to a method and an apparatus for processing an action in a video, an electronic device, and a computer-readable storage medium.

Background

Artificial Intelligence (AI) is a theory, method and technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. As artificial intelligence technology has been researched and developed, artificial intelligence technology has been developed and applied in various fields.

Taking the application scenario of motion analysis as an example, a user can usually only give guidance through video teaching or coaching during the simulation process of a specific motion (such as dance or martial arts), a video learner cannot perceive whether the motion of the learner is standard or not in real time, and coaching cannot give guidance all the time although the video learner is effective. In the related art, although the body motions can be analyzed through the motion capture system, the motions of a single person can only be analyzed and cannot be compared, so that a user still cannot perceive whether the motions of the user are standard or not in real time, the motion simulation efficiency is reduced, the waste of motion analysis resources is caused, and an effective solution is not available in the related art.

Disclosure of Invention

The embodiment of the application provides a method and a device for processing actions in a video, an electronic device and a computer-readable storage medium, which can improve the efficiency of action simulation.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a method for processing actions in a video, which comprises the following steps:

displaying at least one impersonation action implemented by the participant in the first video;

responding to action analysis trigger operation, synchronously displaying a plurality of imitation key points in the imitation action and the imitation action, and applying a display mode corresponding to the similarity to the plurality of imitation key points;

wherein the similarity characterizes a degree of agreement between the mimicking action and a reference action.

In the above solution, the comparing the first motion amplitude and the second motion amplitude to obtain the similarity between the simulated keypoint and the corresponding reference keypoint comprises:

determining the area occupied by all first mimic keypoints in the first video frame and the area occupied by all second reference keypoints in the second video frame;

determining the ratio of the area occupied by all first simulation key points in the first video frame to the area occupied by all second reference key points in the second video frame as a scaling ratio;

determining an amplitude ratio between the first motion amplitude and the second motion amplitude;

determining a product between the amplitude ratio and the scaling as a similarity between the mimicking keypoint and the corresponding reference keypoint.

An embodiment of the present application provides an action processing apparatus in a video, including:

a display module to display at least one impersonation action performed by a participant in a first video;

the analysis module is used for responding to action analysis trigger operation, synchronously displaying a plurality of imitation key points in the imitation action and the imitation action, and applying a display mode corresponding to the similarity to the plurality of imitation key points;

In the above solution, the display module is further configured to identify a plurality of key frames in the first video, and filter out key frames including the same mimic action from the plurality of key frames; and sequentially playing the key frames remained after filtering.

In the above scheme, the display module is further configured to play the standard video frames in the first video at a first play speed, and play the non-standard video frames in the first video at a second play speed; wherein the first playback speed is greater than the second playback speed; the standard video frames are video frames that include impersonation actions for which the number of non-standard keypoints is below a number threshold; the non-standard video frames are video frames that include no less than a number threshold of non-standard keypoints in the mimic action.

In the above solution, the analysis module is configured to determine similarities between the plurality of simulated key points and the plurality of reference key points of the reference action; applying a corresponding display mode to the simulated key points according to the similarity; wherein different similarities correspond to different display modes.

In the above scheme, the analysis module is further configured to display the standard key points and the non-standard key points in the simulated motion in different display manners; wherein the non-standard keypoints are simulation keypoints with similarity lower than a similarity threshold between a plurality of reference keypoints of the simulated motion and the reference motion, and the standard keypoints are simulation keypoints with similarity lower than the similarity threshold between the simulation motion and the reference motion.

In the above scheme, when the first video includes at least two participant objects, namely a first participant object and a second participant object, the analysis module is further configured to display a standard key point and a non-standard key point in the simulated motion of the first participant object in different display manners; determining a part corresponding to the non-standard key point; displaying the standard key points of the part in the second participating object in the same display mode as the non-standard key points of the part in the first participating object.

In the foregoing solution, the analysis module is further configured to determine a degree of significance corresponding to each of the simulated key points according to the similarity of each of the simulated key points, and display the simulated key points according to the degree of significance corresponding to each of the simulated key points; wherein there is a negative correlation between the degree of significance and the degree of similarity.

In the above solution, when the reference action of the reference object is derived from a second video, the analysis module is further configured to align a plurality of video frames included in the second video and a plurality of video frames included in the first video to a same time axis; performing the following processing for each time point in the time axis: determining a first video frame of the first video aligned to the time point and a second video frame of the second video aligned to the time point; determining similarities between a plurality of mimic keypoints in the first video frame and a plurality of reference keypoints in the second video frame.

In the foregoing solution, the analysis module is further configured to perform at least one of the following operations: aligning a first audio frame in the first video with a second audio frame included in the second video to align a video frame synchronized with the first audio frame and a video frame synchronized with the second audio frame; the method includes extracting a mock action in a first video frame in the first video, determining a second video frame in the second video that includes a same reference action as the mock action, aligning the first video frame with the second video frame.

In the foregoing solution, the analysis module is configured to perform the following operations for a first mimic keypoint in the first video frame and a first reference keypoint in the second video frame that is in the same position as the first mimic keypoint: determining a first motion magnitude between the first mimic keypoint and a second mimic keypoint of the same location in a third video frame, wherein the third video frame is a video frame subsequent to the first video frame in the first video; determining a second motion amplitude between the first reference key point and a second reference key point at the same position in a fourth video frame, wherein the fourth video frame is a video frame after the second video frame in the second video; comparing the first motion magnitude and the second motion magnitude to obtain a similarity between the mimicking keypoint and a corresponding reference keypoint; wherein the first impersonation keypoint is any keypoint of a plurality of impersonation keypoints in the first video frame, and the first reference keypoint is any keypoint of a plurality of reference keypoints in the second video frame.

In the foregoing solution, the analysis module is further configured to determine an area occupied by all first simulation key points in the first video frame and an area occupied by all second reference key points in the second video frame; determining the ratio of the area occupied by all first simulation key points in the first video frame to the area occupied by all second reference key points in the second video frame as a scaling ratio; determining an amplitude ratio between the first motion amplitude and the second motion amplitude; determining a product between the amplitude ratio and the scaling as a similarity between the mimicking keypoint and the corresponding reference keypoint.

In the above scheme, when the first video includes a plurality of participating objects, the analysis module is further configured to display the plurality of participating objects included in the first video; in response to a selection operation for the plurality of participant objects, determining at least one selected participant object as a participant object for triggering an operation in response to the action analysis.

In the above scheme, the analysis module is further configured to identify an occlusion relationship between the participating objects in the first video, and automatically identify a participating object that satisfies an unoccluded condition as a participating object for responding to the action analysis trigger operation; wherein the non-occlusion condition comprises at least one of: in the playing process of the first video, the duration which is not shielded is greater than a duration threshold; in the playing process of the first video, the proportion between the duration which is not shielded and the total duration of the first video is greater than a duration proportion threshold; the ratio between the area of the occluded part and the overall area of the participating object is less than an area ratio threshold.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the action processing method in the video provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer-readable storage medium, which stores computer-executable instructions and is used for realizing the action processing method in the video provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

the simulation key points of the simulated motion of the participated object are compared with the reference key points of the reference motion, and the simulation key points synchronously displayed with the simulated motion are displayed in a differentiation mode according to the matching degree between the simulation key points and the reference key points, so that the participated object can sense the difference between the motion of the participated object and the reference motion in real time, the motion simulation efficiency is improved, and the motion analysis resources are saved.

Drawings

Fig. 1A, 1B, 1C, and 1D are schematic diagrams of application scenarios provided by the related art;

fig. 2 is an application scene diagram of a method for processing actions in a video according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a terminal 400 provided in an embodiment of the present application;

fig. 4A and fig. 4B are schematic flow charts of a method for processing an action in a video according to an embodiment of the present application;

fig. 5 is a flowchart illustrating a method for processing an action in a video according to an embodiment of the present application;

fig. 6 is an application scenario diagram of a method for processing an action in a video according to an embodiment of the present application;

fig. 7 is a flowchart illustrating a method for processing an action in a video according to an embodiment of the present application;

8A, 8B, 8C and 8D are schematic diagrams of action processing methods in video provided by embodiments of the present application;

fig. 9A, 9B, 9C, 9D, 9E, and 9F are schematic application scenarios of the method for processing actions in video according to the embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) In response to the condition or state on which the performed operation depends, one or more of the performed operations may be in real-time or may have a set delay when the dependent condition or state is satisfied; there is no restriction on the order of execution of the operations performed unless otherwise specified.

2) The client is an application program running in the terminal for providing various services, such as a video client having an action analysis function.

3) Computer vision is a science for researching how to make a machine look, and further, it refers to that a camera and a computer are used to replace human eyes to perform machine vision of identifying, tracking and measuring a target, and further to perform graphic processing, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

4) Human motion capture refers to a process of converting the motion of the main body joints of the human body into data.

5) The alignment is to match a plurality of materials (e.g., videos) having identical contents to each other on the same time axis, and is generally used to lay down materials photographed in multiple locations on the same clip line. In the present application, since the time for starting recording of each motion video is not consistent, it is necessary to align the video and compare it with a reference motion (or referred to as a standard motion).

6) Participating objects, which refer to objects participating in the simulation; mimic action refers to an action performed by a participant.

7) The reference object refers to an object which is referred to when the participating object performs action simulation; the reference motion refers to a motion performed by a reference object.

The "digging action" refers to the action of the participators, which makes each imitation action tend to be standardized after practice, and the related technology usually depends on human eyes to observe the imitation action, so that some misjudgments often occur once the participators are too many, and each imitation action of each participator cannot be intuitively and specifically judged.

Referring to fig. 1A, 1B, 1C, and 1D, fig. 1A, 1B, 1C, and 1D are schematic diagrams of application scenarios provided in the related art.

Fig. 1A is an image captured by a marked motion capture system, a wearable easily-recognized mark point 101 is provided on a garment of a participant, and a plurality of time series data (including information such as motion tracks and motion accelerations of different limb key points) of the participant can be obtained by different means such as optical, inertial, visual motion or hybrid motion capture.

Fig. 1B and 1C are images captured by a pose detection (PoseNet) model that enables real-time analysis of pictures captured by a local camera in a browser to obtain limb keypoints 102.

Fig. 1D is an image captured by a body sensor for human motion recognition, which is mainly used for a somatosensory interaction device in a game. The body sensor shoots a 3D depth image through a camera capable of sensing depth, so that the depth image is converted into a skeleton tracking system to obtain the key points 103 of the limbs.

The following technical problems of the related art are found in the embodiments of the present application: the related art analyzes the actions of a single character or a plurality of characters, and then recognizes the actions as a specific action, so as to trigger some system feedback (such as games) or record the set of actions into a computer to make the set of actions become the actions of a certain virtual character. However, these actions are not compared by multiple persons, and the user cannot sense whether the actions are standard or not in real time, which not only reduces the efficiency of action simulation, but also wastes the resources for action analysis.

In view of the above technical problems, embodiments of the present application provide a method for processing actions in a video, which can improve the efficiency of action simulation. An exemplary application of the method for processing the motion in the video provided by the embodiment of the present application is described below, and the method for processing the motion in the video provided by the embodiment of the present application may be implemented by various electronic devices, for example, may be applied to various types of user terminals (hereinafter also referred to as simply terminals) such as a smart phone, a tablet computer, a vehicle-mounted terminal, a notebook computer, a desktop computer, and an intelligent wearable device.

Next, taking an electronic device as an example, an exemplary application scenario in which a terminal implements the method for processing actions in a video provided by the embodiment of the present invention is described, referring to fig. 2, fig. 2 is an application scenario diagram of the method for processing actions in a video provided by the embodiment of the present application, and action analysis is completed by various types of terminals 400 such as a smart phone, a tablet computer, a vehicle-mounted terminal, a notebook computer, and an intelligent wearable device, which will be described with reference to fig. 2.

The terminal 400 is used to operate a client 410, and the client 410 is a client with a motion analysis function, such as a video client and a live broadcast client.

A client 410 for displaying at least one impersonation action implemented by a participant object in a first video; and the display module is also used for responding to the action analysis trigger operation of the user, determining the similarity between the plurality of imitation key points imitating the action and the plurality of reference key points referring to the action, synchronously displaying the plurality of imitation key points and the imitation action, and applying a corresponding display mode to the imitation key points of the participated object according to the similarity.

In some embodiments, the terminal 400 implements the action processing method in the video provided by the embodiment of the present application by running a computer program, for example, the computer program may be a native program or a software module in an operating system; may be a local (Native) Application (APP), i.e. a program that needs to be installed in an operating system to run, such as a video client; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module or plug-in.

The structure of the terminal 400 in fig. 2 is explained next. Referring to fig. 3, fig. 3 is a schematic structural diagram of a terminal 400 according to an embodiment of the present application, where the terminal 400 shown in fig. 3 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in FIG. 3.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

The operating system 451, which includes system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., is used for implementing various basic services and for processing hardware-based tasks.

A network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), among others.

A presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430.

An input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the motion processing apparatus in the video provided by the embodiment of the present application may be implemented in software, and fig. 2 shows the motion processing apparatus 455 in the video stored in the memory 450, which may be software in the form of programs and plug-ins, and includes the following software modules: a display module 4551 and an analysis module 4552, which are logical and thus may be arbitrarily combined or further split depending on the functions implemented. The functions of the respective modules will be explained below.

Next, a method for executing action processing in video provided in the embodiment of the present application by terminal 400 in fig. 2 will be described as an example. Referring to fig. 4A, fig. 4A is a schematic flowchart of a method for processing actions in a video according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 4A.

It should be noted that the method shown in fig. 4A can be executed by various forms of computer programs executed by the terminal 400, and is not limited to the client 410, such as the operating system 451, the software modules, and the scripts described above, so that the following example of the client should not be construed as limiting the embodiments of the present application.

In step S101, at least one impersonation action performed by the participant in the first video is displayed.

In some embodiments, the at least one simulated action implemented by the participant object in the first video may be displayed by playing the first video, and the at least one simulated action implemented by the participant object may be displayed according to an action sequence-driven model of the participant object after three-dimensional modeling of the participant object in the first video.

The following description will be given taking as an example at least one impersonation action performed by displaying a participant in a first video in such a manner that the first video is played.

In some embodiments, a plurality of key frames in the first video are identified and key frames comprising the same simulated action are filtered out of the plurality of key frames; and sequentially playing the key frames remained after filtering.

As an example, all key frames in the first video may be identified, or only some key frames in the first video may be identified, for example, key frames including the participating object in the first video may be identified, so that the identification resource may be saved.

As an example, a plurality of key frames in a first video are identified, and the similarity between the first key frame and a second key frame in the first video is determined; when the similarity between the first key frame and the second key frame is lower than the similarity threshold of the adjacent key frames, filtering the first key frame; and sequentially playing the key frames remained after filtering.

Here, the first key frame is any key frame in the first video, and the second key frame is a key frame located before the first key frame in the first video. The threshold of the similarity of the adjacent key frames may be a default value, or a value set by a user, a client, or a server, or may be determined according to the similarities corresponding to all the adjacent key frames, for example, an average value of the similarities corresponding to all the adjacent key frames is used as the threshold of the similarity of the adjacent key frames.

Therefore, the key frames with high similarity can be filtered by calculating the similarity between the adjacent key frames, so that the similar key frames in the first video are prevented from being decoded and played, the action simulation efficiency can be improved, and the decoding resources and the playing resources are saved.

In some embodiments, the first video is played with normal decoding, i.e., video frames in the first video are decoded frame by frame to display at least one impersonation action performed by the participant in the first video.

For example, the video frames of the first video are always played at a speed of 20 frames/second(s) (or 1 × speed).

In some embodiments, the standard video frames in the first video are played at a first play speed, and the non-standard video frames in the first video are played at a second play speed; wherein the first playing speed is greater than the second playing speed; the standard video frames are video frames that include impersonation actions in which the number of non-standard keypoints is below a number threshold; the non-standard video frames are video frames that include no less than a number threshold of non-standard keypoints in the mimic action.

As an example, the first playback speed may be a frame rate (unit: frame/s), and the inverse of the first playback speed is used to characterize the display duration of the standard video frame. The second playback speed may be a frame rate, and an inverse of the second playback speed may be used to characterize a display duration of the non-standard video frame.

As an example, a non-standard keypoint is a mimic keypoint of the mimic action for which the similarity between multiple reference keypoints of the reference action is below a similarity threshold. The criterion keypoints are simulation keypoints of which the similarity between a plurality of reference keypoints of the simulation action and the reference action is not lower than a similarity threshold. The similarity threshold may be a default value, a value set by a user, a client, or a server, or a value determined according to the similarity of all the simulated key points, for example, an average value of the similarities of all the simulated key points is used as the similarity threshold.

For example, the video frames with the simulated key points being the standard key points are played at a speed of 20 frames/second (or 1 time speed), and the video frames with the non-standard key points in the simulated key points are played at a speed of 10 frames/second (or 0.5 time speed), so that the non-standard video frames are played at a slow speed, which is beneficial for a user to carefully observe the key points with the simulated error, the standard video frames are played at a fast speed, the video watching speed can be improved, the action simulation efficiency is improved, and the playing resources are saved.

In step S102, in response to the motion analysis trigger operation, a plurality of mimic key points in the mimic motion are displayed in synchronization with the mimic motion.

In some embodiments, the mimic keypoints are displayed embedded in the region of the mimic action.

As an example, in fig. 6, while the mimic actions of the participation objects b1 and b3 are displayed, the non-standard keypoint 601 and the standard keypoint 602 are also displayed in synchronization with the mimic actions of the participation objects b1 and b3, and the non-standard keypoint 601 and the standard keypoint 602 are displayed embedded in the part of the mimic actions of the participation objects b1 and b 3.

In some embodiments, after step S102, the plurality of impersonation keypoints of the participant object may also be synchronously displayed in response to an action analysis cancelling operation for the participant object.

For example, in fig. 9C, when the user cancels the impersonation key point showing the participant object b1 in the object selection box 602, the shape of the selection button corresponding to b1 in the object selection box 602 changes, and the impersonation key point showing the participant object b1 is cancelled in the human-computer interaction interface. Therefore, the display resource of the terminal can be saved.

In step S103, a display mode corresponding to the similarity is applied to the plurality of simulation key points.

Here, the similarity characterizes a degree of coincidence between the mimic action and the reference action.

In some embodiments, referring to fig. 4B, fig. 4B is a flowchart illustrating a method for processing an action in a video provided in an embodiment of the present application, and based on fig. 4A, step S103 may include step S1031 and step S1032.

In step S1031, the similarities between the plurality of mimic key points and the plurality of reference key points of the reference action are determined.

Here, the reference motion is a motion performed by the reference object. The participating object and the reference object may be in the same video, e.g., the participating object and the reference object may be in the first video at the same time; or the participating object and the reference object may be in different videos, e.g., the participating object in a first video and the reference object in a second video.

Taking the example that the participating object and the reference object are in the first video at the same time, when the first video includes a plurality of objects and does not specify the reference object, the reference object selection prompt information may be further presented before step S1031; in response to a selection operation for a plurality of objects, a selected object is determined as a reference object, and unselected objects are determined as participating objects. In this way, the user is supported to select the reference object in a personalized manner, so that the action simulation efficiency is improved.

For example, the reference object selection hint information is used to hint at setting the reference object; the reference object selection hint information may display all objects in the first video (regardless of whether the key points of the motion thereof are occluded or not), may display only objects that are not occluded in the first video, or may display occluded objects in the reference object selection hint information, but the occluded objects are in a state that is not selectable in the reference object selection hint information.

For convenience of description, the following description is given by way of example of a handheld terminal, but it should be noted that the device applied in the present application is not limited to a handheld terminal, and may be implemented as a desktop computer or a notebook computer. In fig. 9B, the first video includes 6 objects, respectively, B1, B2, B3, B4, B5 and B6, the reference object selection prompt information 902 is presented before the motion analysis, the user may designate a reference object in the reference object selection prompt information 902, taking the designated object B3 as the reference object as an example, and the remaining objects B1, B2, B4, B5 and B6 as the participating objects, so that the participating objects B1, B2, B4, B5 and B6 may be respectively motion-compared with the reference object B3.

In some embodiments, the first video may be a real-time video, such as an in-process video (hereinafter referred to as a live video). The participant object and the reference object may appear in the live video at the same time, or alternatively, the live video may include only the participant object and the reference object is derived from a second video (e.g., a pre-recorded video).

The following describes a selection manner of a reference object in a live video by taking an example in which a participating object and a reference object appear in the live video at the same time, the live video includes a plurality of objects, and the reference object is not specified.

Example one, in a live broadcast process, presenting reference object selection prompt information; in response to the selection operation of the plurality of objects, the selected object is determined as a reference object, and the unselected object is determined as a participating object, so that the actions of the participating object and the reference object are compared in real time in the subsequent live broadcasting process to analyze whether the action of the participating object in the live broadcasting process is standard or not. Therefore, the user is supported to select the reference object, and the watching experience of the user is improved.

In the second example, during live broadcasting, position recognition is performed on an object in the live video, an object closest to a camera (e.g., a video camera) is automatically determined as a reference object, and the remaining objects in the live video are determined as participating objects. Since the instructor is usually located at a position close to the imaging device in the video, the object closest to the imaging device is automatically determined as the reference object, and the operation resources of the terminal can be saved.

In the live broadcasting process, mouth shape action recognition is carried out on the objects in the live broadcasting video, the objects with the mouth shape action duration exceeding the guide duration threshold are automatically determined as reference objects, and other objects in the live broadcasting video are determined as participating objects. Because the teacher usually gives oral instructions in the teaching process in the video, the object with the long duration of the mouth shape action is automatically determined as the reference object, and the operation resources of the terminal can be saved.

For example, the guide time threshold may be a default value, or may be a value set by a user, a client, or a server.

In the live broadcast process, contour recognition is performed on objects in the live broadcast video, the anchor in the live broadcast video is automatically determined as a reference object, and the rest of the objects in the live broadcast video are determined as participating objects. Therefore, the action of the anchor can be used as a reference action to be compared, and the interest of live broadcast is improved.

As an example, when only the participating object is included in the live video and the reference object is derived from the second video, the object in the second video is automatically determined as the reference object and the object in the live video is automatically determined as the participating object.

As an example, when a plurality of participant objects are included in the live video, in the live process, in response to a selection operation for the plurality of participant objects, at least one selected participant object is determined as a participant object that simulates a key point displayed in the live process. Thus, the user is supported to select the participation object by himself.

For example, the participating objects on different audience sides may be set independently, for example, users a and b may each select the same or different participating objects when watching the live broadcast.

For example, an object selection box is displayed, wherein the object selection box comprises a plurality of participating objects in a live video; and responding to the selection operation received in the object selection frame, determining the selected at least one participant to serve as the participant displaying the simulated key points in the live broadcasting process, and improving the watching experience of the user.

For example, the objects displayed in the object selection box may default to a fully selected state, i.e., display the impersonation keypoints of all participating objects in the live video by default; it is also possible to default to a fully unselected state, i.e. by default not to display impersonation keypoints for all participating objects in the live video.

In some embodiments, the client may invoke a corresponding service (e.g., an action analysis service) of the terminal, and the process of action analysis is completed by the terminal. The client may also call a corresponding service (e.g., an action analysis service) of the server, and the action analysis process is completed through the server.

As an example, when the client invokes the corresponding service (e.g., the action analysis service) of the server to complete the process of the action analysis, the alternative steps of step S1031 are: the server determines similarities between a plurality of mimic key points of the mimic action and a plurality of reference key points of the reference action; the server sends the similarity to the client.

Next, a procedure in which the client calls a service (for example, an action analysis service) corresponding to the terminal and the terminal performs action analysis will be described as an example. It should be noted that the process of the client invoking the corresponding service (e.g., the action analysis service) of the server to complete the action analysis is similar to that described below, and will not be described again.

In some embodiments, when the mimic action of the reference object is derived from the first video and the reference action of the reference object is derived from the second video, step S1031 may be to align the plurality of video frames included in the second video and the plurality of video frames included in the first video to the same time axis (e.g., to the time axis of the first video or the second video); the following processing is performed for each time point in the time axis: determining a first video frame aligned to a time point in a first video and a second video frame aligned to the time point in a second video; similarities are determined between the plurality of mimic keypoints in the first video frame and the plurality of reference keypoints in the second video frame.

Here, aligning the plurality of video frames included in the second video and the plurality of video frames included in the first video to the same time axis is to make time points (i.e., play time points) of the video frames including the same action in the first video and the second video coincide.

As an example, determining similarity between the plurality of mimicking keypoints in the first video frame and the plurality of reference keypoints in the second video frame may be for the first mimicking keypoints in the first video frame and the first reference keypoints in the second video frame that are in the same location as the first mimicking keypoints, performing the following operations: determining a first motion amplitude between the first mimic keypoint and a second mimic keypoint of the same location in a third video frame, wherein the third video frame is a video frame of the first video that is subsequent to the first video frame; determining a second motion amplitude between the first reference key point and a second reference key point at the same position in a fourth video frame, wherein the fourth video frame is a video frame behind the second video frame in the second video; comparing the first motion amplitude with the second motion amplitude to obtain a similarity between the simulated keypoints and the corresponding reference keypoints; wherein the first mimicking keypoint is any keypoint of a plurality of mimicking keypoints in the first video frame, and the first reference keypoint is any keypoint of a plurality of reference keypoints in the second video frame.

Here, the first mimicking keypoint may be any keypoint in the first video frame. The first motion amplitude and the second motion amplitude may be vectors, i.e. comprising a direction and a displacement; it may also be scalar, i.e. include a displacement.

For example, the first video frame and the third video frame in the first video may be video frames determined according to time points, that is, the actions in the first video frame and the third video frame may be the same or different, and taking the frame rate as 1 frame/s as an example, the first video frame is a 1:30 (i.e., 1 min 30 sec) video frame, and the third video frame may be a 1:31 video frame. The first video frame and the third video frame in the first video may also be video frames determined according to the motion, that is, the motions in the first video frame and the third video frame are different, taking the frame rate as 1 frame/s as an example, the first video frame is a 1:30 video frame, and the third video frame may be a 1:35 video frame, where the motions in the video frames corresponding to 1:30-1:34 are the same.

For example, comparing the first motion magnitude and the second motion magnitude to obtain the similarity between the mimicking keypoints and the corresponding reference keypoints may be determining the area occupied by all first mimicking keypoints in the first video frame and the area occupied by all second reference keypoints in the second video frame; determining the ratio of the area occupied by all first simulation key points in the first video frame to the area occupied by all second reference key points in the second video frame as a scaling ratio; determining an amplitude ratio between the first motion amplitude and the second motion amplitude; the product between the amplitude ratio and the scale is determined as the similarity between the simulated keypoint and the corresponding reference keypoint.

When the participating object is farther from the shot than the reference object, the motion amplitude of the key points of the participating object is smaller than the motion amplitude of the key points of the reference objectIn this case, it is not reasonable to determine that the key points of the participating objects are not standardized. In this way, the area occupied by all the key points of the participating object can be divided by the area occupied by all the key points of the reference object to obtain the scaling. For example, when all first mimic keypoints in a first video frame occupy an area of 5cm²And the area occupied by all the second reference keypoints in the second video frame is 10cm²The scaling is 5/10-0.5, and when the first motion amplitude is 1cm and the second motion amplitude is 1.25cm, the amplitude ratio is 1/1.25-0.8, so the similarity between the simulated keypoint and the corresponding reference keypoint is 0.5 × 0.8-0.4.

In the embodiment of the application, the scaling ratio is determined by the area occupied by the first simulation key point and the area occupied by the second reference key point, so that the similarity calculation error caused by inconsistent distances between the participating object and the reference object and the lens can be avoided, and the accuracy of motion analysis is improved.

An example of aligning a plurality of video frames included in the second video and a plurality of video frames included in the first video to the same time axis is explained below.

As a first example, a first audio in a first video is extracted, a second audio in a second video is extracted, a first audio frame in the first video is aligned with a second audio frame included in the second video, such that a video frame synchronized with the first audio frame and a video frame synchronized with the second audio frame are aligned.

Here, the first audio frame may be any one of audio frames in the first audio, and the second audio frame may be any one of audio frames in the second audio.

For example, after aligning a video frame synchronized with the first audio frame and a video frame synchronized with the second audio frame, all video frames in the first video and all video frames in the second video may be automatically aligned according to the playing time point of the video frames, and there is no need to align each audio frame in the first video one by one.

Taking dancing as an example, even if the recording start times of the first video and the second video are not consistent, on the premise that the first video and the second video are the same background music, the corresponding action of each same audio frame is the same, so that the plurality of video frames included in the first video and the plurality of video frames included in the second video can be aligned to the same time axis through audio.

As a second example, a mock action in a first video frame in a first video is extracted, a second video frame in a second video is determined that includes the same reference action as the mock action, and the first video frame is aligned with the second video frame.

Here, the first video frame is any video frame in the first video.

For example, after a first video frame and a second video frame including the same action are aligned, all video frames in the first video and all video frames in the second video can be automatically aligned according to the playing time point of the video frames, and it is not necessary to align each video frame in the first video one by one.

In the embodiment of the application, since the recording start times of the first video and the second video may not be consistent, the first video and the second video need to be aligned to the same time axis, so that the time points of the video frames including the same action in the first video and the second video are consistent, thereby ensuring the accuracy of the subsequent similarity comparison.

In some embodiments, a plurality of impersonation keypoints for impersonating an action are identified; in response to a selection operation for the plurality of mimic keypoints, determining a similarity between the selected mimic keypoints and the corresponding reference keypoints. Thus, step S102 may be to display the selected mimic key point and the mimic action in synchronization.

As an example, in fig. 9A, 6 key points of the participating object are identified, which are c1, c2, c3, c4, c5 and c6, respectively, and the user may select key points that need to be displayed in the key point selection box 901, for example, cancel displaying c1, c2 and c6, and thus, only key points c3, c4 and c5 of the participating object are subsequently presented.

According to the method and the device, the key points selected by the user can be displayed in a personalized mode, the reduction of the action simulation efficiency caused by the complexity of information presented on the display interface can be avoided, and the action analysis resources and the display resources can be saved.

In step S1032, a corresponding display manner is applied to the simulation keypoints according to the similarity.

Here, different degrees of similarity correspond to different display modes.

In some embodiments, the standard key points and the non-standard key points in the simulated action are displayed in different display modes; the non-standard key points are simulation key points of which the similarity between the simulation action and a plurality of reference key points of the reference action is lower than a similarity threshold, and the standard key points are simulation key points of which the similarity between the simulation action and the reference action is not lower than the similarity threshold.

For example, the similarity threshold may be a default value, a value set by a user, a client, or a server, or a value determined according to the similarities corresponding to all the key points in the simulated motion, for example, an average value of the similarities corresponding to all the key points in the simulated motion is used as the similarity threshold.

For example, non-canonical keypoints 601 (i.e., solid points) in FIG. 6 represent the keypoint motion non-canonical in the frame, and canonical keypoints 602 (i.e., open points) represent the motion criteria for the keypoint in the frame.

As an example, when at least two of the first and second participation objects are included in the first video, displaying the standard key point and the non-standard key point in the mimic action in different display manners may be displaying the standard key point and the non-standard key point in the mimic action of the first participation object in different display manners; determining the part corresponding to the non-standard key point; the standard keypoints of the part in the second participating object are displayed in the same display mode as the non-standard keypoints of the part in the first participating object.

For example, when the four key points of the head of the participant object b3 are non-standard key points, the four key points of the head of the participant object b1 are also displayed as non-standard key points (solid points) in fig. 9D, which includes a plurality of participant objects, respectively b1 and b3, so that the user can be prompted to pay attention to the four key points of the head, thereby implementing the teaching function.

In other embodiments, non-standard keypoints in the mock action are displayed; wherein the non-standard keypoints are simulated keypoints of the simulated motion with similarity between a plurality of reference keypoints of the reference motion lower than a similarity threshold. Therefore, only the part with nonstandard action can be prompted, and the display resource of the terminal is saved.

In still other embodiments, determining a degree of saliency corresponding to each simulated keypoint based on the similarity of each simulated keypoint, and displaying the simulated keypoints based on the degree of saliency corresponding to each simulated keypoint; wherein the significance and the similarity are inversely correlated.

For example, the similarity of the mimic keypoint b1 in fig. 9E is 0.5 and the similarity of the mimic keypoint b2 is 0.8, such that the mimic keypoint b1 is displayed solid black, the mimic keypoint b2 is displayed solid gray, the significance of the mimic keypoint b1 is higher compared to the significance of the mimic keypoint b2, and the similarity of the characterization mimic keypoint b1 is lower compared to the similarity of the mimic keypoint b 2.

According to the embodiment of the application, the simulated key points synchronously displayed with the simulated actions are displayed in a differentiated mode according to the similarity between the simulated key points and the reference key points, so that the participator can sense the difference between the actions and the reference actions in real time, the action simulation efficiency is improved, and the action analysis resources are saved.

In some embodiments, after step S1031, an action score of the participating object may be further determined according to the similarity, and the action score is displayed; wherein there is a positive correlation between the action score and the similarity. In this way, by scoring the participating objects, the efficiency of motion simulation of the participating objects can be improved.

In some embodiments, the frame similarity of each video frame included in the first video may be further determined according to the similarity after step S1031; in the playing process of the first video, according to the frame similarity, applying a corresponding display mode to the position of the corresponding video frame in the progress bar of the first video; wherein different frame similarities correspond to different display modes.

As an example, according to the similarity, determining the frame similarity of each video frame included in the first video may be to perform the following for each video frame in the first video: determining a plurality of impersonation keypoints included in the video frame; and summing the similarity corresponding to each simulation key point, and determining the result of the summation as the frame similarity of the video frame.

As an example, according to the frame similarity, applying a corresponding display manner to the position of the corresponding video frame in the progress bar of the first video may be to display the corresponding position of the video frame of which the frame similarity is lower than the frame similarity threshold in the progress bar in a display manner different from the corresponding position of the standard video frame in the first video in the progress bar; wherein the standard video frame is a video frame with the frame similarity not lower than the frame similarity threshold.

As an example, according to the frame similarity, applying the corresponding display manner to the position of the corresponding video frame in the progress bar of the first video may be to determine a degree of significance of the display manner of each video frame at the corresponding position in the progress bar according to the frame similarity of each video frame, and display in the progress bar according to the degree of significance of the display manner of the corresponding position of each video frame in the progress bar; wherein there is a negative correlation between the degree of saliency and the frame similarity.

For example, when the participant has 2 mimic keypoint errors at 1:20 and 8 mimic keypoint errors at 1:50, a red dot may be displayed in the progress bar of the first video at a position corresponding to 1:50 to prompt the user that there are many wrong actions at the position of 1:50, and the user may focus on viewing.

According to the method and the device, the speed of positioning the non-standard video frame in the first video by the user can be increased, so that the action simulation efficiency is improved, and resources for searching the non-standard video frame are saved.

In some embodiments, before step S102, an occlusion relationship between the participant objects in the first video may also be identified, and the participant objects satisfying the non-occlusion condition are automatically identified as the participant objects for responding to the action analysis trigger operation.

As an example, the non-occlusion condition includes at least one of: in the playing process of the first video, the duration which is not shielded is greater than a duration threshold; in the playing process of the first video, the proportion between the duration which is not shielded and the total duration of the first video is greater than a duration proportion threshold; the ratio between the area of the occluded part and the overall area of the participating object is less than an area ratio threshold.

Here, the duration threshold may be a default value, a value set by a user, a client, or a server, or a value determined according to the duration of the non-occlusion corresponding to all the participating objects included in the first video, for example, an average value of the duration of the non-occlusion corresponding to all the participating objects included in the first video is used as the duration threshold.

For example, when the temporal threshold is a positive number approximating 0, the characterization automatically identifies a participant object that has not been occluded as the participant object for triggering an operation in response to the action analysis.

Here, the duration ratio threshold may be a default value, or a value set by a user, a client, or a server, or may be determined according to a ratio between the duration of the first video that is not occluded and the total duration of the first video, which is determined according to a ratio between the duration of the first video and the total duration of the first video, for example, an average value of the ratios between the duration of the first video and the total duration of the first video that are not occluded and correspond to all the participating objects included in the first video is used as the duration ratio threshold.

For example, when the temporal length scale threshold is a positive number approximating 0, the characterization automatically identifies a participant object that has not been occluded as the participant object for triggering an operation in response to the action analysis.

Here, the area ratio threshold may be a default value, may be a value set by a user, a client, or a server, or may be determined according to a ratio between the area of the occluded part corresponding to all the participating objects included in the first video and the entire area of the participating objects, for example, an average value of the ratios between the area of the occluded part corresponding to all the participating objects included in the first video and the entire area of the participating objects is used as the area ratio threshold.

For example, when the area proportion threshold is a positive number approximating 0, the characterization automatically identifies an unoccluded participant object as the participant object for triggering an operation in response to the motion analysis.

According to the method and the device, the participated object used for responding to the action analysis trigger operation is determined according to the condition that the participated object is shielded, the display fault caused by the fact that the key point of the shielded participated object cannot be identified can be avoided, and resources for identifying the key point of the participated object can be saved.

In some embodiments, when a plurality of participant objects are included in the first video, the plurality of participant objects included in the video may be further displayed before step S102; in response to a selection operation for a plurality of participant objects, at least one selected participant object is determined as a participant object for triggering an operation in response to the action analysis.

As an example, an object selection box is displayed, wherein the object selection box comprises a plurality of participation objects in the first video; in response to a selection operation for a plurality of participant objects received in the object selection box, at least one selected participant object is determined as a participant object for responding to the action analysis trigger operation.

For example, the plurality of objects displayed in the object selection box may default to a fully selected state, i.e., by default, key points of all participating objects included in the first video are displayed; it is also possible to default to an all unselected state, i.e. by default not to display key points of the participating objects comprised in the first video.

For example, the object selection box may be displayed all the time during the playing of the first video, or may be displayed by triggering the selection button during the playing of the first video, so that the participating object for responding to the action analysis triggering operation may be updated at any time as long as the playing of the first video is not finished.

For example, all the participant objects in the first video may be displayed in the object selection frame (regardless of whether the key points of the actions are occluded or not), only the participant objects that are not occluded in the first video may be displayed, or the occluded participant objects may be in a non-selectable state in the object selection frame although the occluded participant objects are displayed in the object selection frame.

For example, in fig. 6, the user may select the participant object to be observed in the object selection box 602 in the upper left corner, and select the key points in fig. 6 for displaying the participant objects b1 and b 3.

In some embodiments, referring to fig. 5, fig. 5 is a schematic flowchart of a method for processing an action in a video provided in an embodiment of the present application, based on fig. 4B, step S104 may be included after step S1031, and it should be noted that step S1032 and step S104 may be executed in no-order.

In step S104, corresponding action presentation information is presented at the corresponding position of the simulated keypoint with the similarity lower than the similarity threshold.

In some embodiments, the action cue information is determined from reference keypoints corresponding to the mimicking keypoints. The form of the action prompt message includes at least one of the following: animation information; freeze motion information; and (5) character information.

Taking the example that the motion prompt information is text information, in fig. 9F, the position of the simulated key point b1 of the participating object is lower than that of the corresponding reference key point, and the motion prompt information 905 for prompting the simulated key point b1 to move up may be presented.

The embodiment of the application presents the action prompt information aiming at the non-standard key points, and can facilitate the user to quickly adjust the action, thereby improving the action simulation efficiency and saving the action analysis resources.

The following describes a motion processing method in a video provided in an embodiment of the present application, taking a dance simulation performed by a participant as an example.

The reference object (such as a dance coach) can record a series of dance movements towards the stable camera under the normal lighting environment, and the recorded video is a standard video (namely the second video). The participating objects (e.g., trainees) b1, b2, b3, etc. can record the simulated video (i.e., the first video) separately or simultaneously, and through the coincidence degree of the motion or the audio track, the embodiment of the present application can place the standard video and the simulated video on the same timeline to perform frame-by-frame comparison, and display the skeletal nodes (or joint points, i.e., the key points) with larger motion deviation of the trainee in a differentiated manner (e.g., as solid points or red).

In some embodiments, referring to fig. 6, fig. 6 is an application scenario diagram of the method for processing actions in a video provided by an embodiment of the present application, and after comparing with reference actions in a standard library, a dynamic representation of each participating object (e.g., b1 and b3 in fig. 6) at each time is obtained through analysis, a non-standard keypoint 601 (i.e., a solid point) in fig. 6 represents that the node in the frame is not standard in action, and a standard keypoint 602 (i.e., a hollow point) represents an action standard of the node in the frame. However, the leftmost person (b6) is too large in the blocked portion, and therefore the node b6 is not displayed.

Referring to fig. 7, fig. 8A, fig. 8B, fig. 8C, and fig. 8D, fig. 7 is a schematic flowchart of a method for processing an action in a video provided by an embodiment of the present application, and fig. 8A, fig. 8B, fig. 8C, and fig. 8D are schematic diagrams of a principle of the method for processing an action in a video provided by an embodiment of the present application. Description will be made with reference to fig. 7, 8A, 8B, 8C, and 8D.

In step S701, a standard video including a standard motion of the reference object a (i.e., the above-described reference motion) is recorded.

In some embodiments, the action shooting of the reference object a is recorded by a camera, and the human body in the video is identified to obtain 17 key body joints of the human body; the multi-video tracks frame by frame, and the acquired motion set a is determined as a 'standard action'. For example, fig. 8A is a motion set a including a plurality of video frames, for example, video frames whose play time points are 00:02:16, 00:02:15, 00:02:13, and the like.

In step S702, a mimic video including the mimic action of the participating object b is recorded.

In some embodiments, the simulated motions of a plurality of participants are recorded by a camera, and the collected and analyzed motion set b is determined as "simulated motions" in a similar manner as in step S701. If multiple participating objects are in a video segment at the same time, the occlusion of the body should be avoided, otherwise the body joint data of the occluded objects cannot be acquired, and the tracking may fail. For example, there are six participating objects in fig. 6, but since the leftmost participating object b6 is too large by the shielding portion, tracking the participating object b6 fails.

In step S703, the standard video and the mimic video are aligned.

In some embodiments, the standard video and the simulated video are time aligned so that the analysis can be subsequently performed on the same timeline. For example, in fig. 8B, since the start-up time of the imitation video is earlier, alignment is performed based on the sound ripples of the background music (or the participating object or the reference object may be greatly prompted before the dance starts), so that the video frames of the standard video and the imitation video, which are consistent in motion, can be placed at the same point in time.

In step S704, the motion deviations of each joint point of the participating object are compared on the same time track frame by frame.

In some embodiments, the standard motion of the reference object a is analyzed to derive a standard displacement range for each keypoint.

In some embodiments, in fig. 8C, taking the right elbow joint point a12 of the reference object a as an example, a12 moves to the left during the motion, and then the position parameter of a12 of the previous frame is subtracted from the position parameter of a12 of the next frame at the data level, so that a motion vector x (i.e. the motion deviation, including direction and displacement, described above) can be obtained.

As an example, it may be considered that the range of displacement is not uniform due to the overall zoom caused by the person being far from the shot, for example, the participating object is far from the shot compared to the reference object, so the motion amplitude of the key point of the participating object is smaller than that of the standard motion, and it is not reasonable to determine that the key point is not standard when the motion amplitude of the key point is smaller than that of the standard motion. In this way, the total area occupied by all the key points of the participating object can be divided by the area occupied by all the key points of the reference object a to obtain the scaling ratio s.

In step S705, it is determined whether the motion deviation is within the tolerance range.

In some embodiments, for motion set b, the ideal motion deviation of the joint points of the plurality of participating objects themselves is s x. For example, in fig. 8D, the value of the actual motion vector y minus the ideal motion deviation s x of the right elbow joint point b12 of the participating subject is greater than z (z is a fault-tolerant deviation), so the motion deviation of b12 is not within a tolerance range, the motion characterizing b12 is not standard, and at this time, b12 is shown as a solid point (or yellow/red).

As an example, the motion deviation of each key point in each frame is analyzed one by one and judged, and the display mode of each key point is determined according to the judgment result. For a mock video comprising multiple participating objects, the keypoint angle may be difficult to see, and thus, in fig. 6, the user may select a participating object to view in the object selection box 602 in the upper left corner, e.g., select keypoints to view participating objects b1 and b3 in fig. 6.

In the embodiment of the application, the standard action of the reference object is firstly input (for example, a sensor acquires action data), then the action of the participated object is input, finally, standard degree comparison is carried out, real-time feedback is given, and the standard degree of the action of the participated object can be intuitively perceived.

An exemplary structure of the motion processing apparatus implemented as a software module in video provided by the embodiment of the present application is described below with reference to fig. 2.

In some embodiments, as shown in fig. 2, the software modules in the action processing device 455 stored in the video of the memory 450 may include:

a display module 4551 for displaying at least one mock action implemented by the participant in the first video;

an analysis module 4552, configured to respond to a motion analysis trigger operation, synchronously display a plurality of mimic key points in the mimic motion and the mimic motion, and apply a display manner corresponding to the similarity to the plurality of mimic key points;

In the above solution, the display module 4551 is further configured to identify a plurality of key frames in the first video, and filter out key frames including the same simulated motion from the plurality of key frames; and sequentially playing the key frames remained after filtering.

In the above scheme, the display module 4551 is further configured to play the standard video frames in the first video at a first play speed, and play the non-standard video frames in the first video at a second play speed; wherein the first playback speed is greater than the second playback speed; the standard video frames are video frames that include impersonation actions for which the number of non-standard keypoints is below a number threshold; the non-standard video frames are video frames that include no less than a number threshold of non-standard keypoints in the mimic action.

In the above solution, the analyzing module 4552 is configured to determine similarities between the plurality of simulated key points and the plurality of reference key points of the reference action; applying a corresponding display mode to the simulated key points according to the similarity; wherein different similarities correspond to different display modes.

In the above solution, the analyzing module 4552 is further configured to display the standard key points and the non-standard key points in the simulated motion in different display manners; wherein the non-standard keypoints are simulation keypoints with similarity lower than a similarity threshold between a plurality of reference keypoints of the simulated motion and the reference motion, and the standard keypoints are simulation keypoints with similarity lower than the similarity threshold between the simulation motion and the reference motion.

In the above scheme, when at least two participant objects, namely the first participant object and the second participant object, are included in the first video, the analysis module 4552 is further configured to display the standard key points and the non-standard key points in the simulated motion of the first participant object in different display manners; determining a part corresponding to the non-standard key point; displaying the standard key points of the part in the second participating object in the same display mode as the non-standard key points of the part in the first participating object.

In the foregoing solution, the analyzing module 4552 is further configured to determine the degree of significance corresponding to each of the simulated key points according to the similarity of each of the simulated key points, and display the simulated key points according to the degree of significance corresponding to each of the simulated key points; wherein there is a negative correlation between the degree of significance and the degree of similarity.

In the above solution, when the reference action of the reference object is derived from a second video, the analysis module 4552 is further configured to align a plurality of video frames included in the second video and a plurality of video frames included in the first video to a same time axis; performing the following processing for each time point in the time axis: determining a first video frame of the first video aligned to the time point and a second video frame of the second video aligned to the time point; determining similarities between a plurality of mimic keypoints in the first video frame and a plurality of reference keypoints in the second video frame.

In the foregoing solution, the analyzing module 4552 is further configured to perform at least one of the following operations: aligning a first audio frame in the first video with a second audio frame included in the second video to align a video frame synchronized with the first audio frame and a video frame synchronized with the second audio frame; the method includes extracting a mock action in a first video frame in the first video, determining a second video frame in the second video that includes a same reference action as the mock action, aligning the first video frame with the second video frame.

In the above solution, the analysis module 4552 is further configured to, for a first mimicking keypoint in the first video frame and a first reference keypoint in the second video frame that is in the same position as the first mimicking keypoint: determining a first motion magnitude between the first mimic keypoint and a second mimic keypoint of the same location in a third video frame, wherein the third video frame is a video frame subsequent to the first video frame in the first video; determining a second motion amplitude between the first reference key point and a second reference key point at the same position in a fourth video frame, wherein the fourth video frame is a video frame after the second video frame in the second video; comparing the first motion magnitude and the second motion magnitude to obtain a similarity between the mimicking keypoint and a corresponding reference keypoint; wherein the first impersonation keypoint is any keypoint of a plurality of impersonation keypoints in the first video frame, and the first reference keypoint is any keypoint of a plurality of reference keypoints in the second video frame.

In the above solution, the analyzing module 4552 is further configured to determine an area occupied by all first mimic keypoints in the first video frame and an area occupied by all second reference keypoints in the second video frame; determining the ratio of the area occupied by all first simulation key points in the first video frame to the area occupied by all second reference key points in the second video frame as a scaling ratio; determining an amplitude ratio between the first motion amplitude and the second motion amplitude; determining a product between the amplitude ratio and the scaling as a similarity between the mimicking keypoint and the corresponding reference keypoint.

In the above scheme, when a plurality of participating objects are included in the first video, the analyzing module 4552 is further configured to display the plurality of participating objects included in the first video; in response to a selection operation for the plurality of participant objects, determining at least one selected participant object as a participant object for triggering an operation in response to the action analysis.

In the above solution, the analyzing module 4552 is further configured to identify an occlusion relationship between the participating objects in the first video, and automatically identify a participating object that meets an unoccluded condition as a participating object for triggering an operation in response to the action analysis; wherein the non-occlusion condition comprises at least one of: in the playing process of the first video, the duration which is not shielded is greater than a duration threshold; in the playing process of the first video, the proportion between the duration which is not shielded and the total duration of the first video is greater than a duration proportion threshold; the ratio between the area of the occluded part and the overall area of the participating object is less than an area ratio threshold.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the action processing method in the video according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, cause the processor to perform a method for processing actions in videos provided by embodiments of the present application, for example, the method for processing actions in videos shown in fig. 4A, 4B, 5 and 7, where the computer includes various computing devices including a smart terminal and a server.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, the computer-executable instructions may be in the form of programs, software modules, scripts or code written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and they may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, computer-executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, e.g., in one or more scripts in a hypertext markup language document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, computer-executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the present application has the following beneficial effects:

(1) the simulation key points of the simulated motion of the participated object are compared with the reference key points of the reference motion, and the simulation key points synchronously displayed with the simulated motion are displayed in a differentiation mode according to the similarity between the simulation key points and the reference key points, so that the participated object can sense the difference between the motion of the participated object and the reference motion in real time, the motion simulation efficiency is improved, and the motion analysis resources are saved.

(2) The scaling ratio is determined by the area occupied by the first simulation key point and the area occupied by the second reference key point, so that the similarity calculation error caused by inconsistent distances between the participating object and the reference object and the lens can be avoided, and the accuracy of motion analysis is improved.

(3) Since the recording start times of the first video and the second video are not consistent, the first video and the second video need to be aligned to the same time axis, so that the time points of the video frames comprising the same action in the first video and the second video are consistent, thereby ensuring the correctness of the subsequent similarity comparison.

(4) The method can display the key points selected by the user in a personalized manner, not only can avoid the reduction of the action simulation efficiency caused by the complexity of information presented on the display interface be avoided, but also the action analysis resources and the display resources can be saved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method for motion processing in video, the method comprising:

2. The method of claim 1, wherein displaying at least one impersonation action implemented by a participant in the first video comprises:

identifying a plurality of key frames in the first video and filtering out key frames comprising the same mimicking action among the plurality of key frames;

and sequentially playing the key frames remained after filtering.

3. The method of claim 1, wherein displaying at least one impersonation action implemented by a participant in the first video comprises:

playing the standard video frames in the first video at a first playing speed, and playing the non-standard video frames in the first video at a second playing speed;

wherein the first playback speed is greater than the second playback speed; the standard video frames are video frames that include impersonation actions for which the number of non-standard keypoints is below a number threshold; the non-standard video frames are video frames that include no less than a number threshold of non-standard keypoints in the mimic action.

4. The method of claim 1, wherein applying a display corresponding to similarity to the plurality of simulated keypoints comprises:

determining similarities between the plurality of mimic keypoints and a plurality of reference keypoints of the reference action;

applying a corresponding display mode to the simulated key points according to the similarity;

wherein different similarities correspond to different display modes.

5. The method of claim 4, wherein applying a corresponding display mode to the simulated keypoints according to the similarity comprises:

displaying the standard key points and the non-standard key points in the imitation action in different display modes;

wherein the non-standard keypoints are simulation keypoints with similarity lower than a similarity threshold between a plurality of reference keypoints of the simulated motion and the reference motion, and the standard keypoints are simulation keypoints with similarity lower than the similarity threshold between the simulation motion and the reference motion.

6. The method of claim 5, wherein displaying the standard keypoints and the non-standard keypoints in the simulated action in different display manners when at least two participant objects, namely a first participant object and a second participant object, are included in the first video, comprises:

displaying the standard key points and the non-standard key points in the simulated action of the first participating object in different display modes;

determining a part corresponding to the non-standard key point;

displaying the standard key points of the part in the second participating object in the same display mode as the non-standard key points of the part in the first participating object.

7. The method of claim 4, wherein applying a corresponding display mode to the simulated keypoints according to the similarity comprises:

determining the degree of significance corresponding to each simulated key point according to the similarity of each simulated key point, and

displaying the simulated key points according to the significance degree corresponding to each simulated key point;

wherein there is a negative correlation between the degree of significance and the degree of similarity.

8. The method of claim 4,

when the reference action of a reference object is derived from a second video, the determining similarities between the plurality of mimicking keypoints and a plurality of reference keypoints of the reference action comprises:

aligning a plurality of video frames included in the second video and a plurality of video frames included in the first video to a same time axis;

performing the following processing for each time point in the time axis:

determining a first video frame of the first video aligned to the time point and a second video frame of the second video aligned to the time point;

determining similarities between a plurality of mimic keypoints in the first video frame and a plurality of reference keypoints in the second video frame.

9. The method of claim 8, wherein aligning the plurality of video frames included in the second video and the plurality of video frames included in the first video to a same time axis comprises:

performing at least one of the following operations:

aligning a first audio frame in the first video with a second audio frame included in the second video to align a video frame synchronized with the first audio frame and a video frame synchronized with the second audio frame;

the method includes extracting a mock action in a first video frame in the first video, determining a second video frame in the second video that includes a same reference action as the mock action, aligning the first video frame with the second video frame.

10. The method of claim 8, wherein determining the similarity between the plurality of mimic keypoints in the first video frame and the plurality of reference keypoints in the second video frame comprises:

for a first mimicking keypoint in the first video frame and a first reference keypoint in the second video frame that is co-located with the first mimicking keypoint, performing the following operations:

determining a first motion magnitude between the first mimic keypoint and a second mimic keypoint of the same location in a third video frame, wherein the third video frame is a video frame subsequent to the first video frame in the first video;

determining a second motion amplitude between the first reference key point and a second reference key point at the same position in a fourth video frame, wherein the fourth video frame is a video frame after the second video frame in the second video;

comparing the first motion magnitude and the second motion magnitude to obtain a similarity between the mimicking keypoint and a corresponding reference keypoint;

wherein the first impersonation keypoint is any keypoint of a plurality of impersonation keypoints in the first video frame, and the first reference keypoint is any keypoint of a plurality of reference keypoints in the second video frame.

11. The method of claim 1, wherein when a plurality of participant objects are included in the first video, prior to the synchronously displaying a plurality of impersonation key points in the impersonation action with the impersonation action in response to an action analysis trigger operation, the method further comprises:

displaying a plurality of participant objects included in the first video;

in response to a selection operation for the plurality of participant objects, determining at least one selected participant object as a participant object for triggering an operation in response to the action analysis.

12. The method of claim 1, wherein prior to said synchronously displaying a plurality of impersonation key points in the impersonation action with the impersonation action in response to an action analysis trigger operation, the method further comprises:

identifying an occlusion relationship between the participating objects in the first video, and automatically identifying the participating objects meeting an unoccluded condition as participating objects for responding to the action analysis trigger operation;

wherein the non-occlusion condition comprises at least one of:

in the playing process of the first video, the duration which is not shielded is greater than a duration threshold;

in the playing process of the first video, the proportion between the duration which is not shielded and the total duration of the first video is greater than a duration proportion threshold;

the ratio between the area of the occluded part and the overall area of the participating object is less than an area ratio threshold.

13. An apparatus for processing a motion in a video, comprising:

14. An electronic device, comprising:

a memory for storing computer executable instructions;

a processor for implementing the method of action processing in video according to any one of claims 1 to 12 when executing computer executable instructions stored in the memory.

15. A computer-readable storage medium having stored thereon computer-executable instructions for implementing a method of action processing in a video according to any one of claims 1 to 12 when executed.