CN117880588A

CN117880588A - Video editing method, device, equipment and storage medium

Info

Publication number: CN117880588A
Application number: CN202311619549.4A
Authority: CN
Inventors: 邢剑
Original assignee: Wuxi Partner Intelligent Technology Co ltd
Current assignee: Wuxi Partner Intelligent Technology Co ltd
Priority date: 2023-11-27
Filing date: 2023-11-27
Publication date: 2024-04-12

Abstract

The embodiment of the application provides a video editing method, a video editing device, video editing equipment and a storage medium. The method comprises the following steps: acquiring a real-time video of a target object; processing each image frame in the real-time video based on a motion recognition model to recognize the body motion of the target object contained in each image frame; and when the body action contained in the image frame is determined to accord with a preset action, clipping a video fragment containing the body action from the real-time video. According to the method and the device, when the target object is identified to make the preset action, the video segment containing the body action can be automatically clipped from the real-time video, so that the wonderful instant segment growing by the child can be efficiently and accurately captured, and good recall is left.

Description

Video editing method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of video processing, in particular to a video editing method, a device, equipment and a storage medium.

Background

Currently, children are the center of a family. With the development of science and technology, various intelligent devices for caring for children have appeared. These smart devices can acquire and save video of the child in real time for viewing by parents busy working at their leisure.

It will be appreciated that the video of the child obtained by the smart device inevitably contains segments of the wonderful moment of child growth, and that for parents who do not want to miss the wonderful moment of child growth, the desired segments need to be manually cut out from the beginning to the end of the video.

However, in practical application, this method is time-consuming and has low efficiency, and meanwhile, under the influence of the difference of the technical level of the home video clip, there may be a problem that the capturing of the highlight is inaccurate.

Disclosure of Invention

In view of the above problems, embodiments of the present application provide a video editing method, apparatus, device, and storage medium, so as to efficiently and accurately capture a wonderful instant segment of a child growing, leaving a good recall.

According to a first aspect of embodiments of the present application, there is provided a video editing method, the method comprising:

acquiring a real-time video of a target object;

processing each image frame in the real-time video based on a motion recognition model to recognize the body motion of the target object contained in each image frame;

and when the body action contained in the image frame is determined to accord with a preset action, clipping a video fragment containing the body action from the real-time video.

In some embodiments, the processing each image frame in the real-time video based on the motion recognition model to recognize the body motion of the target object contained in the image frame of each frame includes:

and processing each frame of image based on a multi-layer time image convolution algorithm and a space image convolution algorithm respectively to identify the body action of the target object contained in each frame of image frame.

In some embodiments, the determining that the body motion contained in the image frame corresponds to a preset motion comprises:

extracting a first skeleton image characteristic of the target object from the image frame;

determining a first gesture corresponding to the first skeleton diagram feature according to the first skeleton diagram feature;

and determining whether the body action contained in the image frame accords with the preset action according to the first gesture.

In some embodiments, the determining whether the body motion contained in the image frame corresponds to the preset motion according to the first gesture includes:

obtaining the similarity between the first gesture and a preset gesture; wherein the preset gesture is a gesture corresponding to a skeleton diagram feature of the preset action;

determining that the body action corresponds to the preset action when the similarity is greater than a preset similarity threshold;

and under the condition that the similarity is smaller than or equal to the preset similarity threshold, determining that the body action does not accord with the preset action.

In some embodiments, the editing out of the real-time video a video clip containing the physical action comprises:

identifying from the real-time video all image frames containing a course of action of the body action; wherein the course of action includes the generation of the physical action, the duration of the physical action, and the end of the physical action;

and sequencing all the image frames according to the time sequence to obtain the video clips of the body actions.

In some embodiments, the acquiring real-time video of the target object includes:

acquiring a real-time video of the target object through a camera; the camera can rotate along with the action of the target object so as to ensure that real-time video of the target object can be captured.

In some embodiments, before the capturing the real-time video of the target object, the method further includes:

acquiring training data, wherein the training data comprises at least one group of training samples, and each group of training samples comprises a test object, gesture information of the test object and label data of the test object;

and training the training data based on a convolutional neural network algorithm and a supervised learning algorithm to obtain the action recognition model.

According to a second aspect of embodiments of the present application, there is provided a video editing apparatus, including:

the acquisition module is used for acquiring real-time video of the target object;

the identification module is used for processing each image frame in the real-time video based on the action identification model so as to identify the body action of the target object contained in each image frame;

and the clipping module is used for clipping video clips containing the body actions from the real-time video when the body actions contained in the image frames are determined to accord with preset actions.

According to a third aspect of embodiments of the present application, there is provided a video editing apparatus, comprising: a processor; the processor is configured to execute a computer executable program or instructions in a memory to cause the video clip device to perform the method according to any of the first aspects of the embodiments of the present application.

According to a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium storing a computer executable program or instructions arranged to perform the method of any of the first aspects of embodiments of the present application.

The video editing method, the video editing device, the video editing equipment and the storage medium provided by the embodiment of the application are used for acquiring real-time video of a target object; processing each image frame in the real-time video based on a motion recognition model to recognize the body motion of the target object contained in each image frame; and when the body action contained in the image frame is determined to accord with a preset action, clipping a video fragment containing the body action from the real-time video. According to the method and the device, when the target object is identified to make the preset action, the video segment containing the body action can be automatically clipped from the real-time video, so that the wonderful instant segment growing by the child can be efficiently and accurately captured, and good recall is left.

The foregoing description is only an overview of the technical solutions of the embodiments of the present application, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present application can be more clearly understood, and the following detailed description of the present application will be presented in order to make the foregoing and other objects, features and advantages of the embodiments of the present application more understandable.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flowchart of a video editing method according to an embodiment of the present application.

Fig. 2 is a schematic block diagram of a video editing apparatus provided in an embodiment of the present application.

Fig. 3 is a schematic block diagram of a video editing apparatus provided in an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, based on the embodiments herein, which would be apparent to one of ordinary skill in the art without making any inventive effort, are intended to be within the scope of the present application.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

The terms "comprising" and "having" and any variations thereof in the description and claims of the present application and in the description of the drawings are intended to cover, but not exclude, other matters. The word "a" or "an" does not exclude the presence of a plurality.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of the phrase "an embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

Furthermore, the terms first, second and the like in the description and in the claims of the present application or in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order, and may be used to expressly or implicitly include one or more such features.

In the description of the present application, unless otherwise indicated, the meaning of "plurality" means two or more (including two), and similarly, "plural sets" means two or more (including two).

In the description of the present application, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "coupled" are to be construed broadly, e.g., the terms "connected" or "coupled" of a mechanical structure may refer to a physical connection, e.g., the physical connection may be a fixed connection, e.g., by a fastener, such as a screw, bolt, or other fastener; the physical connection may also be a detachable connection, such as a snap-fit or snap-fit connection; the physical connection may also be an integral connection, such as a welded, glued or integrally formed connection. "connected" or "connected" of circuit structures may refer to physical connection, electrical connection or signal connection, for example, direct connection, i.e. physical connection, or indirect connection through at least one element in the middle, so long as circuit communication is achieved, or internal communication between two elements; signal connection may refer to signal connection through a medium such as radio waves, in addition to signal connection through a circuit. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art in a specific context.

In the current social environment, a child is a home center, parents and families of the child often want to not miss any moment when the child grows, and based on the actual requirement, various intelligent devices for nursing are also generated.

The intelligent device generally comprises a monitoring end and a control end, wherein the monitoring end is generally arranged in a main activity space of a monitored person so as to acquire and store videos of the monitored person in real time for the monitored person to watch at leisure. It will be appreciated that in the video of the child obtained by the smart device, it is inevitable to include a segment of the wonderful moment about the child's growth, for example, first turn over, first learn to sit, first climb, first walk, etc.

For parents who do not want to miss the moment of child's wonderful growth, the need to watch the video from beginning to end, from which the desired segments are manually cut out, which is certainly an extremely inefficient way of editing. Meanwhile, the video editing technology of different guardianship persons is different in level, and the problem that the capturing of the highlight long fragments is inaccurate may exist in the method, so that the highlight moment of the guardian is missed.

Based on this, the embodiment of the application provides a video editing method, which can automatically clip a video segment containing body actions from real-time video when recognizing that a monitored person makes preset actions, so as to efficiently and accurately capture a wonderful instant segment growing by the monitored person and keep good recall for the monitored person.

In order to enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, different technical features may be combined with each other.

Fig. 1 is a schematic flowchart of a video editing method according to an embodiment of the present application. As shown in fig. 1, the video clipping method of the embodiment of the present application may be performed by a video clipping apparatus, which is provided in a video clipping device and has a camera, and specifically, the video clipping method of the embodiment of the present application may include the steps of:

step 110, obtaining real-time video of the target object.

The target object is a monitored object, which may be an infant, for example. The real-time video includes at least the physical action of the target object. Further, the real-time video may further include an expression of the target object.

In some scenarios, the video clipping apparatus may acquire real-time video of the target object through a camera. Illustratively, the camera is capable of rotating with the motion of the target object to ensure that real-time video of the target object can be captured.

For example, in a scenario where the person to be monitored is an infant, when the infant turns over, the camera can adjust the angle along with the turning over of the infant, so that the facial image of the infant can be obtained while the body motion of the infant is obtained.

And 120, processing each image frame in the real-time video based on the motion recognition model to recognize the body motion of the target object contained in each image frame.

The motion recognition model is used for analyzing each frame of image in the real-time video. Specifically, when each frame of image in the real-time video is processed based on the motion recognition model, each frame of image may be processed based on a multi-layer time map convolution algorithm and a space map convolution algorithm, so as to recognize the body motion of the target object contained in each frame of image frame.

Specifically, the video clipping apparatus may sequentially process the acquired real-time video images of each frame as the input of the motion recognition model by using a multi-layer time map convolution algorithm and a space map convolution algorithm according to a time sequence, so as to recognize the body motion of the target object.

The physical action refers to the action of the person under guardianship, such as lying down, turning over, lifting arms, standing, walking, sitting down, falling down, etc.

Specifically, in this embodiment, the motion recognition model may be pre-trained based on a convolutional neural network algorithm and a supervised learning algorithm, so that the expression, motion and sleep state of the target object in each frame of image of the real-time video can be analyzed.

For example, the process of training the action recognition model may include: training data is acquired, the training data comprising at least one set of training samples, each set of training samples comprising a test object, pose information of the test object, and tag data of the test object. The gesture information of the test object is obtained by extracting skeleton diagram features from image frames contained in a video of the test object and then based on the skeleton diagram features.

After the training data are obtained, the test object of each group of training samples and the gesture information of the test object are used as the input of a convolutional neural network algorithm and a supervised learning algorithm to obtain corresponding output results, the output results are compared with the label data of the test object to obtain difference values, and parameters of the model are updated based on the difference values and a loss function, so that the action recognition model is obtained through iterative training.

After identifying the body action of the target object contained in the image frame of each frame, step 130 is performed:

and 130, when the body action contained in the image frame is determined to accord with a preset action, clipping a video fragment containing the body action from the real-time video.

Wherein the preset motion is a predetermined motion commonly known as a moment when a child grows wonderfully, such as first turn over, first learn to sit, first climb, first stand, first walk, etc. Of course, in practical applications, the user may set the preset operation autonomously, and this embodiment is not particularly limited.

Specifically, when it is determined that the body motion contained in the image frame accords with a preset motion, the video clipping device may first extract a first skeleton map feature of the target object from the image frame; determining a first gesture corresponding to the first skeleton diagram feature according to the first skeleton diagram feature; after the first gesture is obtained, determining whether the body action contained in the image frame accords with the preset action according to the first gesture.

In one implementation, a similarity between the first gesture and a preset gesture is obtained; wherein the preset gesture is a gesture corresponding to a skeleton diagram feature of the preset action; determining that the body action corresponds to the preset action when the similarity is greater than a preset similarity threshold; and under the condition that the similarity is smaller than or equal to the preset similarity threshold, determining that the body action does not accord with the preset action.

For example, assuming that the similarity between the acquired first gesture and the preset gesture is 85%, if the preset similarity threshold is 80%, since 85% is greater than 80%, it may be determined that the body motion corresponds to the preset motion. If the preset similarity threshold is 90%, since 85% is smaller than 90%, it may be determined that the body motion does not conform to the preset motion.

It will be appreciated that the above implementation is merely an example, and the present embodiment is not particularly limited as to how to determine whether the body motion included in the image frame corresponds to the preset motion according to the first gesture.

When the body actions contained in the image frames are determined to accord with preset actions, the video editing device automatically clips video clips containing the body actions from the real-time video.

In particular, a video editing device may identify all image frames from the real-time video that contain a course of action of the body action; and then sequencing all the image frames according to the time sequence to obtain the video clips of the body actions.

Wherein the course of action includes the generation of the physical action, the duration of the physical action, and the end of the physical action. For example, body movements are walking movements, which may include standing, walking, alternating left and right feet, standing, sitting.

After the video clip device obtains the video clips of the body actions, the video clips can be stored in the memory, a guardian can view the video clips by clicking a button for viewing the wonderful moment, and the video clips can be transferred to the storage equipment of the guardian to leave good recall.

Further, in some embodiments, the guardian may also select a preference for the highlight moment while viewing the highlight moment. In this way, the video editing device can update the automatic editing algorithm according to the real-time video and the preference degree of the user for the video clips so as to continuously optimize the automatic editing algorithm, thereby enabling the video editing device to more accurately clip the wonderful instant clips growing by children.

According to the video editing method, real-time video of the target object is obtained; processing each image frame in the real-time video based on a motion recognition model to recognize the body motion of the target object contained in each image frame; and when the body action contained in the image frame is determined to accord with a preset action, clipping a video fragment containing the body action from the real-time video. According to the method and the device, when the target object is identified to make the preset action, the video segment containing the body action can be automatically clipped from the real-time video, so that the wonderful instant segment growing by the child can be efficiently and accurately captured, and good recall is left.

Fig. 2 is a schematic block diagram of a video editing apparatus according to an embodiment of the present application. As shown in fig. 2, the video clipping apparatus of the embodiment of the present application may be used to perform the method in the above-described method embodiment. Specifically, the video clipping apparatus 200 of the embodiment of the present application may include: an acquisition module 210, an identification module 220, and a clipping module 230.

The acquiring module 210 is configured to acquire a real-time video of the target object.

The identifying module 220 is configured to process each image frame in the real-time video based on the motion identifying model, so as to identify the body motion of the target object included in each image frame.

And a clipping module 230, configured to clip a video clip containing the body motion from the real-time video when it is determined that the body motion contained in the image frame conforms to a preset motion.

In one implementation, the identifying module 220 may specifically be configured to process the frame images based on a multi-layer time map convolution algorithm and a space map convolution algorithm, respectively, to identify a body action of the target object included in each frame of the image frames.

In one implementation, the clipping module 230 may be specifically configured to extract the first skeleton map feature of the target object from the image frame; determining a first gesture corresponding to the first skeleton diagram feature according to the first skeleton diagram feature; and determining whether the body action contained in the image frame accords with the preset action according to the first gesture.

In one implementation, the clipping module 230 may be specifically configured to obtain a similarity between the first gesture and a preset gesture; wherein the preset gesture is a gesture corresponding to a skeleton diagram feature of the preset action; determining that the body action corresponds to the preset action when the similarity is greater than a preset similarity threshold; and under the condition that the similarity is smaller than or equal to the preset similarity threshold, determining that the body action does not accord with the preset action.

In one implementation, the clipping module 230 may be specifically configured to identify all image frames containing the course of action of the body action from the real-time video; wherein the course of action includes the generation of the physical action, the duration of the physical action, and the end of the physical action; and sequencing all the image frames according to the time sequence to obtain the video clips of the body actions.

In one implementation, the obtaining module 210 may specifically be configured to obtain, by using a camera, a real-time video of the target object; the camera can rotate along with the action of the target object so as to ensure that real-time video of the target object can be captured.

In one implementation, the obtaining module 210 may be further configured to obtain training data, where the training data includes at least one set of training samples, and each set of training samples includes a test object, pose information of the test object, and tag data of the test object; and training the training data based on a convolutional neural network algorithm and a supervised learning algorithm to obtain the action recognition model.

The video editing device of the embodiment of the application may be used to execute the technical scheme of the embodiment of the method, and its implementation principle and technical effects are similar, and are not described herein again.

Fig. 3 is a schematic block diagram of a video clipping apparatus according to an embodiment of the present application. As shown in fig. 3, the video clipping apparatus according to the embodiment of the present application may be provided with the video clipping apparatus shown in fig. 2; alternatively, the video clip apparatus 300 of the embodiment of the present application may include: a processor 310; the processor 310 is configured to execute computer-executable programs or instructions in the memory 320 to cause the video clip device 300 to perform the method of the embodiment shown in fig. 1.

An embodiment of the present application also provides a computer-readable storage medium storing a computer-executable program or instructions configured to perform the method of the embodiment shown in fig. 1.

The computer readable storage medium includes, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared memory or semiconductor system, apparatus or device, or any suitable combination of the foregoing, the memory storing program code or instructions, the program code including computer operating instructions, and the processor executing the program code or instructions of the above-described methods stored by the memory.

The definition of the memory and the processor may refer to the description of the foregoing electronic device embodiments, and will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The functional units or modules in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause an electronic device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those skilled in the art will appreciate that while some embodiments herein include certain features that are included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the present application and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of video editing, the method comprising:

acquiring a real-time video of a target object;

2. The method of claim 1, wherein processing each image frame in the real-time video based on the motion recognition model to recognize the body motion of the target object contained in each image frame comprises:

3. The method of claim 2, wherein the determining that the body motion contained in the image frame corresponds to a preset motion comprises:

4. A method according to claim 3, wherein said determining whether the body action contained in the image frame corresponds to the preset action based on the first pose comprises:

5. The method of claim 1, wherein the clipping out of the real-time video a video clip containing the physical action comprises:

6. The method of claim 1, wherein the acquiring real-time video of the target object comprises:

7. The method of claim 1, wherein prior to the capturing the real-time video of the target object, the method further comprises:

8. A video editing apparatus, comprising:

9. A video editing apparatus, comprising: a processor; the processor is configured to execute a computer executable program or instructions in a memory to cause the video clip device to perform the method of any one of claims 1 to 7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer executable program or instructions arranged to perform the method of any of claims 1-7.