CN114363720B

CN114363720B - Video slicing method, system, equipment and medium based on computer vision

Info

Publication number: CN114363720B
Application number: CN202111492456.0A
Authority: CN
Inventors: 郝禄国; 曾文彬; 罗杰强; 李泽伟; 杨琳; 葛海玉
Original assignee: Guangzhou Hison Computer Technology Co ltd
Current assignee: Guangzhou Hison Computer Technology Co ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2024-03-12
Anticipated expiration: 2041-12-08
Also published as: CN114363720A

Abstract

The invention discloses a video slicing method, a system, equipment and a medium based on computer vision, wherein the method comprises the following steps: the embodiment of the invention decodes the video to be sliced and outputs video frame pictures; performing hand key point extraction processing on the video frame picture, and determining a hand key point time sequence stream; performing hand action matching on the hand key point time sequence flow through a sliding window to determine hand behavior information; performing target detection on the video frame picture to determine a node triplet; performing position detection on the node distance according to the hand behavior information to determine a behavior triplet; editing the video to be sliced according to the action start-stop time stamp in the action triplet, and determining a video slice; the start-stop time stamp of the slice can be determined through the interaction behavior of the hand and the object, so that automatic video slicing is realized, the efficiency of video slicing is improved, and the method can be widely applied to the technical field of video slicing.

Description

Video slicing method, system, equipment and medium based on computer vision

Technical Field

The invention relates to the technical field of video slicing, in particular to a video slicing method, a system, equipment and a medium based on computer vision.

Background

As video technology matures, video slicing technology applied to video is also evolving. Video slicing refers to capturing valuable, highlight, and focused time periods from a long video. Existing video slicing techniques are divided into manual video slicing and automatic video slicing according to different application scenarios. The manual video slicing is to cut the video later by an editor by adopting video editing software, and the automatic video slicing is to cut the real-time dotting data information automatically in the video recording process, but the automatic video slicing is only limited to application scenes such as game recording and the like which can acquire process data in real time. For skill assessment of the practical exercises, the video is shot through the camera, and the practical exercises are scored through manual watching of the practical exercises, but the practice is time-consuming. In the prior art, the automatic video slicing technology needs real-time dotting data and is not suitable for skill assessment scoring scenes.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a high-efficiency video slicing method, system, device and medium based on computer vision, so as to automatically slice video.

In one aspect, the present invention provides a video slicing method based on computer vision, including:

decoding the video to be sliced and outputting video frame pictures;

performing hand key point extraction processing on the video frame picture, and determining a hand key point time sequence stream;

performing hand action matching on the hand key point time sequence flow through a sliding window, and determining hand behavior information, wherein the hand behavior information comprises a hand action type and an action start-stop time stamp;

performing target detection on the video frame picture, and determining a node triplet, wherein the node triplet comprises a hand node, an object node and a node distance between the hand node and the object node;

performing position detection on the node distance according to the hand behavior information to determine a behavior triplet, wherein the behavior triplet comprises hand nodes, object nodes and hand behavior information;

and editing the video to be sliced according to the action start-stop time stamp in the action triplet, and determining the video slice.

Optionally, the performing the hand key point extraction processing on the video frame picture, and determining the hand key point time sequence stream, includes:

performing hand recognition on the video frame picture through a hand recognition algorithm, and detecting whether the video frame picture contains a hand target or not;

when the fact that the video frame picture contains a hand target is detected, extracting key points of the video frame picture, and determining hand key points;

and determining a hand key point time sequence stream according to the time stamp of the video frame picture and the hand key point.

Optionally, the performing hand motion matching on the hand keypoint time series flow through the sliding window, determining hand behavior information includes:

combining the hand key point time sequences through a sliding window to determine a time window, wherein the time window is used for representing hand key point data of multiple continuous video frame pictures;

and performing hand motion matching on the time window through a motion recognition algorithm, and determining hand behavior information.

Optionally, the performing object detection on the video frame picture, determining a node triplet includes:

performing target detection on the hand and the object in the video frame picture through a target detection algorithm, and determining hand nodes and object nodes, wherein the hand nodes are used for representing hand coordinates, and the object nodes are used for representing object names and object coordinates;

calculating the distance between the hand node and the object node, and determining the node distance;

and combining the hand node, the object node and the node distance into a node triplet.

Optionally, the performing position detection on the node distance according to the hand behavior information, determining a behavior triplet includes:

determining the distance between an object and a hand according to the hand action type in the hand behavior information;

and detecting the positions of the node distances according to the distances between the object and the hand, and determining the hand behavior information, the hand nodes and the object nodes as behavior triples when the node distance detection is correct.

Optionally, the editing the video to be sliced according to the action start-stop time stamp in the action triplet, and determining video slicing includes:

determining slice video file name information according to the behavior triples;

and slicing the video to be sliced according to the action start-stop time stamp in the action triplet, naming the slice video according to the slice video file name information, and determining the video slice.

Optionally, the performing hand motion matching on the time window through a motion recognition algorithm, and determining the hand behavior information includes:

acquiring a hand motion model, wherein the hand motion model is obtained by performing recognition training on preset hand motions through a motion recognition algorithm;

and matching the hand key points of the time window with the hand motion model through a motion recognition algorithm, and determining hand behavior information.

On the other hand, the embodiment of the invention also discloses a video slicing system based on computer vision, which comprises the following steps:

the first module is used for decoding the video to be sliced and outputting video frame pictures;

the second module is used for extracting and processing the hand key points of the video frame pictures and determining a hand key point time sequence stream;

the third module is used for carrying out hand action matching on the hand key point time sequence flow through a sliding window and determining hand behavior information, wherein the hand behavior information comprises a hand action type and an action start-stop time stamp;

a fourth module, configured to perform target detection on the video frame image, and determine a node triplet, where the node triplet includes a hand node, an object node, and a node distance between the hand node and the object node;

a fifth module, configured to perform position detection on the node distance according to the hand behavior information, and determine a behavior triplet, where the behavior triplet includes a hand node, an object node, and hand behavior information;

and a sixth module, configured to clip the video to be sliced according to the action start-stop time stamp in the action triplet, and determine a video slice.

On the other hand, the embodiment of the invention also discloses electronic equipment, which comprises a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

In another aspect, embodiments of the present invention also disclose a computer readable storage medium storing a program for execution by a processor to implement a method as described above.

In another aspect, embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the foregoing method.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects: the embodiment of the invention decodes the video to be sliced and outputs video frame pictures; performing hand key point extraction processing on the video frame picture, and determining a hand key point time sequence stream; performing hand action matching on the hand key point time sequence flow through a sliding window, and determining hand behavior information, wherein the hand behavior information comprises a hand action type and an action start-stop time stamp; performing target detection on the video frame picture, and determining a node triplet, wherein the node triplet comprises a hand node, an object node and a node distance between the hand node and the object node; performing position detection on the node distance according to the hand behavior information to determine a behavior triplet, wherein the behavior triplet comprises hand nodes, object nodes and hand behavior information; editing the video to be sliced according to the action start-stop time stamp in the action triplet, and determining a video slice; the start-stop time stamp of the slice can be determined through the interaction behavior of the hand and the object, so that automatic video slicing is realized, and the efficiency of video slicing is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a video slicing method based on computer vision according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

For skill assessment application scenes of practical exercises, videos are stored through field shooting of a camera, and scoring is carried out by watching videos of practical exercises in the examination process, but the method needs to watch complete video process and consumes a lot of time. In order to save time, a computer is required to automatically complete the effective video slicing of the key steps of the experimental assessment. According to the invention, based on the experimental skill assessment application scene, the video is automatically sliced by performing action recognition on the trigger conditions of the starting time point and the ending time point of the key action.

Referring to fig. 1, an embodiment of the present invention provides a video slicing method based on computer vision, including:

decoding the video to be sliced and outputting video frame pictures;

The embodiment of the invention acquires the video to be sliced, decodes the file to be sliced and continuously outputs video frame pictures. And extracting the hand key points in each video frame picture and generating a hand key point time sequence stream by combining the time stamps of the video frame pictures. Combining the hand key point time series flows through the sliding window and performing hand action matching to obtain hand action information, wherein the hand action information is used for representing the hand action type obtained through the hand action matching, and determining a starting time stamp of the hand action and an ending time stamp of the hand action according to the hand key point time series flows. In addition, target detection is carried out on each video frame picture, hands and other objects in the video frame pictures are detected, and node triples are obtained, wherein the node triples comprise hand nodes, object nodes and node distances between the hand nodes and the object nodes. And carrying out position detection on the node distance according to the hand action type in the hand action information, updating the node triplet with the consistent position, replacing the node distance with the hand action information, and generating the action triplet. And cutting the video to be sliced according to the action start-stop time stamps stored in the action triples, and generating video slices of the key actions.

Further, as a preferred embodiment, the performing a hand keypoint extraction process on the video frame picture, and determining a hand keypoint time sequence stream, includes:

The method comprises the steps of carrying out hand recognition on video frame pictures through a hand recognition algorithm, screening video frame pictures with hands in pictures, extracting key points of the video frame pictures containing hand targets, and extracting the hand key points of the video frame pictures. And recording the hand key points according to the time stamps of the video frame pictures to obtain a hand key point time sequence stream. According to the embodiment of the invention, the hand key point is identified, the time stamp of the video frame picture is combined to form the hand key point time sequence stream, the hand action identification can be carried out according to the hand key point time sequence stream, the start and stop time of the key action is determined, and the video is automatically video sliced according to the start and stop time.

Further as a preferred embodiment, the performing, through a sliding window, hand motion matching on the time-series flow of hand keypoints, and determining hand behavior information includes:

The embodiment of the invention sets the sliding window with variable size, combines the hand key point time sequences to obtain a plurality of time windows, wherein the time windows comprise hand key point data of multiple continuous video frame pictures, and stores the time windows in a buffer area. And (3) carrying out hand motion matching on the time window through a motion recognition algorithm, detecting whether hand key point data of the hand motion to be recognized exist at each moment, and outputting hand behavior information if the matching is successful, wherein the hand behavior information comprises the type of the hand motion and start-stop time stamp information of the hand motion.

Further as a preferred embodiment, the performing object detection on the video frame picture to determine a node triplet includes:

The method comprises the steps of carrying out target detection on a video frame picture through a target detection algorithm, and identifying hands and other objects in the video frame picture to obtain hand nodes and object nodes, wherein the hand nodes comprise hand coordinate information, and the object nodes comprise object names and object coordinate information. And calculating according to the hand coordinate information and the object coordinate information to obtain the node distance between the hand node and the object node. Combining hand nodes, object nodes and node distances into node triples.

Further as a preferred embodiment, the detecting the node distance according to the hand behavior information, determining a behavior triplet includes:

The hand behavior information includes hand action types, and distances between different objects and hands are determined according to different hand action types, for example, the hand action types are clicking, grabbing and the like, and the different hand action types enable different hand key points or distances between the hands and the objects to be different, so that distance judgment can be performed by setting corresponding distance thresholds according to actual application scenes. And judging the node distance in the node triplet according to the distance between the object and the hand, and determining the hand node and the object node in the node triplet as the behavior triplet when the node distance accords with the preset condition set according to the actual application scene.

Further as a preferred embodiment, the editing the video to be sliced according to the action start-stop time stamp in the action triplet, and determining the video slice includes:

And the slice video file name information is determined according to the names and the action types of the object nodes in the action triples, the video to be sliced is sliced through action start and stop time stamps in the hand action information in the action triples, and the slice video is named according to the slice video file name information, so that a video slice is obtained.

Further, as a preferred embodiment, the performing, by the motion recognition algorithm, hand motion matching on the time window, and determining hand behavior information includes:

Different hand motion templates are generated according to different application scenes, and a hand motion model is obtained through training of a motion recognition algorithm according to the hand motion templates. And performing motion recognition on the hand key point sequence flow in the time window through a motion recognition algorithm, matching with a hand motion model, and obtaining hand behavior information if the matching is successful, wherein the hand behavior information comprises a hand motion type and a hand motion start-stop time stamp.

With reference to fig. 1. The flow of the invention specifically comprises: decoding the video to be sliced and outputting continuous multi-frame video frame pictures, extracting hand key points of the video frame pictures to obtain a hand key point time sequence stream with time stamps, and carrying out hand action matching processing on the hand key point time sequence stream according to the sliding window to obtain hand action types and action start-stop time stamps. And on the other hand, carrying out target detection on the video frame picture to obtain hand coordinate information and object coordinate information, and forming node triples according to the distance between the hand and the object. And determining different distances according to different application scenes, judging the node distances in the node triples, and if the node distances meet the conditions, updating the node triples into behavior triples with hand behavior information. And segmenting the video according to the start and stop time stamps in the behavior triples to obtain segmented video. According to the embodiment of the invention, the behavior method of the interaction between the hand and the object is detected through the computer vision technology, so that the start-stop timestamp information of the slice video is obtained, and the function of automatic video slicing is completed.

The embodiment of the invention also provides a video slicing system based on computer vision, which comprises:

Corresponding to the method of fig. 1, the embodiment of the invention also provides an electronic device, which comprises a processor and a memory; the memory is used for storing programs; the processor executes the program to implement the method as described above.

Corresponding to the method of fig. 1, an embodiment of the present invention also provides a computer-readable storage medium storing a program to be executed by a processor to implement the method as described above.

Embodiments of the present invention also disclose a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions may be read from a computer-readable storage medium by a processor of a computer device, and executed by the processor, to cause the computer device to perform the method shown in fig. 1.

In summary, the embodiment of the invention has the following advantages:

(1) According to the embodiment of the invention, the specific actions of the hand and the object interaction are identified through the computer vision technology, the start and end time stamp information of the slicing of the specific actions is obtained, and the video can be sliced.

(2) According to the embodiment of the invention, the time sequence of the hand key points is utilized to analyze the hand behaviors, so that the complexity of hand behavior analysis can be simplified.

In some alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may in fact be executed substantially concurrently or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flowcharts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed, and in which sub-operations described as part of a larger operation are performed independently.

Furthermore, while the invention is described in the context of functional modules, it should be appreciated that, unless otherwise indicated, one or more of the described functions and/or features may be integrated in a single physical device and/or software module or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary to an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be apparent to those skilled in the art from consideration of their attributes, functions and internal relationships. Accordingly, one of ordinary skill in the art can implement the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative and are not intended to be limiting upon the scope of the invention, which is to be defined in the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the embodiments described above, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present invention, and these equivalent modifications or substitutions are included in the scope of the present invention as defined in the appended claims.

Claims

1. A computer vision-based video slicing method, comprising:

decoding the video to be sliced and outputting video frame pictures;

editing the video to be sliced according to the action start-stop time stamp in the action triplet, and determining a video slice; the step of extracting the hand key points from the video frame picture and determining the time sequence flow of the hand key points comprises the following steps:

determining a hand key point time sequence stream according to the time stamp of the video frame picture and the hand key point;

the step of performing hand action matching on the hand key point time sequence stream through the sliding window to determine hand behavior information comprises the following steps:

performing hand motion matching on the time window through a motion recognition algorithm, and determining hand behavior information;

the step of detecting the positions of the node distances according to the hand behavior information to determine a behavior triplet, including:

2. The computer vision based video slicing method of claim 1, wherein said performing object detection on said video frame pictures to determine node triples comprises:

3. The computer vision based video slicing method of claim 1, wherein the clipping the video to be sliced according to the action start-stop time stamps in the behavior triples, determining the video slice, comprises:

4. The computer vision based video slicing method of claim 1, wherein the performing hand motion matching on the time window by the motion recognition algorithm, determining hand behavior information, comprises:

5. A computer vision-based video slicing system, comprising:

a sixth module, configured to clip the video to be sliced according to the action start-stop time stamp in the action triplet, and determine a video slice;

the second module is specifically configured to:

the third module is specifically configured to:

the fifth module is specifically configured to:

6. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program to implement the method of any one of claims 1-4.

7. A computer readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method of any one of claims 1-4.