CN113038271B

CN113038271B - Video automatic editing method, device and computer storage medium

Info

Publication number: CN113038271B
Application number: CN202110321530.6A
Authority: CN
Inventors: 黄锐; 胡攀文
Original assignee: Shenzhen Institute of Artificial Intelligence and Robotics
Current assignee: Shenzhen Institute of Artificial Intelligence and Robotics
Priority date: 2021-03-25
Filing date: 2021-03-25
Publication date: 2023-09-08
Anticipated expiration: 2041-03-25
Also published as: CN113038271A

Abstract

The embodiment of the application discloses a method, a device and a computer storage medium for automatically editing video, which enable the video generated by editing to maximally present the information of a target person and avoid presenting the information of other irrelevant persons. The embodiment of the application comprises the following steps: and simultaneously, determining a video picture window based on the position and the size of the target person in the video frame, and extracting a video picture related to the target person according to the video picture window, so that the finally synthesized video presents information related to the target person to the maximum extent, and avoids presenting information related to other irrelevant persons.

Description

Video automatic editing method, device and computer storage medium

Technical Field

The embodiment of the application relates to the field of video editing, in particular to an automatic video editing method, an automatic video editing device and a computer storage medium.

Background

In the prior art, the automatic video editing can improve the working efficiency of video editing in the fields of security protection, education, video entertainment and the like. After the video is clipped, the data volume of the video is greatly reduced, and the storage space occupied by the video is reduced, so that the automatic video clipping can also relieve the storage problem of massive videos, and more storage space can be released after the video is automatically clipped.

The existing video automatic editing system is mainly designed aiming at videos such as dance videos, concert videos, outdoor activity videos and football match videos, and focuses on enriching video contents, diversifying the video contents, increasing interestingness and improving the look and feel. However, in some scenes where it is desired to emphasize a target person in a video, the existing video automatic editing system cannot handle the situation well because the existing video automatic editing system focuses on presenting more video content, cannot focus on the target person, and cannot present more information about the target person. Meanwhile, the existing automatic video editing system presents information of other people irrelevant to the target person in the edited video, and can cause privacy leakage of the other people in the video.

Disclosure of Invention

The embodiment of the application provides a video automatic editing method, a video automatic editing device and a computer storage medium, which enable videos generated by editing to be capable of maximally presenting information of target characters and avoiding presenting information of other irrelevant characters.

An embodiment of the present application provides a method for automatically editing video, where the method includes:

calculating face posture information of a target person of each video frame in at least one path of video, calculating a posture information quantization value corresponding to the face posture information, and calculating an optical flow energy change value of each video frame;

taking any video frame in any path of video as a current video frame;

calculating a return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute the reinforcement learning algorithm, wherein the return value of the action under the current video frame is calculated according to the attitude information quantized value and the optical flow energy change value of the current video frame; the action is to select one candidate video frame from each video of the at least one video;

determining a video frame sequence according to the sequence of the current video frames, and obtaining an initial synthesized video based on the video frame sequence;

determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;

and extracting the video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video.

A second aspect of an embodiment of the present application provides an automatic video editing apparatus, including:

the computing unit is used for computing the face posture information of the target person of each video frame in at least one path of video, computing the posture information quantization value corresponding to the face posture information and computing the optical flow energy change value of each video frame;

the determining unit is used for taking any video frame in any path of video as a current video frame;

a clipping unit, configured to calculate a return value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum return value as a next video frame of the current video frame, take the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm based on the quantized value of pose information and the change value of optical flow energy of the current video frame, and calculate the return value of the action under the current video frame; the action is to select one candidate video frame from each video of the at least one video;

the generating unit is used for determining a video frame sequence according to the sequence of the current video frames and obtaining an initial synthesized video based on the video frame sequence;

the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;

and the extraction unit is used for extracting the video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video.

A third aspect of an embodiment of the present application provides an automatic video editing apparatus, including:

a memory for storing a computer program; a processor for implementing the steps of the video automatic editing method as described in the foregoing first aspect when executing the computer program.

A fourth aspect of the embodiments of the present application provides a computer storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the method of the first aspect described above.

From the above technical solutions, the embodiment of the present application has the following advantages:

in this embodiment, by calculating the quantized value of the face pose information and the variable value of the optical flow energy of the target person in each video frame, applying the quantized value of the calculated pose information and the variable value of the optical flow energy to the calculation of the return value of the action in the reinforcement learning algorithm, determining the candidate video frame corresponding to the maximum return value as the next video frame of the current video frame, taking the next video frame of the current video frame as the new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one path of video at each time can maximally present the information of the target person and avoid presenting the blocked picture of the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video presents information about the target person to the maximum extent and avoids presenting information about other unrelated persons.

Drawings

FIG. 1 is a schematic flow chart of an automatic video editing method according to an embodiment of the application;

FIG. 2 is a schematic diagram of another embodiment of an automatic video editing method;

FIG. 3 is a schematic diagram of facial pose information according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of an automatic video editing apparatus;

fig. 5 is a schematic diagram of another structure of an automatic video editing apparatus according to an embodiment of the present application.

Detailed Description

Referring to fig. 1, an embodiment of an automatic video editing method according to an embodiment of the present application includes:

101. calculating face posture information of a target person of each video frame in at least one path of video, calculating a posture information quantization value corresponding to the face posture information, and calculating an optical flow energy change value of each video frame;

the method of the embodiment can be applied to an automatic video editing device, and the automatic video editing device can be specifically a terminal, a server and other computer equipment with data processing capability.

In an application scenario in which a target person in a video needs to be emphasized, the embodiment acquires at least one path of video, wherein a video picture of each path of video includes the target person, and the task of the embodiment is to automatically clip the at least one path of video, so that the video generated by clipping is emphasized to present information of the target person and an object interacted with the target person, and ensure that information of other unrelated persons is not displayed in the video generated by clipping, thereby protecting privacy information security of the other unrelated persons.

After at least one path of video is acquired, calculating the face posture information of the target person of each video frame of each path of video, and calculating a posture information quantization value corresponding to the face posture information. In addition, the embodiment also provides a method for determining the occlusion situation of the target person in the video picture, namely determining the occlusion situation of the target person in the video picture according to the optical flow energy change value. Therefore, this step also calculates an optical flow energy variation value for each video frame, and reflects the occlusion condition of the target person by the optical flow energy variation value.

102. Taking any video frame in any path of video as a current video frame;

the present embodiment employs a reinforcement learning algorithm to determine each frame in the video generated by the clip, the video frame being the state in the reinforcement learning algorithm. The user can designate any video frame in any path of video as the current video frame, so that the video automatic clipping device determines the current video frame according to the designation of the user, the current video frame is used as one state in the reinforcement learning algorithm, and the next state is determined according to the state corresponding to the current video frame in the subsequent step.

103. Calculating a return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the reinforcement learning algorithm to calculate the return value of the action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame;

in the automatic clipping process, the present embodiment determines each video frame in the video generated by clipping in turn. Specifically, after determining the current video frame, the video automatic clipping apparatus calculates a return value of the action under the current video frame based on the reinforcement learning algorithm based on the posture information quantization value and the optical flow energy variation value of the current video frame.

The reinforcement learning algorithm of the present embodiment may be specifically a markov decision process. In the reinforcement learning algorithm, the larger the return value of the action is, the more significant the action is, the virtual agent in the reinforcement learning algorithm optimizes the strategy according to the action corresponding to the maximum return value, and then takes the next action according to the optimized strategy. Therefore, after calculating the return values of a plurality of actions under the current video frame, the candidate video frame selected by the action with the maximum return value is determined as the next video frame of the current video frame.

And after determining the next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the reinforcement learning algorithm-based calculation of the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame. The action under the current video frame is to select one candidate video frame from each video of at least one video.

104. Determining a video frame sequence according to the sequence of the current video frames, and obtaining an initial synthesized video based on the video frame sequence;

each video frame can be determined in sequence through step 103, and the determined plurality of video frames have a determined sequence, so that a video frame sequence can be determined according to the determined sequence of the current video frames in step 103, and further an initial synthesized video can be obtained based on the video frame sequence.

105. Determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;

after the initial synthesized video is obtained, since the object of the present embodiment is to emphasize the target person in the video frame, the position and size of the target person in the initial synthesized video at each frame are further determined, and the position and size of the video frame window of each frame in the initial synthesized video are determined according to the position and size of the target person at each frame.

106. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of a video picture window to obtain a target synthesized video;

after the position and the size of the video picture window are determined, the video picture of each frame in the initial synthesized video is extracted based on the position and the size of the video picture window, so that the target synthesized video is obtained, and automatic video editing is realized.

An embodiment of the present application will be described in further detail below on the basis of the foregoing embodiment shown in fig. 1. Referring to fig. 2, another embodiment of an automatic video editing method according to an embodiment of the present application includes:

201. calculating face posture information of a target person of each video frame in at least one path of video, calculating a posture information quantization value corresponding to the face posture information, and calculating an optical flow energy change value of each video frame;

in this embodiment, the face pose information of the target person of each video frame may be calculated according to a face pose estimation algorithm. Specifically, the face pose information calculated by the face pose estimation algorithm may be represented by using a rotation matrix, a rotation vector, a quaternion, or an euler angle. Since euler angles are better readable, it may be preferable to use euler angles to represent face pose information. As shown in fig. 3, the angle of the face pose information of the target person, such as the pitch angle (pitch), the yaw angle (yaw), and the rotation angle (roll), may be calculated according to the face pose estimation algorithm.

In order to keep the consistency of the calculated face pose information and the symmetrical structure of the face, the embodiment adopts a multi-element Gaussian model to calculate the pose information quantization value corresponding to the face pose information. Further, for the convenience of calculation, the calculated attitude information quantized value can be normalized, and the normalized attitude information quantized value is used in the subsequent calculation process.

Specifically, the specific way of calculating the optical flow energy change value in this embodiment is to calculate optical flow information of each video frame in at least one video and optical flow information of other video frames belonging to the same video as each video frame, calculate optical flow energy of each video frame according to the optical flow information of each video frame, calculate optical flow energy of other video frames according to the optical flow information of other video frames, calculate an optical flow energy difference value between each video frame and other video frames and an interval time between each video frame and other video frames, and use a quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.

For example, assuming that the video automatic editing device acquires C-way video (C.gtoreq.1), each way of video includes T-frame video frames, a certain video frame in the C-way video can be expressed as f _c，t (c=1, …, C; t=1, …, T), then with f _c，t Belongs to the same path of video and is identical to f _c，t+1 The adjacent video frame may be denoted as f _c，t+1 . Respectively calculating f _c，t Optical flow information of f) _c，t+1 According to f _c，t F is calculated from optical flow information of (2) _c，t According to f _c，t+1 F is calculated from optical flow information of (2) _c，t+1 Optical flow energy of (1), calculate f _c，t Optical flow energy of (1) and f _c，t+1 Optical flow energy difference value of optical flow energy of (f) _c，t And f _c，t+1 Taking the quotient of the calculated difference value of the optical flow energy and the interval time as f _c，t Optical flow energy variation values of (a).

In the multivariate Gaussian model, when the angle of the Euler angle of the face tends to 0, the pose information quantization value corresponding to the face pose information is maximum; when the face deflects to generate the Euler angle, namely the Euler angle is not equal to 0, the quantized value of the pose information corresponding to the face pose information is reduced. The variation amplitude of the quantized value of the attitude information caused by the angle of the euler angle can be controlled by a variance matrix. Therefore, the variance of the euler angle can be set, thereby adjusting the degree of influence of the euler angle on the quantized value of the attitude information.

202. Taking any video frame in any path of video as a current video frame;

the operations performed in this step are similar to those performed in step 102 in the embodiment shown in fig. 1, and will not be described here again.

203. Calculating a return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the reinforcement learning algorithm to calculate the return value of the action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame;

in this embodiment, when calculating the return value of the action, a specific calculation manner is to determine the transition probability from the current video frame to the candidate video frame under the action, calculate the initial return value of the action under the current video frame according to the quantized value of the pose information and the change value of the optical flow energy of the current video frame, and take the product of the initial return value and the transition probability as the return value of the action.

Specifically, the specific mode of determining the transition probability is that when a preset condition is met, the transition probability is determined to be 1; when any one of the preset conditions is not satisfied, it is determined that the transition probability is 0. The preset conditions comprise: the current video frame and the candidate video frame are adjacent on a time line, the target person exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in at least one path of video.

The current video frame and the candidate video frame are adjacent on the time line, which means that the current video frame and the candidate video frame are adjacent on the time line of the video. For example, the current video frame is the T-th frame video frame in the first video path, and the candidate video frame may be the t+1-th frame video frame in the current video path or other video paths (such as the second video path, the third video path, etc.).

204. Determining a video frame sequence according to the sequence of the current video frames, and obtaining an initial synthesized video based on the video frame sequence;

the operations performed in this step are similar to those performed in step 104 in the embodiment shown in fig. 1, and will not be described here again.

205. Determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;

in this embodiment, the position and size of the target person in the video frame may be determined according to the information of the target person and the object with which the target person interacts, where the information of the object with which the target person interacts may be information of an object of interest in the line of sight direction of the target person. For example, if the gaze direction of the target person is focused on a chair, the determined position and size of the target person in the video frame should include the chair in which the gaze direction of the target person is focused in addition to the target person.

Specifically, the specific manner of determining the line of sight direction of the target person in the video frame may be that, since the line of sight direction is related to the facial pose information of the target person, for example, the face up may be determined to be a bottom view and the face down may be determined to be a top view, the line of sight direction of the target person in each frame in the initial synthesized video may be determined from the facial pose information of the target person in each frame in the initial synthesized video.

For example, looking left and right mainly depend on a yaw angle (yaw) in the face pose information, and thus, the line-of-sight direction of the target person can be determined from the yaw angle. To facilitate the subsequent calculation process, the line-of-sight direction may be quantified, for example, the line-of-sight direction may be numerically represented based on the following formula:

where g denotes the line of sight direction and phi denotes the yaw angle. Therefore, the value of the sight line direction corresponding to a certain yaw angle can be determined according to the formula.

After the value of the sight line direction is determined, the position and the size of the target person in each frame in the initial synthesized video are further determined according to the sight line direction of the target person in each frame in the initial synthesized video, and the position and the size of the video picture window of each frame in the initial synthesized video are determined according to the position and the size of the target person in each frame in the initial synthesized video.

Specifically, the position and size of the target person in each frame of the initial composite video can be expressed asWherein (1)>Refer to the coordinates of the target person in each frame of the original composite video, i.e. according to +.>And->The position of the target person in the video frame may be determined; />Refer to the width of the target person in each frame of the initial composite video, +.>Refer to the target person at the beginningHigh in each frame of the original composite video, i.e. according to +.>And->The size of the target person in the video frame may be determined.

Meanwhile, the position and size of the video picture window of each frame in the initial composite video can be expressed asWherein (1)>Reference is made to the coordinates of the video picture window in the video frame, i.e. according to +.>The specific position of the video frame of the video picture window in the initial composite video can be determined; />Refer to the width of the video frame window, +.>Refer to the high of the video picture window, i.e. according to +.>And->The size of the video picture window may be determined.

In this embodiment, the solution is based on a functionSpecifically, the position and size of the video frame window of each frame can be determined according to the following objectivesAnd (3) calculating a standard function to obtain:

wherein c _t Refer to the initial composite video, t refers to c _t Any one of the video frames, g _t Refers to the aforementioned value of the line of sight direction.

Since the objective function is a convex function, the objective function can be solved by a convex optimization algorithm, and the initial composite video c can be obtained _t The optimal position and size of the video picture window of each frame are obtainedIs a solution to the optimization of (3).

Thus, as can be seen from the expression of the above objective function, in determining the position and size of the video screen window, information of the object with which the target person interacts in the line-of-sight direction of the target person is also considered, and the object with which the target person interacts is included in the video screen window.

206. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of a video picture window to obtain a target synthesized video;

after the position and the size of the video picture window are determined, the video picture of each frame in the initial composite video may be extracted based on the position and the size of the video picture window, and the extracted multi-frame video pictures constitute the target composite video. As can be seen from the above description, since the video frame window determines the position and the size based on the target person and the object interacting with the target person, the video frame extracted from the video frame window includes the information of the target person and the information of the object interacting with the target person, and prevents the information of other unrelated persons from being presented in the video frame, on the one hand, the information of the target person can be presented maximally, and on the other hand, the information of other unrelated persons is also prevented from being presented, thereby avoiding the problem of privacy leakage.

In the embodiment, the information of the target person is emphasized and highlighted on the video automatic editing, and the problem of privacy disclosure is avoided, so that the technical scheme has more practical application value, and the realizability of the scheme is improved.

The method for automatically editing video in the embodiment of the present application is described above, and the apparatus for automatically editing video in the embodiment of the present application is described below, referring to fig. 4, where an embodiment of the apparatus for automatically editing video in the embodiment of the present application includes:

a calculating unit 401, configured to calculate facial pose information of a target person of each video frame in at least one path of video, calculate a pose information quantization value corresponding to the facial pose information, and calculate an optical flow energy variation value of each video frame;

a determining unit 402, configured to take any video frame in any path of video as a current video frame;

a clipping unit 403, configured to calculate a return value of an action under the current video frame according to the posture information quantization value and the optical flow energy variation value of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum return value as a next video frame of the current video frame, take the next video frame of the current video frame as a new current video frame, and return to the step of executing the reinforcement learning algorithm, and calculate a return value of an action under the current video frame according to the posture information quantization value and the optical flow energy variation value of the current video frame; the action is to select one candidate video frame from each video of the at least one video;

a generating unit 404, configured to determine a video frame sequence according to the sequencing of the current video frame, and obtain an initial synthesized video based on the video frame sequence;

the determining unit 402 is further configured to determine a position and a size of a video frame window of each frame in the initial synthesized video according to the position and the size of the target person in each frame in the initial synthesized video;

an extracting unit 405, configured to extract a video picture of each frame in the initial composite video based on the position and the size of the video picture window, so as to obtain a target composite video.

In a preferred implementation manner of this embodiment, the calculating unit 401 is specifically configured to calculate, according to a face pose estimation algorithm, face pose information of the target person in each video frame, where the face pose information includes an angle of pitch angle, an angle of yaw angle, and an angle of rotation angle.

In a preferred implementation manner of this embodiment, the calculating unit 401 is specifically configured to calculate the pose information quantization value corresponding to the face pose information by using a multivariate gaussian model.

In a preferred implementation manner of this embodiment, the calculating unit 401 is specifically configured to calculate optical flow information of each video frame and optical flow information of other video frames belonging to the same path of video as each video frame; calculating the optical flow energy of each video frame according to the optical flow information of each video frame, calculating the optical flow energy of other video frames according to the optical flow information of the other video frames, and calculating the optical flow energy difference value between each video frame and the other video frames and the interval time between each video frame and the other video frames; taking the quotient of the difference value of the optical flow energy and the interval time as the optical flow energy change value of each video frame.

In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine a transition probability of the current video frame to the candidate video frame under the action; calculating an initial return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame; and taking the product of the initial return value and the transition probability as the return value.

In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine that the transition probability is 1 when a preset condition is satisfied; when any one of the preset conditions is not met, determining that the transition probability is 0;

wherein, the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target person exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in the at least one path of video.

In a preferred implementation manner of this embodiment, the determining unit 402 is specifically configured to determine, according to facial pose information of the target person in each frame in the initial synthesized video, a line of sight direction of the target person in each frame in the initial synthesized video; determining the position and the size of each frame of the initial synthesized video of the target person according to the sight direction of the target person in each frame of the initial synthesized video; and determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video.

In this embodiment, the operations performed by the units in the video automatic editing apparatus are similar to those described in the embodiments shown in fig. 1 to 2, and are not repeated here.

In this embodiment, the calculating unit 401 calculates the quantized value of the face pose information and the variable value of the optical flow energy of the target person in each video frame, the clipping unit 403 applies the quantized value of the calculated pose information and the variable value of the optical flow energy to the calculation of the return value of the action in the reinforcement learning algorithm, determines the candidate video frame corresponding to the maximum return value as the next video frame of the current video frame, and returns the next video frame of the current video frame as the new current video frame to the step of executing the return value of the action calculated under the current video frame, so that each video frame determined from at least one video can maximally present the information of the target person and avoid presenting the picture in which the target person is blocked. Meanwhile, the determination unit 402 determines a video picture window based on the position and size of the target person in the video frame, and the extraction unit 405 extracts a video picture about the target person from the video picture window so that the finally synthesized video maximally presents information about the target person and avoids presenting information about other unrelated persons.

Referring to fig. 5, an embodiment of the video automatic editing apparatus according to the present application includes:

the video automatic editing apparatus 500 may include one or more central processing units (central processing units, CPU) 501 and a memory 505, where one or more application programs or data are stored in the memory 505.

Wherein the memory 505 may be volatile storage or persistent storage. The program stored in the memory 505 may include one or more modules, each of which may include a series of instruction operations in the video automatic editing apparatus. Still further, the central processor 501 may be configured to communicate with the memory 505 and execute a series of instruction operations in the memory 505 on the video automatic editing apparatus 500.

The video automatic editing device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input output interfaces 504, and/or one or more operating systems, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

The cpu 501 may perform the operations performed by the video automatic editing apparatus in the embodiments shown in fig. 1 to 2, and will not be described herein.

The embodiment of the application also provides a computer storage medium, wherein one embodiment comprises: the computer storage medium has stored therein instructions that, when executed on a computer, cause the computer to perform the operations performed by the video automatic editing apparatus in the embodiments shown in fig. 1 to 2 described above.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method for automatically editing video, the method comprising:

taking any video frame in any path of video as a current video frame;

2. The method of claim 1, wherein the computing facial pose information for the target person for each video frame of at least one video comprises:

and calculating the face posture information of the target person of each video frame according to a face posture estimation algorithm, wherein the face posture information comprises the angle of a pitch angle, the angle of a yaw angle and the angle of a rotation angle.

3. The method according to claim 1, wherein the calculating the pose information quantization value corresponding to the face pose information includes:

and calculating the pose information quantization value corresponding to the face pose information by using a multi-element Gaussian model.

4. The method of claim 1, wherein said calculating an optical flow energy variation value for each video frame comprises:

calculating optical flow information of each video frame and optical flow information of other video frames belonging to the same path of video with each video frame;

calculating the optical flow energy of each video frame according to the optical flow information of each video frame, calculating the optical flow energy of other video frames according to the optical flow information of the other video frames, and calculating the optical flow energy difference value between each video frame and the other video frames and the interval time between each video frame and the other video frames;

taking the quotient of the difference value of the optical flow energy and the interval time as the optical flow energy change value of each video frame.

5. The method of claim 1, wherein calculating the return value of the action under the current video frame from the pose information quantization value and the optical flow energy variation value of the current video frame comprises:

determining a transition probability of the current video frame to the candidate video frame under the action;

calculating an initial return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame;

and taking the product of the initial return value and the transition probability as the return value.

6. The method of claim 5, wherein said determining a transition probability of the current video frame to the candidate video frame in the action comprises:

when a preset condition is met, determining that the transition probability is 1; when any one of the preset conditions is not met, determining that the transition probability is 0;

7. The method of claim 1, wherein the determining the position and size of the video picture window for each frame in the initial composite video based on the position and size of each frame in the initial composite video for the target person comprises:

determining the sight direction of the target person in each frame in the initial synthetic video according to the facial pose information of the target person in each frame in the initial synthetic video;

determining the position and the size of each frame of the initial synthesized video of the target person according to the sight direction of the target person in each frame of the initial synthesized video;

and determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video.

8. An automatic video editing apparatus, the apparatus comprising:

9. An automatic video editing apparatus, the apparatus comprising:

a memory for storing a computer program; a processor for implementing the steps of the video automatic editing method according to any of claims 1 to 7 when executing said computer program.

10. A computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of any of claims 1 to 7.