CN113038271B - Video automatic editing method, device and computer storage medium - Google Patents

Video automatic editing method, device and computer storage medium Download PDF

Info

Publication number
CN113038271B
CN113038271B CN202110321530.6A CN202110321530A CN113038271B CN 113038271 B CN113038271 B CN 113038271B CN 202110321530 A CN202110321530 A CN 202110321530A CN 113038271 B CN113038271 B CN 113038271B
Authority
CN
China
Prior art keywords
video
video frame
frame
target person
optical flow
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110321530.6A
Other languages
Chinese (zh)
Other versions
CN113038271A (en
Inventor
黄锐
胡攀文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Artificial Intelligence and Robotics
Original Assignee
Shenzhen Institute of Artificial Intelligence and Robotics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Artificial Intelligence and Robotics filed Critical Shenzhen Institute of Artificial Intelligence and Robotics
Priority to CN202110321530.6A priority Critical patent/CN113038271B/en
Publication of CN113038271A publication Critical patent/CN113038271A/en
Application granted granted Critical
Publication of CN113038271B publication Critical patent/CN113038271B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/441Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card
    • H04N21/4415Acquiring end-user identification, e.g. using personal code sent by the remote control or by inserting a card using biometric characteristics of the user, e.g. by voice recognition or fingerprint scanning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4858End-user interface for client configuration for modifying screen layout parameters, e.g. fonts, size of the windows

Abstract

The embodiment of the application discloses a method, a device and a computer storage medium for automatically editing video, which enable the video generated by editing to maximally present the information of a target person and avoid presenting the information of other irrelevant persons. The embodiment of the application comprises the following steps: and simultaneously, determining a video picture window based on the position and the size of the target person in the video frame, and extracting a video picture related to the target person according to the video picture window, so that the finally synthesized video presents information related to the target person to the maximum extent, and avoids presenting information related to other irrelevant persons.

Description

Video automatic editing method, device and computer storage medium
Technical Field
The embodiment of the application relates to the field of video editing, in particular to an automatic video editing method, an automatic video editing device and a computer storage medium.
Background
In the prior art, the automatic video editing can improve the working efficiency of video editing in the fields of security protection, education, video entertainment and the like. After the video is clipped, the data volume of the video is greatly reduced, and the storage space occupied by the video is reduced, so that the automatic video clipping can also relieve the storage problem of massive videos, and more storage space can be released after the video is automatically clipped.
The existing video automatic editing system is mainly designed aiming at videos such as dance videos, concert videos, outdoor activity videos and football match videos, and focuses on enriching video contents, diversifying the video contents, increasing interestingness and improving the look and feel. However, in some scenes where it is desired to emphasize a target person in a video, the existing video automatic editing system cannot handle the situation well because the existing video automatic editing system focuses on presenting more video content, cannot focus on the target person, and cannot present more information about the target person. Meanwhile, the existing automatic video editing system presents information of other people irrelevant to the target person in the edited video, and can cause privacy leakage of the other people in the video.
Disclosure of Invention
The embodiment of the application provides a video automatic editing method, a video automatic editing device and a computer storage medium, which enable videos generated by editing to be capable of maximally presenting information of target characters and avoiding presenting information of other irrelevant characters.
An embodiment of the present application provides a method for automatically editing video, where the method includes:
calculating face posture information of a target person of each video frame in at least one path of video, calculating a posture information quantization value corresponding to the face posture information, and calculating an optical flow energy change value of each video frame;
taking any video frame in any path of video as a current video frame;
calculating a return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute the reinforcement learning algorithm, wherein the return value of the action under the current video frame is calculated according to the attitude information quantized value and the optical flow energy change value of the current video frame; the action is to select one candidate video frame from each video of the at least one video;
determining a video frame sequence according to the sequence of the current video frames, and obtaining an initial synthesized video based on the video frame sequence;
determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;
and extracting the video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video.
A second aspect of an embodiment of the present application provides an automatic video editing apparatus, including:
the computing unit is used for computing the face posture information of the target person of each video frame in at least one path of video, computing the posture information quantization value corresponding to the face posture information and computing the optical flow energy change value of each video frame;
the determining unit is used for taking any video frame in any path of video as a current video frame;
a clipping unit, configured to calculate a return value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum return value as a next video frame of the current video frame, take the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm based on the quantized value of pose information and the change value of optical flow energy of the current video frame, and calculate the return value of the action under the current video frame; the action is to select one candidate video frame from each video of the at least one video;
the generating unit is used for determining a video frame sequence according to the sequence of the current video frames and obtaining an initial synthesized video based on the video frame sequence;
the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;
and the extraction unit is used for extracting the video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video.
A third aspect of an embodiment of the present application provides an automatic video editing apparatus, including:
a memory for storing a computer program; a processor for implementing the steps of the video automatic editing method as described in the foregoing first aspect when executing the computer program.
A fourth aspect of the embodiments of the present application provides a computer storage medium having stored therein instructions which, when executed on a computer, cause the computer to perform the method of the first aspect described above.
From the above technical solutions, the embodiment of the present application has the following advantages:
in this embodiment, by calculating the quantized value of the face pose information and the variable value of the optical flow energy of the target person in each video frame, applying the quantized value of the calculated pose information and the variable value of the optical flow energy to the calculation of the return value of the action in the reinforcement learning algorithm, determining the candidate video frame corresponding to the maximum return value as the next video frame of the current video frame, taking the next video frame of the current video frame as the new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one path of video at each time can maximally present the information of the target person and avoid presenting the blocked picture of the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video presents information about the target person to the maximum extent and avoids presenting information about other unrelated persons.
Drawings
FIG. 1 is a schematic flow chart of an automatic video editing method according to an embodiment of the application;
FIG. 2 is a schematic diagram of another embodiment of an automatic video editing method;
FIG. 3 is a schematic diagram of facial pose information according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an embodiment of an automatic video editing apparatus;
fig. 5 is a schematic diagram of another structure of an automatic video editing apparatus according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a video automatic editing method, a video automatic editing device and a computer storage medium, which enable videos generated by editing to be capable of maximally presenting information of target characters and avoiding presenting information of other irrelevant characters.
Referring to fig. 1, an embodiment of an automatic video editing method according to an embodiment of the present application includes:
101. calculating face posture information of a target person of each video frame in at least one path of video, calculating a posture information quantization value corresponding to the face posture information, and calculating an optical flow energy change value of each video frame;
the method of the embodiment can be applied to an automatic video editing device, and the automatic video editing device can be specifically a terminal, a server and other computer equipment with data processing capability.
In an application scenario in which a target person in a video needs to be emphasized, the embodiment acquires at least one path of video, wherein a video picture of each path of video includes the target person, and the task of the embodiment is to automatically clip the at least one path of video, so that the video generated by clipping is emphasized to present information of the target person and an object interacted with the target person, and ensure that information of other unrelated persons is not displayed in the video generated by clipping, thereby protecting privacy information security of the other unrelated persons.
After at least one path of video is acquired, calculating the face posture information of the target person of each video frame of each path of video, and calculating a posture information quantization value corresponding to the face posture information. In addition, the embodiment also provides a method for determining the occlusion situation of the target person in the video picture, namely determining the occlusion situation of the target person in the video picture according to the optical flow energy change value. Therefore, this step also calculates an optical flow energy variation value for each video frame, and reflects the occlusion condition of the target person by the optical flow energy variation value.
102. Taking any video frame in any path of video as a current video frame;
the present embodiment employs a reinforcement learning algorithm to determine each frame in the video generated by the clip, the video frame being the state in the reinforcement learning algorithm. The user can designate any video frame in any path of video as the current video frame, so that the video automatic clipping device determines the current video frame according to the designation of the user, the current video frame is used as one state in the reinforcement learning algorithm, and the next state is determined according to the state corresponding to the current video frame in the subsequent step.
103. Calculating a return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the reinforcement learning algorithm to calculate the return value of the action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame;
in the automatic clipping process, the present embodiment determines each video frame in the video generated by clipping in turn. Specifically, after determining the current video frame, the video automatic clipping apparatus calculates a return value of the action under the current video frame based on the reinforcement learning algorithm based on the posture information quantization value and the optical flow energy variation value of the current video frame.
The reinforcement learning algorithm of the present embodiment may be specifically a markov decision process. In the reinforcement learning algorithm, the larger the return value of the action is, the more significant the action is, the virtual agent in the reinforcement learning algorithm optimizes the strategy according to the action corresponding to the maximum return value, and then takes the next action according to the optimized strategy. Therefore, after calculating the return values of a plurality of actions under the current video frame, the candidate video frame selected by the action with the maximum return value is determined as the next video frame of the current video frame.
And after determining the next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the reinforcement learning algorithm-based calculation of the return value of the action under the current video frame according to the attitude information quantization value and the optical flow energy change value of the current video frame. The action under the current video frame is to select one candidate video frame from each video of at least one video.
104. Determining a video frame sequence according to the sequence of the current video frames, and obtaining an initial synthesized video based on the video frame sequence;
each video frame can be determined in sequence through step 103, and the determined plurality of video frames have a determined sequence, so that a video frame sequence can be determined according to the determined sequence of the current video frames in step 103, and further an initial synthesized video can be obtained based on the video frame sequence.
105. Determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;
after the initial synthesized video is obtained, since the object of the present embodiment is to emphasize the target person in the video frame, the position and size of the target person in the initial synthesized video at each frame are further determined, and the position and size of the video frame window of each frame in the initial synthesized video are determined according to the position and size of the target person at each frame.
106. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of a video picture window to obtain a target synthesized video;
after the position and the size of the video picture window are determined, the video picture of each frame in the initial synthesized video is extracted based on the position and the size of the video picture window, so that the target synthesized video is obtained, and automatic video editing is realized.
In this embodiment, by calculating the quantized value of the face pose information and the variable value of the optical flow energy of the target person in each video frame, applying the quantized value of the calculated pose information and the variable value of the optical flow energy to the calculation of the return value of the action in the reinforcement learning algorithm, determining the candidate video frame corresponding to the maximum return value as the next video frame of the current video frame, taking the next video frame of the current video frame as the new current video frame, and returning to the step of executing the calculation of the return value of the action under the current video frame, the video frame determined from at least one path of video at each time can maximally present the information of the target person and avoid presenting the blocked picture of the target person. Meanwhile, a video picture window is determined based on the position and the size of the target person in the video frame, and a video picture about the target person is extracted according to the video picture window, so that the finally synthesized video presents information about the target person to the maximum extent and avoids presenting information about other unrelated persons.
An embodiment of the present application will be described in further detail below on the basis of the foregoing embodiment shown in fig. 1. Referring to fig. 2, another embodiment of an automatic video editing method according to an embodiment of the present application includes:
201. calculating face posture information of a target person of each video frame in at least one path of video, calculating a posture information quantization value corresponding to the face posture information, and calculating an optical flow energy change value of each video frame;
in this embodiment, the face pose information of the target person of each video frame may be calculated according to a face pose estimation algorithm. Specifically, the face pose information calculated by the face pose estimation algorithm may be represented by using a rotation matrix, a rotation vector, a quaternion, or an euler angle. Since euler angles are better readable, it may be preferable to use euler angles to represent face pose information. As shown in fig. 3, the angle of the face pose information of the target person, such as the pitch angle (pitch), the yaw angle (yaw), and the rotation angle (roll), may be calculated according to the face pose estimation algorithm.
In order to keep the consistency of the calculated face pose information and the symmetrical structure of the face, the embodiment adopts a multi-element Gaussian model to calculate the pose information quantization value corresponding to the face pose information. Further, for the convenience of calculation, the calculated attitude information quantized value can be normalized, and the normalized attitude information quantized value is used in the subsequent calculation process.
Specifically, the specific way of calculating the optical flow energy change value in this embodiment is to calculate optical flow information of each video frame in at least one video and optical flow information of other video frames belonging to the same video as each video frame, calculate optical flow energy of each video frame according to the optical flow information of each video frame, calculate optical flow energy of other video frames according to the optical flow information of other video frames, calculate an optical flow energy difference value between each video frame and other video frames and an interval time between each video frame and other video frames, and use a quotient of the optical flow energy difference value and the interval time as the optical flow energy change value of each video frame.
For example, assuming that the video automatic editing device acquires C-way video (C.gtoreq.1), each way of video includes T-frame video frames, a certain video frame in the C-way video can be expressed as f c,t (c=1, …, C; t=1, …, T), then with f c,t Belongs to the same path of video and is identical to f c,t+1 The adjacent video frame may be denoted as f c,t+1 . Respectively calculating f c,t Optical flow information of f) c,t+1 According to f c,t F is calculated from optical flow information of (2) c,t According to f c,t+1 F is calculated from optical flow information of (2) c,t+1 Optical flow energy of (1), calculate f c,t Optical flow energy of (1) and f c,t+1 Optical flow energy difference value of optical flow energy of (f) c,t And f c,t+1 Taking the quotient of the calculated difference value of the optical flow energy and the interval time as f c,t Optical flow energy variation values of (a).
In the multivariate Gaussian model, when the angle of the Euler angle of the face tends to 0, the pose information quantization value corresponding to the face pose information is maximum; when the face deflects to generate the Euler angle, namely the Euler angle is not equal to 0, the quantized value of the pose information corresponding to the face pose information is reduced. The variation amplitude of the quantized value of the attitude information caused by the angle of the euler angle can be controlled by a variance matrix. Therefore, the variance of the euler angle can be set, thereby adjusting the degree of influence of the euler angle on the quantized value of the attitude information.
202. Taking any video frame in any path of video as a current video frame;
the operations performed in this step are similar to those performed in step 102 in the embodiment shown in fig. 1, and will not be described here again.
203. Calculating a return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame based on the reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to the step of executing the reinforcement learning algorithm to calculate the return value of the action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame;
in this embodiment, when calculating the return value of the action, a specific calculation manner is to determine the transition probability from the current video frame to the candidate video frame under the action, calculate the initial return value of the action under the current video frame according to the quantized value of the pose information and the change value of the optical flow energy of the current video frame, and take the product of the initial return value and the transition probability as the return value of the action.
Specifically, the specific mode of determining the transition probability is that when a preset condition is met, the transition probability is determined to be 1; when any one of the preset conditions is not satisfied, it is determined that the transition probability is 0. The preset conditions comprise: the current video frame and the candidate video frame are adjacent on a time line, the target person exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in at least one path of video.
The current video frame and the candidate video frame are adjacent on the time line, which means that the current video frame and the candidate video frame are adjacent on the time line of the video. For example, the current video frame is the T-th frame video frame in the first video path, and the candidate video frame may be the t+1-th frame video frame in the current video path or other video paths (such as the second video path, the third video path, etc.).
204. Determining a video frame sequence according to the sequence of the current video frames, and obtaining an initial synthesized video based on the video frame sequence;
the operations performed in this step are similar to those performed in step 104 in the embodiment shown in fig. 1, and will not be described here again.
205. Determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;
in this embodiment, the position and size of the target person in the video frame may be determined according to the information of the target person and the object with which the target person interacts, where the information of the object with which the target person interacts may be information of an object of interest in the line of sight direction of the target person. For example, if the gaze direction of the target person is focused on a chair, the determined position and size of the target person in the video frame should include the chair in which the gaze direction of the target person is focused in addition to the target person.
Specifically, the specific manner of determining the line of sight direction of the target person in the video frame may be that, since the line of sight direction is related to the facial pose information of the target person, for example, the face up may be determined to be a bottom view and the face down may be determined to be a top view, the line of sight direction of the target person in each frame in the initial synthesized video may be determined from the facial pose information of the target person in each frame in the initial synthesized video.
For example, looking left and right mainly depend on a yaw angle (yaw) in the face pose information, and thus, the line-of-sight direction of the target person can be determined from the yaw angle. To facilitate the subsequent calculation process, the line-of-sight direction may be quantified, for example, the line-of-sight direction may be numerically represented based on the following formula:
where g denotes the line of sight direction and phi denotes the yaw angle. Therefore, the value of the sight line direction corresponding to a certain yaw angle can be determined according to the formula.
After the value of the sight line direction is determined, the position and the size of the target person in each frame in the initial synthesized video are further determined according to the sight line direction of the target person in each frame in the initial synthesized video, and the position and the size of the video picture window of each frame in the initial synthesized video are determined according to the position and the size of the target person in each frame in the initial synthesized video.
Specifically, the position and size of the target person in each frame of the initial composite video can be expressed asWherein (1)>Refer to the coordinates of the target person in each frame of the original composite video, i.e. according to +.>And->The position of the target person in the video frame may be determined; />Refer to the width of the target person in each frame of the initial composite video, +.>Refer to the target person at the beginningHigh in each frame of the original composite video, i.e. according to +.>And->The size of the target person in the video frame may be determined.
Meanwhile, the position and size of the video picture window of each frame in the initial composite video can be expressed asWherein (1)>Reference is made to the coordinates of the video picture window in the video frame, i.e. according to +.>The specific position of the video frame of the video picture window in the initial composite video can be determined; />Refer to the width of the video frame window, +.>Refer to the high of the video picture window, i.e. according to +.>And->The size of the video picture window may be determined.
In this embodiment, the solution is based on a functionSpecifically, the position and size of the video frame window of each frame can be determined according to the following objectivesAnd (3) calculating a standard function to obtain:
wherein c t Refer to the initial composite video, t refers to c t Any one of the video frames, g t Refers to the aforementioned value of the line of sight direction.
Since the objective function is a convex function, the objective function can be solved by a convex optimization algorithm, and the initial composite video c can be obtained t The optimal position and size of the video picture window of each frame are obtainedIs a solution to the optimization of (3).
Thus, as can be seen from the expression of the above objective function, in determining the position and size of the video screen window, information of the object with which the target person interacts in the line-of-sight direction of the target person is also considered, and the object with which the target person interacts is included in the video screen window.
206. Extracting a video picture of each frame in the initial synthesized video based on the position and the size of a video picture window to obtain a target synthesized video;
after the position and the size of the video picture window are determined, the video picture of each frame in the initial composite video may be extracted based on the position and the size of the video picture window, and the extracted multi-frame video pictures constitute the target composite video. As can be seen from the above description, since the video frame window determines the position and the size based on the target person and the object interacting with the target person, the video frame extracted from the video frame window includes the information of the target person and the information of the object interacting with the target person, and prevents the information of other unrelated persons from being presented in the video frame, on the one hand, the information of the target person can be presented maximally, and on the other hand, the information of other unrelated persons is also prevented from being presented, thereby avoiding the problem of privacy leakage.
In the embodiment, the information of the target person is emphasized and highlighted on the video automatic editing, and the problem of privacy disclosure is avoided, so that the technical scheme has more practical application value, and the realizability of the scheme is improved.
The method for automatically editing video in the embodiment of the present application is described above, and the apparatus for automatically editing video in the embodiment of the present application is described below, referring to fig. 4, where an embodiment of the apparatus for automatically editing video in the embodiment of the present application includes:
a calculating unit 401, configured to calculate facial pose information of a target person of each video frame in at least one path of video, calculate a pose information quantization value corresponding to the facial pose information, and calculate an optical flow energy variation value of each video frame;
a determining unit 402, configured to take any video frame in any path of video as a current video frame;
a clipping unit 403, configured to calculate a return value of an action under the current video frame according to the posture information quantization value and the optical flow energy variation value of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum return value as a next video frame of the current video frame, take the next video frame of the current video frame as a new current video frame, and return to the step of executing the reinforcement learning algorithm, and calculate a return value of an action under the current video frame according to the posture information quantization value and the optical flow energy variation value of the current video frame; the action is to select one candidate video frame from each video of the at least one video;
a generating unit 404, configured to determine a video frame sequence according to the sequencing of the current video frame, and obtain an initial synthesized video based on the video frame sequence;
the determining unit 402 is further configured to determine a position and a size of a video frame window of each frame in the initial synthesized video according to the position and the size of the target person in each frame in the initial synthesized video;
an extracting unit 405, configured to extract a video picture of each frame in the initial composite video based on the position and the size of the video picture window, so as to obtain a target composite video.
In a preferred implementation manner of this embodiment, the calculating unit 401 is specifically configured to calculate, according to a face pose estimation algorithm, face pose information of the target person in each video frame, where the face pose information includes an angle of pitch angle, an angle of yaw angle, and an angle of rotation angle.
In a preferred implementation manner of this embodiment, the calculating unit 401 is specifically configured to calculate the pose information quantization value corresponding to the face pose information by using a multivariate gaussian model.
In a preferred implementation manner of this embodiment, the calculating unit 401 is specifically configured to calculate optical flow information of each video frame and optical flow information of other video frames belonging to the same path of video as each video frame; calculating the optical flow energy of each video frame according to the optical flow information of each video frame, calculating the optical flow energy of other video frames according to the optical flow information of the other video frames, and calculating the optical flow energy difference value between each video frame and the other video frames and the interval time between each video frame and the other video frames; taking the quotient of the difference value of the optical flow energy and the interval time as the optical flow energy change value of each video frame.
In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine a transition probability of the current video frame to the candidate video frame under the action; calculating an initial return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame; and taking the product of the initial return value and the transition probability as the return value.
In a preferred implementation manner of this embodiment, the clipping unit 403 is specifically configured to determine that the transition probability is 1 when a preset condition is satisfied; when any one of the preset conditions is not met, determining that the transition probability is 0;
wherein, the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target person exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in the at least one path of video.
In a preferred implementation manner of this embodiment, the determining unit 402 is specifically configured to determine, according to facial pose information of the target person in each frame in the initial synthesized video, a line of sight direction of the target person in each frame in the initial synthesized video; determining the position and the size of each frame of the initial synthesized video of the target person according to the sight direction of the target person in each frame of the initial synthesized video; and determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video.
In this embodiment, the operations performed by the units in the video automatic editing apparatus are similar to those described in the embodiments shown in fig. 1 to 2, and are not repeated here.
In this embodiment, the calculating unit 401 calculates the quantized value of the face pose information and the variable value of the optical flow energy of the target person in each video frame, the clipping unit 403 applies the quantized value of the calculated pose information and the variable value of the optical flow energy to the calculation of the return value of the action in the reinforcement learning algorithm, determines the candidate video frame corresponding to the maximum return value as the next video frame of the current video frame, and returns the next video frame of the current video frame as the new current video frame to the step of executing the return value of the action calculated under the current video frame, so that each video frame determined from at least one video can maximally present the information of the target person and avoid presenting the picture in which the target person is blocked. Meanwhile, the determination unit 402 determines a video picture window based on the position and size of the target person in the video frame, and the extraction unit 405 extracts a video picture about the target person from the video picture window so that the finally synthesized video maximally presents information about the target person and avoids presenting information about other unrelated persons.
Referring to fig. 5, an embodiment of the video automatic editing apparatus according to the present application includes:
the video automatic editing apparatus 500 may include one or more central processing units (central processing units, CPU) 501 and a memory 505, where one or more application programs or data are stored in the memory 505.
Wherein the memory 505 may be volatile storage or persistent storage. The program stored in the memory 505 may include one or more modules, each of which may include a series of instruction operations in the video automatic editing apparatus. Still further, the central processor 501 may be configured to communicate with the memory 505 and execute a series of instruction operations in the memory 505 on the video automatic editing apparatus 500.
The video automatic editing device 500 may also include one or more power supplies 502, one or more wired or wireless network interfaces 503, one or more input output interfaces 504, and/or one or more operating systems, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.
The cpu 501 may perform the operations performed by the video automatic editing apparatus in the embodiments shown in fig. 1 to 2, and will not be described herein.
The embodiment of the application also provides a computer storage medium, wherein one embodiment comprises: the computer storage medium has stored therein instructions that, when executed on a computer, cause the computer to perform the operations performed by the video automatic editing apparatus in the embodiments shown in fig. 1 to 2 described above.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (10)

1. A method for automatically editing video, the method comprising:
calculating face posture information of a target person of each video frame in at least one path of video, calculating a posture information quantization value corresponding to the face posture information, and calculating an optical flow energy change value of each video frame;
taking any video frame in any path of video as a current video frame;
calculating a return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame based on a reinforcement learning algorithm, determining a candidate video frame corresponding to the maximum return value as a next video frame of the current video frame, taking the next video frame of the current video frame as a new current video frame, and returning to execute the reinforcement learning algorithm, wherein the return value of the action under the current video frame is calculated according to the attitude information quantized value and the optical flow energy change value of the current video frame; the action is to select one candidate video frame from each video of the at least one video;
determining a video frame sequence according to the sequence of the current video frames, and obtaining an initial synthesized video based on the video frame sequence;
determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;
and extracting the video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video.
2. The method of claim 1, wherein the computing facial pose information for the target person for each video frame of at least one video comprises:
and calculating the face posture information of the target person of each video frame according to a face posture estimation algorithm, wherein the face posture information comprises the angle of a pitch angle, the angle of a yaw angle and the angle of a rotation angle.
3. The method according to claim 1, wherein the calculating the pose information quantization value corresponding to the face pose information includes:
and calculating the pose information quantization value corresponding to the face pose information by using a multi-element Gaussian model.
4. The method of claim 1, wherein said calculating an optical flow energy variation value for each video frame comprises:
calculating optical flow information of each video frame and optical flow information of other video frames belonging to the same path of video with each video frame;
calculating the optical flow energy of each video frame according to the optical flow information of each video frame, calculating the optical flow energy of other video frames according to the optical flow information of the other video frames, and calculating the optical flow energy difference value between each video frame and the other video frames and the interval time between each video frame and the other video frames;
taking the quotient of the difference value of the optical flow energy and the interval time as the optical flow energy change value of each video frame.
5. The method of claim 1, wherein calculating the return value of the action under the current video frame from the pose information quantization value and the optical flow energy variation value of the current video frame comprises:
determining a transition probability of the current video frame to the candidate video frame under the action;
calculating an initial return value of an action under the current video frame according to the attitude information quantized value and the optical flow energy change value of the current video frame;
and taking the product of the initial return value and the transition probability as the return value.
6. The method of claim 5, wherein said determining a transition probability of the current video frame to the candidate video frame in the action comprises:
when a preset condition is met, determining that the transition probability is 1; when any one of the preset conditions is not met, determining that the transition probability is 0;
wherein, the preset conditions include: the current video frame and the candidate video frame are adjacent on a time line, the target person exists in the picture of the candidate video frame, and the video index corresponding to the action is consistent with the index of the candidate video frame in the at least one path of video.
7. The method of claim 1, wherein the determining the position and size of the video picture window for each frame in the initial composite video based on the position and size of each frame in the initial composite video for the target person comprises:
determining the sight direction of the target person in each frame in the initial synthetic video according to the facial pose information of the target person in each frame in the initial synthetic video;
determining the position and the size of each frame of the initial synthesized video of the target person according to the sight direction of the target person in each frame of the initial synthesized video;
and determining the position and the size of a video picture window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video.
8. An automatic video editing apparatus, the apparatus comprising:
the computing unit is used for computing the face posture information of the target person of each video frame in at least one path of video, computing the posture information quantization value corresponding to the face posture information and computing the optical flow energy change value of each video frame;
the determining unit is used for taking any video frame in any path of video as a current video frame;
a clipping unit, configured to calculate a return value of an action under a current video frame according to a quantized value of pose information and a change value of optical flow energy of the current video frame based on a reinforcement learning algorithm, determine a candidate video frame corresponding to a maximum return value as a next video frame of the current video frame, take the next video frame of the current video frame as a new current video frame, and return to execute the reinforcement learning algorithm based on the quantized value of pose information and the change value of optical flow energy of the current video frame, and calculate the return value of the action under the current video frame; the action is to select one candidate video frame from each video of the at least one video;
the generating unit is used for determining a video frame sequence according to the sequence of the current video frames and obtaining an initial synthesized video based on the video frame sequence;
the determining unit is further configured to determine a position and a size of a video frame window of each frame in the initial synthesized video according to the position and the size of each frame of the target person in the initial synthesized video;
and the extraction unit is used for extracting the video picture of each frame in the initial synthesized video based on the position and the size of the video picture window to obtain a target synthesized video.
9. An automatic video editing apparatus, the apparatus comprising:
a memory for storing a computer program; a processor for implementing the steps of the video automatic editing method according to any of claims 1 to 7 when executing said computer program.
10. A computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of any of claims 1 to 7.
CN202110321530.6A 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium Active CN113038271B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110321530.6A CN113038271B (en) 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110321530.6A CN113038271B (en) 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium

Publications (2)

Publication Number Publication Date
CN113038271A CN113038271A (en) 2021-06-25
CN113038271B true CN113038271B (en) 2023-09-08

Family

ID=76473798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110321530.6A Active CN113038271B (en) 2021-03-25 2021-03-25 Video automatic editing method, device and computer storage medium

Country Status (1)

Country Link
CN (1) CN113038271B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534967A (en) * 2016-10-25 2017-03-22 司马大大(北京)智能系统有限公司 Video editing method and device
CN108805080A (en) * 2018-06-12 2018-11-13 上海交通大学 Multi-level depth Recursive Networks group behavior recognition methods based on context
EP3410353A1 (en) * 2017-06-01 2018-12-05 eyecandylab Corp. Method for estimating a timestamp in a video stream and method of augmenting a video stream with information
CN109618184A (en) * 2018-12-29 2019-04-12 北京市商汤科技开发有限公司 Method for processing video frequency and device, electronic equipment and storage medium
CN110691202A (en) * 2019-08-28 2020-01-14 咪咕文化科技有限公司 Video editing method, device and computer storage medium
CN111063011A (en) * 2019-12-16 2020-04-24 北京蜜莱坞网络科技有限公司 Face image processing method, device, equipment and medium
CN111131884A (en) * 2020-01-19 2020-05-08 腾讯科技(深圳)有限公司 Video clipping method, related device, equipment and storage medium
CN111294524A (en) * 2020-02-24 2020-06-16 中移(杭州)信息技术有限公司 Video editing method and device, electronic equipment and storage medium
CN111800644A (en) * 2020-07-14 2020-10-20 深圳市人工智能与机器人研究院 Video sharing and acquiring method, server, terminal equipment and medium
CN112203115A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Video identification method and related device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI430185B (en) * 2010-06-17 2014-03-11 Inst Information Industry Facial expression recognition systems and methods and computer program products thereof
JP5569329B2 (en) * 2010-10-15 2014-08-13 大日本印刷株式会社 Conference system, monitoring system, image processing apparatus, image processing method, image processing program, etc.
US20150318020A1 (en) * 2014-05-02 2015-11-05 FreshTake Media, Inc. Interactive real-time video editor and recorder
GB2583676B (en) * 2018-01-18 2023-03-29 Gumgum Inc Augmenting detected regions in image or video data
US20200380274A1 (en) * 2019-06-03 2020-12-03 Nvidia Corporation Multi-object tracking using correlation filters in video analytics applications

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106534967A (en) * 2016-10-25 2017-03-22 司马大大(北京)智能系统有限公司 Video editing method and device
EP3410353A1 (en) * 2017-06-01 2018-12-05 eyecandylab Corp. Method for estimating a timestamp in a video stream and method of augmenting a video stream with information
CN108805080A (en) * 2018-06-12 2018-11-13 上海交通大学 Multi-level depth Recursive Networks group behavior recognition methods based on context
CN109618184A (en) * 2018-12-29 2019-04-12 北京市商汤科技开发有限公司 Method for processing video frequency and device, electronic equipment and storage medium
CN110691202A (en) * 2019-08-28 2020-01-14 咪咕文化科技有限公司 Video editing method, device and computer storage medium
CN111063011A (en) * 2019-12-16 2020-04-24 北京蜜莱坞网络科技有限公司 Face image processing method, device, equipment and medium
CN111131884A (en) * 2020-01-19 2020-05-08 腾讯科技(深圳)有限公司 Video clipping method, related device, equipment and storage medium
CN111294524A (en) * 2020-02-24 2020-06-16 中移(杭州)信息技术有限公司 Video editing method and device, electronic equipment and storage medium
CN111800644A (en) * 2020-07-14 2020-10-20 深圳市人工智能与机器人研究院 Video sharing and acquiring method, server, terminal equipment and medium
CN112203115A (en) * 2020-10-10 2021-01-08 腾讯科技(深圳)有限公司 Video identification method and related device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
人脸自动识别方法综述;李刚等;《计算机应用研究》;20030828(第08期);全文 *

Also Published As

Publication number Publication date
CN113038271A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
US11100644B2 (en) Neural network for eye image segmentation and image quality estimation
Zhang et al. Deep future gaze: Gaze anticipation on egocentric videos using adversarial networks
Adler Sample images can be independently restored from face recognition templates
US9813909B2 (en) Cloud server for authenticating the identity of a handset user
CN108776775B (en) Old people indoor falling detection method based on weight fusion depth and skeletal features
CN111464834B (en) Video frame processing method and device, computing equipment and storage medium
KR102317223B1 (en) System and method for implementing metaverse using biometric data
CN111191599A (en) Gesture recognition method, device, equipment and storage medium
CN111310705A (en) Image recognition method and device, computer equipment and storage medium
US20220222969A1 (en) Method for determining the direction of gaze based on adversarial optimization
CN113284073A (en) Image restoration method, device and storage medium
CN112532882B (en) Image display method and device
JP2022185096A (en) Method and apparatus of generating virtual idol, and electronic device
CN105931204B (en) Picture restoring method and system
CN113038271B (en) Video automatic editing method, device and computer storage medium
KR101791604B1 (en) Method and apparatus for estimating position of head, computer readable storage medium thereof
US11361467B2 (en) Pose selection and animation of characters using video data and training techniques
CN114648601A (en) Virtual image generation method, electronic device, program product and user terminal
CN112714337A (en) Video processing method and device, electronic equipment and storage medium
Li et al. Real-time human tracking based on switching linear dynamic system combined with adaptive Meanshift tracker
CN111991808A (en) Face model generation method and device, storage medium and computer equipment
CN111008577A (en) Virtual face-based scoring method, system, device and storage medium
CN115999156B (en) Role control method, device, equipment and storage medium
CN116152896A (en) Face key point prediction method, APP, terminal equipment and storage medium
CN113784207A (en) Video picture display method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant