WO2022127877A1

WO2022127877A1 - Video editing method and system, electronic device, and storage medium

Info

Publication number: WO2022127877A1
Application number: PCT/CN2021/138917
Authority: WO
Inventors: 龙良曲; 唐小林; 陈勃霖; 符峥
Original assignee: 影石创新科技股份有限公司
Priority date: 2020-12-16
Filing date: 2021-12-16
Publication date: 2022-06-23
Also published as: CN112770061A

Abstract

The present invention is applicable to the technical field of image processing, and provides a video editing method and system, an electronic device, and a storage medium. The method comprises: segmenting, according to a preset video segmentation algorithm, a video to be edited to obtain a first number of video sub-segments; analyzing said video according to preset dimensions to obtain score curves corresponding to the dimensions; processing all the score curves by using a preset ensemble learning model to generate a comprehensive score curve of said video; selecting an editing area from the first number of video sub-segments according to the comprehensive score curve; and editing said video according to the editing area to obtain an edited video. Therefore, analysis is carried out on the basis of multiple dimensions, automatic editing of the video is realized, the editing efficiency is improved, and the video editing effect is ensured.

Description

Video editing method, system, electronic device and storage medium

technical field

The invention belongs to the technical field of image processing, and in particular relates to a video editing method, system, electronic device and storage medium.

Background technique

Short videos are streamlined and focused, making them easy to read, and easier to share and disseminate on social media. However, how to edit and obtain relatively "wonderful" short videos is often a major pain point for users.

technical problem

The number of original videos shot by users varies in length, including many uninteresting clips, as well as some irrelevant clips, while “wonderful” clips require users to manually select and edit, which is more difficult to operate and more efficient. Low.

technical solutions

The purpose of the present invention is to provide a video editing method, system, electronic device and storage medium, aiming to solve the problems of high difficulty and low efficiency in manual video editing in the prior art.

In one aspect, the present invention provides a video editing method, the method includes the following steps:

Segment the video to be edited according to the preset video segmentation algorithm to obtain the first number of video sub-segments;

Analyze the video to be edited from a preset dimension to obtain a score curve corresponding to the dimension;

Use the preset integrated learning model to process all the score curves, and generate the comprehensive score curve of the video to be edited;

Selecting a clipping region from the first number of video sub-segments according to the comprehensive score curve;

Edit the video to be edited according to the editing area to obtain the edited video.

Preferably, the step of analyzing the video to be edited from a preset dimension to obtain a score curve corresponding to the dimension includes:

By calculating the summary score of each video subsection, a score curve of the video to be edited on the summary dimension is generated; and/or

Analyze the occlusion probability of each video frame through an occlusion analysis model, and generate a score curve of the video to be edited in the occlusion dimension; and/or

Analyze the aesthetic score of each video frame through an aesthetic evaluation model, and generate a score curve of the video to be edited in the aesthetic dimension; and/or

Use a preset shooting habit analysis model to generate a score curve of the video to be edited in the shooting habit dimension; and/or

Generate a score curve of the video to be edited in the image entropy dimension by counting the amount of information of each video frame; and/or

By analyzing the texture features of each video frame, a score curve of the video to be edited in the texture dimension is generated.

Preferably, the function value of the preset distance function is used as the summary score of the video sub-segment.

Preferably, the preset distance function is a correlation coefficient distance function, and the summary score of the video sub-segment:

Wherein, s _i represents the summary score of the ith video sub-segment, v _i and v _j represent the ith and j th video sub-segments, respectively, and N is the first number.

Preferably, the training process of the occlusion analysis model includes:

constructing an occlusion training sample set, the occlusion training sample set includes a clean sample set and an occlusion sample set, the clean sample does not include an occluder, and the occlusion sample includes at least one type of occluder;

Input the occlusion training samples into the occlusion analysis model, calculate the cross-entropy loss value of the output occlusion probability and the true occlusion probability, and optimize the parameters of the occlusion analysis model through the gradient descent algorithm until the occlusion detection of the occlusion analysis model is accurate. rate reaches the preset value.

Preferably, the training process of the aesthetic evaluation model includes:

Construct an aesthetic training sample set, and the true aesthetic score marked by each aesthetic training sample is the average of multiple users' scores based on preset aesthetic angles;

The aesthetic evaluation model is trained using the aesthetic training sample set to obtain a trained aesthetic evaluation model.

Preferably, the shooting habit analysis model is defined as:

defined as

in,

is a hyperparameter, used to adjust the weight parameter of the probability density of the Gaussian distribution, μ represents the mean value of the Gaussian distribution, σ represents the standard deviation of the Gaussian distribution, υ _duration represents the duration of the video to be edited, and β is a hyperparameter used to adjust The variance of the probability density of a Gaussian distribution.

Preferably, the integrated learning model adopts a single-layer linear dense connection layer, and the comprehensive score curve is as follows:

Among them, M is the dimension set, w _k and b _k are the parameters of the dense connection layer, which are used to weight the weight of the standardized score curve vector corresponding to each dimension, and s' _k is the standardized score of the score curve corresponding to each dimension Curve vector, s is the comprehensive score curve vector.

Preferably, the step of selecting the editing region from the first number of video sub-segments according to the comprehensive score curve includes:

Calculate the average video score of each of the video sub-segments according to the comprehensive score curve;

Sort the video sub-segments in descending order according to the average video score, select a second number of video sub-segments according to the sorting result, and select the editing region from the second number of video sub-segments .

Preferably, the step of selecting a second number of video sub-segments according to the sorting result, and selecting the editing region from the second number of video sub-segments, includes:

The clipping or complementing operation is performed on the video sub-segments whose video lengths do not belong to the preset video length interval in the second number of video sub-segments, the second number of video sub-segments after the operation are obtained, and the The area corresponding to the second number of video sub-segments is used as the clip area.

In another aspect, the present invention provides a video editing system, the system comprising:

a video segmentation module, used for segmenting the video to be edited according to a preset video segmentation algorithm to obtain a first number of video sub-segments;

a dimension analysis module, configured to analyze the video to be edited from a preset dimension to obtain a score curve corresponding to the dimension;

An integrated learning module for processing all the score curves using a preset integrated learning model to generate a comprehensive score curve of the video to be edited;

a segment search module for selecting clip regions from the first number of video sub-segments according to the comprehensive score curve; and

The editing module is used for editing the video to be edited according to the editing area to obtain the edited video.

Preferably, the dimension analysis module includes:

A digest analysis module, configured to generate a score curve of the video to be edited in the digest dimension by calculating the digest score of each video sub-segment; and/or

An occlusion analysis module, configured to analyze the occlusion probability of each video frame through an occlusion analysis model, and generate a score curve of the video to be edited in the occlusion dimension; and/or

An aesthetic analysis module for analyzing the aesthetic score of each video frame through an aesthetic evaluation model, and generating a score curve of the video to be edited in the aesthetic dimension; and/or

A shooting habit analysis module, configured to use a preset shooting habit analysis model to generate a score curve of the video to be edited in the shooting habit dimension; and/or

An image entropy analysis module, for generating a score curve of the video to be edited in the image entropy dimension by counting the amount of information of each video frame; and/or

The texture analysis module is configured to generate a score curve of the video to be edited in the texture dimension by analyzing the texture features of each video frame.

In another aspect, the present invention also provides an electronic device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor is implemented when the processor executes the computer program The steps of the method as described above.

In another aspect, the present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the steps of the above-mentioned method.

beneficial effect

The present invention divides the video to be edited according to a preset video segmentation algorithm, obtains a first number of video sub-segments, analyzes the video to be edited from a preset dimension, obtains a score curve corresponding to the dimension, and uses a preset integrated The learning model processes all the score curves, generates a comprehensive score curve of the video to be edited, selects the editing area from the first number of video sub-segments according to the comprehensive score curve, and edits the video to be edited according to the editing area to obtain the edited video. , so as to realize automatic video editing based on multiple dimensional analysis, improve the editing efficiency, and ensure the video editing effect.

Description of drawings

Fig. 1 is the realization flow chart of the video editing method provided in Embodiment 1 of the present invention;

2 is an example diagram of a video segmentation result provided in Embodiment 1 of the present invention;

3 is an example diagram of a score curve on the summary dimension provided by Embodiment 1 of the present invention;

FIG. 4 is an example diagram of a clipped sub-segment extension provided by Embodiment 1 of the present invention;

5 is a schematic structural diagram of a video editing system provided in Embodiment 2 of the present invention;

6 is a schematic diagram of a preferred structure of a video editing system provided by Embodiment 2 of the present invention; and

FIG. 7 is a schematic structural diagram of an electronic device according to Embodiment 3 of the present invention.

Embodiments of the present invention

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

The specific implementation of the present invention is described in detail below in conjunction with specific embodiments:

Example 1:

FIG. 1 shows the implementation process of the video editing method provided by the first embodiment of the present invention. For the convenience of description, only the part related to the embodiment of the present invention is shown, and the details are as follows:

In step S101, the video to be edited is segmented according to a preset video segmentation algorithm to obtain a first number of video sub-segments.

The embodiment of the present invention is applicable to automatic video editing, and the video to be edited may be a video shot by a user based on a camera or a mobile phone. In the embodiment of the present invention, when the video to be edited is segmented according to a preset video segmentation algorithm, it is usually performed based on the frame extraction sequence of the video to be edited. In the specific implementation, the ffmpeg program can be used to decode the video to be edited, etc. , after decoding, perform frame extraction processing on the video to be edited. When performing frame extraction processing, frame extraction processing can be performed according to a preset fps (frames per second) sampling rate, for example, fps is set to 5 or 3. For the convenience of description, the frame extraction sequence of the video to be edited is expressed as: {I _t },t∈[1,M],I _t ∈R ^h×w , the height and width h×w of the frame feature map can be set to 224× 224, obtained from the original resolution image after differential scaling, the frame feature vector at time t

_Defined as the feature histogram of It, then use the video segmentation algorithm to divide the video to be edited into a plurality of video sub-segments according to different scenes, wherein, the preset video segmentation algorithm can be KTS (kernel function time segmentation) algorithm Or other ordered clustering algorithms, which are not limited here. After the above operations, the video v to be edited can be represented by a matrix of shape [M, 96], where M represents the total number of video frames, the lengths of the video sub-segments obtained after segmentation are different, and the positions within the segment are continuous, FIG. 2 is an example diagram of a video segmentation result.

In order to avoid the situation that the length of the video sub-segments is too large or too small, a video sub-segment length interval (the first video length interval) can be preset, and further, the length of each video sub-segment is obtained. When the length of the segment does not belong to the length range of the video sub-segment, the video sub-segment is segmented or spliced. Specifically, when performing the splicing process on the video sub-segments, the splicing may be performed in combination with the lengths of the video sub-segments before and after the video sub-segment and/or the similarity with the preceding and following video sub-segments. Of course, in this step, the length of the video sub-segment may not be processed, but processed in the process of determining the cropping region.

In step S102, the video to be edited is analyzed from a preset dimension to obtain a score curve corresponding to the dimension.

In this embodiment of the present invention, when analyzing the video to be edited from a preset dimension, it is usually also based on frame sampling sequence analysis to improve analysis efficiency. Wherein, the preset dimension may be one or more, and when there are multiple preset dimensions, correspondingly, there are multiple score curves, that is, a score curve corresponding to each dimension will be obtained. Preferably, the preset dimensions include a video summary dimension, an occlusion dimension, an aesthetic dimension, a shooting habit dimension, an image entropy dimension and/or a texture dimension.

Considering that when a user actually cuts a video, the editing is usually based on the main content of the video, so preferably, by calculating the summary score of each video sub-segment, a score curve of the video to be edited in the summary dimension is generated, so as to obtain a summary score from the video summary dimension. Analyze the video to be edited and provide a basis for subsequent cropping. Specifically, the main content of the video to be edited can be analyzed, and the degree to which each video frame is close to the main content can be evaluated according to the main content, so as to form a representative curve that each frame can represent the main content of the video. Generally speaking, there are two main optimization objectives for video summarization: representative optimization objective and diversity optimization objective.

In order to obtain the above-mentioned summary score _si of each video sub-segment, preferably, the function value of the dist(·) function (distance function) is used as the score of each video sub-segment to simplify the complexity of obtaining the score curve.

Further preferably, the distance function is a correlation coefficient distance function, and the summary score of the video sub-segment _vi :

Among them, s _i represents the summary score of the ith video sub-segment, vi and v _j represent the ith and _j th video sub-segments, respectively, and N is the first number. According to the above formula, the correlation coefficient matrix of each segment v _i and other segments v _j can be calculated, and the sum of the _{i-th row of the matrix is used as the summary score s i} _of the video sub-segment vi. Further, a normalization operation is performed on _si to obtain a normalized score curve. FIG. 3 is an example diagram of a video summary score curve after normalization operation.

Considering that in the process of shooting videos, users may produce clips with hands, heads, or hair that block the lens, and the pictures taken due to occlusions such as hands, heads, or hair are not pure and beautiful enough to be used as clips. Therefore, preferably, the occlusion probability of each video frame is analyzed by an occlusion analysis model, and a score curve of the video to be edited in the occlusion dimension is generated, so as to analyze the video to be edited from the occlusion dimension and provide a basis for subsequent cropping. Among them, the occlusion analysis model is used to analyze whether there is an occluder in the video frame, and output the probability of the occlusion in the video frame. The occlusion probability of each video frame is calculated by the occlusion model, and the occlusion dimension of the video can be obtained. score curve. Among them, the occlusion model can be implemented based on a deep neural network, which can be implemented based on lightweight mobile networks such as MobileNetv2 and EfficientNet, and can also be implemented based on other deep convolutional neural network models.

When training the occlusion analysis model, preferably, an occlusion training sample set is constructed, the occlusion training sample set includes a clean sample set and an occlusion sample set, the occlusion training samples are input into the occlusion analysis model, and the output occlusion probability and the real occlusion probability are calculated. Cross entropy loss value, and optimize the parameters of the occlusion analysis model through the gradient descent algorithm until the occlusion detection accuracy of the occlusion analysis model reaches the preset value. The clean samples do not contain occluders, and the occlusion samples include at least one type of occlusions such as head occlusion, hand occlusion, and hair occlusion. In the specific implementation, the above two types of samples can be collected from the pictures taken by the real camera, and the manual marking of occlusion is performed manually. For clean samples, it is marked as 0, and for occluded samples, it is marked as 1. Before inputting the occlusion analysis model, the video frame can be preprocessed. The preprocessing can include random data enhancement, scaling, and normalization operations. After the occlusion training samples are occluded in the analysis model, the last layer of the occlusion model passes through the softmax activation function. The vector p(I _t ) with a length of 2 is output, representing the probability of no occlusion and occlusion, respectively. The sum of the two probabilities is 1. By calculating the cross-entropy loss value between p(I _t ) and the true label, and by gradient descent The algorithm can optimize the model parameters to obtain an accurate occlusion model.

Considering that in the process of video editing, editing is usually performed based on aesthetics, so it is preferable to analyze the aesthetic score of each video frame through an aesthetic evaluation model, and generate a score curve of the video to be edited in the aesthetic dimension, so as to treat it from the occlusion dimension. Clip the video for analysis and provide the basis for subsequent cropping. The aesthetic analysis model can be implemented using deep convolutional neural networks, which can be based on lightweight mobile models, such as MobileNetv2/v3, MobileNext, GhostNet, etc. In order to obtain a more accurate aesthetic evaluation model, the model input needs to be designed with a larger feature size, or the model adopts the method of randomly cropping the picture area to perform aesthetic evaluation on the video frame. By calculating the occlusion probability of each video frame through the aesthetic analysis model, the score curve of the video to be edited in the occlusion dimension can be obtained.

When training the aesthetic analysis model, preferably, an aesthetic training sample set is constructed, and the real aesthetic score marked by each aesthetic training sample is the average of the scores of multiple users based on a preset aesthetic angle, and the aesthetic evaluation model is evaluated by using the aesthetic training sample set. Perform training to obtain a trained aesthetic evaluation model. Preprocessing operations such as scaling and normalization can also be performed on the samples before they are fed into the aesthetic analysis model. When constructing the aesthetic training sample set, considering that aesthetics is a very subjective concept, in order to obtain a more accurate aesthetic evaluation annotation, multiple people are used to score the same video frame, and the average score is used as the real annotation value of the video frame. In a specific implementation, the user can score the samples of the data set according to the scoring standard and personal understanding, and the aesthetic mean of each sample can be counted based on the user's score.

Considering that shooting habits can reflect the real shooting intention of users, it is preferable to use a preset shooting habit analysis model to generate a score curve of the video to be edited in the dimension of shooting habits, so as to analyze the edited video from the dimension of shooting habits, and Provide a basis for subsequent cutting. In the specific implementation, the distribution of the positions of the artificial clips can be counted, and the machine learning model can be used to approximate the distribution of the artificial clips, so preferably, the Gaussian distribution model can be used to approximate the distribution of the artificial clips to obtain the shooting habit model:

defined as

in,

is a hyperparameter used to adjust the weight parameter of the probability density of the Gaussian distribution, μ represents the mean of the Gaussian distribution, σ represents the standard deviation of the Gaussian distribution, υ _duration represents the duration of the video to be edited, and β is a hyperparameter used to adjust the Gaussian distribution The variance of the probability density.

By adjusting the α and β hyperparameters, the distribution curves of different users’ shooting habits can be generated. The Gaussian distribution focuses more on the video clips whose sampling timestamp is near μ, that is, the central area of the video, which is also in line with human shooting habits, that is, the middle The captured video clips are more likely to be the user's real shooting intent.

Considering that when manually editing a video, the editing is usually performed based on the amount of information of the image, so preferably, by counting the amount of information of each video frame, the score curve of the video to be edited in the image entropy dimension is generated to treat from the image entropy dimension. Clip the video for analysis and provide the basis for subsequent cropping. In the specific implementation, the information amount of the video frame It is analyzed, and the image _entropy _Entropy (It) is defined as:

The image entropy can be counted from the RGB pixel vector or the _grayscale pixel vector of the frame picture feature It, and after the statistics are _averaged , the score of the information content of It can be obtained. The larger the Entropy(I _t ), the more random the pixel distribution is, and the greater the amount of information it contains. By scoring the information content of each frame of the video, the score curve of the video in the image entropy dimension can be obtained.

Table 1 below shows the entropy calculation results of the three example images.

Table 1

Considering that the image texture can better represent the detailed information of the picture, it is preferable to generate a score curve of the video to be edited in the texture dimension by analyzing the texture features of each video frame, so as to analyze the video to be edited from the texture dimension, And provide a basis for subsequent cutting. Specifically, the texture of the picture can be detected by the Laplacian operator, and before the Laplacian calculation is performed, Gaussian smoothing filtering can be performed to suppress the noise pixels in the picture. The calculation process can be implemented by the GaussianBlur function and the Laplacian function. The size of the Gaussian kernel and the size of the Laplacian kernel can be freely set. By counting the average intensity value of the feature map of the Gaussian Laplacian operator, the image texture information can be evaluated. :

Among them, s _texture represents the texture statistical value, and L(I _t ) represents the intensity value of the feature map of the Laplacian of Gaussian operator.

The following Table 2 shows the texture calculation results of the three example images.

Table 2

In step S103, a preset integrated learning model is used to process all the score curves to generate a comprehensive score curve of the video to be edited.

In the embodiment of the present invention, all the score curves obtained in step 2 are learned through integrated learning model learning, and a comprehensive score curve of the video to be edited is generated. The ensemble learning model can be implemented based on a shallow neural network. In order to simplify the model parameters, a single-layer linear dense connection layer can be used, and combined with standardized operations to accelerate model convergence:

In step S104, a clip region is selected from the first number of video sub-segments according to the comprehensive score curve.

In this embodiment of the present invention, a score threshold may be set, and an area in each video sub-segment greater than the score threshold may be used as a clipping area. Preferably, the average video score of each video sub-segment is calculated according to the comprehensive score curve. The average score sorts the video sub-segments in descending order, selects the second number of video sub-segments according to the sorting result, and selects the editing area from the second number of video sub-segments to improve the continuity of the cropping area. In order to select sub-segments with suitable duration, further preferably, the video sub-segments whose video lengths do not belong to the preset video length interval (the second video length interval) in the second quantity of video sub-segments are trimmed or complemented to obtain For the second number of video sub-segments after the operation, the area corresponding to the second number of video sub-segments after the operation is used as the clipping area. The area corresponding to the second number of video sub-segments refers to a video area between the start time and the end time of each video sub-segment in the second number of video sub-segments.

As an example, if the clipped or completed video sub-segments are three A, B, and C, sub-segment A is the 3-5 minute video clip, sub-segment B is the 7-8 minute video clip, The sub-segment C is a video segment of the 10th-12th minute, and the regions corresponding to the above-mentioned second number of video sub-segments are the video regions of the 3rd-5th minute, the 7th-8th minute, and the 10th-12th minute.

As an example, clip the sub-segment length interval t _clip ∈ [2,5], for the sub-segment with longer or shorter duration, it needs to perform clipping and completion operations:

If t _clip ∈ [0, T _min ), the short-term length needs to be completed from the left or right area of the clip. Specifically, a segment with a higher average score of T _min -t _clip on the left or on the right is selected for completion, as shown in FIG. 4 , for example.

If t _clip ∈ [T _max ,+∞), that is, the length is longer, select a sub-segment with a higher score from the sub-segment. A sub-segment can be randomly sampled from the current region, and the duration of the sub-segment is guaranteed to be t' _clip ∈ [T _min , T _max ].

In step S105, the video to be edited is edited according to the editing area to obtain the edited video.

In the embodiment of the present invention, according to the start time and end time of each time segment corresponding to the above-mentioned editing area, the editing of the video can be completed, and the edited video can be obtained by splicing a plurality of editing sub-segments. When splicing multiple editing sub-clips, you can splicing multiple editing sub-clips according to time, score, or scene category.

As an example, the editing area includes three editing sub-segments A, B, and C, where sub-segment A is the video segment from the 3rd to the 5th minute, sub-segment B is the video segment from the 7th to the 8th minute, and sub-segment C is the 10th minute. -12-minute video clips, the average video score corresponding to each clip sub clip is 0.6, 0.8 and 0.5, and the scene categories corresponding to each clip sub clip are outdoor, indoor, outdoor, if according to the time (each video sub clip start time) to splicing the three editing sub-segments, then splicing in the order of A, B, C; Splicing is performed in the order of B, A, and C; if the three editing sub-segments are spliced according to the scene and time (in the order from outdoor to indoor, and in chronological order in the same scene), the three clips are spliced according to the sequence of A, C, and B. Splicing in sequence.

In the embodiment of the present invention, the video to be edited is segmented according to a preset video segmentation algorithm to obtain a first number of video sub-segments, and the video to be edited is analyzed from a preset dimension to obtain a score curve corresponding to the dimension, Use a preset integrated learning model to process all the score curves, generate a comprehensive score curve of the video to be edited, select the editing area from the first number of video sub-segments according to the comprehensive score curve, and edit the video to be edited according to the editing area, The edited video is obtained, so as to realize the automatic editing of the video based on the analysis of multiple dimensions, and ensure the video editing effect.

Embodiment 2:

5-6 show the structure of the video editing system provided by the second embodiment of the present invention. For the convenience of description, only the parts related to the embodiment of the present invention are shown, including:

The video segmentation module 51 is used for segmenting the video to be edited according to a preset video segmentation algorithm to obtain a first number of video sub-segments;

A dimension analysis module 52, configured to analyze the video to be edited from a preset dimension to obtain a score curve corresponding to the dimension;

The integrated learning module 53 is used to process all the score curves using a preset integrated learning model, and generate a comprehensive score curve of the video to be edited;

a segment search module 54 for selecting clip regions from the first number of video sub-segments according to the composite score curve; and

The editing module 55 is used for editing the video to be edited according to the editing area to obtain the edited video.

Preferably, the dimension analysis module 52 includes:

The digest analysis module 521 is configured to generate a score curve of the video to be edited in the digest dimension by calculating the digest score of each video sub-segment; and/or

An occlusion analysis module 522, configured to analyze the occlusion probability of each video frame through an occlusion analysis model, and generate a score curve of the video to be edited in the occlusion dimension; and/or

Aesthetic analysis module 523, for analyzing the aesthetic score of each video frame through the aesthetic evaluation model, and generating the score curve of the video to be edited in the aesthetic dimension; and/or

A shooting habit analysis module 524, configured to use a preset shooting habit analysis model to generate a score curve of the video to be edited in the shooting habit dimension; and/or

The image entropy analysis module 525 is used to generate a score curve of the video to be edited in the image entropy dimension by counting the amount of information of each video frame; and/or

The texture analysis module 526 is configured to generate a score curve of the video to be edited in the texture dimension by analyzing the texture features of each video frame.

Among them, s _i represents the summary score of the ith video sub-segment, vi and v _j represent the ith and _j th video sub-segments, respectively, and N is the first number.

Preferably, the shooting habit analysis model is defined as:

defined as

in,

Preferably, the ensemble learning model adopts a single-layer linear dense connection layer, and the comprehensive score curve is as follows:

In the embodiment of the present invention, each module of the video editing system may be implemented by corresponding hardware or software units, and each unit may be an independent software and hardware unit, or may be integrated into a software and hardware unit, which is not intended to limit the present invention. . For the specific implementation of each module of the video editing system, reference may be made to the description of the foregoing method embodiments, which will not be repeated here.

Embodiment three:

FIG. 7 shows the structure of the electronic device provided by the third embodiment of the present invention. For convenience of description, only the part related to the embodiment of the present invention is shown.

The electronic device 7 of the embodiment of the present invention includes a processor 70 , a memory 71 , and a computer program 72 stored in the memory 71 and executable on the processor 70 . When the processor 70 executes the computer program 72, the steps in the foregoing method embodiments are implemented, for example, steps S101 to S105 shown in FIG. 1 . Alternatively, when the processor 70 executes the computer program 72, the functions of the units in the above-mentioned apparatus embodiments, for example, the functions of the units 51 to 55 shown in FIG. 5 are realized.

In the embodiment of the present invention, the video to be edited is segmented according to a preset video segmentation algorithm to obtain a first number of video sub-segments, and the video to be edited is analyzed from a preset dimension to obtain a score curve corresponding to the dimension, Use a preset integrated learning model to process all the score curves, generate a comprehensive score curve of the video to be edited, select the editing area from the first number of video sub-segments according to the comprehensive score curve, and edit the video to be edited according to the editing area, The edited video is obtained, which realizes automatic video editing based on multiple dimension analysis, improves the editing efficiency, and ensures the video editing effect.

Embodiment 4:

In an embodiment of the present invention, a computer-readable storage medium is provided, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented, for example, as shown in FIG. 1 . Steps S101 to S105 shown. Alternatively, when the computer program is executed by the processor, the functions of the units in the above-mentioned apparatus embodiments, for example, the functions of the units 51 to 55 shown in FIG. 5 , are implemented.

The computer-readable storage medium of the embodiments of the present invention may include any entity or device capable of carrying computer program codes, recording medium, for example, memory such as ROM/RAM, magnetic disk, optical disk, flash memory, and the like.

The above descriptions are only preferred embodiments of the present invention and are not intended to limit the present invention. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present invention shall be included in the protection of the present invention. within the range.

Claims

A video editing method, characterized in that the method comprises the following steps:

Segment the video to be edited according to the preset video segmentation algorithm to obtain the first number of video sub-segments;

Analyze the video to be edited from a preset dimension to obtain a score curve corresponding to the dimension;

Use the preset integrated learning model to process all the score curves, and generate the comprehensive score curve of the video to be edited;

Selecting a clipping region from the first number of video sub-segments according to the comprehensive score curve;

Edit the video to be edited according to the editing area to obtain the edited video.
The method of claim 1, wherein the step of analyzing the video to be edited from a preset dimension to obtain a score curve corresponding to the dimension comprises:

By calculating the summary score of each video sub-segment, a score curve of the video to be edited in the summary dimension is generated; and/or

Analyze the occlusion probability of each video frame through an occlusion analysis model, and generate a score curve of the video to be edited in the occlusion dimension; and/or

Analyze the aesthetic score of each video frame through an aesthetic evaluation model, and generate a score curve of the video to be edited in the aesthetic dimension; and/or

Use a preset shooting habit analysis model to generate a score curve of the video to be edited in the shooting habit dimension; and/or

Generate a score curve of the video to be edited in the image entropy dimension by counting the amount of information of each video frame; and/or

By analyzing the texture features of each video frame, a score curve of the video to be edited in the texture dimension is generated.
The method according to claim 2, wherein the function value of the preset distance function is used as the summary score of the video sub-segment.
The method of claim 3, wherein the preset distance function is a correlation coefficient distance function, and the summary score of the video sub-segment:
Wherein, s i represents the summary score of the ith video sub-segment, v i and v j represent the ith and j th video sub-segments, respectively, and N is the first number.
The method of claim 2, wherein the training process of the occlusion analysis model comprises:

constructing an occlusion training sample set, the occlusion training sample set includes a clean sample set and an occlusion sample set, the clean samples do not include occlusion objects, and the occlusion samples include at least one type of occlusion objects;

Input the occlusion training samples into the occlusion analysis model, calculate the cross-entropy loss value of the output occlusion probability and the true occlusion probability, and optimize the parameters of the occlusion analysis model through the gradient descent algorithm until the occlusion detection of the occlusion analysis model is accurate. rate reaches the preset value.
The method of claim 2, wherein the training process of the aesthetic evaluation model comprises:

Construct an aesthetic training sample set, and the true aesthetic score marked by each aesthetic training sample is the average of multiple users' scores based on preset aesthetic angles;

The aesthetic evaluation model is trained using the aesthetic training sample set to obtain a trained aesthetic evaluation model.
The method of claim 2, wherein the shooting habit analysis model is defined as:
defined as

in,
is a hyperparameter, used to adjust the weight parameter of the probability density of the Gaussian distribution, μ represents the mean value of the Gaussian distribution, σ represents the standard deviation of the Gaussian distribution, υ duration represents the duration of the video to be edited, and β is a hyperparameter used to adjust The variance of the probability density of a Gaussian distribution.
The method of claim 1, wherein the ensemble learning model adopts a single-layer linear dense connection layer, and the comprehensive score curve is as follows:

Among them, M is the dimension set, w k and b k are the parameters of the dense connection layer, which are used to weight the weight of the standardized score curve vector corresponding to each dimension, and s′ k is the standardized score of the score curve corresponding to each dimension Curve vector, s is the comprehensive score curve vector.
The method of claim 1, wherein the step of selecting a clipping region from the first number of video sub-segments according to the comprehensive score curve comprises:

Calculate the average video score of each of the video sub-segments according to the comprehensive score curve;

Sort the video sub-segments in descending order according to the average video score, select a second number of video sub-segments according to the sorting result, and select the editing region from the second number of video sub-segments .
The method of claim 9, wherein the step of selecting a second number of video sub-segments according to the sorting result, and selecting the editing region from the second number of video sub-segments, comprises:

The clipping or complementing operation is performed on the video sub-segments whose video lengths do not belong to the preset video length interval in the second number of video sub-segments, the second number of video sub-segments after the operation are obtained, and the The region corresponding to the second number of video sub-segments is used as the clipping region.
A video editing system, characterized in that the system comprises:

a video segmentation module, used for segmenting the video to be edited according to a preset video segmentation algorithm to obtain a first number of video sub-segments;

a dimension analysis module, configured to analyze the video to be edited from a preset dimension to obtain a score curve corresponding to the dimension;

An integrated learning module for processing all the score curves using a preset integrated learning model to generate a comprehensive score curve of the video to be edited;

a segment search module for selecting clip regions from the first number of video sub-segments according to the comprehensive score curve; and

The editing module is used for editing the video to be edited according to the editing area to obtain the edited video.
The system of claim 11, wherein the dimensional analysis module comprises:

A digest analysis module, configured to generate a score curve of the video to be edited in the digest dimension by calculating the digest score of each video sub-segment; and/or

An occlusion analysis module, configured to analyze the occlusion probability of each video frame through an occlusion analysis model, and generate a score curve of the video to be edited in the occlusion dimension; and/or

An aesthetic analysis module for analyzing the aesthetic score of each video frame through an aesthetic evaluation model, and generating a score curve of the video to be edited in the aesthetic dimension; and/or

A shooting habit analysis module, configured to use a preset shooting habit analysis model to generate a score curve of the video to be edited in the shooting habit dimension; and/or

An image entropy analysis module for generating a score curve of the video to be edited on the image entropy dimension by counting the amount of information of each video frame; and/or

The texture analysis module is configured to generate a score curve of the video to be edited in the texture dimension by analyzing the texture features of each video frame.
An electronic device, comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, when the processor executes the computer program, the computer program according to claim 1 to 10. The steps of any one of the methods.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 10 are implemented.