CN112218005B

CN112218005B - Video editing method based on artificial intelligence

Info

Publication number: CN112218005B
Application number: CN202011011696.XA
Authority: CN
Inventors: 杨邵华; 谢金元; 廖海
Original assignee: Sz Reach Tech Co ltd
Current assignee: Sz Reach Tech Co ltd
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2023-06-16
Anticipated expiration: 2040-09-23
Also published as: CN112218005A

Abstract

The application relates to the technical field of video editing and provides an artificial intelligence-based video editing method, which comprises the following steps: acquiring multi-frame images in a video to be clipped; determining an area including at least one type of target in each frame of image as a target area, and determining a weight value of the at least one type of target area; when the weight value of at least one type of target area is larger than or equal to a preset threshold value, editing the video to be clipped according to the at least one type of target area, and generating a target video. According to the method, video editing work can be automatically completed, workload of manual editing is reduced, editing efficiency is improved, labor cost is reduced, image continuity of a target video can be guaranteed, and a screen flashing phenomenon is avoided.

Description

Video editing method based on artificial intelligence

Technical Field

The application relates to the technical field of video editing, in particular to an artificial intelligence-based video editing method.

Background

With the development of internet technology, video recording has become a major means for people to share and communicate. For example, in the field of education, a teacher records a teaching video, and then uses the teaching video as a learning material for students. Recording video includes recording and producing video. During video recording, due to the influence of various factors, such as environmental change, repeated video recording and the like, video content to be processed is obtained during video recording, and the visibility is poor. When the video is manufactured, the video content to be processed is manufactured into a complete video through video clipping, and the complete video is coherent, clear in theme and strong in viewability, so that a viewer can better accept information conveyed by the video.

The existing video editing method is manual editing, however, the workload of video editing is very large, the problems of complex operation, long time consumption and the like exist in manual editing, and the editing efficiency is low.

Disclosure of Invention

The embodiment of the application aims to provide an artificial intelligence-based video editing method which can automatically complete video editing work and improve editing efficiency. The specific calculation scheme is as follows:

in a first aspect, embodiments of the present application provide an artificial intelligence based video editing method, the method including:

acquiring multi-frame images in a video to be clipped;

determining an area including at least one type of target in each frame of image as a target area, wherein the types of target areas corresponding to the targets of the same type are the same;

determining a weight value of the at least one type of target area;

and when the weight value of the at least one type of target area is larger than or equal to a preset threshold value, editing the video to be edited according to the at least one type of target area to generate a target video.

In particular, the acquiring the multi-frame image in the video to be clipped includes:

analyzing the video to be clipped to obtain all frame images included in the video to be clipped;

And extracting part of frame images from all the frame images to obtain the multi-frame image.

In particular, the determining the weight value of the at least one type of target area includes:

determining an initial weight value of each target area in the multi-frame image;

counting the sum of initial weight values of the at least one type of target areas and the number of the at least one type of target areas;

calculating the average weight of the at least one type of target area according to the sum of the initial weight values and the number;

and determining the weight value of the at least one type of target area according to the average weight and configuration information, wherein the configuration information comprises the type of the target area, the highest weight score and the lowest weight score of each type of target area.

according to the multi-frame images, determining that the multi-frame images comprise a first group of frame images and a second group of frame images, wherein videos corresponding to the first group of frame images are continuously shot, and videos corresponding to the second group of frame images are continuously shot;

and determining at least two weight values corresponding to the at least one type of target area according to the first group of frame images and the second group of frame images.

Specifically, when the weight value of the at least one type of target area is greater than or equal to a preset threshold, clipping the video to be clipped according to the at least one type of target area to generate a target video, including:

according to the configuration information, determining the target area corresponding to the first type as a target area which is required to be cut, and determining the target area corresponding to the second type as a target area which is not required to be cut;

determining, for the multi-frame image, whether the at least one type includes the first type;

when the at least one type comprises the first type, editing the video to be edited according to a target area corresponding to at least one type, wherein the weight value of the target area is larger than or equal to a preset threshold value, in the first type, so as to generate a target video;

or when the at least one type does not comprise the first type and the at least one type comprises the second type, editing the video to be edited according to a target area corresponding to at least one type, in which the weight value in the second type is greater than or equal to a preset threshold value, and generating a target video.

In particular, the editing the video to be edited according to the at least one type of target area to generate a target video includes:

Cropping the at least one type of target region from the multi-frame image;

and generating a target video according to the width-height dimension of the at least one type of target region after cutting and the preset width-height dimension.

Specifically, the generating the target video according to the width-height dimension of the at least one type of target area after clipping and the preset width-height dimension includes:

processing the cut at least one type of target area, wherein the size of at least one side in the width and height of the processed at least one type of target area is equal to the size of the side corresponding to the at least one side in the preset width and height;

color filling the width of the processed at least one type of target area when the width of the processed at least one type of target area is not equal to a preset width, or color filling the height of the processed at least one type of target area when the height of the processed at least one type of target area is not equal to a preset height;

and generating the target video according to the at least one type of the filled target area.

In a second aspect, an embodiment of the present application provides an electronic device, where the device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the artificial intelligence based video editing method of the first aspect when executing the computer program.

In a third aspect, embodiments of the present application provide a computer readable storage medium storing a computer program, which when executed by a processor implements the artificial intelligence based video editing method of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer program product comprising a computer program for performing the artificial intelligence based video editing method of the first aspect when the computer program is executed by a processor.

The beneficial effects of the embodiment of the application are that:

according to the video editing method based on the artificial intelligence, the area including at least one type of target in each frame of image can be determined as the target area, and video to be edited is edited according to the target area, so that video editing work can be automatically completed, workload of manual editing is reduced, editing efficiency is improved, and labor cost is reduced. Further, the same type of target areas may be distributed in one frame of image or may be distributed in different frame of image, and the weight values between two target areas under the same type may be different.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a first video editing method based on artificial intelligence according to an embodiment of the present application.

Fig. 2 is a flowchart of a second video editing method based on artificial intelligence according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a clipping region according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of an apparatus 400 according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of an apparatus 500 according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

Fig. 1 is a schematic flow chart of an artificial intelligence based video editing method 100 provided in an embodiment of the present application. The method 100 includes at least the steps of:

s101, acquiring multi-frame images in a video to be clipped;

s102, determining an area including at least one type of target in each frame of image as a target area, wherein the types of target areas corresponding to the targets of the same type are the same;

s103, determining a weight value of at least one type of target area;

and S104, when the weight value of at least one type of target area is larger than or equal to a preset threshold value, editing the video to be clipped according to the at least one type of target area, and generating the target video.

The equipment for carrying out video clipping acquires a video file to be clipped, wherein the formats of the video file to be clipped comprise mp4, mov, wmv and the like, the video file to be clipped can be acquired locally, downloaded from a network or imported from other equipment, the video to be clipped is analyzed and decoded to obtain original video data, the original video data is all frame images included in the video, and one frame image is one picture.

After the original video data is obtained, one implementation way is to directly perform feature extraction on the image, and one implementation way is to process the original video data and then perform feature extraction on the image.

Taking the processing of the original video data as an example, as shown in fig. 2, in the embodiment of the present application, the processing of the original video data includes image scaling and reducing the frame rate, where the image scaling is used to reduce the resolution of the image, for example, the resolution of the image is reduced from 1080P to 480P or from 1080P to 720P. Image scaling can reduce the amount of data processing and improve the efficiency of editing.

The frame rate is reduced to extract frames of all frame images according to the appointed extraction frequency to obtain multi-frame images, the extraction frequency is to extract a preset number of video frames at intervals of a certain frame number, such as extracting one frame at intervals of 3 frames, extracting one frame at intervals of 5 frames, extracting two frames at intervals of 10 frames and the like, the frame rate is reduced, the frame number of the video frames to be processed can be reduced, the data processing amount is reduced, and the editing efficiency can be improved for the video with excessive frames. The extracted multi-frame images are video images uniformly distributed in the video to be clipped, and can accurately represent the content of the video to be clipped, so that the quality and accuracy of video clipping are ensured.

In the embodiment of the application, the video editing work is automatically completed through artificial intelligence technologies such as target detection, face detection and target tracking. After processing the original video data to obtain multi-frame images, extracting the characteristics of each frame of video image, detecting targets appearing in the images, wherein the targets are classified into various types including faces, human bodies, tables, mobile phones, automobiles, animals, plants and the like, and determining the types of targets to be detected according to the actual content of the video and user settings. Illustratively, in the teaching video, the target includes a human face, a human body, a blackboard, a desk, and the like.

In the embodiment of the application, the face detection model is used for detecting the face appearing in each frame of image to obtain the coordinates and the areas of a plurality of face target areas in each frame of image, one target area corresponds to one face, the face comprises a front face or a side face, and the type of the corresponding mark target area is the front face or the side face. The front face includes all features of the face, and the side faces include important features of the face. The target area is rectangular, the coordinates of the target area are the coordinates of four corner points of the rectangle, and the area is calculated according to the coordinates. The human face detection model is BlazeFace, which is a lightweight model of a human face in a detectable image and is used for recognizing facial key points so as to obtain a human face target area.

The method comprises the steps of uniformly calling a plurality of types of targets except a human face as objects, detecting the objects appearing in each frame of image through a pre-trained object detection model, obtaining coordinates of a plurality of object target areas in each frame of image, wherein one target area corresponds to one object, and marking the types of the target areas according to the types of the objects. In the method, the object detection model is trained by utilizing the characteristics of the objects of different types to be detected, so that the objects of different types can be detected by utilizing the object detection model during image detection. In the embodiment of the application, the object detection model uses a single-pulse multi-box detector (Single Shot MultiBox Detector, SSD) model, wherein the SSD model is a detector based on a full convolution network, and objects with different sizes are detected by different convolution layers.

After obtaining at least one type of target region, a weight value for each type of target region is determined, and in one implementation, the weight value for each type of target region in the multi-frame image is directly calculated. In another implementation manner, for a multi-frame image, detecting whether a scene is switched every preset frame number, and calculating at least two weight values of each type of target area in the multi-frame image according to a detection result. A group of frames that are continuously shot is called a shot, and a scene is switched after the shot is switched.

There are many algorithms for detecting whether a scene is switched, such as a template matching method, a histogram method, a block-based method, etc., and the essence of the algorithms is that there is a difference in chromaticity, brightness, etc. between two adjacent frames of video frames, and when the difference exceeds a certain threshold, it is considered that a shot switch occurs between two adjacent frames. In the embodiment of the application, a histogram method is adopted for scene switching detection, and a histogram of a current frame image and a histogram of a previous frame image of the current frame image are drawn; calculating a difference between the histogram of the current frame image and the histogram of the previous frame image of the current frame image; and when the difference value is smaller than a preset threshold value, switching the scene. For example, the preset number is 15 frames, that is, the scene change detection is performed every 15 frames, the previous frame image of the current frame image may be the 15 th frame image, and the current frame image may be the 16 th frame image.

Taking the case of detecting whether the scene is switched or not as an example, if the scene is switched, calculating the weight value of at least one type of target area in the multi-frame images before the current scene is switched and after the previous scene is switched, storing each frame of image and the target area in the frame of image, wherein the coordinates and the weight value are included, and then editing the images. It should be understood that if the scene in the multi-frame image is only switched once, the first frame image is switched to the scene from the first frame image, and the second frame image is switched to the last frame from the scene from the last frame image. A weight value of at least one type of target region is calculated from the first set of frame images and a weight value of at least one type of target region is calculated from the second set of frame images. If the scene in the multi-frame image is switched for a plurality of times, the multi-frame image also comprises a third group of frame images, a fourth group of frame images and the like.

For example, when the scene between the 40 th frame image and the 41 th frame image is switched, the 1 st frame image to the 40 th frame image are the first group of frame images, the first weight value of at least one type of target area in the first group of frame images is calculated, then when the scene between the 80 th frame image and the 81 th frame image is switched, the 41 st frame image to the 80 th frame image are the second group of frame images, and the second weight value of at least one type of target area in the second group of frame images is calculated. At this time, for one type of target area, the first weight value and the second weight value may be equal or unequal, and then for one weight value, the preset threshold value may be set, the first weight value is greater than the preset threshold value, and the second weight value is less than the preset threshold value, and then only the corresponding one type of target area in the group of frame images corresponding to the first weight value is clipped.

If the shot is not switched, the coordinates and initial weight values of each frame of image and the target area in the frame of image are stored, the stored frame number is counted, whether the frame number is larger than or equal to a preset threshold value is judged, for example, the preset threshold value is 600 frames, when the frame number is larger than or equal to 600 frames, the weight value of at least one type of target area in 600 frames of images is calculated, at the moment, every 600 frames of images are a group, the weight value of the target area is stored, and then the images are clipped. In the embodiment of the application, whether the scene is switched is detected, so that the continuity and completeness of front and rear pictures in the video after editing are ensured, and the phenomenon of screen flash or lens shaking of the video is avoided.

The calculation of the weight value of at least one type of target region is explained below. The weight calculation process is as follows:

counting the sum of initial weight values of at least one type of target area and the number of at least one type of target area;

calculating the average weight of at least one type of target area according to the sum and the number of the initial weight values;

and determining the weight value of at least one type of target area according to the average weight and configuration information, wherein the configuration information comprises the type of the target area, the highest weight score and the lowest weight score of each type of target area.

For the object target area, the initial weight value is set to a fixed value, such as to 1.0 or to 0.8, or the like. And for the face target area, calculating an initial weight value according to the area and the color of the target area. Initial weight value= (area weight + color weight)/(area weight + color weight). Wherein the area specific gravity setting range is (0.0-1.0), and the color specific gravity setting range is (0.0-1.0). The values of the area specific gravity and the color specific gravity are preset, for example, the area specific gravity is set to 0.6, and the color specific gravity is set to 0.4.

Where area weight= (detection frame area×area specific gravity)/area of image. The area of the image is calculated from the width and height of the image.

Color weight= (detection frame color weight value x color specific gravity). The color weight value calculation process of the detection frame comprises the following steps: converting an image in a detection frame into an HSV (Hue, saturation) color space, generating a two-dimensional histogram of Hue/Saturation, converting the generated two-dimensional histogram into a color histogram, then carrying out weighted calculation on the Saturation, calculating entropy of the histogram according to the weighted calculation Value, and finally obtaining the entropy as a detection frame color weight Value. And the pixels which are too dark or too bright in the detection frame are shielded before the two-dimensional histogram is generated, so that the calculation accuracy is improved. The over-bright pixel value is larger than or equal to a first preset threshold value, the over-dark pixel value is smaller than or equal to a second preset threshold value, for example, the over-bright pixel value satisfies R more than or equal to 250, G more than or equal to 250 and B more than or equal to 250, and the over-dark pixel value satisfies R less than or equal to 5, G less than or equal to 5 and B less than or equal to 5.

Counting the number of target areas belonging to the same type in the multi-frame image, adding the initial weight values of a plurality of target areas belonging to the same type in the multi-frame image to obtain an initial weight value sum of at least one type of target areas, and calculating the average weight of at least one type of target areas according to the initial weight value sum and the number, wherein the average weight=the initial weight value sum/number.

And determining the weight value of at least one type of target area according to the average weight and configuration information, wherein the configuration information comprises the type of the target area, the highest weight score and the lowest weight score of each type of target area. The calculation formula of the weight value of at least one type of target area is as follows: weight value = average weight x (highest score-lowest score) +lowest score.

The configuration information may be stored in the device in various forms, such as a configuration table. Exemplary, configuration tables are shown in table 1:

TABLE 1 configuration Table

Type(s)	Highest score	Minimum score	Cutting-out type
				Face-correcting	0.9	0.85	Is that
Side face	0.85	0.8	Is that
				Pet animal	0.2	0.1	Whether or not
Automobile	0.2	0.1	Whether or not

For example, the types of the target areas are front faces, the number of the target areas of the types in the multi-frame image is 10, 3 target areas of the types are front faces possibly exist in one frame image, the initial weight value of each target area is different due to the fact that the area and the color of each target area are different, the sum of the initial weight values is counted, for example, the sum is 30, the average weight is 3, the highest score of the target areas of the types of the front faces is 0.9, the lowest score of the target areas of the types of the front faces is 0.85, the weight value of the target areas of the types of the front faces=3× (0.9-0.85) +0.85=1, and the weight value of the target areas of the types of the front faces in the multi-frame image is 1.

After obtaining at least two weight values of at least one type of target area, in one implementation manner, the video to be clipped is clipped according to at least one type of target area directly under the condition that the weight value of the at least one type of target area is larger than or equal to a preset threshold value, so as to generate a target video.

In another implementation, the types of the target region are divided into a required clipping class, a non-required clipping class. In the embodiment of the application, information of whether the clipping class is necessary is stored in the configuration information. The division result is determined according to the actual video content, for example, in the video with the human subjects, the types of the target areas of the human faces are divided into the necessary clipping types, and the types of other target areas are divided into the unnecessary clipping types. In the video taking zoos as the theme, the types of animal target areas are divided into the necessary cutting types, and other target area types are divided into the unnecessary cutting types. The highest score and the lowest score of the necessary clipping class are set to be larger, and the highest score and the lowest score of the unnecessary clipping class are set to be smaller.

According to the configuration information, the target area corresponding to the first type can be determined to be the target area which needs to be cut, and the target area corresponding to the second type can be determined to be the target area which does not need to be cut. Determining, for the multi-frame image, whether at least one type includes a first type; at least one type of target area is detected for a frame of image, wherein the target area may include both the first type and the second type, or include only the first type, or include only the second type.

When at least one type comprises a first type, editing the video to be clipped according to a target area corresponding to at least one type, wherein the weight value of the target area is larger than or equal to a preset threshold value, in the first type, and generating a target video. Or when at least one type does not comprise the first type and at least one type comprises the second type, editing the video to be edited according to a target area corresponding to at least one type, wherein the weight value of the target area is greater than or equal to a preset threshold value, in the second type, so as to generate a target video; or selecting a target area corresponding to the type with the largest weight value in the second type, and editing the video to be edited to generate a target video. The method provided by the embodiment of the application can highlight the important targets in the video to be clipped.

Before scene cuts are made, the width and height of the target video image are calculated based on the aspect ratio of the original video image and a preset aspect ratio (e.g., 9:16, 1:1). When the original video image aspect ratio is smaller than the preset aspect ratio case, the target video image width=original video image width, the target video image height=target video image width×preset aspect ratio, and the target video image width is rounded to an integer, the integer being a multiple of 2. When the original video image aspect ratio is greater than the preset aspect ratio case, then the target video image height = original video image height; target video image width = target video image height x preset aspect ratio, target video image width rounded to an integer. Illustratively, the original video image has a width of 1920 pixels, a height of 1080, an aspect ratio of 16:9, and a preset aspect ratio of 9:16, where the target video image has a height of 1080, a target video image width=1080×9/16= 607.5, and a width after rounding of 608. This maximizes the width and height of the target video image.

At least one type including the first type is illustrated as an example. And determining a clipping region according to the coordinates of a target region corresponding to at least one type of which the weight value is greater than or equal to a preset threshold value in the first type aiming at the frame of image. The target area may be one or a plurality of target areas. If the target area is one, the width and height of the target area are the width and height of the clipping area. If the target area is a plurality of target areas, the maximum height formed by the plurality of target areas is taken as the width and height of the clipping area.

As shown in fig. 3, it is assumed that in one frame of image, the types of the target areas with the weight value greater than or equal to the preset threshold value in the first type are 3 types, the number of the 3 types of target areas is 1, that is, 3 target areas in one frame of image need to be clipped, namely, area 1, area 2 and area 3 are respectively, and the maximum height of the clipping area formed by the 3 target areas is determined according to the coordinates of the 3 target areas, that is, the left-most edge and the right-most edge in the coordinates of the 3 target areas form the width of the target area, and the top-most edge and the bottom-most edge form the height of the target area.

Cutting out and storing the cutting area in the multi-frame image, reducing or enlarging the width of the cutting area to the width of the target video image when the target video is manufactured, and reducing or enlarging the height of the cutting area in an equal ratio, wherein the height of the cutting area after the equal ratio is reduced or enlarged in an equal ratio is smaller than or equal to the height of the target video image. Or the height of the clipping region is reduced or enlarged to the target video image height, and the width of the clipping region is reduced or enlarged in an equal ratio, and the width of the clipping region after the equal ratio is reduced or enlarged in an equal ratio is smaller than or equal to the target video image width.

In one implementation, after determining the maximum height of the clipping region, calculating a center point of the second type of target region according to coordinates of all the second type of target regions in one frame of image, if the center point is included in the clipping region, recalculating the width and height of the clipping region according to at least one type of target region with a weight value greater than or equal to a preset threshold value in the second type, updating the maximum height of the clipping region, and performing scene clipping according to the updated maximum height of the clipping region, so that the scene is more complete.

If the image is scaled, calculating the actual coordinates of each target area in the original video image according to the width-height, scaling ratio and scaling mode of the original video image, determining the maximum width of the clipping area, performing scene clipping on the original video image, and converting the width-height of the clipping area to the width-height of the target video image.

If the upper and lower 'black edge' boundaries exist in the original video image, the upper and lower boundaries of each frame of image are detected to obtain the upper and lower heights of the boundaries, the minimum values of the upper and lower heights in all frames are taken, new images are cut out after the upper and lower heights are removed from the original video image, for example, the minimum values of the upper and lower heights are 50 pixels, the original image height is 1080 pixels, and the height of the new image is 980 pixels after the upper and lower heights are removed. And adjusting the coordinates of the target area through the new image width and height, and determining the maximum height of the clipping area by utilizing the adjusted coordinates of the target area. This allows for a clearer display of the object when the image has upper and lower "black" borders.

For example, the clip area is 800 pixels wide, 600 pixels high, the target video image width is 608 pixels wide, and the target video image height is 1080 pixels, at this time, the clip area is reduced in width to the target video image width, that is, 800 pixels are reduced to 608 pixels wide, the clip area is reduced in high-order ratio by 608/800, and the reduced height is 600×608/800=456 pixels.

If the aspect ratio of the clipping region is not equal to the aspect ratio of the target video image, the height of the clipping region after being enlarged or reduced is smaller than the height of the target video image when the scene is clipped, and at the moment, the clipping image is filled with upper and lower margin colors, so that the clipping image height is equal to the target video image height. Or the width of the clipping area after the enlargement or the reduction is smaller than the width of the target video image, and the left and right margin color filling is carried out on the cut image at the moment so that the width of the cut image is equal to the width of the target video image. In one implementation, a fixed color fill is used. In another implementation, detecting the boundary dominant color of the original video image, filling with the dominant color, can make the video more aesthetically pleasing.

The boundary detection flow is as follows: detecting a main color of a first row (height is 1, width is image width) at the top of the image, wherein the proportion of pixels of the main color to pixels of the first row of the image is greater than a preset threshold; determining a boundary through the main colors of the first row, sequentially judging whether the difference value between the value of the first row RGB and the value of the next row RGB is smaller than a preset threshold value, if so, indicating that the color of the next row is close to the color of the first row, dividing the next row into boundaries to obtain the height of an upper boundary, and similarly obtaining the height of a lower boundary, checking the proportion of the main colors in a non-boundary area to the image area, if the proportion exceeds 60%, considering the frame image as a background image, and directly cutting according to the original video image and the preset aspect ratio example without other processes. For images in which no face or object is detected, clipping is directly performed according to the original video image and the preset aspect ratio, and no other process is required.

Setting different coding formats, carrying out video coding and format encapsulation on the processed clipping region, and then storing the clipped video to obtain a target video.

The artificial intelligence-based video editing method according to the embodiments of the present application is described in detail above with reference to fig. 1 to 3, and the apparatus and device provided by the embodiments of the present application are described in detail below with reference to fig. 4 to 5.

Fig. 4 is a schematic block diagram of an apparatus 400 provided in an embodiment of the present application, including a receiving unit 401 and a processing unit 402.

A receiving unit 401, configured to acquire multi-frame images in a video to be clipped;

a processing unit 402, configured to determine an area including at least one type of object in each frame image as an object area, where the types of object areas corresponding to the same type of objects are the same; determining a weight value of at least one type of target area; when the weight value of at least one type of target area is larger than or equal to a preset threshold value, the video to be clipped is clipped according to the at least one type of target area, and the target video is generated.

In particular, the processing unit 402 is further configured to parse the video to be clipped to obtain all frame images included in the video to be clipped; and extracting part of frame images from all the frame images to obtain multi-frame images.

In particular, the processing unit 402 is further configured to determine an initial weight value of each target area in the multi-frame image; counting the sum of initial weight values of at least one type of target area and the number of at least one type of target area; calculating the average weight of at least one type of target area according to the sum and the number of the initial weight values; and determining the weight value of at least one type of target area according to the average weight and configuration information, wherein the configuration information comprises the type of the target area, the highest weight score and the lowest weight score of each type of target area.

In particular, the processing unit 402 is further configured to determine, according to the multiple frame images, that the multiple frame images include a first set of frame images and a second set of frame images, where the video corresponding to the first set of frame images is continuously shot, and the video corresponding to the second set of frame images is continuously shot; and determining at least two weight values corresponding to at least one type of target area according to the first group of frame images and the second group of frame images.

Specifically, the processing unit 402 is further configured to determine, according to the configuration information, that the target area corresponding to the first type is a target area that needs to be cut, and determine that the target area corresponding to the second type is a target area that does not need to be cut; determining, for the multi-frame image, whether at least one type includes a first type; when at least one type comprises a first type, editing the video to be clipped according to a target area corresponding to at least one type, wherein the weight value of the target area is larger than or equal to a preset threshold value, in the first type, so as to generate a target video; or when at least one type does not comprise the first type and at least one type comprises the second type, editing the video to be clipped according to a target area corresponding to at least one type, the weight value of which is greater than or equal to a preset threshold value, in the second type, so as to generate a target video.

In particular, the processing unit 402 is further configured to crop at least one type of target region from the multi-frame image; and generating a target video according to the width-height dimension of the at least one type of target region after cutting and the preset width-height dimension.

Specifically, the processing unit 402 is further configured to process the at least one type of target area after clipping, where a dimension of at least one side in a width-height of the at least one type of target area after processing is equal to a dimension of a side corresponding to the at least one side in a preset width-height; color filling the width of the processed at least one type of target area when the width of the processed at least one type of target area is not equal to a preset width, or color filling the height of the processed at least one type of target area when the height of the processed at least one type of target area is not equal to a preset height; and generating the target video according to the at least one type of the filled target area.

It should be appreciated that the apparatus 400 of the embodiments of the present application may be implemented by an application specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD), which may be a complex program logic device (complex programmable logical device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general-purpose array logic (generic array logic, GAL), or any combination thereof. The artificial intelligence based video editing method of fig. 1 may also be implemented by software, and when the artificial intelligence based video editing method of fig. 1 is implemented by software, the apparatus 400 and its respective modules may also be software modules.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the device 500 includes a processor 501, a memory 502, a communication interface 53, and a bus 504. The processor 501, the memory 502, and the communication interface 503 communicate via the bus 504, or may communicate via other means such as wireless transmission. The memory 502 is used for storing instructions and the processor 501 is used for executing the instructions stored by the memory 502. The memory 502 stores program code 5021 and the processor 501 can invoke the program code 5021 stored in the memory 502 to perform the artificial intelligence based video editing method shown in fig. 1.

It should be appreciated that in embodiments of the present application, the processor 501 may be a CPU, and the processor 501 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like.

The memory 502 may include read only memory and random access memory and provide instructions and data to the processor 501. Memory 502 may also include non-volatile random access memory. The memory 502 may be volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The bus 504 may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus. But for clarity of illustration the various buses are labeled as bus 504 in fig. 5.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk (solid state drive, SSD).

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A video editing method based on artificial intelligence, the method comprising:

acquiring multi-frame images in a video to be clipped;

determining a weight value of each type of target area in the target areas;

when the weight value of the target area is larger than or equal to a preset threshold value, editing the video to be edited according to the target area to generate a target video;

the determining the weight value of each type of target area in the target areas comprises the following steps:

counting the sum of initial weight values of each type of target area in the target areas and the number of each type of target areas in the target areas;

calculating the average weight of each type of target area in the target areas according to the initial weight value sum and the number;

and determining the weight value of each type of target area in the target area according to the average weight and configuration information, wherein the configuration information comprises the type of the target area, the highest weight score and the lowest weight score of each type of target area.

2. The method of claim 1, wherein the acquiring multi-frame images in the video to be clipped comprises:

3. The method according to claim 1 or 2, wherein said determining a weight value for each type of target area of the target areas comprises:

And determining at least two weight values corresponding to each type of target area in the target area according to the first group of frame images and the second group of frame images.

4. The method of claim 1, wherein when the weight value corresponding to the target area is greater than or equal to a preset threshold, clipping the video to be clipped according to the target area, and generating a target video, includes:

when the at least one type comprises the first type, editing the video to be edited according to a target area, in the first type, of which the weight value is greater than or equal to a preset threshold value, so as to generate a target video;

or when the at least one type does not comprise the first type and the at least one type comprises the second type, editing the video to be edited according to a target area with a weight value larger than or equal to a preset threshold value in the second type, and generating a target video.

5. The method of claim 1, wherein the editing the video to be edited according to the target area to generate a target video comprises:

clipping the target area from the multi-frame image;

and generating a target video according to the width and height dimensions of the cut target region and the preset width and height dimensions.

6. The method of claim 5, wherein generating the target video according to the cropped width-to-height dimension of the target region and the preset width-to-height dimension comprises:

processing the cut target area, wherein the size of at least one side in the width and height of the processed target area is equal to the size of the side corresponding to the at least one side in the preset width and height;

color filling the width of the processed target area when the width of the processed target area is not equal to a preset width, or color filling the height of the processed target area when the height of the processed target area is not equal to a preset height;

and generating a target video according to the filled target area.

7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the artificial intelligence based video editing method of any of claims 1 to 6 when the computer program is executed.

8. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the artificial intelligence based video editing method of any of claims 1 to 6.