CN111901536B

CN111901536B - Video editing method, system, device and storage medium based on scene recognition

Info

Publication number: CN111901536B
Application number: CN202010773076.3A
Authority: CN
Inventors: 范博; 罗超; 成丹妮; 胡泓; 李巍
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2023-03-24
Anticipated expiration: 2040-08-04
Also published as: CN111901536A

Abstract

The invention provides a video clipping method, a system, equipment and a storage medium based on scene recognition, wherein the method comprises the following steps: extracting each frame of the original video as a first picture, producing a first picture set, arranging each first picture in the first picture set according to the sequence of the pictures in the original video, and forming a frame linked list; cutting off the marker from the pictures in the first picture set to obtain a second picture, and generating a second picture set; sequentially adding a lens label to each second picture in the second picture set according to the lens identification model, and adding a scene label; editing a second picture set according to a preset target frame number, a target scene label and a preset scene priority sequence, and sequentially outputting third pictures in the second picture set to obtain a third picture set; and synthesizing all the third pictures in the third picture set according to the sequence in the frame linked list and outputting a clip video. The invention can automatically clip videos in batches, replaces the work of artificially synthesizing the videos, greatly saves the operation cost and effectively improves the operation efficiency.

Description

Video editing method, system, device and storage medium based on scene recognition

Technical Field

The present invention relates to the field of video editing, and in particular, to a method, system, device, and storage medium for video editing based on scene recognition.

Background

Video has become a fast and effective 'traffic preemption' measure emerging in the OTA industry at first. In the OTA industry, most selling points are displayed by pictures, however, the quantity of picture information is limited and not as much as that of video information. At present, a great number of OTA videos are generated by various services of a tourism platform, wherein the OTA videos comprise hotel official publicity videos, poi tourism videos and the like. The OTA video has the problems of non-uniform time length, non-centralized video content and the like. This can lead to video content redundancy, and direct presentation within the app can lead to less emphasis and a poor user experience. Video clips are therefore the main means to solve this problem. However, there is little research on a method for automatically editing a video by using a deep learning method.

Accordingly, the present invention provides a method, system, device and storage medium for video editing based on scene recognition.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a video editing method, a system, equipment and a storage medium based on scene recognition, which overcome the difficulties in the prior art, can quickly segment and extract a target scene, realize the essence editing of a sub-scene, finally generate an essence short video based on an original video, and use the short video to replace the original mode of displaying selling points of pictures.

The embodiment of the invention provides a video clipping method based on scene identification, which comprises the following steps:

s110, extracting each frame of an original video as a first picture, producing a first picture set, arranging each first picture in the first picture set according to the sequence of the pictures in the original video, and forming a frame linked list;

s120, cutting off a marker from the pictures in the first picture set to obtain a second picture, and generating a second picture set, wherein the marker comprises a watermark and/or a subtitle;

s130, sequentially adding a shot label to each second picture in the second picture set according to a shot recognition model, and adding a scene label according to the scene recognition model;

s140, editing the second picture set according to a preset target frame number, at least one target scene label and a preset scene priority sequence, and outputting third pictures in sequence to obtain a third picture set; and

s170, synthesizing all third pictures in the third picture set according to the sequence in the frame linked list, and outputting a clipped video according to the preset target video duration and frame rate.

Preferably, in step S110, a first frame number of the original video is obtained according to the duration and the frame rate of the original video, and the number of pictures in the first picture set is equal to the first frame number.

Preferably, the step S120 includes the steps of:

s121, obtaining a first area containing the watermark and a second area containing the subtitle in the picture through pattern recognition, and obtaining a first range where the first area is located and a second range where the second area is located;

s122, merging the first range and the second range of all the pictures in the first picture set to obtain an avoidance area;

s123, establishing a shearing area avoiding the avoidance area in the range of the picture;

s124, cutting all the first pictures according to the cutting areas to obtain second pictures;

and S125, arranging the second picture according to the frame linked list to obtain a second picture set.

Preferably, in step S121, a plane coordinate system is established for each picture in the first picture set, and a first coordinate range where the first area is located and a second coordinate range where the second area is located are obtained;

in step S122, summing the first coordinate ranges and the second coordinate ranges of all the pictures in the first picture set to obtain a coordinate range of an avoidance region;

in step S123, based on the range of the first picture, a first cropping frame with a maximum area and avoiding the coordinate range of the avoidance region is formed, a first length of the first cropping frame in the lateral direction of the picture is greater than a second length of the first cropping frame in the longitudinal direction of the picture, a second cropping frame with a maximum area and avoiding the coordinate range of the avoidance region is formed, a first length of the second cropping frame in the lateral direction of the picture is less than a second length of the second cropping frame in the longitudinal direction of the picture, and one of the first cropping frame and the second cropping frame with the maximum area is selected as a cropping region;

in step S124, all the first pictures are clipped through the coordinate range of the clipping region to obtain a second picture.

Preferably, the first range, the second range, the first cutting frame and the second cutting frame are all rectangular areas.

Preferably, the step S130 further includes the steps of:

and obtaining the color parameters of each second picture in the second picture set, sequentially adding a lens label to the continuous second pictures of which the difference value of the color parameters is less than or equal to a preset threshold value, and adding the lens labels with the updated numbers when the second pictures of which the difference value of the color parameters is greater than the preset threshold value are encountered.

Preferably, the color parameter is at least one of a luminance parameter, a histogram, an RGB value, an HSV value, and an HSL value.

Preferably, a scene classification model with multiple types of scene identification codes is used to classify each second picture in the second picture set, and the scene identification code corresponding to the classification result is added to each second picture.

Preferably, the step S130 further includes the steps of:

and performing cluster statistics on the scene identification codes in all second pictures corresponding to each shot label, and updating all second picture scene identification codes with each type of shot label into a type of scene identification code with the highest occurrence frequency in the shot label.

Preferably, in step S140, the target frame number is obtained according to a preset target video duration and a preset frame rate.

Preferably, in step S140, all the second pictures with the target scene tags are selected as third pictures, a third picture set is generated, and the total number of the third pictures is compared with the target frame number;

when the total number of the third pictures is less than the target frame number, executing step S150;

when the total number of the third pictures is greater than the target frame number, executing step S160;

when the total number of the third pictures is equal to the target frame number, executing step S170;

the step S150 includes sequentially adding the second pictures corresponding to the first type of scene identifiers to the third picture set according to the order of the priority sequence of the preset scene identifiers from the unselected second pictures;

the step S160 includes obtaining a retention ratio according to a ratio of the third image set to the target frame number, and deleting a portion of the third image from the second image corresponding to each target scene tag in the third image set according to the retention ratio.

Preferably, the step S160 includes, based on each target scene tag in the third image set, respectively using a target frame selection box, where the frame number of the target frame selection box is a product of the number of third images corresponding to each target scene tag and a retention ratio, the target frame selection box moves from the third image corresponding to each target scene tag by taking one frame as a step length according to a time sequence of the third images, obtains a video quality value after each movement through a video quality model, and finally deletes the third image that is not selected by the target frame selection box with the highest video quality value.

The embodiment of the present invention further provides a video clipping system based on scene recognition, which is used for implementing the above video clipping method based on scene recognition, and the video clipping system based on scene recognition includes:

the video extraction module is used for extracting each frame of an original video as a first picture to produce a first picture set, wherein each first picture in the first picture set is arranged according to the sequence of the pictures in the original video and forms a frame linked list;

the image cutting module is used for removing the marker from the images in the first image set by cutting to obtain a second image and generating a second image set, wherein the marker comprises a watermark and/or a subtitle;

the label adding module is used for sequentially adding a lens label to each second picture in the second picture set according to a lens identification model and adding a scene label according to the scene identification model;

the video editing module is used for editing the second picture set according to the target frame number, at least one target scene label and a preset scene priority sequence and outputting third pictures in sequence to obtain a third picture set; and

and the video synthesis module synthesizes all third pictures in the third picture set according to the sequence in the frame linked list and outputs a clipped video according to the preset target video duration and frame rate.

An embodiment of the present invention further provides a video editing apparatus based on scene recognition, including:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the scene recognition based video clip method described above via execution of the executable instructions.

Embodiments of the present invention also provide a computer-readable storage medium storing a program that, when executed, implements the steps of the above-described scene recognition-based video clipping method.

The invention aims to provide a video editing method, a system, equipment and a storage medium based on scene recognition, which can realize the essence editing of an OTA video based on the scene recognition method, can quickly divide and extract a target scene, realize the essence editing of a sub-scene, finally generate an essence short video based on an original video, replace the original picture display selling point mode with the short video, automatically clip the videos in batches, replace the manual video synthesis work, greatly save the operation cost and effectively improve the operation efficiency.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, with reference to the accompanying drawings.

FIG. 1 is a flow chart of a scene recognition based video clipping method of the present invention.

Fig. 2 to 9 are process diagrams of the scene recognition-based video clipping method of the present invention.

FIG. 10 is a block diagram of a scene recognition based video clip system of the present invention.

Fig. 11 is a schematic structural diagram of a video editing device based on scene recognition according to the present invention. And

fig. 12 is a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present invention.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar structures, and thus their repetitive description will be omitted.

FIG. 1 is a flow chart of a scene recognition based video clipping method of the present invention. As shown in fig. 1, an embodiment of the present invention provides a video clipping method based on scene recognition, including the following steps:

s110, extracting each frame of the original video as a first picture, producing a first picture set, arranging each first picture in the first picture set according to the sequence of the pictures in the original video, and forming a frame linked list.

And S120, cutting the pictures in the first picture set to remove the marker to obtain a second picture, and generating a second picture set, wherein the marker comprises a watermark and/or a subtitle, so that the interference of the watermark and the letter to the subsequent scene identification is removed.

And S130, sequentially adding a shot label to each second picture in the second picture set according to the shot recognition model, and adding a scene label according to the scene recognition model so as to process the frame according to the scene and the shot in the following process and carry out video clipping.

S140, according to the preset target frame number, at least one target scene label and the preset scene priority sequence, third pictures are edited in the second picture set in sequence, and a third picture set is obtained. And selecting all second pictures with the target scene labels as third pictures, generating a third picture set, and comparing the total number of the third pictures with the target frame number.

When the total number of the third pictures is less than the target frame number, step S150 is performed.

When the total number of the third pictures is greater than the target frame number, step S160 is performed.

When the total number of the third pictures is equal to the target frame number, step S170 is performed.

S150, sequentially adding the second pictures corresponding to the scene identification codes of one type into a third picture set from the unselected second pictures according to the sequence of the priority sequence of the preset scene identification codes.

And S160, obtaining a retention ratio according to the ratio of the third image set to the target frame number, and deleting part of the third images from the second images corresponding to the target scene labels in the third image set according to the retention ratio.

And S170, synthesizing all third pictures in the third picture set according to the sequence in the frame linked list, and outputting the clipped video according to the preset target video duration and frame rate.

The method is suitable for automatic editing of OTA short videos, reduces labor cost and improves video production efficiency compared with traditional manual editing.

In a preferred embodiment, in step S110, a first frame number of the original video is obtained according to the duration and the frame rate of the original video, and the number of pictures in the first picture set is equal to the first frame number.

In a preferred embodiment, step S120 includes the steps of:

s121, obtaining a first area containing the watermark and a second area containing the subtitle in the picture through the existing pattern recognition model, and obtaining a first range where the first area is located and a second range where the second area is located.

And S122, merging the first range and the second range of all the pictures in the first picture set to obtain an avoidance area.

And S123, establishing a shearing area avoiding the avoidance area in the range of the picture.

And S124, cutting all the first pictures according to the cutting areas to obtain second pictures.

And S125, arranging the second picture according to the frame chain table to obtain a second picture set, obtaining all positions of the second picture in the picture by obtaining the watermarks and the letters of all the pictures in the video, and obtaining an avoidance area by combining the projections of all the positions in the picture, wherein the avoidance area in the invention is a set of the positions of the watermarks and the letters which possibly cause interference on subsequent scene identification and the like.

In a preferred embodiment, in step S121, a plane coordinate system is established in each picture in the first picture set, and a first coordinate range in which the first area is located and a second coordinate range in which the second area is located are obtained.

In step S122, the first coordinate ranges and the second coordinate ranges of all the pictures in the first picture set are merged to obtain the coordinate range of the avoidance region.

In step S123, based on the range of the first picture, a first cropping frame with a maximum area and avoiding the coordinate range of the avoidance region is formed, a first length of the first cropping frame along the lateral direction of the picture is greater than a second length of the first cropping frame along the longitudinal direction of the picture, a second cropping frame with a maximum area and avoiding the coordinate range of the avoidance region is formed, a first length of the second cropping frame along the lateral direction of the picture is less than a second length of the second cropping frame along the longitudinal direction of the picture, and one of the first cropping frame and the second cropping frame with the maximum area is selected as the cropping region.

In step S124, all the first pictures are cut through the coordinate range of the cutting region to obtain a second picture, which cutting frame can obtain a larger area by simulating two cutting frames to compare, and all the pictures are cut by using the cutting frame with the larger area.

In a preferred embodiment, the first range, the second range, the first cropping frame and the second cropping frame are all rectangular areas, but not limited thereto.

In a preferred embodiment, step S130 further comprises the steps of:

and obtaining the color parameters of each second picture in the second picture set, sequentially adding a lens label to the continuous second pictures with the color parameter difference smaller than or equal to a preset threshold value, and adding the lens labels with the updated numbers when the second pictures with the color parameter difference larger than the preset threshold value are encountered. For example, the color parameter is at least one of a brightness parameter, a histogram, an RGB value, an HSV value, and an HSL value, but not limited thereto.

In the invention, whether the two frames belong to the same lens is judged by the change of the color between the adjacent frames, if the change of the color between the adjacent frames is very small, the light rays are not obviously changed, which indicates that the two frames belong to the same lens, for example, the two frames of pictures belong to continuous frames in the first lens. However, if the color change between adjacent frames exceeds a preset threshold, it indicates that the light has changed significantly, and a transition is made or the two frames are clipped, and the two frames belong to different shots and are distinguished by marking different shot codes for the next frame of picture. For example, a previous frame of picture belongs to a first shot and a subsequent frame of picture belongs to a second shot.

In a preferred embodiment, a scene classification model with multiple classes of scene identifiers is used to classify each second picture in the second picture set, and the scene identifier corresponding to the classification result is added to each second picture. In the invention, each second picture in the second picture set is classified through a preset scene classification model of 26 types of scene identification codes, for example, the picture can belong to a room of type 5, or the picture can belong to the appearance of a hotel of type 9.

In a preferred embodiment, step S130 further comprises the steps of: and performing cluster statistics on the scene identification codes in all the second pictures corresponding to each shot label, and updating all the second picture scene identification codes with each type of shot labels into the scene identification codes with the highest occurrence frequency in the shot labels. Since the scene of a person is easy to pass through when a room is shot, and the picture corresponding to a frame that the person passes through is easy to have the wrong identification of the scene identification code, the scene identification codes of all pictures in the scene need to be corrected according to the main scene identification code of the whole scene.

In a preferred embodiment, in step S140, the target frame number is obtained according to the preset target video duration and frame rate, that is, the product of the target video duration and the frame rate is used as the target frame number.

In a preferred embodiment, step S160 includes, based on each target scene tag in the third image set, respectively, a target frame selection box, where the frame number of the target frame selection box is a product of the number of the third images corresponding to each target scene tag and the retention ratio, the target frame selection box moves from the third image corresponding to each target scene tag by taking one frame as a step length according to a time sequence of the third images, and obtains a video quality value after each movement through the video quality model, and finally deletes the third image that is not selected by the target frame selection box with the highest video quality value.

Fig. 2 to 9 are process diagrams of the scene recognition-based video clipping method of the present invention. The following describes the implementation of the present invention in detail by means of fig. 2 to 9.

The embodiment needs to automatically clip a video with a duration of 2 minutes and a frame rate of 24 frames/second into a clipped video with a duration of 5 seconds and a frame rate of 120 frames/second.

As shown in fig. 2, each frame of the original video is extracted as a first picture, a first picture set is generated, each first picture in the first picture set is arranged according to the sequence of the pictures in the original video, and a frame linked list is formed. The number of pictures of the first picture set is 2 × 60 × 24= 2880.

As shown in fig. 3 to 4, a first region including a watermark and a second region including a subtitle in each picture are obtained through an existing pattern recognition model, a first range in which the first region is located and a second range in which the second region is located are obtained, a planar coordinate system is established in each picture in the first picture set, and a first coordinate range in which the first region is located and a second coordinate range in which the second region is located are obtained. For example, a first coordinate range F1 of the watermark is obtained in the first frame P1, a first coordinate range F8 of the watermark is obtained in the eighth frame P8, and a second coordinate range F9 derived from letters, and so on. Obtaining an avoidance region by summing the first range and the second range of all pictures in the first picture set, wherein the avoidance region comprises a union set of an area A (corresponding to a first coordinate range F1), an area B (corresponding to a first coordinate range F8) and an area C (corresponding to a second coordinate range F9), and obtaining a coordinate range of the avoidance region.

As shown in fig. 5 to 6, based on the range of the first picture, a first cropping frame Q1 which avoids the coordinate range of the avoidance region and has the largest area is formed, the first length of the first cropping frame Q1 in the horizontal direction of the picture is greater than the second length of the first cropping frame Q1 in the longitudinal direction of the picture, a second cropping frame Q2 which avoids the coordinate range of the avoidance region and has the largest area is formed, the first length of the second cropping frame Q2 in the horizontal direction of the picture is less than the second length of the second cropping frame in the longitudinal direction of the picture, the area occupied by the first cropping frame Q1 is compared with the area occupied by the second cropping frame Q2, which cropping frame can obtain a larger area by simulating two cropping frames, one of the first cropping frame and the second cropping frame having the largest area is selected as the cropping region, all the first pictures are cropped to obtain the second pictures by using the cropping frame having the larger area, and all the second pictures are established according to the time sequence of the corresponding frames. Second picture set

And obtaining HSV values of each second picture in the second picture set, sequentially adding a lens label to the continuous second pictures of which the difference values of the HSV values are less than or equal to a preset threshold, and adding the lens labels with the updated numbers when the second pictures of which the difference values of the HSV values are greater than the preset threshold are encountered. In the invention, whether the two frames belong to the same shot or not is judged through the change of the HSV value between the adjacent frames, and if the change of the HSV value between the adjacent frames is very small, the light ray is not obviously changed, which indicates that the two frames belong to the same shot, for example, the two frames of pictures belong to continuous frames in the first shot. However, if the change of the HSV value between adjacent frames exceeds the preset threshold, it indicates that the light has changed significantly, and the two frames belong to different shots, and are distinguished by marking different shot codes for the next frame of picture. For example, a previous frame of picture belongs to a first shot and a subsequent frame of picture belongs to a second shot. And classifying each second picture in the second picture set by adopting a scene classification model with a plurality of types of scene identification codes, and adding the scene identification code corresponding to the classification result to each second picture.

Referring to fig. 7, clustering statistics is performed on the scene identifiers in all the second pictures corresponding to each shot label, and all the second picture scene identifiers having each type of shot label are updated to the scene identifier of the type with the highest occurrence frequency in the shot label. For example, 8 pictures are the scene identification code e1 of a room out of 10 pictures in the first shot I1, but two pictures are the scene identification codes e5 of human images, because the situation that people walk ahead when shooting a room easily occurs, and the like, and the false identification of the scene identification codes easily occurs on the pictures corresponding to the frames that people pass through, therefore, the scene identification codes of all the pictures in the shot need to be corrected according to the main scene identification code of the whole shot, and the scene identification codes of the 10 pictures in the shot I1 are clustered and counted to obtain the scene identification code e1 of the room with the largest occurrence frequency, so that the 10 pictures in the shot I1 are identically marked as the scene identification code e1 (or other scene identification codes not belonging to the scene identification code e1 in the 10 pictures are identically replaced by the scene identification code e 1) to obtain the corrected shot I1', thereby avoiding the interference of improper labels.

And selecting all second pictures with target scene labels as third pictures to generate a third picture set, wherein the target scene labels are e1 (representing rooms) and e2 (representing hotel lobbies), and comparing the total number of the third pictures with the target frame number.

And when the total number of the third pictures is equal to the target frame number, executing a subsequent merging and clipping step.

And when the total number of the third pictures is less than the target frame number, sequentially adding the second pictures corresponding to the scene identification codes of one class into the third picture set from the unselected second pictures according to the sequence of the preset scene identification code priority sequence, and executing the subsequent merging and editing step.

For example, if the order of the scene identifiers in the priority sequence is e1 (representing a room), e2 (representing a hotel lobby), e3 (representing a hotel restaurant), e4 (representing a hotel swimming pool) $ $ and so on, when the total amount of the third pictures with e1 (representing a room) and e2 (representing a hotel lobby) is less than the target frame number, the picture with e3 (representing a hotel restaurant) is preferentially added to the third picture set, and if the total amount of the third picture after thickening the picture with e3 is less than the target frame number, the picture with e4 (representing a hotel swimming pool) is continuously added to the third picture set, and so on, the requirement of the target frame number is met.

And when the total number of the third pictures is greater than the target frame number, obtaining a retention ratio according to the ratio of the third picture set to the target frame number, and deleting part of the third pictures from the second pictures corresponding to the target scene labels in the third picture set according to the retention ratio. Based on the fact that each target scene label in the third picture set is a target frame selection frame, the frame number of the target frame selection frame is the product of the number and the reserved proportion of the third pictures corresponding to each target scene label, the target frame selection frame moves in sequence according to the time sequence of the third pictures by taking one frame as a step length from the third pictures corresponding to each target scene label, the quality value of the video after each movement is obtained through a video quality model, referring to the figure 8, assuming that 10 pictures need to be clipped from W1 to W25, and the length of a target frame selection frame H1 is 10 frames _{Continuous frames} The video quality value L1 of the continuous frame set obtained after each movement is obtained through a video quality model, the target frame selection frame H1 moves backwards from the picture W1, 25-10+1=16 target frame selection frames are sequentially obtained after each movement of 1 frame, the video quality values of the continuous frame sets of the 16 target frame selection frames are sequenced, the video quality value L14 of the target frame selection frame H14 is the highest, the continuous frame sets W14 to W23 selected by the target frame selection frame H14 are used as pictures needing to be reserved, all the rest 15 frames of pictures (including W1 to W13 and W24 to W25) are deleted from the third picture set, and a subsequent merging and clipping step is executed.

Finally, referring to fig. 9, combining and clipping, synthesizing all third pictures in the third picture set according to the order in the frame chain table, clipping the pictures W1 to Wm and the pictures Z1 to Zo with the target scene label as e1 (representing a room), wherein the total number of the pictures W1 to Wm and the pictures Z1 to Zo is 600, generating a clipped video with a duration of 5 seconds and a frame rate of 120 frames/second after combining and clipping, and details are not repeated here.

The invention can automatically and efficiently produce the clip video G of the original video through the method, and in a preferred example, the clip video G can be used as a thumbnail of the original video or a short video guide picture and the like, so that a user can conveniently and rapidly browse the wonderful content of the original video.

The video editing method based on scene recognition can realize the essence editing of OTA videos based on the scene recognition method, can quickly divide and extract target scenes, realize the essence editing of sub-scenes, finally generate essence short videos based on original videos, replace the mode of displaying selling points by the short videos, replace the work of artificially synthesizing the videos through automatically editing the videos in batches, greatly save the operation cost and effectively improve the operation efficiency.

FIG. 10 is a block diagram of a scene recognition based video clip system of the present invention. As shown in fig. 10, an embodiment of the present invention further provides a video clipping system based on scene recognition, for implementing the above-mentioned video clipping method based on scene recognition, where the video clipping system 9 based on scene recognition includes:

the video extraction module 91 extracts each frame of the original video as a first picture to produce a first picture set, wherein each first picture in the first picture set is arranged according to the sequence of the pictures in the original video and forms a frame linked list.

And the picture cutting module 92 is used for removing the identifier from the pictures in the first picture set by cutting to obtain a second picture to generate a second picture set, wherein the identifier comprises a watermark and/or a subtitle.

And the tag adding module 93 is used for sequentially adding a lens tag to each second picture in the second picture set according to the lens identification model and adding a scene tag according to the scene identification model.

The video clipping module 94 clips the third pictures in the second picture set according to the target frame number, the at least one target scene tag, and the preset scene priority sequence, and then outputs the third pictures in sequence to obtain a third picture set. And selecting all second pictures with the target scene labels as third pictures to generate a third picture set, and comparing the total number of the third pictures with the target frame number. When the total number of the third pictures is less than the target frame number, the supplementary picture module 95 is executed. When the total number of the third pictures is greater than the target frame number, the pruned pictures module 96 is executed. When the total number of the third pictures is equal to the target frame number, the video composition module 97 is executed.

The supplementary picture module 95 sequentially adds the second pictures corresponding to the scene identifiers of one category to the third picture set according to the order of the priority sequence of the preset scene identifiers from the unselected second pictures, and executes the video synthesis module 97.

The delete picture module 96 includes a module for obtaining a retention ratio according to a ratio of the third picture set to the target frame number, deleting a portion of the third picture from the second picture corresponding to each target scene tag in the third picture set according to the retention ratio, and executing a video composition module 97.

And the video synthesis module 97 synthesizes all the third pictures in the third picture set according to the sequence in the frame linked list, and outputs the clipped video according to the preset target video duration and frame rate.

According to the invention, by integrating each model researched in the picture field, adding a dynamic picture selection strategy and an edge detection algorithm, the problem in the video field is solved, the functions of scene segmentation, scene recognition and essence editing of the video are completed by utilizing the cooperation of each sub-module, the capability of generating the video in batches is formed, and the operation cost maintenance can be greatly saved. The configurable time length of scenes and video editing is ensured, and the expansibility is strong. The service operation efficiency with video requirements in the OTA scene is effectively improved.

The embodiment of the invention also provides a video editing device based on scene recognition, which comprises a processor. A memory having stored therein executable instructions of the processor. Wherein the processor is configured to perform the steps of the scene recognition based video clip method via execution of executable instructions.

As shown above, the embodiment can realize the essence editing of the OTA video based on the scene recognition method, can quickly cut and extract the target scene, realize the essence editing of the sub-scene, finally generate the essence short video based on the original video, replace the original mode of displaying selling points by the short video, replace the work of artificially synthesizing the video by automatically editing the video in batches, greatly save the operation cost, and effectively improve the operation efficiency.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" platform.

Fig. 11 is a schematic structural diagram of a video editing device based on scene recognition according to the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 11. The electronic device 600 shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 11, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including the memory unit 620 and the processing unit 610), a display unit 640, etc.

Wherein the storage unit stores program code executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of the present specification. For example, processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile storage units, such as a random access memory unit (RAM) 6201 and/or a cache storage unit 6202, and may further include a read-only memory unit (ROM) 6203.

The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. The network adapter 660 may communicate with the other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms, to name a few.

Embodiments of the present invention further provide a computer-readable storage medium for storing a program, and the steps of the video editing method based on scene recognition implemented when the program is executed. In some possible embodiments, the aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present invention described in the above-mentioned electronic prescription flow processing method section of this specification, when the program product is run on the terminal device.

As shown above, the embodiment can realize the essence editing of the OTA video based on the scene recognition method, can quickly divide and extract the target scene, realize the essence editing of the sub-scene, finally generate the essence short video based on the original video, replace the original mode of displaying selling points by using the short video, and replace the work of artificially synthesizing the video by automatically editing the video in batches, thereby greatly saving the operation cost and effectively improving the operation efficiency.

Fig. 12 is a schematic structural diagram of a computer-readable storage medium of the present invention. Referring to fig. 12, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, the present invention is directed to a video editing method, a system, a device and a storage medium based on scene recognition, which can implement an essence editing of an OTA video based on the scene recognition method, can quickly segment and extract a target scene, implement an essence editing of a sub-scene, finally generate an essence short video based on an original video, replace an original picture display selling point with a short video, and greatly save an operation cost and effectively improve an operation efficiency by automatically editing videos in batches instead of a manual video synthesis work.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, numerous simple deductions or substitutions may be made without departing from the spirit of the invention, which shall be deemed to belong to the scope of the invention.

Claims

1. A video clipping method based on scene recognition is characterized by comprising the following steps:

s130, sequentially adding a lens label to each second picture in the second picture set according to a lens identification model, adding a scene label according to the scene identification model to obtain color parameters of each second picture in the second picture set, sequentially adding a lens label to continuous second pictures of which the color parameter difference is smaller than or equal to a preset threshold value, adding the second pictures with the updated numbered lens labels when the color parameter difference is larger than the preset threshold value, performing cluster statistics on the scene identification codes in all the second pictures corresponding to each lens label, and updating all the second picture scene identification codes with each type of lens labels into a type of scene identification codes with the highest occurrence frequency in the lens labels;

s140, editing the second picture set according to a preset target frame number, at least one target scene label and a preset scene priority sequence, and sequentially outputting third pictures to obtain a third picture set; and

2. The method for video clipping based on scene recognition as claimed in claim 1, wherein in step S110, a first frame number of the original video is obtained according to the duration and the frame rate of the original video, and the number of pictures in the first picture set is equal to the first frame number.

3. The scene recognition based video clipping method according to claim 1, wherein said step S120 comprises the steps of:

4. The scene recognition-based video clipping method according to claim 3, wherein in step S121, a planar coordinate system is established for each picture in the first picture set, and a first coordinate range where the first region is located and a second coordinate range where the second region is located are obtained;

5. The scene recognition based video clipping method of claim 3, wherein the first range, the second range, the first cropping frame, and the second cropping frame are all rectangular regions.

6. The scene recognition based video clipping method of claim 1, wherein the color parameter is at least one of a luminance parameter, a histogram, an RGB value, an HSV value, an HSL value.

7. The method for editing video based on scene recognition as claimed in claim 1, wherein a scene classification model with multiple classes of scene recognition codes is used to classify each second picture in the second picture set, and the scene recognition code corresponding to the classification result is added to each second picture.

8. The method for video clipping based on scene recognition as claimed in claim 1, wherein in step S140, the target frame number is obtained according to the preset target video duration and frame rate.

9. The method for video clipping based on scene recognition of claim 1, wherein in step S140, all the second pictures with target scene tags are selected as third pictures, a third picture set is generated, and the total number of the third pictures is compared with the target frame number;

10. The method for video clipping based on scene recognition of claim 9, wherein the step S160 includes selecting a frame based on each target scene tag in the third picture set, the frame number of the frame is a product of the number of the third pictures corresponding to each target scene tag and a reserved ratio, the frame is moved from the third picture corresponding to each target scene tag by taking one frame as a step length in a time sequence of the third pictures, and a video quality value after each movement is obtained through a video quality model, and finally the third pictures not selected by the target frame selection box with the highest video quality value are deleted.

11. A scene recognition-based video clipping system for implementing the scene recognition-based video clipping method of claim 1, comprising:

the image cutting module is used for cutting the images in the first image set to remove the marker to obtain a second image and generating a second image set, wherein the marker comprises a watermark and/or a subtitle;

the label adding module is used for sequentially adding a lens label to each second picture in the second picture set according to a lens identification model, adding a scene label according to the scene identification model to obtain the color parameter of each second picture in the second picture set, sequentially adding a lens label to continuous second pictures of which the difference value of the color parameter is less than or equal to a preset threshold value, adding the second pictures of which the difference value of the color parameter is greater than the preset threshold value by using the lens label after the serial number is updated, performing cluster statistics on scene identification codes in all second pictures corresponding to each lens label, and updating all second picture scene identification codes with each type of lens label into a type of scene identification code with the highest occurrence frequency in the lens label;

the video editing module is used for editing third pictures in the second picture set according to the target frame number, at least one target scene label and a preset scene priority sequence and outputting the third pictures in sequence to obtain a third picture set; and

and the video synthesis module synthesizes all third pictures in the third picture set according to the sequence in the frame linked list and outputs a cutting video according to the preset target video duration and frame rate.

12. A video clipping device based on scene recognition, comprising:

a processor;

a memory having stored therein executable instructions of the processor;

wherein the processor is configured to perform the steps of the scene recognition based video clipping method of any of claims 1 to 10 via execution of the executable instructions.

13. A computer readable storage medium storing a program which when executed performs the steps of the scene recognition based video clipping method of any of claims 1 to 10.