CN113613065B

CN113613065B - Video editing method and device, electronic equipment and storage medium

Info

Publication number: CN113613065B
Application number: CN202110883507.6A
Authority: CN
Inventors: 马彩虹; 叶芷
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-02
Filing date: 2021-08-02
Publication date: 2022-09-09
Anticipated expiration: 2041-08-02
Also published as: WO2023011094A1; CN113613065A

Abstract

The disclosure provides a video editing method, a video editing device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the field of deep learning and video analysis. The specific implementation scheme is as follows: classifying each event scene according to first frame information corresponding to a first part of frame of at least one event scene included in the video positive film to obtain a scene classification result; under the condition that the scene classification result represents that a target event scene corresponding to the scene classification result is a segment segmentation point, splitting a video feature into at least one video segment according to the initial time information of the target event scene, wherein each video segment comprises at least one event scene; and performing a video editing operation based on the at least one video segment.

Description

Video editing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and video analysis, and in particular, to a video editing method and apparatus, an electronic device, and a storage medium.

Background

Video (Video) technology generally refers to the technology of capturing, recording, processing, storing, transmitting and reproducing a series of still images as electrical signals. In the related art, editing operations such as segmentation, classification, bibliography, indexing and the like can be performed on video data according to certain standards and rules.

Disclosure of Invention

The disclosure provides a video editing method, a video editing device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a video editing method including: classifying each event scene according to first frame information corresponding to a respective first part of frames of at least one event scene included in a video positive to obtain a scene classification result; when the scene classification result represents that a target event scene corresponding to the scene classification result is a segment segmentation point, segmenting the video feature into at least one video segment according to the starting time information of the target event scene, wherein each video segment comprises at least one event scene; and performing a video editing operation based on the at least one video segment.

According to another aspect of the present disclosure, there is provided a video editing apparatus including: the first processing module is used for classifying each event scene according to first frame information corresponding to a first part of frame of at least one event scene included in the video positive film to obtain a scene classification result; a first splitting module, configured to split the video feature into at least one video segment according to start time information of a target event scene when the scene classification result represents that the target event scene corresponding to the scene classification result is a segment splitting point, where each video segment includes at least one event scene; and the video editing module is used for carrying out video editing operation based on the at least one video segment.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates an exemplary system architecture to which the video editing method and apparatus may be applied, according to an embodiment of the present disclosure;

fig. 2 schematically shows a flow chart of a video editing method according to an embodiment of the present disclosure;

FIG. 3 schematically shows a video ranking schematic according to an embodiment of the disclosure;

FIG. 4 schematically illustrates an example diagram of partitioning video segments according to event scenes according to an embodiment of the present disclosure;

fig. 5 schematically shows a flow diagram of a video editing method according to an embodiment of the present disclosure;

fig. 6 schematically shows a block diagram of a video editing apparatus according to an embodiment of the present disclosure; and

FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated.

The video editing technology can utilize information such as images, audio, subtitles and the like in the video to analyze, summarize and record video data based on content and form characteristics, and organize and make various retrieval directories or retrieval ways, including the requirements of video labeling, video stripping and the like. For example, in editing, the video may be subjected to timeline dotting processing of a program layer, a segment layer, a scene layer and a lens layer, video keyword tag determination, video content description, video title description and the like.

The inventor discovers that video labels, video stripping and video description need to be combined with context structure information of videos to extract high-level semantic information in the process of realizing the concept disclosed by the invention. Traditional treatments have relied primarily on manual labor. At present, the amount of multimedia data is increased sharply, and in order to balance the increase of manual processing speed and media data, more manpower is consumed for video editing to finish editing and warehousing of videos.

Some other schemes for implementing video editing include, for example: (1) the shot detection technology assists in manually achieving video splitting. The video editing tool utilizes video frame difference information to realize video shot layer segmentation; the computer vision technology utilizes face detection to achieve shot splitting. (2) Video description techniques based on machine learning. For example, video-capture technology, uses video image and audio information to realize simple scene description, such as "someone does something in a certain space". (3) A video intelligent label technology based on machine learning. Such as image classification, image detection, video classification, etc.

The inventor finds that in the process of realizing the concept disclosed by the invention, the scheme (1) only can realize the splitting of the shot granularity by utilizing the single-frame image information or the short-time-sequence image information, and cannot realize the semantic aggregation of higher layers. For the division of higher semantic levels (such as scene level, fragment level), manual assistance is still needed. The AI model of the scheme (2) needs a large amount of manual labeling to realize model training. And the model is too simple and hard to describe the scene, and cannot meet the requirements of actual landing application. Due to the fact that the time sequence redundant information of the video is more, especially on a long video drama, the scheme (3) is adopted to conduct indiscriminate processing on all key frames, and processing efficiency is low.

Based on the above, the automatic editing technology is introduced into the system, and the requirement of automatic editing is realized through a multi-mode system based on machine learning. The method mainly solves the problem that the video editing part with high dependence on human understanding is at present: intelligent label, automatic strip removal, video description generation and the like.

Fig. 1 schematically shows an exemplary system architecture to which the video editing method and apparatus may be applied, according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the video editing method and apparatus may be applied may include a terminal device, but the terminal device may implement the video editing method and apparatus provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a knowledge reading application, a web browser application, a search application, an instant messaging tool, a mailbox client, and/or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having display screens and supporting video browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for content browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (e.g., a webpage, information, or data obtained or generated according to the user request) to the terminal device. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that the video editing method provided by the embodiment of the present disclosure may be generally executed by the

terminal device

101, 102, or 103. Accordingly, the video editing apparatus provided by the embodiment of the present disclosure may also be provided in the

terminal device

101, 102, or 103.

Alternatively, the video editing method provided by the embodiment of the present disclosure may also be generally performed by the server 105. Accordingly, the video editing apparatus provided by the embodiment of the present disclosure may be generally disposed in the server 105. The video editing method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the video editing apparatus provided in the embodiment of the present disclosure may also be disposed in a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

For example, when a video needs to be edited, the

terminal devices

101, 102, and 103 may perform classification processing on each event scene according to first frame information corresponding to a first partial frame of each of at least one event scene included in a video feature, so as to obtain a scene classification result. And then under the condition that the scene classification result represents that the target event scene corresponding to the scene classification result is a segment segmentation point, splitting the video feature into at least one video segment according to the initial time information of the target event scene. Wherein each video clip comprises at least one event scene. And then, performing video editing operation based on at least one video segment. Or by a server or server cluster capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105, may process the video footage according to the scene classification results of the video scenes in the video footage and implement the video editing operation.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flow chart of a video editing method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S230.

In operation S210, each event scene is classified according to first frame information corresponding to a first partial frame of each of at least one event scene included in the video feature, so as to obtain a scene classification result.

In operation S220, when the scene classification result represents that the target event scene corresponding to the scene classification result is a segment segmentation point, the video feature is segmented into at least one video segment according to the start time information of the target event scene. Wherein each video clip comprises at least one event scene.

In operation S230, a video editing operation is performed based on at least one video clip.

According to the embodiment of the disclosure, the video can be divided into four levels from small to large according to the splitting granularity: lens layer, scene layer, segment layer, program layer. The lens layer can refer to continuous pictures, namely lens pictures, shot by the same camera at one time. A scene layer may refer to a continuous piece of video with a scene that is invariant, consisting of one or more shots under the same spatio-temporal coverage. A slice layer may be composed of more than one associated event scenario. The program layer is typically an input complete segment of video and audio data.

Fig. 3 schematically shows a video ranking diagram according to an embodiment of the present disclosure.

As shown in fig. 3, at the program level, a feature video 300 includes two

segments

310, 320. At the slice level, four

scenes

330, 340, 350, 360 are included in the slice 310, and the slice 320 corresponds to one scene 370. In the scene layer, a plurality of shots are included in each scene, for example, four

shots

331, 332, 333, 334 are included in the scene 330.

According to an embodiment of the present disclosure, segment splitting is used to aggregate semantically continuous scenes, and the same continuous event scene is merged into one segment. The criteria for inter-segment detachment may include at least one of: (1) the two segments belong to different time spaces; (2) the two segments are not semantically related to the event. The presentation of the fragment split in the video may include at least one of: (1) the splitting position is at the scene picture conversion position; (2) the splitting position usually has obvious audio frequency change, such as obvious mute pause, background music change, audio frequency change, and background noise abrupt change, such as change of whistling, car noise and the like; (3) splitting positions are often accompanied by obvious buffered video segments, such as special shots without obvious characters, such as scene shots, black screen shots, and in-and-out-of-and-out feature shots; (4) the story theme changes before and after splitting.

According to an embodiment of the present disclosure, one or more event scenes may be included in a video. The first partial frame may be a partial frame of a plurality of frames corresponding to one event scene. The first frame information may include at least one of a graphic feature vector, an audio feature vector, a text feature vector, and the like, and at least one of a difference vector of an image between scenes, a difference vector of audio between scenes, a difference vector of text between scenes, and the like, corresponding to the first partial frame.

According to the embodiment of the disclosure, the process of obtaining the scene classification result by performing classification processing on the event scene can be completed by a boundary classification model based on scene granularity. On the premise of determining the event scenes already contained in the current feature video, a partial frame can be extracted from each event scene to obtain a first partial frame. The boundary classification result of each scene in the video can be obtained by inputting a plurality of first part frames corresponding to a plurality of event scenes as input features into the boundary classification model. The boundary classification result may be represented in a form of 0 or 1, where 1 may indicate that the target event scene to which the frame corresponding to the result belongs is a boundary point in the scene granularity, that is, the target event scene to which the frame having the classification result of 1 belongs may be represented as a segment cut point, and 0 is a non-boundary point, that is, a non-segment cut point.

It should be noted that the segment cut points determined at the scene granularity may be embodied as an event scene, and in the case that the video feature includes a plurality of video scenes, the video feature at the scene granularity may be subdivided into video segments at the segment granularity by an event scene determined as the segment cut points.

Fig. 4 schematically illustrates an example diagram of dividing video segments according to an event scene according to an embodiment of the present disclosure.

As shown in fig. 4, the feature video 400 includes 5 event scenes: 410. 420, 430, 440, 450. For each event scene, partial frames thereof may be extracted, and the number of the partial frames extracted for different event scenes may be different, for example, m frames are extracted for the event scene 410: 411. .., 41m, n frames are extracted for event scene 450: 451. …, 45n, etc. Then at least one of a graphic feature vector, an audio feature vector, a text feature vector and the like corresponding to each extracted frame, and at least one of a difference vector of two adjacent inter-frame images, a difference vector of audio, a difference vector of text and the like are input as input features into the boundary classification model, for example, the result as 00000000100000000100 may be output. Based on the output results, for example, two frames characterizing segment cut points may be determined. For each frame characterizing a segment cut point, a target event scenario corresponding to the frame characterizing the segment cut point may further be determined, e.g., 430 and 440, respectively. Thus, the feature video can be split according to the starting time points of 430 and 440, and three video segments can be obtained, wherein 410 and 420 form a

first video segment

460, 430 forms a

second video segment

470, 440 and 450 form a third video segment 480.

According to the embodiment of the disclosure, for both the feature film video and the divided video clips, corresponding commentary videos can be generated, and the commentary videos can include contents such as a text introduction, a voice description and a plurality of spliced shots for the feature film video or the video clips.

By the embodiment of the disclosure, the scene layer slicing information can be utilized to selectively sample partial frames, and the high-efficiency video clip editing effect can be realized with low resource consumption.

The method shown in fig. 2 is further described below with reference to specific embodiments.

According to an embodiment of the present disclosure, performing a video editing operation based on at least one video clip includes: a target video segment is determined that is needed to generate a commentary video. And determining the identification information of the target video clip according to the character features and the text features in the target video clip. And determining the abstract information of the target video clip according to the text characteristics of the target video clip. And determining a target shot related to the summary information from the target video clip. And determining the title information of the target video clip according to the summary information. And generating the comment video according to the identification information, the abstract information, the target shot and the title information.

According to the embodiment of the disclosure, the target video clip may be a video clip determined from at least one video clip split from a video feature. The personality characteristics may include characteristics characterized by the personality information in the target video segment. The text features may include features characterized by at least one of subtitle information, voice information, etc. in the target video segment. The identification information may include a video name corresponding to the target video segment, such as a series name of a movie and television series segment. The summary information may represent textual commentary for the target video segment. The target shot may be one or more shots of the target video segment. The title information may be a title, a name redefined for the target video segment.

It should be noted that the target video segment may be replaced by a video feature. That is, the above operations may be performed for the video feature film to generate the corresponding commentary video.

According to the embodiment of the present disclosure, taking movie and television play fragments as an example, determining the identification information of the target video fragment according to the character features and text features in the target video fragment may specifically be as follows: and according to the face detection aggregation result of the segment interval corresponding to the target video segment, acquiring the star name of the person in the segment interval. And extracting the subtitle keywords to obtain the names of the film and television characters in the segment interval. By combining the star name and the movie and television character name, the movie and television play knowledge graph is searched to obtain the movie and television play name corresponding to the fragment.

According to the embodiment of the present disclosure, in a case where there are a plurality of segments for which identification information needs to be determined in the same target video segment, the series names of the respective segments may be determined first by the foregoing method. Because the movie names obtained by identifying each segment may be different, scoring votes with different results can be further combined, and the movie name with the largest number of votes can be used as the final output result of each segment in the same target video segment.

Through the embodiment of the disclosure, the video commentary is generated aiming at the video feature films and the video clips, and the richness of the video editing result can be further improved.

According to an embodiment of the present disclosure, the text feature includes a speech text in the target video segment. Determining the summary information of the target video clip according to the text features of the target video clip comprises the following steps: and determining the generator identification of each word text in the text characteristics. And extracting information of each line text marked by the generator identification to obtain abstract information.

According to the embodiment of the disclosure, the process of determining the summary information of the target video clip according to the text features of the target video clip can be completed by a chapter summary model. The chapter abstract model can take the speech text corresponding to the input segment as input, and obtain the abstract description corresponding to the segment as output for training. The abstract of the training segments can be derived from the introduction of the corresponding drama and movie in the network.

According to the embodiment of the disclosure, the trained chapter abstract model can take the subtitle text of the target video clip as input to generate the abstract information of the target video clip. The subtitle text may include a narrator and narrative content, and the narrator may include the names of the characters of the movie and television series. Movie character names may be determined by first obtaining a star from face detection and then obtaining the character names from star and character correspondences. The narrative content may include at least one of line and caption content.

By the aid of the method for generating the abstract information, the abstract information is determined based on the character and the line information of the target video clip, and accuracy and integrity of abstract description can be effectively improved.

According to an embodiment of the present disclosure, the text feature includes a speech text in the target video segment. Determining a target shot related to the summary information from the target video segment includes: and determining the voice broadcasting time for voice broadcasting the abstract information. At least one speech-line text associated with the summary information is determined. And determining shot sections matched with the speech text in time aiming at each speech text to obtain a plurality of shot sections. And determining at least one target shot segment from the plurality of shot segments according to the voice broadcast time length. Wherein the total duration of the at least one target shot segment matches the voice broadcast duration. And determining at least one target shot segment as a target shot.

According to the embodiment of the disclosure, in the chapter abstract model, self-attention operation can be introduced according to the text time sequence characteristics, and the self-attention value can represent the contribution degree of a certain sentence-line text to the final output abstract information. And selecting a lens layer video with the highest time coincidence degree or a lens layer video associated with the speech text as a lens segment corresponding to the speech text for each sentence of the speech text. The voice broadcast duration of the summary information can be calculated and determined according to the AI automatic broadcast speed. At least one target shot may be selected from high to low according to the shot scores until the selected shot can fill the entire voice broadcast duration. The shot score may be a normalized score corresponding to self-attribute.

Through this disclosed above-mentioned embodiment, fill the voice broadcast process through drawing the target lens, can further increase the richness of explaining the video.

According to the embodiment of the disclosure, the process of determining the title information of the target video clip according to the summary information can be predicted by the chapter title generation model. A great amount of movie and television drama introduction and titles corresponding to the diversity and the segmentation can be crawled through a network. By means of the data, a chapter title generation model can be trained.

According to the embodiment of the disclosure, the title information can be predicted by inputting the summary information into the chapter title generation model.

According to an embodiment of the present disclosure, title information of a target video clip is determined according to summary information. According to the identification information, the summary information, the target shots and the title information, the generation of the commentary video can be embodied as follows: and playing the selected target shot according to a time sequence, and broadcasting the selected target shot with the video names, the segment titles, the text abstract subtitles and the abstract AI voices to obtain the explanatory video aiming at the target video segments.

Through the embodiment of the disclosure, the obtained commentary video can efficiently reflect the target video clip, and the integrity, the accuracy and the richness are effectively ensured.

According to an embodiment of the present disclosure, the video editing method may further include: and classifying each shot according to second frame information corresponding to a second part of frame of at least one shot included in the video positive film to obtain a shot classification result of each shot. And under the condition that the shot classification result represents that the target shot corresponding to the shot classification result is a scene segmentation point, splitting the video positive film into at least one event scene according to the initial time information of the target shot. Wherein each event scene comprises at least one shot.

According to an embodiment of the present disclosure, the criteria for lens splitting may include camera switching. Shot splitting may be achieved for video using methods such as those based on color histograms, inter-frame differences, image feature similarity, video-splitter tools, etc. Scene splitting is typically based on such criteria as spatial or temporal occurrence transformation of shots. For example, a video turns from reality to a recall narration, which is a temporal scene change. The video changes from indoor to airport, belonging to space scene change.

According to embodiments of the present disclosure, a video is typically stitched from one or more shots. The second partial frame may be a partial frame of a plurality of frames corresponding to one shot. The second frame information may include at least one of image features, face features, audio features, and the like corresponding to the second partial frame.

According to the embodiment of the disclosure, the process of obtaining the shot classification result by performing classification processing on the shots may be completed by a boundary classification model based on shot granularity. The boundary classification model can be jointly trained by combining the source data set "MovieScenes data set" and the label data in the local database to code whether each shot in the film video is a boundary.

According to the embodiment of the present disclosure, when predicting whether a certain shot is a boundary, a partial frame may be extracted from each shot of a video feature to obtain a second partial frame. The second part of frames corresponding to the shot are input into the boundary classification model as input features, so that the shot classification result of each shot in the video can be obtained, and the judgment of whether the shot is a boundary or not is realized. The result of the boundary classification may also be in the form of 0 or 1, where 1 may indicate that the frame corresponding to the result is a boundary point at the granularity of a shot, and 0 is a non-boundary point. If a shot is determined to be a boundary, the start time of the shot can be used as the start time of scene splitting, so that the video feature can be split into event scenes.

By the embodiment of the disclosure, the lens layer segmentation information can be utilized, the partial frames are selectively sampled, and the efficient video scene splitting effect is realized with low resource consumption.

According to the embodiment of the disclosure, after the multi-level splitting is performed on the front video, the labels can be extracted by adopting models with different strategies for video data of different levels based on the splitting result.

According to an embodiment of the present disclosure, the video editing method may further include: for each lens: acquiring a fourth part of frames corresponding to the shot; and performing feature extraction on each frame in the fourth part of frames to obtain a second feature extraction result. And determining the shot label of the shot according to the second feature extraction result.

According to an embodiment of the present disclosure, the lens layer may employ an image granularity model for extracting the lens label. For example, a face detection model is used to identify different stars; detecting different objects such as guns, flags and the like by using the object detection model; detecting task attributes such as character image types, dressing types and the like by using an object attribute model; and detecting the attractiveness, the style and the like of the picture by using the image classification model. Because the video duration is long, if all key frames are sampled to generate labels, image labels between different frames still have great redundancy. If each frame enters the image model for analysis, the video level analysis speed is slow. Based on the shot granularity, a small number of frames are taken to obtain a fourth part of frames, and a second feature extraction result obtained by analyzing and averaging the fourth part of frames can be used as a shot label.

Through the embodiment of the disclosure, based on shot granularity, part of frames are selectively sampled, and shot labels are determined by using lower resources, so that the label recall rate is ensured, and the overall video analysis speed is also improved.

According to an embodiment of the present disclosure, the video editing method may further include: for each event scenario: acquiring a third part of frames corresponding to each target shot in an event scene; and performing feature extraction on each frame in the third part of frames to obtain a first feature extraction result. And determining a scene label of the event scene according to the first feature extraction result.

According to the embodiment of the disclosure, the scene layer can adopt a video granularity time sequence model to provide labels of scenes, activities, actions and the like, such as labels of airports, living rooms, dozes, conversations and the like for videos. When the label determination of the event scene is performed, the partial frames are sampled to obtain the third partial frame for each target shot in the scene, and the third partial frame is used as the input of the scene model, so as to obtain the scene label of each event scene, and further determine the scene occurrence place, activity, action and other information of the event scene.

It should be noted that the number of frames taken for different shots in an event scene may be different. In one embodiment, the number of frame fetches per shot may be positively correlated with the shot duration.

Through the embodiment of the disclosure, the partial frames are selectively sampled based on the scene layer slicing information, the scene label is determined by using lower resources, the label recall rate can be further ensured, and the overall video analysis speed is improved.

According to an embodiment of the present disclosure, the video editing method may further include: processing the video frame sequence information by at least one of: under the condition that the first target video frame sequence is detected, determining that the first target video frame sequence is a video frame sequence at the head or the tail of the initial video; in the case that the second target video frame sequence is detected, it is determined that the second target video frame sequence is a video frame sequence of a slice header or a slice trailer of the original video. The video frame sequence information comprises a video frame sequence in an initial video and audio corresponding to the video frame sequence, the first target video frame sequence comprises video frames with subtitles located at a first position of a video frame picture, and the audio corresponding to the second target video frame sequence is audio of a target type. A video feature is determined based on at least one of the first sequence of target video frames and the second sequence of target video frames.

According to an embodiment of the present disclosure, the initial video may be a video including a slice header and a slice trailer. The head and tail detection model can detect the head and tail of the video and can be realized by adopting subtitle detection and audio feature classification. In the context of subtitles, in feature films, subtitles typically appear at the base of the picture; in the beginning and end of a slice, subtitles often appear in the picture. The first position can thus be defined to be included in the picture. In the aspect of audio, the beginning and the end of a film generally have no actor to narrate the audio, and most of the audio is pure music or music mixed with partial special effect background sound. The definable target types include pure music, special effect background sound, and the like.

According to the embodiment of the disclosure, for an initial video, the positions of the start and the end of a video feature can be determined through a film head and film tail detection model. Specifically, a certain section of audio features within a sliding window, for example in units of 1 second, may be classified as being leader or trailer. And by combining the results of subtitle position detection and audio characteristic detection, the head and tail time of the video with the second granularity can be doted, so that the positive video is determined.

Through the embodiment, a screening mode is provided before video processing, so that more main video contents can be screened, and the video editing efficiency can be effectively improved.

According to an embodiment of the present disclosure, the video editing method may further include: a plurality of third target video frame sequences in the video positive is determined. And performing feature extraction on each third target video frame sequence to obtain a third feature extraction result. And determining the type label of the video feature according to the plurality of third feature extraction results.

According to an embodiment of the present disclosure, a genre label of a video feature, i.e., a program layer genre label, may be derived based on a long video timing model of an image and audio sequence. The comprehensive labels such as TV series-family ethics, movies-science fiction and the like can be provided for the whole video through the model. The long video timing model can set the maximum analysis frame number of the video in training so as to maintain the memory stability of the machine. If the number of analyzed video frames is greater than the threshold, the maximum number of continuous frames can be randomly intercepted as training input during training. During prediction, non-overlapping sliding windows can be adopted, and the maximum frame number is sequentially selected in sequence to obtain a third target video frame sequence as input. By averaging all the obtained sliding window score results, it can be output as a type tag of the video feature.

Through the embodiment of the disclosure, the partial frame sequence is selectively sampled based on the program layer information, the type label of the video is determined by using lower resources, the label recall rate can be effectively ensured, and the overall analysis speed of the video can be improved.

According to an embodiment of the present disclosure, the first frame information may include at least one of an image feature, an audio feature, and a text feature corresponding to the first partial frame, and at least one of an image difference vector, an audio difference vector, and a text difference vector between two adjacent frames in the first partial frame. And/or the second frame information may include at least one of image features and audio features corresponding to the second portion of frames. And are not limited herein.

Through the embodiment of the disclosure, the information of each dimensionality of the video is processed, and the accuracy and the richness of the time editing structure can be effectively ensured.

Fig. 5 schematically shows a schematic diagram of a flow of a video editing method according to an embodiment of the present disclosure.

As shown in fig. 5, for an initial video, the slice header and slice trailer detection model 510 may first perform detection to screen out slice headers and slices to obtain a positive video. The feature videos may be further processed in combination with the image model 521, the video model 522, and the text model 523 in the video ranking module 520 to divide the feature videos into shot granularity, scene granularity, and segment granularity. Based on the results of the shot granularity information, the scene granularity information, the segment granularity information and the like obtained by the hierarchical division, a shot layer image label, a scene layer space-time label, a segment layer video description generation, a program layer type and the like obtained by editing the initial video can be further determined.

Through the above embodiments of the present disclosure, a method for generating an intelligent tag, an intelligent detaching bar, and an intelligent explanation is provided. The whole method can effectively reduce the dependence degree of manual processing and improve the editing processing speed of the video. The video processing is carried out based on partial frames or key frames, so that the whole video can be labeled at different levels, and a basis is provided for the warehousing index of the video.

Fig. 6 schematically shows a block diagram of a video editing apparatus according to an embodiment of the present disclosure.

As shown in fig. 6, the video editing apparatus 600 includes a first processing module 610, a first splitting module 620, and a video editing module 630.

The first processing module 610 is configured to perform classification processing on each event scene according to first frame information corresponding to a first partial frame of each of at least one event scene included in the video feature, so as to obtain a scene classification result.

The first splitting module 620 is configured to, when the scene classification result represents that the target event scene corresponding to the scene classification result is a segment splitting point, split the video feature into at least one video segment according to the start time information of the target event scene. Wherein each video clip comprises at least one event scene.

And a video editing module 630, configured to perform a video editing operation based on the at least one video segment.

According to an embodiment of the present disclosure, a video editing module includes a first determining unit, a second determining unit, a third determining unit, a fourth determining unit, a fifth determining unit, and a generating unit.

The first determining unit is used for determining a target video segment of the commentary video to be generated.

And the second determining unit is used for determining the identification information of the target video clip according to the character features and the text features in the target video clip.

And the third determining unit is used for determining the summary information of the target video clip according to the text characteristics of the target video clip.

And the fourth determining unit is used for determining the target shot related to the summary information from the target video clip.

And the fifth determining unit is used for determining the title information of the target video clip according to the summary information.

And the generating unit is used for generating the comment video according to the identification information, the abstract information, the target lens and the title information.

According to an embodiment of the present disclosure, the text feature includes a speech text in the target video segment. The third determining unit includes a first determining subunit and an obtaining subunit.

And the first determining subunit is used for determining the generator identifier of each word text in the text characteristics.

And the obtaining subunit is used for extracting information of each line text marked by the generator identification to obtain summary information.

According to an embodiment of the present disclosure, the text feature includes a line text in the target video segment. The fourth determination unit includes a second determination subunit, a third determination subunit, a fourth determination subunit, a fifth determination subunit, and a sixth determination subunit.

And the second determining subunit is used for determining the voice broadcasting time length for carrying out voice broadcasting on the abstract information.

And the third determining subunit is used for determining at least one speech text associated with the summary information.

And the fourth determining subunit is used for determining the shot sections matched with the speech texts in time aiming at each speech text to obtain a plurality of shot sections.

And the fifth determining subunit is configured to determine at least one target shot from the plurality of shot according to the voice broadcast time length. Wherein the total duration of the at least one target shot segment matches the voice broadcast duration.

A sixth determining subunit, configured to determine at least one target shot segment as a target shot.

According to the embodiment of the disclosure, the video editing apparatus further includes a second processing module and a second splitting module.

And the second processing module is used for classifying each shot according to second frame information corresponding to a second part of frame of at least one shot included in the video positive film to obtain a shot classification result of each shot.

And the second splitting module is used for splitting the video feature into at least one event scene according to the initial time information of the target shot under the condition that the shot classification result represents that the target shot corresponding to the shot classification result is the scene splitting point. Wherein each event scene comprises at least one shot.

According to an embodiment of the present disclosure, the video editing apparatus further includes a first feature extraction module and a first determination module.

A first feature extraction module to, for each event scenario: acquiring a third part of frames corresponding to each target shot in an event scene; and performing feature extraction on each frame in the third part of frames to obtain a first feature extraction result.

And the first determining module is used for determining the scene label of the event scene according to the first feature extraction result.

According to an embodiment of the present disclosure, the video editing apparatus further includes a second feature extraction module and a second determination module.

A second feature extraction module to, for each shot: acquiring a fourth part of frames corresponding to the shot; and performing feature extraction on each frame in the fourth part of frames to obtain a second feature extraction result.

And the second determining module is used for determining the lens label of the lens according to the second feature extraction result.

According to an embodiment of the present disclosure, the video editing apparatus further includes a third determination module, a fourth determination module, and a fifth determination module.

And the third determining module is used for determining that the first target video frame sequence is a video frame sequence of the head or the tail of the initial video under the condition that the first target video frame sequence is detected. The first target video frame sequence comprises video frames with subtitles located at a first position of a video frame picture.

And the fourth determining module is used for determining that the second target video frame sequence is a video frame sequence of the head or the tail of the initial video under the condition that the second target video frame sequence is detected. And the audio corresponding to the second target video frame sequence is the audio of the target type.

A fifth determining module for determining a video feature based on at least one of the first sequence of target video frames and the second sequence of target video frames.

According to an embodiment of the present disclosure, the video editing apparatus further includes a sixth determining module, a third feature extracting module, and a seventh determining module.

A sixth determining module for determining a plurality of third target video frame sequences in the video feature.

And the third feature extraction module is used for performing feature extraction on each third target video frame sequence to obtain a third feature extraction result.

And the seventh determining module is used for determining the type label of the video feature according to the plurality of third feature extraction results.

According to an embodiment of the present disclosure, the first frame information includes at least one of an image feature, an audio feature, and a text feature corresponding to the first partial frame, and at least one of an image difference vector, an audio difference vector, and a text difference vector between two adjacent frames in the first partial frame. And/or the second frame information includes at least one of image features and audio features corresponding to the second portion of frames.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the disclosure, a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as a video editing method. For example, in some embodiments, the video editing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the video editing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the video editing method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A video editing method, comprising:

classifying each event scene according to first frame information corresponding to a first part of frame of at least one event scene included in the video positive film to obtain a scene classification result;

when the scene classification result represents that a target event scene corresponding to the scene classification result is a segment segmentation point, segmenting the video feature into at least one video segment according to the starting time information of the target event scene, wherein each video segment comprises at least one event scene; and

performing a video editing operation based on the at least one video segment;

wherein the performing a video editing operation based on the at least one video segment comprises:

determining a target video segment from the at least one video segment;

determining identification information of the target video clip according to the character features and the text features in the target video clip;

determining abstract information of the target video clip according to the text features of the target video clip;

determining a target shot related to the summary information from the target video clip;

determining the title information of the target video clip according to the abstract information; and

and generating an explanation video corresponding to the target video clip according to the identification information, the abstract information, the target shot and the title information.

2. The method of claim 1, wherein the text feature comprises a speech text in the target video segment; the determining the summary information of the target video clip according to the text features of the target video clip comprises:

determining generator identifications of the lines of text in the text features; and

and extracting information of each line text marked by the generator identification to obtain the abstract information.

3. The method of claim 1, wherein the text feature comprises a speech text in the target video segment; the determining the target shots related to the summary information from the target video segments comprises:

determining voice broadcasting time for voice broadcasting the abstract information;

determining at least one speech-line text associated with the summary information;

determining shot segments matched with the speech text in time aiming at each speech text to obtain a plurality of shot segments;

determining at least one target shot segment from the plurality of shot segments according to the voice broadcast time length, wherein the total time length of the at least one target shot segment is matched with the voice broadcast time length; and

determining the at least one target shot segment as the target shot.

4. The method of any of claims 1 to 3, further comprising:

classifying each shot according to second frame information corresponding to a second part of frame of at least one shot included in the video positive film to obtain a shot classification result of each shot;

and under the condition that the shot classification result represents that a target shot corresponding to the shot classification result is a scene segmentation point, splitting the video feature into at least one event scene according to the initial time information of the target shot, wherein each event scene comprises at least one shot.

5. The method of claim 4, further comprising:

for each of the event scenarios:

acquiring a third part of frames corresponding to each target shot in the event scene;

performing feature extraction on each frame in the third part of frames to obtain a first feature extraction result; and

and determining a scene label of the event scene according to the first feature extraction result.

6. The method of claim 4, further comprising:

for each of the shots:

acquiring a fourth part of frames corresponding to the shot;

performing feature extraction on each frame in the fourth part of frames to obtain a second feature extraction result; and

and determining a lens label of the lens according to the second feature extraction result.

7. The method of claim 1, further comprising:

under the condition that a first target video frame sequence is detected, determining that the first target video frame sequence is a video frame sequence at the beginning or the end of an initial video, wherein the first target video frame sequence comprises video frames with subtitles located at a first position of a video frame picture;

under the condition that a second target video frame sequence is detected, determining that the second target video frame sequence is a video frame sequence of a slice head or a slice tail of the initial video, wherein audio corresponding to the second target video frame sequence is audio of a target type; and

determining the video feature based on at least one of the first sequence of target video frames and the second sequence of target video frames.

8. The method of claim 1, further comprising:

determining a plurality of third target video frame sequences in the video feature;

performing feature extraction on each third target video frame sequence to obtain a third feature extraction result; and

and determining the type label of the video feature according to a plurality of third feature extraction results.

9. The method of claim 4, wherein the first frame information includes at least one of image, audio, and text features corresponding to the first portion of frames, and at least one of an image difference vector, an audio difference vector, and a text difference vector between two adjacent frames in the first portion of frames; and/or

The second frame information includes at least one of image features and audio features corresponding to the second partial frame.

10. A video editing apparatus comprising:

the first processing module is used for classifying each event scene according to first frame information corresponding to a first part of frame of at least one event scene included in the video positive film to obtain a scene classification result;

a first splitting module, configured to split the video feature into at least one video segment according to start time information of a target event scene when the scene classification result represents that the target event scene corresponding to the scene classification result is a segment splitting point, where each video segment includes at least one event scene; and

the video editing module is used for carrying out video editing operation based on the at least one video segment;

wherein the video editing module comprises:

the first determination unit is used for determining a target video segment needing to generate the commentary video;

the second determining unit is used for determining the identification information of the target video clip according to the character features and the text features in the target video clip;

the third determining unit is used for determining the abstract information of the target video clip according to the text characteristics of the target video clip;

a fourth determining unit, configured to determine a target shot related to the summary information from the target video segment;

a fifth determining unit, configured to determine title information of the target video segment according to the summary information; and

and the generating unit is used for generating an explanation video corresponding to the target video clip according to the identification information, the abstract information, the target shot and the title information.

11. The apparatus of claim 10, wherein the text feature comprises speech text in the target video segment; the third determination unit includes:

the first determining subunit is used for determining the generator identifier of each speech text in the text characteristics; and

and the obtaining subunit is used for extracting information of each line text marked by the generator identifier to obtain the abstract information.

12. The apparatus of claim 10, wherein the text feature comprises speech text in the target video segment; the fourth determination unit includes:

the second determining subunit is used for determining the voice broadcasting time for carrying out voice broadcasting on the summary information;

a third determining subunit, configured to determine at least one speech text associated with the summary information;

the fourth determining subunit is configured to determine, for each speech text, a shot segment temporally matched with the speech text to obtain multiple shot segments;

a fifth determining subunit, configured to determine at least one target shot from the multiple shot according to the voice broadcast duration, where a total duration of the at least one target shot is matched with the voice broadcast duration; and

a sixth determining subunit, configured to determine the at least one target shot segment as the target shot.

13. The apparatus of any of claims 10 to 12, further comprising:

the second processing module is used for classifying each shot according to second frame information corresponding to a second part of frame of at least one shot included in the video positive film to obtain a shot classification result of each shot;

and the second splitting module is used for splitting the video feature into the at least one event scene according to the starting time information of the target shot when the shot classification result represents that the target shot corresponding to the shot classification result is the scene splitting point, wherein each event scene comprises at least one shot.

14. The apparatus of claim 13, further comprising:

a first feature extraction module to, for each of the event scenarios:

acquiring a third part of frames corresponding to each target shot included in the event scene;

15. The apparatus of claim 10, further comprising:

a second feature extraction module to, for each of the shots:

acquiring a fourth part of frames corresponding to the shot;

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

17. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.