WO2023011094A1 - Procédé et appareil de montage vidéo, dispositif électronique et support de stockage - Google Patents

Procédé et appareil de montage vidéo, dispositif électronique et support de stockage Download PDF

Info

Publication number
WO2023011094A1
WO2023011094A1 PCT/CN2022/104122 CN2022104122W WO2023011094A1 WO 2023011094 A1 WO2023011094 A1 WO 2023011094A1 CN 2022104122 W CN2022104122 W CN 2022104122W WO 2023011094 A1 WO2023011094 A1 WO 2023011094A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
shot
segment
scene
Prior art date
Application number
PCT/CN2022/104122
Other languages
English (en)
Chinese (zh)
Inventor
马彩虹
叶芷
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023011094A1 publication Critical patent/WO2023011094A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440227Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by decomposing into layers, e.g. base layer and one or more enhancement layers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, in particular to the field of deep learning and video analysis, and in particular to a video editing method, device, electronic equipment, and storage medium.
  • Video (Video) technology generally refers to the technology of capturing, recording, processing, storing, transmitting and reproducing a series of static images in the form of electrical signals.
  • editing operations such as segmentation, classification, description, and indexing can be performed on video data according to certain standards and rules.
  • the disclosure provides a video editing method, device, electronic equipment and storage medium.
  • a video editing method including: performing classification processing on each of the event scenes according to the first frame information corresponding to the first partial frames of at least one event scene included in the main video , to obtain the scene classification result; when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, according to the start time information of the target event scene, the video The feature film is split into at least one video clip, wherein each of the video clips includes at least one event scene; and a video editing operation is performed based on the at least one video clip.
  • a video editing device including: a first processing module, configured to process each performing classification processing on each of the event scenes to obtain a scene classification result; the first splitting module is used to represent that the target event scene corresponding to the scene classification result is a segment segmentation point when the scene classification result indicates that the scene classification result is a segment segmentation point, According to the start time information of the target event scene, the video feature film is split into at least one video clip, wherein each of the video clips includes at least one of the event scene; and a video editing module, configured to A video editing operation is performed on the at least one video segment.
  • an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores Instructions executed by the at least one processor to enable the at least one processor to perform the method as described above.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the above method.
  • a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
  • FIG. 1 schematically shows an exemplary system architecture to which a video editing method and device can be applied according to an embodiment of the present disclosure
  • Fig. 2 schematically shows a flow chart of a video editing method according to an embodiment of the present disclosure
  • Fig. 3 schematically shows a schematic diagram of video classification according to an embodiment of the present disclosure
  • Fig. 4 schematically shows an example diagram of dividing video clips according to event scenes according to an embodiment of the present disclosure
  • FIG. 5 schematically shows a schematic flowchart of a video editing method according to an embodiment of the present disclosure
  • Fig. 6 schematically shows a block diagram of a video editing device according to an embodiment of the present disclosure.
  • FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.
  • the user's authorization or consent is obtained.
  • Video editing technology can use image, audio, subtitle and other information in the video to analyze, summarize and record video materials based on content and form characteristics, and organize and create various retrieval catalogs or retrieval methods, including video tagging, video disassembly, etc. requirements etc. For example, during editing, it is possible to perform timeline processing on the program layer, segment layer, scene layer, and shot layer, determine video keyword tags, describe video content, and explain video titles.
  • the inventors found that the video tag, video stripping, and video description need to combine the context structure information of the video to extract high-level semantic information.
  • Traditional processing mainly relies on manual work.
  • the amount of multimedia data is increasing rapidly.
  • video editors need to spend more manpower to complete video editing and storage.
  • Some other solutions for implementing video editing include, for example: (1) Shot detection technology assists manual video splitting. Video editing tools use video frame difference information to achieve video lens layer segmentation; computer vision technology uses face detection to achieve lens splitting. (2) Video description technology based on machine learning. For example, video-caption technology uses image and audio information of video to realize simple scene description, such as "someone is doing something in a certain space”. (3) Video intelligent labeling technology based on machine learning. Such as image classification, image detection, video classification, etc.
  • the inventors found that the scheme (1) can only realize the granularity splitting of shots, but cannot realize higher-level semantic aggregation by using single-frame image information or short-sequence image information. For the division of higher semantic levels (such as scene layer and fragment layer), it still needs to be completed with human assistance.
  • the AI model of solution (2) requires a lot of manual annotation to realize model training. Moreover, the model's description of the scene is too simple and blunt, which cannot meet the needs of actual application. Due to the large amount of redundant information in video timing, especially in long video dramas, adopting scheme (3) to process all key frames indiscriminately will lead to low processing efficiency.
  • the present disclosure introduces an automatic editing technology to realize automatic editing requirements through a multi-mode system based on machine learning. It mainly solves the current video editing part that is highly dependent on artificial understanding: intelligent labeling, automatic stripping, video description generation and other functions.
  • Fig. 1 schematically shows an exemplary system architecture to which a video editing method and device can be applied according to an embodiment of the present disclosure.
  • the exemplary system architecture to which the video editing method and device can be applied may include a terminal device, but the terminal device may implement the video editing method and device provided by the embodiments of the present disclosure without interacting with the server. .
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wired and/or wireless communication links, among others.
  • Terminal devices 101 , 102 , 103 Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, 103, such as knowledge reading applications, web browser applications, search applications, instant messaging tools, email clients and/or social platform software, etc. (only example).
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and supporting video browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers and the like.
  • the server 105 may be a server that provides various services, such as a background management server that supports content browsed by users using the terminal devices 101 , 102 , 103 (just an example).
  • the background management server can analyze and process received data such as user requests, and feed back processing results (such as webpages, information, or data obtained or generated according to user requests) to the terminal device.
  • the server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability.
  • the server can also be a server of a distributed system, or a server combined with a blockchain.
  • the video editing method provided by the embodiment of the present disclosure may be executed by the terminal device 101 , 102 , or 103 .
  • the video editing apparatus provided by the embodiment of the present disclosure may also be set in the terminal device 101 , 102 , or 103 .
  • the video editing method provided by the embodiment of the present disclosure may also be executed by the server 105 .
  • the video editing apparatus provided by the embodiments of the present disclosure may generally be set in the server 105 .
  • the video editing method provided by the embodiments of the present disclosure may also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101 , 102 , 103 and/or the server 105 .
  • the video editing apparatus provided by the embodiments of the present disclosure may also be set in a server or a server cluster that is different from the server 105 and can communicate with the terminal devices 101 , 102 , 103 and/or the server 105 .
  • the terminal devices 101, 102, and 103 can classify each event scene according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, Get the scene classification result. Then, when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, the main video is split into at least one video segment according to the start time information of the target event scene. Wherein, each video segment includes at least one event scene. Afterwards, a video editing operation is performed based on at least one video segment.
  • a server or server cluster capable of communicating with the terminal devices 101, 102, 103 and/or server 105 can process the main video according to the scene classification results of the video scenes in the main video, and implement video editing operations.
  • terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
  • Fig. 2 schematically shows a flowchart of a video editing method according to an embodiment of the present disclosure.
  • the method includes operations S210-S230.
  • operation S210 according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, classify each event scene to obtain a scene classification result.
  • each video clip includes at least one event scene.
  • a video editing operation is performed based on at least one video segment.
  • the video can be divided into four levels according to the split granularity from small to large: shot layer, scene layer, segment layer, and program layer.
  • shot layer can refer to the continuous pictures recorded by the same camera at one time, that is, the shot picture.
  • scene layer may refer to a continuous video consisting of one or more shots covered by the same time and space with the same scene.
  • a Fragment Layer can consist of more than one associated Event Scene.
  • the program layer is generally a complete piece of video and audio data input.
  • Fig. 3 schematically shows a schematic view of video classification according to an embodiment of the present disclosure.
  • feature video 300 includes two segments 310 , 320 .
  • the fragment 310 includes four scenes 330 , 340 , 350 , and 360
  • the fragment 320 corresponds to one scene 370 .
  • each scene includes multiple shots, for example, the scene 330 includes four shots 331 , 332 , 333 , and 334 .
  • segment splitting is used to aggregate semantically continuous scenes, and the same continuous event scene is merged into one segment.
  • the criteria for separating the segments may include at least one of the following: (1) the two segments belong to different time and space; (2) the two segments are not closely related in event semantics.
  • the performance of fragment splitting in the video may include at least one of the following: (1) the splitting position is at the scene picture change; (2) there are usually obvious audio changes at the splitting position, such as obvious silent pauses, background music changes, Audio changes, sudden changes in background noise such as whistles, car sounds, etc.; (3) The split position is often accompanied by obvious buffered video segments, such as scenery shots without obvious characters, black screen shots, and special shots such as spinning in and out; ( 4) The theme of the storyline changed before and after the split.
  • one or more event scenes may be included in the video.
  • the first partial frame may be a partial frame in multiple frames corresponding to an event scene.
  • the first frame information may include at least one of graphic feature vectors, audio feature vectors, and text feature vectors corresponding to the first partial frame, as well as inter-scene image difference vectors, inter-scene audio difference vectors, and inter-scene texts At least one of the difference vector and so on.
  • the process of classifying event scenes to obtain scene classification results may be completed by a scene granularity-based boundary classification model.
  • partial frames may be extracted from each event scene to obtain the first partial frame.
  • the boundary classification result can be expressed in the form of 0 or 1, and 1 can indicate that the target event scene to which the frame corresponding to the result belongs is a boundary point under the scene granularity, which means that the target event scene to which the frame with a classification result of 1 belongs is Fragment segmentation point, 0 is a non-boundary point, that is, a non-fragment segmentation point.
  • segment segmentation point determined at the scene granularity can be embodied as an event scene.
  • Scene-grained video features are re-divided into fragment-grained video segments.
  • Fig. 4 schematically shows an example diagram of dividing video segments according to event scenes according to an embodiment of the present disclosure.
  • feature video 400 includes five event scenes: 410 , 420 , 430 , 440 , 450 .
  • some frames can be extracted, and the number of partial frames extracted for different event scenes can be different, such as extracting m frames for event scene 410: 411, ..., 41m, for event scene 450 Extract n frames: 451, ..., 45n, etc.
  • At least one of the difference vector and the like is input into the boundary classification model as an input feature, for example, a result such as 00000000100000000100 can be output.
  • a result such as 00000000100000000100 can be output.
  • two frames representing segmentation points can be determined.
  • the target event scene corresponding to the frame representing the segmentation point may be further determined, for example, 430 and 440 respectively.
  • the feature video can be split according to the starting time points of 430 and 440, and three video clips can be obtained, 410, 420 form the first video segment 460, 430 form the second video segment 470, 440, 450 form The third video segment 480.
  • a corresponding commentary video can be generated, and the commentary video can include text introduction, voice description, and spliced multiple shots for the feature video or video segment.
  • scene layer segmentation information can be used to selectively sample some frames, so as to achieve efficient video clip editing effect with low resource consumption.
  • performing a video editing operation based on at least one video segment includes: determining a target video segment for which a commentary video needs to be generated. Identifying information of the target video segment is determined according to the character features and text features in the target video segment. The summary information of the target video segment is determined according to the text features of the target video segment. A target shot related to summary information is determined from a target video clip. The title information of the target video segment is determined according to the summary information. Generate an explainer video based on identification information, summary information, target shot and title information.
  • the target video segment may be a video segment determined from at least one video segment obtained by splitting the main video.
  • the character features may include features represented by character information in the target video segment.
  • the text features may include features represented by at least one of subtitle information, voice information, etc. in the target video segment.
  • the identification information may include the video name corresponding to the target video segment, such as the title of a film and television drama segment.
  • the summary information may represent textual commentary for the target video segment.
  • the target shot may be one or more shots of the target video segment.
  • the title information may be a title or name redefined for the target video segment.
  • target video segment may also be replaced by a positive video. That is, for the main video, the above operations can be performed to generate a corresponding commentary video.
  • determining the identification information of the target video segment can be specifically expressed as: according to the human face in the segment interval corresponding to the target video segment Detect the aggregation results to obtain the star names of the characters in the segment. Extracting the subtitle keywords can obtain the movie and TV character names of the characters in the segment interval. By combining the name of the star and the name of the film and television character, searching in the film and television drama knowledge graph can obtain the name of the film and television drama corresponding to the clip.
  • the name of the film and television drama of each segment may be firstly determined through the aforementioned method. Since the names of the film and television dramas identified for each segment may be different, the scoring votes of different results can be further combined, and the name of the film and television drama with the most votes can be used as the final output result of each segment in the same target video segment.
  • the video commentary is generated for the main video and the video segment, which can further improve the richness of the video editing result.
  • the text feature includes line text in the target video segment.
  • Determining the abstract information of the target video segment according to the text feature of the target video segment includes: determining the generator identifier of each line text in the text feature.
  • Information extraction is performed on each line text marked with the generator ID to obtain summary information.
  • the process of determining the summary information of the target video segment according to the text features of the target video segment may be completed by a text summary model.
  • the chapter summary model can take the line text corresponding to the input segment as input, and obtain the summary description corresponding to the segment as output for training.
  • the segment summaries for training can be derived from the corresponding plot introductions of TV dramas and movies on the Internet.
  • the text summarization model obtained through training may use the subtitle text of the target video segment as input to generate summary information of the target video segment.
  • the subtitle text may include the narrator and the narration content, and the narrator may include the names of the characters in the film and television drama.
  • the names of characters in film and television dramas can be determined by first obtaining the stars through face detection, and then obtaining the names of the characters from the corresponding relationship between stars and roles.
  • the narrative content may include at least one of lines and subtitle content.
  • an intelligent summary information generation method is provided, and the summary information is determined based on the character and line information of the target video segment, which can effectively improve the accuracy and completeness of the summary description.
  • the text feature includes line text in the target video segment. Determining the target shot related to the summary information from the target video segment includes: determining the duration of the voice broadcast for the summary information. At least one line text associated with the summary information is determined. For each line text, determine the shot fragments that match the line text in time, and obtain multiple shot fragments. According to the voice broadcast duration, at least one target shot segment is determined from the plurality of shot segments. Wherein, the total duration of at least one target shot segment matches the duration of the voice broadcast. At least one target shot segment is determined as a target shot.
  • a self-attention (self-attention) operation can be introduced for text timing characteristics, and the self-attention value can represent the contribution of a certain line text to the final output summary information.
  • a shot-level video with the highest temporal overlap, or a shot-level video associated with the line text may be selected as a shot segment corresponding to the line text.
  • the voice broadcast duration of the summary information can be calculated and determined according to the AI automatic broadcast speech rate.
  • At least one target shot segment may be selected according to the shot score from high to low until the selected shot segment can fill the entire voice broadcast duration.
  • the shot score can be the normalized score corresponding to self-attention.
  • the richness of the commentary video can be further increased.
  • the process of determining the title information of the target video segment according to the abstract information may be completed by prediction of a chapter title generation model.
  • a large number of movie and TV drama plot introductions, as well as titles corresponding to episodes and segments can be crawled through the Internet. Through these data, the chapter title generation model can be trained.
  • the title information can be predicted by inputting the summary information into the above chapter title generation model.
  • the title information of the target video segment is determined according to the summary information.
  • the generation of commentary video can be embodied as: playing the aforementioned selected target shots in time sequence, adding the video name, segment title, text summary subtitle and abstract AI human voice Broadcast to get a commentary video for the target video segment.
  • the obtained commentary video can efficiently reflect the target video segment, and effectively ensure completeness, accuracy and richness.
  • the video editing method may further include: according to the second frame information corresponding to the second partial frame of at least one shot included in the main video, classify each shot to obtain the Shot classification results.
  • the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point
  • the main video is split into at least one event scene according to the start time information of the target shot.
  • each event scene includes at least one shot.
  • the criteria for shot splitting may include camera switching. You can use methods such as color histogram, inter-frame difference, image feature similarity, video-splitter tools, etc. to split the video.
  • Scene splits are usually based on criteria such as spatial or temporal transformations of shots. For example, when a video changes from reality to memory narrative, it is a temporal scene change. The video changes from indoor to airport, which belongs to the space scene change.
  • a video is usually spliced from one or more shots.
  • the second partial frame may be a partial frame in multiple frames corresponding to one shot.
  • the second frame information may include at least one of image features, face features, and audio features corresponding to the second partial frame.
  • the process of performing classification processing on shots to obtain a shot classification result may be completed by a shot granularity-based boundary classification model.
  • the open source dataset "MovieScenes dataset” and the labeled data in the local database can be combined to encode whether each shot in the feature film is a boundary, and jointly train the boundary classification model.
  • partial frames may be extracted from each shot of the main video to obtain a second partial frame.
  • the shot classification result of each shot in the video can be obtained, and the judgment of whether the shot is a boundary can be realized.
  • the boundary classification result can also be expressed in the form of 0 or 1, 1 can indicate that the frame corresponding to the result is a boundary point under the shot granularity, and 0 is a non-boundary point. If it is judged that a shot is a boundary, the start time of the shot can be used as the start time of scene splitting, so as to split the main video into event scenes.
  • different strategy models may be used to extract tags for different levels of video data.
  • the video editing method may further include: for each shot: acquiring a fourth partial frame corresponding to the shot; performing feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result.
  • a shot label of the shot is determined according to the second feature extraction result.
  • the shot layer may use an image granularity model to extract shot labels.
  • image granularity model For example, use the face detection model to identify different stars; use the object detection model to detect different objects, such as guns, flags, etc.; use the object attribute model to detect task attributes, such as character image type, clothing type, etc.; Aesthetics, picture style, etc. are tested. Due to the long duration of the video, if all the key frames are sampled to generate labels, the image labels between different frames still have great redundancy. Video-level analysis is slower if every frame goes into the image model for analysis. Based on the shot granularity, a small number of frames are taken to obtain the fourth part of the frame, and the second feature extraction result obtained by analyzing and averaging the fourth part of the frames can be used as the shot label.
  • some frames are selectively sampled, and shot tags are determined with relatively low resources, which not only ensures the tag recall rate but also improves the overall analysis speed of the video.
  • the video editing method may further include: for each event scene: acquiring a third partial frame corresponding to each target shot included in the event scene; performing feature extraction on each frame in the third partial frame, Obtain the first feature extraction result.
  • a scene label of the event scene is determined according to the first feature extraction result.
  • the scene layer may adopt a video granularity timing model to provide videos with tags such as scenes, activities, and actions, such as tags such as airport, living room, sleepiness, and conversation.
  • tags such as scenes, activities, and actions, such as tags such as airport, living room, sleepiness, and conversation.
  • some frames can be sampled to obtain the third part of the frame, which can be used as the input of the scene model, so as to obtain the scene label of each event scene, and then determine the event scene.
  • Information such as the place where the scene takes place, activities, actions, etc.
  • the number of frames captured for different shots in an event scene may be different.
  • the frame number of each shot may be positively correlated with the shot duration.
  • the video editing method may further include: performing at least one of the following processes on the video frame sequence information: when the first target video frame sequence is detected, determining that the first target video frame sequence is the initial video The sequence of video frames at the beginning or end of the credits; if the second target video frame sequence is detected, determine that the second target video frame sequence is the sequence of video frames at the beginning or end of the original video.
  • the video frame sequence information includes the video frame sequence in the initial video and the audio corresponding to the video frame sequence
  • the first target video frame sequence includes a video frame whose subtitle is located at the first position of the video frame picture
  • the second target video frame The audio corresponding to the sequence is the audio of the target type.
  • a video feature is determined according to at least one of the first target video frame sequence and the second target video frame sequence.
  • the initial video may be a video including credits and credits.
  • the head and tail detection model can detect the beginning and end of the video, and the model can be realized by subtitle detection and audio feature classification.
  • subtitles in the feature film, the subtitles usually appear at the bottom of the picture; in the opening and closing credits, the subtitles often appear in the picture. Therefore, it can be defined that the first position includes in the picture.
  • audio there is generally no actor narration audio at the beginning and end of the film, and most of them are pure music or music mixed with some special effects and background sounds. Therefore, it is possible to define target types including pure music, background sound with special effects, and the like.
  • the start and end positions of the main video may be determined through a credits and credits detection model. Specifically, it is possible to classify whether a certain piece of audio features within a sliding window with a unit of 1 second is an opening and closing credits. Combining the results of subtitle position detection and audio feature detection, it is possible to time-mark the beginning and end of the video at the second granularity, so as to determine the main video.
  • a screening method is provided before video processing, which can screen relatively main video content, and can effectively improve video editing efficiency.
  • the video editing method may further include: determining a plurality of third target video frame sequences in the main video film. Feature extraction is performed on each third target video frame sequence to obtain a third feature extraction result. Determine the type label of the positive video according to the plurality of third feature extraction results.
  • the type label of the main video can be obtained based on a long video sequence model of images and audio sequences.
  • comprehensive labels can be provided for the entire video, such as TV drama-family ethics, movie-sci-fi, etc.
  • the long video timing model can set the maximum number of video analysis frames during training to maintain machine memory stability. If the number of analyzed video frames is greater than the threshold, the maximum number of consecutive frames can be randomly intercepted as training input during training. When predicting, a non-overlapping sliding window can be used, and the maximum number of frames is sequentially taken to obtain the third target video frame sequence as input. By averaging all the sliding window score results obtained, it can be output as the type label of the video positive.
  • the first frame information may include at least one of image features, audio features, and text features corresponding to the first partial frame, and an image difference vector between two adjacent frames in the first partial frame, At least one of an audio delta vector and a text delta vector.
  • the second frame information may include at least one of image features and audio features corresponding to the second partial frame. It is not limited here.
  • processing based on the information of each dimension of the video can effectively ensure the accuracy and richness of the time editing structure.
  • Fig. 5 schematically shows a schematic diagram of the flow of a video editing method according to an embodiment of the present disclosure.
  • an initial video it may first be detected through a credits and credits detection model 510 to screen out credits and credits to obtain a main video.
  • the feature video can be further processed to divide the feature video into shot granularity, scene granularity, and segment granularity. Based on the shot granularity information, scene granularity information, segment granularity information and other results obtained by hierarchical division, the shot-level image labels, scene-level spatio-temporal labels, segment-level video commentary generation, and program-level types obtained for the initial video editing can be further determined.
  • a method for generating smart labels, smart splitting, and smart commentary is provided.
  • the whole method can effectively reduce the dependence on manual processing and improve the editing and processing speed of the video.
  • Video processing based on partial frames or key frames can mark the entire video at different levels, providing a basis for video storage indexing.
  • Fig. 6 schematically shows a block diagram of a video editing device according to an embodiment of the present disclosure.
  • the video editing device 600 includes a first processing module 610 , a first splitting module 620 and a video editing module 630 .
  • the first processing module 610 is configured to classify each event scene according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, to obtain a scene classification result.
  • the first splitting module 620 is used to split the main video into at least A video clip. Wherein, each video clip includes at least one event scene.
  • a video editing module 630 configured to perform video editing operations based on at least one video segment.
  • the video editing module includes a first determining unit, a second determining unit, a third determining unit, a fourth determining unit, a fifth determining unit and a generating unit.
  • the first determination unit is configured to determine a target video segment that needs to generate a commentary video.
  • the second determination unit is configured to determine the identification information of the target video segment according to the character features and text features in the target video segment.
  • the third determination unit is configured to determine the summary information of the target video segment according to the text features of the target video segment.
  • the fourth determination unit is configured to determine a target shot related to the summary information from the target video segment.
  • the fifth determining unit is configured to determine the title information of the target video segment according to the summary information.
  • the generation unit is used to generate the commentary video according to the identification information, summary information, target shot and title information.
  • the text feature includes line text in the target video segment.
  • the third determining unit includes a first determining subunit and an obtaining subunit.
  • the first determination subunit is used to determine the generator identifier of each line text in the text feature.
  • the obtaining subunit is configured to extract information from each line text marked with the generator ID to obtain summary information.
  • the text feature includes line text in the target video segment.
  • the fourth determination unit includes a second determination subunit, a third determination subunit, a fourth determination subunit, a fifth determination subunit and a sixth determination subunit.
  • the second determining subunit is used to determine the duration of the voice broadcast of the summary information.
  • the third determining subunit is configured to determine at least one line text associated with the summary information.
  • the fourth determining subunit is configured to determine, for each line text, a shot segment that matches the line text in time to obtain a plurality of shot segments.
  • the fifth determining subunit is configured to determine at least one target shot segment from a plurality of shot segments according to the voice broadcast duration. Wherein, the total duration of at least one target shot segment matches the duration of the voice broadcast.
  • a sixth determining subunit configured to determine at least one target shot segment as the target shot.
  • the video editing device further includes a second processing module and a second splitting module.
  • the second processing module is configured to classify each shot according to the second frame information corresponding to the second partial frame of at least one shot included in the main video, and obtain a shot classification result of each shot.
  • the second splitting module is used to split the video feature into at least one event scene according to the start time information of the target shot when the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point .
  • each event scene includes at least one shot.
  • the video editing device further includes a first feature extraction module and a first determination module.
  • the first feature extraction module is used for each event scene: obtaining the third partial frame corresponding to each target lens included in the event scene; performing feature extraction on each frame in the third partial frame to obtain the first feature extraction result .
  • the first determining module is configured to determine the scene label of the event scene according to the first feature extraction result.
  • the video editing device further includes a second feature extraction module and a second determination module.
  • the second feature extraction module is configured to, for each shot: acquire a fourth partial frame corresponding to the shot; perform feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result.
  • the second determination module is configured to determine the lens label of the lens according to the second feature extraction result.
  • the video editing device further includes a third determining module, a fourth determining module and a fifth determining module.
  • the third determination module is configured to determine that the first target video frame sequence is a sequence of video frames at the beginning or end of the initial video when the first target video frame sequence is detected.
  • the first target video frame sequence includes a video frame whose subtitle is located at the first position of the video frame picture.
  • the fourth determination module is configured to determine that the second target video frame sequence is a sequence of video frames at the beginning or end of the original video when the second target video frame sequence is detected.
  • the audio corresponding to the second target video frame sequence is the audio of the target type.
  • the fifth determination module is configured to determine the main video according to at least one of the first target video frame sequence and the second target video frame sequence.
  • the video editing device further includes a sixth determination module, a third feature extraction module, and a seventh determination module.
  • the sixth determination module is configured to determine a plurality of third target video frame sequences in the main video film.
  • the third feature extraction module is configured to perform feature extraction for each third target video frame sequence to obtain a third feature extraction result.
  • the seventh determination module is used to determine the type label of the main video according to the multiple third feature extraction results.
  • the first frame information includes at least one of image features, audio features, and text features corresponding to the first partial frame, and image difference vectors, audio At least one of a delta vector and a text delta vector.
  • the second frame information includes at least one of image features and audio features corresponding to the second partial frame.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by at least one processor, and the instructions are processed by at least one The processor is executed, so that at least one processor can perform the method as described above.
  • non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method as described above.
  • a computer program product includes a computer program, and the computer program implements the above method when executed by a processor.
  • FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 700 includes a computing unit 701 that can execute according to a computer program stored in a read-only memory (ROM) 702 or loaded from a storage unit 708 into a random-access memory (RAM) 703. Various appropriate actions and treatments. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored.
  • the computing unit 701, ROM 702, and RAM 703 are connected to each other through a bus 704.
  • An input/output (I/O) interface 705 is also connected to the bus 704 .
  • the I/O interface 705 includes: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 709 allows the device 700 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 701 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the calculation unit 701 executes various methods and processes described above, such as video editing methods.
  • the video editing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708.
  • part or all of the computer program may be loaded and/or installed on the device 700 via the ROM 702 and/or the communication unit 709.
  • the computer program When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the video editing method described above may be performed.
  • the computing unit 701 may be configured to execute the video editing method in any other suitable manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may include or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Abstract

La présente invention concerne un procédé et un appareil de montage vidéo, un dispositif électronique et un support de stockage, et se rapporte au domaine technique de l'intelligence artificielle, en particulier au domaine de l'apprentissage profond et de l'analyse de vidéo. Une solution de mise en œuvre spécifique comprend les opérations suivantes : classifier chaque scène d'évènement selon des premières informations de trame correspondant à des premières trames partielles respectives d'au moins une scène d'évènement comprise dans une section principale de vidéo, de façon à obtenir un résultat de classification de scène ; lorsque le résultat de classification de scène indique qu'une scène d'évènement cible correspondant au résultat de classification de scène est un point de segmentation de segment, diviser la section principale de vidéo selon des informations d'instant de début de la scène d'évènement cible en au moins un segment de vidéo, chaque segment de vidéo comprenant au moins une scène d'évènement ; et effectuer une opération de montage vidéo en fonction du ou des segments de vidéo.
PCT/CN2022/104122 2021-08-02 2022-07-06 Procédé et appareil de montage vidéo, dispositif électronique et support de stockage WO2023011094A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110883507.6 2021-08-02
CN202110883507.6A CN113613065B (zh) 2021-08-02 2021-08-02 视频编辑方法、装置、电子设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2023011094A1 true WO2023011094A1 (fr) 2023-02-09

Family

ID=78339126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104122 WO2023011094A1 (fr) 2021-08-02 2022-07-06 Procédé et appareil de montage vidéo, dispositif électronique et support de stockage

Country Status (2)

Country Link
CN (1) CN113613065B (fr)
WO (1) WO2023011094A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116405745A (zh) * 2023-06-09 2023-07-07 深圳市信润富联数字科技有限公司 视频信息的提取方法、装置、终端设备及计算机介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113613065B (zh) * 2021-08-02 2022-09-09 北京百度网讯科技有限公司 视频编辑方法、装置、电子设备以及存储介质
CN114245171B (zh) * 2021-12-15 2023-08-29 百度在线网络技术(北京)有限公司 视频编辑方法、装置、电子设备、介质
CN114222196A (zh) * 2022-01-04 2022-03-22 阿里巴巴新加坡控股有限公司 一种剧情解说短视频的生成方法、装置及电子设备
CN114257864B (zh) * 2022-02-24 2023-02-03 易方信息科技股份有限公司 一种基于HLS格式视频源场景下播放器的seek方法及装置
CN115174962A (zh) * 2022-07-22 2022-10-11 湖南芒果无际科技有限公司 预演仿真方法、装置、计算机设备及计算机可读存储介质
CN115460455B (zh) * 2022-09-06 2024-02-09 上海硬通网络科技有限公司 一种视频剪辑方法、装置、设备及存储介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005332206A (ja) * 2004-05-20 2005-12-02 Nippon Hoso Kyokai <Nhk> 映像イベント判別装置及びそのプログラム、並びに、映像イベント判別用学習データ生成装置及びそのプログラム
CN110121108A (zh) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 视频价值评估方法及装置
CN111246287A (zh) * 2020-01-13 2020-06-05 腾讯科技(深圳)有限公司 视频处理方法、发布方法、推送方法及其装置
CN111327945A (zh) * 2018-12-14 2020-06-23 北京沃东天骏信息技术有限公司 用于分割视频的方法和装置
CN111798879A (zh) * 2019-04-08 2020-10-20 百度(美国)有限责任公司 用于生成视频的方法和装置
US10999566B1 (en) * 2019-09-06 2021-05-04 Amazon Technologies, Inc. Automated generation and presentation of textual descriptions of video content
CN112800278A (zh) * 2021-03-30 2021-05-14 腾讯科技(深圳)有限公司 视频类型的确定方法和装置及电子设备
CN113014988A (zh) * 2021-02-23 2021-06-22 北京百度网讯科技有限公司 视频处理方法、装置、设备以及存储介质
CN113613065A (zh) * 2021-08-02 2021-11-05 北京百度网讯科技有限公司 视频编辑方法、装置、电子设备以及存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440640B (zh) * 2013-07-26 2016-02-10 北京理工大学 一种视频场景聚类及浏览方法
CN104394422B (zh) * 2014-11-12 2017-11-17 华为软件技术有限公司 一种视频分割点获取方法及装置
CN108777815B (zh) * 2018-06-08 2021-04-23 Oppo广东移动通信有限公司 视频处理方法和装置、电子设备、计算机可读存储介质
CN111382620B (zh) * 2018-12-28 2023-06-09 阿里巴巴集团控股有限公司 视频标签添加方法、计算机存储介质和电子设备
CN111538862B (zh) * 2020-05-15 2023-06-20 北京百度网讯科技有限公司 用于解说视频的方法及装置
CN112818906B (zh) * 2021-02-22 2023-07-11 浙江传媒学院 一种基于多模态信息融合理解的全媒体新闻智能编目方法

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005332206A (ja) * 2004-05-20 2005-12-02 Nippon Hoso Kyokai <Nhk> 映像イベント判別装置及びそのプログラム、並びに、映像イベント判別用学習データ生成装置及びそのプログラム
CN110121108A (zh) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 视频价值评估方法及装置
CN111327945A (zh) * 2018-12-14 2020-06-23 北京沃东天骏信息技术有限公司 用于分割视频的方法和装置
CN111798879A (zh) * 2019-04-08 2020-10-20 百度(美国)有限责任公司 用于生成视频的方法和装置
US10999566B1 (en) * 2019-09-06 2021-05-04 Amazon Technologies, Inc. Automated generation and presentation of textual descriptions of video content
CN111246287A (zh) * 2020-01-13 2020-06-05 腾讯科技(深圳)有限公司 视频处理方法、发布方法、推送方法及其装置
CN113014988A (zh) * 2021-02-23 2021-06-22 北京百度网讯科技有限公司 视频处理方法、装置、设备以及存储介质
CN112800278A (zh) * 2021-03-30 2021-05-14 腾讯科技(深圳)有限公司 视频类型的确定方法和装置及电子设备
CN113613065A (zh) * 2021-08-02 2021-11-05 北京百度网讯科技有限公司 视频编辑方法、装置、电子设备以及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116405745A (zh) * 2023-06-09 2023-07-07 深圳市信润富联数字科技有限公司 视频信息的提取方法、装置、终端设备及计算机介质

Also Published As

Publication number Publication date
CN113613065A (zh) 2021-11-05
CN113613065B (zh) 2022-09-09

Similar Documents

Publication Publication Date Title
WO2023011094A1 (fr) Procédé et appareil de montage vidéo, dispositif électronique et support de stockage
US11830241B2 (en) Auto-curation and personalization of sports highlights
WO2022116888A1 (fr) Procédé et dispositif de traitement de données vidéo, équipement et support
US9253511B2 (en) Systems and methods for performing multi-modal video datastream segmentation
US10410679B2 (en) Producing video bits for space time video summary
US10108709B1 (en) Systems and methods for queryable graph representations of videos
US20160014482A1 (en) Systems and Methods for Generating Video Summary Sequences From One or More Video Segments
US8948515B2 (en) Method and system for classifying one or more images
US11436831B2 (en) Method and apparatus for video processing
US9189137B2 (en) Method and system for browsing, searching and sharing of personal video by a non-parametric approach
US8503523B2 (en) Forming a representation of a video item and use thereof
US20190130185A1 (en) Visualization of Tagging Relevance to Video
Sundaram et al. A utility framework for the automatic generation of audio-visual skims
US9569428B2 (en) Providing an electronic summary of source content
US20120278824A1 (en) System and method for evaluating visual worthiness of video data in a network environment
CN109408672B (zh) 一种文章生成方法、装置、服务器及存储介质
US20210117471A1 (en) Method and system for automatically generating a video from an online product representation
CN112511854A (zh) 一种直播视频精彩片段生成方法、装置、介质和设备
US9525896B2 (en) Automatic summarizing of media content
WO2023173539A1 (fr) Procédé et système de traitement de contenu vidéo, terminal et support de stockage
US10990828B2 (en) Key frame extraction, recording, and navigation in collaborative video presentations
CN117336572A (zh) 视频摘要生成方法、装置、计算机设备以及存储介质
Valdés et al. On-line video abstract generation of multimedia news
CN113407765B (zh) 视频分类方法、装置、电子设备和计算机可读存储介质
Rajarathinam et al. Analysis on video retrieval using speech and text for content-based information

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22851809

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE