WO2023011094A1 - Video editing method and apparatus, electronic device, and storage medium - Google Patents

Video editing method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2023011094A1
WO2023011094A1 PCT/CN2022/104122 CN2022104122W WO2023011094A1 WO 2023011094 A1 WO2023011094 A1 WO 2023011094A1 CN 2022104122 W CN2022104122 W CN 2022104122W WO 2023011094 A1 WO2023011094 A1 WO 2023011094A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
shot
segment
scene
Prior art date
Application number
PCT/CN2022/104122
Other languages
French (fr)
Chinese (zh)
Inventor
马彩虹
叶芷
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Publication of WO2023011094A1 publication Critical patent/WO2023011094A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440227Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by decomposing into layers, e.g. base layer and one or more enhancement layers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, in particular to the field of deep learning and video analysis, and in particular to a video editing method, device, electronic equipment, and storage medium.
  • Video (Video) technology generally refers to the technology of capturing, recording, processing, storing, transmitting and reproducing a series of static images in the form of electrical signals.
  • editing operations such as segmentation, classification, description, and indexing can be performed on video data according to certain standards and rules.
  • the disclosure provides a video editing method, device, electronic equipment and storage medium.
  • a video editing method including: performing classification processing on each of the event scenes according to the first frame information corresponding to the first partial frames of at least one event scene included in the main video , to obtain the scene classification result; when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, according to the start time information of the target event scene, the video The feature film is split into at least one video clip, wherein each of the video clips includes at least one event scene; and a video editing operation is performed based on the at least one video clip.
  • a video editing device including: a first processing module, configured to process each performing classification processing on each of the event scenes to obtain a scene classification result; the first splitting module is used to represent that the target event scene corresponding to the scene classification result is a segment segmentation point when the scene classification result indicates that the scene classification result is a segment segmentation point, According to the start time information of the target event scene, the video feature film is split into at least one video clip, wherein each of the video clips includes at least one of the event scene; and a video editing module, configured to A video editing operation is performed on the at least one video segment.
  • an electronic device including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores Instructions executed by the at least one processor to enable the at least one processor to perform the method as described above.
  • a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the above method.
  • a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
  • FIG. 1 schematically shows an exemplary system architecture to which a video editing method and device can be applied according to an embodiment of the present disclosure
  • Fig. 2 schematically shows a flow chart of a video editing method according to an embodiment of the present disclosure
  • Fig. 3 schematically shows a schematic diagram of video classification according to an embodiment of the present disclosure
  • Fig. 4 schematically shows an example diagram of dividing video clips according to event scenes according to an embodiment of the present disclosure
  • FIG. 5 schematically shows a schematic flowchart of a video editing method according to an embodiment of the present disclosure
  • Fig. 6 schematically shows a block diagram of a video editing device according to an embodiment of the present disclosure.
  • FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.
  • the user's authorization or consent is obtained.
  • Video editing technology can use image, audio, subtitle and other information in the video to analyze, summarize and record video materials based on content and form characteristics, and organize and create various retrieval catalogs or retrieval methods, including video tagging, video disassembly, etc. requirements etc. For example, during editing, it is possible to perform timeline processing on the program layer, segment layer, scene layer, and shot layer, determine video keyword tags, describe video content, and explain video titles.
  • the inventors found that the video tag, video stripping, and video description need to combine the context structure information of the video to extract high-level semantic information.
  • Traditional processing mainly relies on manual work.
  • the amount of multimedia data is increasing rapidly.
  • video editors need to spend more manpower to complete video editing and storage.
  • Some other solutions for implementing video editing include, for example: (1) Shot detection technology assists manual video splitting. Video editing tools use video frame difference information to achieve video lens layer segmentation; computer vision technology uses face detection to achieve lens splitting. (2) Video description technology based on machine learning. For example, video-caption technology uses image and audio information of video to realize simple scene description, such as "someone is doing something in a certain space”. (3) Video intelligent labeling technology based on machine learning. Such as image classification, image detection, video classification, etc.
  • the inventors found that the scheme (1) can only realize the granularity splitting of shots, but cannot realize higher-level semantic aggregation by using single-frame image information or short-sequence image information. For the division of higher semantic levels (such as scene layer and fragment layer), it still needs to be completed with human assistance.
  • the AI model of solution (2) requires a lot of manual annotation to realize model training. Moreover, the model's description of the scene is too simple and blunt, which cannot meet the needs of actual application. Due to the large amount of redundant information in video timing, especially in long video dramas, adopting scheme (3) to process all key frames indiscriminately will lead to low processing efficiency.
  • the present disclosure introduces an automatic editing technology to realize automatic editing requirements through a multi-mode system based on machine learning. It mainly solves the current video editing part that is highly dependent on artificial understanding: intelligent labeling, automatic stripping, video description generation and other functions.
  • Fig. 1 schematically shows an exemplary system architecture to which a video editing method and device can be applied according to an embodiment of the present disclosure.
  • the exemplary system architecture to which the video editing method and device can be applied may include a terminal device, but the terminal device may implement the video editing method and device provided by the embodiments of the present disclosure without interacting with the server. .
  • a system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 .
  • the network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 .
  • Network 104 may include various connection types, such as wired and/or wireless communication links, among others.
  • Terminal devices 101 , 102 , 103 Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, 103, such as knowledge reading applications, web browser applications, search applications, instant messaging tools, email clients and/or social platform software, etc. (only example).
  • the terminal devices 101, 102, 103 may be various electronic devices with display screens and supporting video browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers and the like.
  • the server 105 may be a server that provides various services, such as a background management server that supports content browsed by users using the terminal devices 101 , 102 , 103 (just an example).
  • the background management server can analyze and process received data such as user requests, and feed back processing results (such as webpages, information, or data obtained or generated according to user requests) to the terminal device.
  • the server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability.
  • the server can also be a server of a distributed system, or a server combined with a blockchain.
  • the video editing method provided by the embodiment of the present disclosure may be executed by the terminal device 101 , 102 , or 103 .
  • the video editing apparatus provided by the embodiment of the present disclosure may also be set in the terminal device 101 , 102 , or 103 .
  • the video editing method provided by the embodiment of the present disclosure may also be executed by the server 105 .
  • the video editing apparatus provided by the embodiments of the present disclosure may generally be set in the server 105 .
  • the video editing method provided by the embodiments of the present disclosure may also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101 , 102 , 103 and/or the server 105 .
  • the video editing apparatus provided by the embodiments of the present disclosure may also be set in a server or a server cluster that is different from the server 105 and can communicate with the terminal devices 101 , 102 , 103 and/or the server 105 .
  • the terminal devices 101, 102, and 103 can classify each event scene according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, Get the scene classification result. Then, when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, the main video is split into at least one video segment according to the start time information of the target event scene. Wherein, each video segment includes at least one event scene. Afterwards, a video editing operation is performed based on at least one video segment.
  • a server or server cluster capable of communicating with the terminal devices 101, 102, 103 and/or server 105 can process the main video according to the scene classification results of the video scenes in the main video, and implement video editing operations.
  • terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
  • Fig. 2 schematically shows a flowchart of a video editing method according to an embodiment of the present disclosure.
  • the method includes operations S210-S230.
  • operation S210 according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, classify each event scene to obtain a scene classification result.
  • each video clip includes at least one event scene.
  • a video editing operation is performed based on at least one video segment.
  • the video can be divided into four levels according to the split granularity from small to large: shot layer, scene layer, segment layer, and program layer.
  • shot layer can refer to the continuous pictures recorded by the same camera at one time, that is, the shot picture.
  • scene layer may refer to a continuous video consisting of one or more shots covered by the same time and space with the same scene.
  • a Fragment Layer can consist of more than one associated Event Scene.
  • the program layer is generally a complete piece of video and audio data input.
  • Fig. 3 schematically shows a schematic view of video classification according to an embodiment of the present disclosure.
  • feature video 300 includes two segments 310 , 320 .
  • the fragment 310 includes four scenes 330 , 340 , 350 , and 360
  • the fragment 320 corresponds to one scene 370 .
  • each scene includes multiple shots, for example, the scene 330 includes four shots 331 , 332 , 333 , and 334 .
  • segment splitting is used to aggregate semantically continuous scenes, and the same continuous event scene is merged into one segment.
  • the criteria for separating the segments may include at least one of the following: (1) the two segments belong to different time and space; (2) the two segments are not closely related in event semantics.
  • the performance of fragment splitting in the video may include at least one of the following: (1) the splitting position is at the scene picture change; (2) there are usually obvious audio changes at the splitting position, such as obvious silent pauses, background music changes, Audio changes, sudden changes in background noise such as whistles, car sounds, etc.; (3) The split position is often accompanied by obvious buffered video segments, such as scenery shots without obvious characters, black screen shots, and special shots such as spinning in and out; ( 4) The theme of the storyline changed before and after the split.
  • one or more event scenes may be included in the video.
  • the first partial frame may be a partial frame in multiple frames corresponding to an event scene.
  • the first frame information may include at least one of graphic feature vectors, audio feature vectors, and text feature vectors corresponding to the first partial frame, as well as inter-scene image difference vectors, inter-scene audio difference vectors, and inter-scene texts At least one of the difference vector and so on.
  • the process of classifying event scenes to obtain scene classification results may be completed by a scene granularity-based boundary classification model.
  • partial frames may be extracted from each event scene to obtain the first partial frame.
  • the boundary classification result can be expressed in the form of 0 or 1, and 1 can indicate that the target event scene to which the frame corresponding to the result belongs is a boundary point under the scene granularity, which means that the target event scene to which the frame with a classification result of 1 belongs is Fragment segmentation point, 0 is a non-boundary point, that is, a non-fragment segmentation point.
  • segment segmentation point determined at the scene granularity can be embodied as an event scene.
  • Scene-grained video features are re-divided into fragment-grained video segments.
  • Fig. 4 schematically shows an example diagram of dividing video segments according to event scenes according to an embodiment of the present disclosure.
  • feature video 400 includes five event scenes: 410 , 420 , 430 , 440 , 450 .
  • some frames can be extracted, and the number of partial frames extracted for different event scenes can be different, such as extracting m frames for event scene 410: 411, ..., 41m, for event scene 450 Extract n frames: 451, ..., 45n, etc.
  • At least one of the difference vector and the like is input into the boundary classification model as an input feature, for example, a result such as 00000000100000000100 can be output.
  • a result such as 00000000100000000100 can be output.
  • two frames representing segmentation points can be determined.
  • the target event scene corresponding to the frame representing the segmentation point may be further determined, for example, 430 and 440 respectively.
  • the feature video can be split according to the starting time points of 430 and 440, and three video clips can be obtained, 410, 420 form the first video segment 460, 430 form the second video segment 470, 440, 450 form The third video segment 480.
  • a corresponding commentary video can be generated, and the commentary video can include text introduction, voice description, and spliced multiple shots for the feature video or video segment.
  • scene layer segmentation information can be used to selectively sample some frames, so as to achieve efficient video clip editing effect with low resource consumption.
  • performing a video editing operation based on at least one video segment includes: determining a target video segment for which a commentary video needs to be generated. Identifying information of the target video segment is determined according to the character features and text features in the target video segment. The summary information of the target video segment is determined according to the text features of the target video segment. A target shot related to summary information is determined from a target video clip. The title information of the target video segment is determined according to the summary information. Generate an explainer video based on identification information, summary information, target shot and title information.
  • the target video segment may be a video segment determined from at least one video segment obtained by splitting the main video.
  • the character features may include features represented by character information in the target video segment.
  • the text features may include features represented by at least one of subtitle information, voice information, etc. in the target video segment.
  • the identification information may include the video name corresponding to the target video segment, such as the title of a film and television drama segment.
  • the summary information may represent textual commentary for the target video segment.
  • the target shot may be one or more shots of the target video segment.
  • the title information may be a title or name redefined for the target video segment.
  • target video segment may also be replaced by a positive video. That is, for the main video, the above operations can be performed to generate a corresponding commentary video.
  • determining the identification information of the target video segment can be specifically expressed as: according to the human face in the segment interval corresponding to the target video segment Detect the aggregation results to obtain the star names of the characters in the segment. Extracting the subtitle keywords can obtain the movie and TV character names of the characters in the segment interval. By combining the name of the star and the name of the film and television character, searching in the film and television drama knowledge graph can obtain the name of the film and television drama corresponding to the clip.
  • the name of the film and television drama of each segment may be firstly determined through the aforementioned method. Since the names of the film and television dramas identified for each segment may be different, the scoring votes of different results can be further combined, and the name of the film and television drama with the most votes can be used as the final output result of each segment in the same target video segment.
  • the video commentary is generated for the main video and the video segment, which can further improve the richness of the video editing result.
  • the text feature includes line text in the target video segment.
  • Determining the abstract information of the target video segment according to the text feature of the target video segment includes: determining the generator identifier of each line text in the text feature.
  • Information extraction is performed on each line text marked with the generator ID to obtain summary information.
  • the process of determining the summary information of the target video segment according to the text features of the target video segment may be completed by a text summary model.
  • the chapter summary model can take the line text corresponding to the input segment as input, and obtain the summary description corresponding to the segment as output for training.
  • the segment summaries for training can be derived from the corresponding plot introductions of TV dramas and movies on the Internet.
  • the text summarization model obtained through training may use the subtitle text of the target video segment as input to generate summary information of the target video segment.
  • the subtitle text may include the narrator and the narration content, and the narrator may include the names of the characters in the film and television drama.
  • the names of characters in film and television dramas can be determined by first obtaining the stars through face detection, and then obtaining the names of the characters from the corresponding relationship between stars and roles.
  • the narrative content may include at least one of lines and subtitle content.
  • an intelligent summary information generation method is provided, and the summary information is determined based on the character and line information of the target video segment, which can effectively improve the accuracy and completeness of the summary description.
  • the text feature includes line text in the target video segment. Determining the target shot related to the summary information from the target video segment includes: determining the duration of the voice broadcast for the summary information. At least one line text associated with the summary information is determined. For each line text, determine the shot fragments that match the line text in time, and obtain multiple shot fragments. According to the voice broadcast duration, at least one target shot segment is determined from the plurality of shot segments. Wherein, the total duration of at least one target shot segment matches the duration of the voice broadcast. At least one target shot segment is determined as a target shot.
  • a self-attention (self-attention) operation can be introduced for text timing characteristics, and the self-attention value can represent the contribution of a certain line text to the final output summary information.
  • a shot-level video with the highest temporal overlap, or a shot-level video associated with the line text may be selected as a shot segment corresponding to the line text.
  • the voice broadcast duration of the summary information can be calculated and determined according to the AI automatic broadcast speech rate.
  • At least one target shot segment may be selected according to the shot score from high to low until the selected shot segment can fill the entire voice broadcast duration.
  • the shot score can be the normalized score corresponding to self-attention.
  • the richness of the commentary video can be further increased.
  • the process of determining the title information of the target video segment according to the abstract information may be completed by prediction of a chapter title generation model.
  • a large number of movie and TV drama plot introductions, as well as titles corresponding to episodes and segments can be crawled through the Internet. Through these data, the chapter title generation model can be trained.
  • the title information can be predicted by inputting the summary information into the above chapter title generation model.
  • the title information of the target video segment is determined according to the summary information.
  • the generation of commentary video can be embodied as: playing the aforementioned selected target shots in time sequence, adding the video name, segment title, text summary subtitle and abstract AI human voice Broadcast to get a commentary video for the target video segment.
  • the obtained commentary video can efficiently reflect the target video segment, and effectively ensure completeness, accuracy and richness.
  • the video editing method may further include: according to the second frame information corresponding to the second partial frame of at least one shot included in the main video, classify each shot to obtain the Shot classification results.
  • the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point
  • the main video is split into at least one event scene according to the start time information of the target shot.
  • each event scene includes at least one shot.
  • the criteria for shot splitting may include camera switching. You can use methods such as color histogram, inter-frame difference, image feature similarity, video-splitter tools, etc. to split the video.
  • Scene splits are usually based on criteria such as spatial or temporal transformations of shots. For example, when a video changes from reality to memory narrative, it is a temporal scene change. The video changes from indoor to airport, which belongs to the space scene change.
  • a video is usually spliced from one or more shots.
  • the second partial frame may be a partial frame in multiple frames corresponding to one shot.
  • the second frame information may include at least one of image features, face features, and audio features corresponding to the second partial frame.
  • the process of performing classification processing on shots to obtain a shot classification result may be completed by a shot granularity-based boundary classification model.
  • the open source dataset "MovieScenes dataset” and the labeled data in the local database can be combined to encode whether each shot in the feature film is a boundary, and jointly train the boundary classification model.
  • partial frames may be extracted from each shot of the main video to obtain a second partial frame.
  • the shot classification result of each shot in the video can be obtained, and the judgment of whether the shot is a boundary can be realized.
  • the boundary classification result can also be expressed in the form of 0 or 1, 1 can indicate that the frame corresponding to the result is a boundary point under the shot granularity, and 0 is a non-boundary point. If it is judged that a shot is a boundary, the start time of the shot can be used as the start time of scene splitting, so as to split the main video into event scenes.
  • different strategy models may be used to extract tags for different levels of video data.
  • the video editing method may further include: for each shot: acquiring a fourth partial frame corresponding to the shot; performing feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result.
  • a shot label of the shot is determined according to the second feature extraction result.
  • the shot layer may use an image granularity model to extract shot labels.
  • image granularity model For example, use the face detection model to identify different stars; use the object detection model to detect different objects, such as guns, flags, etc.; use the object attribute model to detect task attributes, such as character image type, clothing type, etc.; Aesthetics, picture style, etc. are tested. Due to the long duration of the video, if all the key frames are sampled to generate labels, the image labels between different frames still have great redundancy. Video-level analysis is slower if every frame goes into the image model for analysis. Based on the shot granularity, a small number of frames are taken to obtain the fourth part of the frame, and the second feature extraction result obtained by analyzing and averaging the fourth part of the frames can be used as the shot label.
  • some frames are selectively sampled, and shot tags are determined with relatively low resources, which not only ensures the tag recall rate but also improves the overall analysis speed of the video.
  • the video editing method may further include: for each event scene: acquiring a third partial frame corresponding to each target shot included in the event scene; performing feature extraction on each frame in the third partial frame, Obtain the first feature extraction result.
  • a scene label of the event scene is determined according to the first feature extraction result.
  • the scene layer may adopt a video granularity timing model to provide videos with tags such as scenes, activities, and actions, such as tags such as airport, living room, sleepiness, and conversation.
  • tags such as scenes, activities, and actions, such as tags such as airport, living room, sleepiness, and conversation.
  • some frames can be sampled to obtain the third part of the frame, which can be used as the input of the scene model, so as to obtain the scene label of each event scene, and then determine the event scene.
  • Information such as the place where the scene takes place, activities, actions, etc.
  • the number of frames captured for different shots in an event scene may be different.
  • the frame number of each shot may be positively correlated with the shot duration.
  • the video editing method may further include: performing at least one of the following processes on the video frame sequence information: when the first target video frame sequence is detected, determining that the first target video frame sequence is the initial video The sequence of video frames at the beginning or end of the credits; if the second target video frame sequence is detected, determine that the second target video frame sequence is the sequence of video frames at the beginning or end of the original video.
  • the video frame sequence information includes the video frame sequence in the initial video and the audio corresponding to the video frame sequence
  • the first target video frame sequence includes a video frame whose subtitle is located at the first position of the video frame picture
  • the second target video frame The audio corresponding to the sequence is the audio of the target type.
  • a video feature is determined according to at least one of the first target video frame sequence and the second target video frame sequence.
  • the initial video may be a video including credits and credits.
  • the head and tail detection model can detect the beginning and end of the video, and the model can be realized by subtitle detection and audio feature classification.
  • subtitles in the feature film, the subtitles usually appear at the bottom of the picture; in the opening and closing credits, the subtitles often appear in the picture. Therefore, it can be defined that the first position includes in the picture.
  • audio there is generally no actor narration audio at the beginning and end of the film, and most of them are pure music or music mixed with some special effects and background sounds. Therefore, it is possible to define target types including pure music, background sound with special effects, and the like.
  • the start and end positions of the main video may be determined through a credits and credits detection model. Specifically, it is possible to classify whether a certain piece of audio features within a sliding window with a unit of 1 second is an opening and closing credits. Combining the results of subtitle position detection and audio feature detection, it is possible to time-mark the beginning and end of the video at the second granularity, so as to determine the main video.
  • a screening method is provided before video processing, which can screen relatively main video content, and can effectively improve video editing efficiency.
  • the video editing method may further include: determining a plurality of third target video frame sequences in the main video film. Feature extraction is performed on each third target video frame sequence to obtain a third feature extraction result. Determine the type label of the positive video according to the plurality of third feature extraction results.
  • the type label of the main video can be obtained based on a long video sequence model of images and audio sequences.
  • comprehensive labels can be provided for the entire video, such as TV drama-family ethics, movie-sci-fi, etc.
  • the long video timing model can set the maximum number of video analysis frames during training to maintain machine memory stability. If the number of analyzed video frames is greater than the threshold, the maximum number of consecutive frames can be randomly intercepted as training input during training. When predicting, a non-overlapping sliding window can be used, and the maximum number of frames is sequentially taken to obtain the third target video frame sequence as input. By averaging all the sliding window score results obtained, it can be output as the type label of the video positive.
  • the first frame information may include at least one of image features, audio features, and text features corresponding to the first partial frame, and an image difference vector between two adjacent frames in the first partial frame, At least one of an audio delta vector and a text delta vector.
  • the second frame information may include at least one of image features and audio features corresponding to the second partial frame. It is not limited here.
  • processing based on the information of each dimension of the video can effectively ensure the accuracy and richness of the time editing structure.
  • Fig. 5 schematically shows a schematic diagram of the flow of a video editing method according to an embodiment of the present disclosure.
  • an initial video it may first be detected through a credits and credits detection model 510 to screen out credits and credits to obtain a main video.
  • the feature video can be further processed to divide the feature video into shot granularity, scene granularity, and segment granularity. Based on the shot granularity information, scene granularity information, segment granularity information and other results obtained by hierarchical division, the shot-level image labels, scene-level spatio-temporal labels, segment-level video commentary generation, and program-level types obtained for the initial video editing can be further determined.
  • a method for generating smart labels, smart splitting, and smart commentary is provided.
  • the whole method can effectively reduce the dependence on manual processing and improve the editing and processing speed of the video.
  • Video processing based on partial frames or key frames can mark the entire video at different levels, providing a basis for video storage indexing.
  • Fig. 6 schematically shows a block diagram of a video editing device according to an embodiment of the present disclosure.
  • the video editing device 600 includes a first processing module 610 , a first splitting module 620 and a video editing module 630 .
  • the first processing module 610 is configured to classify each event scene according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, to obtain a scene classification result.
  • the first splitting module 620 is used to split the main video into at least A video clip. Wherein, each video clip includes at least one event scene.
  • a video editing module 630 configured to perform video editing operations based on at least one video segment.
  • the video editing module includes a first determining unit, a second determining unit, a third determining unit, a fourth determining unit, a fifth determining unit and a generating unit.
  • the first determination unit is configured to determine a target video segment that needs to generate a commentary video.
  • the second determination unit is configured to determine the identification information of the target video segment according to the character features and text features in the target video segment.
  • the third determination unit is configured to determine the summary information of the target video segment according to the text features of the target video segment.
  • the fourth determination unit is configured to determine a target shot related to the summary information from the target video segment.
  • the fifth determining unit is configured to determine the title information of the target video segment according to the summary information.
  • the generation unit is used to generate the commentary video according to the identification information, summary information, target shot and title information.
  • the text feature includes line text in the target video segment.
  • the third determining unit includes a first determining subunit and an obtaining subunit.
  • the first determination subunit is used to determine the generator identifier of each line text in the text feature.
  • the obtaining subunit is configured to extract information from each line text marked with the generator ID to obtain summary information.
  • the text feature includes line text in the target video segment.
  • the fourth determination unit includes a second determination subunit, a third determination subunit, a fourth determination subunit, a fifth determination subunit and a sixth determination subunit.
  • the second determining subunit is used to determine the duration of the voice broadcast of the summary information.
  • the third determining subunit is configured to determine at least one line text associated with the summary information.
  • the fourth determining subunit is configured to determine, for each line text, a shot segment that matches the line text in time to obtain a plurality of shot segments.
  • the fifth determining subunit is configured to determine at least one target shot segment from a plurality of shot segments according to the voice broadcast duration. Wherein, the total duration of at least one target shot segment matches the duration of the voice broadcast.
  • a sixth determining subunit configured to determine at least one target shot segment as the target shot.
  • the video editing device further includes a second processing module and a second splitting module.
  • the second processing module is configured to classify each shot according to the second frame information corresponding to the second partial frame of at least one shot included in the main video, and obtain a shot classification result of each shot.
  • the second splitting module is used to split the video feature into at least one event scene according to the start time information of the target shot when the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point .
  • each event scene includes at least one shot.
  • the video editing device further includes a first feature extraction module and a first determination module.
  • the first feature extraction module is used for each event scene: obtaining the third partial frame corresponding to each target lens included in the event scene; performing feature extraction on each frame in the third partial frame to obtain the first feature extraction result .
  • the first determining module is configured to determine the scene label of the event scene according to the first feature extraction result.
  • the video editing device further includes a second feature extraction module and a second determination module.
  • the second feature extraction module is configured to, for each shot: acquire a fourth partial frame corresponding to the shot; perform feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result.
  • the second determination module is configured to determine the lens label of the lens according to the second feature extraction result.
  • the video editing device further includes a third determining module, a fourth determining module and a fifth determining module.
  • the third determination module is configured to determine that the first target video frame sequence is a sequence of video frames at the beginning or end of the initial video when the first target video frame sequence is detected.
  • the first target video frame sequence includes a video frame whose subtitle is located at the first position of the video frame picture.
  • the fourth determination module is configured to determine that the second target video frame sequence is a sequence of video frames at the beginning or end of the original video when the second target video frame sequence is detected.
  • the audio corresponding to the second target video frame sequence is the audio of the target type.
  • the fifth determination module is configured to determine the main video according to at least one of the first target video frame sequence and the second target video frame sequence.
  • the video editing device further includes a sixth determination module, a third feature extraction module, and a seventh determination module.
  • the sixth determination module is configured to determine a plurality of third target video frame sequences in the main video film.
  • the third feature extraction module is configured to perform feature extraction for each third target video frame sequence to obtain a third feature extraction result.
  • the seventh determination module is used to determine the type label of the main video according to the multiple third feature extraction results.
  • the first frame information includes at least one of image features, audio features, and text features corresponding to the first partial frame, and image difference vectors, audio At least one of a delta vector and a text delta vector.
  • the second frame information includes at least one of image features and audio features corresponding to the second partial frame.
  • the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
  • an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by at least one processor, and the instructions are processed by at least one The processor is executed, so that at least one processor can perform the method as described above.
  • non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method as described above.
  • a computer program product includes a computer program, and the computer program implements the above method when executed by a processor.
  • FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.
  • Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
  • the device 700 includes a computing unit 701 that can execute according to a computer program stored in a read-only memory (ROM) 702 or loaded from a storage unit 708 into a random-access memory (RAM) 703. Various appropriate actions and treatments. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored.
  • the computing unit 701, ROM 702, and RAM 703 are connected to each other through a bus 704.
  • An input/output (I/O) interface 705 is also connected to the bus 704 .
  • the I/O interface 705 includes: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like.
  • the communication unit 709 allows the device 700 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
  • the computing unit 701 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc.
  • the calculation unit 701 executes various methods and processes described above, such as video editing methods.
  • the video editing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708.
  • part or all of the computer program may be loaded and/or installed on the device 700 via the ROM 702 and/or the communication unit 709.
  • the computer program When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the video editing method described above may be performed.
  • the computing unit 701 may be configured to execute the video editing method in any other suitable manner (for example, by means of firmware).
  • Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof.
  • FPGAs field programmable gate arrays
  • ASICs application specific integrated circuits
  • ASSPs application specific standard products
  • SOC system of systems
  • CPLD load programmable logic device
  • computer hardware firmware, software, and/or combinations thereof.
  • programmable processor can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
  • Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may include or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer.
  • a display device e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or a trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
  • the systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
  • a computer system may include clients and servers.
  • Clients and servers are generally remote from each other and typically interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.
  • the server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
  • steps may be reordered, added or deleted using the various forms of flow shown above.
  • each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

Abstract

The present disclosure provides a video editing method and apparatus, an electronic device, and a storage medium, and relates to the technical field of artificial intelligence, in particular to the field of deep learning and video analysis. A specific implementation solution is as follows: classifying each event scene according to first frame information corresponding to respective first partial frames of at least one event scene comprised in a video main section, so as to obtain a scene classification result; when the scene classification result indicates that a target event scene corresponding to the scene classification result is a segment segmentation point, splitting the video main section according to start time information of the target event scene into at least one video segment, wherein each video segment comprises at least one event scene; and performing a video editing operation on the basis of the at least one video segment.

Description

视频编辑方法、装置、电子设备以及存储介质Video editing method, device, electronic device and storage medium
本申请要求于2021年08月02日递交的中国专利申请No.202110883507.6的优先权,其内容一并在此作为参考。This application claims the priority of Chinese Patent Application No. 202110883507.6 submitted on August 2, 2021, the contents of which are hereby incorporated by reference.
技术领域technical field
本公开涉及人工智能技术领域,尤其涉及深度学习和视频分析领域,具体涉及一种视频编辑方法、装置、电子设备以及存储介质。The present disclosure relates to the field of artificial intelligence technology, in particular to the field of deep learning and video analysis, and in particular to a video editing method, device, electronic equipment, and storage medium.
背景技术Background technique
视频(Video)技术泛指将一系列静态影像以电信号的方式加以捕捉、纪录、处理、储存、传送与重现的技术。相关技术中,可以按照一定的标准和规则,对视频资料进行切分、归类、著录、标引等编辑操作。Video (Video) technology generally refers to the technology of capturing, recording, processing, storing, transmitting and reproducing a series of static images in the form of electrical signals. In related technologies, editing operations such as segmentation, classification, description, and indexing can be performed on video data according to certain standards and rules.
发明内容Contents of the invention
本公开提供了一种视频编辑方法、装置、电子设备以及存储介质。The disclosure provides a video editing method, device, electronic equipment and storage medium.
根据本公开的一方面,提供了一种视频编辑方法,包括:根据与视频正片包括的至少一个事件场景各自的第一部分帧相对应的第一帧信息,对每个所述事件场景进行分类处理,得到场景分类结果;在所述场景分类结果表征与所述场景分类结果相对应的目标事件场景为片段切分点的情况下,根据所述目标事件场景的起始时间信息,将所述视频正片拆分为至少一个视频片段,其中,每个所述视频片段中包括至少一个所述事件场景;以及基于所述至少一个视频片段进行视频编辑操作。According to an aspect of the present disclosure, a video editing method is provided, including: performing classification processing on each of the event scenes according to the first frame information corresponding to the first partial frames of at least one event scene included in the main video , to obtain the scene classification result; when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, according to the start time information of the target event scene, the video The feature film is split into at least one video clip, wherein each of the video clips includes at least one event scene; and a video editing operation is performed based on the at least one video clip.
根据本公开的另一方面,提供了一种视频编辑装置,包括:第一处理模块,用于根据与视频正片包括的至少一个事件场景各自的第一部分帧相对应的第一帧信息,对每个所述事件场景进行分类处理,得到场景分类结果;第一拆分模块,用于在所述场景分类结果表征与所述场景分类结果相对应的目标事件场景为片段切分点的情况下,根据所述目标事件场景的起始时间信息,将所述视频正片拆分为至少一个视频片段,其中,每个所述视频片段中包括至少一个所述事件场景;以及视频编辑模块,用于基于所 述至少一个视频片段进行视频编辑操作。According to another aspect of the present disclosure, there is provided a video editing device, including: a first processing module, configured to process each performing classification processing on each of the event scenes to obtain a scene classification result; the first splitting module is used to represent that the target event scene corresponding to the scene classification result is a segment segmentation point when the scene classification result indicates that the scene classification result is a segment segmentation point, According to the start time information of the target event scene, the video feature film is split into at least one video clip, wherein each of the video clips includes at least one of the event scene; and a video editing module, configured to A video editing operation is performed on the at least one video segment.
根据本公开的另一方面,提供了一种电子设备,包括:至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述的方法。According to another aspect of the present disclosure, there is provided an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores Instructions executed by the at least one processor to enable the at least one processor to perform the method as described above.
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行如上所述的方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the above method.
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现如上所述的方法。According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.
附图说明Description of drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution, and do not constitute a limitation to the present disclosure. in:
图1示意性示出了根据本公开实施例的可以应用视频编辑方法及装置的示例性系统架构;FIG. 1 schematically shows an exemplary system architecture to which a video editing method and device can be applied according to an embodiment of the present disclosure;
图2示意性示出了根据本公开实施例的视频编辑方法的流程图;Fig. 2 schematically shows a flow chart of a video editing method according to an embodiment of the present disclosure;
图3示意性示出了根据本公开实施例的视频分级示意图;Fig. 3 schematically shows a schematic diagram of video classification according to an embodiment of the present disclosure;
图4示意性示出了根据本公开实施例的根据事件场景划分视频片段的示例图;Fig. 4 schematically shows an example diagram of dividing video clips according to event scenes according to an embodiment of the present disclosure;
图5示意性示出了根据本公开实施例的视频编辑方法的流程示意图;FIG. 5 schematically shows a schematic flowchart of a video editing method according to an embodiment of the present disclosure;
图6示意性示出了根据本公开实施例的视频编辑装置的框图;以及Fig. 6 schematically shows a block diagram of a video editing device according to an embodiment of the present disclosure; and
图7示出了可以用来实施本公开的实施例的示例电子设备700的示意性框图。FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本 领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
在本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理,均符合相关法律法规的规定,采取了必要保密措施,且不违背公序良俗。In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved are all in compliance with relevant laws and regulations, necessary confidentiality measures have been taken, and they do not violate public order and good customs.
在本公开的技术方案中,在获取或采集用户个人信息之前,均获取了用户的授权或同意。In the technical solution of the present disclosure, before acquiring or collecting the user's personal information, the user's authorization or consent is obtained.
视频编辑技术可以利用视频中的图像、音频、字幕等信息,对视频资料基于内容和形式特征进行分析、归纳和记录,并组织、制作各种检索目录或检索途径,包括视频打标签、视频拆条等需求。例如,编辑中可以对视频进行节目层、片段层、场景层、镜头层的时间线打点处理、视频关键词标签确定、视频内容描述、视频标题说明等。Video editing technology can use image, audio, subtitle and other information in the video to analyze, summarize and record video materials based on content and form characteristics, and organize and create various retrieval catalogs or retrieval methods, including video tagging, video disassembly, etc. requirements etc. For example, during editing, it is possible to perform timeline processing on the program layer, segment layer, scene layer, and shot layer, determine video keyword tags, describe video content, and explain video titles.
发明人在实现本公开构思的过程中发现,视频标签、视频拆条、视频描述,需要结合视频的上下文结构信息,提取高层的语义信息。传统处理主要依靠人工。当前多媒体数据量激增,为了平衡人工处理速度和媒体数据的增长,视频编辑需要耗费更多的人力完成视频的编辑入库。In the process of realizing the concept of the present disclosure, the inventors found that the video tag, video stripping, and video description need to combine the context structure information of the video to extract high-level semantic information. Traditional processing mainly relies on manual work. At present, the amount of multimedia data is increasing rapidly. In order to balance the speed of manual processing and the growth of media data, video editors need to spend more manpower to complete video editing and storage.
实现视频编辑的一些其他方案例如包括:(1)镜头检测技术辅助人工实现视频拆分。视频编辑工具利用视频帧差信息,实现视频镜头层切分;计算机视觉技术利用人脸检测实现镜头拆分。(2)基于机器学习的视频描述技术。如video-caption技术,利用视频的图像和音频信息,实现简单的场景描述,如“某人在某空间做某事”。(3)基于机器学习的视频智能标签技术。如图像分类、图像检测、视频分类等。Some other solutions for implementing video editing include, for example: (1) Shot detection technology assists manual video splitting. Video editing tools use video frame difference information to achieve video lens layer segmentation; computer vision technology uses face detection to achieve lens splitting. (2) Video description technology based on machine learning. For example, video-caption technology uses image and audio information of video to realize simple scene description, such as "someone is doing something in a certain space". (3) Video intelligent labeling technology based on machine learning. Such as image classification, image detection, video classification, etc.
发明人在实现本公开构思的过程中发现,方案(1)利用单帧图像信息或者短时序的图像信息,只能实现镜头粒度的拆分,无法实现更高层的语义聚合。对于更高语义层级(如场景层、片段层)的划分,仍然需要利用人工辅助完成。方案(2)的AI模型,需要大量的人工标注,实现模型训练。且模型对场景描述过于简单、生硬,无法满足实际落地应用的需求。由于视频的时序冗余信息较多,尤其是在长视频影视剧上,采用方案(3) 对全部的关键帧进行无区别的处理,会导致处理效率低下。In the process of realizing the concept of the present disclosure, the inventors found that the scheme (1) can only realize the granularity splitting of shots, but cannot realize higher-level semantic aggregation by using single-frame image information or short-sequence image information. For the division of higher semantic levels (such as scene layer and fragment layer), it still needs to be completed with human assistance. The AI model of solution (2) requires a lot of manual annotation to realize model training. Moreover, the model's description of the scene is too simple and blunt, which cannot meet the needs of actual application. Due to the large amount of redundant information in video timing, especially in long video dramas, adopting scheme (3) to process all key frames indiscriminately will lead to low processing efficiency.
基于此,本公开引入自动编辑技术,通过基于机器学习的多模系统,实现自动化编辑需求。主要解决目前对人工理解依赖度较高的视频编辑部分:智能标签、自动拆条、视频描述生成等功能。Based on this, the present disclosure introduces an automatic editing technology to realize automatic editing requirements through a multi-mode system based on machine learning. It mainly solves the current video editing part that is highly dependent on artificial understanding: intelligent labeling, automatic stripping, video description generation and other functions.
图1示意性示出了根据本公开实施例的可以应用视频编辑方法及装置的示例性系统架构。Fig. 1 schematically shows an exemplary system architecture to which a video editing method and device can be applied according to an embodiment of the present disclosure.
需要注意的是,图1所示仅为可以应用本公开实施例的系统架构的示例,以帮助本领域技术人员理解本公开的技术内容,但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。例如,在另一实施例中,可以应用视频编辑方法及装置的示例性系统架构可以包括终端设备,但终端设备可以无需与服务器进行交互,即可实现本公开实施例提供的视频编辑方法及装置。It should be noted that, what is shown in FIG. 1 is only an example of the system architecture to which the embodiments of the present disclosure can be applied, so as to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure cannot be used in other device, system, environment or scenario. For example, in another embodiment, the exemplary system architecture to which the video editing method and device can be applied may include a terminal device, but the terminal device may implement the video editing method and device provided by the embodiments of the present disclosure without interacting with the server. .
如图1所示,根据该实施例的系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线和/或无线通信链路等等。As shown in FIG. 1 , a system architecture 100 according to this embodiment may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the terminal devices 101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wired and/or wireless communication links, among others.
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如知识阅读类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端和/或社交平台软件等(仅为示例)。Users can use terminal devices 101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like. Various communication client applications can be installed on the terminal devices 101, 102, 103, such as knowledge reading applications, web browser applications, search applications, instant messaging tools, email clients and/or social platform software, etc. (only example).
终端设备101、102、103可以是具有显示屏并且支持视频浏览的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, 103 may be various electronic devices with display screens and supporting video browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers and the like.
服务器105可以是提供各种服务的服务器,例如对用户利用终端设备101、102、103所浏览的内容提供支持的后台管理服务器(仅为示例)。后台管理服务器可以对接收到的用户请求等数据进行分析等处理,并将处理结果(例如根据用户请求获取或生成的网页、信息、或数据等)反馈给终端设备。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务(″Virtual Private Server″,或简称″VPS″)中,存在的管理难度大,业务 扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。The server 105 may be a server that provides various services, such as a background management server that supports content browsed by users using the terminal devices 101 , 102 , 103 (just an example). The background management server can analyze and process received data such as user requests, and feed back processing results (such as webpages, information, or data obtained or generated according to user requests) to the terminal device. The server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the problem of traditional physical host and VPS service ("Virtual Private Server", or "VPS") Among them, there are defects such as difficult management and weak business scalability. The server can also be a server of a distributed system, or a server combined with a blockchain.
需要说明的是,本公开实施例所提供的视频编辑方法一般可以由终端设备101、102、或103执行。相应地,本公开实施例所提供的视频编辑装置也可以设置于终端设备101、102、或103中。It should be noted that, generally, the video editing method provided by the embodiment of the present disclosure may be executed by the terminal device 101 , 102 , or 103 . Correspondingly, the video editing apparatus provided by the embodiment of the present disclosure may also be set in the terminal device 101 , 102 , or 103 .
或者,本公开实施例所提供的视频编辑方法一般也可以由服务器105执行。相应地,本公开实施例所提供的视频编辑装置一般可以设置于服务器105中。本公开实施例所提供的视频编辑方法也可以由不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群执行。相应地,本公开实施例所提供的视频编辑装置也可以设置于不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群中。Or, generally, the video editing method provided by the embodiment of the present disclosure may also be executed by the server 105 . Correspondingly, the video editing apparatus provided by the embodiments of the present disclosure may generally be set in the server 105 . The video editing method provided by the embodiments of the present disclosure may also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101 , 102 , 103 and/or the server 105 . Correspondingly, the video editing apparatus provided by the embodiments of the present disclosure may also be set in a server or a server cluster that is different from the server 105 and can communicate with the terminal devices 101 , 102 , 103 and/or the server 105 .
例如,在需要对视频进行编辑时,终端设备101、102、103可以根据与视频正片包括的至少一个事件场景各自的第一部分帧相对应的第一帧信息,对每个事件场景进行分类处理,得到场景分类结果。然后在场景分类结果表征与场景分类结果相对应的目标事件场景为片段切分点的情况下,根据目标事件场景的起始时间信息,将视频正片拆分为至少一个视频片段。其中,每个视频片段中包括至少一个事件场景。之后,基于至少一个视频片段进行视频编辑操作。或者由能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群可以根据视频正片中视频场景的场景分类结果对视频正片进行处理,并实现视频编辑操作。For example, when the video needs to be edited, the terminal devices 101, 102, and 103 can classify each event scene according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, Get the scene classification result. Then, when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, the main video is split into at least one video segment according to the start time information of the target event scene. Wherein, each video segment includes at least one event scene. Afterwards, a video editing operation is performed based on at least one video segment. Alternatively, a server or server cluster capable of communicating with the terminal devices 101, 102, 103 and/or server 105 can process the main video according to the scene classification results of the video scenes in the main video, and implement video editing operations.
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.
图2示意性示出了根据本公开实施例的视频编辑方法的流程图。Fig. 2 schematically shows a flowchart of a video editing method according to an embodiment of the present disclosure.
如图2所示,该方法包括操作S210~S230。As shown in FIG. 2, the method includes operations S210-S230.
在操作S210,根据与视频正片包括的至少一个事件场景各自的第一部分帧相对应的第一帧信息,对每个事件场景进行分类处理,得到场景分类结果。In operation S210, according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, classify each event scene to obtain a scene classification result.
在操作S220,在场景分类结果表征与该场景分类结果相对应的目标事件场景为片段切分点的情况下,根据目标事件场景的起始时间信息,将 视频正片拆分为至少一个视频片段。其中,每个视频片段中包括至少一个事件场景。In operation S220, when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, split the main video into at least one video segment according to the start time information of the target event scene. Wherein, each video clip includes at least one event scene.
在操作S230,基于至少一个视频片段进行视频编辑操作。In operation S230, a video editing operation is performed based on at least one video segment.
根据本公开的实施例,视频按照拆分粒度由小到大可分为四级:镜头层、场景层、片段层、节目层。镜头层可以指同一摄像机一次摄录的连续画面,即镜头画面。场景层可以指由同一时空覆盖下的一个或多个镜头构成的场面不变的一段连续视频。片段层可以由一个以上相关联的事件场景构成。节目层一般为输入的一段完整视音频资料。According to the embodiment of the present disclosure, the video can be divided into four levels according to the split granularity from small to large: shot layer, scene layer, segment layer, and program layer. The shot layer can refer to the continuous pictures recorded by the same camera at one time, that is, the shot picture. The scene layer may refer to a continuous video consisting of one or more shots covered by the same time and space with the same scene. A Fragment Layer can consist of more than one associated Event Scene. The program layer is generally a complete piece of video and audio data input.
图3示意性示出了根据本公开实施例的视频分级示意图。Fig. 3 schematically shows a schematic view of video classification according to an embodiment of the present disclosure.
如图3所示,在节目层,正片视频300包括两个片段310、320。在片段层,片段310中包括四个场景330、340、350、360,片段320对应一个场景370。在场景层,每个场景中包括多个镜头,如,场景330中包括四个镜头331、332、333、334。As shown in FIG. 3 , at the program level, feature video 300 includes two segments 310 , 320 . At the fragment layer, the fragment 310 includes four scenes 330 , 340 , 350 , and 360 , and the fragment 320 corresponds to one scene 370 . At the scene layer, each scene includes multiple shots, for example, the scene 330 includes four shots 331 , 332 , 333 , and 334 .
根据本公开的实施例,片段拆分用于对语义连续的场景进行聚合,同一连续的事件场景合并为一个片段。片段间拆开的标准可包括如下至少之一:(1)两片段属于不同时空;(2)两片段在事件语意上无紧密联系。片段拆分在视频中的表现可包括如下至少之一:(1)拆分位置在场景画面变换处;(2)拆分位置处通常有明显音频变化,如明显的静音停顿,背景音乐变化,音频变化,背景噪声突变如鸣笛、车声等的变化;(3)拆分位置常伴随明显的缓冲视频段,如无明显人物的景色镜头,黑屏镜头,旋进旋出等特色镜头;(4)拆分前后故事情节主题发生变化。According to the embodiment of the present disclosure, segment splitting is used to aggregate semantically continuous scenes, and the same continuous event scene is merged into one segment. The criteria for separating the segments may include at least one of the following: (1) the two segments belong to different time and space; (2) the two segments are not closely related in event semantics. The performance of fragment splitting in the video may include at least one of the following: (1) the splitting position is at the scene picture change; (2) there are usually obvious audio changes at the splitting position, such as obvious silent pauses, background music changes, Audio changes, sudden changes in background noise such as whistles, car sounds, etc.; (3) The split position is often accompanied by obvious buffered video segments, such as scenery shots without obvious characters, black screen shots, and special shots such as spinning in and out; ( 4) The theme of the storyline changed before and after the split.
根据本公开的实施例,视频中可以包括一个或多个事件场景。第一部分帧可以为一个事件场景对应的多个帧中的部分帧。第一帧信息可以包括与第一部分帧相对应的图形特征向量、音频特征向量和文本特征向量等其中至少之一,以及场景间图像的差值向量、场景间音频的差值向量和场景间文本的差值向量等其中至少之一。According to an embodiment of the present disclosure, one or more event scenes may be included in the video. The first partial frame may be a partial frame in multiple frames corresponding to an event scene. The first frame information may include at least one of graphic feature vectors, audio feature vectors, and text feature vectors corresponding to the first partial frame, as well as inter-scene image difference vectors, inter-scene audio difference vectors, and inter-scene texts At least one of the difference vector and so on.
根据本公开的实施例,针对事件场景进行分类处理得到场景分类结果的过程可以由基于场景粒度的边界分类模型来完成。在确定当前正片视频中已包含的事件场景的前提下,可以从每个事件场景中提取部分帧,得到第一部分帧。通过将多个事件场景对应的多个第一部分帧作为输入特征输 入边界分类模型,可以得到视频中每个场景的边界分类结果。边界分类结果可以表现为0或1的形式,1可以表示与该结果相对应的帧所属的目标事件场景为场景粒度下的边界点,即可表征分类结果为1的帧所属的目标事件场景为片段切分点,0为非边界点,即非片段切分点。According to an embodiment of the present disclosure, the process of classifying event scenes to obtain scene classification results may be completed by a scene granularity-based boundary classification model. On the premise of determining the event scenes included in the current feature video, partial frames may be extracted from each event scene to obtain the first partial frame. By inputting multiple first partial frames corresponding to multiple event scenes into the boundary classification model as input features, the boundary classification result of each scene in the video can be obtained. The boundary classification result can be expressed in the form of 0 or 1, and 1 can indicate that the target event scene to which the frame corresponding to the result belongs is a boundary point under the scene granularity, which means that the target event scene to which the frame with a classification result of 1 belongs is Fragment segmentation point, 0 is a non-boundary point, that is, a non-fragment segmentation point.
需要说明的是,在场景粒度下确定的片段切分点可以具体表现为一个事件场景,在视频正片中包括多个视频场景的情况下,通过一个被判定为片段切分点的事件场景可以将场景粒度的视频正片重新划分为片段粒度的视频片段。It should be noted that the segment segmentation point determined at the scene granularity can be embodied as an event scene. Scene-grained video features are re-divided into fragment-grained video segments.
图4示意性示出了根据本公开实施例的根据事件场景划分视频片段的示例图。Fig. 4 schematically shows an example diagram of dividing video segments according to event scenes according to an embodiment of the present disclosure.
如图4所示,正片视频400包括5个事件场景:410、420、430、440、450。针对每个事件场景,可以提取其中的部分帧,针对不同的事件场景提取的部分帧的个数可以不同,如针对事件场景410提取m个帧:411、...、41m,针对事件场景450提取n个帧:451、...、45n等。然后将与提取到的每个帧相对应的图形特征向量、音频特征向量和文本特征向量等其中至少之一,以及相邻两个帧间图像的差值向量、音频的差值向量和文本的差值向量等其中至少之一作为输入特征输入边界分类模型,例如可输出如00000000100000000100的结果。基于输出结果,例如可以确定两个表征片段切分点的帧。针对每个表征片段切分点的帧,可进一步确定与该表征片段切分点的帧相对应的目标事件场景,例如分别为430和440。从而可根据430和440的起始时间点对正片视频进行拆分,可得到三个视频视频片段,410、420构成第一个视频片段460,430构成第二个视频片段470,440、450构成第三个视频片段480。As shown in FIG. 4 , feature video 400 includes five event scenes: 410 , 420 , 430 , 440 , 450 . For each event scene, some frames can be extracted, and the number of partial frames extracted for different event scenes can be different, such as extracting m frames for event scene 410: 411, ..., 41m, for event scene 450 Extract n frames: 451, ..., 45n, etc. Then at least one of the graphic feature vectors, audio feature vectors, and text feature vectors corresponding to each extracted frame, as well as the difference vector of images between two adjacent frames, the difference vector of audio and the text At least one of the difference vector and the like is input into the boundary classification model as an input feature, for example, a result such as 00000000100000000100 can be output. Based on the output results, for example, two frames representing segment segmentation points can be determined. For each frame representing a segment segmentation point, the target event scene corresponding to the frame representing the segment segmentation point may be further determined, for example, 430 and 440 respectively. Therefore, the feature video can be split according to the starting time points of 430 and 440, and three video clips can be obtained, 410, 420 form the first video segment 460, 430 form the second video segment 470, 440, 450 form The third video segment 480.
根据本公开的实施例,对于正片视频和划分得到的视频片段,均可生成相应的解说视频,解说视频可包括针对正片视频或视频片段的文字简介、语音描述以及拼接的多个镜头等内容。According to an embodiment of the present disclosure, for both the feature video and the divided video segments, a corresponding commentary video can be generated, and the commentary video can include text introduction, voice description, and spliced multiple shots for the feature video or video segment.
通过本公开的上述实施例,可以利用场景层切分信息,有选择的采样部分帧,以较低的资源消耗实现高效的视频片段编辑效果。Through the above-mentioned embodiments of the present disclosure, scene layer segmentation information can be used to selectively sample some frames, so as to achieve efficient video clip editing effect with low resource consumption.
下面结合具体实施例,对图2所示的方法做进一步说明。The method shown in FIG. 2 will be further described below in conjunction with specific embodiments.
根据本公开的实施例,基于至少一个视频片段进行视频编辑操作包括: 确定需要生成解说视频的目标视频片段。根据目标视频片段中的人物特征和文本特征,确定目标视频片段的标识信息。根据目标视频片段的文本特征确定目标视频片段的摘要信息。从目标视频片段中确定与摘要信息相关的目标镜头。根据摘要信息确定目标视频片段的标题信息。根据标识信息、摘要信息、目标镜头和标题信息,生成解说视频。According to an embodiment of the present disclosure, performing a video editing operation based on at least one video segment includes: determining a target video segment for which a commentary video needs to be generated. Identifying information of the target video segment is determined according to the character features and text features in the target video segment. The summary information of the target video segment is determined according to the text features of the target video segment. A target shot related to summary information is determined from a target video clip. The title information of the target video segment is determined according to the summary information. Generate an explainer video based on identification information, summary information, target shot and title information.
根据本公开的实施例,目标视频片段可以为从视频正片拆分得到的至少一个视频片段中确定的视频片段。人物特征可以包括目标视频片段中的人物信息所表征的特征。文本特征可以包括目标视频片段中的字幕信息、语音信息等其中至少之一所表征的特征。标识信息可以包括目标视频片段所对应的视频名字,如影视剧片段的剧名等。摘要信息可以表示针对目标视频片段的文字解说。目标镜头可以为目标视频片段的一个或多个镜头。标题信息可以为针对目标视频片段重新定义的标题、名称。According to an embodiment of the present disclosure, the target video segment may be a video segment determined from at least one video segment obtained by splitting the main video. The character features may include features represented by character information in the target video segment. The text features may include features represented by at least one of subtitle information, voice information, etc. in the target video segment. The identification information may include the video name corresponding to the target video segment, such as the title of a film and television drama segment. The summary information may represent textual commentary for the target video segment. The target shot may be one or more shots of the target video segment. The title information may be a title or name redefined for the target video segment.
需要说明的是,目标视频片段也可以替换为视频正片。即,针对视频正片,可以进行如上的操作,生成相应的解说视频。It should be noted that the target video segment may also be replaced by a positive video. That is, for the main video, the above operations can be performed to generate a corresponding commentary video.
根据本公开的实施例,以影视剧片段为例,根据目标视频片段中的人物特征和文本特征,确定目标视频片段的标识信息可以具体表现为:根据目标视频片段所对应的片段区间的人脸检测聚合结果,可获得该片段区间内人物的明星名字。对字幕关键词进行提取可获得该片段区间中人物的影视人物名字。通过结合明星名字和影视人物名字,在影视剧知识图谱中搜索可以获取该片段对应的影视剧名字。According to an embodiment of the present disclosure, taking a film and television drama segment as an example, according to the character features and text features in the target video segment, determining the identification information of the target video segment can be specifically expressed as: according to the human face in the segment interval corresponding to the target video segment Detect the aggregation results to obtain the star names of the characters in the segment. Extracting the subtitle keywords can obtain the movie and TV character names of the characters in the segment interval. By combining the name of the star and the name of the film and television character, searching in the film and television drama knowledge graph can obtain the name of the film and television drama corresponding to the clip.
根据本公开的实施例,在同一目标视频片段中存在多个需要确定标识信息的片段的情况下,可以首先通过前述方法确定各个片段的影视剧名字。由于对各个片段识别得到的影视剧名字可能不同,可以进一步结合不同结果的计分投票,取票数最多的影视名字作为同一目标视频片段中各个片段的最终输出结果。According to an embodiment of the present disclosure, in the case where there are multiple segments for which identification information needs to be determined in the same target video segment, the name of the film and television drama of each segment may be firstly determined through the aforementioned method. Since the names of the film and television dramas identified for each segment may be different, the scoring votes of different results can be further combined, and the name of the film and television drama with the most votes can be used as the final output result of each segment in the same target video segment.
通过本公开的上述实施例,针对视频正片、视频片段生成视频解说,可以进一步提高视频编辑结果的丰富度。Through the above-mentioned embodiments of the present disclosure, the video commentary is generated for the main video and the video segment, which can further improve the richness of the video editing result.
根据本公开的实施例,文本特征包括目标视频片段中的台词文本。根据目标视频片段的文本特征确定目标视频片段的摘要信息包括:确定文本特征中各台词文本的生成者标识。对以生成者标识标记的各台词文本进行 信息提取,得到摘要信息。According to an embodiment of the present disclosure, the text feature includes line text in the target video segment. Determining the abstract information of the target video segment according to the text feature of the target video segment includes: determining the generator identifier of each line text in the text feature. Information extraction is performed on each line text marked with the generator ID to obtain summary information.
根据本公开的实施例,根据目标视频片段的文本特征确定目标视频片段的摘要信息的过程可以由篇章摘要模型完成。篇章摘要模型可以以输入片段对应的台词文本作为输入,获得该部分片段对应的摘要描述作为输出进行训练。训练的片段摘要可来源于网络中电视剧、电影对应的剧情介绍。According to an embodiment of the present disclosure, the process of determining the summary information of the target video segment according to the text features of the target video segment may be completed by a text summary model. The chapter summary model can take the line text corresponding to the input segment as input, and obtain the summary description corresponding to the segment as output for training. The segment summaries for training can be derived from the corresponding plot introductions of TV dramas and movies on the Internet.
根据本公开的实施例,训练得到的篇章摘要模型可以以目标视频片段的字幕文本作为输入,生成该目标视频片段的摘要信息。字幕文本可以包括叙述人和叙述内容,叙述人可以包括影视剧人物名字。影视剧人物名字可以通过首先由人脸检测获得明星,然后由明星和角色对应关系获得的人物名字来确定。叙述内容可以包括台词和字幕内容其中至少之一。According to an embodiment of the present disclosure, the text summarization model obtained through training may use the subtitle text of the target video segment as input to generate summary information of the target video segment. The subtitle text may include the narrator and the narration content, and the narrator may include the names of the characters in the film and television drama. The names of characters in film and television dramas can be determined by first obtaining the stars through face detection, and then obtaining the names of the characters from the corresponding relationship between stars and roles. The narrative content may include at least one of lines and subtitle content.
通过本公开的上述实施例,提供了一种智能化的摘要信息生成方法,基于目标视频片段的人物和台词信息确定摘要信息,可有效提高摘要描述的准确度与完整度。Through the above-mentioned embodiments of the present disclosure, an intelligent summary information generation method is provided, and the summary information is determined based on the character and line information of the target video segment, which can effectively improve the accuracy and completeness of the summary description.
根据本公开的实施例,文本特征包括目标视频片段中的台词文本。从目标视频片段中确定与摘要信息相关的目标镜头包括:确定对摘要信息进行语音播报的语音播报时长。确定与摘要信息相关联的至少一个台词文本。针对每个台词文本,确定与台词文本在时间上匹配的镜头片段,得到多个镜头片段。根据语音播报时长,从多个镜头片段中确定至少一个目标镜头片段。其中,至少一个目标镜头片段的总时长与语音播报时长相匹配。将至少一个目标镜头片段确定为目标镜头。According to an embodiment of the present disclosure, the text feature includes line text in the target video segment. Determining the target shot related to the summary information from the target video segment includes: determining the duration of the voice broadcast for the summary information. At least one line text associated with the summary information is determined. For each line text, determine the shot fragments that match the line text in time, and obtain multiple shot fragments. According to the voice broadcast duration, at least one target shot segment is determined from the plurality of shot segments. Wherein, the total duration of at least one target shot segment matches the duration of the voice broadcast. At least one target shot segment is determined as a target shot.
根据本公开的实施例,在篇章摘要模型中,可以针对文本时序特点引入self-attention(自注意力)操作,self-attention数值可以表征某句台词文本对最终输出摘要信息的贡献度。针对每句台词文本可选取一个时间重合度最高的镜头层视频,或是与该台词文本相关联的镜头层视频,作为与该台词文本对应的镜头片段。摘要信息的语音播报时长可以根据AI自动播报语速计算确定。至少一个目标镜头片段可以按照镜头分数自高至低选取,直至选取的镜头片段能够填充完整该语音播报时长。镜头分数可以为self-attention对应的归一化分数。According to an embodiment of the present disclosure, in the chapter summary model, a self-attention (self-attention) operation can be introduced for text timing characteristics, and the self-attention value can represent the contribution of a certain line text to the final output summary information. For each line text, a shot-level video with the highest temporal overlap, or a shot-level video associated with the line text, may be selected as a shot segment corresponding to the line text. The voice broadcast duration of the summary information can be calculated and determined according to the AI automatic broadcast speech rate. At least one target shot segment may be selected according to the shot score from high to low until the selected shot segment can fill the entire voice broadcast duration. The shot score can be the normalized score corresponding to self-attention.
通过本公开的上述实施例,通过提取目标镜头对语音播报过程进行填充,可进一步增加解说视频的丰富度。Through the above-mentioned embodiments of the present disclosure, by extracting target shots to fill in the voice broadcast process, the richness of the commentary video can be further increased.
根据本公开的实施例,根据摘要信息确定目标视频片段的标题信息的过程可以由篇章标题生成模型预测完成。通过网络可爬取大量电影电视剧剧情介绍,以及分集、分段对应的标题。通过这些数据,可训练得到篇章标题生成模型。According to an embodiment of the present disclosure, the process of determining the title information of the target video segment according to the abstract information may be completed by prediction of a chapter title generation model. A large number of movie and TV drama plot introductions, as well as titles corresponding to episodes and segments can be crawled through the Internet. Through these data, the chapter title generation model can be trained.
根据本公开的实施例,通过将摘要信息输入上述篇章标题生成模型,即可预测得到标题信息。According to an embodiment of the present disclosure, the title information can be predicted by inputting the summary information into the above chapter title generation model.
根据本公开的实施例,根据摘要信息确定目标视频片段的标题信息。根据标识信息、摘要信息、目标镜头和标题信息,生成解说视频可以具体表现为:将前述选取的目标镜头按时序播放,配上视频的影视名字、片段标题、文本摘要字幕和摘要的AI人声播报,得到针对目标视频片段的解说视频。According to an embodiment of the present disclosure, the title information of the target video segment is determined according to the summary information. According to the identification information, abstract information, target shot and title information, the generation of commentary video can be embodied as: playing the aforementioned selected target shots in time sequence, adding the video name, segment title, text summary subtitle and abstract AI human voice Broadcast to get a commentary video for the target video segment.
通过本公开的上述实施例,得到的解说视频可高效体现目标视频片段,并有效保证完整度、准确度和丰富度。Through the above-mentioned embodiments of the present disclosure, the obtained commentary video can efficiently reflect the target video segment, and effectively ensure completeness, accuracy and richness.
根据本公开的实施例,视频编辑方法还可以包括:根据与视频正片包括的至少一个镜头各自的第二部分帧相对应的第二帧信息,对每个镜头进行分类处理,得到每个镜头的镜头分类结果。在镜头分类结果表征与镜头分类结果相对应的目标镜头为场景切分点的情况下,根据目标镜头的起始时间信息将视频正片拆分为至少一个事件场景。其中,每个事件场景中包括至少一个镜头。According to an embodiment of the present disclosure, the video editing method may further include: according to the second frame information corresponding to the second partial frame of at least one shot included in the main video, classify each shot to obtain the Shot classification results. When the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point, the main video is split into at least one event scene according to the start time information of the target shot. Wherein, each event scene includes at least one shot.
根据本公开的实施例,镜头拆分的标准可以包括摄像头切换。可以使用如基于颜色直方图、帧间差分、图像特征相似度、video-splitter工具等方法,对视频实现镜头拆分。场景拆分通常基于镜头的空间或者时间出现变换这类标准。例如,视频由现实转为回忆叙述,属于时间场景变换。视频从室内转为机场,属于空间场景变换。According to an embodiment of the present disclosure, the criteria for shot splitting may include camera switching. You can use methods such as color histogram, inter-frame difference, image feature similarity, video-splitter tools, etc. to split the video. Scene splits are usually based on criteria such as spatial or temporal transformations of shots. For example, when a video changes from reality to memory narrative, it is a temporal scene change. The video changes from indoor to airport, which belongs to the space scene change.
根据本公开的实施例,视频通常由一个或多个镜头拼接而成。第二部分帧可以为一个镜头对应的多个帧中的部分帧。第二帧信息可以包括与第二部分帧相对应的图像特征、人脸特征和音频特征等其中至少之一。According to an embodiment of the present disclosure, a video is usually spliced from one or more shots. The second partial frame may be a partial frame in multiple frames corresponding to one shot. The second frame information may include at least one of image features, face features, and audio features corresponding to the second partial frame.
根据本公开的实施例,针对镜头进行分类处理得到镜头分类结果的过程可以由基于镜头粒度的边界分类模型来完成。可以结合开源数据集“MovieScenes数据集”和本地数据库中的标记数据,对正片视频中的每 个镜头是否为边界进行编码,联合训练该边界分类模型。According to an embodiment of the present disclosure, the process of performing classification processing on shots to obtain a shot classification result may be completed by a shot granularity-based boundary classification model. The open source dataset "MovieScenes dataset" and the labeled data in the local database can be combined to encode whether each shot in the feature film is a boundary, and jointly train the boundary classification model.
根据本公开的实施例,在预测某个镜头是否为边界时,可以从视频正片的每个镜头中提取部分帧,得到第二部分帧。通过将镜头对应的第二部分帧作为输入特征输入该边界分类模型,可以得到视频中每个镜头的镜头分类结果,实现对镜头是否为边界的判断。边界分类结果同样可以表现为0或1的形式,1可以表示与该结果相对应的帧为镜头粒度下的边界点,0为非边界点。若判断某镜头是边界,则可以将该镜头的开始时刻作为场景拆分的开始时刻,从而将视频正片拆分为事件场景。According to an embodiment of the present disclosure, when predicting whether a certain shot is a boundary, partial frames may be extracted from each shot of the main video to obtain a second partial frame. By inputting the second part of the frame corresponding to the shot into the boundary classification model as an input feature, the shot classification result of each shot in the video can be obtained, and the judgment of whether the shot is a boundary can be realized. The boundary classification result can also be expressed in the form of 0 or 1, 1 can indicate that the frame corresponding to the result is a boundary point under the shot granularity, and 0 is a non-boundary point. If it is judged that a shot is a boundary, the start time of the shot can be used as the start time of scene splitting, so as to split the main video into event scenes.
通过本公开的上述实施例,可以利用镜头层切分信息,有选择的采样部分帧,以较低的资源消耗实现高效的视频场景拆分效果。Through the above-mentioned embodiments of the present disclosure, it is possible to use the shot layer segmentation information to selectively sample some frames, and achieve an efficient video scene splitting effect with low resource consumption.
根据本公开的实施例,在对正片视频进行多级拆分之后,可以基于拆分结果,对不同层级的视频数据,采用不同策略的模型提取标签。According to an embodiment of the present disclosure, after multi-level splitting is performed on the feature video, based on the splitting results, different strategy models may be used to extract tags for different levels of video data.
根据本公开的实施例,视频编辑方法还可以包括:针对每个镜头:获取镜头对应的第四部分帧;对第四部分帧中的每个帧进行特征提取,得到第二特征提取结果。根据第二特征提取结果确定镜头的镜头标签。According to an embodiment of the present disclosure, the video editing method may further include: for each shot: acquiring a fourth partial frame corresponding to the shot; performing feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result. A shot label of the shot is determined according to the second feature extraction result.
根据本公开的实施例,镜头层可以采用图像粒度模型进行镜头标签的提取。例如,使用人脸检测模型识别不同明星;使用物体检测模型检测不同的物件,如枪支、旗帜等;使用物体属性模型检测任务属性,如人物形象类型、着装类型等;使用图像分类模型对如画面美观度、画面风格等进行检测。因视频时长较长,如果采样全部关键帧生成标签,不同帧间图像标签仍有较大冗余。如果每帧都进入图像模型进行分析,视频级分析速度较慢。基于镜头粒度,取其中少量几帧得到第四部分帧,通过对第四部分帧进行分析取平均值得到的第二特征提取结果可以作为镜头标签。According to an embodiment of the present disclosure, the shot layer may use an image granularity model to extract shot labels. For example, use the face detection model to identify different stars; use the object detection model to detect different objects, such as guns, flags, etc.; use the object attribute model to detect task attributes, such as character image type, clothing type, etc.; Aesthetics, picture style, etc. are tested. Due to the long duration of the video, if all the key frames are sampled to generate labels, the image labels between different frames still have great redundancy. Video-level analysis is slower if every frame goes into the image model for analysis. Based on the shot granularity, a small number of frames are taken to obtain the fourth part of the frame, and the second feature extraction result obtained by analyzing and averaging the fourth part of the frames can be used as the shot label.
通过本公开的上述实施例,基于镜头粒度,有选择的采样部分帧,以较低的资源确定镜头标签,既保证了标签召回率也提升了视频整体分析速度。Through the above-mentioned embodiments of the present disclosure, based on the granularity of shots, some frames are selectively sampled, and shot tags are determined with relatively low resources, which not only ensures the tag recall rate but also improves the overall analysis speed of the video.
根据本公开的实施例,视频编辑方法还可以包括:针对每个事件场景:获取事件场景包括的每个目标镜头对应的第三部分帧;对第三部分帧中的每个帧进行特征提取,得到第一特征提取结果。根据第一特征提取结果确定事件场景的场景标签。According to an embodiment of the present disclosure, the video editing method may further include: for each event scene: acquiring a third partial frame corresponding to each target shot included in the event scene; performing feature extraction on each frame in the third partial frame, Obtain the first feature extraction result. A scene label of the event scene is determined according to the first feature extraction result.
根据本公开的实施例,场景层可以采用视频粒度时序模型,为视频提供场景、活动、动作等标签,如机场、客厅、瞌睡、交谈等标签。在进行事件场景的标签判定时,可以对场景中的每个目标镜头,采样部分帧得到第三部分帧,作为场景模型的输入,从而得到每个事件场景的场景标签,进而确定该事件场景的场景发生地、活动、动作等信息。According to an embodiment of the present disclosure, the scene layer may adopt a video granularity timing model to provide videos with tags such as scenes, activities, and actions, such as tags such as airport, living room, sleepiness, and conversation. When judging the label of the event scene, for each target shot in the scene, some frames can be sampled to obtain the third part of the frame, which can be used as the input of the scene model, so as to obtain the scene label of each event scene, and then determine the event scene. Information such as the place where the scene takes place, activities, actions, etc.
需要说明的是,对一个事件场景中不同镜头的取帧数量可以不同。在一种实施例,每个镜头的取帧数量可以与镜头时长正相关。It should be noted that the number of frames captured for different shots in an event scene may be different. In an embodiment, the frame number of each shot may be positively correlated with the shot duration.
通过本公开的上述实施例,基于场景层切分信息,有选择的采样部分帧,以较低的资源确定场景标签,可进一步保证标签召回率和提升视频整体分析速度。Through the above-mentioned embodiments of the present disclosure, based on scene layer segmentation information, selective sampling of partial frames and determination of scene labels with relatively low resources can further ensure label recall and improve overall video analysis speed.
根据本公开的实施例,视频编辑方法还可以包括:对视频帧序列信息进行如下至少之一的处理:在检测到第一目标视频帧序列的情况下,确定第一目标视频帧序列为初始视频的片头或片尾的视频帧序列;在检测到第二目标视频帧序列的情况下,确定第二目标视频帧序列为初始视频的片头或片尾的视频帧序列。其中,视频帧序列信息包括初始视频中的视频帧序列和与视频帧序列相对应的音频,第一目标视频帧序列中包括字幕位于视频帧画面的第一位置的视频帧,第二目标视频帧序列所对应的音频为目标类型的音频。根据第一目标视频帧序列和第二目标视频帧序列其中至少之一确定视频正片。According to an embodiment of the present disclosure, the video editing method may further include: performing at least one of the following processes on the video frame sequence information: when the first target video frame sequence is detected, determining that the first target video frame sequence is the initial video The sequence of video frames at the beginning or end of the credits; if the second target video frame sequence is detected, determine that the second target video frame sequence is the sequence of video frames at the beginning or end of the original video. Wherein, the video frame sequence information includes the video frame sequence in the initial video and the audio corresponding to the video frame sequence, the first target video frame sequence includes a video frame whose subtitle is located at the first position of the video frame picture, and the second target video frame The audio corresponding to the sequence is the audio of the target type. A video feature is determined according to at least one of the first target video frame sequence and the second target video frame sequence.
根据本公开的实施例,初始视频可以为包括片头片尾的视频。头片尾检测模型可以对视频中的片头片尾进行检测,该模型可采用字幕检测和音频特征分类实现。在字幕方面,正片中,字幕通常出现在画底;片头片尾中,字幕常出现在画面中。因此可定义第一位置包括画面中。在音频方面,片头片尾一般无演员叙述音频,多为纯音乐或夹杂部分特效背景音的音乐。因此可定义目标类型包括纯音乐、特效背景音等类型。According to an embodiment of the present disclosure, the initial video may be a video including credits and credits. The head and tail detection model can detect the beginning and end of the video, and the model can be realized by subtitle detection and audio feature classification. In terms of subtitles, in the feature film, the subtitles usually appear at the bottom of the picture; in the opening and closing credits, the subtitles often appear in the picture. Therefore, it can be defined that the first position includes in the picture. In terms of audio, there is generally no actor narration audio at the beginning and end of the film, and most of them are pure music or music mixed with some special effects and background sounds. Therefore, it is possible to define target types including pure music, background sound with special effects, and the like.
根据本公开的实施例,对于初始视频,可以先经过片头片尾检测模型,确定视频正片开始和结束的位置。具体地,可以对如以1秒为单位的滑动窗口内的某段音频特征进行是否为片头片尾的分类。结合字幕位置检测和音频特征检测的结果,可以实现对视频秒粒度的片头片尾时间打点,从而确定正片视频。According to an embodiment of the present disclosure, for the initial video, the start and end positions of the main video may be determined through a credits and credits detection model. Specifically, it is possible to classify whether a certain piece of audio features within a sliding window with a unit of 1 second is an opening and closing credits. Combining the results of subtitle position detection and audio feature detection, it is possible to time-mark the beginning and end of the video at the second granularity, so as to determine the main video.
通过本公开对上述实施例,在视频处理之前提供了一种筛选方式,可以对较为主要的视频内容进行筛选,可有效提高视频编辑效率。According to the above-mentioned embodiments of the present disclosure, a screening method is provided before video processing, which can screen relatively main video content, and can effectively improve video editing efficiency.
根据本公开的实施例,视频编辑方法还可以包括:确定视频正片中的多个第三目标视频帧序列。针对每个第三目标视频帧序列进行特征提取,得到第三特征提取结果。根据多个第三特征提取结果确定视频正片的类型标签。According to an embodiment of the present disclosure, the video editing method may further include: determining a plurality of third target video frame sequences in the main video film. Feature extraction is performed on each third target video frame sequence to obtain a third feature extraction result. Determine the type label of the positive video according to the plurality of third feature extraction results.
根据本公开的实施例,视频正片的类型标签,即节目层类型标签,可以基于图像和音频序列的长视频时序模型得到。通过该模型可以为整段视频提供综合标签,如电视剧-家庭伦理、电影-科幻等。长视频时序模型可以在训练中设置视频的最大分析帧数,以维护机器内存稳定。若分析视频帧数大于阈值,训练时可以随机截取最大数量的连续帧作为训练输入。预测时,可以采用无重叠的滑窗,依次顺序取最大帧数得到第三目标视频帧序列,作为输入。通过对获得的所有滑窗分数结果取平均,可以作为视频正片的类型标签输出。According to an embodiment of the present disclosure, the type label of the main video, that is, the type label of the program layer, can be obtained based on a long video sequence model of images and audio sequences. Through this model, comprehensive labels can be provided for the entire video, such as TV drama-family ethics, movie-sci-fi, etc. The long video timing model can set the maximum number of video analysis frames during training to maintain machine memory stability. If the number of analyzed video frames is greater than the threshold, the maximum number of consecutive frames can be randomly intercepted as training input during training. When predicting, a non-overlapping sliding window can be used, and the maximum number of frames is sequentially taken to obtain the third target video frame sequence as input. By averaging all the sliding window score results obtained, it can be output as the type label of the video positive.
通过本公开的上述实施例,基于节目层信息,有选择的采样部分帧序列,以较低的资源确定视频的类型标签,可有效既保证标签召回率,同时可提升视频整体分析速度。Through the above-mentioned embodiments of the present disclosure, based on the program layer information, selectively sampling part of the frame sequence, and determining the type label of the video with relatively low resources can effectively ensure the label recall rate and improve the overall analysis speed of the video at the same time.
根据本公开的实施例,第一帧信息可以包括与第一部分帧相对应的图像特征、音频特征和文本特征其中至少之一,以及第一部分帧中相邻两帧之间的图像差值向量、音频差值向量和文本差值向量其中至少之一。和/或,第二帧信息可以包括与第二部分帧相对应的图像特征和音频特征其中至少之一。在此不做限定。According to an embodiment of the present disclosure, the first frame information may include at least one of image features, audio features, and text features corresponding to the first partial frame, and an image difference vector between two adjacent frames in the first partial frame, At least one of an audio delta vector and a text delta vector. And/or, the second frame information may include at least one of image features and audio features corresponding to the second partial frame. It is not limited here.
通过本公开的上述实施例,基于视频各维度的信息进行处理,可有效保证时间编辑结构的准确度与丰富度。Through the above-mentioned embodiments of the present disclosure, processing based on the information of each dimension of the video can effectively ensure the accuracy and richness of the time editing structure.
图5示意性示出了根据本公开实施例的视频编辑方法的流程的示意图。Fig. 5 schematically shows a schematic diagram of the flow of a video editing method according to an embodiment of the present disclosure.
如图5所示,对于一个初始视频,可以首先经过片头片尾检测模型510进行检测,以筛除片头片尾,得到正片视频。结合视频分级模块520中的图像模型521、视频模型522、文本模型523可以对正片视频进行进一步处理,以将正片视频进行镜头粒度、场景粒度、片段粒度的划分。基于分级划分得到的镜头粒度信息、场景粒度信息、片段粒度信息等结果,可进 一步确定针对该初始视频编辑得到的镜头层图像标签、场景层时空标签、片段层视频解说生成以及节目层类型等。As shown in FIG. 5 , for an initial video, it may first be detected through a credits and credits detection model 510 to screen out credits and credits to obtain a main video. Combining the image model 521 , video model 522 , and text model 523 in the video classification module 520 , the feature video can be further processed to divide the feature video into shot granularity, scene granularity, and segment granularity. Based on the shot granularity information, scene granularity information, segment granularity information and other results obtained by hierarchical division, the shot-level image labels, scene-level spatio-temporal labels, segment-level video commentary generation, and program-level types obtained for the initial video editing can be further determined.
通过本公开的上述实施例,提供了一种智能标签、智能拆条、智能解说生成的方法。整个方法可有效降低人工处理依赖度,提升视频的编辑处理速度。基于部分帧或关键帧进行视频处理,可以对整段视频在不同层级进行标注,为视频的入库索引提供基础。Through the above-mentioned embodiments of the present disclosure, a method for generating smart labels, smart splitting, and smart commentary is provided. The whole method can effectively reduce the dependence on manual processing and improve the editing and processing speed of the video. Video processing based on partial frames or key frames can mark the entire video at different levels, providing a basis for video storage indexing.
图6示意性示出了根据本公开实施例的视频编辑装置的框图。Fig. 6 schematically shows a block diagram of a video editing device according to an embodiment of the present disclosure.
如图6所示,视频编辑装置600包括第一处理模块610、第一拆分模块620和视频编辑模块630。As shown in FIG. 6 , the video editing device 600 includes a first processing module 610 , a first splitting module 620 and a video editing module 630 .
第一处理模块610,用于根据与视频正片包括的至少一个事件场景各自的第一部分帧相对应的第一帧信息,对每个事件场景进行分类处理,得到场景分类结果。The first processing module 610 is configured to classify each event scene according to the first frame information corresponding to the first partial frame of at least one event scene included in the main video, to obtain a scene classification result.
第一拆分模块620,用于在场景分类结果表征与场景分类结果相对应的目标事件场景为片段切分点的情况下,根据目标事件场景的起始时间信息,将视频正片拆分为至少一个视频片段。其中,每个视频片段中包括至少一个事件场景。The first splitting module 620 is used to split the main video into at least A video clip. Wherein, each video clip includes at least one event scene.
视频编辑模块630,用于基于至少一个视频片段进行视频编辑操作。A video editing module 630, configured to perform video editing operations based on at least one video segment.
根据本公开的实施例,视频编辑模块包括第一确定单元、第二确定单元、第三确定单元、第四确定单元、第五确定单元和生成单元。According to an embodiment of the present disclosure, the video editing module includes a first determining unit, a second determining unit, a third determining unit, a fourth determining unit, a fifth determining unit and a generating unit.
第一确定单元,用于确定需要生成解说视频的目标视频片段。The first determination unit is configured to determine a target video segment that needs to generate a commentary video.
第二确定单元,用于根据目标视频片段中的人物特征和文本特征,确定目标视频片段的标识信息。The second determination unit is configured to determine the identification information of the target video segment according to the character features and text features in the target video segment.
第三确定单元,用于根据目标视频片段的文本特征确定目标视频片段的摘要信息。The third determination unit is configured to determine the summary information of the target video segment according to the text features of the target video segment.
第四确定单元,用于从目标视频片段中确定与摘要信息相关的目标镜头。The fourth determination unit is configured to determine a target shot related to the summary information from the target video segment.
第五确定单元,用于根据摘要信息确定目标视频片段的标题信息。The fifth determining unit is configured to determine the title information of the target video segment according to the summary information.
生成单元,用于根据标识信息、摘要信息、目标镜头和标题信息,生成解说视频。The generation unit is used to generate the commentary video according to the identification information, summary information, target shot and title information.
根据本公开的实施例,文本特征包括目标视频片段中的台词文本。第 三确定单元包括第一确定子单元和获得子单元。According to an embodiment of the present disclosure, the text feature includes line text in the target video segment. The third determining unit includes a first determining subunit and an obtaining subunit.
第一确定子单元,用于确定文本特征中各台词文本的生成者标识。The first determination subunit is used to determine the generator identifier of each line text in the text feature.
获得子单元,用于对以生成者标识标记的各台词文本进行信息提取,得到摘要信息。The obtaining subunit is configured to extract information from each line text marked with the generator ID to obtain summary information.
根据本公开的实施例,文本特征包括目标视频片段中的台词文本。第四确定单元包括第二确定子单元、第三确定子单元、第四确定子单元、第五确定子单元和第六确定子单元。According to an embodiment of the present disclosure, the text feature includes line text in the target video segment. The fourth determination unit includes a second determination subunit, a third determination subunit, a fourth determination subunit, a fifth determination subunit and a sixth determination subunit.
第二确定子单元,用于确定对摘要信息进行语音播报的语音播报时长。The second determining subunit is used to determine the duration of the voice broadcast of the summary information.
第三确定子单元,用于确定与摘要信息相关联的至少一个台词文本。The third determining subunit is configured to determine at least one line text associated with the summary information.
第四确定子单元,用于针对每个台词文本,确定与台词文本在时间上匹配的镜头片段,得到多个镜头片段。The fourth determining subunit is configured to determine, for each line text, a shot segment that matches the line text in time to obtain a plurality of shot segments.
第五确定子单元,用于根据语音播报时长,从多个镜头片段中确定至少一个目标镜头片段。其中,至少一个目标镜头片段的总时长与语音播报时长相匹配。The fifth determining subunit is configured to determine at least one target shot segment from a plurality of shot segments according to the voice broadcast duration. Wherein, the total duration of at least one target shot segment matches the duration of the voice broadcast.
第六确定子单元,用于将至少一个目标镜头片段确定为目标镜头。A sixth determining subunit, configured to determine at least one target shot segment as the target shot.
根据本公开的实施例,视频编辑装置还包括第二处理模块和第二拆分模块。According to an embodiment of the present disclosure, the video editing device further includes a second processing module and a second splitting module.
第二处理模块,用于根据与视频正片包括的至少一个镜头各自的第二部分帧相对应的第二帧信息,对每个镜头进行分类处理,得到每个镜头的镜头分类结果。The second processing module is configured to classify each shot according to the second frame information corresponding to the second partial frame of at least one shot included in the main video, and obtain a shot classification result of each shot.
第二拆分模块,用于在镜头分类结果表征与镜头分类结果相对应的目标镜头为场景切分点的情况下,根据目标镜头的起始时间信息,将视频正片拆分为至少一个事件场景。其中,每个事件场景中包括至少一个镜头。The second splitting module is used to split the video feature into at least one event scene according to the start time information of the target shot when the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point . Wherein, each event scene includes at least one shot.
根据本公开的实施例,视频编辑装置还包括第一特征提取模块和第一确定模块。According to an embodiment of the present disclosure, the video editing device further includes a first feature extraction module and a first determination module.
第一特征提取模块,用于针对每个事件场景:获取事件场景包括的每个目标镜头对应的第三部分帧;对第三部分帧中的每个帧进行特征提取,得到第一特征提取结果。The first feature extraction module is used for each event scene: obtaining the third partial frame corresponding to each target lens included in the event scene; performing feature extraction on each frame in the third partial frame to obtain the first feature extraction result .
第一确定模块,用于根据第一特征提取结果确定事件场景的场景标签。The first determining module is configured to determine the scene label of the event scene according to the first feature extraction result.
根据本公开的实施例,视频编辑装置还包括第二特征提取模块和第二 确定模块。According to an embodiment of the present disclosure, the video editing device further includes a second feature extraction module and a second determination module.
第二特征提取模块,用于针对每个镜头:获取镜头对应的第四部分帧;对第四部分帧中的每个帧进行特征提取,得到第二特征提取结果。The second feature extraction module is configured to, for each shot: acquire a fourth partial frame corresponding to the shot; perform feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result.
第二确定模块,用于根据第二特征提取结果确定镜头的镜头标签。The second determination module is configured to determine the lens label of the lens according to the second feature extraction result.
根据本公开的实施例,视频编辑装置还包括第三确定模块、第四确定模块和第五确定模块。According to an embodiment of the present disclosure, the video editing device further includes a third determining module, a fourth determining module and a fifth determining module.
第三确定模块,用于在检测到第一目标视频帧序列的情况下,确定第一目标视频帧序列为初始视频的片头或片尾的视频帧序列。其中,第一目标视频帧序列中包括字幕位于视频帧画面的第一位置的视频帧。The third determination module is configured to determine that the first target video frame sequence is a sequence of video frames at the beginning or end of the initial video when the first target video frame sequence is detected. Wherein, the first target video frame sequence includes a video frame whose subtitle is located at the first position of the video frame picture.
第四确定模块,用于在检测到第二目标视频帧序列的情况下,确定第二目标视频帧序列为初始视频的片头或片尾的视频帧序列。其中,第二目标视频帧序列所对应的音频为目标类型的音频。The fourth determination module is configured to determine that the second target video frame sequence is a sequence of video frames at the beginning or end of the original video when the second target video frame sequence is detected. Wherein, the audio corresponding to the second target video frame sequence is the audio of the target type.
第五确定模块,用于根据第一目标视频帧序列和第二目标视频帧序列其中至少之一确定视频正片。The fifth determination module is configured to determine the main video according to at least one of the first target video frame sequence and the second target video frame sequence.
根据本公开的实施例,视频编辑装置还包括第六确定模块、第三特征提取模块和第七确定模块。According to an embodiment of the present disclosure, the video editing device further includes a sixth determination module, a third feature extraction module, and a seventh determination module.
第六确定模块,用于确定视频正片中的多个第三目标视频帧序列。The sixth determination module is configured to determine a plurality of third target video frame sequences in the main video film.
第三特征提取模块,用于针对每个第三目标视频帧序列进行特征提取,得到第三特征提取结果。The third feature extraction module is configured to perform feature extraction for each third target video frame sequence to obtain a third feature extraction result.
第七确定模块,用于根据多个第三特征提取结果确定视频正片的类型标签。The seventh determination module is used to determine the type label of the main video according to the multiple third feature extraction results.
根据本公开的实施例,第一帧信息包括与第一部分帧相对应的图像特征、音频特征和文本特征其中至少之一,以及第一部分帧中相邻两帧之间的图像差值向量、音频差值向量和文本差值向量其中至少之一。和/或,第二帧信息包括与第二部分帧相对应的图像特征和音频特征其中至少之一。According to an embodiment of the present disclosure, the first frame information includes at least one of image features, audio features, and text features corresponding to the first partial frame, and image difference vectors, audio At least one of a delta vector and a text delta vector. And/or, the second frame information includes at least one of image features and audio features corresponding to the second partial frame.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to the embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
根据本公开的实施例,一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能 够执行如上所述的方法。According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by at least one processor, and the instructions are processed by at least one The processor is executed, so that at least one processor can perform the method as described above.
根据本公开的实施例,一种存储有计算机指令的非瞬时计算机可读存储介质,其中,计算机指令用于使计算机执行如上所述的方法。According to an embodiment of the present disclosure, there is a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to execute the method as described above.
根据本公开的实施例,一种计算机程序产品,包括计算机程序,计算机程序在被处理器执行时实现如上所述的方法。According to an embodiment of the present disclosure, a computer program product includes a computer program, and the computer program implements the above method when executed by a processor.
图7示出了可以用来实施本公开的实施例的示例电子设备700的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
如图7所示,设备700包括计算单元701,其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序,来执行各种适当的动作和处理。在RAM 703中,还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in FIG. 7, the device 700 includes a computing unit 701 that can execute according to a computer program stored in a read-only memory (ROM) 702 or loaded from a storage unit 708 into a random-access memory (RAM) 703. Various appropriate actions and treatments. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored. The computing unit 701, ROM 702, and RAM 703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to the bus 704 .
设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储单元708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Multiple components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, a mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the device 700 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.
计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理,例如视频编辑方法。例如,在一些实施例中,视频编辑方法可被实现为计算机软件程序,其被有形地包括于机器可读介质,例如存储单元 708。在一些实施例中,计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时,可以执行上文描述的视频编辑方法的一个或多个步骤。备选地,在其他实施例中,计算单元701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行视频编辑方法。The computing unit 701 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 701 executes various methods and processes described above, such as video editing methods. For example, in some embodiments, the video editing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed on the device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the video editing method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to execute the video editing method in any other suitable manner (for example, by means of firmware).
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包括或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的 任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may include or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,也可以是分布式系统的服务器,或者是结合了区块链的服务器。A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a server of a distributed system, or a server combined with a blockchain.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包括在本公开保护范围之内。The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims (20)

  1. 一种视频编辑方法,包括:A method of video editing, comprising:
    根据与视频正片包括的至少一个事件场景各自的第一部分帧相对应的第一帧信息,对每个所述事件场景进行分类处理,得到场景分类结果;According to the first frame information corresponding to the first partial frames of at least one event scene included in the main video, classify each event scene to obtain a scene classification result;
    在所述场景分类结果表征与所述场景分类结果相对应的目标事件场景为片段切分点的情况下,根据所述目标事件场景的起始时间信息,将所述视频正片拆分为至少一个视频片段,其中,每个所述视频片段中包括至少一个所述事件场景;以及When the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point, according to the start time information of the target event scene, split the video feature into at least one video clips, wherein each of said video clips includes at least one of said event scenes; and
    基于所述至少一个视频片段进行视频编辑操作。A video editing operation is performed based on the at least one video segment.
  2. 根据权利要求1所述的方法,所述基于所述至少一个视频片段进行视频编辑操作包括:The method according to claim 1, said performing video editing operations based on said at least one video segment comprising:
    从所述至少一个视频片段中确定目标视频片段;determining a target video segment from the at least one video segment;
    根据所述目标视频片段中的人物特征和文本特征,确定所述目标视频片段的标识信息;Determine the identification information of the target video segment according to the character features and text features in the target video segment;
    根据所述目标视频片段的文本特征确定所述目标视频片段的摘要信息;determining summary information of the target video segment according to the text features of the target video segment;
    从所述目标视频片段中确定与所述摘要信息相关的目标镜头;determining a target shot related to the summary information from the target video segment;
    根据所述摘要信息确定所述目标视频片段的标题信息;以及determining title information of the target video segment according to the summary information; and
    根据所述标识信息、所述摘要信息、所述目标镜头和所述标题信息,生成与所述目标视频片段相对应的解说视频。An explanatory video corresponding to the target video segment is generated according to the identification information, the abstract information, the target shot and the title information.
  3. 根据权利要求2所述的方法,其中,所述文本特征包括所述目标视频片段中的台词文本;所述根据所述目标视频片段的文本特征确定所述目标视频片段的摘要信息包括:The method according to claim 2, wherein the text feature comprises the line text in the target video segment; the determination of the summary information of the target video segment according to the text feature of the target video segment comprises:
    确定所述文本特征中各台词文本的生成者标识;以及Determining the generator identifier of each line text in the text feature; and
    对以所述生成者标识标记的各台词文本进行信息提取,得到所述摘要信息。Information extraction is performed on each line text marked with the generator identifier to obtain the summary information.
  4. 根据权利要求2所述的方法,其中,所述文本特征包括所述目标视频片段中的台词文本;所述从所述目标视频片段中确定 与所述摘要信息相关的目标镜头包括:The method according to claim 2, wherein said text feature comprises lines text in said target video segment; said determining from said target video segment the target shot relevant to said abstract information comprises:
    确定对所述摘要信息进行语音播报的语音播报时长;Determine the voice broadcast duration for voice broadcast of the summary information;
    确定与所述摘要信息相关联的至少一个台词文本;determining at least one line text associated with the summary information;
    针对每个台词文本,确定与所述台词文本在时间上匹配的镜头片段,得到多个镜头片段;For each line text, determine a shot segment that matches the line text in time to obtain a plurality of shot segments;
    根据所述语音播报时长,从所述多个镜头片段中确定至少一个目标镜头片段,其中,所述至少一个目标镜头片段的总时长与所述语音播报时长相匹配;以及Determine at least one target shot segment from the plurality of shot segments according to the voice broadcast duration, wherein the total duration of the at least one target shot segment matches the voice broadcast duration; and
    将所述至少一个目标镜头片段确定为所述目标镜头。The at least one target shot segment is determined as the target shot.
  5. 根据权利要求1至4中任一所述的方法,还包括:The method according to any one of claims 1 to 4, further comprising:
    根据与视频正片包括的至少一个镜头各自的第二部分帧相对应的第二帧信息,对每个所述镜头进行分类处理,得到每个所述镜头的镜头分类结果;Classify each of the shots according to the second frame information corresponding to the second partial frame of at least one shot included in the main video, and obtain a shot classification result for each of the shots;
    在所述镜头分类结果表征与所述镜头分类结果相对应的目标镜头为场景切分点的情况下,根据所述目标镜头的起始时间信息,将所述视频正片拆分为所述至少一个事件场景,其中,每个所述事件场景中包括至少一个所述镜头。When the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point, according to the start time information of the target shot, split the video feature into the at least one Event scenes, wherein each event scene includes at least one shot.
  6. 根据权利要求5所述的方法,还包括:The method according to claim 5, further comprising:
    针对每个所述事件场景:For each of the described event scenarios:
    获取所述事件场景包括的每个目标镜头对应的第三部分帧;Acquiring the third partial frame corresponding to each target shot included in the event scene;
    对所述第三部分帧中的每个帧进行特征提取,得到第一特征提取结果;以及performing feature extraction on each frame in the third partial frame to obtain a first feature extraction result; and
    根据所述第一特征提取结果确定所述事件场景的场景标签。Determine the scene label of the event scene according to the first feature extraction result.
  7. 根据权利要求5或6所述的方法,还包括:The method according to claim 5 or 6, further comprising:
    针对每个所述镜头:For each said lens:
    获取所述镜头对应的第四部分帧;Obtain the fourth part of the frame corresponding to the shot;
    对所述第四部分帧中的每个帧进行特征提取,得到第二特征提取结果;以及performing feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result; and
    根据所述第二特征提取结果确定所述镜头的镜头标签。Determine the shot label of the shot according to the second feature extraction result.
  8. 根据权利要求1至7中任一所述的方法,还包括:The method according to any one of claims 1 to 7, further comprising:
    在检测到第一目标视频帧序列的情况下,确定所述第一目标视频帧序列为初始视频的片头或片尾的视频帧序列,其中,所述第一目标视频帧序列中包括字幕位于视频帧画面的第一位置的视频帧;In the case that the first target video frame sequence is detected, it is determined that the first target video frame sequence is the video frame sequence of the beginning or end of the original video, wherein the first target video frame sequence includes subtitles located in the video frame the video frame at the first position of the screen;
    在检测到第二目标视频帧序列的情况下,确定所述第二目标视频帧序列为所述初始视频的片头或片尾的视频帧序列,其中,所述第二目标视频帧序列所对应的音频为目标类型的音频;以及When the second target video frame sequence is detected, determine that the second target video frame sequence is the video frame sequence of the beginning or end of the initial video, wherein the audio corresponding to the second target video frame sequence is the target type audio; and
    根据所述第一目标视频帧序列和所述第二目标视频帧序列其中至少之一确定所述视频正片。The video feature is determined according to at least one of the first target video frame sequence and the second target video frame sequence.
  9. 根据权利要求1至8中任一所述的方法,还包括:The method according to any one of claims 1 to 8, further comprising:
    确定视频正片中的多个第三目标视频帧序列;determining a plurality of third target video frame sequences in the positive video;
    针对每个所述第三目标视频帧序列进行特征提取,得到第三特征提取结果;以及performing feature extraction for each of the third target video frame sequences to obtain a third feature extraction result; and
    根据多个所述第三特征提取结果确定所述视频正片的类型标签。Determine the type label of the video positive film according to a plurality of the third feature extraction results.
  10. 根据权利要求5至9中任一所述的方法,其中,所述第一帧信息包括与所述第一部分帧相对应的图像特征、音频特征和文本特征其中至少之一,以及所述第一部分帧中相邻两帧之间的图像差值向量、音频差值向量和文本差值向量其中至少之一;和/或The method according to any one of claims 5 to 9, wherein the first frame information includes at least one of image features, audio features and text features corresponding to the first partial frame, and the first partial frame At least one of image difference vectors, audio difference vectors and text difference vectors between two adjacent frames in the frame; and/or
    所述第二帧信息包括与所述第二部分帧相对应的图像特征和音频特征其中至少之一。The second frame information includes at least one of image features and audio features corresponding to the second partial frame.
  11. 一种视频编辑装置,包括:A video editing device comprising:
    第一处理模块,用于根据与视频正片包括的至少一个事件场景各自的第一部分帧相对应的第一帧信息,对每个所述事件场景进行分类处理,得到场景分类结果;The first processing module is configured to perform classification processing on each of the event scenes according to the first frame information corresponding to the first partial frames of at least one event scene included in the main video film, and obtain a scene classification result;
    第一拆分模块,用于在所述场景分类结果表征与所述场景分类结果相对应的目标事件场景为片段切分点的情况下,根据所述目标事件场景的起始时间信息,将所述视频正片拆分为至少一个 视频片段,其中,每个所述视频片段中包括至少一个所述事件场景;以及The first splitting module is configured to divide the target event scene according to the start time information of the target event scene when the scene classification result indicates that the target event scene corresponding to the scene classification result is a segment segmentation point. The feature film of the video is split into at least one video segment, wherein each of the video segments includes at least one event scene; and
    视频编辑模块,用于基于所述至少一个视频片段进行视频编辑操作。A video editing module, configured to perform video editing operations based on the at least one video segment.
  12. 根据权利要求11所述的装置,其中,所述视频编辑模块包括:The device according to claim 11, wherein the video editing module comprises:
    第一确定单元,用于确定需要生成解说视频的目标视频片段片段;The first determination unit is used to determine the target video segment segment that needs to generate the commentary video;
    第二确定单元,用于根据所述目标视频片段中的人物特征和文本特征,确定所述目标视频片段的标识信息;A second determining unit, configured to determine the identification information of the target video segment according to the character features and text features in the target video segment;
    第三确定单元,用于根据所述目标视频片段的文本特征确定所述目标视频片段的摘要信息;A third determining unit, configured to determine summary information of the target video segment according to text features of the target video segment;
    第四确定单元,用于从所述目标视频片段中确定与所述摘要信息相关的目标镜头;A fourth determining unit, configured to determine a target shot related to the summary information from the target video segment;
    第五确定单元,片段用于根据所述摘要信息确定所述目标视频片段的标题信息;以及A fifth determining unit, the segment is configured to determine the title information of the target video segment according to the summary information; and
    生成单元,用于根据所述标识信息、所述摘要信息、所述目标镜头和所述标题信息,生成与所述目标视频片段相对应的解说视频。A generating unit, configured to generate an explanatory video corresponding to the target video segment according to the identification information, the abstract information, the target shot and the title information.
  13. 根据权利要求12所述的装置,其中,所述文本特征包括所述目标视频片段中的台词文本;所述第三确定单元包括:The device according to claim 12, wherein the text features include line text in the target video segment; the third determination unit includes:
    第一确定子单元,用于确定所述文本特征中各台词文本的生成者标识;以及The first determining subunit is used to determine the generator identifier of each line text in the text feature; and
    获得子单元,用于对以所述生成者标识标记的各台词文本进行信息提取,得到所述摘要信息。The obtaining subunit is configured to perform information extraction on each line text marked with the generator ID to obtain the summary information.
  14. 根据权利要求12所述的装置,其中,所述文本特征包括所述目标视频片段中的台词文本;所述第四确定单元包括:The device according to claim 12, wherein the text features include line text in the target video segment; the fourth determination unit includes:
    第二确定子单元,用于确定对所述摘要信息进行语音播报的语音播报时长;The second determination subunit is used to determine the voice broadcast duration for voice broadcast of the summary information;
    第三确定子单元,用于确定与所述摘要信息相关联的至少一 个台词文本;A third determining subunit, configured to determine at least one line text associated with the summary information;
    第四确定子单元,用于针对每个台词文本,确定与所述台词文本在时间上匹配的镜头片段,得到多个镜头片段;The fourth determining subunit is used to determine, for each line text, a shot segment that matches the line text in time to obtain a plurality of shot segments;
    第五确定子单元,用于根据所述语音播报时长,从所述多个镜头片段中确定至少一个目标镜头片段,其中,所述至少一个目标镜头片段的总时长与所述语音播报时长相匹配;以及The fifth determining subunit is configured to determine at least one target shot segment from the plurality of shot segments according to the voice broadcast duration, wherein the total duration of the at least one target shot segment matches the voice broadcast duration ;as well as
    第六确定子单元,用于将所述至少一个目标镜头片段确定为所述目标镜头。A sixth determining subunit, configured to determine the at least one target shot segment as the target shot.
  15. 根据权利要求11至14中任一所述的装置,还包括:An apparatus according to any one of claims 11 to 14, further comprising:
    第二处理模块,用于根据与视频正片包括的至少一个镜头各自的第二部分帧相对应的第二帧信息,对每个所述镜头进行分类处理,得到每个所述镜头的镜头分类结果;The second processing module is configured to perform classification processing on each of the shots according to the second frame information corresponding to the second partial frame of at least one shot included in the main video film, and obtain a shot classification result of each of the shots. ;
    第二拆分模块,用于在所述镜头分类结果表征与所述镜头分类结果相对应的目标镜头为场景切分点的情况下,根据所述目标镜头的起始时间信息,将所述视频正片拆分为所述至少一个事件场景,其中,每个所述事件场景中包括至少一个所述镜头。The second splitting module is used to divide the video according to the start time information of the target shot when the shot classification result indicates that the target shot corresponding to the shot classification result is a scene segmentation point The feature film is divided into the at least one event scene, wherein each event scene includes at least one shot.
  16. 根据权利要求15所述的装置,还包括:The apparatus of claim 15, further comprising:
    第一特征提取模块,用于针对每个所述事件场景:The first feature extraction module is used for each of the event scenarios:
    获取所述事件场景包括的每个目标镜头对应的第三部分帧;Acquiring the third partial frame corresponding to each target shot included in the event scene;
    对所述第三部分帧中的每个帧进行特征提取,得到第一特征提取结果;以及performing feature extraction on each frame in the third partial frame to obtain a first feature extraction result; and
    第一确定模块,用于根据所述第一特征提取结果确定所述事件场景的场景标签。A first determining module, configured to determine the scene label of the event scene according to the first feature extraction result.
  17. 根据权利要求11至16中任一所述的装置,还包括:An apparatus according to any one of claims 11 to 16, further comprising:
    第二特征提取模块,用于针对每个所述镜头:The second feature extraction module is used for each of the shots:
    获取所述镜头对应的第四部分帧;Obtain the fourth part of the frame corresponding to the shot;
    对所述第四部分帧中的每个帧进行特征提取,得到第二特征提取结果;以及performing feature extraction on each frame in the fourth partial frame to obtain a second feature extraction result; and
    第二确定模块,用于根据所述第二特征提取结果确定所述镜 头的镜头标签。The second determination module is configured to determine the lens label of the lens according to the second feature extraction result.
  18. 一种电子设备,包括:An electronic device comprising:
    至少一个处理器;以及at least one processor; and
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-10中任一项所述的方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-10. Methods.
  19. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据权利要求1-10中任一项所述的方法。A non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the method according to any one of claims 1-10.
  20. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-10中任一项所述的方法。A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-10.
PCT/CN2022/104122 2021-08-02 2022-07-06 Video editing method and apparatus, electronic device, and storage medium WO2023011094A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110883507.6A CN113613065B (en) 2021-08-02 2021-08-02 Video editing method and device, electronic equipment and storage medium
CN202110883507.6 2021-08-02

Publications (1)

Publication Number Publication Date
WO2023011094A1 true WO2023011094A1 (en) 2023-02-09

Family

ID=78339126

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/104122 WO2023011094A1 (en) 2021-08-02 2022-07-06 Video editing method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN113613065B (en)
WO (1) WO2023011094A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116405745A (en) * 2023-06-09 2023-07-07 深圳市信润富联数字科技有限公司 Video information extraction method and device, terminal equipment and computer medium

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113613065B (en) * 2021-08-02 2022-09-09 北京百度网讯科技有限公司 Video editing method and device, electronic equipment and storage medium
CN114245171B (en) * 2021-12-15 2023-08-29 百度在线网络技术(北京)有限公司 Video editing method and device, electronic equipment and medium
CN114222196A (en) * 2022-01-04 2022-03-22 阿里巴巴新加坡控股有限公司 Method and device for generating short video of plot commentary and electronic equipment
CN114257864B (en) * 2022-02-24 2023-02-03 易方信息科技股份有限公司 Seek method and device of player in HLS format video source scene
CN115174962A (en) * 2022-07-22 2022-10-11 湖南芒果无际科技有限公司 Rehearsal simulation method and device, computer equipment and computer readable storage medium
CN115460455B (en) * 2022-09-06 2024-02-09 上海硬通网络科技有限公司 Video editing method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005332206A (en) * 2004-05-20 2005-12-02 Nippon Hoso Kyokai <Nhk> Video event discrimination device, program thereof, learning data generation device for video event discrimination, and program thereof
CN110121108A (en) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 Video value appraisal procedure and device
CN111246287A (en) * 2020-01-13 2020-06-05 腾讯科技(深圳)有限公司 Video processing method, video publishing method, video pushing method and devices thereof
CN111327945A (en) * 2018-12-14 2020-06-23 北京沃东天骏信息技术有限公司 Method and apparatus for segmenting video
CN111798879A (en) * 2019-04-08 2020-10-20 百度(美国)有限责任公司 Method and apparatus for generating video
US10999566B1 (en) * 2019-09-06 2021-05-04 Amazon Technologies, Inc. Automated generation and presentation of textual descriptions of video content
CN112800278A (en) * 2021-03-30 2021-05-14 腾讯科技(深圳)有限公司 Video type determination method and device and electronic equipment
CN113014988A (en) * 2021-02-23 2021-06-22 北京百度网讯科技有限公司 Video processing method, device, equipment and storage medium
CN113613065A (en) * 2021-08-02 2021-11-05 北京百度网讯科技有限公司 Video editing method and device, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440640B (en) * 2013-07-26 2016-02-10 北京理工大学 A kind of video scene cluster and browsing method
CN104394422B (en) * 2014-11-12 2017-11-17 华为软件技术有限公司 A kind of Video segmentation point acquisition methods and device
CN108777815B (en) * 2018-06-08 2021-04-23 Oppo广东移动通信有限公司 Video processing method and device, electronic equipment and computer readable storage medium
CN111382620B (en) * 2018-12-28 2023-06-09 阿里巴巴集团控股有限公司 Video tag adding method, computer storage medium and electronic device
CN111538862B (en) * 2020-05-15 2023-06-20 北京百度网讯科技有限公司 Method and device for explaining video
CN112818906B (en) * 2021-02-22 2023-07-11 浙江传媒学院 Intelligent cataloging method of all-media news based on multi-mode information fusion understanding

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005332206A (en) * 2004-05-20 2005-12-02 Nippon Hoso Kyokai <Nhk> Video event discrimination device, program thereof, learning data generation device for video event discrimination, and program thereof
CN110121108A (en) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 Video value appraisal procedure and device
CN111327945A (en) * 2018-12-14 2020-06-23 北京沃东天骏信息技术有限公司 Method and apparatus for segmenting video
CN111798879A (en) * 2019-04-08 2020-10-20 百度(美国)有限责任公司 Method and apparatus for generating video
US10999566B1 (en) * 2019-09-06 2021-05-04 Amazon Technologies, Inc. Automated generation and presentation of textual descriptions of video content
CN111246287A (en) * 2020-01-13 2020-06-05 腾讯科技(深圳)有限公司 Video processing method, video publishing method, video pushing method and devices thereof
CN113014988A (en) * 2021-02-23 2021-06-22 北京百度网讯科技有限公司 Video processing method, device, equipment and storage medium
CN112800278A (en) * 2021-03-30 2021-05-14 腾讯科技(深圳)有限公司 Video type determination method and device and electronic equipment
CN113613065A (en) * 2021-08-02 2021-11-05 北京百度网讯科技有限公司 Video editing method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116405745A (en) * 2023-06-09 2023-07-07 深圳市信润富联数字科技有限公司 Video information extraction method and device, terminal equipment and computer medium

Also Published As

Publication number Publication date
CN113613065B (en) 2022-09-09
CN113613065A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
WO2023011094A1 (en) Video editing method and apparatus, electronic device, and storage medium
US11830241B2 (en) Auto-curation and personalization of sports highlights
WO2022116888A1 (en) Method and device for video data processing, equipment, and medium
US9253511B2 (en) Systems and methods for performing multi-modal video datastream segmentation
US10410679B2 (en) Producing video bits for space time video summary
US10108709B1 (en) Systems and methods for queryable graph representations of videos
US20160014482A1 (en) Systems and Methods for Generating Video Summary Sequences From One or More Video Segments
US11436831B2 (en) Method and apparatus for video processing
US8503523B2 (en) Forming a representation of a video item and use thereof
US20190130185A1 (en) Visualization of Tagging Relevance to Video
Sundaram et al. A utility framework for the automatic generation of audio-visual skims
US9569428B2 (en) Providing an electronic summary of source content
US8528018B2 (en) System and method for evaluating visual worthiness of video data in a network environment
US20120039539A1 (en) Method and system for classifying one or more images
US20110218997A1 (en) Method and system for browsing, searching and sharing of personal video by a non-parametric approach
CN109408672B (en) Article generation method, article generation device, server and storage medium
US20210117471A1 (en) Method and system for automatically generating a video from an online product representation
CN112511854A (en) Live video highlight generation method, device, medium and equipment
US9525896B2 (en) Automatic summarizing of media content
WO2023173539A1 (en) Video content processing method and system, and terminal and storage medium
US10990828B2 (en) Key frame extraction, recording, and navigation in collaborative video presentations
CN117336572A (en) Video abstract generation method, device, computer equipment and storage medium
Valdés et al. On-line video abstract generation of multimedia news
CN113923479A (en) Audio and video editing method and device
CN113407765B (en) Video classification method, apparatus, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22851809

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE