WO2024007898A1 - 视频处理方法、装置及电子设备 - Google Patents

视频处理方法、装置及电子设备 Download PDF

Info

Publication number
WO2024007898A1
WO2024007898A1 PCT/CN2023/102818 CN2023102818W WO2024007898A1 WO 2024007898 A1 WO2024007898 A1 WO 2024007898A1 CN 2023102818 W CN2023102818 W CN 2023102818W WO 2024007898 A1 WO2024007898 A1 WO 2024007898A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
features
transition
feature
fused
Prior art date
Application number
PCT/CN2023/102818
Other languages
English (en)
French (fr)
Inventor
靳潇杰
沈垚杰
徐凯
Original Assignee
脸萌有限公司
北京字跳网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 脸萌有限公司, 北京字跳网络技术有限公司 filed Critical 脸萌有限公司
Publication of WO2024007898A1 publication Critical patent/WO2024007898A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream

Definitions

  • Embodiments of the present disclosure relate to the technical fields of computer vision and artificial intelligence, and in particular, to an interactive segmentation model training method, annotation data generation method and device.
  • transition effects can be added between multiple material videos through transition effects templates in electronic devices.
  • the transition special effects template includes multiple preset transition special effects.
  • the electronic device can insert the preset transition special effects between the material videos in sequence according to the arrangement order of the material videos, thereby obtaining a video after the material videos are merged.
  • the content of each material video is quite different, and preset transition effects are added sequentially between the material videos, resulting in a low matching degree between the transition effects and adjacent material videos, resulting in poor video synthesis effects. .
  • the present disclosure provides a video processing method, device and electronic equipment to solve the technical problem of poor video synthesis effect in the prior art.
  • the present disclosure provides a video processing method, which method includes:
  • first video where the first video includes multiple material videos
  • the fused video characteristics and the plurality of transition special effect characteristics among the plurality of video transition special effects, determine the target video transition special effect between the adjacent material videos;
  • a second video is determined based on the plurality of material videos and the target video transition effects.
  • the present disclosure provides a video processing device, which includes a first acquisition module, a first determination module, a second acquisition module, a second determination module and a third determination module, wherein:
  • the first acquisition module is used to acquire a first video, where the first video includes multiple material videos;
  • the first determination module is used to determine the fused video features corresponding to each adjacent material video, and the fused video features are used to indicate the image features and audio features of the adjacent material videos;
  • the second acquisition module is used to acquire multiple transition special effect features corresponding to multiple video transition special effects
  • the second determination module is configured to determine the target video between the adjacent material videos among the plurality of video transition effects according to the fused video characteristics and the plurality of transition special effects characteristics. transition effects;
  • the third determination module is configured to determine a second video based on the plurality of material videos and the transition special effects of the target video.
  • embodiments of the present disclosure provide an electronic device, including: a processor and a memory;
  • the memory stores computer execution instructions
  • the processor executes the computer execution instructions stored in the memory, so that the at least one processor executes the above first aspect and the various possible video processing methods involved in the first aspect.
  • embodiments of the present disclosure provide a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the processor executes the computer-executable instructions, the above first aspect and the first aspect are implemented.
  • Various aspects may involve the video processing methods.
  • embodiments of the present disclosure provide a computer program product, including a computer program.
  • the computer program When the computer program is executed by a processor, the computer program implements the above first aspect and various possible video processing methods involved in the first aspect.
  • embodiments of the present disclosure provide a computer program that, when executed by a processor, implements the above first aspect and various possible video processing methods involved in the first aspect.
  • Figure 1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure
  • Figure 2 is a schematic flowchart of a video processing method provided by an embodiment of the present disclosure
  • Figure 3 is a schematic diagram of a material video provided by an embodiment of the present disclosure.
  • Figure 4 is a schematic diagram of a first video segment and a second video segment provided by an embodiment of the present disclosure
  • Figure 5 is a schematic diagram of a video transition special effect provided by an embodiment of the present disclosure.
  • Figure 6 is a schematic diagram of a transition special effect feature provided by an embodiment of the present disclosure.
  • Figure 7 is a schematic diagram of determining a target video transition special effect provided by an embodiment of the present disclosure.
  • Figure 8 is a schematic flowchart of a method for determining fused video features provided by an embodiment of the present disclosure
  • Figure 9 is a schematic process diagram of a video processing method provided by an embodiment of the present disclosure.
  • Figure 10 is a schematic structural diagram of a video processing device provided by an embodiment of the present disclosure.
  • FIG. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • Electronic equipment It is a device with wireless sending and receiving functions. Electronic devices can be deployed on land, including indoors or outdoors, handheld, wearable or vehicle-mounted; they can also be deployed on water (such as ships, etc.).
  • the electronic device may be a mobile phone (mobile phone), a tablet computer (Pad), a computer with wireless transceiver functions, a virtual reality (VR) electronic device, an augmented reality (AR) electronic device, an industrial control ( Wireless terminals in industrial control, vehicle-mounted electronic equipment, wireless terminals in self-driving, wireless electronic equipment in remote medical, wireless electronic equipment in smart grid, transportation safety Wireless electronic devices in transportation safety, wireless electronic devices in smart city, wireless electronic devices in smart home, wearable electronic devices, etc.
  • VR virtual reality
  • AR augmented reality
  • the electronic equipment involved in the embodiments of the present disclosure may also be called terminal, user equipment (UE), access electronic equipment, vehicle-mounted terminal, industrial control terminal, UE unit, UE station, mobile station, mobile station, remote station , remote electronic equipment, mobile equipment, UE electronic equipment, wireless communication equipment, UE agent or UE device, etc.
  • Electronic equipment may also be stationary or mobile.
  • electronic devices can merge multiple captured material videos into one video. Since the content of each material video is different, in order to improve the display effect of the video, it is necessary to add transitions between the material videos. Special effects, thus allowing each material video to be displayed smoothly during playback.
  • transition effects can be added between multiple material videos through transition effects templates in electronic devices.
  • the transition special effects template includes multiple transition special effects set in sequence, and the electronic device can sequentially add corresponding transition special effects between the material videos through the transition special effects template.
  • the content of each material video is quite different, and the transition effects between different contents are also different. Preset transition effects are added sequentially between the material videos to match the transition effects with adjacent material videos. The degree is lower, which leads to poorer video synthesis effects.
  • embodiments of the present disclosure provide a video processing method, which acquires a first video including multiple material videos, determines the image features and audio features corresponding to each adjacent material video, and obtains multiple image features and multiple audio features, and based on multiple image features and audio features, the fused video features corresponding to each adjacent material video are determined.
  • the transition special effects features of multiple video transition effects are obtained in advance, and then each video transition effect is determined.
  • the similarity between the fusion features of two adjacent material videos and the characteristics of each transition effect, and then based on the similarity, the video transition effects between each adjacent material video are determined, and between each adjacent material video Set the corresponding video transition effects to determine the second video.
  • the fused video features of adjacent material videos combine image features, audio features and contextual information
  • the fused video features can accurately indicate the video features of adjacent material videos.
  • transition effects Features can accurately determine the video transition effects that best match the adjacent material video content, thereby improving the effect of video synthesis.
  • Figure 1 is a schematic diagram of an application scenario provided by an embodiment of the present disclosure. See Figure 1, including: First video.
  • the first video includes material video A, material video B and material video C.
  • Material video A is before material video B
  • material video B is before material video C.
  • fused video feature A is obtained, and based on material video B and material video C, fused video feature B is obtained.
  • each transition effect feature corresponds to a unique video transition effect.
  • Obtain the similarity between fused video feature A and each transition special effect feature and obtain the similarity between fused video feature B and each transition special effect feature. Since the fused video feature A has the highest similarity with the transition special effect feature 1, and the fused video feature B has the highest similarity with the transition special effect feature N, therefore, the transition special effect 1 corresponding to the transition special effect feature 1 is obtained, and the transition special effect feature is obtained
  • the transition effect corresponding to N is N.
  • transition effect 1 between material video A and material video B Add transition effect 1 between material video B and material video Add transition effect N between videos C to determine the second video.
  • the electronic device can automatically add transition effects between the material videos of the first video, and because the fused video features fuse the image features and audio features of the adjacent material videos, therefore, by merging the video features and maintaining the transition
  • the special effects feature can accurately determine the video transition special effects that best match the adjacent material video content, thereby improving the effect of video synthesis.
  • FIG. 2 is a schematic flowchart of a video processing method provided by an embodiment of the present disclosure. See Figure 2, the method can include:
  • the execution subject of the embodiment of the present disclosure may be an electronic device, or may be a video processing device provided in the electronic device.
  • the video processing device can be implemented by software or a combination of software and hardware.
  • the first video includes multiple material videos.
  • the material video can be multiple videos shot by electronic devices.
  • the material video can be multiple videos with different video content shot by an electronic device.
  • the material videos may include sky videos, ocean videos, people videos, etc. After the electronic device shoots multiple material videos, the multiple material videos can be spliced to obtain the first video.
  • the electronic device can obtain the first video in the database.
  • the electronic device receives a video processing request, where the video processing request includes an identifier of the first video, and the electronic device obtains the first video from multiple videos stored in the database according to the identifier of the first video.
  • the electronic device can also receive the first video sent by other devices.
  • the electronic device may receive a video sent by the server and determine the video as the first video.
  • the electronic device may also receive a video sent by other electronic devices and determine the video as the first video.
  • the electronic device can obtain multiple material videos in the first video.
  • the electronic device can divide the first video into multiple videos based on the optical flow information of the first video, and each video is a material video corresponding to the first video.
  • the electronic device can also obtain the content of the first video through other methods such as model training. material video, the embodiment of the present disclosure does not limit this.
  • Figure 3 is a schematic diagram of a material video provided by an embodiment of the present disclosure. See Figure 3, including the first video.
  • the first video includes 3 frames of sky images and 3 frames of ocean images.
  • the first video is divided into two material videos based on the optical flow information of each frame of the image in the first video.
  • material video A includes 3 frames of sky images
  • material video B includes 3 frames of ocean images. In this way, images with similar content can be divided into the same material video through optical flow information, thereby improving the accuracy of obtaining the material video.
  • the fused video features are used to indicate image features and audio features of adjacent material videos.
  • the fused video feature can be a feature obtained by fusing image features and audio features of adjacent material videos.
  • adjacent material videos may be determined based on the first video. For example, if the playback order of the material videos of the first video is: material video A, material video B, and material video C, then material video A and material video B are adjacent material videos, and material video B and material video C are adjacent material videos. Neighbor material video.
  • the fused video features corresponding to each adjacent material video can be determined according to the following feasible implementation methods: Confirm The image features and audio features corresponding to each adjacent material video are determined to obtain multiple image features and multiple audio features. Based on the multiple image features and multiple audio features, the fused video features corresponding to each adjacent material video are determined. For example, if the first video includes material video A, material video B and material video C, then based on the adjacent material video A and material video B, 2 image features and 2 audio features can be determined. Video B and material video C can determine 2 image features and 2 audio features. Therefore, the electronic device can determine the fusion between adjacent material video A and material video B based on 4 image features and 4 audio features. Video features, as well as the fused video features between adjacent material video B and material video C.
  • the electronic device can obtain multiple image features and multiple audio features according to the following feasible implementation methods: obtain the first video feature of the first material video.
  • the video segment and the second video segment in the second material video determine the image features and audio features corresponding to the first material video and the second material video based on the first video segment and the second video segment.
  • the first material video and the second material video can be any adjacent material videos of multiple material videos in the first video, the first video segment is a video in the first material video, and the second video segment is a second video segment.
  • a video in the material video, audio features and video features can be determined from the two videos.
  • the first material video is located before the second material video.
  • the first material video is adjacent to the second material video, and the playback order of the first material video is before the playback order of the second material video.
  • the first video segment is a video at the end of the first material video
  • the second video segment is a video at the beginning of the second material video.
  • the first video segment may be a 5-second video segment at the end of the first material video
  • the second video segment may be a 5-second video segment at the beginning of the second material video.
  • the first video segment is a video at the beginning of the first material video
  • the second video segment is a video at the end of the second material video.
  • the first video segment may be a 5-second video segment at the beginning of the first material video
  • the second video segment may be a 5-second video segment at the end of the second material video.
  • the lengths of the first video segment and the second video segment may be the same.
  • the second video segment may be a 5-second video segment
  • the first video segment is a 10-second video segment
  • the second video segment may be a 10-second video segment.
  • the lengths of the first video segment and the second video segment may be different.
  • the first video segment is a 5-second video segment
  • the second video segment may be a 3-second video segment
  • the first video segment is a 5-second video segment
  • the second video segment may be a 10-second video segment. part.
  • the length of the first video segment may be determined based on the length of the first material video and the first preset ratio. For example, if the first material video is a 20-second video, the first preset ratio is 0.1, then the first video segment is a 2-second video, and if the first material video is a 30-second video, the first preset ratio is 0.5 , then the first video segment is a 15-second video.
  • the length of the first video segment may be determined based on the length of the second material video and the second preset ratio. For example, if the second material video is a 10-second video, the first preset ratio is 0.3, then the first video segment is a 3-second video, and if the first material video is a 5-second video, the first preset ratio is 0.2 , then the first video segment is a 1 second video.
  • the lengths of the first video segment and the second video segment may also be preset lengths.
  • both the first video segment and the second video segment can be 5-second video segments.
  • other methods are used to determine the length of the first video segment and the second video segment. Length, in the embodiment of the present disclosure, the length of the first video segment and the length of the second video segment can also be determined by other methods, which is not limited in the embodiment of the present disclosure.
  • the image features and audio features corresponding to each adjacent material video can be obtained.
  • Figure 4 is a schematic diagram of a first video segment and a second video segment provided by an embodiment of the present disclosure. See Figure 4, including: First video.
  • the first video includes material video A and material video B.
  • Material video A is before material video B.
  • a video of a preset duration is captured at the end of material video A, and this video is determined as the first video segment.
  • a video of a preset duration is intercepted from the beginning of material video B, and the video is determined as the second video segment. Since the positions of the first video segment and the second video segment are similar, the content characteristics between the material videos can be accurately reflected through the first video segment and the second video segment, thereby improving the accuracy of determining the video transition effects. Improve the effect of video synthesis.
  • video transition special effects refer to the special effects added to different lenses and lens switching.
  • multiple material videos are shot by different shooting devices (or different content shot by the same shooting device).
  • the video transition special effects may include special effects such as wipe, stack change, page curl, etc.
  • the video transition special effects may also be any other special effects, which are not limited in the embodiments of the present disclosure.
  • FIG. 5 is a schematic diagram of a video transition special effect provided by an embodiment of the present disclosure. See Figure 5, including: First video. Among them, the first video plays the first material video, and the content of the first material video is the letter A. When the first material video ends, the first material video slides to the left, and the second material video slides to the right. Side sliding, where the content of the second material video is the letter B. When the video transition effect (sliding page effect) ends, the first video plays the second material video. In this way, the first material video and the second material video are connected through the sliding page effect, thereby making the video playback smoother and improving the video playback effect.
  • the transition effect characteristics are used to indicate the characteristics of the video transition effects.
  • the transition special effects feature can be a feature vector, and different video transition special effects correspond to different feature vectors.
  • multiple transition special effect features corresponding to multiple video transition special effects can be obtained according to the following feasible implementation methods: Obtain special effects classification models corresponding to multiple video transition special effects.
  • the special effects classification model is used to classify multiple video transition special effects.
  • the special effects classification model can classify 10 kinds of video transition effects, and the special effects classification model can also classify 20 kinds of video transition effects. It should be noted that after the special effects classification model is trained, the classified video transition effects The category is also determined. If you need to add new video transition effects, you need to retrain the special effects classification model.
  • the feature vector corresponding to each video transition special effect is obtained through the special effects classification model, and the feature vector is determined as the transition special effect feature.
  • the internal parameters of the special effects classification model include a feature vector corresponding to each video transition special effect, and the feature vector can be determined as the transition special effect feature.
  • the unit vectors in the middle of all video transition effects are extracted as features on the same data set, and the feature vectors corresponding to each video transition effect are averaged to obtain each video transition.
  • the only transition effect feature of special effects For example, if the special effects classification model can classify 30 types of video transition effects, the electronic device can obtain 30 feature vectors corresponding to the 30 types of video transition effects through the special effects classification model.
  • the transition special effects can be removed from the edited video (the video transition type has been added to the video), and the remaining videos can be As training data (wherein, the marks of the transition special effects can be obtained by analyzing the editing template, or can be manually marked, which is not limited in the embodiments of the present disclosure), the special effects classification model is trained.
  • the video before deleting the transition effects can also be used as training data, thereby reducing the effort of obtaining training samples. workload and improve the efficiency of model training.
  • FIG. 6 is a schematic diagram of a transition special effect feature provided by an embodiment of the present disclosure. See Figure 6, including: special effects classification model. Input multiple videos into the special effects classification model, each video including transition effects. The backbone network processes multiple videos to obtain features of each video. Optionally, the backbone network can be replaced with other network results that can extract video features. This is not limited in the embodiments of the present disclosure.
  • the features of the video are fused through a fully connected network, and the fused features are normalized and converted into unit vectors.
  • the unit vector is processed through a linear classifier to classify multiple transition effects.
  • the feature vectors are used as the transition special effects features of the associated video transition effects.
  • the electronic device can determine the target video transition effects between adjacent material videos according to the following feasible implementation methods: obtain the first similarity between the fused video features and the transition special effect features, and obtain multiple the first degree of similarity. For example, for the fused video feature corresponding to any adjacent material video, the cosine similarity or Euclidean distance between the fused video feature and each transition special effect feature can be obtained, and the cosine similarity or Euclidean distance can be determined as First degree of similarity. For example, the electronic device obtains the transition special effect feature A of the video transition effect A and the transition special effect feature B of the video transition effect B. The electronic device can determine the cosine similarity between the fused video feature and the transition special effect feature A, and Cosine similarity between fused video features and transition effects feature B.
  • the electronic device can obtain the largest first similarity among multiple first similarities, and determine the video transition effect corresponding to the largest first similarity as the adjacent material corresponding to the fused video feature.
  • Target video transition effects between videos For example, if the similarity between the fused video feature and transition effect A is 70%, and the similarity between the fused video feature and transition effect B is 90%, then transition effect B is determined as the target video transition effect , and determine the transition effects between adjacent material videos corresponding to the fused video features as transition effects B.
  • FIG. 7 is a schematic diagram of determining a target video transition special effect provided by an embodiment of the present disclosure. Please see Figure 7, including: transition effect A, transition effect B and transition effect C. It is determined that the characteristic of transition effect A is transition special effect characteristic A, the characteristic of transition special effect B is transition special effect characteristic B, and the characteristic of transition special effect C is transition special effect characteristic C. Obtain the similarity A between the transition special effect feature A and the fused video feature, the similarity B between the transition special effect feature B and the fused video feature, and the similarity C between the transition special effect feature C and the fused video feature. Since the similarity A is the maximum similarity, the transition effect A corresponding to the transition effect feature A is determined as the target video transition effect corresponding to the fused video feature.
  • the target video transition effects can be set between associated adjacent materials, and then multiple video materials can be spliced to determine the second video.
  • the first video includes material video A, material video B and material video C.
  • Material video A and material video B are adjacent videos
  • material video B and material video C are adjacent videos. If material video A and material video
  • the target video transition effect between B is wipe, and the target video transition effect between material video B and material video C is page curl.
  • between material video A and material video B Add wipe effects and page curl effects between material video B and material video C to obtain the second video.
  • Embodiments of the present disclosure provide a video processing method that obtains a first video including multiple material videos, and determines each adjacent The fused video features corresponding to the material video are obtained, multiple transition special effect features corresponding to multiple video transition special effects are obtained, and the first similarity between each fused video feature and each transition special effect feature is obtained.
  • the largest first similarity is obtained, and the video transition special effect corresponding to the largest first similarity is determined as the target video transition special effect between the adjacent material videos corresponding to the fused video features, according to Multiple source videos and multiple target video transition effects to determine the second video.
  • fused video features can accurately indicate the video features of adjacent material videos, by fusing video features and transition effects features, the video transition effects that best match the adjacent material video content can be accurately determined , thereby improving the effect of video synthesis.
  • FIG. 8 is a schematic flowchart of a method for determining fused video features according to an embodiment of the present disclosure. Please refer to Figure 8. The method flow includes:
  • first material video and second material video multiple image features and multiple audio features can be obtained according to the following feasible implementation methods: Obtain the first video segment in the first material video and the second video segment in the second material video. It should be noted that the process of obtaining the first video segment and the second video segment may refer to step S202, which will not be described again in this embodiment of the disclosure.
  • the electronic device can determine the image features and audio features corresponding to the first material video and the second material video according to the following feasible implementation manner: obtain the first image feature and the first audio feature corresponding to the first video segment.
  • the first video segment includes images (video frames) and audio.
  • the images in the first video segment are processed through a feature extraction model (such as backbone network, neural network, etc.) to obtain the first image features.
  • a feature extraction model such as backbone network, neural network, etc.
  • the model processes the audio in the first video segment to obtain the first audio feature.
  • the second video segment also includes images and audio.
  • the image in the second video segment is processed through the feature extraction model to obtain the second image features.
  • the audio in the second video segment is processed through the feature extraction model to obtain Secondary audio characteristics.
  • the first image feature and the second image feature are determined as image features corresponding to the first material video and the second material video
  • the first audio feature and the second audio feature are determined as the corresponding image features of the first material video and the second material video.
  • audio characteristics For example, the electronic device obtains image feature A and audio feature A in the first video segment, and obtains image feature B and audio feature B in the second video segment, and then determines image feature A and image feature B as adjacent material videos.
  • the image features of the audio feature A and the audio feature B are determined as the audio features of adjacent material videos. In this way, the image features and audio features corresponding to each adjacent material video can be obtained.
  • the fused video features corresponding to each adjacent material video can be determined according to the following possible implementation methods: obtain the first position code of each image feature in the first video and the first position code of each audio feature in the first video. Two position encoding.
  • the first position code is used to indicate the position of the image feature
  • the second position code is used to indicate the position of the audio feature.
  • the electronic device can obtain the associated position code based on the material video corresponding to the image features and audio features.
  • the fused video features corresponding to each adjacent material video are determined.
  • the trained first model can be used to analyze multiple image features and multiple audio features.
  • Features, first position coding and second position coding are processed to obtain the fusion features corresponding to each adjacent material video.
  • the first model can be an encoder.
  • the encoder can fuse contextual information and multi-modal features, and merge features belonging to the same video transition effect to obtain fused video features corresponding to each adjacent material video.
  • Embodiments of the present disclosure provide a method for determining fused video features, determining the image features and audio features corresponding to each adjacent material video, and obtaining multiple image features and multiple audio features. According to the multiple image features and multiple audio features , determine the fused video features corresponding to each adjacent material video. In this way, since multiple image features and multiple audio features can reflect the multi-modal features of the first video, and the first position coding and the second position coding can reflect the contextual information of the first video, the accuracy of the fused video features is relatively high. High, which can improve the effect of video synthesis.
  • FIG. 9 is a schematic process diagram of a video processing method provided by an embodiment of the present disclosure. See Figure 9, including: First video.
  • the first video includes earth images, sky images, ocean images, high-rise building images, etc.
  • Split the first video into multiple material videos according to the video content of the first video, and obtain the fused video features corresponding to the adjacent material videos for the adjacent material videos.
  • Multiple image features and multiple audio features are processed through the encoder to obtain multiple features that integrate contextual information.
  • the two image features and the two audio features corresponding to adjacent material videos are spliced to obtain the fused video features corresponding to each adjacent material video.
  • transition effect A, transition effect B and transition effect C are the target video transition effects (other transition effects pictures (not shown in ), add transition effect A between the earth image and the sky image, add transition effect B between the sky image and the ocean image, add transition effect C between the ocean image and the high-rise image, and then get Second video.
  • transition effect A, transition effect B and transition effect C are the target video transition effects (other transition effects pictures (not shown in )
  • add transition effect A between the earth image and the sky image add transition effect B between the sky image and the ocean image
  • transition effect C between the ocean image and the high-rise image
  • the electronic device can automatically add transition effects between the material videos of the first video, and because the fused video features fuse the image features and audio features of the adjacent material videos, therefore, by merging the video features and maintaining
  • the transition effects feature enables cross-modal retrieval functions, which can accurately determine the video transition effects that best match the adjacent material video content and improve the effect of video synthesis.
  • FIG. 10 is a schematic structural diagram of a video processing device provided by an embodiment of the present disclosure.
  • the video processing device 10 includes a first acquisition module 11, a first determination module 12, a second acquisition module 13, a second determination module 14 and a third determination module 15, wherein:
  • the first acquisition module 11 is used to acquire a first video, where the first video includes multiple material videos;
  • the first determination module 12 is used to determine the fused video features corresponding to each adjacent material video, and the fused video features are used to indicate the image features and audio features of the adjacent material videos;
  • the second acquisition module 13 is used to acquire multiple transition special effect features corresponding to multiple video transition special effects
  • the second determination module 14 is configured to determine the target between the adjacent material videos in the multiple video transition effects according to the fused video features and the multiple transition special effects features.
  • the third determination module 15 is configured to determine a second video based on the plurality of material videos and the transition effects of the target video.
  • the first determination module 12 is specifically used to:
  • the fused video features corresponding to each adjacent material video are determined.
  • the first determination module 12 is specifically used to:
  • image features and audio features corresponding to the first material video and the second material video are determined.
  • the first determination module 12 is specifically used to:
  • the first image feature and the second image feature are determined as image features corresponding to the first material video and the second material video, and the first audio feature and the second audio feature are Determine the audio features corresponding to the first material video and the second material video.
  • the first determination module 12 is specifically used to:
  • the fused video feature corresponding to each adjacent material video is determined.
  • the first material video is located before the second material video
  • the first video segment is a video at the end of the first material video
  • the second video segment is the Describe a video of the second material video title.
  • the second determination module 14 is specifically used to:
  • the video transition special effect corresponding to the maximum first similarity is determined as the target video transition special effect between adjacent material videos corresponding to the fused video feature.
  • the second acquisition module 13 is specifically used to:
  • the feature vector corresponding to each video transition special effect is obtained through the special effects classification model, and the feature vector is determined as the transition special effect feature.
  • the video processing device provided in this embodiment can be used to execute the technical solutions of the above method embodiments. Its implementation principles and technical effects are similar, and will not be described again in this embodiment.
  • FIG. 11 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure. See Figure 11, which shows a suitable The following is a schematic structural diagram of an electronic device 1100 that implements an embodiment of the present disclosure.
  • the electronic device 1100 may be a terminal device or a server.
  • the terminal devices may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA for short), tablet computers (Portable Android Device, PAD for short), portable multimedia players (Portable Mobile terminals such as Media Player (PMP for short), vehicle-mounted terminals (such as vehicle-mounted navigation terminals), and fixed terminals such as digital TVs, desktop computers, etc.
  • PDA Personal Digital Assistant
  • PDA Personal Digital Assistant
  • PAD Personal Android Device
  • portable multimedia players Portable Mobile terminals such as Media Player (PMP for short
  • vehicle-mounted terminals such as vehicle-mounted navigation terminals
  • fixed terminals such as digital TVs, desktop computers, etc.
  • the electronic device 1100 may include a processing device (such as a central processing unit, a graphics processor, etc.) 1101, which may process data according to a program stored in a read-only memory (Read Only Memory, ROM for short) 1102 or from a storage device. 1108 performs various appropriate actions and processing on the program loaded into the random access memory (Random Access Memory, RAM for short) 1103. In the RAM 1103, various programs and data required for the operation of the electronic device 1100 are also stored.
  • the processing device 1101, ROM 1102 and RAM 1103 are connected to each other via a bus 1104.
  • An input/output (I/O) interface 1105 is also connected to bus 1104.
  • the following devices can be connected to the I/O interface 1105: input devices 1106 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD). ), an output device 1107 such as a speaker, a vibrator, etc.; a storage device 1108 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 1109.
  • the communication device 1109 may allow the electronic device 1100 to communicate wirelessly or wiredly with other devices to exchange data.
  • FIG. 11 illustrates an electronic device 1100 having various means, it should be understood that implementation or availability of all illustrated means is not required. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via communication device 1109, or from storage device 1108, or from ROM 1102.
  • the processing device 1101 When the computer program is executed by the processing device 1101, the above-mentioned functions defined in the method of the embodiment of the present disclosure are performed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium may be, for example, but is not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or any combination thereof. More specific examples of computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programmable Read Only Memory (Erasable Programmable Read Only Memory, EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code contained on a computer-readable medium can be transmitted using any appropriate medium, including but not limited to: wires, optical cables, RF (Radio Frequency, radio frequency), etc., or any suitable combination of the above.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • the computer-readable medium carries one or more programs.
  • the electronic device When the one or more programs are executed by the electronic device, the electronic device performs the method shown in the above embodiment.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language—such as "C" or a similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or it can be connected to an external computer Computer (e.g. connected via the Internet using an Internet service provider).
  • LAN Local Area Network
  • WAN Wide Area Network
  • each block in the flowchart or block diagram may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure can be implemented in software or hardware.
  • the name of the unit does not constitute a limitation on the unit itself under certain circumstances.
  • the first acquisition unit can also be described as "the unit that acquires at least two Internet Protocol addresses.”
  • exemplary types of hardware logic components include: field programmable gate array (Field Programmable Gate Array, FPGA), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), application specific standard product (Application Specific Standard Product (ASSP), System on Chip (SOC), Complex Programmable Logic Device (CPLD), etc.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM portable compact disk read-only memory
  • magnetic storage device or any suitable combination of the above.
  • one or more embodiments of the present disclosure provide a video processing method, which method includes:
  • first video where the first video includes multiple material videos
  • the fused video characteristics and the plurality of transition special effect characteristics among the plurality of video transition special effects, determine the target video transition special effect between the adjacent material videos;
  • a second video is determined based on the plurality of material videos and the target video transition effects.
  • determining the fused video features corresponding to each adjacent material video includes:
  • the fused video features corresponding to each adjacent material video are determined.
  • determining the image features and audio features corresponding to the first material video and the second material video include:
  • image features and audio features corresponding to the first material video and the second material video are determined.
  • determining the image features and audio features corresponding to the first material video and the second material video according to the first video segment and the second video segment include: :
  • the first image feature and the second image feature are determined as image features corresponding to the first material video and the second material video, and the first audio feature and the second audio feature are Determine the audio features corresponding to the first material video and the second material video.
  • determining the fused video features corresponding to each adjacent material video based on the multiple image features and the multiple audio features includes:
  • the fused video feature corresponding to each adjacent material video is determined.
  • the first material video is located before the second material video
  • the first video segment is a video at the end of the first material video
  • the second video segment is A video of the opening title of the second material video.
  • Target video transition effects including:
  • the video transition special effect corresponding to the maximum first similarity is determined as the target video transition special effect between adjacent material videos corresponding to the fused video feature.
  • obtaining the transition effect characteristics corresponding to the video transition effect includes:
  • the feature vector corresponding to each video transition special effect is obtained through the special effects classification model, and the feature vector is determined as the transition special effect feature.
  • one or more embodiments of the present disclosure provide a video processing device, which includes a first acquisition module, a first determination module, a second acquisition module, a second determination module and a third determination module, in:
  • the first acquisition module is used to acquire a first video, where the first video includes multiple material videos;
  • the first determination module is used to determine the fused video features corresponding to each adjacent material video, and the fused video features are used to indicate the image features and audio features of the adjacent material videos;
  • the second acquisition module is used to acquire multiple transition special effect features corresponding to multiple video transition special effects
  • the second determination module is configured to determine the target video between the adjacent material videos among the plurality of video transition effects according to the fused video characteristics and the plurality of transition special effects characteristics. transition effects;
  • the third determination module is configured to determine a second video based on the plurality of material videos and the transition special effects of the target video.
  • the first determination module is specifically used to:
  • the fused video features corresponding to each adjacent material video are determined.
  • the first determination module is specifically used to:
  • image features and audio features corresponding to the first material video and the second material video are determined.
  • the first determination module is specifically used to:
  • the first image feature and the second image feature are determined as image features corresponding to the first material video and the second material video, and the first audio feature and the second audio feature are Determine the audio features corresponding to the first material video and the second material video.
  • the first determination module is specifically used to:
  • the fused video feature corresponding to each adjacent material video is determined.
  • the first material video is located before the second material video
  • the first video segment is a video at the end of the first material video
  • the second video segment is the Describe a video of the second material video title.
  • the second determination module is specifically used to:
  • the video transition special effect corresponding to the maximum first similarity is determined as the target video transition special effect between adjacent material videos corresponding to the fused video feature.
  • the second acquisition module is specifically used to:
  • the feature vector corresponding to each video transition special effect is obtained through the special effects classification model, and the feature vector is determined as the transition special effect feature.
  • embodiments of the present disclosure provide an electronic device, including: a processor and a memory;
  • the memory stores computer execution instructions
  • the processor executes the computer execution instructions stored in the memory, so that the at least one processor executes the above first aspect and the various possible video processing methods involved in the first aspect.
  • embodiments of the present disclosure provide a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the processor executes the computer-executable instructions, the above first aspect and the first aspect are implemented.
  • Various aspects may involve the video processing methods.
  • embodiments of the present disclosure provide a computer program product, including a computer program.
  • the computer program When the computer program is executed by a processor, the computer program implements the above first aspect and various possible video processing methods involved in the first aspect.
  • embodiments of the present disclosure provide a computer program that, when executed by a processor, implements the above first aspect and various possible video processing methods involved in the first aspect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Studio Circuits (AREA)

Abstract

本公开提供一种视频处理方法、装置及电子设备,该方法包括:获取第一视频,所述第一视频包括多个素材视频;确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视频的图像特征和音频特征;获取多个视频转场特效对应的多个转场特效特征;根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;根据所述多个素材视频和所述目标视频转场特效,确定第二视频。

Description

视频处理方法、装置及电子设备
相关申请的交叉引用
本申请要求于2022年07月08日提交中国专利局、申请号为202210806771.4、申请名称为“视频处理方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本文中。
技术领域
本公开实施例涉及计算机视觉和人工智能技术领域,尤其涉及一种交互式分割模型训练方法、标注数据生成方法及设备。
背景技术
在视频剪辑中,电子设备可以将多个拍摄的素材视频合并成一个视频,因此,多个素材视频之间需要插入转场特效,以提高视频的效果。
目前,可以通过电子设备中的转场特效模板,在多个素材视频之间添加转场特效。例如,转场特效模板中包括多个预设的转场特效,电子设备可以按照素材视频的排列顺序,依次在素材视频之间插入预设的转场特效,进而得到素材视频合并之后的视频。但是,各素材视频之间的内容区别较大,在素材视频之间依次添加预设的转场特效,使得转场特效与相邻素材视频的匹配度较低,进而导致视频合成的效果较差。
发明内容
本公开提供一种视频处理方法、装置及电子设备,用于解决现有技术中视频合成的效果较差的技术问题。
第一方面,本公开提供一种视频处理方法,该方法包括:
获取第一视频,所述第一视频包括多个素材视频;
确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视频的图像特征和音频特征;
获取多个视频转场特效对应的多个转场特效特征;
根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;
根据所述多个素材视频和所述目标视频转场特效,确定第二视频。
第二方面,本公开提供一种视频处理装置,该视频处理装置包括第一获取模块、第一确定模块、第二获取模块、第二确定模块和第三确定模块,其中:
所述第一获取模块用于,获取第一视频,所述第一视频包括多个素材视频;
所述第一确定模块用于,确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视频的图像特征和音频特征;
所述第二获取模块用于,获取多个视频转场特效对应的多个转场特效特征;
所述第二确定模块用于,根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;
所述第三确定模块用于,根据所述多个素材视频和所述目标视频转场特效,确定第二视频。
第三方面,本公开实施例提供一种电子设备,包括:处理器和存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
第四方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
第五方面,本公开实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
第六方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的一种应用场景示意图;
图2为本公开实施例提供的一种视频处理方法的流程示意图;
图3为本公开实施例提供的一种素材视频示意图;
图4为本公开实施例提供的一种第一视频段和第二视频段的示意图;
图5为本公开实施例提供的一种视频转场特效的示意图;
图6为本公开实施例提供的一种转场特效特征的示意图;
图7为本公开实施例提供的一种确定目标视频转场特效的示意图;
图8为本公开实施例提供的一种确定融合视频特征的方法流程示意图;
图9为本公开实施例提供的一种视频处理方法的过程示意图;
图10为本公开实施例提供的一种视频处理装置的结构示意图;
图11为本公开实施例提供的一种电子设备的结构示意图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本公开的一些方面相一致的装置和方法的例子。
为了便于理解,下面,对本公开实施例涉及的概念进行说明。
电子设备:是一种具有无线收发功能的设备。电子设备可以部署在陆地上,包括室内或室外、手持、穿戴或车载;也可以部署在水面上(如轮船等)。所述电子设备可以是手机(mobile phone)、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(virtual reality,VR)电子设备、增强现实(augmented reality,AR)电子设备、工业控制(industrial control)中的无线终端、车载电子设备、无人驾驶(self driving)中的无线终端、远程医疗(remote medical)中的无线电子设备、智能电网(smart grid)中的无线电子设备、运输安全(transportation safety)中的无线电子设备、智慧城市(smart city)中的无线电子设备、智慧家庭(smart home)中的无线电子设备、可穿戴电子设备等。本公开实施例所涉及的电子设备还可以称为终端、用户设备(user equipment,UE)、接入电子设备、车载终端、工业控制终端、UE单元、UE站、移动站、移动台、远方站、远程电子设备、移动设备、UE电子设备、无线通信设备、UE代理或UE装置等。电子设备也可以是固定的或者移动的。
在相关技术中,电子设备可以将多个拍摄的素材视频合并成一个视频,由于各素材视频之间的内容存在差异性,因此,为了提高视频的显示效果,需要在素材视频之间添加转场特效,进而使得各素材视频在播放过程中可以平滑的显示。目前,可以通过电子设备中的转场特效模板,在多个素材视频之间添加转场特效。例如,转场特效模板中包括顺序设置的多个转场特效,电子设备通过转场特效模板可以依次在素材视频之间添加对应的转场特效。但是,各素材视频之间的内容区别较大,不同的内容之间的转场特效也不同,在素材视频之间依次添加预设的转场特效,使得转场特效与相邻素材视频的匹配度较低,进而导致视频合成的效果较差。
为了解决上述技术问题,本公开实施例提供一种视频处理方法,获取包括多个素材视频的第一视频,确定每个相邻素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征,并根据多个图像特征和音频特征,确定每个相邻素材视频对应的融合视频特征,通过模型训练的方法,预先得到多个视频转场特效的转场特效特征,进而确定每个相邻素材视频的融合特征与每个转场特效特征之间的相似度,进而根据相似度,确定每个相邻素材视频之间的视频转场特效,并在每个相邻素材视频之间设置对应的视频转场特效,确定第二视频。这样,由于相邻素材视频的融合视频特征,结合了图像特征、音频特征和上下文信息,因此,融合视频特征可以准确的指示相邻素材视频的视频特征,通过融合视频特征和维护的转场特效特征,可以准确的确定与相邻的素材视频内容匹配度最高的视频转场特效,进而提高视频合成的效果。
下面,结合图,对本公开实施例的应用场景进行说明。
图1为本公开实施例提供的一种应用场景示意图。请参见图1,包括:第一视频。其中,第一视频中包括素材视频A、素材视频B和素材视频C。素材视频A位于素材视频B之前,素材视频B位于素材视频C之前。根据素材视频A和素材视频B,得到融合视频特征A,根据素材视频B和素材视频C,得到融合视频特征B。
请参见图1,获取N个转场特效特征,其中,每个转场特效特征对应唯一的视频转场特效。获取融合视频特征A与每个转场特效特征的相似度,以及获取融合视频特征B与每个转场特效特征的相似度。由于,融合视频特征A与转场特效特征1相似度最高,融合视频特征B与转场特效特征N相似度最高,因此,获取转场特效特征1对应的转场特效1,获取转场特效特征N对应的转场特效N。
请参见图1,在素材视频A与素材视频B之间添加转场特效1,在素材视频B和素材视 频C之间添加转场特效N,进而确定第二视频。这样,电子设备可以自动为第一视频的素材视频之间添加转场特效,并且,由于融合视频特征融合相邻的素材视频的图像特征和音频特征,因此,通过融合视频特征和维护的转场特效特征,可以准确的确定与相邻的素材视频内容匹配度最高的视频转场特效,进而提高视频合成的效果。
下面以具体地实施例对本公开的技术方案以及本公开的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本公开的实施例进行描述。
图2为本公开实施例提供的一种视频处理方法的流程示意图。请参见图2,该方法可以包括:
S201、获取第一视频。
本公开实施例的执行主体可以为电子设备,也可以为设置在电子设备中的视频处理装置。其中,视频处理装置可以通过软件实现,也可以通过软件和硬件的结合实现。
可选的,第一视频包括多个素材视频。可选的,素材视频可以为电子设备拍摄的多段视频。例如,素材视频可以为电子设备拍摄的视频内容不同的多段视频。例如,素材视频可以包括天空视频、海洋视频、人物视频等,电子设备拍摄多段素材视频之后,可以将多段素材视频进行拼接,得到第一视频。
可选的,电子设备可以在数据库中获取第一视频。例如,电子设备接收到视频处理请求,其中,视频处理请求中包括第一视频的标识,电子设备根据第一视频的标识,在数据库存储的多个视频中获取第一视频。
可选的,电子设备也可以接收其它设备发送的第一视频。例如,电子设备可以接收服务器发送的视频,并将该视频确定为第一视频,电子设备也可以接收其它电子设备发送的视频,并将该视频确定为第一视频。
可选的,电子设备接收到第一视频之后,可以获取第一视频中的多个素材视频。例如,电子设备可以根据第一视频的光流信息,将第一视频分为多段视频,每段视频为第一视频对应的素材视频,电子设备也可以通过模型训练等其它方式获取第一视频中的素材视频,本公开实施例对此不作限定。
下面,结合图3,对第一视频中的素材视频进行说明。
图3为本公开实施例提供的一种素材视频示意图。请参见图3,包括第一视频。其中,第一视频中包括3帧天空图像和3帧海洋图像。通过第一视频中每帧图像的光流信息,将第一视频分成两个素材视频。其中,素材视频A中包括3帧天空图像,素材视频B中包括3帧海洋图像。这样,可以通过光流信息,将内容相近的图像划分为同一个素材视频,进而提高获取素材视频的准确度。
S202、确定各相邻的素材视频对应的融合视频特征。
可选的,融合视频特征用于指示相邻的素材视频的图像特征和音频特征。例如,融合视频特征可以为相邻的素材视频的图像特征和音频特征融合之后的特征。
可选的,可以根据第一视频确定相邻的素材视频。例如,若第一视频的素材视频的播放顺序为:素材视频A、素材视频B、素材视频C,则素材视频A和素材视频B为相邻的素材视频,素材视频B和素材视频C为相邻的素材视频。
可选的,可以根据如下可行的实现方式,确定各相邻的素材视频对应的融合视频特征:确 定各相邻的素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征,根据多个图像特征和多个音频特征,确定各相邻的素材视频对应的融合视频特征。例如,若第一视频中包括素材视频A、素材视频B和素材视频C,则根据相邻的素材视频A和素材视频B,可以确定2个图像特征和2个音频特征,根据相邻的素材视频B和素材视频C,可以确定2个图像特征和2个音频特征,因此,电子设备可以根据4个图像特征和4个音频特征,确定相邻的素材视频A和素材视频B之间的融合视频特征,以及相邻的素材视频B和素材视频C之间的融合视频特征。
可选的,针对于任意相邻的第一素材视频和第二素材视频,电子设备可以根据如下可行的实现方式,得到多个图像特征和多个音频特征:获取第一素材视频中的第一视频段和第二素材视频中的第二视频段,根据第一视频段和第二视频段,确定第一素材视频和第二素材视频对应的图像特征和音频特征。例如,第一素材视频和第二素材视频可以为第一视频中多个素材视频的任意相邻的素材视频,第一视频段为第一素材视频中的一段视频,第二视频段为第二素材视频中的一段视频,通过两段视频可以确定音频特征和视频特征。
可选的,第一素材视频位于第二素材视频之前。例如,在第一视频中,第一素材视频与第二素材视频相邻,第一素材视频的播放顺序在第二素材视频的播放顺序之前。
可选的,若第一素材视频位于第二素材视频之前,则第一视频段为第一素材视频片尾的一段视频,第二视频段为第二素材视频片头的一段视频。例如,若第一素材视频位于第二素材视频之前,则第一视频段可以为第一素材视频中片尾的5秒视频段,第二视频段可以为第二素材视频中片头的5秒视频段。
可选的,若第一素材视频位于第二素材视频之后,则第一视频段为第一素材视频片头的一段视频,第二视频段为第二素材视频片尾的一段视频。例如,若第一素材视频位于第二素材视频之后,则第一视频段可以为第一素材视频片头的5秒视频段,第二视频段可以为第二素材视频中片尾的5秒视频段。
可选的,第一视频段和第二视频段的长度可以相同。例如,若第一视频段为5秒的视频段,则第二视频段可以为5秒视频段,若第一视频段为10秒的视频段,则第二视频段可以为10秒视频段。可选的,第一视频段和第二视频段的长度可以不同。例如,若第一视频段为5秒的视频段,则第二视频段可以为3秒的视频段,若第一视频段为5秒的视频段,则第二视频段可以为10秒的视频段。
可选的,可以根据第一素材视频的长度和第一预设比例,确定第一视频段的长度。例如,若第一素材视频为20秒的视频,第一预设比例为0.1,则第一视频段为2秒的视频,若第一素材视频为30秒的视频,第一预设比例为0.5,则第一视频段为15秒的视频。
可选的,可以根据第二素材视频的长度和第二预设比例,确定第一视频段的长度。例如,若第二素材视频为10秒的视频,第一预设比例为0.3,则第一视频段为3秒的视频,若第一素材视频为5秒的视频,第一预设比例为0.2,则第一视频段为1秒的视频。
可选的,第一视频段和第二视频段的长度也可以为预设长度。例如,第一视频段和第二视频段都可以为5秒的视频段,在第一素材视频和第二素材视频不足5秒时,采用其它的方法确定第一视频段和第二视频段的长度,在本公开实施例中也可以通过其它方法确定第一视频段的长度和第二视频段的长度,本公开实施例对此不作限定。
根据上述方法,可以得到每个相邻的素材视频对应的图像特征和音频特征。
下面,结合图4,对确定第一视频段和第二视频段的过程进行说明。
图4为本公开实施例提供的一种第一视频段和第二视频段的示意图。请参见图4,包括:第一视频。其中,第一视频中包括素材视频A和素材视频B。素材视频A位于素材视频B之前。在素材视频A的片尾截取预设时长的一段视频,并将该段视频确定为第一视频段。在素材视频B的片头截取预设时长的一段视频,并将该段视频确定为第二视频段。由于,第一视频段和第二视频段的位置相近,因此,通过第一视频段和第二视频段可以准确的反映素材视频之间的内容特征,进而提高确定视频转场特效的准确度,提高视频合成的效果。
S203、获取多个视频转场特效对应的多个转场特效特征。
可选的,视频转场特效是指不同的镜头和镜头的切换中加入的特效。例如,在视频编辑中,多段素材视频是不同的拍摄装置拍摄得到的(或者,同一个拍摄装置拍摄的不同内容),为了避免多段素材视频合并时衔接的流畅度较低,可以在不同的素材视频之间添加视频转场特效,以提高合并视频的效果。例如,视频转场特效可以包括划像、叠变、卷页等特效,视频转场特效也可以为其它任意一种特效,本公开实施例对此不作限定。
下面,结合图5,对视频转场特效进行说明。
图5为本公开实施例提供的一种视频转场特效的示意图。请参见图5,包括:第一视频。其中,第一视频播放第一个素材视频,第一个素材视频的内容为字母A,在第一个素材视频播放结束时,第一个素材视频向左侧滑动,第二个素材视频向右侧滑动,其中,第二个素材视频的内容为字母B。在视频转场特效(滑页特效)结束时,第一视频播放第二素材视频。这样,第一个素材视频和第二个素材视频之间通过滑页特效进行衔接,进而使得视频播放更为平滑,提高视频的播放效果。
可选的,转场特效特征用于指示视频转场特效的特征。例如,转场特效特征可以为特征向量,不同的视频转场特效对应的特征向量不同。
可选的,可以根据如下可行的实现方式,获取多个视频转场特效对应的多个转场特效特征:获取多个视频转场特效对应的特效分类模型。可选的,特效分类模型用于对多个视频转场特效进行分类。例如,特效分类模型可以对10种视频转场特效进行分类,特效分类模型也可以对20种视频转场特效进行分类,需要说明的是,特效分类模型训练完成之后,分类的视频转场特效的种类也确定,若需添加新的视频转场特效,需要对特效分类模型进行重新训练。
通过特效分类模型获取各视频转场特效对应的特征向量,并将特征向量,确定为转场特效特征。例如,在对特效分类模型训练完成之后,特效分类模型内部参数中包括每个视频转场特效对应的特征向量,可以将该特征向量确定为转场特效特征。例如,在特效分类模型训练完成之后,在同一数据集上提取所有视频转场特效中间的单位向量作为特征,并将各视频转场特效对应的特征向量分别做平均,进而得到每个视频转场特效唯一的一个转场特效特征。例如,若特效分类模型可以对30种视频转场特效进行分类,则电子设备可以通过特效分类模型得到30种视频转场特效对应的30个特征向量。
可选的,对特效分类模型进行训练时,需要获取充足数量的训练数据,因此,可以对已经剪辑完成的视频(视频中已添加视频转场类型)进行转场特效去除,并将剩余的视频作为训练数据(其中,转场特效的标记可以通过对剪辑模板的解析获取,也可以人工标注,本公开实施例对此不作限定),对特效分类模型进行训练。可选的,由于删除转场特效之后的视频长度变化较小,因此,也可以将删除转场特效之前的视频作为训练数据,进而减小获取训练样本的工 作量,提高模型训练的效率。
下面,结合图6,对确定转场特效特征的过程进行说明。
图6为本公开实施例提供的一种转场特效特征的示意图。请参见图6,包括:特效分类模型。向特效分类模型中输入多个视频,每个视频都包括转场特效。骨干网络对多个视频进行处理,可以得到每个视频的特征,可选的,骨干网络可以替换为其它可以提取视频特征的网络结果,本公开实施例对此不作限定。
请参见图6,通过全连接网络对视频的特征进行融合处理,并将融合后的特征进行归一化处理,转化为单位向量。通过线性分类器对单位向量进行处理,进而对多个转场特效进行分类。在特效分类模型训练完成之后,将其中的特征向量作为相关联的视频转场特效的转场特效特征。
S204、根据融合视频特征和多个转场特效特征,在多个视频转场特效中,确定各相邻的素材视频之间的目标视频转场特效。
可选的,电子设备可以根据如下可行的实现方式,确定各相邻的素材视频之间的目标视频转场特效:获取融合视频特征与各转场特效特征之间的第一相似度,得到多个第一相似度。例如,针对于任意相邻的素材视频对应的融合视频特征,可以获取该融合视频特征与每个转场特效特征之间的余弦相似度或者欧式距离,并将余弦相似度或欧式距离,确定为第一相似度。例如,电子设备获取视频转场特效A的转场特效特征A和视频转场特效B的转场特效特征B,电子设备可以确定融合视频特征与转场特效特征A之间的余弦相似度,以及融合视频特征与转场特效特征B之间的余弦相似度。
可选的,电子设备可以在多个第一相似度中,获取最大的第一相似度,并将最大的第一相似度对应的视频转场特效,确定为融合视频特征对应的相邻的素材视频之间的目标视频转场特效。例如,若融合视频特征与转场特效A之间的相似度为70%,融合视频特征与转场特效B之间的相似度为90%,则将转场特效B确定为目标视频转场特效,并将融合视频特征对应的相邻的素材视频之间的转场特效,确定为转场特效B。
下面,结合图7,对确定目标视频转场特效的过程进行说明。
图7为本公开实施例提供的一种确定目标视频转场特效的示意图。请参见图7,包括:转场特效A、转场特效B和转场特效C。确定转场特效A的特征为转场特效特征A,转场特效B的特征为转场特效特征B,转场特效C的特征为转场特效特征C。获取转场特效特征A与融合视频特征之间的相似度A,转场特效特征B与融合视频特征之间的相似度B,转场特效特征C与融合视频特征之间的相似度C。由于相似度A为最大的相似度,因此,将转场特效特征A对应的转场特效A,确定为融合视频特征对应的目标视频转场特效。
S205、根据多个素材视频和目标视频转场特效,确定第二视频。
可选的,可以将目标视频转场特效设置于相关联的相邻素材之间,进而对多个素材视频进行拼接,确定第二视频。例如,第一视频中包括素材视频A、素材视频B和素材视频C,素材视频A与素材视频B为相邻视频,素材视频B与素材视频C为相邻视频,若素材视频A与素材视频B之间的目标视频转场特效为划像,素材视频B与素材视频C之间的目标视频转场特效为卷页,则在合并3个素材视频时,在素材视频A和素材视频B中间添加划像特效,在素材视频B和素材视频C之间添加卷页特效,进而得到第二视频。
本公开实施例提供一种视频处理方法,获取包括多个素材视频的第一视频,确定各相邻的 素材视频对应的融合视频特征,获取多个视频转场特效对应的多个转场特效特征,并获取每个融合视频特征与每个转场特效特征之间的第一相似度,在多个第一相似度中,获取最大的第一相似度,并将最大的第一相似度对应的视频转场特效,确定为融合视频特征对应的相邻的素材视频之间的目标视频转场特效,根据多个素材视频和多个目标视频转场特效,确定第二视频。在上述方法中,由于融合视频特征可以准确的指示相邻素材视频的视频特征,通过融合视频特征和转场特效特征,可以准确的确定与相邻的素材视频内容匹配度最高的视频转场特效,进而提高视频合成的效果。
在图2所示的实施例的基础上,下面,结合图8,对上述音频处理方法中,确定各相邻的素材视频对应的融合视频特征的方法进行说明。
图8为本公开实施例提供的一种确定融合视频特征的方法流程示意图。请参见图8,该方法流程包括:
S801、确定各相邻的素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征。
可选的,针对于任意相邻的第一素材视频和第二素材视频,可以根据如下可行的实现方式,得到多个图像特征和多个音频特征:获取第一素材视频中的第一视频段和第二素材视频中的第二视频段。需要说明的是,获取第一视频段和第二视频段的过程可以参见步骤S202,本公开实施例在此不再进行赘述。
根据第一视频段和第二视频段,确定第一素材视频和第二素材视频对应的图像特征和音频特征。可选的,电子设备可以根据如下可行的实现方式,确定第一素材视频和第二素材视频对应的图像特征和音频特征:获取第一视频段对应的第一图像特征和第一音频特征。例如,第一视频段中包括图像(视频帧)和音频,通过特征提取模型(如,骨干网络、神经网络等)对第一视频段中的图像进行处理,得到第一图像特征,通过特征提取模型对第一视频段中的音频进行处理,得到第一音频特征。
获取第二视频段对应的第二图像特征和第二音频特征。例如,第二视频段中也包括图像和音频,通过特征提取模型对第二视频段中的图像进行处理,得到第二图像特征,通过特征提取模型对第二视频段中的音频进行处理,得到第二音频特征。
将第一图像特征和第二图像特征,确定为第一素材视频和第二素材视频对应的图像特征,将第一音频特征和第二音频特征,确定为第一素材视频和第二素材视频对应的音频特征。例如,电子设备在第一视频段中得到图像特征A和音频特征A,在第二视频段中得到图像特征B和音频特征B,进而将图像特征A、图像特征B,确定为相邻素材视频的图像特征,将音频特征A和音频特征B,确定为相邻素材视频的音频特征,这样,可以获取每个相邻素材视频对应的图像特征和音频特征。
S802、根据多个图像特征和多个音频特征,确定各相邻的素材视频对应的融合视频特征。
可选的,可以根据如下可行的实现方式,确定各相邻的素材视频对应的融合视频特征:获取各图像特征在第一视频中的第一位置编码和各音频特征在第一视频中的第二位置编码。可选的,第一位置编码用于指示图像特征的位置,第二位置编码用于指示音频特征的位置。例如,电子设备可以根据图像特征和音频特征对应的素材视频,获取相关联的位置编码。
根据多个图像特征、多个音频特征、第一位置编码和第二位置编码,确定各相邻的素材视频对应的融合视频特征。可选的,可以通过训练完成的第一模型对多个图像特征、多个音频特 征、第一位置编码和第二位置编码进行处理,得到各相邻素材视频对应的融合特征。例如,第一模型可以为编码器,编码器可以融合上下文信息和多模态的特征,并将属于同一视频转场特效的特征合并,进而得到各相邻的素材视频对应的融合视频特征。
本公开实施例提供一种确定融合视频特征的方法,确定各相邻的素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征,根据多个图像特征和多个音频特征,确定各相邻的素材视频对应的融合视频特征。这样,由于多个图像特征和多个音频特征可以体现第一视频的多模态特征,第一位置编码和第二位置编码可以体现第一视频的上下文信息,因此,融合视频特征的准确度较高,进而可以提高视频合成的效果。
在上述任意一个实施例的基础上,下面,结合图9,对上述视频处理方法的过程进行说明。
图9为本公开实施例提供的一种视频处理方法的过程示意图。请参见图9,包括:第一视频。其中,第一视频包括大地图像、天空图像、海洋图像、高楼图像等。根据第一视频的视频内容将第一视频拆分为多个素材视频,针对于相邻的素材视频获取相邻素材视频对应的融合视频特征,以图9中的天空图像和海洋图像为例(天空图像和海洋图像分别位于不同的素材视频中),获取天空图像视频段的图像和音频,以及海洋图像视频段的图像和音频。
请参见图9,通过骨干网络对2个图像和2个音频进行处理,并将处理结果进行线性映射,并在输出的结果中添加位置编码和模态编码,进而得到2个图像对应的2个图像特征和2个音频对应的2个音频特征。需要说明的是,图9所示的实施例中,其它相邻的素材视频可以通过上述方法得到多个图像特征和视频特征。
请参见图9,通过编码器对多个图像特征和多个音频特征进行处理,得道多个融合上下文信息的特征。将相邻素材视频对应的2个图像特征和2个音频特征进行拼接处理,得到每个相邻素材视频对应的融合视频特征。获取转场特效特征,并确定每个融合视频特征与每个转场特效特征之间的相似度。
请参见图9,根据融合视频特征和每个转场特效特征之间的相似度,确定转场特效A、转场特效B和转场特效C为目标视频转场特效(其它的转场特效图中未示出),在大地图像和天空图像之间添加转场特效A,在天空图像和海洋图像之间添加转场特效B,在海洋图像和高楼图像之间添加转场特效C,进而得到第二视频。需要说明的是,上述过程可以通过模型实现,为了优化从素材中提取到的融合视频特征与视频转场特效之间的相似度,可以采用三元组损失作为损失函数对模型的参数进行更新。
根据上述方法,电子设备可以自动为第一视频的素材视频之间添加转场特效,并且,由于融合视频特征融合相邻的素材视频的图像特征和音频特征,因此,通过融合视频特征和维护的转场特效特征,实现跨模态的检索功能,进而可以准确的确定与相邻的素材视频内容匹配度最高的视频转场特效,提高视频合成的效果。
图10为本公开实施例提供的一种视频处理装置的结构示意图。请参见图10,该视频处理装置10包括第一获取模块11、第一确定模块12、第二获取模块13、第二确定模块14和第三确定模块15,其中:
所述第一获取模块11用于,获取第一视频,所述第一视频包括多个素材视频;
所述第一确定模块12用于,确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视频的图像特征和音频特征;
所述第二获取模块13用于,获取多个视频转场特效对应的多个转场特效特征;
所述第二确定模块14用于,根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;
所述第三确定模块15用于,根据所述多个素材视频和所述目标视频转场特效,确定第二视频。
在一种可能的实施方式中,所述第一确定模块12具体用于:
确定各相邻的素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征;
根据所述多个图像特征和所述多个音频特征,确定各相邻的素材视频对应的融合视频特征。
在一种可能的实施方式中,所述第一确定模块12具体用于:
获取所述第一素材视频中的第一视频段和所述第二素材视频中的第二视频段;
根据所述第一视频段和所述第二视频段,确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征。
在一种可能的实施方式中,所述第一确定模块12具体用于:
获取所述第一视频段对应的第一图像特征和第一音频特征;
获取所述第二视频段对应的第二图像特征和第二音频特征;
将所述第一图像特征和所述第二图像特征,确定为所述第一素材视频和所述第二素材视频对应的图像特征,将所述第一音频特征和所述第二音频特征,确定为所述第一素材视频和所述第二素材视频对应的音频特征。
在一种可能的实施方式中,所述第一确定模块12具体用于:
获取各图像特征在第一视频中的第一位置编码和各音频特征在第一视频中的第二位置编码;
根据所述多个图像特征、所述多个音频特征、所述第一位置编码和所述第二位置编码,确定各相邻的素材视频对应的融合视频特征。
在一种可能的实施方式中,所述第一素材视频位于所述第二素材视频之前,所述第一视频段为所述第一素材视频片尾的一段视频,所述第二视频段为所述第二素材视频片头的一段视频。
在一种可能的实施方式中,所述第二确定模块14具体用于:
获取所述融合视频特征与各转场特效特征之间的第一相似度,得到多个第一相似度;
在所述多个第一相似度中,获取最大的第一相似度;
将所述最大的第一相似度对应的视频转场特效,确定为所述融合视频特征对应的相邻的素材视频之间的目标视频转场特效。
在一种可能的实施方式中,所述第二获取模块13具体用于:
获取所述多个视频转场特效对应的特效分类模型,所述特效分类模型用于对多个视频转场特效进行分类;
通过所述特效分类模型获取各视频转场特效对应的特征向量,并将所述特征向量,确定为所述转场特效特征。
本实施例提供的视频处理装置,可用于执行上述方法实施例的技术方案,其实现原理和技术效果类似,本实施例此处不再赘述。
图11为本公开实施例提供的一种电子设备的结构示意图。请参见图11,其示出了适于用 来实现本公开实施例的电子设备1100的结构示意图,该电子设备1100可以为终端设备或服务器。其中,终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,简称PDA)、平板电脑(Portable Android Device,简称PAD)、便携式多媒体播放器(Portable Media Player,简称PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图11示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图11所示,电子设备1100可以包括处理装置(例如中央处理器、图形处理器等)1101,其可以根据存储在只读存储器(Read Only Memory,简称ROM)1102中的程序或者从存储装置1108加载到随机访问存储器(Random Access Memory,简称RAM)1103中的程序而执行各种适当的动作和处理。在RAM 1103中,还存储有电子设备1100操作所需的各种程序和数据。处理装置1101、ROM 1102以及RAM 1103通过总线1104彼此相连。输入/输出(Input/Output,I/O)接口1105也连接至总线1104。
通常,以下装置可以连接至I/O接口1105:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置1106;包括例如液晶显示器(Liquid Crystal Display,简称LCD)、扬声器、振动器等的输出装置1107;包括例如磁带、硬盘等的存储装置1108;以及通信装置1109。通信装置1109可以允许电子设备1100与其他设备进行无线或有线通信以交换数据。虽然图11示出了具有各种装置的电子设备1100,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置1109从网络上被下载和安装,或者从存储装置1108被安装,或者从ROM 1102被安装。在该计算机程序被处理装置1101执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备执行上述实施例所示的方法。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(Local Area Network,简称LAN)或广域网(Wide Area Network,简称WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Product,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
第一方面,本公开一个或多个实施例,提供一种视频处理方法,该方法包括:
获取第一视频,所述第一视频包括多个素材视频;
确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视 频的图像特征和音频特征;
获取多个视频转场特效对应的多个转场特效特征;
根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;
根据所述多个素材视频和所述目标视频转场特效,确定第二视频。
根据本公开一个或多个实施例,所述确定各相邻的素材视频对应的融合视频特征,包括:
确定各相邻的素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征;
根据所述多个图像特征和所述多个音频特征,确定各相邻的素材视频对应的融合视频特征。
根据本公开一个或多个实施例,针对于任意相邻的第一素材视频和第二素材视频;所述确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征,包括:
获取所述第一素材视频中的第一视频段和所述第二素材视频中的第二视频段;
根据所述第一视频段和所述第二视频段,确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征。
根据本公开一个或多个实施例,所述根据所述第一视频段和所述第二视频段,确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征,包括:
获取所述第一视频段对应的第一图像特征和第一音频特征;
获取所述第二视频段对应的第二图像特征和第二音频特征;
将所述第一图像特征和所述第二图像特征,确定为所述第一素材视频和所述第二素材视频对应的图像特征,将所述第一音频特征和所述第二音频特征,确定为所述第一素材视频和所述第二素材视频对应的音频特征。
根据本公开一个或多个实施例,所述根据所述多个图像特征和所述多个音频特征,确定各相邻的素材视频对应的融合视频特征,包括:
获取各图像特征在第一视频中的第一位置编码和各音频特征在第一视频中的第二位置编码;
根据所述多个图像特征、所述多个音频特征、所述第一位置编码和所述第二位置编码,确定各相邻的素材视频对应的融合视频特征。
根据本公开一个或多个实施例,所述第一素材视频位于所述第二素材视频之前,所述第一视频段为所述第一素材视频片尾的一段视频,所述第二视频段为所述第二素材视频片头的一段视频。
根据本公开一个或多个实施例,所述根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效,包括:
获取所述融合视频特征与各转场特效特征之间的第一相似度,得到多个第一相似度;
在所述多个第一相似度中,获取最大的第一相似度;
将所述最大的第一相似度对应的视频转场特效,确定为所述融合视频特征对应的相邻的素材视频之间的目标视频转场特效。
根据本公开一个或多个实施例,所述获取视频转场特效对应的转场特效特征,包括:
获取所述多个视频转场特效对应的特效分类模型,所述特效分类模型用于对多个视频转场特效进行分类;
通过所述特效分类模型获取各视频转场特效对应的特征向量,并将所述特征向量,确定为所述转场特效特征。
第二方面,本公开一个或多个实施例,提供一种视频处理装置,该视频处理装置包括第一获取模块、第一确定模块、第二获取模块、第二确定模块和第三确定模块,其中:
所述第一获取模块用于,获取第一视频,所述第一视频包括多个素材视频;
所述第一确定模块用于,确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视频的图像特征和音频特征;
所述第二获取模块用于,获取多个视频转场特效对应的多个转场特效特征;
所述第二确定模块用于,根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;
所述第三确定模块用于,根据所述多个素材视频和所述目标视频转场特效,确定第二视频。
在一种可能的实施方式中,所述第一确定模块具体用于:
确定各相邻的素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征;
根据所述多个图像特征和所述多个音频特征,确定各相邻的素材视频对应的融合视频特征。
在一种可能的实施方式中,所述第一确定模块具体用于:
获取所述第一素材视频中的第一视频段和所述第二素材视频中的第二视频段;
根据所述第一视频段和所述第二视频段,确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征。
在一种可能的实施方式中,所述第一确定模块具体用于:
获取所述第一视频段对应的第一图像特征和第一音频特征;
获取所述第二视频段对应的第二图像特征和第二音频特征;
将所述第一图像特征和所述第二图像特征,确定为所述第一素材视频和所述第二素材视频对应的图像特征,将所述第一音频特征和所述第二音频特征,确定为所述第一素材视频和所述第二素材视频对应的音频特征。
在一种可能的实施方式中,所述第一确定模块具体用于:
获取各图像特征在第一视频中的第一位置编码和各音频特征在第一视频中的第二位置编码;
根据所述多个图像特征、所述多个音频特征、所述第一位置编码和所述第二位置编码,确定各相邻的素材视频对应的融合视频特征。
在一种可能的实施方式中,所述第一素材视频位于所述第二素材视频之前,所述第一视频段为所述第一素材视频片尾的一段视频,所述第二视频段为所述第二素材视频片头的一段视频。
在一种可能的实施方式中,所述第二确定模块具体用于:
获取所述融合视频特征与各转场特效特征之间的第一相似度,得到多个第一相似度;
在所述多个第一相似度中,获取最大的第一相似度;
将所述最大的第一相似度对应的视频转场特效,确定为所述融合视频特征对应的相邻的素材视频之间的目标视频转场特效。
在一种可能的实施方式中,所述第二获取模块具体用于:
获取所述多个视频转场特效对应的特效分类模型,所述特效分类模型用于对多个视频转场特效进行分类;
通过所述特效分类模型获取各视频转场特效对应的特征向量,并将所述特征向量,确定为所述转场特效特征。
第三方面,本公开实施例提供一种电子设备,包括:处理器和存储器;
所述存储器存储计算机执行指令;
所述处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
第四方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
第五方面,本公开实施例提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
第六方面,本公开实施例提供一种计算机程序,所述计算机程序被处理器执行时实现如上第一方面以及第一方面各种可能涉及的所述视频处理方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (13)

  1. 一种视频处理方法,包括:
    获取第一视频,所述第一视频包括多个素材视频;
    确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视频的图像特征和音频特征;
    获取多个视频转场特效对应的多个转场特效特征;
    根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;
    根据所述多个素材视频和所述目标视频转场特效,确定第二视频。
  2. 根据权利要求1所述的方法,其中,所述确定各相邻的素材视频对应的融合视频特征,包括:
    确定各相邻的素材视频对应的图像特征和音频特征,得到多个图像特征和多个音频特征;
    根据所述多个图像特征和所述多个音频特征,确定各相邻的素材视频对应的融合视频特征。
  3. 根据权利要求2所述的方法,其中,针对于任意相邻的第一素材视频和第二素材视频;所述确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征,包括:
    获取所述第一素材视频中的第一视频段和所述第二素材视频中的第二视频段;
    根据所述第一视频段和所述第二视频段,确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征。
  4. 根据权利要求3所述的方法,其中,所述根据所述第一视频段和所述第二视频段,确定所述第一素材视频和所述第二素材视频对应的图像特征和音频特征,包括:
    获取所述第一视频段对应的第一图像特征和第一音频特征;
    获取所述第二视频段对应的第二图像特征和第二音频特征;
    将所述第一图像特征和所述第二图像特征,确定为所述第一素材视频和所述第二素材视频对应的图像特征,将所述第一音频特征和所述第二音频特征,确定为所述第一素材视频和所述第二素材视频对应的音频特征。
  5. 根据权利要求2至4中任一项所述的方法,其中,所述根据所述多个图像特征和所述多个音频特征,确定各相邻的素材视频对应的融合视频特征,包括:
    获取各图像特征在第一视频中的第一位置编码和各音频特征在第一视频中的第二位置编码;
    根据所述多个图像特征、所述多个音频特征、所述第一位置编码和所述第二位置编码,确定各相邻的素材视频对应的融合视频特征。
  6. 根据权利要求3或4所述的方法,其中,所述第一素材视频位于所述第二素材视频之前,所述第一视频段为所述第一素材视频片尾的一段视频,所述第二视频段为所述第二素材视频片头的一段视频。
  7. 根据权利要求1至6中任一项所述的方法,其中,所述根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效,包括:
    获取所述融合视频特征与各转场特效特征之间的第一相似度,得到多个第一相似度;
    在所述多个第一相似度中,获取最大的第一相似度;
    将所述最大的第一相似度对应的视频转场特效,确定为所述融合视频特征对应的相邻的素材视频之间的目标视频转场特效。
  8. 根据权利要求1至7中任一项所述的方法,其中,所述获取多个视频转场特效对应的多个转场特效特征,包括:
    获取所述多个视频转场特效对应的特效分类模型,所述特效分类模型用于对多个视频转场特效进行分类;
    通过所述特效分类模型获取各视频转场特效对应的特征向量,并将所述特征向量,确定为所述转场特效特征。
  9. 一种视频处理装置,包括第一获取模块、第一确定模块、第二获取模块、第二确定模块和第三确定模块,其中:
    所述第一获取模块用于,获取第一视频,所述第一视频包括多个素材视频;
    所述第一确定模块用于,确定各相邻的素材视频对应的融合视频特征,所述融合视频特征用于指示相邻的素材视频的图像特征和音频特征;
    所述第二获取模块用于,获取多个视频转场特效对应的多个转场特效特征;
    所述第二确定模块用于,根据所述融合视频特征和所述多个转场特效特征,在所述多个视频转场特效中,确定所述各相邻的素材视频之间的目标视频转场特效;
    所述第三确定模块用于,根据所述多个素材视频和所述目标视频转场特效,确定第二视频。
  10. 一种电子设备,包括:处理器和存储器;
    所述存储器存储计算机执行指令;
    所述处理器执行所述存储器存储的计算机执行指令,使得所述处理器执行如权利要求1至8中任一项所述的视频处理方法。
  11. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至8中任一项所述的视频处理方法。
  12. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如权利要求1至8中任一项所述的视频处理方法。
  13. 一种计算机程序,所述计算机程序被处理器执行时实现如权利要求1至8中任一项所述的视频处理方法。
PCT/CN2023/102818 2022-07-08 2023-06-27 视频处理方法、装置及电子设备 WO2024007898A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210806771.4 2022-07-08
CN202210806771.4A CN116055798A (zh) 2022-07-08 2022-07-08 视频处理方法、装置及电子设备

Publications (1)

Publication Number Publication Date
WO2024007898A1 true WO2024007898A1 (zh) 2024-01-11

Family

ID=86114048

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/102818 WO2024007898A1 (zh) 2022-07-08 2023-06-27 视频处理方法、装置及电子设备

Country Status (2)

Country Link
CN (1) CN116055798A (zh)
WO (1) WO2024007898A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116055798A (zh) * 2022-07-08 2023-05-02 脸萌有限公司 视频处理方法、装置及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170125064A1 (en) * 2015-11-03 2017-05-04 Seastar Labs, Inc. Method and Apparatus for Automatic Video Production
CN111107392A (zh) * 2019-12-31 2020-05-05 北京百度网讯科技有限公司 视频处理方法、装置和电子设备
CN112702656A (zh) * 2020-12-21 2021-04-23 北京达佳互联信息技术有限公司 视频编辑方法和视频编辑装置
CN113938751A (zh) * 2020-06-29 2022-01-14 北京字节跳动网络技术有限公司 视频转场类型确定方法、设备及存储介质
CN114615513A (zh) * 2022-03-08 2022-06-10 北京字跳网络技术有限公司 视频数据生成方法、装置、电子设备及存储介质
CN116055798A (zh) * 2022-07-08 2023-05-02 脸萌有限公司 视频处理方法、装置及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170125064A1 (en) * 2015-11-03 2017-05-04 Seastar Labs, Inc. Method and Apparatus for Automatic Video Production
CN111107392A (zh) * 2019-12-31 2020-05-05 北京百度网讯科技有限公司 视频处理方法、装置和电子设备
CN113938751A (zh) * 2020-06-29 2022-01-14 北京字节跳动网络技术有限公司 视频转场类型确定方法、设备及存储介质
CN112702656A (zh) * 2020-12-21 2021-04-23 北京达佳互联信息技术有限公司 视频编辑方法和视频编辑装置
CN114615513A (zh) * 2022-03-08 2022-06-10 北京字跳网络技术有限公司 视频数据生成方法、装置、电子设备及存储介质
CN116055798A (zh) * 2022-07-08 2023-05-02 脸萌有限公司 视频处理方法、装置及电子设备

Also Published As

Publication number Publication date
CN116055798A (zh) 2023-05-02

Similar Documents

Publication Publication Date Title
CN111696176B (zh) 图像处理方法、装置、电子设备及计算机可读介质
CN112929744A (zh) 用于分割视频剪辑的方法、装置、设备、介质和程序产品
US20200410731A1 (en) Method and apparatus for controlling mouth shape changes of three-dimensional virtual portrait
WO2024007898A1 (zh) 视频处理方法、装置及电子设备
CN111369427A (zh) 图像处理方法、装置、可读介质和电子设备
WO2020211573A1 (zh) 用于处理图像的方法和装置
WO2023197979A1 (zh) 一种数据处理方法、装置、计算机设备及存储介质
CN112954450B (zh) 视频处理方法、装置、电子设备和存储介质
CN111292420A (zh) 用于构建地图的方法和装置
CN112597944A (zh) 关键点检测方法及装置、电子设备和存储介质
WO2023071578A1 (zh) 一种文本对齐语音的方法、装置、设备及介质
US11996124B2 (en) Video processing method, apparatus, readable medium and electronic device
WO2023142917A1 (zh) 一种视频生成方法、装置、设备、介质及产品
US20230140558A1 (en) Method for converting a picture into a video, device, and storage medium
CN111696549A (zh) 一种图片搜索方法、装置、电子设备及存储介质
CN113610034B (zh) 识别视频中人物实体的方法、装置、存储介质及电子设备
CN112182255A (zh) 用于存储媒体文件和用于检索媒体文件的方法和装置
US11514648B2 (en) Aligning input image data with model input data to generate image annotations
WO2024012251A1 (zh) 语义分割模型训练方法、装置、电子设备及存储介质
CN112954453A (zh) 视频配音方法和装置、存储介质和电子设备
CN110765304A (zh) 图像处理方法、装置、电子设备及计算机可读介质
WO2022227996A1 (zh) 图像处理方法、装置、电子设备以及可读存储介质
CN111460214B (zh) 分类模型训练方法、音频分类方法、装置、介质及设备
CN110263743B (zh) 用于识别图像的方法和装置
CN113240599A (zh) 图像调色方法及装置、计算机可读存储介质、电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834666

Country of ref document: EP

Kind code of ref document: A1