WO2024040865A1 - Video editing method and electronic device - Google Patents

Video editing method and electronic device Download PDF

Info

Publication number
WO2024040865A1
WO2024040865A1 PCT/CN2023/073141 CN2023073141W WO2024040865A1 WO 2024040865 A1 WO2024040865 A1 WO 2024040865A1 CN 2023073141 W CN2023073141 W CN 2023073141W WO 2024040865 A1 WO2024040865 A1 WO 2024040865A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
videos
information
music
similarity
Prior art date
Application number
PCT/CN2023/073141
Other languages
French (fr)
Chinese (zh)
Inventor
王龙
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2024040865A1 publication Critical patent/WO2024040865A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/4302Content synchronisation processes, e.g. decoder synchronisation
    • H04N21/4307Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
    • H04N21/43072Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • H04N21/8113Monomedia components thereof involving special audio data, e.g. different tracks for different languages comprising music, e.g. song in MP3 format

Definitions

  • the present application relates to the field of video, and specifically to a video editing method and electronic device.
  • Video mixing and editing refers to a video editing technology that divides multiple videos to select target segments, then reorganizes the video segments and adds background music to generate a new video.
  • This application provides a video editing method and electronic device, which can avoid the problem of image content irrelevant to the overall video theme of N videos in the edited video, and improve the video quality of the edited video.
  • the first aspect provides a video editing method applied to electronic devices, including:
  • the first operation on N video icons among the video icons is detected
  • the video topics of N videos are obtained.
  • the first video is obtained
  • M video clips can be selected from the N videos based on the similarity between the images in the N videos and the video themes; the processed video, that is, the first video, can be obtained based on the M video clips and music. ; In the solution of this application, based on the similarity between the images included in the N videos and the video themes, M video clips in the N videos that are highly relevant to the video themes can be determined; Based on the solution of this application, it is possible to determine Effectively delete video clips from N videos that are irrelevant to the overall video topic information, ensure that the filtered video clips are related to the video topic, and improve the video quality of the edited first video.
  • N videos including:
  • the pre-trained similarity matching model includes image encoders and text encoding. and the first similarity measurement module.
  • the image encoder is used to extract image features from N videos.
  • the text encoder is used to extract text features from video topics.
  • the first similarity measurement module is used to measure the images in N videos. The similarity between the feature and the text feature of the video topic, the similarity confidence value is used to represent the probability that the image in N videos is similar to the video topic;
  • M video clips from the N videos are selected.
  • the similarity between the image features in the video and the text features of the video subject can be identified through a pre-trained similarity matching model;
  • the pre-trained similarity matching model can be a multi-modal model, Supports two different types of input data, image and text, at the same time; text features and image features can be mapped into a unified space through a pre-trained similarity matching model, thereby improving visual and text understanding capabilities; in the solution of this application, Based on the pre-trained similarity matching model, it can intelligently identify the similarity between the image features in the video and the text features of the video subject.
  • the first video is obtained based on M video clips and music, including:
  • the sorted M video clips and music are synthesized into the first video.
  • the image content in the M video clips can be made more consistent with the music rhythm in the music; for example, if the video image content is scenery, it can correspond to the prelude of the music or the soothing music part; the video image If the content is the user's sports scene, it can correspond to the climax of the background music; by sorting the M video clips, the M video clips can better match the rhythm stuck points of the music; thereby solving the problem in the first video
  • the problem of the video clips not matching the background music can be solved, that is, the problem of the video content of the first video not fully matching the rhythm of the music can be solved; and the video quality of the first video can be improved.
  • the M video clips are sorted to obtain the sorted M video clips, including:
  • background music can be selected based on the overall video theme information of N videos; and M videos can be sorted based on the rhythm of the background music, so as to achieve video sorting of M video clips according to the rhythm of the background music. , so that the picture content of the video clip matches the rhythm of the music; compared with the video matching the music directly according to the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the video quality of the edited video .
  • the M video clips are sorted to obtain the sorted M video clips, including:
  • the M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
  • the N videos can be sorted based on the text description information of the N videos to obtain the sorted N videos; from the sorted N videos, select M video clips with high relevance to the video topic information are used to obtain sorted M video clips; based on the sorted M video clips and video topic information, music matching the sorted M video clips is determined as the background Music; when the picture content of N videos with strong storylines matches the rhythm of the music, and the playback sequence of the video picture content conforms to the causal connection, it improves Improve the video quality of the edited video.
  • a video with a strong story line can refer to a causal connection between N videos.
  • the cause and effect between the N videos can be identified and the N videos can be sorted based on the order of the cause and effect; for example, a strong story line
  • the videos can include travel-themed videos or travel-themed videos.
  • the M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips, including:
  • the pre-trained audio-visual rhythm matching model includes an audio encoder, a video encoder and a first similarity measurement module.
  • the audio encoder is used to extract features from music to obtain audio features
  • the video decoder is used to extract features from M video clips to obtain video features
  • the first similarity measurement module is used to measure the similarity between the audio features and the M video clips.
  • the similarity between the video features of the M video clips and the audio features of the music can be identified through a pre-trained audio-visual rhythm matching model;
  • the pre-trained audio-visual rhythm matching model can be a multi-modal model. , simultaneously supporting two different types of input data, video and audio; through the pre-trained audio-visual rhythm matching model, video features and audio features can be mapped into a unified space, thereby improving visual and audio understanding capabilities; in the solution of this application , based on the pre-trained audio-visual rhythm matching model, it can intelligently identify the similarity between the video features of M video clips and the audio features of music.
  • video themes of the N videos are obtained, including:
  • the N text description information corresponds to the N videos one-to-one.
  • One text description information among the N text description information is used to describe the image of one video among the N videos.
  • content information ;
  • the topic information of the N videos is obtained, and the text description information is used to convert the video content in the N videos into text information.
  • the video theme information corresponding to the N videos is obtained through the text description information of the N videos; that is, the N videos can be obtained based on the text description information of the N videos
  • the overall video topic information compared with the video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to infer multiple
  • the theme information hidden behind the text can improve the accuracy of the overall video theme corresponding to N videos; for example, N videos include videos of users packing their luggage, videos of users going out and taking a car to the airport, and videos of users taking a plane.
  • the video theme information of N videos can be accurately obtained based on the linguistic logical correlation between N video text description information and N video text description information; for example, based on N
  • the text description information included in the video is "a user is packing luggage", "a user is taking a plane", "a user is walking on the beach”. Based on this text description information, the video theme information of N videos can be abstracted as travel; therefore, The video theme information of N videos is obtained through the text description information of N videos, which can improve the accuracy of the theme information.
  • the topic information of the N videos is obtained, including:
  • the pre-trained topic classification model is a deep neural network used for text classification.
  • the video topic information corresponding to the text description information of N videos can be obtained based on the pre-trained topic classification model; the video topic information corresponding to the text description information of the N videos is identified through the pre-trained topic classification model.
  • frequency topic information compared with video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to speculate on the background of multiple texts
  • the implicit topic information can improve the accuracy of the overall video topic corresponding to N videos; in addition, the pre-trained topic classification model can more intelligently identify the video topic information corresponding to N text description information.
  • the pre-trained topic classification model when the pre-trained topic classification model outputs at least two video topics, the at least two video topics correspond to N pieces of text description information, which also includes:
  • Display a second interface the second interface includes a prompt box, and the prompt box includes information on at least two video topics;
  • N text description information into the pre-trained topic classification model to obtain the topic information of N videos including:
  • a second action on at least two video subjects is detected
  • the theme information of N videos is obtained.
  • the electronic device when the electronic device outputs at least two video themes, the electronic device can display a prompt box; based on detecting the user's operation on the candidate video theme in the prompt box, the video theme information of N videos can be determined; To a certain extent, it can be avoided that the electronic device cannot recognize the video themes of the N videos when the video contents of the N videos do not completely comply with the pre-set video themes.
  • music matching the video theme is obtained, including:
  • the total duration of the background music can be determined based on the duration of the M video clips.
  • the background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; based on the video theme information, the total duration of the background music can be determined.
  • the pre-trained similarity matching model is a Transformer model.
  • the pre-trained similarity matching model is obtained through the following training method:
  • the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the first training data set includes positive example data pairs and negative example data pairs.
  • the example data pair includes the first sample text description information and the first sample video theme information.
  • the first sample description information matches the first sample video theme information.
  • the positive example data pair includes the first sample text description information and the first sample video theme information.
  • the second sample video theme information and the first sample description information do not match the second sample video theme information.
  • the pre-trained audio-visual rhythm matching model is a Transformer model.
  • the pre-trained audio-visual rhythm matching model is obtained through the following training method:
  • the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the second training data set includes positive example data pairs and negative example data pairs.
  • the example data pair includes the first sample music and the first sample video. The rhythm of the first sample music matches the content of the first sample video.
  • the negative example data pair includes the first sample music and the second sample video. The tempo of the first sample music does not match the content of the second sample video.
  • an electronic device in a second aspect, includes one or more processors and a memory; the memory is coupled to the one or more processors, and the memory is used to store computer program code.
  • the computer program code includes computer instructions, one or more Multiple processors invoke computer instructions to cause the electronic device to:
  • the first operation on N video icons among the video icons is detected
  • the video topics of N videos are obtained.
  • the first video is obtained
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • the pre-trained similarity matching model includes image encoders and text encoding. and the first similarity measurement module.
  • the image encoder is used to extract image features from N videos.
  • the text encoder is used to extract text features from video topics.
  • the first similarity measurement module is used to measure the images in N videos. The similarity between the feature and the text feature of the video topic, the similarity confidence value is used to represent the probability that the image in N videos is similar to the video topic;
  • M video clips from the N videos are selected.
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • the sorted M video clips and music are synthesized into the first video.
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • the M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • the pre-trained audio-visual rhythm matching model includes an audio encoder, a video encoder and a first similarity measurement module.
  • the audio encoder is used to extract features from music to obtain audio features
  • the video decoder is used to extract features from M video clips to obtain video features
  • the first similarity measurement module is used to measure the similarity between the audio features and the M video clips.
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • N text description information Convert the video content of N videos into N text description information, and the N text description information is one-to-one with the N videos. Accordingly, one text description information among the N pieces of text description information is used to describe the image content information of one video among the N videos;
  • the topic information of the N videos is obtained, and the text description information is used to convert the video content in the N videos into text information.
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • the pre-trained topic classification model is a deep neural network used for text classification.
  • the pre-trained topic classification model when the pre-trained topic classification model outputs at least two video topics, and the at least two video topics correspond to N pieces of text description information, one or more processes
  • the processor calls computer instructions to cause the electronic device to perform:
  • Display a second interface the second interface includes a prompt box, and the prompt box includes information on at least two video topics;
  • N text description information into the pre-trained topic classification model to obtain the topic information of N videos including:
  • a second action on at least two video subjects is detected
  • the theme information of N videos is obtained.
  • one or more processors invoke computer instructions to cause the electronic device to execute:
  • the pre-trained similarity matching model is a Transformer model.
  • the pre-trained similarity matching model is obtained through the following training method:
  • the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the first training data set includes positive example data pairs and negative example data pairs.
  • the example data pair includes the first sample text description information and the first sample video theme information.
  • the first sample description information matches the first sample video theme information.
  • the positive example data pair includes the first sample text description information and the first sample video theme information.
  • the second sample video theme information and the first sample description information do not match the second sample video theme information.
  • the pre-trained audio-visual rhythm matching model is a Transformer model.
  • the pre-trained audio-visual rhythm matching model is obtained through the following training method:
  • the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the second training data set includes positive example data pairs and negative example data pairs.
  • the example data pair includes the first sample music and the first sample video. The rhythm of the first sample music matches the content of the first sample video.
  • the negative example data pair includes the first sample music and the second sample video. The tempo of the first sample music does not match the content of the second sample video.
  • an electronic device including a module/unit for executing the video editing method in the first aspect or any implementation of the first aspect.
  • a fourth aspect provides an electronic device, the electronic device comprising one or more processors and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code,
  • the computer program code includes computer instructions that are invoked by the one or more processors to cause the electronic device to perform Perform the video editing method in the first aspect or any one of the implementations of the first aspect.
  • a chip system is provided.
  • the chip system is applied to an electronic device.
  • the chip system includes one or more processors.
  • the processor is used to call computer instructions to cause the electronic device to execute the first aspect. Or any of the video editing methods in the first aspect.
  • a computer-readable storage medium stores computer program code.
  • the computer program code When the computer program code is run by an electronic device, the electronic device causes the electronic device to execute the first aspect or the first aspect.
  • the video editing method in any implementation of the aspects.
  • a computer program product includes: computer program code.
  • the computer program code When the computer program code is run by an electronic device, the electronic device causes the electronic device to execute the first aspect or any of the first aspects.
  • a video editing method in an implementation manner.
  • M video clips can be selected from the N videos based on the similarity between the images in the N videos and the video themes; the processed video, that is, the first video, can be obtained based on the M video clips and music. ; In the solution of this application, based on the similarity between the images included in the N videos and the video themes, M video clips in the N videos that are highly relevant to the video themes can be determined; Based on the solution of this application, it is possible to determine Effectively delete video clips from N videos that are irrelevant to the overall video topic information, ensure that the filtered video clips are related to the video topic, and improve the video quality of the edited first video.
  • the problem of mismatch between the video clips and the music existing in the edited video can be solved, that is, the problem of the edited video content not fully matching the rhythm stuck point of the background music can be solved; in this application, the image content in the M video clips can be made more consistent with the music rhythm in the music; for example, if the video image content is scenery, it can correspond to the prelude of the music or the soothing music part; the video image content is The user's sports scene can correspond to the climax of the background music; by sorting the M video clips, the M video clips can better match the rhythm of the music; and the video quality of the edited video can be improved.
  • Figure 1 is a schematic diagram of a hardware system suitable for electronic equipment of the present application
  • Figure 2 is a schematic diagram of the structure of a transformer model suitable for this application
  • Figure 3 is a schematic diagram of the structure of the encoder and decoder in the Transformer model
  • Figure 4 is a schematic diagram of a software system suitable for the electronic device of the present application.
  • Figure 5 is a schematic diagram of a graphical user interface suitable for embodiments of the present application.
  • Figure 6 is a schematic diagram of a graphical user interface suitable for embodiments of the present application.
  • Figure 7 is a schematic diagram of a graphical user interface suitable for embodiments of the present application.
  • Figure 8 is a schematic diagram of a graphical user interface suitable for embodiments of the present application.
  • Figure 9 is a schematic diagram of a graphical user interface suitable for embodiments of the present application.
  • Figure 10 is a schematic diagram of a graphical user interface suitable for embodiments of the present application.
  • Figure 11 is a schematic diagram of a graphical user interface suitable for embodiments of the present application.
  • Figure 12 is a schematic flow chart of a video editing method provided by an embodiment of the present application.
  • Figure 13 is a schematic flow chart of another video editing method provided by an embodiment of the present application.
  • Figure 14 is a schematic flow chart of a method for determining M video clips related to video theme information in N videos provided by an embodiment of the present application;
  • Figure 15 is a schematic diagram of the processing flow of a similarity evaluation model implemented in this application.
  • Figure 16 is a flow chart of a method for rhythm matching processing of M video clips and background music provided by an embodiment of the present application
  • Figure 17 is a schematic diagram of the processing flow of an audio-visual rhythm matching model provided by an embodiment of the present application.
  • Figure 18 is a schematic flow chart of another video editing method provided by an embodiment of the present application.
  • Figure 19 is a schematic flow chart of another video editing method provided by an embodiment of the present application.
  • Figure 20 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 21 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • Image features refer to a set of attributes that characterize the characteristics or content of an image; for example, image features can include color features, texture features, shape features, and spatial relationship features of the image, etc., or they can be obtained through some kind of mapping. Formula attribute expression.
  • Video features refer to a set of attributes that can characterize the characteristics of the video obtained from the image sequence in the video through some mapping.
  • Text features refer to the set of attributes that characterize a word or sentence's specific semantics obtained through vectorization and subsequent mapping.
  • the CLIP model is a cross-modal pre-training model based on contrastive picture-text learning.
  • Neural network refers to a network formed by connecting multiple single neural units together, that is, the output of one neural unit can be the input of another neural unit; the input of each neural unit can be connected to the local receptive field of the previous layer, To extract the features of the local receptive field, the local receptive field can be an area composed of several neural units.
  • Contrastive learning is one of the self-supervised learning methods; contrastive learning refers to a training method that learns knowledge from unlabeled images without relying on labeled data.
  • contrastive learning is to learn an encoder that encodes similar data similarly and makes the encoding results of different types of data as different as possible.
  • the Transformer model can be composed of an encoder and a decoder.
  • the encoder and decoder can include multiple sub-modules; for example, an encoder can include 6 encoding modules; a decoder can include 6 decoding modules.
  • a coding module can include: embedding layer, position coding, multi-head attention mechanism module, residual connection and linear normalization and forward network modules; where the embedding layer is used to Each input data Words are represented by a vector; position encoding is used to construct a matrix with the same vector dimension as the input data, so that the data input to the multi-head attention mechanism module contains position information; the multi-head attention mechanism module is used to utilize multiple queries of the same query.
  • Different versions implement the work of multiple attention modules in parallel; the idea is to use different weight matrices to linearly transform the query to obtain multiple queries.
  • Each newly formed query essentially requires a different type of relevant information, thus allowing attention
  • the model introduces more information into the context vector calculation; residual connections are used to prevent network degradation; linear normalization is used to normalize the activation values of each layer; the forward network module is used to do the obtained word representation Further transformations.
  • a decoding module can include: embedding layer, position coding, masked multi-head attention mechanism module, residual connection and linear normalization, forward network module and multi-head attention mechanism module ;
  • the embedding layer is used to represent each word in the input data with a vector;
  • the position encoding is used to construct a matrix with the same vector dimension as the input data, so that the data input to the multi-head attention mechanism module contains position information ;
  • the mask multi-head attention mechanism module is used to ensure that the previous words do not have information about the following words by using masks, thereby ensuring that the output data predicted by the Transformer model will not change based on the number of input words;
  • the multi-head attention mechanism module is used to implement the work of multiple attention modules in parallel by leveraging multiple different versions of the same query; the idea is to linearly transform the query using different weight matrices to obtain multiple queries, each newly formed query essentially requires Different types of relevant information, thus allowing the attention model to introduce more information in the context vector calculation; residual connections are
  • DNN Deep neural network
  • Deep neural network can also be called multi-layer neural network, which can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the position of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in between are hidden layers.
  • the layers are fully connected; that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • Transition effects can also be called transition effects. Transition effects refer to the use of certain techniques between two scenes, such as wipes, stacking changes, page curls, etc., to achieve smooth transitions between scenes or plots, or to achieve rich The effect of the picture.
  • Figure 1 shows a hardware system suitable for the electronic device of the present application.
  • the electronic device 100 may be a mobile phone, a smart screen, a tablet, a wearable electronic device, a vehicle-mounted electronic device, an augmented reality (AR) device, a virtual reality (VR) device, a notebook computer, or a super mobile personal computer ( Ultra-mobile personal computer (UMPC), netbook, personal digital assistant (personal digital assistant, PDA), projector, etc.
  • the embodiment of the present application does not place any restrictions on the specific type of the electronic device 100.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2. , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, Camera 193, display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structure shown in FIG. 1 does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or less components than those shown in FIG. 1 , or the electronic device 100 may include a combination of some of the components shown in FIG. 1 , or , the electronic device 100 may include sub-components of some of the components shown in FIG. 1 .
  • the components shown in Figure 1 may be implemented in hardware, software, or a combination of software and hardware.
  • Processor 110 may include one or more processing units.
  • the processor 110 may include at least one of the following processing units: an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), an image signal processor (image signal processor) , ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, neural network processing unit (NPU).
  • an application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural network processing unit NPU
  • different processing units can be independent devices or integrated devices.
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • the processor 110 may include at least one of the following interfaces: an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, pulse code modulation, PCM) interface, universal asynchronous receiver/transmitter (UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, SIM interface, USB interface.
  • I2C inter-integrated circuit
  • I2S inter-integrated circuit sound
  • PCM pulse code modulation
  • UART universal asynchronous receiver/transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • the processor 110 may be configured to execute the video editing method provided by the embodiment of the present application; for example, display a first interface, the first interface includes a video icon, and the video indicated by the video icon is Videos stored in electronic devices; detect the first operation on N video icons in the video icons; in response to the first operation, obtain information of N videos, where N is an integer greater than 1; based on the information of N videos, obtain Video themes of N videos; based on the similarity between the images in the N videos and the video themes, select M video clips from the N videos; based on the video themes, obtain music that matches the video theme; based on the M video clips With music, get the first video; show the first video.
  • connection relationship between the modules shown in FIG. 1 is only a schematic illustration and does not constitute a limitation on the connection relationship between the modules of the electronic device 100 .
  • each module of the electronic device 100 may also adopt a combination of various connection methods in the above embodiments.
  • the wireless communication function of the electronic device 100 can be implemented through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor, baseband processor and other components.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the electronic device 100 may implement display functions through a GPU, a display screen 194, and an application processor.
  • GPU is the graphics processing unit
  • the processing microprocessor is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • Display 194 may be used to display images or videos.
  • display screen 194 may be used to display images or videos.
  • Display 194 includes a display panel.
  • the display panel can use liquid crystal display (LCD), organic light-emitting diode (OLED), active-matrix organic light-emitting diode (AMOLED), flexible Light-emitting diode (flex light-emitting diode, FLED), mini light-emitting diode (Mini LED), micro light-emitting diode (micro light-emitting diode, Micro LED), micro OLED (Micro OLED) or quantum dot light emitting Diodes (quantum dot light emitting diodes, QLED).
  • the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the display screen 194 can display a video or photo selected by the user; and display the processed video.
  • the electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • the ISP is used to process data fed back by the camera 193 .
  • the shutter is opened, the light is transmitted to the camera sensor through the camera, the light signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can algorithmically optimize the noise, brightness and color of the image. ISP can also optimize parameters such as exposure and color temperature of the shooting scene.
  • the ISP may be provided in the camera 193.
  • a camera 193 (which may also be referred to as a lens) is used to capture still images or videos. It can be triggered by application instructions to realize the camera function, such as capturing images of any scene.
  • the camera can include imaging lenses, filters, image sensors and other components. The light emitted or reflected by the object enters the imaging lens, passes through the optical filter, and finally converges on the image sensor.
  • the imaging lens is mainly used to collect and image the light emitted or reflected by all objects in the camera angle (which can also be called the scene to be shot, the target scene, or the scene image that the user expects to shoot);
  • the filter is mainly used to To filter out excess light waves in the light (such as light waves other than visible light, such as infrared);
  • the image sensor can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) ) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the image sensor is mainly used to photoelectrically convert the received optical signal into an electrical signal, and then transfer the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other format image signals.
  • the digital signal processor is used to process digital signals. In addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • video codecs are used to compress or decompress digital video.
  • Electronic device 100 may support one or more video codecs.
  • the electronic device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3 and MPEG4.
  • MPEG moving picture experts group
  • MPEG2 MPEG2
  • MPEG3 MPEG3
  • MPEG4 MPEG4
  • the gyro sensor 180B may be used to determine the motion posture of the electronic device 100 .
  • the angular velocity of electronic device 100 about three axes ie, x-axis, y-axis, and z-axis
  • the gyro sensor 180B can be used for image stabilization. For example, when the shutter is pressed, the gyro sensor 180B detects the angle at which the electronic device 100 shakes, and calculates the distance that the lens module needs to compensate based on the angle, so that the lens can offset the shake of the electronic device 100 through reverse movement to achieve anti-shake.
  • the gyro sensor 180B can also be used in scenarios such as navigation and somatosensory games.
  • the acceleration sensor 180E can detect the acceleration of the electronic device 100 in various directions (generally the x-axis, y-axis, and z-axis). When the electronic device 100 is stationary, the magnitude and direction of gravity can be detected. The acceleration sensor 180E can also be used to identify the posture of the electronic device 100 as an input parameter for applications such as horizontal and vertical screen switching and pedometer.
  • distance sensor 180F is used to measure distance.
  • Electronic device 100 can measure distance via infrared or laser. In some embodiments, such as in a shooting scene, the electronic device 100 may utilize the distance sensor 180F to measure distance to achieve fast focusing.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • the electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness.
  • the ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures.
  • the ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in the pocket to prevent accidental touching.
  • fingerprint sensor 180H is used to collect fingerprints.
  • the electronic device 100 can use the collected fingerprint characteristics to implement functions such as unlocking, accessing application locks, taking photos, and answering incoming calls.
  • the touch sensor 180K is also called a touch device.
  • the touch sensor 180K can be disposed on the display screen 194.
  • the touch sensor 180K and the display screen 194 form a touch screen.
  • the touch screen is also called a touch screen.
  • the touch sensor 180K is used to detect a touch operation acted on or near the touch sensor 180K.
  • the touch sensor 180K may pass the detected touch operation to the application processor to determine the touch event type.
  • Visual output related to the touch operation may be provided through display screen 194 .
  • the touch sensor 180K may also be disposed on the surface of the electronic device 100 and at a different position from the display screen 194 .
  • the hardware system of the electronic device 100 is described in detail above, and the software system of the electronic device 100 is introduced below.
  • FIG. 4 is a schematic diagram of a software system of an electronic device provided by an embodiment of the present application.
  • the system architecture may include an application layer 210 , an application framework layer 220 , a hardware abstraction layer 230 , a driver layer 240 and a hardware layer 250 .
  • Application layer 210 may include a gallery application.
  • the application layer 210 may also include application programs such as camera application, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • application programs such as camera application, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • the application framework layer 220 provides an application programming interface (API) and programming framework for applications in the application layer; the application framework layer may include some predefined functions.
  • API application programming interface
  • the application framework layer 220 may include a gallery access interface; the gallery access interface may be used to obtain relevant data of the gallery.
  • Hardware abstraction layer 230 is used to abstract hardware.
  • the hardware abstraction module may include a video editing algorithm; based on the video editing algorithm, the video editing related methods of the embodiments of the present application may be executed.
  • the driver layer 240 is used to provide drivers for different hardware devices.
  • the driver layer may include display driver.
  • Hardware layer 250 may include a display screen and other hardware devices.
  • embodiments of the present application provide a video editing method and electronic device; in the embodiment of the present application, the image content information of N videos can be converted into text description information; based on the text description of N videos Information to obtain the video theme information of N videos; based on the correlation between the images in the N videos and the video theme information, select M video clips from the N videos; obtain the processed video based on the M video clips and background music ;
  • the video theme information of N videos is obtained through the text description information of N videos; compared with obtaining the video theme information of N videos based on the image information of N videos, the text information has more advantages than the image information.
  • Rich information in addition, there is language correlation between multiple text information, and the video subject information of the video is obtained based on the text description information of N videos, which can improve the accuracy of the video subject information; in addition, in the embodiment of the present application , based on the correlation between the images in the N videos and the video theme information, M video clips in the N videos that are highly relevant to the video theme can be determined; based on the solution of this application, the N videos can be effectively deleted that are related to the overall video Video clips with irrelevant topic information ensure that the filtered video clips are related to the video topic information and improve the video quality of the edited video.
  • the problem of the video clips in the edited video not matching the background music can be solved, that is, the problem of the edited video content not completely matching the rhythm stuck points of the background music can be solved.
  • background music can be selected based on the overall video theme information of the N videos; and the M videos can be sorted based on the rhythm of the background music, so that the M videos can be sorted according to the rhythm of the background music. Sort the video clips so that the picture content of the video clips matches the rhythm of the music; compared with matching the video directly with the music in the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the quality of the video clips after editing. The video quality of the video.
  • the N videos can be sorted based on the text description information of the N videos to obtain the sorted N videos; from the sorted N videos Select M video clips from the video that are highly relevant to the video topic information to obtain the sorted M video clips; based on the sorted M video clips and the video topic information, determine the match with the sorted M video clips background music; when the picture content of the N videos with a strong storyline matches the music rhythm, and the playback sequence of the video picture content conforms to the causal relationship, the video quality of the edited video is improved.
  • the video editing method provided by the embodiment of the present application can effectively filter out the video clips in the N videos that are not relevant to the overall video theme based on the correlation between the video content in the N videos and the overall video theme; based on the video content (for example, The emotions expressed in the video and the picture of the video) match the background music; based on the rhythm of the background music or the logical correlation of the video clips, multiple video clips are reasonably connected in series; so that the edited video does not include content that is irrelevant to the overall video theme. Moreover, the video content matches the rhythm of the background music, thereby improving the professionalism of electronic device video editing and improving the video quality of the edited video.
  • the video content matches the rhythm of the background music, thereby improving the professionalism of electronic device video editing and improving the video quality of the edited video.
  • the video editing method provided by the embodiment of the present application is suitable for automatically generating mixed-cut videos in electronic devices; for example, the electronic device detects the user's selection operations on multiple videos; identifies the video themes of multiple videos, and matches Background music related to multiple video themes; automatically synthesize multiple videos and background music to generate a mixed-cut video.
  • the method provided by the embodiment of the present application is not only applicable to videos saved in electronic devices; it is also applicable to photos saved in electronic devices; for example, a mixed-cut video is generated based on photos saved in electronic devices; where the photos include but Not limited to: gif animations, JPEG format images, PNG format images, etc.
  • the graphical user interface (GUI) shown in (a) of Figure 5 is the desktop 301 of the electronic device; the electronic device detects that the user clicks on the gallery application on the desktop.
  • the operation of the control 302 is as shown in (b) of Figure 5; after the electronic device detects that the user clicks on the control 302 of the gallery application on the desktop, the gallery display interface 303 as shown in (c) of Figure 5 is displayed.
  • the gallery display interface 303 includes all photo icons, video icons and controls 304 for more options, and the electronic device detects the user's operation of clicking on the controls 304 for more options, As shown in (d) in Figure 5; after the electronic device detects that the user clicks on the control 304 for more options, the display interface 305 shown in (a) in Figure 6 is displayed; in the display interface 305, including The one-click blockbuster control 306; the electronic device detects the user's operation of clicking the one-click blockbuster control 306, as shown in (b) in Figure 6; after the electronic device detects the user's click on the one-click blockbuster control 306, The display interface 307 shown in (c) of Figure 6 is displayed; the display interface 307 includes the icon of the video saved in the electronic device and the multi-select control 308; the electronic device detects the user's operation of clicking the multi-select control 308, as shown in Figure As shown in (d) in Figure 6; after the electronic device detects that the user clicks on the
  • the electronic device can execute the video editing method provided by the embodiment of the present application to perform video editing processing on multiple videos selected by the user, displaying the following:
  • the electronic device can pre-configure a template corresponding to the video theme information; therefore, the electronic device can display the display interface 317 as shown in FIG. 9 .
  • the electronic device can execute the video editing method provided by the embodiment of the present application to perform video editing processing on multiple videos selected by the user, displaying the following:
  • the electronic device can pre-configure multiple templates corresponding to the video theme information; therefore, the electronic device can display the display interface 318 as shown in FIG. 10 .
  • the electronic device can execute the video editing method provided by the embodiment of the present application to perform video editing processing on multiple videos selected by the user, displaying the following: Display interface 319 shown in Figure 11; Based on the solution of this application, if the electronic device obtains two or more video theme information, the prompt box 320 can be displayed on the electronic device; as shown in Figure 11, the prompt box 320 includes two video themes, namely scenery and travel; based on the user's operation on the video theme in the prompt box 320, the electronic device can determine one video theme information from two or more video themes.
  • the video editing method provided in the embodiment of the present application is also applicable to performing video editing processing on photos saved in an electronic device to generate a mixed-cut video; wherein , photos include but are not limited to: gif animations, JPEG format images, PNG format images, etc.; this application does not make any restrictions on this.
  • FIG 12 is a schematic flow chart of a video editing method provided by an embodiment of the present application.
  • the video editing method 400 can be executed by the electronic device shown in Figure 1; the video editing method includes steps S410 to S480, and steps S410 to 480 are described in detail below respectively.
  • Step S410 Display the first interface.
  • the first interface includes a video icon, and the video indicated by the video icon is a video stored in the electronic device.
  • the first interface may refer to the display interface of the gallery application in the electronic device, such as the display interface 307 shown in (c) of Figure 6; the display interface 307 includes 6 video icons, and the 6 video images The corresponding video is a video stored in the electronic device.
  • Step S420 Detect the first operation on N video icons among the video icons.
  • the first operation may be a click operation on N video images in the video icon, or the first operation may be another selection operation on N video images.
  • the electronic device detects a click operation on the icon 310 in the video icon; for another example, as shown in (d) of Figure 7 , the electronic device detects a click operation on the icon 310 in the video icon. Click operation of icon 312.
  • the first operation on the N video images in the video icon may be operations performed one after another, or may be performed simultaneously.
  • the first operation takes the first operation as a click operation.
  • the first operation may also be an operation of selecting N video icons among the video icons indicated by voice, or the first operation may also be other instructions for selecting video icons. This application does not impose any restrictions on the operation of N video icons.
  • Step S430 In response to the first operation, obtain information of N videos.
  • N is an integer greater than 1.
  • the electronic device can obtain information of three videos.
  • Step S440 Based on the information of the N videos, obtain the video themes of the N videos.
  • the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
  • the video topics of N videos are obtained, including:
  • the N text description information corresponds to the N videos one-to-one.
  • One text description information among the N text description information is used to describe the image of one video among the N videos.
  • the video theme information corresponding to the N videos is obtained through the text description information of the N videos; that is, the N videos can be obtained based on the text description information of the N videos
  • the overall video topic information compared with the video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to infer multiple
  • the theme information hidden behind the text can improve the accuracy of the overall video theme corresponding to the N video.
  • the N videos include videos of the user packing luggage, videos of the user taking a car to the airport, videos of the user taking a plane, and videos of the user walking on the beach; only some image tags may be obtained based on image semantics, including clothing, Suitcases, users, seaside, etc.
  • the video theme of N videos is travel; however, when identifying the video theme based on the text description information of N videos, the video theme can be identified based on the N video text description information and N
  • the linguistic logic correlation between video text description information can accurately obtain the video theme information of N videos; for example, based on the text description information included in N videos, "a user is packing luggage", “a user is taking a plane", "A user is walking on the beach.”
  • the video theme information of N videos can be abstracted as travel; therefore, the video theme information of N videos can be obtained through the text description information of N videos, which can improve the theme information. accuracy.
  • the topic information of N videos is obtained, including:
  • the pre-trained topic classification model is a deep neural network used for text classification.
  • N videos can be input to the image-text conversion model to obtain text description information of N videos; for example, N text description information; the text description information of N videos can be input to the pre-trained topic classification model, Get the topic information of N videos.
  • the implementation may refer to step S530 in FIG. 13 or the related description of step S620 and step S630 in FIG. 18 .
  • the pre-trained topic classification model when the pre-trained topic classification model outputs at least two video topics, the at least two video topics correspond to N pieces of text description information, which also includes:
  • Display a second interface the second interface includes a prompt box, and the prompt box includes information on at least two video topics;
  • a second action on at least two video subjects is detected
  • the theme information of N videos is obtained.
  • a prompt box can be displayed in the electronic device; the prompt box can Including candidate video topic information, the video topic information of N videos is determined based on the user's operation on the candidate video topic information in the prompt box.
  • the second interface can be displayed on the electronic device, such as the display interface 319 shown in Figure 11; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two
  • the candidate video topic information is scenery and travel respectively. If the electronic device detects that the user clicks "Landscape”, then the video topic information of the N videos is scenery; if the electronic device detects that the user clicks "Travel”, then the video topic information of the N videos is scenery. The theme information is travel.
  • the electronic device when the electronic device outputs at least two video themes, the electronic device can display a prompt box; based on detecting the user's operation on the candidate video theme in the prompt box, the video theme information of N videos can be determined; To a certain extent, it can be avoided that the electronic device cannot recognize the video themes of the N videos when the video contents of the N videos do not completely comply with the pre-set video themes.
  • Step S450 Based on the similarity between the images in the N videos and the video themes, select M video clips from the N videos.
  • the similarity between the images in N videos and the video topic can be represented by a similarity confidence value or a distance value; for example, if the similarity between an image feature in a video and the text feature of the video topic is greater than is high, the greater the similarity confidence value and the smaller the distance measurement value; if the similarity between an image feature in a video and the text feature of the video subject is lower, the smaller the similarity confidence value and the smaller the distance measurement value. big.
  • M video clips among the N videos that are highly relevant to the video theme can be determined; based on the solution of the present application, it can effectively Delete the video clips in the N videos that are not related to the video topic information to ensure that the filtered video clips are related to the video topic information; on the other hand, the similarity confidence value of some or all image features in the N videos and the video topic information can be calculated , by using video clips obtained by selecting consecutive multiple frames of images in a video, so the continuity of the video clips is better.
  • all image features in a video can be traversed, and the similarity between each image feature in a video and the text information of the video topic information can be determined.
  • part of the image features in a video can be extracted; for example, for one of the N videos, image frames can be selected at equal intervals, and the selected image frames can be characterized. extract Get image features.
  • M may be greater than N, or M may be equal to N, or M may be less than N; the numerical size of M is based on each video segment and video theme information in N videos. The similarity confidence value is determined.
  • M video clips from the N videos are selected based on the similarity between the images in the N videos and the video themes, including:
  • the pre-trained similarity matching model includes image encoders and text encoding. and the first similarity measurement module.
  • the image encoder is used to extract image features from N videos.
  • the text encoder is used to extract text features from video topics.
  • the first similarity measurement module is used to measure the images in N videos. The similarity between the feature and the text feature of the video topic, the similarity confidence value is used to represent the probability that the image in N videos is similar to the video topic;
  • M video clips from the N videos are selected.
  • the pre-trained similarity matching model is a Transformer model.
  • the pre-trained similarity matching model is obtained through the following training method:
  • the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the first training data set includes positive example data pairs and negative example data pairs.
  • the example data pair includes the first sample text description information and the first sample video theme information.
  • the first sample description information matches the first sample video theme information.
  • the positive example data pair includes the first sample text description information and the first sample video theme information.
  • the second sample video theme information and the first sample description information do not match the second sample video theme information.
  • step S450 can refer to the following steps S540 and S550 in Figure 13, or Figure 14, or Figure 15, or Step S640 and Step S650 in Figure 18, or Step S750 and Step S750 in Figure 19 Related description of S760.
  • Step S460 Based on the video theme, obtain music that matches the video theme.
  • music matching the video theme is obtained, including:
  • the total duration of the background music can be determined based on the duration of the M video clips.
  • the background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; based on the video theme information, the total duration of the background music can be determined. music style.
  • step S460 may refer to the subsequent step S560 in FIG. 13 , or step S660 in FIG. 18 , or the related description of step S770 in FIG. 19 .
  • Step S470 Obtain the first video based on the M video clips and music.
  • the first video is obtained based on M video clips and music, including:
  • the sorted M video clips and music are synthesized into the first video.
  • the image content in the M video clips can be made more consistent with the music rhythm in the music; for example, if the video image content is scenery, it can correspond to the prelude of the music or the soothing music part; the video image If the content is the user's sports scene, it can correspond to the climax of the background music; by sorting the M video clips, the M video clips can better match the rhythm stuck points of the music; thereby solving the problem in the edited first video
  • the existing problem of video clips and background music not matching can solve the problem of the edited first video content not fully matching the rhythm of the music; improve the video quality of the edited first video.
  • the M video clips are sorted to obtain the sorted M video clips, including:
  • the best positions of M video clips can be matched based on the rhythm of the music; a processed video can be generated.
  • a processed video can be generated.
  • a video without a strong storyline may refer to a video in which N videos are in equal order; there is no strong causal correlation between the N videos; for example, a video without a strong storyline may include a video with a sports theme.
  • background music can be selected based on the overall video theme information of N videos; and M videos can be sorted based on the rhythm of the background music, so that M video clips can be sorted according to the rhythm of the background music.
  • the video is sorted so that the picture content of the video clips matches the rhythm of the music; compared with matching the video directly with the music according to the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the quality of the edited video. Video quality.
  • the M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips, including:
  • the pre-trained audio-visual rhythm matching model includes an audio encoder, a video encoder and a first similarity measurement module.
  • the audio encoder is used to extract features from music to obtain audio features
  • the video decoder is used to extract features from M video clips to obtain video features
  • the first similarity measurement module is used to measure the similarity between the audio features and the M video clips.
  • music and M video clips are input to a pre-trained audio-visual rhythm matching model to obtain sorted M video clips; through the pre-trained audio-visual rhythm matching model, the relationship between audio features and video features can be realized match.
  • M video clips are sorted to obtain sorted M video clips, including:
  • the M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
  • the sorted M video segments can be determined based on the similarity confidence values between the video segments included in the sorted N videos and the video theme; based on the sorted M video segments and the video theme The information determines background music matching the sorted M video clips; generates a processed video.
  • a video with a strong story line can refer to a causal connection between N videos.
  • the cause and effect between the N videos can be identified and the N videos can be sorted based on the order of the cause and effect; for example, a strong story line
  • the videos can include travel-themed videos or travel-themed videos.
  • the N videos can be sorted based on the text description information of the N videos to obtain the sorted N videos; from the sorted N videos Select M video clips that are highly relevant to the video topic information to obtain the sorted M video clips; based on the sorted M video clips and video theme information, determine the background music that matches the sorted M video clips; make the picture content of the N videos with strong storylines match the music rhythm, and the picture content of the videos is played in order Comply with the causal relationship and improve the video quality of the edited video.
  • Step S480 Display the first video.
  • the first video may be a mixed-cut video obtained based on M video clips and music; the mixed-cut video may be displayed on the electronic device.
  • the electronic device after generating the first video based on M video clips and music, the electronic device can save the first video; after the electronic device detects an operation indicating to display the first video, display First video.
  • photos can include but are not limited to: gif animations, JPEG format images, PNG format images, etc. .
  • the image content information of N videos can be converted into text description information; the video theme information of N videos can be obtained based on the text description information of N videos; based on the images and video themes in N videos According to the relevance of the information, M video clips are selected from N videos; the processed video is obtained based on the M video clips and background music; in the solution of this application, N videos are obtained through the text description information of the N videos video topic information; compared with the video topic information of N videos based on the image information of N videos, text information has richer information than image information; in addition, there is language correlation between multiple text information, based on N
  • the video theme information of the video can be obtained from the text description information of the videos, which can improve the accuracy of the video theme information; in addition, in the embodiment of the present application, N can be determined based on the correlation between the images in the N videos and the video theme information.
  • the problem of mismatch between the video clips and the background music existing in the edited first video can be solved, that is, the rhythm lag between the image content of the edited first video and the background music can be solved.
  • FIG 13 is a schematic flow chart of a video editing method provided by an embodiment of the present application.
  • the video editing method 500 can be executed by the electronic device shown in Figure 1; the video editing method includes steps S510 to S570, and steps S510 to S570 are described in detail below respectively.
  • Step S510 Obtain N videos.
  • the N videos may be videos stored in the electronic device; wherein the N videos may be videos collected by the electronic device; or some or all of the N videos may be downloaded videos; this application applies to the N videos
  • the source of the video is not limited in any way.
  • the electronic device detects the user's click operation on N videos in the gallery application; the N videos are obtained.
  • Step S520 Obtain text description information of N videos.
  • a video can correspond to a text description information, and the text description information is used to describe the content information in a video; the image content in the video can be converted into text description information through the text description information.
  • the text description information is used to describe the image content in a video, and the text description information and the subtitle content in the video may be different.
  • Video 1 is a video of a user packing his luggage
  • the text description information of Video 1 may be "A person is packing his luggage”
  • Video 2 is a video of the user taking a plane at the airport
  • the text description information of Video 2 may be "A person is taking a plane”
  • Video 3 is a video of the user walking on the beach, then the text description information of Video 3 can be "A person is taking a plane.” People walking on the beach.”
  • Step S530 Obtain the video theme information of the N videos based on the text description information of the N videos.
  • the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
  • the video theme information of N videos is one theme information, that is, the video theme information is the video theme information corresponding to the N videos as a whole.
  • video themes may include but are not limited to: travel, parties, pets, sports, scenery, parent-child, work, etc.
  • the text description information of N videos can be input to a pre-trained video topic classification model to obtain the video topic information of N videos; wherein, the pre-trained video topic classification model can output video topic tags.
  • the pre-trained video topic classification model may refer to a text classification model.
  • the pre-trained video topic classification model may be used to classify input text description information and obtain classification labels corresponding to the text description information.
  • the pre-trained video topic classification model can be a neural network; for example, the pre-trained video topic classification model can be a deep neural network.
  • the pre-trained video topic classification model can be trained through the back propagation algorithm based on the following training data set; the training data set includes sample text description information and video topic text information, and the sample text description information is related to the video topic information.
  • the sample text description information can be one or more sentence texts; the video topic text information can be a phrase text; the video topic classification model to be trained can obtain the trained video topic by learning a large number of training data sets Classification model.
  • the sample text description information may include: “Multiple people are eating”, “Multiple people are playing games”, and “Multiple people are talking”;
  • the video theme text information corresponding to the sample text description information may be "Party”;
  • the sample text description information may include “an adult and a child are taking pictures”, “an adult and a child are playing games”;
  • the video theme information corresponding to the sample text description information is "parent-child”.
  • the video theme information corresponding to the N videos is obtained through the text description information of the N videos; that is, the N videos can be obtained based on the text description information of the N videos
  • the overall video topic information compared with the video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to infer multiple
  • the theme information hidden behind the text can improve the accuracy of the overall video theme corresponding to N videos; for example, N videos include videos of users packing their luggage, videos of users going out and taking a car to the airport, and videos of users taking a plane.
  • the video theme information of N videos can be accurately obtained based on the linguistic logical correlation between N video text description information and N video text description information; for example, based on N
  • the text description information included in the video is "a user is packing luggage", "a user is taking a plane", "a user is walking on the beach”. Based on this text description information, the video theme information of N videos can be abstracted as travel; therefore, The video theme information of N videos is obtained through the text description information of N videos, which can improve the accuracy of the theme information.
  • the electronic device determines the video theme information of N videos.
  • the electronic device can Candidate video topic information is displayed in the video topic information, and the video topic information corresponding to multiple text description information is determined based on the user's operation.
  • the display interface 319 can be displayed on the electronic device; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two candidate video themes.
  • the information is scenery and travel respectively. If the electronic device detects that the user clicks "Landscape”, then the video theme information of the N videos is scenery; if the electronic device detects that the user clicks "Travel”, then the video theme information of the N videos is travel. .
  • Step S540 Based on the similarity between the images in the N videos and the video topic information, obtain the similarity confidence values of the images in the N videos and the video topic information.
  • the similarity between the image features in the N videos and the text features of the video topic information can be obtained based on the similarity evaluation model, and the similarity confidence values between the image features in the N videos and the video topic information can be obtained.
  • the similarity confidence values between the image features in the N videos and the video topic information can be obtained.
  • all image features in a video can be traversed, and the similarity between each image feature in a video and the text information of the video topic information can be determined.
  • part of the image features in a video can be extracted; for example, for one of the N videos, image frames can be selected at equal intervals, and the selected image frames can be characterized. Extract image features.
  • the image features in the N videos and the text features of the video topic information can be extracted based on the similarity evaluation model, and the similarity between the image features in the N videos and the text features of the video topic information can be evaluated, and N The similarity confidence value between the image features in the video and the video topic information; the specific implementation method is as described in the subsequent figures 14 and 15.
  • Step S550 Obtain M video clips from the N videos based on the similarity confidence values between the images in the N videos and the video topic information.
  • N videos include video 310, video 312 and video 314
  • curve 561 is the similarity curve between the image features included in video 310 and the text features of the video topic information
  • curve 562 is the similarity curve between the image features included in the video 312 and the text features of the video topic information
  • the curve 563 is the similarity curve between the image features included in the video 314 and the text features of the video topic information
  • based on the curve 561 It is determined to select image 3101 and image 3102 in video 310 to form video segment 1
  • curve 562 it can be determined to select image 3121, image 3122 and image 3123 in video 312 to form video segment 2
  • curve 563 it can be determined to select the image in video 314 3141, image 3142, image 3143 and image 3144 constitute video clip 3.
  • Figure 15 is an example, and two or more video clips can also be selected from one video.
  • the two video clips can be two consecutive video clips, or they can also be two discontinuous video clips.
  • Video clips (for example, frames 1 to 5 make up a video clip; frames 10 to 13 make up a video clip); however, for a video clip, the multiple frames included in the video clip are continuous Multi-frame images; alternatively, you can not select any video clip from a video; whether to select a video clip from a video depends on the similarity confidence value between the image features included in the video and the video topic information; if a video If there are no image features related to the video theme, the video clips in the video do not need to be selected.
  • M may be greater than N, or M may be equal to N, or M may be less than N; the numerical size of M is based on each video segment and video theme information in N videos. The similarity confidence value is determined.
  • M video clips among the N videos that are highly relevant to the video theme can be determined; based on the solution of the present application, it can effectively Delete the video clips in the N videos that are not related to the video topic information to ensure that the filtered video clips are related to the video topic information; on the other hand, the similarity confidence value of some or all images in the N videos and the video topic information can be calculated , by using video clips obtained by selecting consecutive multiple frames of images in a video, so the continuity of the video clips is better.
  • the original sounds in some or all of the M video clips can be retained.
  • Step S560 Perform music matching processing based on M video clips and video theme information to obtain background music.
  • step S560 may refer to the music in step S460 of Figure 12 .
  • the total duration of the background music can be determined based on the duration of the M video clips.
  • the background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; based on the video theme information, the total duration of the background music can be determined. music style.
  • the background music can be a cheerful music style; if the video theme is landscape, the background music can be a soothing music style.
  • music matching processing can be performed in the candidate music library based on the M video clips and video theme information to obtain background music information; the candidate music library can include music of different music styles and music durations.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; based on the total duration and the music style, the candidate music of the music style in the candidate music library can be determined Choose randomly and get background music.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; based on the total duration and music style, selection can be made according to the music popularity in the candidate music library, and we get Background music.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; based on the total duration and music style, selection can be made based on the user's preferences in the candidate music library, Get background music.
  • background music that satisfies the total duration and music style is selected from the candidate music library based on the frequency with which the user plays music.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; the music with the highest matching degree to the video theme can be selected from the candidate music library as the background music.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; multiple music can be selected from the candidate music library for editing to obtain the background music; where, multiple The weight or duration of each piece of music can be based on the user's preferences or preset fixed parameters.
  • Step S570 Match M video clips and background music to obtain a processed video (an example of the first video).
  • the order of the M video clips can be determined based on the music rhythm of the background music, so that the picture content and the music rhythm between the M video clips and the background music are consistent.
  • the rhythm matching process is to achieve a better integration of the M video clips and the background music, so that the image content in the M video clips is more consistent with the music rhythm in the background music; for example, if the video image content is scenery, then It can correspond to the prelude or soothing music part of the background music; the video image content is the user's sports scene, so it can correspond to the climax part of the background music; through rhythm matching processing, the rhythm of the M video clips and the background music is stuck The points are more closely matched and the quality of the processed video is improved.
  • M video clips and background music can be input to a pre-trained audio-visual rhythm matching model to obtain position information of all or part of the M video clips; wherein, the audio-visual rhythm matching model can include an audio encoder. , video encoder and similarity measurement module; among them, the audio encoder is used to extract audio features of background music; the video encoder can be used to extract video features; the similarity measurement module is used to measure the similarity between audio features and video features ;
  • the audio-visual rhythm matching model can include an audio encoder.
  • video encoder and similarity measurement module among them, the audio encoder is used to extract audio features of background music; the video encoder can be used to extract video features; the similarity measurement module is used to measure the similarity between audio features and video features ;
  • the network of the audio-visual rhythm matching model can be a neural deep network; for example, the audio-visual rhythm matching model can adopt the structure of the Transformer model as shown in Figure 2; when training the audio-visual rhythm matching model, Comparative learning training methods can be used.
  • the audio-visual rhythm matching model can be a neural network, and the audio-visual rhythm matching model to be trained can be trained by obtaining sample music clips to obtain a trained audio-visual rhythm matching model.
  • the overall training architecture of the audio-visual rhythm matching model can use a contrastive learning model; when constructing training data pairs, data pairs that match background music and video content can be used as positive examples, and data pairs that do not match background music and video content can be used as negative examples. For example, train a video encoder and an audio encoder so that the similarity of positive data pairs is greater than the similarity of negative data pairs.
  • the audio-visual rhythm matching model can have a multi-modal pre-training architecture and can simultaneously support two different types of input data: images and text; text and images are mapped into a unified space through a cross-modal contrastive learning method, thereby improving Visual and text comprehension skills.
  • the M video clips by performing rhythm matching processing on the M video clips and the background music, the M video clips can be sorted according to the rhythm of the background music, so that the picture content of the video clips matches the music rhythm. That is to say, the image content and background music of M video clips are stuck.
  • the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the user experience.
  • the duration of the background music is greater than or equal to the total duration of the M video clips; when the duration of the background music is greater than the duration of the M video clips, the last video clip among the M video clips can be played in slow motion. ; Or, add transition effects so that M video clips can be played repeatedly after the M video contents are played.
  • the processed video can also be obtained based on the uploading order of the M video clips, or the order of the timestamp information of the M video clips and the background music.
  • the above is an example of editing N videos included in the gallery application; the solution of this application can also be applied to editing photos in the gallery application; for example, the photos can include but are not limited to: gif animations. Pictures, JPEG format images, PNG format images, etc.
  • the image content information of N videos can be converted into text description information; the video theme information of N videos can be obtained based on the text description information of N videos; based on the images and video themes in N videos According to the relevance of the information, M video clips are selected from N videos; the processed video is obtained based on the M video clips and background music; in the solution of this application, N videos are obtained through the text description information of the N videos video topic information; compared with the video topic information of N videos based on the image information of N videos, text information has richer information than image information; in addition, there is language correlation between multiple text information, based on N The text description information of each video is obtained The video theme information of the video can improve the accuracy of the video theme information; in addition, in the embodiment of the present application, based on the correlation between the images in the N videos and the video theme information, it can be determined that the N videos are related to the video theme.
  • M video clips with high degree of accuracy based on the solution of this application, on the one hand, it can effectively delete the video clips that are irrelevant to the video topic information in N videos, ensuring that the filtered video clips are related to the video topic information; on the other hand, when calculating When calculating the similarity confidence value between each video clip in N videos and the video topic information, the video clip is obtained by selecting consecutive multi-frame images in a video, so the continuity of the video clip is better; thereby improving the quality of the edited video. Video quality.
  • the background music of M videos is selected based on the video theme information of N videos; and the M videos can be sorted based on the rhythm of the background music, so that the M videos can be sorted according to the rhythm of the background music.
  • the video clips are sorted so that the picture content of the video clips matches the rhythm of the music; compared with matching the video directly with the music according to the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the user experience. experience.
  • step S540 and step S550 in FIG. 13 will be described in detail below in conjunction with FIG. 14 and FIG. 15 .
  • FIG. 14 is a schematic flowchart of a method for determining M video segments related to video theme information among N videos provided by an embodiment of the present application.
  • the method can be executed by the electronic device shown in Figure 1; the method includes steps S551 to S555, and steps S551 to S555 will be described in detail below.
  • Step S551 Extract features from N videos based on the image encoder in the similarity evaluation model to obtain image features in N videos.
  • the network of the audio-visual rhythm matching model can be a neural deep network; for example, the audio-visual rhythm matching model can adopt the structure of the Transformer model as shown in Figure 2; when training the audio-visual rhythm matching model, Comparative learning training methods can be used.
  • a training data set can be obtained to train the similarity evaluation model to be trained, and a trained similarity evaluation model can be obtained; for example, the overall training architecture of the similarity evaluation model can adopt a contrastive learning model.
  • the overall training architecture of the similarity evaluation model can adopt a contrastive learning model.
  • the training data set includes sample videos, video theme information that matches the sample videos, and video theme information that does not match the sample videos; for example, the sample videos can include travel videos, and the video theme information of the sample videos is "travel".
  • Text information the video topic information that does not match the video topic of the sample video is the text information of "motion";
  • the similarity evaluation model can identify matching text features and image features, for example, making the input consistent When text features and image features are matched, the distance measurement value output by the similarity measurement module in the similarity evaluation model to be trained is smaller; so that when text features and image features are input that do not match, the similarity in the similarity evaluation model to be trained is smaller.
  • the trained similarity evaluation model can identify matching text features and image features.
  • image features can be extracted from each frame of the N videos through the image encoder in the similarity evaluation model to obtain all image features included in the N videos.
  • the image features in the N videos can be extracted through the image encoder in the similarity evaluation model based on the same number of interval frames to obtain partial image features in the N videos.
  • the similarity evaluation model can be as shown in Figure 15.
  • the similarity evaluation model can include a text encoder, an image encoder and a similarity measurement module (an example of the first similarity measurement module); where, the text encoding
  • the encoder is used to extract text features
  • the image encoder can be used to extract image features
  • the similarity measurement module is used to measure the similarity between text features and image features.
  • the similarity evaluation model may be a contrastive learning model.
  • Step S552 Extract features from the video topic information based on the text encoder in the similarity evaluation model to obtain text features of the video topic information.
  • text features refer to a set of attributes obtained by vectorization and subsequent mapping of words or sentences that can characterize their specific semantics.
  • Step S553 Obtain the similarity confidence value between the image feature and the text feature based on the similarity measurement module in the similarity evaluation model.
  • the image features in N videos and the text features of the video topic information can be extracted; the image features and text features are compared to obtain the similarity between the image features and the text features.
  • the similarity evaluation model can output a distance measurement value, or the similarity evaluation model can output a similarity confidence value; if the similarity evaluation model outputs a distance measurement value, the smaller the distance measurement value indicates the difference between the image feature and the text feature. The higher the similarity; the similarity confidence value between the image feature and the text feature can be obtained based on the distance measurement value; if the similarity evaluation model outputs the similarity confidence value, the greater the similarity confidence value, the greater the similarity between the image feature and the text feature. The higher the similarity between them.
  • the distance measurement value can be the cos value between the image feature and the text feature.
  • Step S554 Select consecutive multi-frame image features in the video based on the similarity confidence value of the image feature and the text feature to obtain a video clip.
  • the similarity curve of the image features in the video and the text features of the video topic information can be obtained; one or more video clips can be selected from a video based on the similarity curve, and a A video clip consists of multiple consecutive frames of images.
  • multiple consecutive frames of images related to the video theme are selected to obtain a video clip; based on the solution in the present application, it can be ensured that the selected video clip is related to the overall video theme.
  • Step S555 Based on the similarity confidence value of the image feature and the text feature, select M video clips from the N videos.
  • the text feature similarity curve between the image features in the video and the video topic information can be obtained; based on the similarity curve, the images in the video related to the video topic can be determined from the entire video; then from a video Extract consecutive multiple frames of images to obtain a video clip.
  • N videos are video 310, video 312 and video 314; curve 561 is the similarity curve between the image features included in video 310 and the text features of the video topic information; curve 562 is the similarity curve between the image features included in the video 312 and the text features of the video topic information; the curve 563 is the similarity curve between the image features included in the video 314 and the text features of the video topic information; based on the curve 561, It is determined to select image 3101 and image 3102 in video 310 to form video segment 1; based on curve 562, it can be determined to select image 3121, image 3122 and image 3123 in video 312 to form video segment 2; based on curve 563, it can be determined to select the image in video 314 3141, image 3142, image 3143 and image 3144 constitute video clip 3.
  • Figure 15 is an example, and two or more video clips can also be selected from one video, where the two video clips can be two consecutive video clips or they can also be two discontinuous videos. Segments (for example, frames 1 to 5 constitute a video segment; frames 10 to 13 constitute a video segment); however, for a video segment, the multiple frames of images included in the video segment are consecutive multiple frame image; alternatively, you can not select any video clip from a video; whether to select a video clip from a video depends on the similarity confidence value of the image features in the video and the video topic information; if there is no similarity in a video with If the image features related to the video theme are not selected, the video clips in the video may not be selected.
  • Segments for example, frames 1 to 5 constitute a video segment; frames 10 to 13 constitute a video segment
  • the multiple frames of images included in the video segment are consecutive multiple frame image; alternatively, you can not select any video clip from a video; whether to select a video clip from a video depends on the similarity confidence value of the image features
  • M may be greater than N, or M may be equal to N, or M may be less than N; the numerical size of M is based on each video segment and video theme information in N videos. The similarity confidence value is determined.
  • the image features related to the overall video theme in N videos can be identified through the pre-trained similarity evaluation model; based on the image features related to the video theme, N videos are filtered out that are related to the video theme. M video clips, and the video clips in the N videos that are not related to the video theme are eliminated; based on the solution of this application, on the one hand, the video clips in the N videos that are not related to the video topic information can be effectively deleted to ensure that the filtered video clips Related to the video topic information; the edited video is obtained based on the filtered video clips and background music, thereby improving the video quality of the edited video.
  • step S570 in Figure 13 will be described in detail below with reference to Figures 16 and 17 .
  • Figure 16 is a flow chart of a method for matching M video clips and background music provided by an embodiment of the present application.
  • the method can be executed by the electronic device shown in Figure 1; the method includes steps S571 to S574, and steps S571 to S574 are described in detail below respectively.
  • Step S571 Extract features of the background music based on the audio encoder in the audio-visual rhythm matching model to obtain audio features.
  • the audio-visual rhythm matching model can be as shown in Figure 17;
  • the audio-visual rhythm matching model can include an audio encoder, a video encoder and a similarity measurement module; wherein the audio encoder is used to extract the audio features of the background music; video encoding The module can be used to extract video features; the similarity measurement module is used to measure the similarity between audio features and video features.
  • Step S572 Extract features from the M video clips based on the video encoder in the audio-visual rhythm matching model to obtain video features.
  • one video feature includes multi-frame image features; M video segments can correspond to M video features.
  • Step S573 Obtain the similarity confidence value between the audio feature and the video feature based on the similarity measurement module in the audio-visual rhythm matching model.
  • the background music can be divided into multiple audio features; by traversing the correlation between each video feature in the M video features and each audio feature in the multiple audio features, the correlation between each video feature in the M video features and multiple audio features is obtained.
  • the distance measurement value between the audio feature and the video feature can be output; the larger the distance measurement value, the smaller the similarity between the audio feature and the video feature, then the audio
  • the distance measurement value can be the cos value between the audio feature and the video feature.
  • Step S574 Based on the similarity confidence value, obtain the best matching positions corresponding to the background music of the M video clips.
  • the best positions for matching the M video clips and the background music can be obtained, so that the image content of the M video clips matches the music rhythm of the background music.
  • M video clips include video clip 1, video clip 2 and video clip 3; the background music can be divided into 3 audio features, namely audio feature 1, audio feature 2 and audio feature 3; respectively determine the audio features 1 and 3.
  • the correlation between video clip 1, video clip 2 and video clip 3 is used to obtain the audio feature that has the highest matching degree with audio feature 1 among the three video clips; judge whether audio feature 2 is related to video clip 1, video clip 2 and video clip 3, get the audio feature with the highest matching degree with audio feature 2 among the 3 video clips; judge the correlation between audio feature 3 and video clip 1, video clip 2 and video clip 3, get 3 videos
  • the audio feature in the clip that has the highest matching degree with audio feature 3; finally, the video clip corresponding to each audio feature can be output.
  • the audio-visual rhythm matching model can output audio feature 1 corresponding to video clip 3, audio feature 2 corresponding to video clip 2, and audio feature 3 corresponding to video clip 1; thus Obtain a sequence of M video clips that match the rhythm of the background audio.
  • the video image content may correspond to the prelude or soothing music part of the background music; if the video image content is a user movement scene, it may correspond to the climax part of the background music.
  • a training data set can be obtained to train the audio-visual rhythm matching model to be trained, and a trained audio-visual rhythm matching model can be obtained; wherein, the training data set includes sample matching music shorts and sample non-matching music Short film; sample-matching music short refers to a music short whose music matches the image content; sample-unmatched music short refers to a music short whose music and image content do not match; for example, the background music of music short 1 and the image of music short 2
  • the videos are mixed to obtain sample-unmatched music clips; by learning a large number of training data sets, the audio-visual rhythm matching model can sort the input M video clips based on the rhythm of the input background music.
  • the M video clips can be sorted according to the rhythm of the background music, so that the picture content of the video clips matches the music rhythm; and Compared with matching videos directly with music according to the input sequence, the solution of this application can improve the consistency between the image content in the video and the rhythm of the background music, and improve the user experience.
  • the electronic device detects N videos selected by the user.
  • the N videos may refer to videos with strong story lines; or, the N video contents may be videos without strong story lines. ;
  • the non-storytelling video editing method and the strong story-telling video editing method will be described in detail below with reference to Figures 18 and 19.
  • a video with a strong story line can refer to a causal connection between N videos.
  • the cause and effect between the N videos can be identified and the N videos can be sorted based on the order of the cause and effect; for example, a strong story line Videos can include travel-themed videos or travel-themed videos; videos with non-strong story lines can refer to videos in which N videos are in equal order; there is no strong causal correlation between N videos; for example, non-strong story lines
  • the videos can include sports themed videos.
  • videos with strong story lines can include videos with a video theme of travel; for example, the N videos include videos of packing luggage at home; videos of taking a taxi to the airport; videos of taking a plane; arriving at the destination and walking on the beach. videos; then these 4 videos are causally connected. You need to pack your luggage first and then take a man-machine to reach the destination and travel at the destination.
  • videos with non-strong story lines may include videos with a video theme of sports; for example, the N videos include videos of running on the basketball court; videos of layups and shots; videos of passing the ball on the basketball court; then this The three videos do not have a strong causal relationship.
  • the N videos include videos of running on the basketball court; videos of layups and shots; videos of passing the ball on the basketball court; then this The three videos do not have a strong causal relationship.
  • Implementation method 1 For videos without strong story lines, obtain N videos; obtain the video themes of N videos based on the text description information of the N videos; determine N based on the similarity confidence values between the images in the N videos and the video themes M video clips in the videos; determine background music based on the M video clips and video themes; match the best positions of the M video clips based on the rhythm of the background music; generate a processed video.
  • FIG. 18 is a schematic flow chart of a video editing method provided by an embodiment of the present application.
  • the video editing method 600 can be executed by the electronic device shown in FIG. 1; the video editing method 600 includes steps S610 to S680, and steps S610 to S680 are described in detail below respectively.
  • Step S610 Obtain N videos.
  • the N videos may be videos stored in the electronic device; wherein the N videos may be videos collected by the electronic device; or some or all of the N videos may be downloaded videos; this application applies to the N videos
  • the source of the video is not limited in any way.
  • the electronic device detects the user's click operation on N videos in the gallery application; the N videos can be obtained.
  • the sorting of the N videos that do not have a strong story line can be based on the order in which the N videos are uploaded; or, it can be based on the timestamp information of the video (for example, recording a video or downloading a video time information), sort N videos.
  • Step S620 Obtain the text description information of N videos through the image-text conversion model.
  • one video can correspond to one piece of text description information
  • N pieces of text description information can be obtained from N videos through the image-to-text conversion model.
  • the image-to-text conversion model can be used to convert the video into text information; that is, the image information included in the video can be converted into text description information; and the image content included in the image can be described based on the text description information.
  • the image-to-text conversion model may include a CLIP model.
  • Step S630 Input the text description information of N videos into the pre-trained video topic classification model to obtain video topic information.
  • the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
  • the video theme information of N videos is one theme information, that is, the video theme information is the video theme information corresponding to the N videos as a whole.
  • the pre-trained video topic classification model may be a pre-trained text classification model
  • the text classification model may be a deep neural network
  • the video topic classification model can be obtained through training based on the following training data set;
  • the training data set includes sample text description information and video topic text information, and the sample text description information corresponds to the video topic information; wherein, the sample text description
  • the information can be one or more sentence texts;
  • the video topic text information can be phrase text.
  • the sample text description information may include: “Multiple people are eating”, “Multiple people are playing games”, and “Multiple people are talking”;
  • the video theme text information corresponding to the sample description text may be "Party”; and for example , the sample text description information may include “an adult and a child are taking pictures”, “an adult and a child are playing games”; the video theme information corresponding to the sample description text is "parent-child”.
  • inputting a video to the image-to-text conversion model can obtain a text description information; N videos can Obtain N pieces of text description information; input the N pieces of text description information into the pre-trained video topic classification model, and obtain the video topic information corresponding to the N pieces of text description information; where the video topic information can include but is not limited to: travel, party , pets, sports, scenery, parent-child, work, etc.
  • the video theme information of N videos is obtained through the text description information of N videos; and the video theme information of N videos is obtained based on the image information of N videos.
  • text information has richer information than image information; in addition, there is language correlation between multiple text information.
  • the video topic information of the video can be obtained based on the text description information of N videos, which can improve the accuracy of the topic information.
  • N videos include videos of the user packing luggage, videos of the user going out and taking a car to the airport, videos of the user taking an airplane, and videos of the user's behavior at the beach; only some labels may be obtained based on image information, including clothing , suitcases, users, seaside, etc. Based on these image tags, it is impossible to abstract that the theme of N videos is travel; however, when identifying the theme of N videos based on the text description information of N videos, it can be based on the text description information of N videos.
  • the video theme information of N videos can be accurately obtained based on the language logic correlation with the text description information of N videos; for example, based on the text description information included in N videos, "A user is packing luggage”, “A user is taking a ride”"Airplane”,"A user walks on the beach”, based on these text description information, the video theme information of N videos can be abstracted as travel; therefore, the video theme information of N videos can be obtained through the text description information of N videos, which can improve Accuracy of subject information.
  • a prompt box can be displayed in the electronic device; the prompt box can Including candidate video topic information, the video topic information of N videos is determined based on the user's operation on the candidate video topic information in the prompt box.
  • the display interface 319 can be displayed on the electronic device; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two candidate video theme information. They are scenery and travel respectively. If the electronic device detects that the user clicks "Landscape”, then the video theme information of the N videos is scenery; if the electronic device detects that the user clicks "Travel”, then the video theme information of the N videos is travel.
  • step S630 please refer to the relevant description in step S530 in FIG. 13 .
  • Step S640 Obtain similarity confidence values of image features and video topic information in N videos based on the similarity evaluation model.
  • the similarity evaluation model may be a pre-trained neural network model; the similarity evaluation model is used to output the correlation between the image features included in each of the N videos and the video topic information.
  • the similarity evaluation model can include an image encoder, a text encoder and a similarity measurement module; among them, the image encoder is used to extract features from images in the video to obtain image features; the text encoder is used to Feature extraction is performed on video topic information to obtain text information; the similarity measurement module is used to evaluate the similarity between image features and text features.
  • the image features in N videos and the text features of the video topic information can be extracted; the image features and text features are compared to obtain the similarity between the image features and the text features.
  • the similarity evaluation model can output a distance measurement value, or the similarity evaluation model can output a similarity confidence value; if the similarity evaluation model outputs a distance measurement value, the smaller the distance measurement value indicates the difference between the image feature and the text feature. The higher the similarity; the similarity confidence value between the image feature and the text feature can be obtained based on the distance measurement value; if the similarity evaluation model outputs the similarity confidence value, the greater the similarity confidence value, the greater the similarity between the image feature and the text feature. The higher the similarity between them.
  • all image features in the N videos can be extracted; or some image features in the N videos can be extracted; this application does not impose any limitation on this.
  • image features can be extracted from each frame of the N videos through the image encoder in the similarity evaluation model to obtain all image features included in the N videos.
  • the image features in the N videos can be extracted through the image encoder in the similarity evaluation model based on the same number of interval frames to obtain partial image features in the N videos.
  • the image features of one frame can be extracted at intervals of 4 frames, and the 1st frame image, 5th frame image, 10th frame image, 15th frame image, etc. in one video of N videos can be extracted.
  • step S640 please refer to the related description of step S540 in Figure 13; or the related description of step S551 to step S553 in Figure 14; or the related description of Figure 15.
  • Step S650 Obtain M video clips from the N videos based on the similarity confidence values between the images in the N videos and the video topic information.
  • continuous multi-frame image features can be selected from the N videos to obtain a video clip.
  • the similarity curve of the image features in the video and the video theme information can be obtained; based on the similarity curve, multiple consecutive frames of images can be selected from the video to obtain a video clip.
  • multiple consecutive frames of images related to the video theme can be selected from a video to obtain a video clip; based on the solution of the present application, it can be ensured that the selected video clip is related to the overall video theme.
  • Step S660 Based on the duration and video theme information of the M video clips, perform music matching processing in the candidate music library to obtain background music.
  • the total duration of the background music can be determined based on the duration of the M video clips.
  • the background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; the background music can be determined based on the video theme information. style.
  • the background music will be a cheerful music style; if the video theme is landscape, the background music will be a soothing music style.
  • the total duration of background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; and the background music can be randomly selected from the candidate music library based on the total duration and music style.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; and the background music can be selected from the candidate music library according to music popularity based on the total duration and music style.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; and the background music can be selected based on the user's preferences in the candidate music library based on the total duration and music style.
  • background music that satisfies the total duration and music style is selected from the candidate music library based on the frequency with which the user plays music.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; the music with the highest matching degree to the video theme can be selected from the candidate music library as the background music.
  • the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; multiple music can be selected from the candidate music library for editing to obtain the background music; where, multiple The weight or duration of each piece of music can be based on the user's preferences or preset fixed parameters.
  • Step S670 Input M video clips and background music to the pre-trained audio-visual rhythm matching model to obtain sorted M video clips.
  • the audio-visual rhythm matching model can be a neural network, and the audio-visual rhythm matching model to be trained can be trained by obtaining sample music clips to obtain a trained audio-visual rhythm matching model.
  • the audio-visual rhythm matching model can include an audio encoder, a video encoder and a similarity measurement module; the audio encoder is used to extract the audio features of background music; the video encoder can be used to extract video features ; The similarity measurement module is used to measure the similarity between audio features and video features.
  • the audio-visual rhythm matching model can output a distance metric value, which is used to represent the distance between audio features and video features; the larger the distance metric value, the smaller the similarity between the audio features and video features; The smaller the distance metric value, the greater the similarity between audio features and video features; based on the distance metric value, the confidence value of the similarity between audio features and video features can be obtained.
  • the audio-visual rhythm matching model can output a similarity confidence value.
  • the similarity confidence value is used to represent the probability value that the audio feature is similar to the video feature. The larger the similarity confidence value is, the greater the similarity between the audio feature and the video feature. The higher the degree; the larger and smaller the similarity confidence value is, indicating that the similarity between audio features and video features is smaller.
  • a training data set can be obtained to train the audio-visual rhythm matching model to be trained, and a trained audio-visual rhythm matching model can be obtained; wherein, the training data set includes sample matching music shorts and sample non-matching music Short film; sample-matching music short refers to a music short whose music matches the image content; sample-unmatched music short refers to a music short whose music and image content do not match; for example, the background music of music short 1 and the image of music short 2
  • the videos are mixed to obtain sample-unmatched music clips; by learning a large number of training data sets, the audio-visual rhythm matching model can sort the input M video clips based on the rhythm of the input background music.
  • background music and M video clips can be input to a pre-trained audio-visual rhythm matching model, and the audio-visual rhythm matching model can output the sorting of the M video clips; assuming that the M video clips include video clip 1, video clip 2 and video Fragment 3; the background music can be divided into three audio features, namely audio feature 1, audio feature 2 and audio feature 3; determine the correlation between audio feature 1 and video clip 1, video clip 2 and video clip 3 respectively. , get the audio feature that has the highest matching degree with audio feature 1 among the 3 video clips; judge the correlation between audio feature 2 and video clip 1, video clip 2 and video clip 3, and get the audio feature among the 3 video clips. 2 audio feature with the highest matching degree; determine the correlation between audio feature 3 and video clip 1, video clip 2 and video clip 3, and obtain the audio feature with the highest matching degree with audio feature 3 among the three video clips; finally it can be output The video clip corresponding to each audio feature.
  • the M video clips can be sorted according to the rhythm of the background music, so that the picture content of the video clips matches the music rhythm; and Compared with matching videos directly with music according to the input order, the solution of this application can improve the consistency between the image content in the video and the rhythm of the background music, and improve the user experience.
  • Step S680 Obtain the processed video based on the sorted M video clips and background audio.
  • a processed video can be obtained based on the video content of the sorted M video clips and the audio information of the background music.
  • transition effect of a video refers to the use of certain techniques between two scenes (for example, two pieces of material); for example, wipes, stacking changes, page curls, etc., to achieve a smooth transition between scenes or plots. Or achieve the effect of enriching the picture to attract the audience fruit.
  • Implementation method two For videos with strong story lines, obtain N videos; obtain the video themes of N videos based on the text description information of the N videos; sort the N videos based on the text description information of the N videos, and obtain the sorted of N videos; determine the sorted M video clips based on the similarity confidence value between the video clips included in the sorted N videos and the video topic; determine the sorted M video clips based on the sorted M video clips and video topic information. Matching background music to M video clips; generating processed video.
  • FIG. 19 is a schematic flow chart of a video editing method provided by an embodiment of the present application.
  • the video editing method 700 can be executed by the electronic device shown in FIG. 1; the video editing method 700 includes steps S710 to S780, and steps S710 to S780 are described in detail below respectively.
  • Step S710 Obtain N videos.
  • the N videos may be videos stored in the electronic device; wherein the N videos may be videos collected by the electronic device; or some or all of the N videos may be downloaded videos; this application applies to the N videos
  • the source of the video is not limited in any way.
  • the electronic device detects the user's click operation on N videos in the gallery application; the N videos can be obtained.
  • Step S720 Obtain the text description information of N videos through the image-text conversion model.
  • one video can correspond to one piece of text description information
  • N pieces of text description information can be obtained from N videos through the image-to-text conversion model.
  • the image-to-text conversion model can be used to convert the video into text information; that is, the image information included in the video can be converted into text description information; and the image content included in the image can be described based on the text description information.
  • the image-to-text conversion model may include a CLIP model.
  • Step S730 Input the text description information of N videos into the pre-trained video topic classification model to obtain video topic information.
  • the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
  • the video theme information of N videos is one theme information, that is, the video theme information is the video theme information corresponding to the N videos as a whole.
  • the pre-trained video topic classification model may be a pre-trained text classification model
  • the text classification model may be a deep neural network
  • the video topic classification model can be obtained through training based on the following training data set;
  • the training data set includes sample text description information and video topic text information, and the sample text description information corresponds to the video topic information; wherein, the sample text description
  • the information can be one or more sentence texts;
  • the video topic text information can be phrase text.
  • the sample text description information may include: “Multiple people are eating”, “Multiple people are playing games”, and “Multiple people are talking”;
  • the video theme text information corresponding to the sample description text may be "Party”; and for example , the sample text description information may include “an adult and a child are taking pictures”, “an adult and a child are playing games”; the video theme information corresponding to the sample description text is "parent-child”.
  • inputting a video to the image-to-text conversion model can obtain a text description information; N videos can Obtain N pieces of text description information; input the N pieces of text description information into the pre-trained video topic classification model, and obtain the video topic information corresponding to the N pieces of text description information; where the video topic information can include but is not limited to: travel, party , pets, sports, scenery, parent-child, work, etc.
  • the video theme information of N videos is obtained through the text description information of N videos; and the video theme information of N videos is obtained based on the image information of N videos.
  • text information has richer information than image information; in addition, there is language correlation between multiple text information.
  • the video topic information of the video can be obtained based on the text description information of N videos, which can improve the accuracy of the topic information.
  • N videos include videos of the user packing luggage, videos of the user going out and taking a car to the airport, videos of the user taking an airplane, and videos of the user's behavior at the beach; only some labels may be obtained based on image information, including clothing , suitcases, users, seaside, etc. Based on these image tags, it is impossible to abstract that the theme of N videos is travel; however, when identifying the theme of N videos based on the text description information of N videos, it can be based on the text description information of N videos.
  • the video theme information of N videos can be accurately obtained based on the language logic correlation with the text description information of N videos; for example, based on the text description information included in N videos, "A user is packing luggage”, “A user is taking a ride”"Airplane”,"A user walks on the beach”, based on these text description information, the video theme information of N videos can be abstracted as travel; therefore, the video theme information of N videos can be obtained through the text description information of N videos, which can improve Accuracy of subject information.
  • a prompt box can be displayed in the electronic device; the prompt box can Including candidate video topic information, the video topic information of N videos is determined based on the user's operation on the candidate video topic information in the prompt box.
  • the display interface 319 can be displayed on the electronic device; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two candidate video theme information. They are scenery and travel respectively. If the electronic device detects that the user clicks "Landscape”, then the video theme information of the N videos is scenery; if the electronic device detects that the user clicks "Travel”, then the video theme information of the N videos is travel.
  • Step S740 Sort the N videos based on the text description information of the N videos to obtain the sorted N videos.
  • step S730 and step S740 there is no restriction on the execution order of step S730 and step S740; step S730 can be executed first and then step S740; or step S740 can be executed first and then step S730; or, step S730 can be executed first and then step S730 can be executed. Step S730 and step S740 are executed simultaneously.
  • the order of the N videos may be the wrong order; for example, the order of the N videos There are 3 downloaded videos.
  • Video 1 A person goes home from the amusement park
  • Video 2 A person plays with entertainment facilities in the amusement park
  • Video 3 A person goes to the amusement park by car
  • the order is: video 1, video 2 and video 3; however, the normal order of a person's day trip should be to go out to the destination, arrive at the destination, and go home from the destination; the above
  • the sorting of videos based on timestamps obviously does not conform to the reasonable logical sequence of travel videos; therefore, for videos with strong story lines, the sorting of N videos based on video timestamps may be wrong, causing users to The viewing experience of the videos is poor; in the solution of this application, for videos with strong story lines, N videos can be sorted based on the text description information of the videos, and it is determined that the sorted N videos conform to normal causes and consequences, improving User’s viewing experience.
  • the ranking of the N videos can be obtained based on the text description information of the N videos and the correlation between the natural languages between the text description information;
  • the text information of N videos can be input into a pre-trained ranking model.
  • the ranking model can be a neural Network; the pre-trained ranking model can be trained through the back propagation algorithm based on the training data set.
  • the training data set can include the ranking of sample topic text and multiple sample description texts, and the sample topic text corresponds to multiple sample description texts.
  • sample topic text is "appear”
  • the order of multiple sample description texts is: sample description text 1 "A person goes out”, sample description text 2 is “A person is on the way to the destination”, sample description text 3 is “A person arrives at the destination”, sample description text 4 is “A person is active at the destination”, sample description text 5 is “A person leaves from the destination on the way to the departure place”, and sample description text 6 is “A person leaves from the destination on the way to the departure place”.
  • sample description text 1 "A person goes out
  • sample description text 2 is “A person is on the way to the destination”
  • sample description text 3 is "A person arrives at the destination”
  • sample description text 4 is "A person is active at the destination”
  • sample description text 5 is “A person leaves from the destination on the way to the departure place”
  • sample description text 6 is “A person leaves from the destination on the way to the departure place”.
  • Step S750 Obtain similarity confidence values between images and video topic information in N videos based on the similarity evaluation model.
  • the similarity evaluation model may be a pre-trained neural network model; the similarity evaluation model is used to output the correlation between the image features included in each of the N videos and the video topic information.
  • the similarity evaluation model can include an image encoder, a text encoder and a similarity measurement module; among them, the image encoder is used to extract features from images in the video to obtain image features; the text encoder is used to Feature extraction is performed on video topic information to obtain text information; the similarity measurement module is used to evaluate the similarity between image features and text features.
  • the image features in N videos and the text features of the video topic information can be extracted; the image features and text features are compared to obtain the similarity between the image features and the text features.
  • the similarity evaluation model can output a distance measurement value, or the similarity evaluation model can output a similarity confidence value; if the similarity evaluation model outputs a distance measurement value, the smaller the distance measurement value indicates the difference between the image feature and the text feature. The higher the similarity; the similarity confidence value between the image feature and the text feature can be obtained based on the distance measurement value; if the similarity evaluation model outputs the similarity confidence value, the greater the similarity confidence value, the greater the similarity between the image feature and the text feature. The higher the similarity between them.
  • all image features in the N videos can be extracted; or some image features in the N videos can be extracted; this application does not impose any limitation on this.
  • image features can be extracted from each frame of the N videos through the image encoder in the similarity evaluation model to obtain all image features included in the N videos.
  • the image features in the N videos can be extracted through the image encoder in the similarity evaluation model based on the same number of interval frames to obtain partial image features in the N videos.
  • the image features of one frame can be extracted at intervals of 4 frames, and the 1st frame image, 5th frame image, 10th frame image, 15th frame image, etc. in one video of N videos can be extracted.
  • Step S760 Based on the similarity confidence values between the images in the N videos and the video topic information, obtain the sorted M video clips.
  • continuous multi-frame image features can be selected from the N videos to obtain a video clip. For example, when the similarity confidence value of consecutive multi-frame images is greater than a preset threshold, a video segment composed of multi-frame images is obtained.
  • the M video clips selected from the sorted N videos are sequential video clips, that is, the sorted M video clips are obtained.
  • N videos include 3 videos, and the order of the 3 videos is video 2, video 1 and video 3; if video clip 2-1 and video clip 2-2 are selected from video 2, the time of video clip 2-1 Before the time of video clip 2-2; select video clip 1-1 in video 1; select video clip 3-1, video clip 3-2, and the time of video clip 3-1 in video 3 before the time of video clip 3-2; then the order of the five video clips is video clip 2-1, video clip 2-2, video clip 1-1, video clip 3-1, and video clip 3-2.
  • Step S770 Based on the duration and video theme information of the sorted M video clips, perform music matching processing in the candidate music library to obtain background music that matches the sorted M video clips.
  • the M video clips are sorted video clips, it is necessary to use the sorted M video clips as a benchmark to select appropriate background music to match the sorted M videos; the selected background music The rhythm should be matched based on the style of the images of different video clips in the sorted M video clips.
  • the selected background music should be music with a soothing intro and a cheerful middle tempo.
  • Step S780 Obtain the processed video based on the sorted M video clips and background music.
  • a processed video can be obtained based on the video content of the sorted M video clips and the audio information of the background music.
  • transition effect of a video refers to the use of certain techniques between two scenes (for example, two pieces of material); for example, wipes, stacking changes, page curls, etc., to achieve a smooth transition between scenes or plots. Or achieve the effect of enriching the picture to attract the audience.
  • the sorted M video clips are obtained based on the rhythm of the background music; in the second implementation method, for the M video clips with strong story lines, The sorted M video clips are obtained based on the forward and backward causal connections of the M video clips; the music of the sorted M video clips that matches the rhythm is selected as the background music.
  • FIG 20 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 800 includes a display module 810 and a processing module 820 .
  • the display module 810 is used to display a first interface, the first interface includes a video icon, and the video indicated by the video icon is a video stored in the electronic device; the processing module 820 is used to detect that the video is The first operation of the N video icons in the icon; in response to the first operation, obtain the information of the N videos, where N is an integer greater than 1; based on the information of the N videos, obtain the videos of the N videos Theme; based on the similarity between the images in the N videos and the video theme, select M video clips from the N videos; based on the video theme, obtain music that matches the video theme; Based on the M video clips and the music, a first video is obtained; the display module 810 is also used to display the first video.
  • processing module 820 is specifically used to:
  • the matching model includes an image encoder, a text encoder and a first similarity measurement module.
  • the image encoder is used to extract image features from the N videos.
  • the text encoder is used to extract text from the video topic.
  • the first similarity measurement module is used to measure the similarity between the image features in the N videos and the text features of the video subject, and the similarity confidence value is used to represent the N The probability that the images in the video are similar to the subject of the video;
  • M video clips from the N videos are selected.
  • processing module 820 is specifically used to:
  • the sorted M video clips and the music are synthesized into the first video.
  • processing module 820 is specifically used to:
  • the M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips.
  • processing module 820 is specifically used to:
  • the M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
  • processing module 820 is specifically used to:
  • the music and the M video clips are input into a pre-trained audio-visual rhythm matching model to obtain the sorted M video clips.
  • the pre-trained audio-visual rhythm matching model includes an audio encoder and a video encoder.
  • the audio encoder is used to extract features of the music to obtain audio features
  • the video decoder is used to extract features of the M video segments to obtain video features
  • the degree measurement module is used to measure the similarity between the audio features and the M video clips.
  • processing module 820 is specifically used to:
  • the N pieces of text description information correspond to the N videos one-to-one, and one of the N pieces of text description information is used to describe the Image content information of one video among N videos;
  • the theme information of the N videos is obtained, and the text description information is used to convert the video content in the N videos into text information.
  • processing module 820 is specifically used to:
  • the N pieces of text description information are input into a pre-trained topic classification model to obtain the topic information of the N videos.
  • the pre-trained topic classification model is a deep neural network for text classification.
  • the display module 810 is also configured to :
  • the second interface includes a prompt box, the prompt box includes information on the at least two video themes;
  • the processing module 820 is specifically used for:
  • the theme information of the N videos is obtained.
  • obtaining music that matches the video theme based on the video theme includes:
  • the pre-trained similarity matching model is a Transformer model.
  • the pre-trained similarity matching model is obtained through the following training method:
  • the similarity matching model to be trained is trained using the contrastive learning training method based on the first training data set to obtain the pre-trained similarity matching model; wherein, the first training data set includes positive example data pairs and negative examples Data pair, the positive example data pair includes a first sample text description information and a first sample video theme information, the first sample description information matches the first sample video theme information, and the positive example data pair An example data pair includes the first sample text description information and the second sample video theme information, and the first sample description information does not match the second sample video theme information.
  • the pre-trained audio-visual rhythm matching model is a Transformer model.
  • the pre-trained audio-visual rhythm matching model is obtained through the following training method:
  • the similarity matching model to be trained is trained using the contrastive learning training method based on the second training data set to obtain the pre-trained similarity matching model; wherein, the second training data set includes positive example data pairs and negative examples Data pair, the positive example data pair includes a first sample music and a first sample video, the rhythm of the first sample music matches the content of the first sample video, and the negative example data pair It includes the first sample music and the second sample video, and the rhythm of the first sample music does not match the content of the second sample video.
  • module can be implemented in the form of software and/or hardware, and is not specifically limited.
  • a “module” may be a software program, a hardware circuit, or a combination of both that implements the above functions.
  • the hardware circuit may include an application specific integrated circuit (ASIC), an electronic circuit, a processor (such as a shared processor, a dedicated processor, or a group processor) for executing one or more software or firmware programs. etc.) and memory, merged logic circuitry, and/or other suitable components to support the described functionality.
  • ASIC application specific integrated circuit
  • processor such as a shared processor, a dedicated processor, or a group processor for executing one or more software or firmware programs. etc.
  • memory merged logic circuitry, and/or other suitable components to support the described functionality.
  • the units of each example described in the embodiments of the present application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
  • Figure 21 shows a schematic structural diagram of an electronic device provided by this application.
  • the dotted line in Figure 21 indicates that this unit or module is optional; the electronic device 900 can be used to implement the video editing method described in the above method embodiment.
  • the electronic device 900 includes one or more processors 901, and the one or more processors 901 can support the electronic device 900 to implement the video editing method in the method embodiment.
  • Processor 901 may be a general-purpose processor or a special-purpose processor.
  • the processor 901 may be a central processing unit (CPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), or a field programmable gate array (field programmable). gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices, or discrete hardware components.
  • the processor 901 can be used to control the electronic device 900, execute software programs, and process data of the software programs.
  • the electronic device 900 may also include a communication unit 905 to implement input (reception) and output (transmission) of signals.
  • the electronic device 900 may be a chip, and the communication unit 905 may be an input and/or output circuit of the chip, or the communication unit 905 may be a communication interface of the chip, and the chip may be used as a component of a terminal device or other electronic device. .
  • the electronic device 900 may be a terminal device, and the communication unit 905 may be a transceiver of the terminal device, or Alternatively, the communication unit 905 may include one or more memories 902 in which a program 904 is stored.
  • the program 904 may be run by the processor 901 to generate an instruction 903, so that the processor 901 executes the above method embodiment according to the instruction 903. Describes video editing methods.
  • data may also be stored in the memory 902 .
  • the processor 901 can also read data stored in the memory 902.
  • the data can be stored at the same storage address as the program 904, or the data can be stored at a different storage address than the program 904.
  • the processor 901 and the memory 902 can be provided separately or integrated together, for example, integrated on a system on chip (SOC) of the terminal device.
  • SOC system on chip
  • the memory 902 can be used to store the related program 904 of the video editing method provided in the embodiment of the present application, and the processor 901 can be used to call the related program 904 of the video editing method stored in the memory 902 when executing the video editing method.
  • execute the video editing method of the embodiment of the present application for example, display a first interface, the first interface includes a video icon, and the video indicated by the video icon is a video stored in the electronic device; detect that N video icons in the video icon are The first operation; in response to the first operation, obtain the information of N videos, where N is an integer greater than 1; based on the information of the N videos, obtain the video themes of the N videos; based on the images and video themes in the N videos Similarity, select M video clips from N videos; based on the video theme, get the music that matches the video theme; based on the M video clips and music, get the first video; display the first video.
  • this application also provides a computer program product, which when executed by the processor 901 implements the video editing method in any method embodiment of this application.
  • the computer program product may be stored in the memory 902, such as the program 904.
  • the program 904 is finally converted into an executable object file that can be executed by the processor 901 through processes such as preprocessing, compilation, assembly, and linking.
  • this application also provides a computer-readable storage medium on which a computer program is stored.
  • a computer program When the computer program is executed by a computer, the video editing method of any method embodiment in this application is implemented.
  • the computer program may be a high-level language program or an executable object program.
  • the computer-readable storage medium is memory 902.
  • Memory 902 may be volatile memory or non-volatile memory, or memory 902 may include both volatile memory and non-volatile memory.
  • non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate SDRAM double data rate SDRAM
  • DDR SDRAM double data rate SDRAM
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the embodiments of the electronic equipment described above are only illustrative.
  • the division of the modules is only a logical function division.
  • there may be other division methods for example, multiple units or components may be The combination can either be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the size of the sequence numbers of each process does not mean the order of execution.
  • the execution order of each process should be determined by its functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The present application relates to the field of video processing, and provides a video editing method and an electronic device. The video editing method is applied to an electronic device, and comprises: displaying a first interface, the first interface comprising video icons, and videos indicated by the video icons being videos stored in the electronic device; detecting a first operation on N video icons among the video icons; in response to the first operation, acquiring information of N videos, N being an integer greater than 1; obtaining a video theme of the N videos on the basis of the information of the N videos; selecting M video clips in the N videos on the basis of the similarities between images in the N videos and the video theme; on the basis of the video theme, obtaining music matching the video theme; obtaining a first video on the basis of the M video clips and the music; and displaying the first video. On the basis of the solution of the present application, the problem that an edited video has image content unrelated to the overall video theme of the N videos can be avoided, and the video quality of the edited video is improved.

Description

视频编辑方法和电子设备Video editing methods and electronic equipment
本申请要求于2022年8月25日提交到国家知识产权局、申请号为202211024258.6、申请名称为“视频编辑方法和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application submitted to the State Intellectual Property Office on August 25, 2022, with application number 202211024258.6 and the application name "Video Editing Method and Electronic Device", the entire content of which is incorporated into this application by reference. middle.
技术领域Technical field
本申请涉及视频领域,具体地,涉及一种视频编辑方法和电子设备。The present application relates to the field of video, and specifically to a video editing method and electronic device.
背景技术Background technique
随着电子设备中短视频技术的发展,用户对视频编辑功能的需求越来越高。视频混剪是指将多个视频进行分割选取其中目标片段,然后对视频片段重组并添加背景音乐生成新视频的视频编辑技术。With the development of short video technology in electronic devices, users have higher and higher demands for video editing functions. Video mixing and editing refers to a video editing technology that divides multiple videos to select target segments, then reorganizes the video segments and adds background music to generate a new video.
目前,用户可以通过现有的应用程序对多个视频进行自动编辑,实现视频混剪;但是,现有的应用程序对多个视频进行编辑时的专业性较差,导致处理后的视频中会存在问题;例如,编辑后的视频中可能会存在与多个视频的整体视频主题无关的图像内容。Currently, users can automatically edit multiple videos through existing applications to achieve video mixing and cutting; however, existing applications are less professional when editing multiple videos, resulting in the processed video being There are issues; for example, the edited video may have graphic content that is not relevant to the overall video theme of multiple videos.
因此,如何提高电子设备中自动编辑视频的专业性,提升编辑后视频的视频质量成为一个亟需解决的问题。Therefore, how to improve the professionalism of automatic video editing in electronic devices and improve the video quality of edited videos has become an urgent problem that needs to be solved.
发明内容Contents of the invention
本申请提供了一种视频编辑方法和电子设备,能够避免编辑后的视频中存在与N个视频的整体视频主题无关的图像内容的问题,提升编辑后视频的视频质量。This application provides a video editing method and electronic device, which can avoid the problem of image content irrelevant to the overall video theme of N videos in the edited video, and improve the video quality of the edited video.
第一方面,提供了一种视频编辑方法,应用于电子设备,包括:The first aspect provides a video editing method applied to electronic devices, including:
显示第一界面,第一界面中包括视频图标,视频图标指示的视频为电子设备中存储的视频;Display a first interface, which includes a video icon, and the video indicated by the video icon is a video stored in the electronic device;
检测到对视频图标中N个视频图标的第一操作;The first operation on N video icons among the video icons is detected;
响应于第一操作,获取N个视频的信息,N为大于1的整数;In response to the first operation, obtain information of N videos, where N is an integer greater than 1;
基于N个视频的信息,得到N个视频的视频主题;Based on the information of N videos, the video topics of N videos are obtained;
基于N个视频中的图像与视频主题的相似度,选取N个视频中的M个视频片段;Based on the similarity between the images in the N videos and the video topics, select M video clips from the N videos;
基于视频主题,得到与视频主题相匹配的音乐;Based on the video theme, get music that matches the video theme;
基于M个视频片段与音乐,得到第一视频;Based on M video clips and music, the first video is obtained;
显示第一视频。Show the first video.
在本申请的实施例中,可以基于N个视频中的图像与视频主题的相似度,从N个视频中选取M个视频片段;基于M个视频片段与音乐得到处理后视频,即第一视频;在本申请的方案中,基于N个视频中包括的图像与视频主题之间的相似度,能够确定N个视频中与视频主题相关度较高的M视频片段;基于本申请的方案,能够有效删除N个视频中与整体视频主题信息无关的视频片段,确保筛选出的视频片段与视频主题相关,提升编辑后的第一视频的视频质量。In the embodiment of the present application, M video clips can be selected from the N videos based on the similarity between the images in the N videos and the video themes; the processed video, that is, the first video, can be obtained based on the M video clips and music. ; In the solution of this application, based on the similarity between the images included in the N videos and the video themes, M video clips in the N videos that are highly relevant to the video themes can be determined; Based on the solution of this application, it is possible to determine Effectively delete video clips from N videos that are irrelevant to the overall video topic information, ensure that the filtered video clips are related to the video topic, and improve the video quality of the edited first video.
结合第一方面,在第一方面的某些实现方式中,基于N个视频中的图像与视频主题的 相似度,选取N个视频中的M个视频片段,包括:Combined with the first aspect, in some implementations of the first aspect, based on images in N videos and video themes Similarity, select M video clips from N videos, including:
将N个视频与视频主题输入至预先训练的相似度匹配模型,得到N个视频中的图像与视频主题的相似度置信值,其中,预先训练的相似度匹配模型中包括图像编码器、文本编码器与第一相似度度量模块,图像编码器用于对N个视频进行提取图像特征处理,文本编码器用于视频主题进行提取文本特征处理,第一相似度度量模块用于度量N个视频中的图像特征与视频主题的文本特征之间的相似度,相似度置信值用于表示N个视频中的图像与视频主题相似的概率;Input N videos and video topics into the pre-trained similarity matching model to obtain the similarity confidence values of the images and video topics in the N videos. The pre-trained similarity matching model includes image encoders and text encoding. and the first similarity measurement module. The image encoder is used to extract image features from N videos. The text encoder is used to extract text features from video topics. The first similarity measurement module is used to measure the images in N videos. The similarity between the feature and the text feature of the video topic, the similarity confidence value is used to represent the probability that the image in N videos is similar to the video topic;
基于N个视频中的图像与视频主题的相似度置信值,选取N个视频中的M个视频片段。Based on the similarity confidence values between the images in the N videos and the video topics, M video clips from the N videos are selected.
在本申请的实施例中,可以通过预先训练的相似度匹配模型识别视频中的图像特征与视频主题的文本特征之间的相似度;预先训练的相似度匹配模型可以为多模态的模型,同时支持图像和文本两种不同类型的输入数据;通过预先训练的相似度匹配模型可以将文本特征和图像特征映射到统一空间中,从而提升视觉和文本的理解能力;在本申请的方案中,基于预先训练的相似度匹配模型能够智能化的识别视频中的图像特征与视频主题的文本特征之间的相似度。In the embodiments of this application, the similarity between the image features in the video and the text features of the video subject can be identified through a pre-trained similarity matching model; the pre-trained similarity matching model can be a multi-modal model, Supports two different types of input data, image and text, at the same time; text features and image features can be mapped into a unified space through a pre-trained similarity matching model, thereby improving visual and text understanding capabilities; in the solution of this application, Based on the pre-trained similarity matching model, it can intelligently identify the similarity between the image features in the video and the text features of the video subject.
结合第一方面,在第一方面的某些实现方式中,基于M个视频片段与音乐,得到第一视频,包括:Combined with the first aspect, in some implementations of the first aspect, the first video is obtained based on M video clips and music, including:
对M个视频片段进行排序,得到排序后的M个视频片段;Sort the M video clips to obtain the sorted M video clips;
将排序后的M个视频片段与音乐合成为第一视频。The sorted M video clips and music are synthesized into the first video.
在本申请的实施例中,能够使得M个视频片段中的图像内容与音乐中的音乐节奏更加吻合;例如,视频图像内容为风景,则可以对应于音乐的前奏或者舒缓的音乐部分;视频图像内容为用户的运动场景,则可以对应于背景音乐中的高潮部分;通过对M个视频片段进行排序,使得M个视频片段与音乐的节奏卡点更加匹配;从而解决第一视频的视频中存在的视频片段与背景音乐不匹配的问题,即能够解决第一视频的视频内容与音乐的节奏卡点不完全匹配的问题;提高第一视频的视频质量。In the embodiment of the present application, the image content in the M video clips can be made more consistent with the music rhythm in the music; for example, if the video image content is scenery, it can correspond to the prelude of the music or the soothing music part; the video image If the content is the user's sports scene, it can correspond to the climax of the background music; by sorting the M video clips, the M video clips can better match the rhythm stuck points of the music; thereby solving the problem in the first video The problem of the video clips not matching the background music can be solved, that is, the problem of the video content of the first video not fully matching the rhythm of the music can be solved; and the video quality of the first video can be improved.
结合第一方面,在第一方面的某些实现方式中,对M个视频片段进行排序,得到排序后的M个视频片段,包括:Combined with the first aspect, in some implementations of the first aspect, the M video clips are sorted to obtain the sorted M video clips, including:
基于音乐的节奏对M个视频片段排序,得到排序后的M个视频片段。Sort the M video clips based on the rhythm of the music to obtain the sorted M video clips.
在本申请的方案中,可以基于N个视频的整体视频主题信息可以选取背景音乐;并且可以基于背景音乐的节奏对M个视频进行排序,实现按照背景音乐的节奏对M个视频片段进行视频排序,使得视频片段的画面内容与音乐节奏相符合;与视频直接按照输入顺序与音乐匹配相比,本申请的方案能够提高视频中图像内容与背景音乐节奏的一致性,提升编辑后视频的视频质量。In the solution of this application, background music can be selected based on the overall video theme information of N videos; and M videos can be sorted based on the rhythm of the background music, so as to achieve video sorting of M video clips according to the rhythm of the background music. , so that the picture content of the video clip matches the rhythm of the music; compared with the video matching the music directly according to the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the video quality of the edited video .
结合第一方面,在第一方面的某些实现方式中,对M个视频片段进行排序,得到排序后的M个视频片段,包括:Combined with the first aspect, in some implementations of the first aspect, the M video clips are sorted to obtain the sorted M video clips, including:
基于M个视频片段中的视频内容对M个视频片段进行排序,得到排序后的M个视频片段。The M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
在本申请的方案中,对于强故事线的N个视频,可以基于N个视频的文本描述信息对N个视频进行排序,得到排序后的N个视频;从排序后的N个视频中选取与视频主题信息相关度较高的M个视频片段,得到排序后的M个视频片段;基于排序后的M个视频片段与视频主题信息,确定与排序后的M个视频片段相匹配的音乐作为背景音乐;使得强故事线的N个视频的画面内容与音乐节奏相匹配的情况下,且视频的画面内容播放顺序符合因果联系,提 升编辑后视频的视频质量。In the solution of this application, for N videos with strong story lines, the N videos can be sorted based on the text description information of the N videos to obtain the sorted N videos; from the sorted N videos, select M video clips with high relevance to the video topic information are used to obtain sorted M video clips; based on the sorted M video clips and video topic information, music matching the sorted M video clips is determined as the background Music; when the picture content of N videos with strong storylines matches the rhythm of the music, and the playback sequence of the video picture content conforms to the causal connection, it improves Improve the video quality of the edited video.
应理解,强故事线的视频可以是指N个视频之间具有因果联系,基于视频编辑方法后能够识别N个视频之间的前因后果并基于前因后果的顺序对N个视频排序;例如,强故事线的视频可以包括旅行主题的视频或者出行主题的视频。It should be understood that a video with a strong story line can refer to a causal connection between N videos. Based on the video editing method, the cause and effect between the N videos can be identified and the N videos can be sorted based on the order of the cause and effect; for example, a strong story line The videos can include travel-themed videos or travel-themed videos.
结合第一方面,在第一方面的某些实现方式中,基于音乐的节奏对M个视频片段排序,得到排序后的M个视频片段,包括:Combined with the first aspect, in some implementations of the first aspect, the M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips, including:
将音乐与M个视频片段输入至预先训练的影音节奏匹配模型,得到排序后的M个视频片段,预先训练的影音节奏匹配模型中包括音频编码器、视频编码器与第一相似度度量模块,音频编码器用于对音乐进行特征提取得到音频特征,视频解码器用于对M个视频片段进行特征提取得到视频特征,第一相似度度量模块用于度量音频特征与M个视频片段的相似性。Input music and M video clips into the pre-trained audio-visual rhythm matching model to obtain sorted M video clips. The pre-trained audio-visual rhythm matching model includes an audio encoder, a video encoder and a first similarity measurement module. The audio encoder is used to extract features from music to obtain audio features, the video decoder is used to extract features from M video clips to obtain video features, and the first similarity measurement module is used to measure the similarity between the audio features and the M video clips.
在本申请的实施例中,可以通过预先训练的影音节奏匹配模型识别M个视频片段的视频特征与音乐的音频特征之间的相似度;预先训练的影音节奏匹配模型可以为多模态的模型,同时支持视频和音频两种不同类型的输入数据;通过预先训练的影音节奏匹配模型可以将视频特征和音频特征映射到统一空间中,从而提升视觉和音频的理解能力;在本申请的方案中,基于预先训练的影音节奏匹配模型能够智能化的识别M个视频片段的视频特征与音乐的音频特征之间的相似度。In the embodiment of the present application, the similarity between the video features of the M video clips and the audio features of the music can be identified through a pre-trained audio-visual rhythm matching model; the pre-trained audio-visual rhythm matching model can be a multi-modal model. , simultaneously supporting two different types of input data, video and audio; through the pre-trained audio-visual rhythm matching model, video features and audio features can be mapped into a unified space, thereby improving visual and audio understanding capabilities; in the solution of this application , based on the pre-trained audio-visual rhythm matching model, it can intelligently identify the similarity between the video features of M video clips and the audio features of music.
结合第一方面,在第一方面的某些实现方式中,基于N个视频的信息,得到N个视频的视频主题,包括:Combined with the first aspect, in some implementations of the first aspect, based on the information of the N videos, video themes of the N videos are obtained, including:
将N个视频的视频内容转换为N个文本描述信息,N个文本描述信息与N个视频一一对应,N个文本描述信息中的一个文本描述信息用于描述N个视频中一个视频的图像内容信息;Convert the video content of N videos into N text description information. The N text description information corresponds to the N videos one-to-one. One text description information among the N text description information is used to describe the image of one video among the N videos. content information;
基于N个文本描述信息,得到N个视频的主题信息,文本描述信息用于将N个视频中的视频内容转换为文本信息。Based on the N pieces of text description information, the topic information of the N videos is obtained, and the text description information is used to convert the video content in the N videos into text information.
在本申请的实施例中,在识别N个视频的视频主题时,通过N个视频的文本描述信息得到N个视频对应的视频主题信息;即基于N个视频的文本描述信息可以得到N个视频的整体视频主题信息;与基于N个视频的图像语义得到视频主题信息相比,文本信息比图像信息具有更抽象的语义信息,多个文本信息之间具有语言关联性,有助于推测多个文本背后隐含的主题信息,从而能够提高N视频对应的整体视频主题的准确性;例如,N个视频中包括用户收拾行李的视频、用户出门乘坐汽车前往机场的视频以及用户乘坐飞机的视频,与用户在海边散步的视频;基于图像语义可能只能得到一些图像标签,包括衣物、行李箱、用户、海边等,基于这些图像标签无法抽象出N个视频的视频主题为旅行;但是,基于N个视频的文本描述信息识别视频主题时,可以基于N个视频文本描述信息与N个视频文本描述信息之间的语言逻辑关联性,准确地得到N个视频的视频主题信息;比如,基于N个视频包括的文本描述信息“一个用户在收拾行李”、“一个用户在乘坐飞机”、“一个用户在海边散步”,基于这些文本描述信息可以抽象出N个视频的视频主题信息为旅行;因此,通过N个视频的文本描述信息得到N个视频的视频主题信息,能够提高主题信息的准确性。In the embodiment of the present application, when identifying the video themes of N videos, the video theme information corresponding to the N videos is obtained through the text description information of the N videos; that is, the N videos can be obtained based on the text description information of the N videos The overall video topic information; compared with the video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to infer multiple The theme information hidden behind the text can improve the accuracy of the overall video theme corresponding to N videos; for example, N videos include videos of users packing their luggage, videos of users going out and taking a car to the airport, and videos of users taking a plane. A video with a user walking on the beach; based on image semantics, only some image tags may be obtained, including clothing, suitcases, users, beach, etc. Based on these image tags, it is impossible to abstract the video theme of N videos as travel; however, based on N When identifying the video theme from the text description information of N videos, the video theme information of N videos can be accurately obtained based on the linguistic logical correlation between N video text description information and N video text description information; for example, based on N The text description information included in the video is "a user is packing luggage", "a user is taking a plane", "a user is walking on the beach". Based on this text description information, the video theme information of N videos can be abstracted as travel; therefore, The video theme information of N videos is obtained through the text description information of N videos, which can improve the accuracy of the theme information.
结合第一方面,在第一方面的某些实现方式中,基于N个文本描述信息,得到N个视频的主题信息,包括:Combined with the first aspect, in some implementations of the first aspect, based on the N pieces of text description information, the topic information of the N videos is obtained, including:
将N个文本描述信息输入至预先训练的主题分类模型,得到N个视频的主题信息,预先训练的主题分类模型为用于文本分类的深度神经网络。Input N pieces of text description information into a pre-trained topic classification model to obtain topic information of N videos. The pre-trained topic classification model is a deep neural network used for text classification.
在本申请的实施例中,基于预先训练的主题分类模型可以得到N个视频的文本描述信息对应的视频主题信息;通过预先训练的主题分类模型识别N个视频的文本描述信息对应的视 频主题信息;与基于N个视频的图像语义得到视频主题信息相比,文本信息比图像信息具有更抽象的语义信息,多个文本信息之间具有语言关联性,有助于推测多个文本背后隐含的主题信息,从而能够提高N视频对应的整体视频主题的准确性;此外,预先训练的主题分类模型能够更加智能化的识别N个文本描述信息对应的视频主题信息。In the embodiment of the present application, the video topic information corresponding to the text description information of N videos can be obtained based on the pre-trained topic classification model; the video topic information corresponding to the text description information of the N videos is identified through the pre-trained topic classification model. frequency topic information; compared with video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to speculate on the background of multiple texts The implicit topic information can improve the accuracy of the overall video topic corresponding to N videos; in addition, the pre-trained topic classification model can more intelligently identify the video topic information corresponding to N text description information.
结合第一方面,在第一方面的某些实现方式中,在预先训练的主题分类模型输出至少两个视频主题时,至少两个视频主题与N个文本描述信息相对应,还包括:Combined with the first aspect, in some implementations of the first aspect, when the pre-trained topic classification model outputs at least two video topics, the at least two video topics correspond to N pieces of text description information, which also includes:
显示第二界面,第二界面中包括提示框,提示框中包括至少两个视频主题的信息;Display a second interface, the second interface includes a prompt box, and the prompt box includes information on at least two video topics;
将N个文本描述信息输入至预先训练的主题分类模型,得到N个视频的主题信息,包括:Input N text description information into the pre-trained topic classification model to obtain the topic information of N videos, including:
检测到对至少两个视频主题的第二操作;A second action on at least two video subjects is detected;
响应于第二操作,得到N个视频的主题信息。In response to the second operation, the theme information of N videos is obtained.
在本申请的实施例中,在电子设备输出至少两个视频主题时,电子设备可以显示提示框;基于检测到用户对提示框中候选视频主题的操作,能够确定N个视频的视频主题信息;在一定程度上能够避免在N个视频的视频内容不完成符合预先视频主题时,电子设备无法识别N个视频的视频主题。In embodiments of the present application, when the electronic device outputs at least two video themes, the electronic device can display a prompt box; based on detecting the user's operation on the candidate video theme in the prompt box, the video theme information of N videos can be determined; To a certain extent, it can be avoided that the electronic device cannot recognize the video themes of the N videos when the video contents of the N videos do not completely comply with the pre-set video themes.
结合第一方面,在第一方面的某些实现方式中,基于视频主题,得到与视频主题相匹配的音乐,包括:Combined with the first aspect, in some implementations of the first aspect, based on the video theme, music matching the video theme is obtained, including:
基于M个视频片段的时长与视频主题,得到与视频主题相匹配的音乐,音乐的时长大于或者等于M个视频片段的时长。Based on the duration and video theme of the M video clips, music that matches the video theme is obtained, and the duration of the music is greater than or equal to the duration of the M video clips.
在本申请的实施例中,基于M个视频片段的时长可以确定背景音乐的总时长,进行音乐匹配时通常选取的背景音乐需要大于或者等于M个视频片段的总时长;基于视频主题信息,可以确定音乐的音乐风格;在本申请的方案中,能够基于M个视频片段的时长与视频主题更加准确地筛选出匹配M个视频片段的音乐作为背景音乐,提高编辑视频的视频质量;即提高第一视频的视频质量。In the embodiment of the present application, the total duration of the background music can be determined based on the duration of the M video clips. The background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; based on the video theme information, the total duration of the background music can be determined. Determine the music style of the music; in the solution of this application, the music matching the M video clips can be more accurately selected as background music based on the duration and video theme of the M video clips, thereby improving the video quality of the edited video; that is, improving the The video quality of a video.
结合第一方面,在第一方面的某些实现方式中,预先训练的相似度匹配模型为Transformer模型。Combined with the first aspect, in some implementations of the first aspect, the pre-trained similarity matching model is a Transformer model.
结合第一方面,在第一方面的某些实现方式中,预先训练的相似度匹配模型是通过以下训练方式得到的:Combined with the first aspect, in some implementations of the first aspect, the pre-trained similarity matching model is obtained through the following training method:
基于第一训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到预先训练的相似度匹配模型;其中,第一训练数据集中包括正例数据对与负例数据对,正例数据对包括第一样本文本描述信息与第一样本视频主题信息,第一样本描述信息与第一样本视频主题信息相匹配,正例数据对包括第一样本文本描述信息与第二样本视频主题信息,第一样本描述信息与第二样本视频主题信息不匹配。Based on the first training data set, the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the first training data set includes positive example data pairs and negative example data pairs. The example data pair includes the first sample text description information and the first sample video theme information. The first sample description information matches the first sample video theme information. The positive example data pair includes the first sample text description information and the first sample video theme information. The second sample video theme information and the first sample description information do not match the second sample video theme information.
结合第一方面,在第一方面的某些实现方式中,预先训练的影音节奏匹配模型为Transformer模型。Combined with the first aspect, in some implementations of the first aspect, the pre-trained audio-visual rhythm matching model is a Transformer model.
结合第一方面,在第一方面的某些实现方式中,预先训练的影音节奏匹配模型是通过以下训练方式得到的:Combined with the first aspect, in some implementations of the first aspect, the pre-trained audio-visual rhythm matching model is obtained through the following training method:
基于第二训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到预先训练的相似度匹配模型;其中,第二训练数据集中包括正例数据对与负例数据对,正例数据对包括第一样本音乐与第一样本视频,第一样本音乐的节奏与第一样本视频的内容相匹配,负例数据对包括第一样本音乐与第二样本视频,第一样本音乐的节奏与第二样本视频的内容不匹配。 Based on the second training data set, the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the second training data set includes positive example data pairs and negative example data pairs. The example data pair includes the first sample music and the first sample video. The rhythm of the first sample music matches the content of the first sample video. The negative example data pair includes the first sample music and the second sample video. The tempo of the first sample music does not match the content of the second sample video.
第二方面,提供了一种电子设备,电子设备包括一个或多个处理器与存储器;存储器与一个或多个处理器耦合,存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,一个或多个处理器调用计算机指令以使得电子设备执行:In a second aspect, an electronic device is provided. The electronic device includes one or more processors and a memory; the memory is coupled to the one or more processors, and the memory is used to store computer program code. The computer program code includes computer instructions, one or more Multiple processors invoke computer instructions to cause the electronic device to:
显示第一界面,第一界面中包括视频图标,视频图标指示的视频为电子设备中存储的视频;Display a first interface, which includes a video icon, and the video indicated by the video icon is a video stored in the electronic device;
检测到对视频图标中N个视频图标的第一操作;The first operation on N video icons among the video icons is detected;
响应于第一操作,获取N个视频的信息,N为大于1的整数;In response to the first operation, obtain information of N videos, where N is an integer greater than 1;
基于N个视频的信息,得到N个视频的视频主题;Based on the information of N videos, the video topics of N videos are obtained;
基于N个视频中的图像与视频主题的相似度,选取N个视频中的M个视频片段;Based on the similarity between the images in the N videos and the video topics, select M video clips from the N videos;
基于视频主题,得到与视频主题相匹配的音乐;Based on the video theme, get music that matches the video theme;
基于M个视频片段与音乐,得到第一视频;Based on M video clips and music, the first video is obtained;
显示第一视频。Show the first video.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
将N个视频与视频主题输入至预先训练的相似度匹配模型,得到N个视频中的图像与视频主题的相似度置信值,其中,预先训练的相似度匹配模型中包括图像编码器、文本编码器与第一相似度度量模块,图像编码器用于对N个视频进行提取图像特征处理,文本编码器用于视频主题进行提取文本特征处理,第一相似度度量模块用于度量N个视频中的图像特征与视频主题的文本特征之间的相似度,相似度置信值用于表示N个视频中的图像与视频主题相似的概率;Input N videos and video topics into the pre-trained similarity matching model to obtain the similarity confidence values of the images and video topics in the N videos. The pre-trained similarity matching model includes image encoders and text encoding. and the first similarity measurement module. The image encoder is used to extract image features from N videos. The text encoder is used to extract text features from video topics. The first similarity measurement module is used to measure the images in N videos. The similarity between the feature and the text feature of the video topic, the similarity confidence value is used to represent the probability that the image in N videos is similar to the video topic;
基于N个视频中的图像与视频主题的相似度置信值,选取N个视频中的M个视频片段。Based on the similarity confidence values between the images in the N videos and the video topics, M video clips from the N videos are selected.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
对M个视频片段进行排序,得到排序后的M个视频片段;Sort the M video clips to obtain the sorted M video clips;
将排序后的M个视频片段与音乐合成为第一视频。The sorted M video clips and music are synthesized into the first video.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
基于音乐的节奏对M个视频片段排序,得到排序后的M个视频片段。Sort the M video clips based on the rhythm of the music to obtain the sorted M video clips.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
基于M个视频片段中的视频内容对M个视频片段进行排序,得到排序后的M个视频片段。The M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
将音乐与M个视频片段输入至预先训练的影音节奏匹配模型,得到排序后的M个视频片段,预先训练的影音节奏匹配模型中包括音频编码器、视频编码器与第一相似度度量模块,音频编码器用于对音乐进行特征提取得到音频特征,视频解码器用于对M个视频片段进行特征提取得到视频特征,第一相似度度量模块用于度量音频特征与M个视频片段的相似性。Input music and M video clips into the pre-trained audio-visual rhythm matching model to obtain sorted M video clips. The pre-trained audio-visual rhythm matching model includes an audio encoder, a video encoder and a first similarity measurement module. The audio encoder is used to extract features from music to obtain audio features, the video decoder is used to extract features from M video clips to obtain video features, and the first similarity measurement module is used to measure the similarity between the audio features and the M video clips.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
将N个视频的视频内容转换为N个文本描述信息,N个文本描述信息与N个视频一一对 应,N个文本描述信息中的一个文本描述信息用于描述N个视频中一个视频的图像内容信息;Convert the video content of N videos into N text description information, and the N text description information is one-to-one with the N videos. Accordingly, one text description information among the N pieces of text description information is used to describe the image content information of one video among the N videos;
基于N个文本描述信息,得到N个视频的主题信息,文本描述信息用于将N个视频中的视频内容转换为文本信息。Based on the N pieces of text description information, the topic information of the N videos is obtained, and the text description information is used to convert the video content in the N videos into text information.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
将N个文本描述信息输入至预先训练的主题分类模型,得到N个视频的主题信息,预先训练的主题分类模型为用于文本分类的深度神经网络。Input N pieces of text description information into a pre-trained topic classification model to obtain topic information of N videos. The pre-trained topic classification model is a deep neural network used for text classification.
结合第二方面,在第二方面的某些实现方式中,在预先训练的主题分类模型输出至少两个视频主题时,至少两个视频主题与N个文本描述信息相对应,一个或多个处理器调用计算机指令以使得电子设备执行:Combined with the second aspect, in some implementations of the second aspect, when the pre-trained topic classification model outputs at least two video topics, and the at least two video topics correspond to N pieces of text description information, one or more processes The processor calls computer instructions to cause the electronic device to perform:
显示第二界面,第二界面中包括提示框,提示框中包括至少两个视频主题的信息;Display a second interface, the second interface includes a prompt box, and the prompt box includes information on at least two video topics;
将N个文本描述信息输入至预先训练的主题分类模型,得到N个视频的主题信息,包括:Input N text description information into the pre-trained topic classification model to obtain the topic information of N videos, including:
检测到对至少两个视频主题的第二操作;A second action on at least two video subjects is detected;
响应于第二操作,得到N个视频的主题信息。In response to the second operation, the theme information of N videos is obtained.
结合第二方面,在第二方面的某些实现方式中,一个或多个处理器调用计算机指令以使得电子设备执行:In conjunction with the second aspect, in some implementations of the second aspect, one or more processors invoke computer instructions to cause the electronic device to execute:
基于M个视频片段的时长与视频主题,得到与视频主题相匹配的音乐,音乐的时长大于或者等于M个视频片段的时长。Based on the duration and video theme of the M video clips, music matching the video theme is obtained, and the duration of the music is greater than or equal to the duration of the M video clips.
结合第二方面,在第二方面的某些实现方式中,预先训练的相似度匹配模型为Transformer模型。Combined with the second aspect, in some implementations of the second aspect, the pre-trained similarity matching model is a Transformer model.
结合第二方面,在第二方面的某些实现方式中,预先训练的相似度匹配模型是通过以下训练方式得到的:Combined with the second aspect, in some implementations of the second aspect, the pre-trained similarity matching model is obtained through the following training method:
基于第一训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到预先训练的相似度匹配模型;其中,第一训练数据集中包括正例数据对与负例数据对,正例数据对包括第一样本文本描述信息与第一样本视频主题信息,第一样本描述信息与第一样本视频主题信息相匹配,正例数据对包括第一样本文本描述信息与第二样本视频主题信息,第一样本描述信息与第二样本视频主题信息不匹配。Based on the first training data set, the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the first training data set includes positive example data pairs and negative example data pairs. The example data pair includes the first sample text description information and the first sample video theme information. The first sample description information matches the first sample video theme information. The positive example data pair includes the first sample text description information and the first sample video theme information. The second sample video theme information and the first sample description information do not match the second sample video theme information.
结合第二方面,在第二方面的某些实现方式中,预先训练的影音节奏匹配模型为Transformer模型。Combined with the second aspect, in some implementations of the second aspect, the pre-trained audio-visual rhythm matching model is a Transformer model.
结合第二方面,在第二方面的某些实现方式中,预先训练的影音节奏匹配模型是通过以下训练方式得到的:Combined with the second aspect, in some implementations of the second aspect, the pre-trained audio-visual rhythm matching model is obtained through the following training method:
基于第二训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到预先训练的相似度匹配模型;其中,第二训练数据集中包括正例数据对与负例数据对,正例数据对包括第一样本音乐与第一样本视频,第一样本音乐的节奏与第一样本视频的内容相匹配,负例数据对包括第一样本音乐与第二样本视频,第一样本音乐的节奏与第二样本视频的内容不匹配。Based on the second training data set, the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the second training data set includes positive example data pairs and negative example data pairs. The example data pair includes the first sample music and the first sample video. The rhythm of the first sample music matches the content of the first sample video. The negative example data pair includes the first sample music and the second sample video. The tempo of the first sample music does not match the content of the second sample video.
第三方面,提供了一种电子设备,包括用于执行第一方面或者第一方面中的任意一种实现方式中的视频编辑方法的模块/单元。In a third aspect, an electronic device is provided, including a module/unit for executing the video editing method in the first aspect or any implementation of the first aspect.
第四方面,提供一种电子设备,所述电子设备包括一个或多个处理器和存储器与;所述存储器与所述一个或多个处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,所述一个或多个处理器调用所述计算机指令以使得所述电子设备执 行第一方面或者第一方面中的任意一种实现方式中的视频编辑方法。A fourth aspect provides an electronic device, the electronic device comprising one or more processors and a memory; the memory is coupled to the one or more processors, the memory is used to store computer program code, The computer program code includes computer instructions that are invoked by the one or more processors to cause the electronic device to perform Perform the video editing method in the first aspect or any one of the implementations of the first aspect.
第五方面,提供了一种芯片系统,所述芯片系统应用于电子设备,所述芯片系统包括一个或多个处理器,所述处理器用于调用计算机指令以使得所述电子设备执行第一方面或第一方面中的任一种视频编辑方法。In a fifth aspect, a chip system is provided. The chip system is applied to an electronic device. The chip system includes one or more processors. The processor is used to call computer instructions to cause the electronic device to execute the first aspect. Or any of the video editing methods in the first aspect.
第六方面,提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序代码,当所述计算机程序代码被电子设备运行时,使得该电子设备执行第一方面或者第一方面中的任意一种实现方式中的视频编辑方法。In a sixth aspect, a computer-readable storage medium is provided. The computer-readable storage medium stores computer program code. When the computer program code is run by an electronic device, the electronic device causes the electronic device to execute the first aspect or the first aspect. The video editing method in any implementation of the aspects.
第七方面,提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序代码,当所述计算机程序代码被电子设备运行时,使得该电子设备执行第一方面或者第一方面中的任意一种实现方式中的视频编辑方法。In a seventh aspect, a computer program product is provided. The computer program product includes: computer program code. When the computer program code is run by an electronic device, the electronic device causes the electronic device to execute the first aspect or any of the first aspects. A video editing method in an implementation manner.
在本申请的实施例中,可以基于N个视频中的图像与视频主题的相似度,从N个视频中选取M个视频片段;基于M个视频片段与音乐得到处理后视频,即第一视频;在本申请的方案中,基于N个视频中包括的图像与视频主题之间的相似度,能够确定N个视频中与视频主题相关度较高的M视频片段;基于本申请的方案,能够有效删除N个视频中与整体视频主题信息无关的视频片段,确保筛选出的视频片段与视频主题相关,提升编辑后的第一视频的视频质量。In the embodiment of the present application, M video clips can be selected from the N videos based on the similarity between the images in the N videos and the video themes; the processed video, that is, the first video, can be obtained based on the M video clips and music. ; In the solution of this application, based on the similarity between the images included in the N videos and the video themes, M video clips in the N videos that are highly relevant to the video themes can be determined; Based on the solution of this application, it is possible to determine Effectively delete video clips from N videos that are irrelevant to the overall video topic information, ensure that the filtered video clips are related to the video topic, and improve the video quality of the edited first video.
此外,本申请的实施例中,能够解决编辑后的视频中存在的视频片段与音乐不匹配的问题,即能够解决编辑后的视频内容与背景音乐的节奏卡点不完全匹配的问题;在本申请的实施例中,能够使得M个视频片段中的图像内容与音乐中的音乐节奏更加吻合;例如,视频图像内容为风景,则可以对应于音乐的前奏或者舒缓的音乐部分;视频图像内容为用户的运动场景,则可以对应于背景音乐中的高潮部分;通过对M个视频片段进行排序,使得M个视频片段与音乐的节奏卡点更加匹配;提高编辑后视频的视频质量。In addition, in the embodiments of the present application, the problem of mismatch between the video clips and the music existing in the edited video can be solved, that is, the problem of the edited video content not fully matching the rhythm stuck point of the background music can be solved; in this application In the embodiment of the application, the image content in the M video clips can be made more consistent with the music rhythm in the music; for example, if the video image content is scenery, it can correspond to the prelude of the music or the soothing music part; the video image content is The user's sports scene can correspond to the climax of the background music; by sorting the M video clips, the M video clips can better match the rhythm of the music; and the video quality of the edited video can be improved.
附图说明Description of drawings
图1是一种适用于本申请的电子设备的硬件系统的示意图;Figure 1 is a schematic diagram of a hardware system suitable for electronic equipment of the present application;
图2是一种适用于本申请的变换器Transformer模型的结构的示意图;Figure 2 is a schematic diagram of the structure of a transformer model suitable for this application;
图3是一种Transformer模型中编码器与解码器的结构的示意图;Figure 3 is a schematic diagram of the structure of the encoder and decoder in the Transformer model;
图4是一种适用于本申请的电子设备的软件系统的示意图;Figure 4 is a schematic diagram of a software system suitable for the electronic device of the present application;
图5是一种适用于本申请实施例的图形用户界面的示意图;Figure 5 is a schematic diagram of a graphical user interface suitable for embodiments of the present application;
图6是一种适用于本申请实施例的图形用户界面的示意图;Figure 6 is a schematic diagram of a graphical user interface suitable for embodiments of the present application;
图7是一种适用于本申请实施例的图形用户界面的示意图;Figure 7 is a schematic diagram of a graphical user interface suitable for embodiments of the present application;
图8是一种适用于本申请实施例的图形用户界面的示意图;Figure 8 is a schematic diagram of a graphical user interface suitable for embodiments of the present application;
图9是一种适用于本申请实施例的图形用户界面的示意图;Figure 9 is a schematic diagram of a graphical user interface suitable for embodiments of the present application;
图10是一种适用于本申请实施例的图形用户界面的示意图;Figure 10 is a schematic diagram of a graphical user interface suitable for embodiments of the present application;
图11是一种适用于本申请实施例的图形用户界面的示意图;Figure 11 is a schematic diagram of a graphical user interface suitable for embodiments of the present application;
图12是本申请实施例提供的一种视频编辑方法的示意性流程图;Figure 12 is a schematic flow chart of a video editing method provided by an embodiment of the present application;
图13是本申请实施例提供的另一种视频编辑方法的示意性流程图;Figure 13 is a schematic flow chart of another video editing method provided by an embodiment of the present application;
图14是本申请实施例提供的一种确定N个视频中与视频主题信息相关的M个视频片段的方法的示意性流程图;Figure 14 is a schematic flow chart of a method for determining M video clips related to video theme information in N videos provided by an embodiment of the present application;
图15是本申请实施提供的一种相似度评估模型的处理流程的示意图; Figure 15 is a schematic diagram of the processing flow of a similarity evaluation model implemented in this application;
图16是本申请实施例提供的一种对M个视频片段与背景音乐进行节奏匹配处理的方法的流程图;Figure 16 is a flow chart of a method for rhythm matching processing of M video clips and background music provided by an embodiment of the present application;
图17是本申请实施例提供的一种影音节奏匹配模型的处理流程的示意图;Figure 17 is a schematic diagram of the processing flow of an audio-visual rhythm matching model provided by an embodiment of the present application;
图18是本申请实施例提供的另一种视频编辑方法的示意性流程图;Figure 18 is a schematic flow chart of another video editing method provided by an embodiment of the present application;
图19是本申请实施例提供的另一种视频编辑方法的示意性流程图;Figure 19 is a schematic flow chart of another video editing method provided by an embodiment of the present application;
图20是本申请实施例提供的一种电子设备的结构示意图;Figure 20 is a schematic structural diagram of an electronic device provided by an embodiment of the present application;
图21是本申请实施例提供的一种电子设备的结构示意图。Figure 21 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
具体实施方式Detailed ways
在本申请的实施例中,以下术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。In the embodiments of the present application, the following terms “first”, “second”, etc. are only used for descriptive purposes and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, features defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, "plurality" means two or more.
为了便于对本申请实施例的理解,首先对本申请实施例中涉及的相关概念进行简要说明。In order to facilitate understanding of the embodiments of the present application, the relevant concepts involved in the embodiments of the present application are briefly described first.
1、图像特征1. Image features
图像特征是指对图像的特点或内容进行表征的一系列属性的集合;例如,图像特征可以包括图像的颜色特征、纹理特征、形状特征以及空间关系特征等,也可以是通过某种映射得到隐式的属性表达。Image features refer to a set of attributes that characterize the characteristics or content of an image; for example, image features can include color features, texture features, shape features, and spatial relationship features of the image, etc., or they can be obtained through some kind of mapping. Formula attribute expression.
2、视频特征2. Video characteristics
视频特征是指由视频中的图像序列通过某种映射获得的能够表征视频特点的属性集合。Video features refer to a set of attributes that can characterize the characteristics of the video obtained from the image sequence in the video through some mapping.
3、文本特征3. Text features
文本特征是指词语或句子经过向量化以及后续的某种映射获得的能够表征其特定语义的属性集合。Text features refer to the set of attributes that characterize a word or sentence's specific semantics obtained through vectorization and subsequent mapping.
4、图像文本多模态(contrastive language–image pre-training,CLIP)模型4. Image and text multimodal (contrastive language–image pre-training, CLIP) model
CLIP模型是一种基于对比的图片-文本学习的跨模态预训练模型。The CLIP model is a cross-modal pre-training model based on contrastive picture-text learning.
5、神经网络5. Neural network
神经网络是指将多个单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入;每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Neural network refers to a network formed by connecting multiple single neural units together, that is, the output of one neural unit can be the input of another neural unit; the input of each neural unit can be connected to the local receptive field of the previous layer, To extract the features of the local receptive field, the local receptive field can be an area composed of several neural units.
6、对比学习6. Comparative learning
对比学习属于自监督学习方式中的一种;对比学习是指在不依赖于标注数据的情况下,从无标注图像中学习知识的训练方式。Contrastive learning is one of the self-supervised learning methods; contrastive learning refers to a training method that learns knowledge from unlabeled images without relying on labeled data.
应理解,对比学习的目标是学习一个编码器,此编码器对同类数据进行相似的编码,并使不同类的数据的编码结果尽可能的不同。It should be understood that the goal of contrastive learning is to learn an encoder that encodes similar data similarly and makes the encoding results of different types of data as different as possible.
7、变换器(Transformer)模型7. Transformer model
Transformer模型可以由编码器与解码器两部分组成;如图2所示,编码器与解码器中可以包括多个子模块;例如,一个编码器中可以包括6个编码模块;一个解码器中可以包括6个解码模块。The Transformer model can be composed of an encoder and a decoder. As shown in Figure 2, the encoder and decoder can include multiple sub-modules; for example, an encoder can include 6 encoding modules; a decoder can include 6 decoding modules.
示例性地,如图3所示,一个编码模块中可以包括:嵌入层、位置编码、多头注意力机制模块、残差连接与线性归一化与前向网络模块;其中,嵌入层用于将输入数据中的每一个 词用一个向量进行表示;位置编码用于构造一个与输入数据的向量维度相同的矩阵,使得输入至多头注意力机制模块的数据包含位置信息;多头注意力机制模块用于通过利用同一查询的多个不同版本并行实现多个注意力模块的工作;其思想是使用不同的权重矩阵对查询进行线性变换得到多个查询,每个新形成的查询本质上都需要不同类型的相关信息,从而允许注意模型在上下文向量计算中引入更多信息;残差连接用于防止网络退化;线性归一化用于对每一层的激活值进行归一化;前向网络模块用于对得到的词表征做进一步的变换。For example, as shown in Figure 3, a coding module can include: embedding layer, position coding, multi-head attention mechanism module, residual connection and linear normalization and forward network modules; where the embedding layer is used to Each input data Words are represented by a vector; position encoding is used to construct a matrix with the same vector dimension as the input data, so that the data input to the multi-head attention mechanism module contains position information; the multi-head attention mechanism module is used to utilize multiple queries of the same query. Different versions implement the work of multiple attention modules in parallel; the idea is to use different weight matrices to linearly transform the query to obtain multiple queries. Each newly formed query essentially requires a different type of relevant information, thus allowing attention The model introduces more information into the context vector calculation; residual connections are used to prevent network degradation; linear normalization is used to normalize the activation values of each layer; the forward network module is used to do the obtained word representation Further transformations.
示例性地,如图3所示,一个解码模块中可以包括:嵌入层、位置编码、掩码多头注意力机制模块、残差连接与线性归一化、前向网络模块与多头注意力机制模块;其中,嵌入层用于将输入数据中的每一个词用一个向量进行表示;位置编码用于构造一个与输入数据的向量维度相同的矩阵,使得输入至多头注意力机制模块的数据包含位置信息;掩码多头注意力机制模块用于通过使用掩码,确保前面的词不会具备后面词的信息,从而保证Transformer模型预测的输出数据不会基于输入词的多少而发生改变;多头注意力机制模块用于通过利用同一查询的多个不同版本并行实现多个注意力模块的工作;其思想是使用不同的权重矩阵对查询进行线性变换得到多个查询,每个新形成的查询本质上都需要不同类型的相关信息,从而允许注意模型在上下文向量计算中引入更多信息;残差连接用于防止网络退化;线性归一化用于对每一层的激活值进行归一化;前向网络模块用于对得到的词表征做进一步的变换。For example, as shown in Figure 3, a decoding module can include: embedding layer, position coding, masked multi-head attention mechanism module, residual connection and linear normalization, forward network module and multi-head attention mechanism module ; Among them, the embedding layer is used to represent each word in the input data with a vector; the position encoding is used to construct a matrix with the same vector dimension as the input data, so that the data input to the multi-head attention mechanism module contains position information ; The mask multi-head attention mechanism module is used to ensure that the previous words do not have information about the following words by using masks, thereby ensuring that the output data predicted by the Transformer model will not change based on the number of input words; the multi-head attention mechanism module is used to implement the work of multiple attention modules in parallel by leveraging multiple different versions of the same query; the idea is to linearly transform the query using different weight matrices to obtain multiple queries, each newly formed query essentially requires Different types of relevant information, thus allowing the attention model to introduce more information in the context vector calculation; residual connections are used to prevent network degradation; linear normalization is used to normalize the activation values of each layer; forward network The module is used to further transform the obtained word representation.
8、深度神经网络(deep neural network,DNN)8. Deep neural network (DNN)
深度神经网络也可以称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的;也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。Deep neural network can also be called multi-layer neural network, which can be understood as a neural network with multiple hidden layers. DNN is divided according to the position of different layers. The neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. The layers are fully connected; that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
9、反向传播算法9. Backpropagation algorithm
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。The neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
10、转场效果10. Transition effect
转场效果又可以称为转场特效,转场效果是指两个场景之间,采用一定的技巧如划像、叠变、卷页等,实现场景或情节之间的平滑过渡,或达到丰富画面的效果。Transition effects can also be called transition effects. Transition effects refer to the use of certain techniques between two scenes, such as wipes, stacking changes, page curls, etc., to achieve smooth transitions between scenes or plots, or to achieve rich The effect of the picture.
下面将结合附图,对本申请实施例中提供的视频编辑方法和电子设备进行描述。The video editing method and electronic device provided in the embodiments of the present application will be described below with reference to the accompanying drawings.
图1示出了一种适用于本申请的电子设备的硬件系统。Figure 1 shows a hardware system suitable for the electronic device of the present application.
电子设备100可以是手机、智慧屏、平板电脑、可穿戴电子设备、车载电子设备、增强现实(augmented reality,AR)设备、虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)、投影仪等等,本申请实施例对电子设备100的具体类型不作任何限制。The electronic device 100 may be a mobile phone, a smart screen, a tablet, a wearable electronic device, a vehicle-mounted electronic device, an augmented reality (AR) device, a virtual reality (VR) device, a notebook computer, or a super mobile personal computer ( Ultra-mobile personal computer (UMPC), netbook, personal digital assistant (personal digital assistant, PDA), projector, etc. The embodiment of the present application does not place any restrictions on the specific type of the electronic device 100.
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192, 摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2. , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, Camera 193, display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
需要说明的是,图1所示的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图1所示的部件更多或更少的部件,或者,电子设备100可以包括图1所示的部件中某些部件的组合,或者,电子设备100可以包括图1所示的部件中某些部件的子部件。图1示的部件可以以硬件、软件、或软件和硬件的组合实现。It should be noted that the structure shown in FIG. 1 does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or less components than those shown in FIG. 1 , or the electronic device 100 may include a combination of some of the components shown in FIG. 1 , or , the electronic device 100 may include sub-components of some of the components shown in FIG. 1 . The components shown in Figure 1 may be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元。例如,处理器110可以包括以下处理单元中的至少一个:应用处理器(application processor,AP)、调制解调处理器、图形处理器(graphics processing unit,GPU)、图像信号处理器(image signal processor,ISP)、控制器、视频编解码器、数字信号处理器(digital signal processor,DSP)、基带处理器、神经网络处理器(neural-network processing unit,NPU)。其中,不同的处理单元可以是独立的器件,也可以是集成的器件。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。Processor 110 may include one or more processing units. For example, the processor 110 may include at least one of the following processing units: an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), an image signal processor (image signal processor) , ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, neural network processing unit (NPU). Among them, different processing units can be independent devices or integrated devices. The controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。The processor 110 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
在一些实施例中,处理器110可以包括一个或多个接口。例如,处理器110可以包括以下接口中的至少一个:内部集成电路(inter-integrated circuit,I2C)接口、内部集成电路音频(inter-integrated circuit sound,I2S)接口、脉冲编码调制(pulse code modulation,PCM)接口、通用异步接收传输器(universal asynchronous receiver/transmitter,UART)接口、移动产业处理器接口(mobile industry processor interface,MIPI)、通用输入输出(general-purpose input/output,GPIO)接口、SIM接口、USB接口。In some embodiments, processor 110 may include one or more interfaces. For example, the processor 110 may include at least one of the following interfaces: an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, pulse code modulation, PCM) interface, universal asynchronous receiver/transmitter (UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, SIM interface, USB interface.
示例性地,在本申请的实施例中,处理器110可以用于执行本申请实施例提供的视频编辑方法;例如,显示第一界面,第一界面中包括视频图标,视频图标指示的视频为电子设备中存储的视频;检测到对视频图标中N个视频图标的第一操作;响应于第一操作,获取N个视频的信息,N为大于1的整数;基于N个视频的信息,得到N个视频的视频主题;基于N个视频中的图像与视频主题的相似度,选取N个视频中的M个视频片段;基于视频主题,得到与视频主题相匹配的音乐;基于M个视频片段与音乐,得到第一视频;显示第一视频。Illustratively, in the embodiment of the present application, the processor 110 may be configured to execute the video editing method provided by the embodiment of the present application; for example, display a first interface, the first interface includes a video icon, and the video indicated by the video icon is Videos stored in electronic devices; detect the first operation on N video icons in the video icons; in response to the first operation, obtain information of N videos, where N is an integer greater than 1; based on the information of N videos, obtain Video themes of N videos; based on the similarity between the images in the N videos and the video themes, select M video clips from the N videos; based on the video themes, obtain music that matches the video theme; based on the M video clips With music, get the first video; show the first video.
图1所示的各模块间的连接关系只是示意性说明,并不构成对电子设备100的各模块间的连接关系的限定。可选地,电子设备100的各模块也可以采用上述实施例中多种连接方式的组合。The connection relationship between the modules shown in FIG. 1 is only a schematic illustration and does not constitute a limitation on the connection relationship between the modules of the electronic device 100 . Optionally, each module of the electronic device 100 may also adopt a combination of various connection methods in the above embodiments.
电子设备100的无线通信功能可以通过天线1、天线2、移动通信模块150、无线通信模块160、调制解调处理器以及基带处理器等器件实现。The wireless communication function of the electronic device 100 can be implemented through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor, baseband processor and other components.
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
电子设备100可以通过GPU、显示屏194以及应用处理器实现显示功能。GPU为图像处 理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The electronic device 100 may implement display functions through a GPU, a display screen 194, and an application processor. GPU is the graphics processing unit The processing microprocessor is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
显示屏194可以用于显示图像或视频。Display 194 may be used to display images or videos.
可选地,显示屏194可以用于显示图像或视频。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)、有源矩阵有机发光二极体(active-matrix organic light-emitting diode,AMOLED)、柔性发光二极管(flex light-emitting diode,FLED)、迷你发光二极管(mini light-emitting diode,Mini LED)、微型发光二极管(micro light-emitting diode,Micro LED)、微型OLED(Micro OLED)或量子点发光二极管(quantum dot light emitting diodes,QLED)。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。Optionally, display screen 194 may be used to display images or videos. Display 194 includes a display panel. The display panel can use liquid crystal display (LCD), organic light-emitting diode (OLED), active-matrix organic light-emitting diode (AMOLED), flexible Light-emitting diode (flex light-emitting diode, FLED), mini light-emitting diode (Mini LED), micro light-emitting diode (micro light-emitting diode, Micro LED), micro OLED (Micro OLED) or quantum dot light emitting Diodes (quantum dot light emitting diodes, QLED). In some embodiments, the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
示例性地,在本申请的实施例中,显示屏194可以显示用户选择的视频或者照片;以及显示处理后的视频。For example, in the embodiment of the present application, the display screen 194 can display a video or photo selected by the user; and display the processed video.
示例性地,电子设备100可以通过ISP、摄像头193、视频编解码器、GPU、显示屏194以及应用处理器等实现拍摄功能。For example, the electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
示例性地,ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过摄像头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将电信号传递给ISP处理,转化为肉眼可见的图像。ISP可以对图像的噪点、亮度和色彩进行算法优化,ISP还可以优化拍摄场景的曝光和色温等参数。在一些实施例中,ISP可以设置在摄像头193中。Illustratively, the ISP is used to process data fed back by the camera 193 . For example, when taking a photo, the shutter is opened, the light is transmitted to the camera sensor through the camera, the light signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye. ISP can algorithmically optimize the noise, brightness and color of the image. ISP can also optimize parameters such as exposure and color temperature of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.
示例性地,摄像头193(也可以称为镜头)用于捕获静态图像或视频。可以通过应用程序指令触发开启,实现拍照功能,如拍摄获取任意场景的图像。摄像头可以包括成像镜头、滤光片、图像传感器等部件。物体发出或反射的光线进入成像镜头,通过滤光片,最终汇聚在图像传感器上。成像镜头主要是用于对拍照视角中的所有物体(也可以称为待拍摄场景、目标场景,也可以理解为用户期待拍摄的场景图像)发出或反射的光汇聚成像;滤光片主要是用于将光线中的多余光波(例如除可见光外的光波,如红外)滤去;图像传感器可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。图像传感器主要是用于对接收到的光信号进行光电转换,转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。Illustratively, a camera 193 (which may also be referred to as a lens) is used to capture still images or videos. It can be triggered by application instructions to realize the camera function, such as capturing images of any scene. The camera can include imaging lenses, filters, image sensors and other components. The light emitted or reflected by the object enters the imaging lens, passes through the optical filter, and finally converges on the image sensor. The imaging lens is mainly used to collect and image the light emitted or reflected by all objects in the camera angle (which can also be called the scene to be shot, the target scene, or the scene image that the user expects to shoot); the filter is mainly used to To filter out excess light waves in the light (such as light waves other than visible light, such as infrared); the image sensor can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) ) phototransistor. The image sensor is mainly used to photoelectrically convert the received optical signal into an electrical signal, and then transfer the electrical signal to the ISP to convert it into a digital image signal. ISP outputs digital image signals to DSP for processing. DSP converts digital image signals into standard RGB, YUV and other format image signals.
示例性地,数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。For example, the digital signal processor is used to process digital signals. In addition to processing digital image signals, it can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
示例性地,视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1、MPEG2、MPEG3和MPEG4。Illustratively, video codecs are used to compress or decompress digital video. Electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3 and MPEG4.
示例性地,陀螺仪传感器180B可以用于确定电子设备100的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定电子设备100围绕三个轴(即,x轴、y轴和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。例如,当快门被按下时,陀螺仪传感器180B检测电子设备100抖动的角度,根据角度计算出镜头模组需要补偿的距离,让镜头通过反向运动抵消电子设备100的抖动,实现防抖。陀螺仪传感器180B还可以用于导航和体感游戏等场景。 For example, the gyro sensor 180B may be used to determine the motion posture of the electronic device 100 . In some embodiments, the angular velocity of electronic device 100 about three axes (ie, x-axis, y-axis, and z-axis) may be determined by gyro sensor 180B. The gyro sensor 180B can be used for image stabilization. For example, when the shutter is pressed, the gyro sensor 180B detects the angle at which the electronic device 100 shakes, and calculates the distance that the lens module needs to compensate based on the angle, so that the lens can offset the shake of the electronic device 100 through reverse movement to achieve anti-shake. The gyro sensor 180B can also be used in scenarios such as navigation and somatosensory games.
示例性地,加速度传感器180E可检测电子设备100在各个方向上(一般为x轴、y轴和z轴)加速度的大小。当电子设备100静止时可检测出重力的大小及方向。加速度传感器180E还可以用于识别电子设备100的姿态,作为横竖屏切换和计步器等应用程序的输入参数。For example, the acceleration sensor 180E can detect the acceleration of the electronic device 100 in various directions (generally the x-axis, y-axis, and z-axis). When the electronic device 100 is stationary, the magnitude and direction of gravity can be detected. The acceleration sensor 180E can also be used to identify the posture of the electronic device 100 as an input parameter for applications such as horizontal and vertical screen switching and pedometer.
示例性地,距离传感器180F用于测量距离。电子设备100可以通过红外或激光测量距离。在一些实施例中,例如在拍摄场景中,电子设备100可以利用距离传感器180F测距以实现快速对焦。Illustratively, distance sensor 180F is used to measure distance. Electronic device 100 can measure distance via infrared or laser. In some embodiments, such as in a shooting scene, the electronic device 100 may utilize the distance sensor 180F to measure distance to achieve fast focusing.
示例性地,环境光传感器180L用于感知环境光亮度。电子设备100可以根据感知的环境光亮度自适应调节显示屏194亮度。环境光传感器180L也可用于拍照时自动调节白平衡。环境光传感器180L还可以与接近光传感器180G配合,检测电子设备100是否在口袋里,以防误触。Illustratively, the ambient light sensor 180L is used to sense ambient light brightness. The electronic device 100 can adaptively adjust the brightness of the display screen 194 according to the perceived ambient light brightness. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking pictures. The ambient light sensor 180L can also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in the pocket to prevent accidental touching.
示例性地,指纹传感器180H用于采集指纹。电子设备100可以利用采集的指纹特性实现解锁、访问应用锁、拍照和接听来电等功能。Illustratively, fingerprint sensor 180H is used to collect fingerprints. The electronic device 100 can use the collected fingerprint characteristics to implement functions such as unlocking, accessing application locks, taking photos, and answering incoming calls.
示例性地,触摸传感器180K,也称为触控器件。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,触摸屏也称为触控屏。触摸传感器180K用于检测作用于其上或其附近的触摸操作。触摸传感器180K可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,并且与显示屏194设置于不同的位置。For example, the touch sensor 180K is also called a touch device. The touch sensor 180K can be disposed on the display screen 194. The touch sensor 180K and the display screen 194 form a touch screen. The touch screen is also called a touch screen. The touch sensor 180K is used to detect a touch operation acted on or near the touch sensor 180K. The touch sensor 180K may pass the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through display screen 194 . In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 and at a different position from the display screen 194 .
上文详细描述了电子设备100的硬件系统,下面介绍电子设备100的软件系统。The hardware system of the electronic device 100 is described in detail above, and the software system of the electronic device 100 is introduced below.
图4是本申请实施例提供的电子设备的软件系统的示意图。FIG. 4 is a schematic diagram of a software system of an electronic device provided by an embodiment of the present application.
如图4所示,系统架构中可以包括应用层210、应用框架层220、硬件抽象层230、驱动层240以及硬件层250。As shown in FIG. 4 , the system architecture may include an application layer 210 , an application framework layer 220 , a hardware abstraction layer 230 , a driver layer 240 and a hardware layer 250 .
应用层210可以包括图库应用程序。Application layer 210 may include a gallery application.
可选地,应用层210还可以包括相机应用程序、日历、通话、地图、导航、WLAN、蓝牙、音乐、视频、短信息等应用程序。Optionally, the application layer 210 may also include application programs such as camera application, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
应用框架层220为应用层的应用程序提供应用程序编程接口(application programming interface,API)和编程框架;应用框架层可以包括一些预定义的函数。The application framework layer 220 provides an application programming interface (API) and programming framework for applications in the application layer; the application framework layer may include some predefined functions.
例如,应用框架层220可以包括图库访问接口;图库访问接口可以用于获取图库的相关数据。For example, the application framework layer 220 may include a gallery access interface; the gallery access interface may be used to obtain relevant data of the gallery.
硬件抽象层230用于将硬件抽象化。Hardware abstraction layer 230 is used to abstract hardware.
例如,硬件抽象模块可以包括视频编辑算法;基于视频编辑算法,可以执行本申请实施例的视频编辑的相关方法。For example, the hardware abstraction module may include a video editing algorithm; based on the video editing algorithm, the video editing related methods of the embodiments of the present application may be executed.
驱动层240用于为不同硬件设备提供驱动。例如,驱动层可以包括显示屏驱动。The driver layer 240 is used to provide drivers for different hardware devices. For example, the driver layer may include display driver.
硬件层250可以包括显示屏以及其他硬件设备。Hardware layer 250 may include a display screen and other hardware devices.
目前,用户可以通过现有的应用程序对多个视频进行自动编辑,实现视频混剪;但是,现有的应用程序对多个视频进行编辑时的专业性较差,导致处理后的视频中会存在问题;例如,编辑后的视频中可能会存在与多个视频的整体视频主题无关的图像内容。Currently, users can automatically edit multiple videos through existing applications to achieve video mixing and cutting; however, existing applications are less professional when editing multiple videos, resulting in the processed video being There are issues; for example, the edited video may have graphic content that is not relevant to the overall video theme of multiple videos.
有鉴于此,本申请的实施例提供了一种视频编辑方法和电子设备;在本申请的实施例中,可以将N个视频的图像内容信息转换为文本描述信息;基于N个视频的文本描述信息得到N个视频的视频主题信息;基于N个视频中的图像与视频主题信息的相关度大小,从N个视频中选取M个视频片段;基于M个视频片段与背景音乐得到处理后的视频; 在本申请的方案中,通过N个视频的文本描述信息得到N个视频的视频主题信息;与基于N个视频的图像信息得到N个视频的视频主题信息相比,文本信息比图像信息具有更丰富的信息;此外,多个文本信息之间具有语言关联性,基于N个视频的文本描述信息得到视频的视频主题信息,能够提高视频主题信息的准确性;此外,在本申请的实施例中,可以基于N个视频中的图像与视频主题信息的相关性,确定N个视频中与视频主题相关度较高的M视频片段;基于本申请的方案,能够有效删除N个视频中与整体视频主题信息无关的视频片段,确保筛选出的视频片段与视频主题信息相关,提升编辑后视频的视频质量。In view of this, embodiments of the present application provide a video editing method and electronic device; in the embodiment of the present application, the image content information of N videos can be converted into text description information; based on the text description of N videos Information to obtain the video theme information of N videos; based on the correlation between the images in the N videos and the video theme information, select M video clips from the N videos; obtain the processed video based on the M video clips and background music ; In the solution of this application, the video theme information of N videos is obtained through the text description information of N videos; compared with obtaining the video theme information of N videos based on the image information of N videos, the text information has more advantages than the image information. Rich information; in addition, there is language correlation between multiple text information, and the video subject information of the video is obtained based on the text description information of N videos, which can improve the accuracy of the video subject information; in addition, in the embodiment of the present application , based on the correlation between the images in the N videos and the video theme information, M video clips in the N videos that are highly relevant to the video theme can be determined; based on the solution of this application, the N videos can be effectively deleted that are related to the overall video Video clips with irrelevant topic information ensure that the filtered video clips are related to the video topic information and improve the video quality of the edited video.
进一步地,在本申请的实施例中,能够解决编辑后的视频中存在的视频片段与背景音乐不匹配的问题,即能够解决编辑后的视频内容与背景音乐的节奏卡点不完全匹配的问题。Furthermore, in the embodiments of the present application, the problem of the video clips in the edited video not matching the background music can be solved, that is, the problem of the edited video content not completely matching the rhythm stuck points of the background music can be solved. .
可选地,在本申请的方案中,可以基于N个视频的整体视频主题信息可以选取背景音乐;并且可以基于背景音乐的节奏对M个视频进行排序,实现按照背景音乐的节奏对M个视频片段进行视频排序,使得视频片段的画面内容与音乐节奏相符合;与视频直接按照输入顺序与音乐匹配相比,本申请的方案能够提高视频中图像内容与背景音乐节奏的一致性,提升编辑后视频的视频质量。Optionally, in the solution of this application, background music can be selected based on the overall video theme information of the N videos; and the M videos can be sorted based on the rhythm of the background music, so that the M videos can be sorted according to the rhythm of the background music. Sort the video clips so that the picture content of the video clips matches the rhythm of the music; compared with matching the video directly with the music in the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the quality of the video clips after editing. The video quality of the video.
可选地,在本申请的方案中,对于强故事线的N个视频,可以基于N个视频的文本描述信息对N个视频进行排序,得到排序后的N个视频;从排序后的N个视频中选取与视频主题信息相关度较高的M个视频片段,得到排序后的M个视频片段;基于排序后的M个视频片段与视频主题信息,确定与排序后的M个视频片段相匹配的背景音乐;使得强故事线的N个视频的画面内容与音乐节奏相匹配的情况下,且视频的画面内容播放顺序符合因果联系,提升编辑后视频的视频质量。Optionally, in the solution of this application, for N videos with strong story lines, the N videos can be sorted based on the text description information of the N videos to obtain the sorted N videos; from the sorted N videos Select M video clips from the video that are highly relevant to the video topic information to obtain the sorted M video clips; based on the sorted M video clips and the video topic information, determine the match with the sorted M video clips background music; when the picture content of the N videos with a strong storyline matches the music rhythm, and the playback sequence of the video picture content conforms to the causal relationship, the video quality of the edited video is improved.
本申请实施例提供的视频编辑方法,能够基于N个视频中视频内容与整体视频主题的相关性,有效地过滤掉N个视频中与整体视频主题不相关的视频片段;根据视频内容(例如,视频表达的情绪与视频的画面)匹配背景音乐;基于背景音乐的节奏或者视频片段的前后逻辑关联性,合理串联多个视频片段;使得编辑后的视频中不包括与整体视频主题无关的内容,且视频内容与背景音乐的节奏相符合,从而提高电子设备视频编辑的专业性,提升编辑后视频的视频质量。The video editing method provided by the embodiment of the present application can effectively filter out the video clips in the N videos that are not relevant to the overall video theme based on the correlation between the video content in the N videos and the overall video theme; based on the video content (for example, The emotions expressed in the video and the picture of the video) match the background music; based on the rhythm of the background music or the logical correlation of the video clips, multiple video clips are reasonably connected in series; so that the edited video does not include content that is irrelevant to the overall video theme. Moreover, the video content matches the rhythm of the background music, thereby improving the professionalism of electronic device video editing and improving the video quality of the edited video.
示例性地,本申请实施例提供的视频编辑方法适用于电子设备中自动生成混剪视频;例如,电子设备检测到用户对多个视频的选择操作;识别多个视频的视频主题,并匹配出与多个视频主题相关的背景音乐;将多个视频与背景音乐进行自动合成,生成混剪视频。Illustratively, the video editing method provided by the embodiment of the present application is suitable for automatically generating mixed-cut videos in electronic devices; for example, the electronic device detects the user's selection operations on multiple videos; identifies the video themes of multiple videos, and matches Background music related to multiple video themes; automatically synthesize multiple videos and background music to generate a mixed-cut video.
可选地,本申请实施例提供的方法不仅适用于电子设备中保存的视频;同样适用于电子设备中保存的照片;例如,基于电子设备中保存的照片生成混剪视频;其中,照片包括但不限于:gif动图、JPEG格式图像、PNG格式图像等。Optionally, the method provided by the embodiment of the present application is not only applicable to videos saved in electronic devices; it is also applicable to photos saved in electronic devices; for example, a mixed-cut video is generated based on photos saved in electronic devices; where the photos include but Not limited to: gif animations, JPEG format images, PNG format images, etc.
下面结合图5至图11对本申请实施例提供的视频编辑方法的相关界面示意图进行详细的描述。The following is a detailed description of the relevant interface diagrams of the video editing method provided by the embodiment of the present application with reference to Figures 5 to 11.
示例性地,如图5所示,图5中的(a)所示的图形用户界面(graphical user interface,GUI)为电子设备的桌面301;电子设备检测到用户点击桌面上的图库应用程序的控件302的操作,如图5中的(b)所示;在电子设备检测到用户点击桌面上的图库应用程序的控件302之后,显示如图5中的(c)所示的图库显示界面303;图库显示界面303中包括所有照片图标、视频图标与更多选项的控件304,电子设备检测到用户点击更多选项的控件304的操作, 如图5中的(d)所示;在电子设备检测到用户点击更多选项的控件304的之后,显示如图6中的(a)所示的显示界面305;在显示界面305中,包括一键大片的控件306;电子设备检测到用户点击一键大片的控件306的操作,如图6中的(b)所示;在电子设备检测到用户点击一键大片的控件306的操作之后,显示如图6中的(c)所示的显示界面307;显示界面307中包括电子设备中保存的视频的图标与多选控件308;电子设备检测到用户点击多选控件308的操作,如图6中的(d)所示;在电子设备检测到用户点击多选控件308之后,显示如图7中的(a)所示的显示界面309,显示界面309中包括视频的图标310;电子设备检测到用户点击图标310的操作,如图7中的(b)所示;在电子设备检测到用户点击图标310的操作之后,显示如图7中的(c)所示的显示界面311;在显示界面311中包括视频的图标312,电子设备检测到用户点击图标312的操作,如图7中的(d)所示;在电子设备检测到用户点击图标312的操作之后,显示如图8中的(a)所示的显示界面313;显示界面313中包括视频的图标314,电子设备检测到用户点击图标314的操作,如图8中的(b)所示;在电子设备检测到用户点击图标314的操作之后,显示如图8中的(c)所示的显示界面315;显示界面315中包括一键大片的控件316;电子设备检测到用户点击一键大片的控件316的操作,如图8中的(d)所示。Exemplarily, as shown in Figure 5, the graphical user interface (GUI) shown in (a) of Figure 5 is the desktop 301 of the electronic device; the electronic device detects that the user clicks on the gallery application on the desktop. The operation of the control 302 is as shown in (b) of Figure 5; after the electronic device detects that the user clicks on the control 302 of the gallery application on the desktop, the gallery display interface 303 as shown in (c) of Figure 5 is displayed. ; The gallery display interface 303 includes all photo icons, video icons and controls 304 for more options, and the electronic device detects the user's operation of clicking on the controls 304 for more options, As shown in (d) in Figure 5; after the electronic device detects that the user clicks on the control 304 for more options, the display interface 305 shown in (a) in Figure 6 is displayed; in the display interface 305, including The one-click blockbuster control 306; the electronic device detects the user's operation of clicking the one-click blockbuster control 306, as shown in (b) in Figure 6; after the electronic device detects the user's click on the one-click blockbuster control 306, The display interface 307 shown in (c) of Figure 6 is displayed; the display interface 307 includes the icon of the video saved in the electronic device and the multi-select control 308; the electronic device detects the user's operation of clicking the multi-select control 308, as shown in Figure As shown in (d) in Figure 6; after the electronic device detects that the user clicks on the multi-select control 308, the display interface 309 as shown in (a) in Figure 7 is displayed, and the display interface 309 includes a video icon 310; the electronic device The user's operation of clicking the icon 310 is detected, as shown in (b) of Figure 7; after the electronic device detects the user's operation of clicking the icon 310, the display interface 311 is displayed as shown in (c) of Figure 7; in The display interface 311 includes a video icon 312, and the electronic device detects the user's click on the icon 312, as shown in (d) in Figure 7; after the electronic device detects the user's click on the icon 312, the display is shown in Figure 8 The display interface 313 shown in (a); the display interface 313 includes a video icon 314, and the electronic device detects the user's click on the icon 314, as shown in (b) of Figure 8; when the electronic device detects the user's click After the icon 314 is operated, the display interface 315 shown in (c) in Figure 8 is displayed; the display interface 315 includes a one-click blockbuster control 316; the electronic device detects the user's operation of clicking the one-click blockbuster control 316, such as As shown in (d) in Figure 8.
在一个示例中,在电子设备检测到用户点击一键大片的控件316的操作之后,电子设备可以执行本申请实施例提供的视频编辑方法,对用户选择的多个视频进行视频编辑处理,显示如图9所示的显示界面317。In one example, after the electronic device detects the user's operation of clicking on the one-click blockbuster control 316, the electronic device can execute the video editing method provided by the embodiment of the present application to perform video editing processing on multiple videos selected by the user, displaying the following: The display interface 317 shown in Figure 9.
应理解,电子设备对于不同的视频主题信息,可以预先配置该视频主题信息对应的一种模板;因此,电子设备可以显示如图9所示的显示界面317。It should be understood that for different video theme information, the electronic device can pre-configure a template corresponding to the video theme information; therefore, the electronic device can display the display interface 317 as shown in FIG. 9 .
在一个示例中,在电子设备检测到用户点击一键大片的控件316的操作之后,电子设备可以执行本申请实施例提供的视频编辑方法,对用户选择的多个视频进行视频编辑处理,显示如图10所示的显示界面318;显示界面318中包括基于本申请的方案得到的视频主题信息为“旅行”;此外,可以在显示界面318中显示“旅行”对应的模板1、模板2、模板3与模板4。In one example, after the electronic device detects the user's operation of clicking on the one-click blockbuster control 316, the electronic device can execute the video editing method provided by the embodiment of the present application to perform video editing processing on multiple videos selected by the user, displaying the following: The display interface 318 shown in Figure 10; the display interface 318 includes the video theme information obtained based on the solution of the present application as "travel"; in addition, the template 1, template 2, and template corresponding to "travel" can be displayed in the display interface 318 3 with template 4.
应理解,电子设备对于不同的视频主题信息,可以预先配置该视频主题信息对应的多种模板;因此,电子设备可以显示如图10所示的显示界面318。It should be understood that for different video theme information, the electronic device can pre-configure multiple templates corresponding to the video theme information; therefore, the electronic device can display the display interface 318 as shown in FIG. 10 .
在一个示例中,在电子设备检测到用户点击一键大片的控件316的操作之后,电子设备可以执行本申请实施例提供的视频编辑方法,对用户选择的多个视频进行视频编辑处理,显示如图11所示的显示界面319;基于本申请的方案,若电子设备得到两个或者两个以上的视频主题信息,则在电子设备中可以显示提示框320;如图11所示,提示框320中包括两个视频主题,分别为风景与旅行;基于用户对提示框320中的视频主题的操作,电子设备可以从两个或者两个以上的视频主题中确定一个视频主题信息。In one example, after the electronic device detects the user's operation of clicking on the one-click blockbuster control 316, the electronic device can execute the video editing method provided by the embodiment of the present application to perform video editing processing on multiple videos selected by the user, displaying the following: Display interface 319 shown in Figure 11; Based on the solution of this application, if the electronic device obtains two or more video theme information, the prompt box 320 can be displayed on the electronic device; as shown in Figure 11, the prompt box 320 includes two video themes, namely scenery and travel; based on the user's operation on the video theme in the prompt box 320, the electronic device can determine one video theme information from two or more video themes.
应理解,上述是以在电子设备中选择视频进行视频编辑处理进行举例说明;本申请实施例中提供的视频编辑方法同样适用于对电子设备中保存的照片进行视频编辑处理生成混剪视频;其中,照片包括但不限于:gif动图、JPEG格式图像、PNG格式图像等;本申请对此不作任何限定。It should be understood that the above is an example of selecting a video for video editing processing in an electronic device; the video editing method provided in the embodiment of the present application is also applicable to performing video editing processing on photos saved in an electronic device to generate a mixed-cut video; wherein , photos include but are not limited to: gif animations, JPEG format images, PNG format images, etc.; this application does not make any restrictions on this.
下面结合图12至图19对本申请实施例提供的视频编辑方法进行详细描述。The video editing method provided by the embodiment of the present application will be described in detail below with reference to Figures 12 to 19.
图12是本申请实施例提供的一种视频编辑方法的示意性流程图。该视频编辑方法400可以由图1所示的电子设备执行;该视频编辑方法包括步骤S410至步骤S480,下面分别对步骤S410至步骤480进行详细的描述。 Figure 12 is a schematic flow chart of a video editing method provided by an embodiment of the present application. The video editing method 400 can be executed by the electronic device shown in Figure 1; the video editing method includes steps S410 to S480, and steps S410 to 480 are described in detail below respectively.
步骤S410、显示第一界面。Step S410: Display the first interface.
其中,第一界面中包括视频图标,视频图标指示的视频为电子设备中存储的视频。The first interface includes a video icon, and the video indicated by the video icon is a video stored in the electronic device.
示例性地,第一界面可以是指电子设备中图库应用程序的显示界面,如图6中的(c)所示的显示界面307;显示界面307中包括6个视频图标,该6个视频图像对应的视频为电子设备中存储的视频。For example, the first interface may refer to the display interface of the gallery application in the electronic device, such as the display interface 307 shown in (c) of Figure 6; the display interface 307 includes 6 video icons, and the 6 video images The corresponding video is a video stored in the electronic device.
步骤S420、检测到对视频图标中N个视频图标的第一操作。Step S420: Detect the first operation on N video icons among the video icons.
示例性地,第一操作可以为对视频图标中N个视频图像的点击操作,或者,第一操作可以为其他对N个视频图像的选中操作。For example, the first operation may be a click operation on N video images in the video icon, or the first operation may be another selection operation on N video images.
例如,如图7中的(b)所示,电子设备检测到对视频图标中的图标310的点击操作;又例如,如图7中的(d)所示,电子设备检测到对视频图标中的图标312的点击操作。For example, as shown in (b) of Figure 7 , the electronic device detects a click operation on the icon 310 in the video icon; for another example, as shown in (d) of Figure 7 , the electronic device detects a click operation on the icon 310 in the video icon. Click operation of icon 312.
可选地,对视频图标中N个视频图像的第一操作可以是分别先后执行的操作,或者,也可以是同时执行的操作。Optionally, the first operation on the N video images in the video icon may be operations performed one after another, or may be performed simultaneously.
应理解,上述以第一操作为点击操作进行举例说明,第一操作还可以为语音指示的选中视频图标中N个视频图标的操作,或者,第一操作还可以为其他用于指示选中视频图标中N个视频图标的操作,本申请对此不作任何限定。It should be understood that the above example takes the first operation as a click operation. The first operation may also be an operation of selecting N video icons among the video icons indicated by voice, or the first operation may also be other instructions for selecting video icons. This application does not impose any restrictions on the operation of N video icons.
步骤S430、响应于第一操作,获取N个视频的信息。Step S430: In response to the first operation, obtain information of N videos.
其中,N为大于1的整数。Among them, N is an integer greater than 1.
示例性地,如图8中的(b)所示,基于第一操作,电子设备可以获取3个视频的信息。For example, as shown in (b) of FIG. 8 , based on the first operation, the electronic device can obtain information of three videos.
步骤S440、基于N个视频的信息,得到N个视频的视频主题。Step S440: Based on the information of the N videos, obtain the video themes of the N videos.
应理解,视频主题可以是指视频中与整体的图像内容相关联的主题思想;对于不同的视频主题,对应的视频处理方式可以不同;例如,视频主题不同可以采用不同的音乐,不同的转场特效,不同的图像处理滤镜,或者,可以采用不同的视频剪辑方式。It should be understood that the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
可选地,在一种可能的实现方式中,基于N个视频的信息,得到N个视频的视频主题,包括:Optionally, in a possible implementation, based on the information of N videos, the video topics of N videos are obtained, including:
将N个视频的视频内容转换为N个文本描述信息,N个文本描述信息与N个视频一一对应,N个文本描述信息中的一个文本描述信息用于描述N个视频中一个视频的图像内容信息;基于N个文本描述信息,得到N个视频的主题信息,文本描述信息用于将N个视频中的视频内容转换为文本信息。Convert the video content of N videos into N text description information. The N text description information corresponds to the N videos one-to-one. One text description information among the N text description information is used to describe the image of one video among the N videos. Content information; based on N pieces of text description information, the subject information of N videos is obtained, and the text description information is used to convert the video content in the N videos into text information.
在本申请的实施例中,在识别N个视频的视频主题时,通过N个视频的文本描述信息得到N个视频对应的视频主题信息;即基于N个视频的文本描述信息可以得到N个视频的整体视频主题信息;与基于N个视频的图像语义得到视频主题信息相比,文本信息比图像信息具有更抽象的语义信息,多个文本信息之间具有语言关联性,有助于推测多个文本背后隐含的主题信息,从而能够提高N视频对应的整体视频主题的准确性。In the embodiment of the present application, when identifying the video themes of N videos, the video theme information corresponding to the N videos is obtained through the text description information of the N videos; that is, the N videos can be obtained based on the text description information of the N videos The overall video topic information; compared with the video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to infer multiple The theme information hidden behind the text can improve the accuracy of the overall video theme corresponding to the N video.
例如,N个视频中包括用户收拾行李的视频、用户出门乘坐汽车前往机场的视频以及用户乘坐飞机的视频,与用户在海边散步的视频;基于图像语义可能只能得到一些图像标签,包括衣物、行李箱、用户、海边等,基于这些图像标签无法抽象出N个视频的视频主题为旅行;但是,基于N个视频的文本描述信息识别视频主题时,可以基于N个视频文本描述信息与N个视频文本描述信息之间的语言逻辑关联性,准确地得到N个视频的视频主题信息;比如,基于N个视频包括的文本描述信息“一个用户在收拾行李”、“一个用户在乘坐飞机”、“一个用户在海边散步”,基于这些文本描述信息可以抽象出N个视频的视频主题信息为旅行;因此,通过N个视频的文本描述信息得到N个视频的视频主题信息,能够提高主题信息 的准确性。For example, the N videos include videos of the user packing luggage, videos of the user taking a car to the airport, videos of the user taking a plane, and videos of the user walking on the beach; only some image tags may be obtained based on image semantics, including clothing, Suitcases, users, seaside, etc. Based on these image tags, it is impossible to abstract that the video theme of N videos is travel; however, when identifying the video theme based on the text description information of N videos, the video theme can be identified based on the N video text description information and N The linguistic logic correlation between video text description information can accurately obtain the video theme information of N videos; for example, based on the text description information included in N videos, "a user is packing luggage", "a user is taking a plane", "A user is walking on the beach." Based on this text description information, the video theme information of N videos can be abstracted as travel; therefore, the video theme information of N videos can be obtained through the text description information of N videos, which can improve the theme information. accuracy.
可选地,在一种可能的实现方式中,基于N个文本描述信息,得到N个视频的主题信息,包括:Optionally, in a possible implementation, based on N pieces of text description information, the topic information of N videos is obtained, including:
将N个文本描述信息输入至预先训练的主题分类模型,得到N个视频的主题信息,预先训练的主题分类模型为用于文本分类的深度神经网络。Input N pieces of text description information into a pre-trained topic classification model to obtain topic information of N videos. The pre-trained topic classification model is a deep neural network used for text classification.
可选地,可以将N个视频输入至图文转换模型,得到N个视频的文本描述信息;例如,N个文本描述信息;将N个视频的文本描述信息输入至预先训练的主题分类模型,得到N个视频的主题信息。可选地,实现方式可以参见图13中步骤S530、或者图18中步骤S620与步骤S630的相关描述。Optionally, N videos can be input to the image-text conversion model to obtain text description information of N videos; for example, N text description information; the text description information of N videos can be input to the pre-trained topic classification model, Get the topic information of N videos. Optionally, the implementation may refer to step S530 in FIG. 13 or the related description of step S620 and step S630 in FIG. 18 .
可选地,在一种可能的实现方式中,在预先训练的主题分类模型输出至少两个视频主题时,至少两个视频主题与N个文本描述信息相对应,还包括:Optionally, in a possible implementation, when the pre-trained topic classification model outputs at least two video topics, the at least two video topics correspond to N pieces of text description information, which also includes:
显示第二界面,第二界面中包括提示框,提示框中包括至少两个视频主题的信息;Display a second interface, the second interface includes a prompt box, and the prompt box includes information on at least two video topics;
检测到对至少两个视频主题的第二操作;A second action on at least two video subjects is detected;
响应于第二操作,得到N个视频的主题信息。In response to the second operation, the theme information of N videos is obtained.
可选地,若步骤S440输出的主题信息为一个主题信息,则无需用户操作;若步骤S440输出为两个或者两个以上的主题信息,则可以在电子设备中显示提示框;提示框中可以包括候选视频主题信息,基于用户在提示框中候选视频主题信息的操作,确定N个视频的视频主题信息。Optionally, if the theme information output in step S440 is one theme information, no user operation is required; if the theme information output in step S440 is two or more theme information, a prompt box can be displayed in the electronic device; the prompt box can Including candidate video topic information, the video topic information of N videos is determined based on the user's operation on the candidate video topic information in the prompt box.
示例性地,步骤S440中输出两个主题信息,则可以在电子设备中显示显示第二界面,如图11所示的显示界面319;显示界面319中包括提示框320,提示框320中包括两个候选视频主题信息分别为风景与旅行,若电子设备检测到用户点击“风景”,则N个视频的视频主题信息为风景;若电子设备检测到用户点击“旅行”,则N个视频的视频主题信息为旅行。For example, if two theme information are output in step S440, the second interface can be displayed on the electronic device, such as the display interface 319 shown in Figure 11; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two The candidate video topic information is scenery and travel respectively. If the electronic device detects that the user clicks "Landscape", then the video topic information of the N videos is scenery; if the electronic device detects that the user clicks "Travel", then the video topic information of the N videos is scenery. The theme information is travel.
在本申请的实施例中,在电子设备输出至少两个视频主题时,电子设备可以显示提示框;基于检测到用户对提示框中候选视频主题的操作,能够确定N个视频的视频主题信息;在一定程度上能够避免在N个视频的视频内容不完成符合预先视频主题时,电子设备无法识别N个视频的视频主题。In embodiments of the present application, when the electronic device outputs at least two video themes, the electronic device can display a prompt box; based on detecting the user's operation on the candidate video theme in the prompt box, the video theme information of N videos can be determined; To a certain extent, it can be avoided that the electronic device cannot recognize the video themes of the N videos when the video contents of the N videos do not completely comply with the pre-set video themes.
步骤S450、基于N个视频中的图像与视频主题的相似度,选取N个视频中的M个视频片段。Step S450: Based on the similarity between the images in the N videos and the video themes, select M video clips from the N videos.
示例性地,N个视频中的图像与视频主题的相似度可以通过相似度置信值,或者距离值表示;例如,若一个视频中的一个图像特征与视频主题的文本特征之间的相似度越高,则相似度置信值越大,距离度量值越小;若一个视频中的一个图像特征与视频主题的文本特征之间的相似度越低,则相似度置信值越小,距离度量值越大。For example, the similarity between the images in N videos and the video topic can be represented by a similarity confidence value or a distance value; for example, if the similarity between an image feature in a video and the text feature of the video topic is greater than is high, the greater the similarity confidence value and the smaller the distance measurement value; if the similarity between an image feature in a video and the text feature of the video subject is lower, the smaller the similarity confidence value and the smaller the distance measurement value. big.
在本申请的实施例中,可以基于N个视频中的图像与视频主题信息的相关性,确定N个视频中与视频主题相关性较高的M个视频片段;基于本申请的方案,能够有效删除N个视频中与视频主题信息无法的视频片段,确保筛选出的视频片段与视频主题信息相关;另一方面,可以计算N个视频中部分或者全部图像特征与视频主题信息的相似度置信值,通过采用选取一个视频中连续的多帧图像得到的视频片段,因此视频片段的连续性较好。In the embodiment of the present application, based on the correlation between the images in the N videos and the video theme information, M video clips among the N videos that are highly relevant to the video theme can be determined; based on the solution of the present application, it can effectively Delete the video clips in the N videos that are not related to the video topic information to ensure that the filtered video clips are related to the video topic information; on the other hand, the similarity confidence value of some or all image features in the N videos and the video topic information can be calculated , by using video clips obtained by selecting consecutive multiple frames of images in a video, so the continuity of the video clips is better.
在一个示例中,对于N个视频中每一个视频,可以遍历一个视频中的全部图像特征,判断一个视频中的每一个图像特征与视频主题信息的文本信息之间的相似度。In one example, for each of N videos, all image features in a video can be traversed, and the similarity between each image feature in a video and the text information of the video topic information can be determined.
在一个示例中,对于N个视频中每一个视频,可以提取一个视频中的部分图像特征;例如,对于N个视频中的一个视频,可以等间隔的选取图像帧,对选取的图像帧进行特征提取 得到图像特征。In one example, for each of N videos, part of the image features in a video can be extracted; for example, for one of the N videos, image frames can be selected at equal intervals, and the selected image frames can be characterized. extract Get image features.
可选地,在本申请的实施例中,M可以大于N,或者,M可以等于N,或者,M可以小于N;M的数值大小是基于N个视频中每个视频片段与视频主题信息的相似度置信值确定的。Optionally, in the embodiment of the present application, M may be greater than N, or M may be equal to N, or M may be less than N; the numerical size of M is based on each video segment and video theme information in N videos. The similarity confidence value is determined.
应理解,在本申请的方案中,若一个视频中的所有图像与视频主题信息的相似度置信值均小于或者等于预设阈值,则说明该视频与视频主题信息无关,可以不保留该视频中的任意一个视频片段;若一个视频中的部分或者全部图像与视频主题信息的相似度置信值大于预设阈值,则可以保留该视频中的部分或者全部的视频片段。It should be understood that in the solution of this application, if the similarity confidence values of all images in a video and the video topic information are less than or equal to the preset threshold, it means that the video has nothing to do with the video topic information, and the video does not need to be retained. Any video clip; if the similarity confidence value of some or all images in a video and the video topic information is greater than the preset threshold, some or all video clips in the video can be retained.
可选地,在一种可能的实现方式中,基于N个视频中的图像与视频主题的相似度,选取N个视频中的M个视频片段,包括:Optionally, in a possible implementation, M video clips from the N videos are selected based on the similarity between the images in the N videos and the video themes, including:
将N个视频与视频主题输入至预先训练的相似度匹配模型,得到N个视频中的图像与视频主题的相似度置信值,其中,预先训练的相似度匹配模型中包括图像编码器、文本编码器与第一相似度度量模块,图像编码器用于对N个视频进行提取图像特征处理,文本编码器用于视频主题进行提取文本特征处理,第一相似度度量模块用于度量N个视频中的图像特征与视频主题的文本特征之间的相似度,相似度置信值用于表示N个视频中的图像与视频主题相似的概率;Input N videos and video topics into the pre-trained similarity matching model to obtain the similarity confidence values of the images and video topics in the N videos. The pre-trained similarity matching model includes image encoders and text encoding. and the first similarity measurement module. The image encoder is used to extract image features from N videos. The text encoder is used to extract text features from video topics. The first similarity measurement module is used to measure the images in N videos. The similarity between the feature and the text feature of the video topic, the similarity confidence value is used to represent the probability that the image in N videos is similar to the video topic;
基于N个视频中的图像与视频主题的相似度置信值,选取N个视频中的M个视频片段。Based on the similarity confidence values between the images in the N videos and the video topics, M video clips from the N videos are selected.
可选地,在一种可能的实现方式中,预先训练的相似度匹配模型为Transformer模型。Optionally, in a possible implementation, the pre-trained similarity matching model is a Transformer model.
可选地,在一种可能的实现方式中,预先训练的相似度匹配模型是通过以下训练方式得到的:Optionally, in a possible implementation, the pre-trained similarity matching model is obtained through the following training method:
基于第一训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到预先训练的相似度匹配模型;其中,第一训练数据集中包括正例数据对与负例数据对,正例数据对包括第一样本文本描述信息与第一样本视频主题信息,第一样本描述信息与第一样本视频主题信息相匹配,正例数据对包括第一样本文本描述信息与第二样本视频主题信息,第一样本描述信息与第二样本视频主题信息不匹配。Based on the first training data set, the contrastive learning training method is used to train the similarity matching model to be trained, and a pre-trained similarity matching model is obtained; wherein, the first training data set includes positive example data pairs and negative example data pairs. The example data pair includes the first sample text description information and the first sample video theme information. The first sample description information matches the first sample video theme information. The positive example data pair includes the first sample text description information and the first sample video theme information. The second sample video theme information and the first sample description information do not match the second sample video theme information.
可选地,步骤S450的实现方式可以参见后续图13中的步骤S540与步骤S550,或者图14,或者图15,或者图18中的步骤S640与步骤S650,或者图19中的步骤S750与步骤S760的相关描述。Optionally, the implementation of step S450 can refer to the following steps S540 and S550 in Figure 13, or Figure 14, or Figure 15, or Step S640 and Step S650 in Figure 18, or Step S750 and Step S750 in Figure 19 Related description of S760.
步骤S460、基于视频主题,得到与视频主题相匹配的音乐。Step S460: Based on the video theme, obtain music that matches the video theme.
可选地,在一种可能的实现方式中,基于视频主题,得到与视频主题相匹配的音乐,包括:Optionally, in a possible implementation, based on the video theme, music matching the video theme is obtained, including:
基于M个视频片段的时长与视频主题,得到与视频主题相匹配的音乐,音乐的时长大于或者等于M个视频片段的时长。Based on the duration and video theme of the M video clips, music that matches the video theme is obtained, and the duration of the music is greater than or equal to the duration of the M video clips.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长,进行音乐匹配时通常选取的背景音乐需要大于或者等于M个视频片段的总时长;基于视频主题信息,可以确定背景音乐的音乐风格。For example, the total duration of the background music can be determined based on the duration of the M video clips. The background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; based on the video theme information, the total duration of the background music can be determined. music style.
可选地,步骤S460的实现方式可以参见后续图13中的步骤S560,或者图18中的步骤S660,或者图19中的步骤S770的相关描述。Optionally, the implementation of step S460 may refer to the subsequent step S560 in FIG. 13 , or step S660 in FIG. 18 , or the related description of step S770 in FIG. 19 .
步骤S470、基于M个视频片段与音乐,得到第一视频。Step S470: Obtain the first video based on the M video clips and music.
可选地,在一种可能的实现方式中,基于M个视频片段与音乐,得到第一视频,包括:Optionally, in a possible implementation, the first video is obtained based on M video clips and music, including:
对M个视频片段进行排序,得到排序后的M个视频片段;Sort the M video clips to obtain the sorted M video clips;
将排序后的M个视频片段与音乐合成为第一视频。 The sorted M video clips and music are synthesized into the first video.
在本申请的实施例中,能够使得M个视频片段中的图像内容与音乐中的音乐节奏更加吻合;例如,视频图像内容为风景,则可以对应于音乐的前奏或者舒缓的音乐部分;视频图像内容为用户的运动场景,则可以对应于背景音乐中的高潮部分;通过对M个视频片段进行排序,使得M个视频片段与音乐的节奏卡点更加匹配;从而解决编辑后的第一视频中存在的视频片段与背景音乐不匹配的问题,即能够解决编辑后的第一视频内容与音乐的节奏卡点不完全匹配的问题;提高编辑后的第一视频的视频质量。In the embodiment of the present application, the image content in the M video clips can be made more consistent with the music rhythm in the music; for example, if the video image content is scenery, it can correspond to the prelude of the music or the soothing music part; the video image If the content is the user's sports scene, it can correspond to the climax of the background music; by sorting the M video clips, the M video clips can better match the rhythm stuck points of the music; thereby solving the problem in the edited first video The existing problem of video clips and background music not matching can solve the problem of the edited first video content not fully matching the rhythm of the music; improve the video quality of the edited first video.
可选地,在一种可能的实现方式中,对M个视频片段进行排序,得到排序后的M个视频片段,包括:Optionally, in a possible implementation, the M video clips are sorted to obtain the sorted M video clips, including:
基于音乐的节奏对M个视频片段排序,得到排序后的M个视频片段。Sort the M video clips based on the rhythm of the music to obtain the sorted M video clips.
示例性地,对于非强故事线的视频,可以基于音乐的节奏匹配M个视频片段的最佳位置;生成处理后的视频。可选地,实现方式可以参见后续图18的相关描述。For example, for a video without a strong story line, the best positions of M video clips can be matched based on the rhythm of the music; a processed video can be generated. Optionally, for the implementation method, please refer to the relevant description in the subsequent Figure 18.
应理解,非强故事线的视频可以是指N个视频为平等顺序的视频;N个视频之间不具有强因果关联性;例如,非强故事线的视频可以包括运动主题的视频。It should be understood that a video without a strong storyline may refer to a video in which N videos are in equal order; there is no strong causal correlation between the N videos; for example, a video without a strong storyline may include a video with a sports theme.
例如,在本申请的方案中,可以基于N个视频的整体视频主题信息可以选取背景音乐;并且可以基于背景音乐的节奏对M个视频进行排序,实现按照背景音乐的节奏对M个视频片段进行视频排序,使得视频片段的画面内容与音乐节奏相符合;与视频直接按照输入顺序与音乐匹配相比,本申请的方案能够提高视频中图像内容与背景音乐节奏的一致性,提升编辑后视频的视频质量。For example, in the solution of this application, background music can be selected based on the overall video theme information of N videos; and M videos can be sorted based on the rhythm of the background music, so that M video clips can be sorted according to the rhythm of the background music. The video is sorted so that the picture content of the video clips matches the rhythm of the music; compared with matching the video directly with the music according to the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the quality of the edited video. Video quality.
可选地,在一种可能的实现方式中,基于音乐的节奏对M个视频片段排序,得到排序后的M个视频片段,包括:Optionally, in a possible implementation, the M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips, including:
将音乐与M个视频片段输入至预先训练的影音节奏匹配模型,得到排序后的M个视频片段,预先训练的影音节奏匹配模型中包括音频编码器、视频编码器与第一相似度度量模块,音频编码器用于对音乐进行特征提取得到音频特征,视频解码器用于对M个视频片段进行特征提取得到视频特征,第一相似度度量模块用于度量音频特征与M个视频片段的相似性。Input music and M video clips into the pre-trained audio-visual rhythm matching model to obtain sorted M video clips. The pre-trained audio-visual rhythm matching model includes an audio encoder, a video encoder and a first similarity measurement module. The audio encoder is used to extract features from music to obtain audio features, the video decoder is used to extract features from M video clips to obtain video features, and the first similarity measurement module is used to measure the similarity between the audio features and the M video clips.
需要说明的是,上述实现方式可以参见后续图16或者图17的相关描述。It should be noted that the above implementation method can be referred to the subsequent relevant descriptions of FIG. 16 or FIG. 17 .
在本申请的实施例中,音乐与M个视频片段输入至预先训练的影音节奏匹配模型,得到排序后的M个视频片段;通过预先训练的影音节奏匹配模型可以实现音频特征与视频特征之间的匹配。In the embodiment of the present application, music and M video clips are input to a pre-trained audio-visual rhythm matching model to obtain sorted M video clips; through the pre-trained audio-visual rhythm matching model, the relationship between audio features and video features can be realized match.
可选地,在一种可能的实现方式中,对M个视频片段进行排序,得到排序后的M个视频片段,包括:Optionally, in a possible implementation, M video clips are sorted to obtain sorted M video clips, including:
基于M个视频片段中的视频内容对M个视频片段进行排序,得到排序后的M个视频片段。The M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
示例性地,对于强故事线的视频,可以排序后的N个视频包括的视频片段与视频主题的相似度置信值确定排序后的M个视频片段;基于排序后的M个视频片段与视频主题信息确定与排序后的M个视频片段相匹配的背景音乐;生成处理后的视频。For example, for a video with a strong story line, the sorted M video segments can be determined based on the similarity confidence values between the video segments included in the sorted N videos and the video theme; based on the sorted M video segments and the video theme The information determines background music matching the sorted M video clips; generates a processed video.
应理解,强故事线的视频可以是指N个视频之间具有因果联系,基于视频编辑方法后能够识别N个视频之间的前因后果并基于前因后果的顺序对N个视频排序;例如,强故事线的视频可以包括旅行主题的视频或者出行主题的视频。It should be understood that a video with a strong story line can refer to a causal connection between N videos. Based on the video editing method, the cause and effect between the N videos can be identified and the N videos can be sorted based on the order of the cause and effect; for example, a strong story line The videos can include travel-themed videos or travel-themed videos.
例如,在本申请的方案中,对于强故事线的N个视频,可以基于N个视频的文本描述信息对N个视频进行排序,得到排序后的N个视频;从排序后的N个视频中选取与视频主题信息相关度较高的M个视频片段,得到排序后的M个视频片段;基于排序后的M 个视频片段与视频主题信息,确定与排序后的M个视频片段相匹配的背景音乐;使得强故事线的N个视频的画面内容与音乐节奏相匹配的情况下,且视频的画面内容播放顺序符合因果联系,提升编辑后视频的视频质量。For example, in the solution of this application, for N videos with strong story lines, the N videos can be sorted based on the text description information of the N videos to obtain the sorted N videos; from the sorted N videos Select M video clips that are highly relevant to the video topic information to obtain the sorted M video clips; based on the sorted M video clips and video theme information, determine the background music that matches the sorted M video clips; make the picture content of the N videos with strong storylines match the music rhythm, and the picture content of the videos is played in order Comply with the causal relationship and improve the video quality of the edited video.
步骤S480、显示第一视频。Step S480: Display the first video.
示例性地,第一视频可以为基于M个视频片段与音乐得到的混剪视频;可以在电子设备中显示该混剪视频。For example, the first video may be a mixed-cut video obtained based on M video clips and music; the mixed-cut video may be displayed on the electronic device.
可选地,在一种可能的实现方式中,在基于M个视频片段与音乐生成第一视频后,电子设备可以保存第一视频;在电子设备检测到指示显示第一视频的操作后,显示第一视频。Optionally, in a possible implementation, after generating the first video based on M video clips and music, the electronic device can save the first video; after the electronic device detects an operation indicating to display the first video, display First video.
应理解,上述以N个视频的编辑进行举例说明;本申请的方案还可以适用于电子设备中保存的照片;例如,照片可以包括但不限于:gif动图、JPEG格式图像、PNG格式图像等。It should be understood that the above is exemplified by editing N videos; the solution of this application can also be applied to photos saved in electronic devices; for example, photos can include but are not limited to: gif animations, JPEG format images, PNG format images, etc. .
在本申请的实施例中,可以将N个视频的图像内容信息转换为文本描述信息;基于N个视频的文本描述信息得到N个视频的视频主题信息;基于N个视频中的图像与视频主题信息的相关度大小,从N个视频中选取M个视频片段;基于M个视频片段与背景音乐得到处理后的视频;在本申请的方案中,通过N个视频的文本描述信息得到N个视频的视频主题信息;与基于N个视频的图像信息得到N个视频的视频主题信息相比,文本信息比图像信息具有更丰富的信息;此外,多个文本信息之间具有语言关联性,基于N个视频的文本描述信息得到视频的视频主题信息,能够提高视频主题信息的准确性;此外,在本申请的实施例中,可以基于N个视频中的图像与视频主题信息的相关性,确定N个视频中与视频主题相关度较高的M视频片段;基于本申请的方案,能够有效删除N个视频中与整体视频主题信息无关的视频片段,确保筛选出的视频片段与视频主题信息相关,提升编辑后视频的视频质量。In the embodiment of the present application, the image content information of N videos can be converted into text description information; the video theme information of N videos can be obtained based on the text description information of N videos; based on the images and video themes in N videos According to the relevance of the information, M video clips are selected from N videos; the processed video is obtained based on the M video clips and background music; in the solution of this application, N videos are obtained through the text description information of the N videos video topic information; compared with the video topic information of N videos based on the image information of N videos, text information has richer information than image information; in addition, there is language correlation between multiple text information, based on N The video theme information of the video can be obtained from the text description information of the videos, which can improve the accuracy of the video theme information; in addition, in the embodiment of the present application, N can be determined based on the correlation between the images in the N videos and the video theme information. M video clips in the videos that are highly relevant to the video topic; based on the solution of this application, the video clips in the N videos that are irrelevant to the overall video topic information can be effectively deleted, ensuring that the filtered video clips are related to the video topic information. Improve the video quality of edited videos.
此外,在本申请的实施例中,能够解决编辑后的第一视频中存在的视频片段与背景音乐不匹配的问题,即能够解决编辑后的第一视频的图像内容与背景音乐的节奏卡点不完全匹配的问题;基于背景音乐的节奏或者视频片段的前后逻辑关联性,合理串联多个视频片段;从而提高编辑后视频的视频质量。In addition, in the embodiments of the present application, the problem of mismatch between the video clips and the background music existing in the edited first video can be solved, that is, the rhythm lag between the image content of the edited first video and the background music can be solved. The problem of incomplete matching; based on the rhythm of background music or the logical correlation of video clips, multiple video clips can be reasonably connected in series; thereby improving the video quality of the edited video.
图13是本申请实施例提供的一种视频编辑方法的示意性流程图。该视频编辑方法500可以由图1所示的电子设备执行;该视频编辑方法包括步骤S510至步骤S570,下面分别对步骤S510至步骤S570进行详细的描述。Figure 13 is a schematic flow chart of a video editing method provided by an embodiment of the present application. The video editing method 500 can be executed by the electronic device shown in Figure 1; the video editing method includes steps S510 to S570, and steps S510 to S570 are described in detail below respectively.
步骤S510、获取N个视频。Step S510: Obtain N videos.
示例性地,N个视频可以为存储在电子设备中的视频;其中,N个视频可以为电子设备采集的视频;或者,N个视频中的部分或者全部为下载的视频;本申请对N个视频的来源不作任何限定。For example, the N videos may be videos stored in the electronic device; wherein the N videos may be videos collected by the electronic device; or some or all of the N videos may be downloaded videos; this application applies to the N videos The source of the video is not limited in any way.
例如,电子设备检测到用户对图库应用程序中N个视频的点击操作;获取N个视频。For example, the electronic device detects the user's click operation on N videos in the gallery application; the N videos are obtained.
步骤S520、获取N个视频的文本描述信息。Step S520: Obtain text description information of N videos.
应理解,一个视频可以对应一个文本描述信息,文本描述信息用于描述一个视频中的内容信息;通过文本描述信息可以将视频中的图像内容转换成文字描述信息。It should be understood that a video can correspond to a text description information, and the text description information is used to describe the content information in a video; the image content in the video can be converted into text description information through the text description information.
需要说明的是,文本描述信息用于描述一个视频中的图像内容,文本描述信息与视频中的字幕内容可以不同。It should be noted that the text description information is used to describe the image content in a video, and the text description information and the subtitle content in the video may be different.
示例性地,视频1为用户收拾行李的视频,则视频1的文本描述信息可以为“一个人在收拾行李”;视频2为用户在机场乘坐飞机的视频,则视频2的文本描述信息可以为“一个人在乘坐飞机”;视频3为用户在海边漫步的视频,则视频3的文本描述信息可以为“一个 人在海边漫步”。For example, if Video 1 is a video of a user packing his luggage, the text description information of Video 1 may be "A person is packing his luggage"; Video 2 is a video of the user taking a plane at the airport, then the text description information of Video 2 may be "A person is taking a plane"; Video 3 is a video of the user walking on the beach, then the text description information of Video 3 can be "A person is taking a plane." People walking on the beach."
步骤S530、基于N个视频的文本描述信息,得到N个视频的视频主题信息。Step S530: Obtain the video theme information of the N videos based on the text description information of the N videos.
应理解,视频主题可以是指视频中与整体的图像内容相关联的主题思想;对于不同的视频主题,对应的视频处理方式可以不同;例如,视频主题不同可以采用不同的音乐,不同的转场特效,不同的图像处理滤镜,或者,可以采用不同的视频剪辑方式。It should be understood that the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
需要说明的是,在本申请的实施例中N个视频的视频主题信息为一个主题信息,即视频主题信息为N个视频作为整体对应的视频主题信息。It should be noted that in the embodiment of the present application, the video theme information of N videos is one theme information, that is, the video theme information is the video theme information corresponding to the N videos as a whole.
示例性地,视频主题可以包括但不限于:旅游、聚会、宠物、运动、风景、亲子、工作等。For example, video themes may include but are not limited to: travel, parties, pets, sports, scenery, parent-child, work, etc.
可选地,可以将N个视频的文本描述信息输入至预先训练的视频主题分类模型,得到N个视频的视频主题信息;其中,预先训练的视频主题分类模型可以输出视频主题标签。Optionally, the text description information of N videos can be input to a pre-trained video topic classification model to obtain the video topic information of N videos; wherein, the pre-trained video topic classification model can output video topic tags.
示例性地,预先训练的视频主题分类模型可以是指文本分类模型,预先训练的视频主题分类模型可以用于对输入的文本描述信息进行分类处理,得到文本描述信息对应的分类标签。For example, the pre-trained video topic classification model may refer to a text classification model. The pre-trained video topic classification model may be used to classify input text description information and obtain classification labels corresponding to the text description information.
例如,预先训练的视频主题分类模型可以为神经网络;比如,预先训练的视频主题分类模型可以为深度神经网络。For example, the pre-trained video topic classification model can be a neural network; for example, the pre-trained video topic classification model can be a deep neural network.
可选地,预先训练的视频主题分类模型可以是基于以下训练数据集通过反向传播算法训练得到的;训练数据集包括样本文本描述信息和视频主题文本信息,样本文本描述信息与视频主题信息相对应;其中,样本文本描述信息可以为一个或者多个语句文本;视频主题文本信息可以是短语文本;待训练的视频主题分类模型对通过对大量训练数据集的学习,可以得到训练后的视频主题分类模型。Optionally, the pre-trained video topic classification model can be trained through the back propagation algorithm based on the following training data set; the training data set includes sample text description information and video topic text information, and the sample text description information is related to the video topic information. Corresponding; among them, the sample text description information can be one or more sentence texts; the video topic text information can be a phrase text; the video topic classification model to be trained can obtain the trained video topic by learning a large number of training data sets Classification model.
例如,样本文本描述信息可以包括:“多个人在吃饭”、“多个人在做游戏”、以及“多个人在交谈”;该样本文本描述信息对应的视频主题文本信息可以为“聚会”;又例如,样本文本描述信息可以包括“一个成年人与一个儿童在拍照”,“一个成年人与一个儿童在做游戏”;该样本文本描述信息对应的视频主题信息为“亲子”。For example, the sample text description information may include: "Multiple people are eating", "Multiple people are playing games", and "Multiple people are talking"; the video theme text information corresponding to the sample text description information may be "Party"; and For example, the sample text description information may include "an adult and a child are taking pictures", "an adult and a child are playing games"; the video theme information corresponding to the sample text description information is "parent-child".
应理解,上述为举例说明;本申请实施例对样本文本描述信息与样本视频主题信息不作任何限定。It should be understood that the above is an example; the embodiment of the present application does not place any limitations on the sample text description information and the sample video theme information.
在本申请的实施例中,在识别N个视频的视频主题时,通过N个视频的文本描述信息得到N个视频对应的视频主题信息;即基于N个视频的文本描述信息可以得到N个视频的整体视频主题信息;与基于N个视频的图像语义得到视频主题信息相比,文本信息比图像信息具有更抽象的语义信息,多个文本信息之间具有语言关联性,有助于推测多个文本背后隐含的主题信息,从而能够提高N视频对应的整体视频主题的准确性;例如,N个视频中包括用户收拾行李的视频、用户出门乘坐汽车前往机场的视频以及用户乘坐飞机的视频,与用户在海边散步的视频;基于图像语义可能只能得到一些图像标签,包括衣物、行李箱、用户、海边等,基于这些图像标签无法抽象出N个视频的视频主题为旅行;但是,基于N个视频的文本描述信息识别视频主题时,可以基于N个视频文本描述信息与N个视频文本描述信息之间的语言逻辑关联性,准确地得到N个视频的视频主题信息;比如,基于N个视频包括的文本描述信息“一个用户在收拾行李”、“一个用户在乘坐飞机”、“一个用户在海边散步”,基于这些文本描述信息可以抽象出N个视频的视频主题信息为旅行;因此,通过N个视频的文本描述信息得到N个视频的视频主题信息,能够提高主题信息的准确性。In the embodiment of the present application, when identifying the video themes of N videos, the video theme information corresponding to the N videos is obtained through the text description information of the N videos; that is, the N videos can be obtained based on the text description information of the N videos The overall video topic information; compared with the video topic information based on the image semantics of N videos, text information has more abstract semantic information than image information, and there is language correlation between multiple text information, which helps to infer multiple The theme information hidden behind the text can improve the accuracy of the overall video theme corresponding to N videos; for example, N videos include videos of users packing their luggage, videos of users going out and taking a car to the airport, and videos of users taking a plane. A video with a user walking on the beach; based on image semantics, only some image tags may be obtained, including clothing, suitcases, users, beach, etc. Based on these image tags, it is impossible to abstract the video theme of N videos as travel; however, based on N When identifying the video theme from the text description information of N videos, the video theme information of N videos can be accurately obtained based on the linguistic logical correlation between N video text description information and N video text description information; for example, based on N The text description information included in the video is "a user is packing luggage", "a user is taking a plane", "a user is walking on the beach". Based on this text description information, the video theme information of N videos can be abstracted as travel; therefore, The video theme information of N videos is obtained through the text description information of N videos, which can improve the accuracy of the theme information.
可选地,若步骤S530输出的主题信息为一个视频主题信息,则无需用户的操作;若步骤S530输出为两个或者两个以上的视频主题信息,则可以在电子设备中显示提示框;提示框中 可以包括候选视频主题信息,基于用户对提示框中候选视频主题信息的操作,电子设备确定N个视频的视频主题信息。Optionally, if the theme information output in step S530 is one video theme information, no user operation is required; if the output in step S530 is two or more video theme information, a prompt box can be displayed on the electronic device; prompt box Candidate video theme information may be included. Based on the user's operation on the candidate video theme information in the prompt box, the electronic device determines the video theme information of N videos.
在一种可能的实现方式中,多个文本描述信息对应各个视频主题信息的置信阈值均较小,则可能是输入的文本描述信息并不能完全符合某一个视频主题;此时,可以在电子设备中显示候选视频主题信息,基于用户的操作确定多个文本描述信息对应的视频主题信息。In one possible implementation, if the confidence thresholds of multiple text description information corresponding to each video theme information are all small, it may be that the input text description information does not fully conform to a certain video theme; at this time, the electronic device can Candidate video topic information is displayed in the video topic information, and the video topic information corresponding to multiple text description information is determined based on the user's operation.
示例性地,如图11所示,步骤S530中输出两个视频主题信息,则可以在电子设备中显示显示界面319;显示界面319中包括提示框320,提示框320中包括两个候选视频主题信息分别为风景与旅行,若电子设备检测到用户点击“风景”,则N个视频的视频主题信息为风景;若电子设备检测到用户点击“旅行”,则N个视频的视频主题信息为旅行。For example, as shown in Figure 11, if two video theme information are output in step S530, the display interface 319 can be displayed on the electronic device; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two candidate video themes. The information is scenery and travel respectively. If the electronic device detects that the user clicks "Landscape", then the video theme information of the N videos is scenery; if the electronic device detects that the user clicks "Travel", then the video theme information of the N videos is travel. .
步骤S540、基于N个视频中的图像与视频主题信息之间的相似度,得到N个视频中的图像与视频主题信息的相似度置信值。Step S540: Based on the similarity between the images in the N videos and the video topic information, obtain the similarity confidence values of the images in the N videos and the video topic information.
可选地,可以基于相似度评估模型得到N个视频中的图像特征与视频主题信息的文本特征之间的相似度,得到N个视频中的图像特征与视频主题信息的相似度置信值。可选地,实现方式参见后续图14与图15的相关描述。Optionally, the similarity between the image features in the N videos and the text features of the video topic information can be obtained based on the similarity evaluation model, and the similarity confidence values between the image features in the N videos and the video topic information can be obtained. Optionally, for the implementation method, refer to the subsequent descriptions of FIG. 14 and FIG. 15 .
在一个示例中,对于N个视频中每一个视频,可以遍历一个视频中的全部图像特征,判断一个视频中的每一个图像特征与视频主题信息的文本信息之间的相似度。In one example, for each of N videos, all image features in a video can be traversed, and the similarity between each image feature in a video and the text information of the video topic information can be determined.
在一个示例中,对于N个视频中每一个视频,可以提取一个视频中的部分图像特征;例如,对于N个视频中的一个视频,可以等间隔的选取图像帧,对选取的图像帧进行特征提取得到图像特征。In one example, for each of N videos, part of the image features in a video can be extracted; for example, for one of the N videos, image frames can be selected at equal intervals, and the selected image frames can be characterized. Extract image features.
例如,可以间隔4帧提取一帧图像特征,则可以选取一个视频中的第1帧图像,第5帧图像,第10帧图像,第15帧图像等。For example, you can extract the image features of one frame every 4 frames, and you can select the 1st frame, 5th frame, 10th frame, 15th frame, etc. in a video.
可选地,可以基于相似度评估模型提取N个视频中的图像特征与视频主题信息的文本特征,并评估N个视频中的图像特征与视频主题信息的文本特征之间的相似度,输出N个视频中的图像特征与视频主题信息的相似度置信值;具体实现方式如后续图14与图15所示的相关描述。Optionally, the image features in the N videos and the text features of the video topic information can be extracted based on the similarity evaluation model, and the similarity between the image features in the N videos and the text features of the video topic information can be evaluated, and N The similarity confidence value between the image features in the video and the video topic information; the specific implementation method is as described in the subsequent figures 14 and 15.
步骤S550、基于N个视频中的图像与视频主题信息的相似度置信值,得到N个视频中的M个视频片段。Step S550: Obtain M video clips from the N videos based on the similarity confidence values between the images in the N videos and the video topic information.
示例性地,如图15所示,假设N个视频包括视频310、视频312与视频314;曲线561为视频310中包括的图像特征与视频主题信息的文本特征之间的相似度曲线;曲线562为视频312中包括的图像特征与视频主题信息的文本特征之间的相似度曲线;曲线563为视频314中包括的图像特征与视频主题信息的文本特征之间的相似度曲线;基于曲线561可以确定选取视频310中的图像3101与图像3102组成视频片段1;基于曲线562可以确定选取视频312中的图像3121、图像3122与图像3123组成视频片段2;基于曲线563可以确定选取视频314中的图像3141、图像3142、图像3143与图像3144组成视频片段3。For example, as shown in Figure 15, assume that N videos include video 310, video 312 and video 314; curve 561 is the similarity curve between the image features included in video 310 and the text features of the video topic information; curve 562 is the similarity curve between the image features included in the video 312 and the text features of the video topic information; the curve 563 is the similarity curve between the image features included in the video 314 and the text features of the video topic information; based on the curve 561, It is determined to select image 3101 and image 3102 in video 310 to form video segment 1; based on curve 562, it can be determined to select image 3121, image 3122 and image 3123 in video 312 to form video segment 2; based on curve 563, it can be determined to select the image in video 314 3141, image 3142, image 3143 and image 3144 constitute video clip 3.
应理解,图15为举例说明,也可以从一个视频中选取两个或者两个以上的视频片段,其中,两个视频片段可以是连续的两个视频片段,或者也可以是不连续的两个视频片段(例如,第1帧至第5帧组成一个视频片段;第10帧至第13帧组成一个视频片段);但是,对一个视频片段而言,视频片段中包括的多帧图像为连续的多帧图像;或者,也可以从一个视频中不选取任何一个视频片段;是否从一个视频中选取视频片段取决于该视频中包括的图像特征与视频主题信息的相似度置信值;若一个视频中不存在与视频主题相关的图像特征,则可以不选取该视频中的视频片段。 It should be understood that Figure 15 is an example, and two or more video clips can also be selected from one video. The two video clips can be two consecutive video clips, or they can also be two discontinuous video clips. Video clips (for example, frames 1 to 5 make up a video clip; frames 10 to 13 make up a video clip); however, for a video clip, the multiple frames included in the video clip are continuous Multi-frame images; alternatively, you can not select any video clip from a video; whether to select a video clip from a video depends on the similarity confidence value between the image features included in the video and the video topic information; if a video If there are no image features related to the video theme, the video clips in the video do not need to be selected.
可选地,在本申请的实施例中,M可以大于N,或者,M可以等于N,或者,M可以小于N;M的数值大小是基于N个视频中每个视频片段与视频主题信息的相似度置信值确定的。Optionally, in the embodiment of the present application, M may be greater than N, or M may be equal to N, or M may be less than N; the numerical size of M is based on each video segment and video theme information in N videos. The similarity confidence value is determined.
应理解,在本申请的方案中,若一个视频中的所有图像与视频主题信息的相似度置信值均小于或者等于预设阈值,则说明该视频与视频主题信息无关,可以不保留该视频中的任意一个视频片段;若一个视频中的部分或者全部图像与视频主题信息的相似度置信值大于预设阈值,则可以保留该视频中的部分或者全部的视频片段。It should be understood that in the solution of this application, if the similarity confidence values of all images in a video and the video topic information are less than or equal to the preset threshold, it means that the video has nothing to do with the video topic information, and the video does not need to be retained. Any video clip; if the similarity confidence value of some or all images in a video and the video topic information is greater than the preset threshold, some or all video clips in the video can be retained.
在本申请的实施例中,可以基于N个视频中的图像与视频主题信息的相关性,确定N个视频中与视频主题相关性较高的M个视频片段;基于本申请的方案,能够有效删除N个视频中与视频主题信息无法的视频片段,确保筛选出的视频片段与视频主题信息相关;另一方面,可以计算N个视频中部分或者全部图像与视频主题信息的相似度置信值时,通过采用选取一个视频中连续的多帧图像得到的视频片段,因此视频片段的连续性较好。In the embodiment of the present application, based on the correlation between the images in the N videos and the video theme information, M video clips among the N videos that are highly relevant to the video theme can be determined; based on the solution of the present application, it can effectively Delete the video clips in the N videos that are not related to the video topic information to ensure that the filtered video clips are related to the video topic information; on the other hand, the similarity confidence value of some or all images in the N videos and the video topic information can be calculated , by using video clips obtained by selecting consecutive multiple frames of images in a video, so the continuity of the video clips is better.
可选地,可以保留M个视频片段中部分或者全部视频片段中的原声。Optionally, the original sounds in some or all of the M video clips can be retained.
步骤S560、基于M个视频片段与视频主题信息进行音乐匹配处理,得到背景音乐。Step S560: Perform music matching processing based on M video clips and video theme information to obtain background music.
应理解,步骤S560中得到的背景音乐可以是指图12步骤S460中的音乐。It should be understood that the background music obtained in step S560 may refer to the music in step S460 of Figure 12 .
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长,进行音乐匹配时通常选取的背景音乐需要大于或者等于M个视频片段的总时长;基于视频主题信息,可以确定背景音乐的音乐风格。For example, the total duration of the background music can be determined based on the duration of the M video clips. The background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; based on the video theme information, the total duration of the background music can be determined. music style.
例如,若视频主题为聚会,则背景音乐可以为欢快的音乐风格;若视频主题为风景,则背景音乐可以为舒缓的音乐风格。For example, if the video theme is a party, the background music can be a cheerful music style; if the video theme is landscape, the background music can be a soothing music style.
应理解,上述为举例说明,本申请对视频主题与背景音乐的音乐风格不作任何限定。It should be understood that the above is an example, and this application does not impose any restrictions on the music style of the video theme and background music.
可选地,可以基于M个视频片段与视频主题信息在候选音乐库中进行音乐匹配处理,得到背景音乐信息;其中,候选音乐库中可以包括不同音乐风格与音乐时长的音乐。Optionally, music matching processing can be performed in the candidate music library based on the M video clips and video theme information to obtain background music information; the candidate music library can include music of different music styles and music durations.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;基于总时长与音乐风格可以在候选音乐库中该音乐风格的候选音乐中随机选择,得到背景音乐。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; based on the total duration and the music style, the candidate music of the music style in the candidate music library can be determined Choose randomly and get background music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;基于总时长与音乐风格可以在候选音乐库中按照音乐热度进行选择,得到背景音乐。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; based on the total duration and music style, selection can be made according to the music popularity in the candidate music library, and we get Background music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;基于总时长与音乐风格可以在候选音乐库中基于用户的喜好进行选择,得到背景音乐。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; based on the total duration and music style, selection can be made based on the user's preferences in the candidate music library, Get background music.
例如,在候选音乐库中基于用户播放音乐的频率选择满足总时长与音乐风格的背景音乐。For example, background music that satisfies the total duration and music style is selected from the candidate music library based on the frequency with which the user plays music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;可以在候选音乐库中选择与视频主题匹配度最高的音乐为背景音乐。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; the music with the highest matching degree to the video theme can be selected from the candidate music library as the background music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;可以在候选音乐库中选择多个音乐进行剪辑得到背景音乐;其中,多个音乐的权重或者时间长短可以基于用户的喜好或者预设的固定参数。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; multiple music can be selected from the candidate music library for editing to obtain the background music; where, multiple The weight or duration of each piece of music can be based on the user's preferences or preset fixed parameters.
应理解,上述为举例描述,本申请对音乐匹配处理的具体实现方式不作任何限定。It should be understood that the above description is an example, and this application does not impose any limitations on the specific implementation of the music matching process.
步骤S570、对M个视频片段与背景音乐进行匹配处理,得到处理后的视频(第一视频的一个示例)。 Step S570: Match M video clips and background music to obtain a processed video (an example of the first video).
示例性地,可以基于背景音乐的音乐节奏确定M个视频片段的排序,使得M个视频片段与背景音乐之间实现画面内容与音乐节奏相符合。For example, the order of the M video clips can be determined based on the music rhythm of the background music, so that the picture content and the music rhythm between the M video clips and the background music are consistent.
应理解,节奏匹配处理是为了使得M个视频片段与背景音乐的更好的融合,使得M个视频片段中的图像内容与背景音乐中的音乐节奏更加吻合;例如,视频图像内容为风景,则可以对应于背景音乐的前奏或者舒缓的音乐部分;视频图像内容为用户的运动场景,则可以对应于背景音乐中的高潮部分;通过进行节奏匹配处理,使得M个视频片段与背景音乐的节奏卡点更加匹配,提高处理后的视频质量。It should be understood that the rhythm matching process is to achieve a better integration of the M video clips and the background music, so that the image content in the M video clips is more consistent with the music rhythm in the background music; for example, if the video image content is scenery, then It can correspond to the prelude or soothing music part of the background music; the video image content is the user's sports scene, so it can correspond to the climax part of the background music; through rhythm matching processing, the rhythm of the M video clips and the background music is stuck The points are more closely matched and the quality of the processed video is improved.
可选地,可以将M个视频片段与背景音乐输入至预先训练的影音节奏匹配模型,得到M个视频片段中全部或者部分视频片段的位置信息;其中,影音节奏匹配模型中可以包括音频编码器、视频编码器与相似度度量模块;其中,音频编码器用于提取背景音乐的音频特征;视频编码器可以用于提取视频特征;相似度度量模块用于度量音频特征与视频特征之间的相似度;可选地,实现方式可以参见后续图16与图17的相关描述。Optionally, M video clips and background music can be input to a pre-trained audio-visual rhythm matching model to obtain position information of all or part of the M video clips; wherein, the audio-visual rhythm matching model can include an audio encoder. , video encoder and similarity measurement module; among them, the audio encoder is used to extract audio features of background music; the video encoder can be used to extract video features; the similarity measurement module is used to measure the similarity between audio features and video features ; Optionally, for implementation, please refer to the subsequent descriptions of Figure 16 and Figure 17 .
需要说明的是,在本申请的实施中,影音节奏匹配模型的网络可以为神经深度网络;例如,影音节奏匹配模型可以采用如图2所示的Transformer模型的结构;训练影音节奏匹配模型时,可以采用对比学习的训练方式。It should be noted that in the implementation of this application, the network of the audio-visual rhythm matching model can be a neural deep network; for example, the audio-visual rhythm matching model can adopt the structure of the Transformer model as shown in Figure 2; when training the audio-visual rhythm matching model, Comparative learning training methods can be used.
示例性地,影音节奏匹配模型可以为神经网络,可以通过获取样本音乐短片对待训练的影音节奏匹配模型进行训练,得到训练后的影音节奏匹配模型。例如,影音节奏匹配模型的整体训练架构可以采用对比学习模型;构建训练数据对时,可以采用背景音乐与视频内容匹配的数据对作为正例,采用背景音乐与视频内容不匹配的数据对作为负例,训练视频编码器和音频编码器,使得正例数据对的相似度大于负例数据对的相似度。For example, the audio-visual rhythm matching model can be a neural network, and the audio-visual rhythm matching model to be trained can be trained by obtaining sample music clips to obtain a trained audio-visual rhythm matching model. For example, the overall training architecture of the audio-visual rhythm matching model can use a contrastive learning model; when constructing training data pairs, data pairs that match background music and video content can be used as positive examples, and data pairs that do not match background music and video content can be used as negative examples. For example, train a video encoder and an audio encoder so that the similarity of positive data pairs is greater than the similarity of negative data pairs.
应理解,影音节奏匹配模型可以多模态的预训练架构,可以同时支持图像和文本两种不同类型的输入数据;通过跨模态的对比学习方法将文本和图像映射到统一空间中,从而提升视觉和文本的理解能力。It should be understood that the audio-visual rhythm matching model can have a multi-modal pre-training architecture and can simultaneously support two different types of input data: images and text; text and images are mapped into a unified space through a cross-modal contrastive learning method, thereby improving Visual and text comprehension skills.
在本申请的实施例中,通过对M个视频片段与背景音乐进行节奏匹配处理,可以实现按照背景音乐的节奏对M个视频片段进行视频排序,使得视频片段的画面内容与音乐节奏相符合,即实现M个视频片段的图像内容与背景音乐的卡点;与视频直接按照输入顺序与音乐匹配相比,本申请的方案能够提高视频中图像内容与背景音乐节奏的一致性,提高用户体验。In the embodiment of the present application, by performing rhythm matching processing on the M video clips and the background music, the M video clips can be sorted according to the rhythm of the background music, so that the picture content of the video clips matches the music rhythm. That is to say, the image content and background music of M video clips are stuck. Compared with matching the video with the music directly according to the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the user experience.
可选地,由于背景音乐的时长大于或者等于M个视频片段的总时长;在背景音乐的时长大于M个视频片段的时长时,可以在M个视频片中的最后一个视频片段进行慢动作播放;或者,加入转场特效,在M个视频内容播放完后可以继续重复播放M个视频片段等。Optionally, since the duration of the background music is greater than or equal to the total duration of the M video clips; when the duration of the background music is greater than the duration of the M video clips, the last video clip among the M video clips can be played in slow motion. ; Or, add transition effects so that M video clips can be played repeatedly after the M video contents are played.
可选地,在步骤S570中也可以基于M个视频片段的上传顺序,或者,M个视频片段的时间戳信息的顺序与背景音乐进行匹配处理,得到处理后的视频。Optionally, in step S570, the processed video can also be obtained based on the uploading order of the M video clips, or the order of the timestamp information of the M video clips and the background music.
需要说明的是,上述以对图库应该程序中包括的N个视频的编辑进行举例说明;本申请的方案还可以适用于编辑图库应用程序中的照片;例如,照片可以包括但不限于:gif动图、JPEG格式图像、PNG格式图像等。It should be noted that the above is an example of editing N videos included in the gallery application; the solution of this application can also be applied to editing photos in the gallery application; for example, the photos can include but are not limited to: gif animations. Pictures, JPEG format images, PNG format images, etc.
在本申请的实施例中,可以将N个视频的图像内容信息转换为文本描述信息;基于N个视频的文本描述信息得到N个视频的视频主题信息;基于N个视频中的图像与视频主题信息的相关度大小,从N个视频中选取M个视频片段;基于M个视频片段与背景音乐得到处理后的视频;在本申请的方案中,通过N个视频的文本描述信息得到N个视频的视频主题信息;与基于N个视频的图像信息得到N个视频的视频主题信息相比,文本信息比图像信息具有更丰富的信息;此外,多个文本信息之间具有语言关联性,基于N个视频的文本描述信息得到 视频的视频主题信息,能够提高视频主题信息的准确性;此外,在本申请的实施例中,可以基于N个视频中的图像与视频主题信息的相关性,确定N个视频中与视频主题相关度较高的M视频片段;基于本申请的方案,一方面能够有效删除N个视频中与视频主题信息无关的视频片段,确保筛选出的视频片段与视频主题信息相关;另一方面,在计算N个视频中每个视频片段与视频主题信息的相似度置信值时,通过采用选取一个视频中连续的多帧图像得到的视频片段,因此视频片段的连续性较好;从而提高编辑后视频的视频质量。In the embodiment of the present application, the image content information of N videos can be converted into text description information; the video theme information of N videos can be obtained based on the text description information of N videos; based on the images and video themes in N videos According to the relevance of the information, M video clips are selected from N videos; the processed video is obtained based on the M video clips and background music; in the solution of this application, N videos are obtained through the text description information of the N videos video topic information; compared with the video topic information of N videos based on the image information of N videos, text information has richer information than image information; in addition, there is language correlation between multiple text information, based on N The text description information of each video is obtained The video theme information of the video can improve the accuracy of the video theme information; in addition, in the embodiment of the present application, based on the correlation between the images in the N videos and the video theme information, it can be determined that the N videos are related to the video theme. M video clips with high degree of accuracy; based on the solution of this application, on the one hand, it can effectively delete the video clips that are irrelevant to the video topic information in N videos, ensuring that the filtered video clips are related to the video topic information; on the other hand, when calculating When calculating the similarity confidence value between each video clip in N videos and the video topic information, the video clip is obtained by selecting consecutive multi-frame images in a video, so the continuity of the video clip is better; thereby improving the quality of the edited video. Video quality.
进一步地,在本申请的实施例中,基于N个视频的视频主题信息选取M个视频的背景音乐;并且可以基于背景音乐的节奏对M个视频进行排序,实现按照背景音乐的节奏对M个视频片段进行视频排序,使得视频片段的画面内容与音乐节奏相符合;与视频直接按照输入顺序与音乐匹配相比,本申请的方案能够提高视频中图像内容与背景音乐节奏的一致性,提高用户体验。Further, in the embodiment of the present application, the background music of M videos is selected based on the video theme information of N videos; and the M videos can be sorted based on the rhythm of the background music, so that the M videos can be sorted according to the rhythm of the background music. The video clips are sorted so that the picture content of the video clips matches the rhythm of the music; compared with matching the video directly with the music according to the input order, the solution of this application can improve the consistency of the image content in the video and the rhythm of the background music, and improve the user experience. experience.
示例性地,下面结合图14与图15对图13中的步骤S540与步骤S550的实现方式进行详细描述。For example, the implementation of step S540 and step S550 in FIG. 13 will be described in detail below in conjunction with FIG. 14 and FIG. 15 .
图14是本申请实施例提供的一种确定N个视频中与视频主题信息相关的M个视频片段的方法的示意性流程图。该方法可以由图1所示的电子设备执行;该方法包括步骤S551至步骤S555,下面分别对步骤S551至步骤S555进行详细的描述。FIG. 14 is a schematic flowchart of a method for determining M video segments related to video theme information among N videos provided by an embodiment of the present application. The method can be executed by the electronic device shown in Figure 1; the method includes steps S551 to S555, and steps S551 to S555 will be described in detail below.
步骤S551、基于相似度评估模型中的图像编码器对N个视频进行特征提取,得到N个视频中的图像特征。Step S551: Extract features from N videos based on the image encoder in the similarity evaluation model to obtain image features in N videos.
需要说明的是,在本申请的实施中,影音节奏匹配模型的网络可以为神经深度网络;例如,影音节奏匹配模型可以采用如图2所示的Transformer模型的结构;训练影音节奏匹配模型时,可以采用对比学习的训练方式。It should be noted that in the implementation of this application, the network of the audio-visual rhythm matching model can be a neural deep network; for example, the audio-visual rhythm matching model can adopt the structure of the Transformer model as shown in Figure 2; when training the audio-visual rhythm matching model, Comparative learning training methods can be used.
可选地,在本申请的实施例中可以获取训练数据集对待训练的相似度评估模型进行训练,得到训练后的相似度评估模型;例如,相似度评估模型的整体训练架构可以采用对比学习模型;构建训练数据对时,可以采用文本描述信息与视频主题信息匹配的数据对作为正例,采用文本描述信息与视频主题信息不匹配的数据对作为负例,训练图像编码器和文本编码器,使得正例数据对的相似度大于负例数据对的相似度。Optionally, in the embodiment of the present application, a training data set can be obtained to train the similarity evaluation model to be trained, and a trained similarity evaluation model can be obtained; for example, the overall training architecture of the similarity evaluation model can adopt a contrastive learning model. ; When constructing training data pairs, you can use data pairs where the text description information matches the video topic information as positive examples, and use data pairs where the text description information does not match the video topic information as negative examples to train the image encoder and text encoder. Make the similarity of positive data pairs greater than the similarity of negative data pairs.
例如,训练数据集包括样本视频,与样本视频匹配的视频主题信息,与样本视频不匹配的视频主题信息;比如,样本视频可以包括旅游的视频,与样本视频的视频主题信息为“旅游”的文本信息;与样本视频的视频主题不匹配的视频主题信息为“运动”的文本信息;通过大量的训练数据集使得相似度评估模型能够识别相匹配的文本特征与图像特征,例如,使得输入相匹配的文本特征与图像特征时待训练的相似度评估模型中的相似度度量模块输出的距离度量值越小;使得输入不匹配的文本特征与图像特征时待训练的相似度评估模型中的相似度度量模块输出的距离度量值越大;或者,使得输入相匹配的文本特征与图像特征时待训练的相似度评估模型中的相似度度量模块输出的相似度置信值越大;使得输入不匹配的文本特征与图像特征时待训练的相似度评估模型中的相似度度量模块输出的相似度置信值越小。For example, the training data set includes sample videos, video theme information that matches the sample videos, and video theme information that does not match the sample videos; for example, the sample videos can include travel videos, and the video theme information of the sample videos is "travel". Text information; the video topic information that does not match the video topic of the sample video is the text information of "motion"; through a large number of training data sets, the similarity evaluation model can identify matching text features and image features, for example, making the input consistent When text features and image features are matched, the distance measurement value output by the similarity measurement module in the similarity evaluation model to be trained is smaller; so that when text features and image features are input that do not match, the similarity in the similarity evaluation model to be trained is smaller. The greater the distance measurement value output by the degree measurement module; or, the greater the similarity confidence value output by the similarity measurement module in the similarity evaluation model to be trained when inputting matching text features and image features; making the input mismatch When the text features and image features are different, the similarity confidence value output by the similarity measurement module in the similarity evaluation model to be trained is smaller.
应理解,训练后的相似度评估模型能够识别相匹配的文本特征与图像特征。It should be understood that the trained similarity evaluation model can identify matching text features and image features.
可选地,可以通过相似度评估模型中的图像编码器对N个视频中的每一帧图像进行图像特征提取,得到N个视频中包括的全部图像特征。Optionally, image features can be extracted from each frame of the N videos through the image encoder in the similarity evaluation model to obtain all image features included in the N videos.
可选地,可以基于相同的间隔帧数通过相似度评估模型中的图像编码器提取N个视频中的图像特征,得到N个视频中的部分图像特征。Optionally, the image features in the N videos can be extracted through the image encoder in the similarity evaluation model based on the same number of interval frames to obtain partial image features in the N videos.
例如,可以间隔4帧提取一帧图像特征,则可以提取N个视频中一个视频中的第1帧图 像,第5帧图像,第10帧图像,第15帧图像等。For example, you can extract the image features of one frame at an interval of 4 frames, then you can extract the first frame of one of N videos. Like, the 5th frame image, the 10th frame image, the 15th frame image, etc.
应理解,上述为举例说明,在本申请的实施例中;对于N个视频中的一个视频而言,可以通过遍历每一帧图像,提取一个视频中的全部图像特征;或者,可以等间隔帧数,提取一个视频中的部分图像特征;本申请对此不作任何限定。It should be understood that the above is an example. In the embodiment of the present application, for one video among N videos, all image features in a video can be extracted by traversing each frame of the image; or, frames can be equally spaced. Number, extract some image features in a video; this application does not make any restrictions on this.
可选地,相似度评估模型可以如图15所示,相似度评估模型中可以包括文本编码器、图像编码器与相似度度量模块(第一相似度度量模块的一个示例);其中,文本编码器用于提取文本特征;图像编码器可以用于提取图像特征;相似度度量模块用于度量文本特征与图像特征之间的相似度。Optionally, the similarity evaluation model can be as shown in Figure 15. The similarity evaluation model can include a text encoder, an image encoder and a similarity measurement module (an example of the first similarity measurement module); where, the text encoding The encoder is used to extract text features; the image encoder can be used to extract image features; the similarity measurement module is used to measure the similarity between text features and image features.
示例性地,相似度评估模型可以为对比学习模型。For example, the similarity evaluation model may be a contrastive learning model.
步骤S552、基于相似度评估模型中的文本编码器对视频主题信息进行特征提取,得到视频主题信息的文本特征。Step S552: Extract features from the video topic information based on the text encoder in the similarity evaluation model to obtain text features of the video topic information.
应理解,文本特征是指词语或句子经过向量化以及后续的某种映射获得的能够表征其特定语义的属性集合。It should be understood that text features refer to a set of attributes obtained by vectorization and subsequent mapping of words or sentences that can characterize their specific semantics.
步骤S553、基于相似度评估模型中的相似度度量模块,得到图像特征与文本特征之间的相似度置信值。Step S553: Obtain the similarity confidence value between the image feature and the text feature based on the similarity measurement module in the similarity evaluation model.
示例性地,基于相似度评估模型可以提取N个视频中的图像特征以及提取视频主题信息的文本特征;对图像特征与文本特征进行比较,得到图像特征与文本特征之间的相似度。其中,相似度评估模型可以输出距离度量值,或者,相似度评估模型可以输出相似度置信值;若相似度评估模型输出距离度量值,则距离度量值越小表示图像特征与文本特征之间的相似度越高;基于距离度量值可以得到图像特征与文本特征之间的相似度置信值;若相似度评估模型输出相似度置信值,则相似度置信值越大,表示图像特征与文本特征之间的相似度越高。For example, based on the similarity evaluation model, the image features in N videos and the text features of the video topic information can be extracted; the image features and text features are compared to obtain the similarity between the image features and the text features. Among them, the similarity evaluation model can output a distance measurement value, or the similarity evaluation model can output a similarity confidence value; if the similarity evaluation model outputs a distance measurement value, the smaller the distance measurement value indicates the difference between the image feature and the text feature. The higher the similarity; the similarity confidence value between the image feature and the text feature can be obtained based on the distance measurement value; if the similarity evaluation model outputs the similarity confidence value, the greater the similarity confidence value, the greater the similarity between the image feature and the text feature. The higher the similarity between them.
例如,距离度量值可以为图像特征与文本特征之间的cos值。For example, the distance measurement value can be the cos value between the image feature and the text feature.
步骤S554、基于图像特征与文本特征的相似度置信值在视频中选取连续的多帧图像特征,得到一个视频片段。Step S554: Select consecutive multi-frame image features in the video based on the similarity confidence value of the image feature and the text feature to obtain a video clip.
示例性地,例如图15所示,对于一个视频可以得到该视频中图像特征与视频主题信息的文本特性的相似度曲线;基于相似度曲线可以从一个视频中选择一个或者多个视频片段,一个视频片段包括连续的多帧图像。For example, as shown in Figure 15, for a video, the similarity curve of the image features in the video and the text features of the video topic information can be obtained; one or more video clips can be selected from a video based on the similarity curve, and a A video clip consists of multiple consecutive frames of images.
在本申请的实施例中,选取与视频主题相关的多帧连续的图像得到一个视频片段;基于本申请中的方案,能够确保选取的视频片段与整体的视频主题相关。In the embodiment of the present application, multiple consecutive frames of images related to the video theme are selected to obtain a video clip; based on the solution in the present application, it can be ensured that the selected video clip is related to the overall video theme.
步骤S555、基于图像特征与文本特征的相似度置信值,选取N个视频中的M个视频片段。Step S555: Based on the similarity confidence value of the image feature and the text feature, select M video clips from the N videos.
示例性地,对于一个视频可以得到该视频中图像特征与视频主题信息的文本特征相似度曲线;基于相似度曲线,可以从视频的整体确定视频中与视频主题相关的图像;则可以从一个视频中提取连续的多帧图像得到一个视频片段。For example, for a video, the text feature similarity curve between the image features in the video and the video topic information can be obtained; based on the similarity curve, the images in the video related to the video topic can be determined from the entire video; then from a video Extract consecutive multiple frames of images to obtain a video clip.
示例性地,如图15所示,假设N个视频为视频310、视频312与视频314;曲线561为视频310中包括的图像特征与视频主题信息的文本特征之间的相似度曲线;曲线562为视频312中包括的图像特征与视频主题信息的文本特征之间的相似度曲线;曲线563为视频314中包括的图像特征与视频主题信息的文本特征之间的相似度曲线;基于曲线561可以确定选取视频310中的图像3101与图像3102组成视频片段1;基于曲线562可以确定选取视频312中的图像3121、图像3122与图像3123组成视频片段2;基于曲线563可以确定选取视频314中的图像3141、图像3142、图像3143与图像3144组成视频片段3。 For example, as shown in Figure 15, assume that N videos are video 310, video 312 and video 314; curve 561 is the similarity curve between the image features included in video 310 and the text features of the video topic information; curve 562 is the similarity curve between the image features included in the video 312 and the text features of the video topic information; the curve 563 is the similarity curve between the image features included in the video 314 and the text features of the video topic information; based on the curve 561, It is determined to select image 3101 and image 3102 in video 310 to form video segment 1; based on curve 562, it can be determined to select image 3121, image 3122 and image 3123 in video 312 to form video segment 2; based on curve 563, it can be determined to select the image in video 314 3141, image 3142, image 3143 and image 3144 constitute video clip 3.
应理解,图15为举例说明,也可以从一个视频中选取两个或者两个以上的视频片段,其中,两个视频片段可以是连续的两个视频片段或者也可以是不连续的两个视频片段(例如,第1帧至第5帧组成一个视频片段;第10帧至第13帧组成一个视频片段);但是,对一个视频片段而言,视频片段中包括的多帧图像为连续的多帧图像;或者,也可以从一个视频中不选取任何一个视频片段;是否从一个视频中选取视频片段取决于该视频中图像特征与视频主题信息的相似度置信值;若一个视频中不存在与视频主题相关的图像特征,则可以不选取该视频中的视频片段。It should be understood that Figure 15 is an example, and two or more video clips can also be selected from one video, where the two video clips can be two consecutive video clips or they can also be two discontinuous videos. Segments (for example, frames 1 to 5 constitute a video segment; frames 10 to 13 constitute a video segment); however, for a video segment, the multiple frames of images included in the video segment are consecutive multiple frame image; alternatively, you can not select any video clip from a video; whether to select a video clip from a video depends on the similarity confidence value of the image features in the video and the video topic information; if there is no similarity in a video with If the image features related to the video theme are not selected, the video clips in the video may not be selected.
可选地,在本申请的实施例中,M可以大于N,或者,M可以等于N,或者,M可以小于N;M的数值大小是基于N个视频中每个视频片段与视频主题信息的相似度置信值确定的。Optionally, in the embodiment of the present application, M may be greater than N, or M may be equal to N, or M may be less than N; the numerical size of M is based on each video segment and video theme information in N videos. The similarity confidence value is determined.
应理解,在本申请的方案中,若一个视频中的所有图像与视频主题信息的相似度置信值均小于或者等于预设阈值,则说明该视频与视频主题信息无关,可以不保留该视频中的任意一个视频片段;若一个视频中的部分或者全部图像与视频主题信息的相似度置信值大于预设阈值,则可以保留该视频中的部分或者全部的视频片段。It should be understood that in the solution of this application, if the similarity confidence values of all images in a video and the video topic information are less than or equal to the preset threshold, it means that the video has nothing to do with the video topic information, and the video does not need to be retained. Any video clip; if the similarity confidence value of some or all images in a video and the video topic information is greater than the preset threshold, some or all video clips in the video can be retained.
在本申请的实施例中,通过预先训练的相似度评估模型能够识别出N个视频中与整体视频主题相关的图像特征;基于与视频主题相关的图像特征筛选出N个视频中与视频主题相关的M个视频片段,剔除掉N个视频中与视频主题无法的视频片段;基于本申请的方案,一方面能够有效删除N个视频中与视频主题信息无法的视频片段,确保筛选出的视频片段与视频主题信息相关;基于筛选出的视频片段与背景音乐得到编辑后的视频,从而提高编辑后视频的视频质量。In the embodiment of the present application, the image features related to the overall video theme in N videos can be identified through the pre-trained similarity evaluation model; based on the image features related to the video theme, N videos are filtered out that are related to the video theme. M video clips, and the video clips in the N videos that are not related to the video theme are eliminated; based on the solution of this application, on the one hand, the video clips in the N videos that are not related to the video topic information can be effectively deleted to ensure that the filtered video clips Related to the video topic information; the edited video is obtained based on the filtered video clips and background music, thereby improving the video quality of the edited video.
示例性地,下面结合图16与图17对图13中的步骤S570的实现方式进行详细描述。Exemplarily, the implementation of step S570 in Figure 13 will be described in detail below with reference to Figures 16 and 17 .
图16是本申请实施例提供的一种对M个视频片段与背景音乐进行匹配处理的方法的流程图。该方法可以由图1所示的电子设备执行;该方法包括步骤S571至步骤S574,下面分别对步骤S571至步骤S574进行详细的描述。Figure 16 is a flow chart of a method for matching M video clips and background music provided by an embodiment of the present application. The method can be executed by the electronic device shown in Figure 1; the method includes steps S571 to S574, and steps S571 to S574 are described in detail below respectively.
步骤S571、基于影音节奏匹配模型中的音频编码器对背景音乐进行特征提取,得到音频特征。Step S571: Extract features of the background music based on the audio encoder in the audio-visual rhythm matching model to obtain audio features.
示例性地,影音节奏匹配模型可以如图17所示;影音节奏匹配模型中可以包括音频编码器、视频编码器与相似度度量模块;其中,音频编码器用于提取背景音乐的音频特征;视频编码器可以用于提取视频特征;相似度度量模块用于度量音频特征与视频特征之间的相似度。For example, the audio-visual rhythm matching model can be as shown in Figure 17; the audio-visual rhythm matching model can include an audio encoder, a video encoder and a similarity measurement module; wherein the audio encoder is used to extract the audio features of the background music; video encoding The module can be used to extract video features; the similarity measurement module is used to measure the similarity between audio features and video features.
步骤S572、基于影音节奏匹配模型中的视频编码器对M个视频片段进行特征提取,得到视频特征。Step S572: Extract features from the M video clips based on the video encoder in the audio-visual rhythm matching model to obtain video features.
应理解,一个视频特征中包括多帧图像特征;M个视频片段则可以对于M个视频特征。It should be understood that one video feature includes multi-frame image features; M video segments can correspond to M video features.
步骤S573、基于影音节奏匹配模型中的相似度度量模块,得到音频特征与视频特征之间的相似度置信值。Step S573: Obtain the similarity confidence value between the audio feature and the video feature based on the similarity measurement module in the audio-visual rhythm matching model.
示例性地,可以将背景音乐分割成多段音频特征;通过遍历M个视频特征中每一个视频特征与多段音频特征中每段音频特征的相关性,得到M个视频特征中每一个视频特征与多个音频特征中相似度最高的音频特征;基于多个音频特征在整体背景音乐中的位置,则可以确定该音频特征对应的视频片段在M个视频片段中的排序。For example, the background music can be divided into multiple audio features; by traversing the correlation between each video feature in the M video features and each audio feature in the multiple audio features, the correlation between each video feature in the M video features and multiple audio features is obtained. The audio feature with the highest similarity among the audio features; based on the positions of multiple audio features in the overall background music, the ranking of the video clips corresponding to the audio features among the M video clips can be determined.
示例性地,基于影音节奏匹配模型中的相似度度量模块可以输出音频特征与视频特征之间的距离度量值;距离度量值越大,表示音频特征与视频特征之间相似度越小,则音频特征与视频特征之间的相关性越低,相似度置信值越小;距离度量值越小,表示音频特征与视频特征之间相似度越高,则音频特征与视频特征之间的相关性越高,相似度置信值越大。例如, 距离度量值可以为音频特征与视频特征之间的cos值。For example, based on the similarity measurement module in the audio-visual rhythm matching model, the distance measurement value between the audio feature and the video feature can be output; the larger the distance measurement value, the smaller the similarity between the audio feature and the video feature, then the audio The lower the correlation between features and video features, the smaller the similarity confidence value; the smaller the distance measurement value, the higher the similarity between audio features and video features, the greater the correlation between audio features and video features. The higher the similarity, the greater the confidence value. For example, The distance measurement value can be the cos value between the audio feature and the video feature.
步骤S574、基于相似度置信值,得到M个视频片段对应背景音乐的最佳匹配位置。Step S574: Based on the similarity confidence value, obtain the best matching positions corresponding to the background music of the M video clips.
示例性地,基于相似度置信值可以得到M个视频片段与背景音乐匹配的最佳位置,使得M个视频片段的图像内容与背景音乐的音乐节奏实现相匹配。For example, based on the similarity confidence value, the best positions for matching the M video clips and the background music can be obtained, so that the image content of the M video clips matches the music rhythm of the background music.
例如,M个视频片段包括视频片段1、视频片段2与视频片段3;将背景音乐可以分割为3段音频特征,分别为音频特征1、音频特征2与音频特征3;分别判断音频特征1与视频片段1、视频片段2以及视频片段3之间的相关性,得到3个视频片段中与音频特征1匹配度最高的音频特征;判断判断音频特征2与视频片段1、视频片段2以及视频片段3之间的相关性,得到3个视频片段中与音频特征2匹配度最高的音频特征;判断音频特征3与视频片段1、视频片段2以及视频片段3之间的相关性,得到3个视频片段中与音频特征3匹配度最高的音频特征;最终可以输出每一个音频特征对应的视频片段。For example, M video clips include video clip 1, video clip 2 and video clip 3; the background music can be divided into 3 audio features, namely audio feature 1, audio feature 2 and audio feature 3; respectively determine the audio features 1 and 3. The correlation between video clip 1, video clip 2 and video clip 3 is used to obtain the audio feature that has the highest matching degree with audio feature 1 among the three video clips; judge whether audio feature 2 is related to video clip 1, video clip 2 and video clip 3, get the audio feature with the highest matching degree with audio feature 2 among the 3 video clips; judge the correlation between audio feature 3 and video clip 1, video clip 2 and video clip 3, get 3 videos The audio feature in the clip that has the highest matching degree with audio feature 3; finally, the video clip corresponding to each audio feature can be output.
例如,如图17所示,假设M个视频片段为3个视频片段,影音节奏匹配模型可以输出音频特征1对应视频片段3,音频特征2对应视频片段2;音频特征3对应视频片段1;从而得的与背景音频的节奏匹配的M个视频片段的排序。For example, as shown in Figure 17, assuming that M video clips are 3 video clips, the audio-visual rhythm matching model can output audio feature 1 corresponding to video clip 3, audio feature 2 corresponding to video clip 2, and audio feature 3 corresponding to video clip 1; thus Obtain a sequence of M video clips that match the rhythm of the background audio.
可选地,视频图像内容为风景,则可以对应于背景音乐的前奏或者舒缓的音乐部分;视频图像内容为用户运动场景,则可以对应于背景音乐中的高潮部分。Optionally, if the video image content is scenery, it may correspond to the prelude or soothing music part of the background music; if the video image content is a user movement scene, it may correspond to the climax part of the background music.
可选地,在本申请的实施例中可以获取训练数据集对待训练的影音节奏匹配模型进行训练,得到训练后的影音节奏匹配模型;其中,训练数据集包括样本匹配音乐短片与样本不匹配音乐短片;样本匹配音乐短片是指音乐与图像内容相匹配的音乐短片;样本不匹配音乐短片是指音乐与图像内容不匹配的音乐短片;例如,对音乐短片1的背景音乐与音乐短片2的图像视频进行混合,得到样本不匹配音乐短片;通过对大量的训练数据集的学习,使得影音节奏匹配模型够基于输入的背景音乐的节奏对输入的M个视频片段进行排序。Optionally, in the embodiment of the present application, a training data set can be obtained to train the audio-visual rhythm matching model to be trained, and a trained audio-visual rhythm matching model can be obtained; wherein, the training data set includes sample matching music shorts and sample non-matching music Short film; sample-matching music short refers to a music short whose music matches the image content; sample-unmatched music short refers to a music short whose music and image content do not match; for example, the background music of music short 1 and the image of music short 2 The videos are mixed to obtain sample-unmatched music clips; by learning a large number of training data sets, the audio-visual rhythm matching model can sort the input M video clips based on the rhythm of the input background music.
在本申请的实施例中,通过影音节奏匹配模型对M个视频片段进行排序,可以实现按照背景音乐的节奏对M个视频片段进行视频排序,使得视频片段的画面内容与音乐节奏相符合;与视频直接按照输入顺序与音乐匹配相比,本申请的方案能够提高视频中图像内容与背景音乐节奏的一致性,提高用户体验。In the embodiment of the present application, by sorting the M video clips through the audio-visual rhythm matching model, the M video clips can be sorted according to the rhythm of the background music, so that the picture content of the video clips matches the music rhythm; and Compared with matching videos directly with music according to the input sequence, the solution of this application can improve the consistency between the image content in the video and the rhythm of the background music, and improve the user experience.
可选地,在本申请的实施例中,电子设备检测到用户选择的N个视频,N个视频可以是指具有强故事线的视频;或者,N个视频内容可以为非强故事线的视频;下面结合图18与图19分别对非强故事性的视频编辑方法与强故事性的视频编辑方法进行详细描述。Optionally, in the embodiment of the present application, the electronic device detects N videos selected by the user. The N videos may refer to videos with strong story lines; or, the N video contents may be videos without strong story lines. ; The non-storytelling video editing method and the strong story-telling video editing method will be described in detail below with reference to Figures 18 and 19.
应理解,强故事线的视频可以是指N个视频之间具有因果联系,基于视频编辑方法后能够识别N个视频之间的前因后果并基于前因后果的顺序对N个视频排序;例如,强故事线的视频可以包括旅行主题的视频或者出行主题的视频;非强故事线的视频可以是指N个视频为平等顺序的视频;N个视频之间不具有强因果关联性;例如,非强故事线的视频可以包括运动主题的视频。It should be understood that a video with a strong story line can refer to a causal connection between N videos. Based on the video editing method, the cause and effect between the N videos can be identified and the N videos can be sorted based on the order of the cause and effect; for example, a strong story line Videos can include travel-themed videos or travel-themed videos; videos with non-strong story lines can refer to videos in which N videos are in equal order; there is no strong causal correlation between N videos; for example, non-strong story lines The videos can include sports themed videos.
示例性地,强故事线的视频可以包括视频主题为旅游的视频;例如,N个视频中包括在家收拾行李的视频;出门打车到达机场的视频;乘坐飞机的视频;到达目的地,在海边漫步的视频;则这4个视频是具有前后因果联系的,通过需要先收拾行李然后乘坐人机到达目的地,在目的地地旅游。For example, videos with strong story lines can include videos with a video theme of travel; for example, the N videos include videos of packing luggage at home; videos of taking a taxi to the airport; videos of taking a plane; arriving at the destination and walking on the beach. videos; then these 4 videos are causally connected. You need to pack your luggage first and then take a man-machine to reach the destination and travel at the destination.
示例性地,非强故事线的视频可以包括视频主题为运动的视频;例如,N个视频中包括在篮球场上奔跑的视频;上篮投球的视频;在篮球场上传球的视频;则这个3个视频不具有很强的因果关联,对于一场球赛而言,可以存在多次上篮投球、篮球场上传球以及篮球场上 奔跑的过程,则对于3个视频的排序的前后要求并非存在唯一性排序。For example, videos with non-strong story lines may include videos with a video theme of sports; for example, the N videos include videos of running on the basketball court; videos of layups and shots; videos of passing the ball on the basketball court; then this The three videos do not have a strong causal relationship. For a ball game, there can be multiple layups, passes on the basketball court, and shots on the basketball court. During the running process, there is no unique ordering requirement for the ordering of the three videos.
实现方式一:对于非强故事线的视频,获取N个视频;基于N个视频的文本描述信息得到N个视频的视频主题;基于N个视频中的图像与视频主题的相似度置信值确定N个视频中的M个视频片段;基于M个视频片段与视频主题确定背景音乐;基于背景音乐的节奏匹配M个视频片段的最佳位置;生成处理后的视频。Implementation method 1: For videos without strong story lines, obtain N videos; obtain the video themes of N videos based on the text description information of the N videos; determine N based on the similarity confidence values between the images in the N videos and the video themes M video clips in the videos; determine background music based on the M video clips and video themes; match the best positions of the M video clips based on the rhythm of the background music; generate a processed video.
图18是本申请实施例提供的一种视频编辑方法的示意性流程图。该视频编辑方法600可以由图1所示的电子设备执行;该视频编辑方法600包括步骤S610至步骤S680,下面分别对步骤S610至步骤S680进行详细的描述。Figure 18 is a schematic flow chart of a video editing method provided by an embodiment of the present application. The video editing method 600 can be executed by the electronic device shown in FIG. 1; the video editing method 600 includes steps S610 to S680, and steps S610 to S680 are described in detail below respectively.
步骤S610、获取N个视频。Step S610: Obtain N videos.
示例性地,N个视频可以为存储在电子设备中的视频;其中,N个视频可以为电子设备采集的视频;或者,N个视频中的部分或者全部为下载的视频;本申请对N个视频的来源不作任何限定。For example, the N videos may be videos stored in the electronic device; wherein the N videos may be videos collected by the electronic device; or some or all of the N videos may be downloaded videos; this application applies to the N videos The source of the video is not limited in any way.
例如,电子设备检测到用户对图库应用程序中N个视频的点击操作;可以获取N个视频。For example, the electronic device detects the user's click operation on N videos in the gallery application; the N videos can be obtained.
可选地,对于非强故事线的N个视频的排序可以是基于N个视频上传的顺序,对N个视频进行排序;或者,可以是基于视频的时间戳信息(例如,录制视频或者下载视频的时间信息),对N个视频进行排序。Optionally, the sorting of the N videos that do not have a strong story line can be based on the order in which the N videos are uploaded; or, it can be based on the timestamp information of the video (for example, recording a video or downloading a video time information), sort N videos.
步骤S620、通过图文转换模型得到N个视频的文本描述信息。Step S620: Obtain the text description information of N videos through the image-text conversion model.
示例性地,一个视频可以对应一个文本描述信息;N个视频通过图文转换模型可以得到N个文本描述信息。For example, one video can correspond to one piece of text description information; N pieces of text description information can be obtained from N videos through the image-to-text conversion model.
可选地,图文转换模型可以用于将视频转换为文本信息;即可以将视频中包括的图像信息转换为文本描述信息;基于文本描述信息描述图像中包括的图像内容。Optionally, the image-to-text conversion model can be used to convert the video into text information; that is, the image information included in the video can be converted into text description information; and the image content included in the image can be described based on the text description information.
示例性地,图文转换模型可以包括CLIP模型。For example, the image-to-text conversion model may include a CLIP model.
步骤S630、将N个视频的文本描述信息输入至预先训练的视频主题分类模型,得到视频主题信息。Step S630: Input the text description information of N videos into the pre-trained video topic classification model to obtain video topic information.
应理解,视频主题可以是指视频中与整体的图像内容相关联的主题思想;对于不同的视频主题,对应的视频处理方式可以不同;例如,视频主题不同可以采用不同的音乐,不同的转场特效,不同的图像处理滤镜,或者,可以采用不同的视频剪辑方式。It should be understood that the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
需要说明的是,在本申请的实施例中N个视频的视频主题信息为一个主题信息,即视频主题信息为N个视频作为整体对应的视频主题信息。It should be noted that in the embodiment of the present application, the video theme information of N videos is one theme information, that is, the video theme information is the video theme information corresponding to the N videos as a whole.
示例性地,预先训练的视频主题分类模型可以为预先训练的文本分类模型,该文本分类模型可以为深度神经网络。For example, the pre-trained video topic classification model may be a pre-trained text classification model, and the text classification model may be a deep neural network.
可选地,视频主题分类模型可以是基于以下训练数据集通过训练得到的;训练数据集包括样本文本描述信息和视频主题文本信息,样本文本描述信息与视频主题信息相对应;其中,样本文本描述信息可以是一个或者多个语句文本;视频主题文本信息可以是短语文本。Optionally, the video topic classification model can be obtained through training based on the following training data set; the training data set includes sample text description information and video topic text information, and the sample text description information corresponds to the video topic information; wherein, the sample text description The information can be one or more sentence texts; the video topic text information can be phrase text.
例如,样本文本描述信息可以包括:“多个人在吃饭”、“多个人在做游戏”、以及“多个人在交谈”;该样本描述文本对应的视频主题文本信息可以为“聚会”;又例如,样本文本描述信息可以包括“一个成年人与一个儿童在拍照”,“一个成年人与一个儿童在做游戏”;该样本描述文本对应的视频主题信息为“亲子”。For example, the sample text description information may include: "Multiple people are eating", "Multiple people are playing games", and "Multiple people are talking"; the video theme text information corresponding to the sample description text may be "Party"; and for example , the sample text description information may include "an adult and a child are taking pictures", "an adult and a child are playing games"; the video theme information corresponding to the sample description text is "parent-child".
应理解,上述为举例说明;本申请实施例对样本文本描述信息与样本视频主题信息不作任何限定。It should be understood that the above is an example; the embodiment of the present application does not place any limitations on the sample text description information and the sample video theme information.
示例性地,将一个视频输入至图文转换模型可以得到一个文本描述信息;N个视频可以 得到N个文本描述信息;将N个文本描述信息输入至预先训练的视频主题分类模型,可以得到N个文本描写信息对应的视频主题信息;其中,视频主题信息可以包括但不限于:旅游、聚会、宠物、运动、风景、亲子、工作等。在本申请的实施例中,在识别N个视频的视频主题信息时,通过N个视频的文本描述信息得到N个视频的视频主题信息;与基于N个视频的图像信息得到N个视频的视频主题信息相比,文本信息比图像信息具有更丰富的信息;此外,多个文本信息之间具有语言关联性,基于N个视频的文本描述信息得到视频的视频主题信息,能够提高主题信息的准确性;例如,N个视频中包括用户收拾行李的视频、用户出门乘坐汽车前往机场的视频以及用户乘坐飞机的视频,与用户在海边行为的视频;基于图像信息可能只能得到一些标签,包括衣物、行李箱、用户、海边等,基于这些图像标签无法抽象出N个视频的主题为旅行;但是,基于N个视频的文本描述信息识别N个视频的主题时,可以基于N个视频文本描述信息与N个视频文本描述信息之间的语言逻辑关联性,准确地得到N个视频的视频主题信息;比如,基于N个视频包括的文本描述信息“一个用户在收拾行李”、“一个用户在乘坐飞机”、“一个用户在海边漫步”,基于这些文本描述信息可以抽象出N个视频的视频主题信息为旅行;因此,通过N个视频的文本描述信息得到N个视频的视频主题信息,能够提高主题信息的准确性。For example, inputting a video to the image-to-text conversion model can obtain a text description information; N videos can Obtain N pieces of text description information; input the N pieces of text description information into the pre-trained video topic classification model, and obtain the video topic information corresponding to the N pieces of text description information; where the video topic information can include but is not limited to: travel, party , pets, sports, scenery, parent-child, work, etc. In the embodiment of the present application, when identifying the video theme information of N videos, the video theme information of N videos is obtained through the text description information of N videos; and the video theme information of N videos is obtained based on the image information of N videos. Compared with topic information, text information has richer information than image information; in addition, there is language correlation between multiple text information. The video topic information of the video can be obtained based on the text description information of N videos, which can improve the accuracy of the topic information. sex; for example, the N videos include videos of the user packing luggage, videos of the user going out and taking a car to the airport, videos of the user taking an airplane, and videos of the user's behavior at the beach; only some labels may be obtained based on image information, including clothing , suitcases, users, seaside, etc. Based on these image tags, it is impossible to abstract that the theme of N videos is travel; however, when identifying the theme of N videos based on the text description information of N videos, it can be based on the text description information of N videos. The video theme information of N videos can be accurately obtained based on the language logic correlation with the text description information of N videos; for example, based on the text description information included in N videos, "A user is packing luggage", "A user is taking a ride""Airplane","A user walks on the beach", based on these text description information, the video theme information of N videos can be abstracted as travel; therefore, the video theme information of N videos can be obtained through the text description information of N videos, which can improve Accuracy of subject information.
可选地,若步骤S630输出的主题信息为一个主题信息,则无需用户操作;若步骤S630输出为两个或者两个以上的主题信息,则可以在电子设备中显示提示框;提示框中可以包括候选视频主题信息,基于用户在提示框中候选视频主题信息的操作,确定N个视频的视频主题信息。Optionally, if the theme information output in step S630 is one theme information, no user operation is required; if the theme information output in step S630 is two or more theme information, a prompt box can be displayed in the electronic device; the prompt box can Including candidate video topic information, the video topic information of N videos is determined based on the user's operation on the candidate video topic information in the prompt box.
示例性地,如图11所示,步骤S630中输出两个主题信息,则可以在电子设备中显示显示界面319;显示界面319中包括提示框320,提示框320中包括两个候选视频主题信息分别为风景与旅行,若电子设备检测到用户点击“风景”,则N个视频的视频主题信息为风景;若电子设备检测到用户点击“旅行”,则N个视频的视频主题信息为旅行。For example, as shown in Figure 11, if two theme information are output in step S630, the display interface 319 can be displayed on the electronic device; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two candidate video theme information. They are scenery and travel respectively. If the electronic device detects that the user clicks "Landscape", then the video theme information of the N videos is scenery; if the electronic device detects that the user clicks "Travel", then the video theme information of the N videos is travel.
可选地,步骤S630的实现方式可以参见图13中的步骤S530中的相关描述。Optionally, for the implementation of step S630, please refer to the relevant description in step S530 in FIG. 13 .
步骤S640、基于相似度评估模型得到N个视频中的图像特征与视频主题信息的相似度置信值。Step S640: Obtain similarity confidence values of image features and video topic information in N videos based on the similarity evaluation model.
应理解,相似度评估模型可以是预先训练的神经网络模型;相似度评估模型用于输出N个视频中每个视频包括的图像特征与视频主题信息之间的相关性。如图15所示,相似度评估模型中可以包括图像编码器、文本编码器与相似度度量模块;其中,图像编码器用于对视频中的图像进行特征提取,得到图像特征;文本编码器用于对视频主题信息进行特征提取,得到文本信息;相似度度量模块用于评估图像特征与文本特征之间的相似性。It should be understood that the similarity evaluation model may be a pre-trained neural network model; the similarity evaluation model is used to output the correlation between the image features included in each of the N videos and the video topic information. As shown in Figure 15, the similarity evaluation model can include an image encoder, a text encoder and a similarity measurement module; among them, the image encoder is used to extract features from images in the video to obtain image features; the text encoder is used to Feature extraction is performed on video topic information to obtain text information; the similarity measurement module is used to evaluate the similarity between image features and text features.
示例性地,基于相似度评估模型可以提取N个视频中的图像特征以及提取视频主题信息的文本特征;对图像特征与文本特征进行比较,得到图像特征与文本特征之间的相似度。其中,相似度评估模型可以输出距离度量值,或者,相似度评估模型可以输出相似度置信值;若相似度评估模型输出距离度量值,则距离度量值越小表示图像特征与文本特征之间的相似度越高;基于距离度量值可以得到图像特征与文本特征之间的相似度置信值;若相似度评估模型输出相似度置信值,则相似度置信值越大,表示图像特征与文本特征之间的相似度越高。For example, based on the similarity evaluation model, the image features in N videos and the text features of the video topic information can be extracted; the image features and text features are compared to obtain the similarity between the image features and the text features. Among them, the similarity evaluation model can output a distance measurement value, or the similarity evaluation model can output a similarity confidence value; if the similarity evaluation model outputs a distance measurement value, the smaller the distance measurement value indicates the difference between the image feature and the text feature. The higher the similarity; the similarity confidence value between the image feature and the text feature can be obtained based on the distance measurement value; if the similarity evaluation model outputs the similarity confidence value, the greater the similarity confidence value, the greater the similarity between the image feature and the text feature. The higher the similarity between them.
在本申请的实施例中,可以提取N个视频中的全部图像特征;或者,可以提取N个视频中的部分图像特征;本申请对此不作任何限定。In the embodiment of the present application, all image features in the N videos can be extracted; or some image features in the N videos can be extracted; this application does not impose any limitation on this.
可选地,可以通过相似度评估模型中的图像编码器对N个视频中的每一帧图像进行图像特征提取,得到N个视频中包括的全部图像特征。 Optionally, image features can be extracted from each frame of the N videos through the image encoder in the similarity evaluation model to obtain all image features included in the N videos.
可选地,可以基于相同的间隔帧数通过相似度评估模型中的图像编码器提取N个视频中的图像特征,得到N个视频中的部分图像特征。Optionally, the image features in the N videos can be extracted through the image encoder in the similarity evaluation model based on the same number of interval frames to obtain partial image features in the N videos.
例如,可以间隔4帧提取一帧图像特征,则可以提取N个视频中一个视频中的第1帧图像,第5帧图像,第10帧图像,第15帧图像等。For example, the image features of one frame can be extracted at intervals of 4 frames, and the 1st frame image, 5th frame image, 10th frame image, 15th frame image, etc. in one video of N videos can be extracted.
应理解,上述为举例说明,在本申请的实施例中;对于N个视频中的一个视频而言,可以通过遍历每一帧图像,提取一个视频中的全部图像特征;或者,可以等间隔帧数,提取一个视频中的部分图像特征;本申请对此不作任何限定。It should be understood that the above is an example. In the embodiment of the present application, for one video among N videos, all image features in a video can be extracted by traversing each frame of the image; or, frames can be equally spaced. Number, extract some image features in a video; this application does not make any restrictions on this.
可选地,步骤S640的具体描述可以参见图13中的步骤S540的相关描述;或者,图14中的步骤S551至步骤S553的相关描述;或者,图15中的相关描述。Optionally, for the specific description of step S640, please refer to the related description of step S540 in Figure 13; or the related description of step S551 to step S553 in Figure 14; or the related description of Figure 15.
步骤S650、基于N个视频中的图像与视频主题信息的相似度置信值,得到N个视频中的M个视频片段。Step S650: Obtain M video clips from the N videos based on the similarity confidence values between the images in the N videos and the video topic information.
示例性地,基于N个视频中包括的图像特征与视频主题信息的文本特征之间的相似度置信值,可以在N个视频中的选取连续的多帧图像特征,得到一个视频片段。For example, based on the similarity confidence value between the image features included in the N videos and the text features of the video topic information, continuous multi-frame image features can be selected from the N videos to obtain a video clip.
示例性地,如图15所示,对于一个视频可以得到该视频中图像特征与视频主题信息的相似度曲线;基于相似度曲线可以从视频中选择连续的多帧图像得到一个视频片段。For example, as shown in Figure 15, for a video, the similarity curve of the image features in the video and the video theme information can be obtained; based on the similarity curve, multiple consecutive frames of images can be selected from the video to obtain a video clip.
在本申请的实施例中,可以在一个视频中选取与视频主题相关的多帧连续的图像得到一个视频片段;基于本申请中的方案,能够确保选取的视频片段与整体的视频主题相关。In the embodiments of the present application, multiple consecutive frames of images related to the video theme can be selected from a video to obtain a video clip; based on the solution of the present application, it can be ensured that the selected video clip is related to the overall video theme.
步骤S660、基于M个视频片段的时长与视频主题信息,在候选音乐库中进行音乐匹配处理,得到背景音乐。Step S660: Based on the duration and video theme information of the M video clips, perform music matching processing in the candidate music library to obtain background music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长,进行音乐匹配时通常选取的背景音乐需要大于或者等于M个视频片段的总时长;基于视频主题信息可以确定背景音乐的音乐风格。For example, the total duration of the background music can be determined based on the duration of the M video clips. The background music usually selected when performing music matching needs to be greater than or equal to the total duration of the M video clips; the background music can be determined based on the video theme information. style.
例如,若视频主题为聚会,则背景音乐为欢快的音乐风格;若视频主题为风景,则背景音乐为舒缓的音乐风格。For example, if the video theme is a party, the background music will be a cheerful music style; if the video theme is landscape, the background music will be a soothing music style.
应理解,上述为举例说明,本申请对视频主题与背景音乐的音乐风格不作任何限定。It should be understood that the above is an example, and this application does not impose any restrictions on the music style of the video theme and background music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;基于总时长与音乐风格可以在候选音乐库中随机选择背景音乐。For example, the total duration of background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; and the background music can be randomly selected from the candidate music library based on the total duration and music style.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;基于总时长与音乐风格可以在候选音乐库中按照音乐热度选择背景音乐。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; and the background music can be selected from the candidate music library according to music popularity based on the total duration and music style.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;基于总时长与音乐风格可以在候选音乐库中基于用户的喜好选择背景音乐。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; and the background music can be selected based on the user's preferences in the candidate music library based on the total duration and music style. .
例如,在候选音乐库中基于用户播放音乐的频率选择满足总时长与音乐风格的背景音乐。For example, background music that satisfies the total duration and music style is selected from the candidate music library based on the frequency with which the user plays music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;可以在候选音乐库中选择与视频主题匹配度最高的音乐为背景音乐。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; the music with the highest matching degree to the video theme can be selected from the candidate music library as the background music.
示例性地,基于M个视频片段的时长可以确定背景音乐的总时长;基于视频主题信息可以确定背景音乐的音乐风格;可以在候选音乐库中选择多个音乐进行剪辑得到背景音乐;其中,多个音乐的权重或者时间长短可以基于用户的喜好或者预设的固定参数。For example, the total duration of the background music can be determined based on the duration of M video clips; the music style of the background music can be determined based on the video theme information; multiple music can be selected from the candidate music library for editing to obtain the background music; where, multiple The weight or duration of each piece of music can be based on the user's preferences or preset fixed parameters.
应理解,上述为举例描述,本申请对音乐匹配处理的具体实现方式不作任何限定。 It should be understood that the above description is an example, and this application does not impose any limitations on the specific implementation of the music matching process.
步骤S670、将M个视频片与背景音乐输入至预先训练的影音节奏匹配模型,得到排序后的M个视频片段。Step S670: Input M video clips and background music to the pre-trained audio-visual rhythm matching model to obtain sorted M video clips.
示例性地,影音节奏匹配模型可以为神经网络,可以通过获取样本音乐短片对待训练的影音节奏匹配模型进行训练,得到训练后的影音节奏匹配模型。例如,如图17所示,影音节奏匹配模型中可以包括音频编码器、视频编码器与相似度度量模块;其中,音频编码器用于提取背景音乐的音频特征;视频编码器可以用于提取视频特征;相似度度量模块用于度量音频特征与视频特征之间的相似度。For example, the audio-visual rhythm matching model can be a neural network, and the audio-visual rhythm matching model to be trained can be trained by obtaining sample music clips to obtain a trained audio-visual rhythm matching model. For example, as shown in Figure 17, the audio-visual rhythm matching model can include an audio encoder, a video encoder and a similarity measurement module; the audio encoder is used to extract the audio features of background music; the video encoder can be used to extract video features ;The similarity measurement module is used to measure the similarity between audio features and video features.
在一个示例中,影音节奏匹配模型可以输出距离度量值,距离度量值用于表示音频特征与视频特征之间的距离;距离度量值越大,表示音频特征与视频特征之间相似度越小;距离度量值越小,表示音频特征与视频特征之间相似度越大;基于距离度量值可以得到音频特征与视频特征之间的相似度置信值。In one example, the audio-visual rhythm matching model can output a distance metric value, which is used to represent the distance between audio features and video features; the larger the distance metric value, the smaller the similarity between the audio features and video features; The smaller the distance metric value, the greater the similarity between audio features and video features; based on the distance metric value, the confidence value of the similarity between audio features and video features can be obtained.
在一个示例中,影音节奏匹配模型可以输出相似度置信值,相似度置信值用于表示音频特征与视频特征相似的概率值大小;相似度置信值越大,表示音频特征与视频特征之间相似度越高;相似度置信值越大越小,表示音频特征与视频特征之间相似度越小。In one example, the audio-visual rhythm matching model can output a similarity confidence value. The similarity confidence value is used to represent the probability value that the audio feature is similar to the video feature. The larger the similarity confidence value is, the greater the similarity between the audio feature and the video feature. The higher the degree; the larger and smaller the similarity confidence value is, indicating that the similarity between audio features and video features is smaller.
可选地,在本申请的实施例中可以获取训练数据集对待训练的影音节奏匹配模型进行训练,得到训练后的影音节奏匹配模型;其中,训练数据集包括样本匹配音乐短片与样本不匹配音乐短片;样本匹配音乐短片是指音乐与图像内容相匹配的音乐短片;样本不匹配音乐短片是指音乐与图像内容不匹配的音乐短片;例如,对音乐短片1的背景音乐与音乐短片2的图像视频进行混合,得到样本不匹配音乐短片;通过对大量的训练数据集的学习,使得影音节奏匹配模型够基于输入的背景音乐的节奏对输入的M个视频片段进行排序。Optionally, in the embodiment of the present application, a training data set can be obtained to train the audio-visual rhythm matching model to be trained, and a trained audio-visual rhythm matching model can be obtained; wherein, the training data set includes sample matching music shorts and sample non-matching music Short film; sample-matching music short refers to a music short whose music matches the image content; sample-unmatched music short refers to a music short whose music and image content do not match; for example, the background music of music short 1 and the image of music short 2 The videos are mixed to obtain sample-unmatched music clips; by learning a large number of training data sets, the audio-visual rhythm matching model can sort the input M video clips based on the rhythm of the input background music.
例如,可以将背景音乐与M个视频片段输入至预先训练的影音节奏匹配模型,影音节奏匹配模型可以输出M个视频片段的排序;假设,M个视频片段包括视频片段1、视频片段2与视频片段3;将背景音乐可以分割为3段音频特征,分别为音频特征1、音频特征2与音频特征3;分别判断音频特征1与视频片段1、视频片段2以及视频片段3之间的相关性,得到3个视频片段中与音频特征1匹配度最高的音频特征;判断判断音频特征2与视频片段1、视频片段2以及视频片段3之间的相关性,得到3个视频片段中与音频特征2匹配度最高的音频特征;判断音频特征3与视频片段1、视频片段2以及视频片段3之间的相关性,得到3个视频片段中与音频特征3匹配度最高的音频特征;最终可以输出每一个音频特征对应的视频片段。For example, background music and M video clips can be input to a pre-trained audio-visual rhythm matching model, and the audio-visual rhythm matching model can output the sorting of the M video clips; assuming that the M video clips include video clip 1, video clip 2 and video Fragment 3; the background music can be divided into three audio features, namely audio feature 1, audio feature 2 and audio feature 3; determine the correlation between audio feature 1 and video clip 1, video clip 2 and video clip 3 respectively. , get the audio feature that has the highest matching degree with audio feature 1 among the 3 video clips; judge the correlation between audio feature 2 and video clip 1, video clip 2 and video clip 3, and get the audio feature among the 3 video clips. 2 audio feature with the highest matching degree; determine the correlation between audio feature 3 and video clip 1, video clip 2 and video clip 3, and obtain the audio feature with the highest matching degree with audio feature 3 among the three video clips; finally it can be output The video clip corresponding to each audio feature.
在本申请的实施例中,通过影音节奏匹配模型对M个视频片段进行排序,可以实现按照背景音乐的节奏对M个视频片段进行视频排序,使得视频片段的画面内容与音乐节奏相符合;与视频直接按照输入顺序与音乐匹配相比,本申请的方案能够提高视频中图像内容与背景音乐节奏的一致性,提高用户体验。In the embodiment of the present application, by sorting the M video clips through the audio-visual rhythm matching model, the M video clips can be sorted according to the rhythm of the background music, so that the picture content of the video clips matches the music rhythm; and Compared with matching videos directly with music according to the input order, the solution of this application can improve the consistency between the image content in the video and the rhythm of the background music, and improve the user experience.
步骤S680、基于排序后的M个视频片段与背景音频,得到处理后的视频。Step S680: Obtain the processed video based on the sorted M video clips and background audio.
示例性地,基于排序后的M个视频片段的视频内容与背景音乐的音频信息可以得到处理后的视频。For example, a processed video can be obtained based on the video content of the sorted M video clips and the audio information of the background music.
可选地,在对排序后的M个视频片段中增加背景音乐之后,还可以对视频进行其他剪辑处理,得到处理后的视频;其中,其他剪辑处理可以包括:视频添加图像特效、视频添加文字,或者视频添加转场动画效果等。Optionally, after adding background music to the sorted M video clips, you can also perform other editing processes on the videos to obtain the processed videos; where the other editing processes can include: adding image special effects to the video, adding text to the video , or add transition animation effects to the video, etc.
应理解,视频的转场效果是指两个场景(例如,两段素材)之间,采用一定的技巧;例如,划像、叠变、卷页等,实现场景或情节之间的平滑过渡,或达到丰富画面吸引观众的效 果。It should be understood that the transition effect of a video refers to the use of certain techniques between two scenes (for example, two pieces of material); for example, wipes, stacking changes, page curls, etc., to achieve a smooth transition between scenes or plots. Or achieve the effect of enriching the picture to attract the audience fruit.
需要说明的是,除上述描述之外,图18中与图12至图17中相同的部分可以参照图12至图17的相关描述,此处不再赘述。It should be noted that, in addition to the above description, the same parts in Fig. 18 as in Figs. 12 to 17 can be referred to the relevant descriptions of Figs. 12 to 17 and will not be described again here.
实现方式二:对于强故事线的视频,获取N个视频;基于N个视频的文本描述信息得到N个视频的视频主题;基于N个视频的文本描述信息对N个视频进行排序,得到排序后的N个视频;基于排序后的N个视频包括的视频片段与视频主题的相似度置信值确定排序后的M个视频片段;基于排序后的M个视频片段与视频主题信息确定与排序后的M个视频片段相匹配的背景音乐;生成处理后的视频。Implementation method two: For videos with strong story lines, obtain N videos; obtain the video themes of N videos based on the text description information of the N videos; sort the N videos based on the text description information of the N videos, and obtain the sorted of N videos; determine the sorted M video clips based on the similarity confidence value between the video clips included in the sorted N videos and the video topic; determine the sorted M video clips based on the sorted M video clips and video topic information. Matching background music to M video clips; generating processed video.
图19是本申请实施例提供的一种视频编辑方法的示意性流程图。该视频编辑方法700可以由图1所示的电子设备执行;该视频编辑方法700包括步骤S710至步骤S780,下面分别对步骤S710至步骤S780进行详细的描述。Figure 19 is a schematic flow chart of a video editing method provided by an embodiment of the present application. The video editing method 700 can be executed by the electronic device shown in FIG. 1; the video editing method 700 includes steps S710 to S780, and steps S710 to S780 are described in detail below respectively.
步骤S710、获取N个视频。Step S710: Obtain N videos.
示例性地,N个视频可以为存储在电子设备中的视频;其中,N个视频可以为电子设备采集的视频;或者,N个视频中的部分或者全部为下载的视频;本申请对N个视频的来源不作任何限定。For example, the N videos may be videos stored in the electronic device; wherein the N videos may be videos collected by the electronic device; or some or all of the N videos may be downloaded videos; this application applies to the N videos The source of the video is not limited in any way.
例如,电子设备检测到用户对图库应用程序中N个视频的点击操作;可以获取N个视频。For example, the electronic device detects the user's click operation on N videos in the gallery application; the N videos can be obtained.
步骤S720、通过图文转换模型得到N个视频的文本描述信息。Step S720: Obtain the text description information of N videos through the image-text conversion model.
示例性地,一个视频可以对应一个文本描述信息;N个视频通过图文转换模型可以得到N个文本描述信息。For example, one video can correspond to one piece of text description information; N pieces of text description information can be obtained from N videos through the image-to-text conversion model.
可选地,图文转换模型可以用于将视频转换为文本信息;即可以将视频中包括的图像信息转换为文本描述信息;基于文本描述信息描述图像中包括的图像内容。Optionally, the image-to-text conversion model can be used to convert the video into text information; that is, the image information included in the video can be converted into text description information; and the image content included in the image can be described based on the text description information.
示例性地,图文转换模型可以包括CLIP模型。For example, the image-to-text conversion model may include a CLIP model.
步骤S730、将N个视频的文本描述信息输入至预先训练的视频主题分类模型,得到视频主题信息。Step S730: Input the text description information of N videos into the pre-trained video topic classification model to obtain video topic information.
应理解,视频主题可以是指视频中与整体的图像内容相关联的主题思想;对于不同的视频主题,对应的视频处理方式可以不同;例如,视频主题不同可以采用不同的音乐,不同的转场特效,不同的图像处理滤镜,或者,可以采用不同的视频剪辑方式。It should be understood that the video theme can refer to the theme idea associated with the overall image content in the video; for different video themes, the corresponding video processing methods can be different; for example, different video themes can use different music and different transitions. Special effects, different image processing filters, or different video editing methods can be used.
需要说明的是,在本申请的实施例中N个视频的视频主题信息为一个主题信息,即视频主题信息为N个视频作为整体对应的视频主题信息。It should be noted that in the embodiment of the present application, the video theme information of N videos is one theme information, that is, the video theme information is the video theme information corresponding to the N videos as a whole.
示例性地,预先训练的视频主题分类模型可以为预先训练的文本分类模型,该文本分类模型可以为深度神经网络。For example, the pre-trained video topic classification model may be a pre-trained text classification model, and the text classification model may be a deep neural network.
可选地,视频主题分类模型可以是基于以下训练数据集通过训练得到的;训练数据集包括样本文本描述信息和视频主题文本信息,样本文本描述信息与视频主题信息相对应;其中,样本文本描述信息可以是一个或者多个语句文本;视频主题文本信息可以是短语文本。Optionally, the video topic classification model can be obtained through training based on the following training data set; the training data set includes sample text description information and video topic text information, and the sample text description information corresponds to the video topic information; wherein, the sample text description The information can be one or more sentence texts; the video topic text information can be phrase text.
例如,样本文本描述信息可以包括:“多个人在吃饭”、“多个人在做游戏”、以及“多个人在交谈”;该样本描述文本对应的视频主题文本信息可以为“聚会”;又例如,样本文本描述信息可以包括“一个成年人与一个儿童在拍照”,“一个成年人与一个儿童在做游戏”;该样本描述文本对应的视频主题信息为“亲子”。For example, the sample text description information may include: "Multiple people are eating", "Multiple people are playing games", and "Multiple people are talking"; the video theme text information corresponding to the sample description text may be "Party"; and for example , the sample text description information may include "an adult and a child are taking pictures", "an adult and a child are playing games"; the video theme information corresponding to the sample description text is "parent-child".
应理解,上述为举例说明;本申请实施例对样本文本描述信息与样本视频主题信息不作任何限定。It should be understood that the above is an example; the embodiment of the present application does not place any limitations on the sample text description information and the sample video theme information.
示例性地,将一个视频输入至图文转换模型可以得到一个文本描述信息;N个视频可以 得到N个文本描述信息;将N个文本描述信息输入至预先训练的视频主题分类模型,可以得到N个文本描写信息对应的视频主题信息;其中,视频主题信息可以包括但不限于:旅游、聚会、宠物、运动、风景、亲子、工作等。在本申请的实施例中,在识别N个视频的视频主题信息时,通过N个视频的文本描述信息得到N个视频的视频主题信息;与基于N个视频的图像信息得到N个视频的视频主题信息相比,文本信息比图像信息具有更丰富的信息;此外,多个文本信息之间具有语言关联性,基于N个视频的文本描述信息得到视频的视频主题信息,能够提高主题信息的准确性;例如,N个视频中包括用户收拾行李的视频、用户出门乘坐汽车前往机场的视频以及用户乘坐飞机的视频,与用户在海边行为的视频;基于图像信息可能只能得到一些标签,包括衣物、行李箱、用户、海边等,基于这些图像标签无法抽象出N个视频的主题为旅行;但是,基于N个视频的文本描述信息识别N个视频的主题时,可以基于N个视频文本描述信息与N个视频文本描述信息之间的语言逻辑关联性,准确地得到N个视频的视频主题信息;比如,基于N个视频包括的文本描述信息“一个用户在收拾行李”、“一个用户在乘坐飞机”、“一个用户在海边漫步”,基于这些文本描述信息可以抽象出N个视频的视频主题信息为旅行;因此,通过N个视频的文本描述信息得到N个视频的视频主题信息,能够提高主题信息的准确性。For example, inputting a video to the image-to-text conversion model can obtain a text description information; N videos can Obtain N pieces of text description information; input the N pieces of text description information into the pre-trained video topic classification model, and obtain the video topic information corresponding to the N pieces of text description information; where the video topic information can include but is not limited to: travel, party , pets, sports, scenery, parent-child, work, etc. In the embodiment of the present application, when identifying the video theme information of N videos, the video theme information of N videos is obtained through the text description information of N videos; and the video theme information of N videos is obtained based on the image information of N videos. Compared with topic information, text information has richer information than image information; in addition, there is language correlation between multiple text information. The video topic information of the video can be obtained based on the text description information of N videos, which can improve the accuracy of the topic information. sex; for example, the N videos include videos of the user packing luggage, videos of the user going out and taking a car to the airport, videos of the user taking an airplane, and videos of the user's behavior at the beach; only some labels may be obtained based on image information, including clothing , suitcases, users, seaside, etc. Based on these image tags, it is impossible to abstract that the theme of N videos is travel; however, when identifying the theme of N videos based on the text description information of N videos, it can be based on the text description information of N videos. The video theme information of N videos can be accurately obtained based on the language logic correlation with the text description information of N videos; for example, based on the text description information included in N videos, "A user is packing luggage", "A user is taking a ride""Airplane","A user walks on the beach", based on these text description information, the video theme information of N videos can be abstracted as travel; therefore, the video theme information of N videos can be obtained through the text description information of N videos, which can improve Accuracy of subject information.
可选地,若步骤S730输出的主题信息为一个主题信息,则无需用户操作;若步骤S730输出为两个或者两个以上的主题信息,则可以在电子设备中显示提示框;提示框中可以包括候选视频主题信息,基于用户在提示框中候选视频主题信息的操作,确定N个视频的视频主题信息。Optionally, if the theme information output in step S730 is one theme information, no user operation is required; if the theme information output in step S730 is two or more theme information, a prompt box can be displayed in the electronic device; the prompt box can Including candidate video topic information, the video topic information of N videos is determined based on the user's operation on the candidate video topic information in the prompt box.
示例性地,如图11所示,步骤S730中输出两个主题信息,则可以在电子设备中显示显示界面319;显示界面319中包括提示框320,提示框320中包括两个候选视频主题信息分别为风景与旅行,若电子设备检测到用户点击“风景”,则N个视频的视频主题信息为风景;若电子设备检测到用户点击“旅行”,则N个视频的视频主题信息为旅行。For example, as shown in Figure 11, if two theme information are output in step S730, the display interface 319 can be displayed on the electronic device; the display interface 319 includes a prompt box 320, and the prompt box 320 includes two candidate video theme information. They are scenery and travel respectively. If the electronic device detects that the user clicks "Landscape", then the video theme information of the N videos is scenery; if the electronic device detects that the user clicks "Travel", then the video theme information of the N videos is travel.
步骤S740、基于N个视频的文本描述信息对N个视频进行排序,得到排序后的N个视频。Step S740: Sort the N videos based on the text description information of the N videos to obtain the sorted N videos.
需要说明的是,在本申请的实施例中对步骤S730与步骤S740的执行顺序不作任何限定;可以先执行步骤S730再执行步骤S740;或者,可以先执行步骤S740再执行步骤S730;或者,可以同时执行步骤S730与步骤S740。It should be noted that in the embodiment of the present application, there is no restriction on the execution order of step S730 and step S740; step S730 can be executed first and then step S740; or step S740 can be executed first and then step S730; or, step S730 can be executed first and then step S730 can be executed. Step S730 and step S740 are executed simultaneously.
应理解,对于强故事线的N个视频,基于用户上传N个视频的顺序;或者,基于N个视频的时间戳信息,可能得到的N个视频的排序为错误的排序;例如,N个视频为3个下载视频,基于下载的时间顺序3个视频分别为:视频1:一个人从游乐场回家;视频2:一个人在游乐场玩娱乐设施;视频3:一个人坐车前往游乐场;若基于视频的时间戳信息进行排序则顺序为:视频1、视频2与视频3;但是,一个人出行的一天的正常顺序应该为出门前往目的地,到达目的地,从目的地回家;上述基于时间戳得到的视频排序明显不符合合理的出行的视频逻辑顺序;因此,对于强故事线的视频而言,基于视频的时间戳得到的N个视频排序可能是错误的,导致用户对处理后的视频的观看体验感较差;在本申请的方案中,对于强故事线的视频,可以基于视频的文本描述信息对N个视频进行排序,确定排序后的N个视频符合正常的前因后果,提高用户的观看体验感。It should be understood that for N videos with strong story lines, based on the order in which the user uploads the N videos; or based on the timestamp information of the N videos, the order of the N videos may be the wrong order; for example, the order of the N videos There are 3 downloaded videos. Based on the chronological order of downloading, the 3 videos are: Video 1: A person goes home from the amusement park; Video 2: A person plays with entertainment facilities in the amusement park; Video 3: A person goes to the amusement park by car; If sorted based on the timestamp information of the video, the order is: video 1, video 2 and video 3; however, the normal order of a person's day trip should be to go out to the destination, arrive at the destination, and go home from the destination; the above The sorting of videos based on timestamps obviously does not conform to the reasonable logical sequence of travel videos; therefore, for videos with strong story lines, the sorting of N videos based on video timestamps may be wrong, causing users to The viewing experience of the videos is poor; in the solution of this application, for videos with strong story lines, N videos can be sorted based on the text description information of the videos, and it is determined that the sorted N videos conform to normal causes and consequences, improving User’s viewing experience.
示例性地,在本申请的实施例中,可以基于N个视频的文本描述信息,基于文本描述信息之间的自然语言之间的关联性得到N个视频的排序;For example, in the embodiment of the present application, the ranking of the N videos can be obtained based on the text description information of the N videos and the correlation between the natural languages between the text description information;
例如,可以将N个视频的文本信息输入至预先训练的排序模型中,排序模型可以为神经 网络;预先训练的排序模型可以是基于训练数据集通过反向传播算法训练得到的,训练数据集可以包括样本主题文本与多个样本描述文本的排序,样本主题文本与多个样本描述文本相对应;比如,样本主题文本为“出现”;多个样本描述文本的排序为:样本描述文本1“一个人出门”,样本描述文本2为“一个人在前往目的地的途中”,样本描述文本3为“一个人到底目的地”,样本描述文本4为“一个人在目的地活动”,样本描述文本5为“一个人从目的地离开在前往出发地的途中”,样本描述文本6为“一个人到达出发地”;通过对大量样本训练数据集的学习,当输入多个描述文本时,预先训练的排序模型可以输出多个描述文本的排序。For example, the text information of N videos can be input into a pre-trained ranking model. The ranking model can be a neural Network; the pre-trained ranking model can be trained through the back propagation algorithm based on the training data set. The training data set can include the ranking of sample topic text and multiple sample description texts, and the sample topic text corresponds to multiple sample description texts. ;For example, the sample topic text is "appear"; the order of multiple sample description texts is: sample description text 1 "A person goes out", sample description text 2 is "A person is on the way to the destination", sample description text 3 is "A person arrives at the destination", sample description text 4 is "A person is active at the destination", sample description text 5 is "A person leaves from the destination on the way to the departure place", and sample description text 6 is "A person leaves from the destination on the way to the departure place". "People arrive at the starting point"; by learning from a large number of sample training data sets, when multiple description texts are input, the pre-trained ranking model can output the ranking of multiple description texts.
步骤S750、基于相似度评估模型得到N个视频中的图像与视频主题信息的相似度置信值。Step S750: Obtain similarity confidence values between images and video topic information in N videos based on the similarity evaluation model.
应理解,相似度评估模型可以是预先训练的神经网络模型;相似度评估模型用于输出N个视频中每个视频包括的图像特征与视频主题信息之间的相关性。如图15所示,相似度评估模型中可以包括图像编码器、文本编码器与相似度度量模块;其中,图像编码器用于对视频中的图像进行特征提取,得到图像特征;文本编码器用于对视频主题信息进行特征提取,得到文本信息;相似度度量模块用于评估图像特征与文本特征之间的相似性。It should be understood that the similarity evaluation model may be a pre-trained neural network model; the similarity evaluation model is used to output the correlation between the image features included in each of the N videos and the video topic information. As shown in Figure 15, the similarity evaluation model can include an image encoder, a text encoder and a similarity measurement module; among them, the image encoder is used to extract features from images in the video to obtain image features; the text encoder is used to Feature extraction is performed on video topic information to obtain text information; the similarity measurement module is used to evaluate the similarity between image features and text features.
示例性地,基于相似度评估模型可以提取N个视频中的图像特征以及提取视频主题信息的文本特征;对图像特征与文本特征进行比较,得到图像特征与文本特征之间的相似度。其中,相似度评估模型可以输出距离度量值,或者,相似度评估模型可以输出相似度置信值;若相似度评估模型输出距离度量值,则距离度量值越小表示图像特征与文本特征之间的相似度越高;基于距离度量值可以得到图像特征与文本特征之间的相似度置信值;若相似度评估模型输出相似度置信值,则相似度置信值越大,表示图像特征与文本特征之间的相似度越高。For example, based on the similarity evaluation model, the image features in N videos and the text features of the video topic information can be extracted; the image features and text features are compared to obtain the similarity between the image features and the text features. Among them, the similarity evaluation model can output a distance measurement value, or the similarity evaluation model can output a similarity confidence value; if the similarity evaluation model outputs a distance measurement value, the smaller the distance measurement value indicates the difference between the image feature and the text feature. The higher the similarity; the similarity confidence value between the image feature and the text feature can be obtained based on the distance measurement value; if the similarity evaluation model outputs the similarity confidence value, the greater the similarity confidence value, the greater the similarity between the image feature and the text feature. The higher the similarity between them.
在本申请的实施例中,可以提取N个视频中的全部图像特征;或者,可以提取N个视频中的部分图像特征;本申请对此不作任何限定。In the embodiment of the present application, all image features in the N videos can be extracted; or some image features in the N videos can be extracted; this application does not impose any limitation on this.
可选地,可以通过相似度评估模型中的图像编码器对N个视频中的每一帧图像进行图像特征提取,得到N个视频中包括的全部图像特征。Optionally, image features can be extracted from each frame of the N videos through the image encoder in the similarity evaluation model to obtain all image features included in the N videos.
可选地,可以基于相同的间隔帧数通过相似度评估模型中的图像编码器提取N个视频中的图像特征,得到N个视频中的部分图像特征。Optionally, the image features in the N videos can be extracted through the image encoder in the similarity evaluation model based on the same number of interval frames to obtain partial image features in the N videos.
例如,可以间隔4帧提取一帧图像特征,则可以提取N个视频中一个视频中的第1帧图像,第5帧图像,第10帧图像,第15帧图像等。For example, the image features of one frame can be extracted at intervals of 4 frames, and the 1st frame image, 5th frame image, 10th frame image, 15th frame image, etc. in one video of N videos can be extracted.
应理解,上述为举例说明,在本申请的实施例中;对于N个视频中的一个视频而言,可以通过遍历每一帧图像,提取一个视频中的全部图像特征;或者,可以等间隔帧数,提取一个视频中的部分图像特征;本申请对此不作任何限定。It should be understood that the above is an example. In the embodiment of the present application, for one video among N videos, all image features in a video can be extracted by traversing each frame of the image; or, frames can be equally spaced. Number, extract some image features in a video; this application does not make any restrictions on this.
步骤S760、基于N个视频中的图像与视频主题信息的相似度置信值,得到排序后的M个视频片段。Step S760: Based on the similarity confidence values between the images in the N videos and the video topic information, obtain the sorted M video clips.
示例性地,基于N个视频中包括的图像特征与视频主题信息的文本特征之间的相似度置信值,可以在N个视频中的选取连续的多帧图像特征,得到一个视频片段。例如,当连续的多帧图像的相似度置信值大于预设阈值的情况下,得到多帧图像组成的一个视频片段。For example, based on the similarity confidence value between the image features included in the N videos and the text features of the video topic information, continuous multi-frame image features can be selected from the N videos to obtain a video clip. For example, when the similarity confidence value of consecutive multi-frame images is greater than a preset threshold, a video segment composed of multi-frame images is obtained.
应理解,由于N个视频为排序后的视频,因此,基于排序后的N个视频中选取的M个视频片段为具有顺序的视频片段,即得排序后的M个视频片段。It should be understood that since the N videos are sorted videos, the M video clips selected from the sorted N videos are sequential video clips, that is, the sorted M video clips are obtained.
例如,N个视频包括3个视频,3个视频的排序为视频2、视频1与视频3;若从视频2中选取视频片段2-1、视频片段2-2,视频片段2-1的时间在视频片段2-2的时间之前;在视频1中选取视频片段1-1;在视频3中选取视频片段3-1、视频片段3-2,视频片段3-1的时间 在视频片段3-2的时间之前;则5个视频片段的排序为视频片段2-1、视频片段2-2、视频片段1-1、视频片段3-1、视频片段3-2。For example, N videos include 3 videos, and the order of the 3 videos is video 2, video 1 and video 3; if video clip 2-1 and video clip 2-2 are selected from video 2, the time of video clip 2-1 Before the time of video clip 2-2; select video clip 1-1 in video 1; select video clip 3-1, video clip 3-2, and the time of video clip 3-1 in video 3 before the time of video clip 3-2; then the order of the five video clips is video clip 2-1, video clip 2-2, video clip 1-1, video clip 3-1, and video clip 3-2.
步骤S770、基于排序后的M个视频片段的时长与视频主题信息,在候选音乐库中进行音乐匹配处理,与排序后的M个视频片段相匹配的背景音乐。Step S770: Based on the duration and video theme information of the sorted M video clips, perform music matching processing in the candidate music library to obtain background music that matches the sorted M video clips.
应理解,由于M个视频片段为排序后的视频片段,因此,此时需要将排序后的M个视频片段作为基准,选择合适的背景音乐去匹配排序后的M个视频;选择的背景音乐的节奏应该基于排序后的M个视频片段中不同视频片段的图像的风格去匹配。It should be understood that since the M video clips are sorted video clips, it is necessary to use the sorted M video clips as a benchmark to select appropriate background music to match the sorted M videos; the selected background music The rhythm should be matched based on the style of the images of different video clips in the sorted M video clips.
例如,若排序后的M个视频片段的风格分别为舒缓、欢快的图像内容,则选取的背景音乐应该为前奏为舒缓,中间节奏为欢快的音乐。For example, if the styles of the sorted M video clips are soothing and cheerful image content, the selected background music should be music with a soothing intro and a cheerful middle tempo.
步骤S780、基于排序后的M个视频片段与背景音乐,得到处理后的视频。Step S780: Obtain the processed video based on the sorted M video clips and background music.
示例性地,基于排序后的M个视频片段的视频内容与背景音乐的音频信息可以得到处理后的视频。For example, a processed video can be obtained based on the video content of the sorted M video clips and the audio information of the background music.
可选地,在对排序后的M个视频片段中增加背景音乐之后,还可以对视频进行其他剪辑处理,得到处理后的视频;其中,其他剪辑处理可以包括:视频添加图像特效、视频添加文字,或者视频添加转场动画效果等。Optionally, after adding background music to the sorted M video clips, you can also perform other editing processes on the videos to obtain the processed videos; where the other editing processes can include: adding image special effects to the video, adding text to the video , or add transition animation effects to the video, etc.
应理解,视频的转场效果是指两个场景(例如,两段素材)之间,采用一定的技巧;例如,划像、叠变、卷页等,实现场景或情节之间的平滑过渡,或达到丰富画面吸引观众的效果。It should be understood that the transition effect of a video refers to the use of certain techniques between two scenes (for example, two pieces of material); for example, wipes, stacking changes, page curls, etc., to achieve a smooth transition between scenes or plots. Or achieve the effect of enriching the picture to attract the audience.
需要说明的是,除上述描述之外,图19中与图12至图17中相同的部分可以参照图12至图17的相关描述,此处不再赘述。It should be noted that, in addition to the above description, the same parts in Fig. 19 as in Figs. 12 to 17 can be referred to the relevant descriptions of Figs. 12 to 17 and will not be described again here.
应理解,在实现方式一中,对于非强故事线的M个视频片段,基于背景音乐的节奏得到排序后的M个视频片段;在实现方式二中,对于强故事线的M个视频片段,基于M个视频片段的前后因果联系得到排序后的M个视频片段;选择节奏匹配排序后的M个视频片段的音乐作为背景音乐。It should be understood that in the first implementation method, for the M video clips without strong story lines, the sorted M video clips are obtained based on the rhythm of the background music; in the second implementation method, for the M video clips with strong story lines, The sorted M video clips are obtained based on the forward and backward causal connections of the M video clips; the music of the sorted M video clips that matches the rhythm is selected as the background music.
应理解,上述举例说明是为了帮助本领域技术人员理解本申请实施例,而非要将本申请实施例限于所例示的具体数值或具体场景。本领域技术人员根据所给出的上述举例说明,显然可以进行各种等价的修改或变化,这样的修改或变化也落入本申请实施例的范围内。It should be understood that the above examples are to help those skilled in the art understand the embodiments of the present application, but are not intended to limit the embodiments of the present application to the specific numerical values or specific scenarios illustrated. Those skilled in the art can obviously make various equivalent modifications or changes based on the above examples, and such modifications or changes also fall within the scope of the embodiments of the present application.
上文结合图1至图19详细描述了本申请实施例提供的图像处理方法;下面将结合图20至图21详细描述本申请的装置实施例。应理解,本申请实施例中的装置可以执行前述本申请实施例的各种方法,即以下各种产品的具体工作过程,可以参考前述方法实施例中的对应过程。The image processing method provided by the embodiment of the present application is described in detail above with reference to Figures 1 to 19; below, the device embodiment of the present application will be described in detail with reference to Figures 20 to 21. It should be understood that the devices in the embodiments of the present application can perform various methods of the foregoing embodiments of the present application, that is, for the specific working processes of the following various products, reference can be made to the corresponding processes in the foregoing method embodiments.
图20是本申请实施例提供的一种电子设备的结构示意图。该电子设备800包括显示模块810与处理模块820。Figure 20 is a schematic structural diagram of an electronic device provided by an embodiment of the present application. The electronic device 800 includes a display module 810 and a processing module 820 .
其中,显示模块810用于显示第一界面,所述第一界面中包括视频图标,所述视频图标指示的视频为所述电子设备中存储的视频;处理模块820用于检测到对所述视频图标中N个视频图标的第一操作;响应于所述第一操作,获取N个视频的信息,N为大于1的整数;基于所述N个视频的信息,得到所述N个视频的视频主题;基于所述N个视频中的图像与所述视频主题的相似度,选取所述N个视频中的M个视频片段;基于所述视频主题,得到与所述视频主题相匹配的音乐;基于所述M个视频片段与所述音乐,得到第一视频;显示模块810还用于显示所述第一视频。Wherein, the display module 810 is used to display a first interface, the first interface includes a video icon, and the video indicated by the video icon is a video stored in the electronic device; the processing module 820 is used to detect that the video is The first operation of the N video icons in the icon; in response to the first operation, obtain the information of the N videos, where N is an integer greater than 1; based on the information of the N videos, obtain the videos of the N videos Theme; based on the similarity between the images in the N videos and the video theme, select M video clips from the N videos; based on the video theme, obtain music that matches the video theme; Based on the M video clips and the music, a first video is obtained; the display module 810 is also used to display the first video.
可选地,作为一个实施例,处理模块820具体用于: Optionally, as an embodiment, the processing module 820 is specifically used to:
将所述N个视频与所述视频主题输入至预先训练的相似度匹配模型,得到所述N个视频中的图像与所述视频主题的相似度置信值,其中,所述预先训练的相似度匹配模型中包括图像编码器、文本编码器与第一相似度度量模块,所述图像编码器用于对所述N个视频进行提取图像特征处理,所述文本编码器用于所述视频主题进行提取文本特征处理,所述第一相似度度量模块用于度量所述N个视频中的图像特征与所述视频主题的文本特征之间的相似度,所述相似度置信值用于表示所述N个视频中的图像与所述视频主题相似的概率;Input the N videos and the video topic into a pre-trained similarity matching model to obtain the similarity confidence value of the images in the N videos and the video topic, where the pre-trained similarity The matching model includes an image encoder, a text encoder and a first similarity measurement module. The image encoder is used to extract image features from the N videos. The text encoder is used to extract text from the video topic. Feature processing, the first similarity measurement module is used to measure the similarity between the image features in the N videos and the text features of the video subject, and the similarity confidence value is used to represent the N The probability that the images in the video are similar to the subject of the video;
基于所述N个视频中的图像与所述视频主题的相似度置信值,选取所述N个视频中的M个视频片段。Based on the similarity confidence values between the images in the N videos and the video themes, M video clips from the N videos are selected.
可选地,作为一个实施例,处理模块820具体用于:Optionally, as an embodiment, the processing module 820 is specifically used to:
对所述M个视频片段进行排序,得到排序后的M个视频片段;Sort the M video clips to obtain sorted M video clips;
将所述排序后的M个视频片段与所述音乐合成为所述第一视频。The sorted M video clips and the music are synthesized into the first video.
可选地,作为一个实施例,处理模块820具体用于:Optionally, as an embodiment, the processing module 820 is specifically used to:
基于所述音乐的节奏对所述M个视频片段排序,得到所述排序后的M个视频片段。The M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips.
可选地,作为一个实施例,处理模块820具体用于:Optionally, as an embodiment, the processing module 820 is specifically used to:
基于所述M个视频片段中的视频内容对所述M个视频片段进行排序,得到所述排序后的M个视频片段。The M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
可选地,作为一个实施例,处理模块820具体用于:Optionally, as an embodiment, the processing module 820 is specifically used to:
将所述音乐与所述M个视频片段输入至预先训练的影音节奏匹配模型,得到所述排序后的M个视频片段,所述预先训练的影音节奏匹配模型中包括音频编码器、视频编码器与第一相似度度量模块,所述音频编码器用于对所述音乐进行特征提取得到音频特征,所述视频解码器用于对所述M个视频片段进行特征提取得到视频特征,所述第一相似度度量模块用于度量所述音频特征与所述M个视频片段的相似性。The music and the M video clips are input into a pre-trained audio-visual rhythm matching model to obtain the sorted M video clips. The pre-trained audio-visual rhythm matching model includes an audio encoder and a video encoder. With the first similarity measurement module, the audio encoder is used to extract features of the music to obtain audio features, the video decoder is used to extract features of the M video segments to obtain video features, and the first similarity The degree measurement module is used to measure the similarity between the audio features and the M video clips.
可选地,作为一个实施例,处理模块820具体用于:Optionally, as an embodiment, the processing module 820 is specifically used to:
将N个视频的视频内容转换为N个文本描述信息,所述N个文本描述信息与所述N个视频一一对应,所述N个文本描述信息中的一个文本描述信息用于描述所述N个视频中一个视频的图像内容信息;Convert the video content of the N videos into N pieces of text description information, the N pieces of text description information correspond to the N videos one-to-one, and one of the N pieces of text description information is used to describe the Image content information of one video among N videos;
基于所述N个文本描述信息,得到所述N个视频的主题信息,所述文本描述信息用于将所述N个视频中的视频内容转换为文本信息。Based on the N pieces of text description information, the theme information of the N videos is obtained, and the text description information is used to convert the video content in the N videos into text information.
可选地,作为一个实施例,处理模块820具体用于:Optionally, as an embodiment, the processing module 820 is specifically used to:
将所述N个文本描述信息输入至预先训练的主题分类模型,得到所述N个视频的主题信息,所述预先训练的主题分类模型为用于文本分类的深度神经网络。The N pieces of text description information are input into a pre-trained topic classification model to obtain the topic information of the N videos. The pre-trained topic classification model is a deep neural network for text classification.
可选地,作为一个实施例,在所述预先训练的主题分类模型输出至少两个视频主题时,所述至少两个视频主题与所述N个文本描述信息相对应,显示模块810还用于:Optionally, as an embodiment, when the pre-trained topic classification model outputs at least two video topics, and the at least two video topics correspond to the N pieces of text description information, the display module 810 is also configured to :
显示第二界面,所述第二界面中包括提示框,所述提示框中包括所述至少两个视频主题的信息;Display a second interface, the second interface includes a prompt box, the prompt box includes information on the at least two video themes;
处理模块820具体用于:The processing module 820 is specifically used for:
检测到对所述至少两个视频主题的第二操作;detecting a second operation on the at least two video subjects;
响应于所述第二操作,得到所述N个视频的主题信息。In response to the second operation, the theme information of the N videos is obtained.
可选地,作为一个实施例,所述基于所述视频主题,得到与所述视频主题相匹配的音乐,包括:Optionally, as an embodiment, obtaining music that matches the video theme based on the video theme includes:
基于所述M个视频片段的时长与所述视频主题,得到与所述视频主题相匹配的音乐,所 述音乐的时长大于或者等于所述M个视频片段的时长。Based on the duration of the M video clips and the video theme, music matching the video theme is obtained, so The duration of the music is greater than or equal to the duration of the M video clips.
可选地,作为一个实施例,所述预先训练的相似度匹配模型为Transformer模型。Optionally, as an embodiment, the pre-trained similarity matching model is a Transformer model.
可选地,作为一个实施例,所述预先训练的相似度匹配模型是通过以下训练方式得到的:Optionally, as an embodiment, the pre-trained similarity matching model is obtained through the following training method:
基于第一训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到所述预先训练的相似度匹配模型;其中,所述第一训练数据集中包括正例数据对与负例数据对,所述正例数据对包括第一样本文本描述信息与第一样本视频主题信息,所述第一样本描述信息与所述第一样本视频主题信息相匹配,所述正例数据对包括所述第一样本文本描述信息与第二样本视频主题信息,所述第一样本描述信息与所述第二样本视频主题信息不匹配。The similarity matching model to be trained is trained using the contrastive learning training method based on the first training data set to obtain the pre-trained similarity matching model; wherein, the first training data set includes positive example data pairs and negative examples Data pair, the positive example data pair includes a first sample text description information and a first sample video theme information, the first sample description information matches the first sample video theme information, and the positive example data pair An example data pair includes the first sample text description information and the second sample video theme information, and the first sample description information does not match the second sample video theme information.
可选地,作为一个实施例,所述预先训练的影音节奏匹配模型为Transformer模型。Optionally, as an embodiment, the pre-trained audio-visual rhythm matching model is a Transformer model.
可选地,作为一个实施例,所述预先训练的影音节奏匹配模型是通过以下训练方式得到的:Optionally, as an embodiment, the pre-trained audio-visual rhythm matching model is obtained through the following training method:
基于第二训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到所述预先训练的相似度匹配模型;其中,所述第二训练数据集中包括正例数据对与负例数据对,所述正例数据对包括第一样本音乐与第一样本视频,所述第一样本音乐的节奏与所述第一样本视频的内容相匹配,所述负例数据对包括所述第一样本音乐与第二样本视频,所述第一样本音乐的节奏与所述第二样本视频的内容不匹配。The similarity matching model to be trained is trained using the contrastive learning training method based on the second training data set to obtain the pre-trained similarity matching model; wherein, the second training data set includes positive example data pairs and negative examples Data pair, the positive example data pair includes a first sample music and a first sample video, the rhythm of the first sample music matches the content of the first sample video, and the negative example data pair It includes the first sample music and the second sample video, and the rhythm of the first sample music does not match the content of the second sample video.
需要说明的是,上述电子设备800以功能模块的形式体现。这里的术语“模块”可以通过软件和/或硬件形式实现,对此不作具体限定。It should be noted that the above-mentioned electronic device 800 is embodied in the form of a functional module. The term "module" here can be implemented in the form of software and/or hardware, and is not specifically limited.
例如,“模块”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。For example, a "module" may be a software program, a hardware circuit, or a combination of both that implements the above functions. The hardware circuit may include an application specific integrated circuit (ASIC), an electronic circuit, a processor (such as a shared processor, a dedicated processor, or a group processor) for executing one or more software or firmware programs. etc.) and memory, merged logic circuitry, and/or other suitable components to support the described functionality.
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Therefore, the units of each example described in the embodiments of the present application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
图21示出了本申请提供的一种电子设备的结构示意图。图21中的虚线表示该单元或该模块为可选的;电子设备900可以用于实现上述方法实施例中描述的视频编辑方法。Figure 21 shows a schematic structural diagram of an electronic device provided by this application. The dotted line in Figure 21 indicates that this unit or module is optional; the electronic device 900 can be used to implement the video editing method described in the above method embodiment.
电子设备900包括一个或多个处理器901,该一个或多个处理器901可支持电子设备900实现方法实施例中的视频编辑方法。处理器901可以是通用处理器或者专用处理器。例如,处理器901可以是中央处理器(central processing unit,CPU)、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件,如分立门、晶体管逻辑器件或分立硬件组件。The electronic device 900 includes one or more processors 901, and the one or more processors 901 can support the electronic device 900 to implement the video editing method in the method embodiment. Processor 901 may be a general-purpose processor or a special-purpose processor. For example, the processor 901 may be a central processing unit (CPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), or a field programmable gate array (field programmable). gate array, FPGA) or other programmable logic devices, such as discrete gates, transistor logic devices, or discrete hardware components.
可选地,处理器901可以用于对电子设备900进行控制,执行软件程序,处理软件程序的数据。电子设备900还可以包括通信单元905,用以实现信号的输入(接收)和输出(发送)。Optionally, the processor 901 can be used to control the electronic device 900, execute software programs, and process data of the software programs. The electronic device 900 may also include a communication unit 905 to implement input (reception) and output (transmission) of signals.
例如,电子设备900可以是芯片,通信单元905可以是该芯片的输入和/或输出电路,或者,通信单元905可以是该芯片的通信接口,该芯片可以作为终端设备或其它电子设备的组成部分。For example, the electronic device 900 may be a chip, and the communication unit 905 may be an input and/or output circuit of the chip, or the communication unit 905 may be a communication interface of the chip, and the chip may be used as a component of a terminal device or other electronic device. .
又例如,电子设备900可以是终端设备,通信单元905可以是该终端设备的收发器,或 者,通信单元905可以900中可以包括一个或多个存储器902,其上存有程序904,程序904可被处理器901运行,生成指令903,使得处理器901根据指令903执行上述方法实施例中描述的视频编辑方法。For another example, the electronic device 900 may be a terminal device, and the communication unit 905 may be a transceiver of the terminal device, or Alternatively, the communication unit 905 may include one or more memories 902 in which a program 904 is stored. The program 904 may be run by the processor 901 to generate an instruction 903, so that the processor 901 executes the above method embodiment according to the instruction 903. Describes video editing methods.
可选地,存储器902中还可以存储有数据。Optionally, data may also be stored in the memory 902 .
可选地,处理器901还可以读取存储器902中存储的数据,该数据可以与程序904存储在相同的存储地址,该数据也可以与程序904存储在不同的存储地址。Optionally, the processor 901 can also read data stored in the memory 902. The data can be stored at the same storage address as the program 904, or the data can be stored at a different storage address than the program 904.
可选地,处理器901和存储器902可以单独设置,也可以集成在一起,例如,集成在终端设备的系统级芯片(system on chip,SOC)上。Optionally, the processor 901 and the memory 902 can be provided separately or integrated together, for example, integrated on a system on chip (SOC) of the terminal device.
示例性地,存储器902可以用于存储本申请实施例中提供的视频编辑方法的相关程序904,处理器901可以用于在执行视频编辑方法时调用存储器902中存储的视频编辑方法的相关程序904,执行本申请实施例的视频编辑方法;例如,显示第一界面,第一界面中包括视频图标,视频图标指示的视频为电子设备中存储的视频;检测到对视频图标中N个视频图标的第一操作;响应于第一操作,获取N个视频的信息,N为大于1的整数;基于N个视频的信息,得到N个视频的视频主题;基于N个视频中的图像与视频主题的相似度,选取N个视频中的M个视频片段;基于视频主题,得到与视频主题相匹配的音乐;基于M个视频片段与音乐,得到第一视频;显示第一视频。For example, the memory 902 can be used to store the related program 904 of the video editing method provided in the embodiment of the present application, and the processor 901 can be used to call the related program 904 of the video editing method stored in the memory 902 when executing the video editing method. , execute the video editing method of the embodiment of the present application; for example, display a first interface, the first interface includes a video icon, and the video indicated by the video icon is a video stored in the electronic device; detect that N video icons in the video icon are The first operation; in response to the first operation, obtain the information of N videos, where N is an integer greater than 1; based on the information of the N videos, obtain the video themes of the N videos; based on the images and video themes in the N videos Similarity, select M video clips from N videos; based on the video theme, get the music that matches the video theme; based on the M video clips and music, get the first video; display the first video.
可选地,本申请还提供了一种计算机程序产品,该计算机程序产品被处理器901执行时实现本申请中任一方法实施例中的视频编辑方法。Optionally, this application also provides a computer program product, which when executed by the processor 901 implements the video editing method in any method embodiment of this application.
例如,该计算机程序产品可以存储在存储器902中,例如是程序904,程序904经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器901执行的可执行目标文件。For example, the computer program product may be stored in the memory 902, such as the program 904. The program 904 is finally converted into an executable object file that can be executed by the processor 901 through processes such as preprocessing, compilation, assembly, and linking.
可选地,本申请还提供了一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被计算机执行时实现本申请中任一方法实施例的视频编辑方法。该计算机程序可以是高级语言程序,也可以是可执行目标程序。Optionally, this application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a computer, the video editing method of any method embodiment in this application is implemented. The computer program may be a high-level language program or an executable object program.
例如,该计算机可读存储介质例如是存储器902。存储器902可以是易失性存储器或非易失性存储器,或者,存储器902可以同时包括易失性存储器和非易失性存储器。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。For example, the computer-readable storage medium is memory 902. Memory 902 may be volatile memory or non-volatile memory, or memory 902 may include both volatile memory and non-volatile memory. Among them, non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) ) and direct memory bus random access memory (direct rambus RAM, DR RAM).
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented with electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置 和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and simplicity of description, the systems and devices described above For the specific working process of the unit, please refer to the corresponding process in the foregoing method embodiment, which will not be described again here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的电子设备的实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the embodiments of the electronic equipment described above are only illustrative. For example, the division of the modules is only a logical function division. In actual implementation, there may be other division methods, for example, multiple units or components may be The combination can either be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
应理解,在本申请的各种实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请的实施例的实施过程构成任何限定。It should be understood that in the various embodiments of the present application, the size of the sequence numbers of each process does not mean the order of execution. The execution order of each process should be determined by its functions and internal logic, and should not be used in the embodiments of the present application. The implementation process constitutes any limitation.
另外,本文中的术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。In addition, the term "and/or" in this article is only an association relationship describing related objects, indicating that there can be three relationships. For example, A and/or B can mean: A alone exists, and A and B exist simultaneously. , there are three situations of B alone. In addition, the character "/" in this article generally indicates that the related objects are an "or" relationship.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which can be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准总之,以上所述仅为本申请技术方案的较佳实施例而已,并非用于限定本申请的保护范围。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。 The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application. should be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims. In short, the above descriptions are only preferred embodiments of the technical solution of the present application and are not intended to limit the protection scope of the present application. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the protection scope of this application.

Claims (17)

  1. 一种视频编辑方法,其特征在于,应用于电子设备,包括:A video editing method, characterized in that it is applied to electronic devices, including:
    显示第一界面,所述第一界面中包括视频图标,所述视频图标指示的视频为所述电子设备中存储的视频;Display a first interface, the first interface includes a video icon, and the video indicated by the video icon is a video stored in the electronic device;
    检测到对所述视频图标中N个视频图标的第一操作;Detecting a first operation on N video icons among the video icons;
    响应于所述第一操作,获取N个视频的信息,N为大于1的整数;In response to the first operation, obtain information of N videos, where N is an integer greater than 1;
    基于所述N个视频的信息,得到所述N个视频的视频主题;Based on the information of the N videos, obtain the video themes of the N videos;
    基于所述N个视频与所述视频主题的相关性,选取所述N个视频中的M个视频片段;Based on the correlation between the N videos and the video theme, select M video clips from the N videos;
    基于所述M个视频片段的时长与所述视频主题,得到与所述视频主题相匹配的音乐,所述音乐的时长大于或者等于所述M个视频片段的时长;Based on the duration of the M video clips and the video theme, obtain music that matches the video theme, and the duration of the music is greater than or equal to the duration of the M video clips;
    基于所述M个视频片段与所述音乐,得到第一视频;Based on the M video clips and the music, a first video is obtained;
    显示所述第一视频。Display the first video.
  2. 如权利要求1所述的视频编辑方法,其特征在于,所述基于所述N个视频与所述视频主题的相关性,选取所述N个视频中的M个视频片段,包括:The video editing method of claim 1, wherein selecting M video clips from the N videos based on the correlation between the N videos and the video theme includes:
    将所述N个视频与所述视频主题输入至预先训练的相似度匹配模型,得到所述N个视频中的图像与所述视频主题的相似度置信值,其中,所述预先训练的相似度匹配模型中包括图像编码器、文本编码器与第一相似度度量模块,所述图像编码器用于对所述N个视频进行提取图像特征处理,所述文本编码器用于所述视频主题进行提取文本特征处理,所述第一相似度度量模块用于度量所述N个视频中的图像特征与所述视频主题的文本特征之间的相似度,所述相似度置信值用于表示所述N个视频中的图像与所述视频主题相似的概率;Input the N videos and the video topic into a pre-trained similarity matching model to obtain the similarity confidence value of the images in the N videos and the video topic, where the pre-trained similarity The matching model includes an image encoder, a text encoder and a first similarity measurement module. The image encoder is used to extract image features from the N videos. The text encoder is used to extract text from the video topic. Feature processing, the first similarity measurement module is used to measure the similarity between the image features in the N videos and the text features of the video subject, and the similarity confidence value is used to represent the N The probability that the images in the video are similar to the subject of the video;
    基于所述N个视频中的图像与所述视频主题的相似度置信值,选取所述N个视频中的M个视频片段。Based on the similarity confidence values between the images in the N videos and the video themes, M video clips from the N videos are selected.
  3. 如权利要求1或2所述视频编辑方法,其特征在于,所述基于所述M个视频片段与所述音乐,得到第一视频,包括:The video editing method of claim 1 or 2, wherein obtaining the first video based on the M video clips and the music includes:
    对所述M个视频片段进行排序,得到排序后的M个视频片段;Sort the M video clips to obtain sorted M video clips;
    将所述排序后的M个视频片段与所述音乐合成为所述第一视频。The sorted M video clips and the music are synthesized into the first video.
  4. 如权利要求3所述的视频编辑方法,其特征在于,所述对所述M个视频片段进行排序,得到排序后的M个视频片段,包括:The video editing method of claim 3, wherein said sorting the M video clips to obtain the sorted M video clips includes:
    基于所述音乐的节奏对所述M个视频片段排序,得到所述排序后的M个视频片段。The M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips.
  5. 如权利要求3所述的视频编辑方法,其特征在于,所述对所述M个视频片段进行排序,得到排序后的M个视频片段,包括:The video editing method of claim 3, wherein said sorting the M video clips to obtain the sorted M video clips includes:
    基于所述M个视频片段中的视频内容对所述M个视频片段进行排序,得到所述排序后的M个视频片段。The M video clips are sorted based on the video contents in the M video clips to obtain the sorted M video clips.
  6. 如权利要求4所述的视频编辑方法,其特征在于,所述基于所述音乐的节奏对所述M个视频片段排序,得到所述排序后的M个视频片段,包括:The video editing method of claim 4, wherein the M video clips are sorted based on the rhythm of the music to obtain the sorted M video clips, including:
    将所述音乐与所述M个视频片段输入至预先训练的影音节奏匹配模型,得到所述排序后的M个视频片段,所述预先训练的影音节奏匹配模型中包括音频编码器、视频编码器与第二相似度度量模块,所述音频编码器用于对所述音乐进行特征提取得到音频特征,所述视频编码器用于对所述M个视频片段进行特征提取得到视频特征,所述第二相似度度量模块用于度量所述音频特征与所述M个视频片段的相似性。The music and the M video clips are input into a pre-trained audio-visual rhythm matching model to obtain the sorted M video clips. The pre-trained audio-visual rhythm matching model includes an audio encoder and a video encoder. With the second similarity measurement module, the audio encoder is used to extract features of the music to obtain audio features, the video encoder is used to extract features of the M video clips to obtain video features, and the second similarity measure module is used to extract features of the music to obtain audio features. The degree measurement module is used to measure the similarity between the audio features and the M video clips.
  7. 如权利要求1至6中任一项所述的视频编辑方法,其特征在于,所述基于所述N个视 频的信息,得到所述N个视频的视频主题,包括:The video editing method according to any one of claims 1 to 6, characterized in that, based on the N views The video information of the N videos is obtained, including:
    将N个视频的视频内容转换为N个文本描述信息,所述N个文本描述信息与所述N个视频一一对应,所述N个文本描述信息中的一个文本描述信息用于描述所述N个视频中一个视频的图像内容信息;Convert the video content of the N videos into N pieces of text description information, the N pieces of text description information correspond to the N videos one-to-one, and one of the N pieces of text description information is used to describe the Image content information of one video among N videos;
    基于所述N个文本描述信息,得到所述N个视频的视频主题,所述文本描述信息用于将所述N个视频中的视频内容转换为文本信息。Based on the N pieces of text description information, the video themes of the N videos are obtained, and the text description information is used to convert the video content in the N videos into text information.
  8. 如权利要求7所述的视频编辑方法,其特征在于,所述基于所述N个文本描述信息,得到所述N个视频的视频主题,包括:The video editing method according to claim 7, wherein said obtaining video themes of said N videos based on said N pieces of text description information includes:
    将所述N个文本描述信息输入至预先训练的主题分类模型,得到所述N个视频的视频主题,所述预先训练的主题分类模型为用于文本分类的深度神经网络。The N pieces of text description information are input into a pre-trained topic classification model to obtain the video topics of the N videos. The pre-trained topic classification model is a deep neural network for text classification.
  9. 如权利要求8所述的视频编辑方法,其特征在于,在所述预先训练的主题分类模型输出至少两个视频主题时,所述至少两个视频主题与所述N个文本描述信息相对应,还包括:The video editing method of claim 8, wherein when the pre-trained topic classification model outputs at least two video topics, the at least two video topics correspond to the N pieces of text description information, Also includes:
    显示第二界面,所述第二界面中包括提示框,所述提示框中包括所述至少两个视频主题的信息;Display a second interface, the second interface includes a prompt box, the prompt box includes information on the at least two video themes;
    所述将所述N个文本描述信息输入至预先训练的主题分类模型,得到所述N个视频的视频主题,包括:The step of inputting the N pieces of text description information into a pre-trained topic classification model to obtain the video topics of the N videos includes:
    检测到对所述至少两个视频主题的第二操作;detecting a second operation on the at least two video topics;
    响应于所述第二操作,得到所述N个视频的视频主题。In response to the second operation, video themes of the N videos are obtained.
  10. 如权利要求2至9中任一项所述的视频编辑方法,其特征在于,所述预先训练的相似度匹配模型为变换器Transformer模型。The video editing method according to any one of claims 2 to 9, wherein the pre-trained similarity matching model is a Transformer model.
  11. 如权利要求10所述的视频编辑方法,其特征在于,所述预先训练的相似度匹配模型是通过以下训练方式得到的:The video editing method according to claim 10, characterized in that the pre-trained similarity matching model is obtained through the following training method:
    基于第一训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到所述预先训练的相似度匹配模型;其中,所述第一训练数据集中包括正例数据对与负例数据对,所述正例数据对包括第一样本文本描述信息与第一样本视频主题信息,所述第一样本描述信息与所述第一样本视频主题信息相匹配,所述负例数据对包括所述第一样本文本描述信息与第二样本视频主题信息,所述第一样本描述信息与所述第二样本视频主题信息不匹配。The similarity matching model to be trained is trained using the contrastive learning training method based on the first training data set to obtain the pre-trained similarity matching model; wherein, the first training data set includes positive example data pairs and negative examples Data pair, the positive example data pair includes a first sample text description information and a first sample video theme information, the first sample description information matches the first sample video theme information, and the negative sample An example data pair includes the first sample text description information and the second sample video theme information, and the first sample description information does not match the second sample video theme information.
  12. 如权利要求6至11中任一项所述的视频编辑方法,其特征在于,所述预先训练的影音节奏匹配模型为Transformer模型。The video editing method according to any one of claims 6 to 11, wherein the pre-trained audio-visual rhythm matching model is a Transformer model.
  13. 如权利要求12所述的视频编辑方法,其特征在于,所述预先训练的影音节奏匹配模型是通过以下训练方式得到的:The video editing method according to claim 12, wherein the pre-trained audio-visual rhythm matching model is obtained through the following training method:
    基于第二训练数据集采用对比学习的训练方法对待训练的相似度匹配模型进行训练,得到所述预先训练的相似度匹配模型;其中,所述第二训练数据集中包括正例数据对与负例数据对,所述正例数据对包括第一样本音乐与第一样本视频,所述第一样本音乐的节奏与所述第一样本视频的内容相匹配,所述负例数据对包括所述第一样本音乐与第二样本视频,所述第一样本音乐的节奏与所述第二样本视频的内容不匹配。The similarity matching model to be trained is trained using the contrastive learning training method based on the second training data set to obtain the pre-trained similarity matching model; wherein, the second training data set includes positive example data pairs and negative examples Data pair, the positive example data pair includes a first sample music and a first sample video, the rhythm of the first sample music matches the content of the first sample video, and the negative example data pair It includes the first sample music and the second sample video, and the rhythm of the first sample music does not match the content of the second sample video.
  14. 一种电子设备,其特征在于,包括:An electronic device, characterized by including:
    一个或多个处理器和存储器;one or more processors and memories;
    所述存储器与所述一个或多个处理器耦合,所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令,所述一个或多个处理器调用所述计算机指令以使得所述电子设备执行如权利要求1至13中任一项所述的视频编辑方法。 The memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code includes computer instructions, and the one or more processors invoke the computer instructions to cause the The electronic device executes the video editing method according to any one of claims 1 to 13.
  15. 一种芯片系统,其特征在于,所述芯片系统应用于电子设备,所述芯片系统包括一个或多个处理器,所述处理器用于调用计算机指令以使得所述电子设备执行如权利要求1至13中任一项所述的视频编辑方法。A chip system, characterized in that the chip system is applied to electronic equipment, and the chip system includes one or more processors, and the processor is used to call computer instructions to cause the electronic equipment to execute claims 1 to 1 The video editing method described in any one of 13.
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储了计算机程序,当所述计算机程序被处理器执行时,使得处理器执行权利要求1至13中任一项所述的视频编辑方法。A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program. When the computer program is executed by a processor, the processor causes the processor to execute any one of claims 1 to 13. video editing methods.
  17. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序代码,当所述计算机程序代码被电子设备运行时,使得该电子设备执行权利要求1至13中任一项所述的视频编辑方法。 A computer program product, characterized in that the computer program product includes computer program code. When the computer program code is run by an electronic device, the electronic device causes the electronic device to execute the video described in any one of claims 1 to 13. Edit method.
PCT/CN2023/073141 2022-08-25 2023-01-19 Video editing method and electronic device WO2024040865A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211024258.6 2022-08-25
CN202211024258.6A CN115134646B (en) 2022-08-25 2022-08-25 Video editing method and electronic equipment

Publications (1)

Publication Number Publication Date
WO2024040865A1 true WO2024040865A1 (en) 2024-02-29

Family

ID=83387701

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/073141 WO2024040865A1 (en) 2022-08-25 2023-01-19 Video editing method and electronic device

Country Status (2)

Country Link
CN (1) CN115134646B (en)
WO (1) WO2024040865A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115134646B (en) * 2022-08-25 2023-02-10 荣耀终端有限公司 Video editing method and electronic equipment
CN116193275B (en) * 2022-12-15 2023-10-20 荣耀终端有限公司 Video processing method and related equipment
CN117278801B (en) * 2023-10-11 2024-03-22 广州智威智能科技有限公司 AI algorithm-based student activity highlight instant shooting and analyzing method
CN117692676A (en) * 2023-12-08 2024-03-12 广东创意热店互联网科技有限公司 Video quick editing method based on artificial intelligence technology
CN117544822B (en) * 2024-01-09 2024-03-26 杭州任性智能科技有限公司 Video editing automation method and system
CN117830910B (en) * 2024-03-05 2024-05-31 沈阳云翠通讯科技有限公司 Automatic mixed video cutting method, system and storage medium for video retrieval

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140096002A1 (en) * 2012-09-28 2014-04-03 Frameblast Limited Video clip editing system
US20180174616A1 (en) * 2016-12-21 2018-06-21 Facebook, Inc. Systems and methods for compiled video generation
CN110495180A (en) * 2017-03-30 2019-11-22 格雷斯诺特公司 It generates for being presented with the video of audio
CN114286171A (en) * 2021-08-19 2022-04-05 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium
CN115134646A (en) * 2022-08-25 2022-09-30 荣耀终端有限公司 Video editing method and electronic equipment

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013187796A1 (en) * 2011-12-15 2013-12-19 Didenko Alexandr Sergeevich Method for automatically editing digital video files
WO2018045358A1 (en) * 2016-09-05 2018-03-08 Google Llc Generating theme-based videos
US10410060B2 (en) * 2017-12-14 2019-09-10 Google Llc Generating synthesis videos
CN109688463B (en) * 2018-12-27 2020-02-18 北京字节跳动网络技术有限公司 Clip video generation method and device, terminal equipment and storage medium
CN110602546A (en) * 2019-09-06 2019-12-20 Oppo广东移动通信有限公司 Video generation method, terminal and computer-readable storage medium
WO2021259322A1 (en) * 2020-06-23 2021-12-30 广州筷子信息科技有限公司 System and method for generating video
WO2022061806A1 (en) * 2020-09-27 2022-03-31 深圳市大疆创新科技有限公司 Film production method, terminal device, photographing device, and film production system
US11468915B2 (en) * 2020-10-01 2022-10-11 Nvidia Corporation Automatic video montage generation
CN114731458A (en) * 2020-12-31 2022-07-08 深圳市大疆创新科技有限公司 Video processing method, video processing apparatus, terminal device, and storage medium
CN114520931B (en) * 2021-12-31 2024-01-23 脸萌有限公司 Video generation method, device, electronic equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140096002A1 (en) * 2012-09-28 2014-04-03 Frameblast Limited Video clip editing system
US20180174616A1 (en) * 2016-12-21 2018-06-21 Facebook, Inc. Systems and methods for compiled video generation
CN110495180A (en) * 2017-03-30 2019-11-22 格雷斯诺特公司 It generates for being presented with the video of audio
CN114286171A (en) * 2021-08-19 2022-04-05 腾讯科技(深圳)有限公司 Video processing method, device, equipment and storage medium
CN115134646A (en) * 2022-08-25 2022-09-30 荣耀终端有限公司 Video editing method and electronic equipment

Also Published As

Publication number Publication date
CN115134646B (en) 2023-02-10
CN115134646A (en) 2022-09-30

Similar Documents

Publication Publication Date Title
WO2024040865A1 (en) Video editing method and electronic device
CN111726536B (en) Video generation method, device, storage medium and computer equipment
US9208227B2 (en) Electronic apparatus, reproduction control system, reproduction control method, and program therefor
CN111866404B (en) Video editing method and electronic equipment
WO2023125335A1 (en) Question and answer pair generation method and electronic device
WO2021244457A1 (en) Video generation method and related apparatus
JP2021192222A (en) Video image interactive method and apparatus, electronic device, computer readable storage medium, and computer program
CN111480156A (en) System and method for selectively storing audiovisual content using deep learning
CN111465918B (en) Method for displaying service information in preview interface and electronic equipment
CN112532865B (en) Slow-motion video shooting method and electronic equipment
CN112214636A (en) Audio file recommendation method and device, electronic equipment and readable storage medium
CN113010740B (en) Word weight generation method, device, equipment and medium
WO2020119455A1 (en) Method for repeating word or sentence during video playback, and electronic device
WO2021180109A1 (en) Electronic device and search method thereof, and medium
US9525841B2 (en) Imaging device for associating image data with shooting condition information
WO2022073417A1 (en) Fusion scene perception machine translation method, storage medium, and electronic device
CN114242037A (en) Virtual character generation method and device
WO2023160170A1 (en) Photographing method and electronic device
WO2022037479A1 (en) Photographing method and photographing system
CN115529378A (en) Video processing method and related device
WO2023030098A1 (en) Video editing method, electronic device, and storage medium
WO2022206605A1 (en) Method for determining target object, and photographing method and device
WO2024082914A1 (en) Video question answering method and electronic device
CN116828099B (en) Shooting method, medium and electronic equipment
CN117119266A (en) Video score processing method, electronic device, and computer-readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23855962

Country of ref document: EP

Kind code of ref document: A1