WO2023197749A9 - 背景音乐的插入时间点确定方法、装置、设备和存储介质 - Google Patents

背景音乐的插入时间点确定方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2023197749A9
WO2023197749A9 PCT/CN2023/077645 CN2023077645W WO2023197749A9 WO 2023197749 A9 WO2023197749 A9 WO 2023197749A9 CN 2023077645 W CN2023077645 W CN 2023077645W WO 2023197749 A9 WO2023197749 A9 WO 2023197749A9
Authority
WO
WIPO (PCT)
Prior art keywords
features
video
audio
feature
sample
Prior art date
Application number
PCT/CN2023/077645
Other languages
English (en)
French (fr)
Other versions
WO2023197749A1 (zh
Inventor
冯鑫
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023197749A1 publication Critical patent/WO2023197749A1/zh
Publication of WO2023197749A9 publication Critical patent/WO2023197749A9/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating

Definitions

  • the present application relates to the field of computer technology, and in particular to a method, device, equipment and storage medium for determining the insertion time point of background music.
  • Embodiments of the present application provide a method, device, equipment and storage medium for determining the insertion time point of background music, which can improve the efficiency of inserting background music into videos.
  • the technical solution is as follows.
  • a method for determining the insertion time point of background music includes:
  • the video features of the target video are encoded based on the attention mechanism to obtain multiple target parameters.
  • the multiple target parameters correspond to multiple time points of the target video.
  • the target parameters are used to represent the time points at the corresponding time points.
  • At least one candidate time point for inserting background music is determined, and the candidate time point is a time point among the plurality of time points at which the target parameter meets the target condition.
  • a device for determining the insertion time point of background music includes:
  • Feature extraction module used to extract audio features and image features of the target video
  • a feature fusion module used to fuse the audio features and the image features to obtain the video features of the target video
  • An encoding module configured to encode video features of the target video based on an attention mechanism to obtain multiple target parameters, where the multiple target parameters correspond to multiple time points of the target video, and the target parameters are used to Represents the probability of inserting background music at the corresponding time point;
  • a candidate time point determination module is configured to determine at least one candidate time point for inserting background music, where the candidate time point is a time point in the plurality of time points at which the target parameter meets the target condition.
  • a computer device in one aspect, includes one or more processors and one or more memories. At least one computer program is stored in the one or more memories. The computer program is composed of the One or more processors are loaded and executed to implement the background music insertion time point determination method.
  • a computer-readable storage medium is provided. At least one computer program is stored in the computer-readable storage medium. The computer program is loaded and executed by a processor to implement the background music insertion time point determination method. .
  • a computer program product including a computer program that implements the above background music insertion time point determination method when executed by a processor.
  • Figure 1 is a schematic diagram of the implementation environment of a method for determining the insertion time point of background music provided by an embodiment of the present application;
  • Figure 2 is a flow chart of a method for determining the insertion time point of background music provided by an embodiment of the present application
  • Figure 3 is a flow chart of another method for determining the insertion time point of background music provided by an embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a feature extraction unit provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a residual construction subunit provided by an embodiment of the present application.
  • Figure 6 is a schematic structural diagram of a target parameter acquisition unit provided by an embodiment of the present application.
  • Figure 7 is a schematic diagram of an effect provided by an embodiment of the present application.
  • Figure 8 is a flow chart of another method for determining the insertion time point of background music provided by an embodiment of the present application.
  • Figure 9 is a flow chart of a training method for a time point determination model provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of an audio separation unit provided by an embodiment of the present application.
  • Figure 11 is a flow chart of another method for training a time point determination model provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a device for determining the insertion time point of background music provided by an embodiment of the present application
  • Figure 13 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • Figure 14 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science that attempts to understand the nature of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is the study of the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive subject that covers a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Machine Learning is a multi-field interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge sub-models to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, teaching learning and other technologies.
  • Semantic features Features used to represent the semantics expressed by text. Different texts can correspond to the same semantic features. For example, the text "What is the weather like today” and the text “What is the weather like today” can correspond to the same semantic feature.
  • the computer device can map the characters in the text into character vectors, and combine and operate the character vectors according to the relationship between the characters to obtain the semantic features of the text.
  • computer equipment can use Bidirectional Encoder Representations from Transformers (BERT) of codecs.
  • Normalization Map sequence with different value ranges to the (0, 1) interval to facilitate data processing.
  • the normalized values can be directly implemented as probabilities.
  • Embedded Coding represents a correspondence relationship mathematically, that is, the data in the X space is mapped to the Y space through a function F, where the function F is an injective function, and the mapping result is a structure preservation.
  • the injective function indicates that the data after mapping uniquely corresponds to the data before mapping.
  • the structure save represents the size relationship of the data before mapping and the size relationship of the data after mapping is the same. For example, there are data X 1 and X 2 before mapping, and X 1 is obtained after mapping. The corresponding Y 1 and the corresponding Y 2 of X 2 . If the data before mapping X 1 > X 2 , then correspondingly, the data after mapping Y 1 is greater than Y 2 . For words, it is to map the words to another space to facilitate subsequent machine learning and processing.
  • Attention weight It can represent the importance of a certain data in the training or prediction process. The importance represents the impact of the input data on the output data. Data with high importance has a higher value of attention weight, and data with low importance has a low value of attention weight. In different scenarios, the importance of data is not the same.
  • the training attention weight of the model The important process is the process of determining the importance of data.
  • the information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data used for analysis, stored data, displayed data, etc.
  • signals involved in this application All are authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.
  • the method for determining the insertion time point of background music provided by the embodiment of the present application can be executed by a computer device.
  • the computer device is a terminal or server.
  • the following is an introduction to the implementation environment of the method for determining the insertion time point of background music provided by the embodiment of the present application.
  • Figure 1 is a schematic diagram of the implementation environment of the method for determining the insertion time point of background music provided by the embodiment of the present application. See Figure 1.
  • the implementation environment may include a terminal 110 and a server 140.
  • the terminal 110 is connected to the server 140 through a wireless network or a wired network.
  • the terminal 110 is a vehicle-mounted terminal, a smart phone, a tablet, a laptop, a desktop computer, a smart speaker, a smart watch, a smart TV, etc., but is not limited thereto.
  • the terminal 110 is installed and runs with an application that supports determining the time point for inserting background music.
  • the server 140 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or it provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, and middleware services. , domain name services, security services, distribution network (Content Delivery Network, CDN), and cloud servers for basic cloud computing services such as big data and artificial intelligence platforms.
  • the server 140 provides background services for applications running on the terminal 110 .
  • the number of terminals 110 and servers 140 is not limited.
  • the terminal is also the terminal 110 in the above-mentioned implementation environment
  • the server is also Server 140 in the above implementation environment.
  • the video producer selects the film and television work to be inserted with background music through the terminal, and the film and television work is also the target video.
  • the terminal sends the film and television work to the server, and the server processes the film and television work to obtain a candidate time point in the film and television work.
  • the candidate time point is also a time point at which background music can be inserted into the film and television work.
  • the server sends the candidate time points of the film and television work to the terminal, and the terminal displays the candidate time points of the film and television work.
  • the video producer can select the target time point for inserting background music from the candidate time points displayed on the terminal.
  • the server can directly determine the candidate time point in the film and television work. There is no need for the video producer to completely watch the film and television work before determining the candidate time point, which greatly improves the efficiency of inserting background into the film and television work. Musical efficiency.
  • the short video author selects the short video to be inserted with background music through the terminal, and the short video is also the target video.
  • the terminal sends the short video to the server, and the server processes the short video to obtain a candidate time point in the short video.
  • the candidate time point is also a time point at which background music can be inserted into the short video.
  • the server sends the candidate time point of the short video to the terminal, and the terminal displays the candidate time point of the short video.
  • the short video author can select the target time point to insert background music from the candidate time points displayed on the terminal.
  • the server can directly determine the candidate time points in the short video. There is no need for the short video author to select within the scope of the complete short video, which greatly improves the efficiency of inserting background music into the short video. efficiency.
  • the technical methods provided by the embodiments of the present application are as follows: case is introduced.
  • the technical solution provided by the embodiment of the present application can be executed by the terminal or the server, or can be executed by the terminal and the server together.
  • the execution subject is the server as an example for explanation.
  • the method includes the following steps .
  • the server extracts audio features and image features of the target video.
  • the target video is a video into which background music is to be inserted, such as a film and television work that has not yet inserted background music, or a video clip during secondary creation, etc. This is not limited in the embodiment of the present application.
  • Audio features can reflect the audio characteristics of the target video, and audio features are also called auditory features; image features can reflect the image characteristics of the target video, and image features are also called visual features.
  • the server fuses the audio features and the image features to obtain the video features of the target video.
  • the audio features and image features of the target video are integrated, so that the video features of the target video can reflect the characteristics of the target video from both auditory and visual dimensions.
  • the video Features have strong expressive ability.
  • the server encodes the video features of the target video based on the attention mechanism and obtains multiple target parameters.
  • the multiple target parameters correspond to multiple time points of the target video.
  • the target parameters are used to represent the insertion at the corresponding time points. Probability of background music.
  • the information in the video features can be fully utilized to improve the accuracy of the determined target parameters.
  • the server determines at least one candidate time point for inserting background music.
  • the candidate time point is a time point among multiple time points at which the target parameter meets the target condition.
  • the candidate time points are time points with a high probability of inserting background music
  • the video producer can select a target time point for inserting background music among the determined candidate time points.
  • the audio features and image features of the target video are combined to determine the video features of the target video.
  • the video features can more accurately represent the content of the target video.
  • the video features are encoded based on the attention mechanism to obtain multiple target parameters, which represent the probability of inserting background music at the corresponding time point.
  • a candidate time point is determined from the multiple time points, which is also a time point at which background music can be inserted into the target video.
  • the determined candidate time points are more accurate.
  • the video producer does not need to watch the target video completely, and only needs to select among the identified candidate time points. This improves the efficiency of inserting background music into the video while ensuring accuracy. .
  • the server obtains the target video.
  • the target video is a video into which background music is to be inserted.
  • the target video is a movie or TV series in a film and television work, or other types of videos such as short videos, etc., which are not limited in the embodiments of the present application.
  • the terminal in response to the operation on the target video, sends the target video to the server.
  • the video producer can control the terminal to send the target video to the server by operating the target video.
  • the video producer can select the target video by himself, and the efficiency of human-computer interaction is high.
  • the terminal displays a video selection page, which includes multiple candidate videos.
  • the terminal sends the target video to the server.
  • the server gets the target video.
  • the plurality of candidate videos are videos stored on the terminal.
  • the terminal in response to a click operation on the target video on the video selection page, the terminal sends a video selection instruction to the server, and the video selection instruction carries the identification of the target video.
  • the server receives the video selection instruction, Obtain the identification of the target video from the video selection instruction.
  • the server performs a query based on the identification of the target video to obtain the target video.
  • the server performs feature extraction on multiple audio frames of the target video to obtain the audio features of the target video.
  • the server performs feature extraction on the time domain information of the multiple audio frames to obtain the time domain audio features of the multiple audio frames.
  • the server performs feature extraction on the frequency domain information of the multiple audio frames to obtain frequency domain audio features of the multiple audio frames.
  • the server obtains the audio features of the target video based on the time domain audio features and frequency domain audio features of the multiple audio frames.
  • the server can extract time-domain audio features and frequency-domain audio features of multiple audio frames of the target video, and the audio features can more accurately reflect the audio characteristics of the target video.
  • a point-in-time determination model is deployed on the server, and the server implements the above implementation manner through the point-in-time determination model.
  • the time point determination model includes an audio feature extraction unit. The server determines the audio feature extraction unit of the model at this time point to obtain the audio features of the target video.
  • the audio feature of the target video is an audio feature sequence.
  • the audio feature sequence includes multiple audio sub-features. Each audio sub-feature corresponds to a time point of the target video. Each audio sub-feature is used for Reflects the audio characteristics at the corresponding time point.
  • the server performs feature extraction on the time domain information of multiple audio frames to obtain the time domain audio features of multiple audio frames.
  • the multiple audio frames are temporally continuous audio frames in the target video.
  • the time domain information of the multiple audio frames is used to describe the changes in the amplitude of the multiple audio frames over time.
  • the time domain audio features can reflect multiple audio frames. Characteristics of an audio frame in the time domain.
  • the time-domain audio features of multiple audio frames are a time-domain audio feature sequence.
  • the time-domain audio feature sequence includes multiple sub-features, each sub-feature corresponds to a time point of the target video, and each sub-feature is used to reflect the time-domain audio characteristics of the corresponding time point.
  • the server uses multiple one-dimensional convolution kernels to perform feature extraction on the time domain information of multiple audio frames to obtain the time domain audio features of the multiple audio frames.
  • the server extracts time-domain audio features through multiple one-dimensional convolution kernels, and multiple one-dimensional convolution kernels can more accurately extract time-domain audio features.
  • the server inputs the time domain information of multiple audio frames into a time point determination model, extracts features of the time domain information through the time point determination model, and obtains the time domain audio features of multiple audio frames.
  • the time point determination model includes an audio feature extraction unit
  • the audio feature extraction unit includes a time domain feature extraction branch and a frequency domain feature extraction branch.
  • the time domain feature extraction branch is used to extract time domain audio features of multiple audio frames
  • the frequency domain branch is used to extract frequency domain audio features of multiple audio frames.
  • the time domain feature extraction branch of the audio feature extraction unit includes multiple one-dimensional convolution sub-units and multiple pooling sub-units, and each one-dimensional convolution sub-unit includes at least one one-dimensional convolution kernel.
  • the server After the server inputs the time domain information of multiple audio frames into the time point determination model, it extracts features from the time domain information of the multiple audio frames through the time domain feature extraction branch of the time point determination model, that is, through time domain feature extraction. Multiple one-dimensional convolution subunits on the branch convolve the time domain information to obtain multiple time domain feature maps.
  • the server pools multiple time domain feature maps through multiple pooling subunits on the time domain feature extraction branch to obtain frequency domain audio features of multiple audio frames.
  • the time-domain characteristics of multiple audio frames can be extracted from the time-domain information of multiple audio frames, especially multiple The loudness and sample amplitude of audio frames can be accurately extracted.
  • the pooling layer is used to reduce complexity and improve the extraction efficiency of time-domain audio features.
  • the server performs feature extraction on the frequency domain information of multiple audio frames to obtain the frequency domain audio features of multiple audio frames.
  • the frequency domain audio features of multiple audio frames are a frequency domain audio feature sequence.
  • the frequency domain audio feature sequence includes multiple sub-features. Each sub-feature corresponds to a time point of the target video. Each sub-feature is represented by To reflect the frequency domain audio characteristics at the corresponding time point.
  • the frequency domain information of the multiple audio frames is the frequency spectrum of the multiple audio frames, such as the Mel cepstrum of the multiple audio frames.
  • the frequency domain information of multiple audio frames is determined based on the time domain information of multiple audio frames. For example, Fourier transform is performed on the time domain information of multiple audio frames to obtain the Fourier spectrum of multiple audio frames.
  • the server maps the Fourier spectra of multiple audio frames to the Mel scale through the triangular window function to obtain the first Mel parameters of the multiple audio frames.
  • the server obtains the logarithm of the first Mel parameter of the multiple audio frames and obtains the second Mel parameter of the multiple audio frames.
  • the server performs discrete cosine transformation on the second Mel parameters of the multiple audio frames to obtain the Mel cepstrum of the multiple audio frames.
  • the Mel cepstrum is also the frequency domain information of the multiple audio frames. It should be noted that the above description is a method of obtaining the Mel cepstrum based on time domain information provided by the embodiment of the present application. In other possible implementations, the server can also obtain the Mel cepstrum based on time domain information through other methods, which is not limited in the embodiments of this application.
  • the frequency domain information of multiple audio frames is the mel cepstrum of multiple audio frames as an example. In other possible implementations, the frequency domain information of the multiple audio frames is also It may be other types of spectrum, which is not limited in the embodiments of this application.
  • the server uses multiple two-dimensional convolution kernels to perform feature extraction on frequency domain information of multiple audio frames to obtain frequency domain audio features of multiple audio frames.
  • the server extracts frequency domain audio features through multiple two-dimensional convolution kernels, and multiple two-dimensional convolution kernels can more accurately extract frequency domain audio features.
  • the server inputs frequency domain information of multiple audio frames into a time point determination model, and performs feature extraction on the frequency domain information through the time point determination model to obtain frequency domain audio features of the multiple audio frames.
  • the time point determination model includes an audio feature extraction unit.
  • the audio feature extraction unit includes a time domain feature extraction branch and a frequency domain feature extraction branch.
  • the time domain feature extraction branch is used to extract the time domain audio features of the multiple audio frames.
  • the frequency domain branch Used to extract frequency domain audio features of the multiple audio frames.
  • the frequency domain feature extraction branch of the audio feature extraction unit includes a plurality of two-dimensional convolution sub-units, and each two-dimensional convolution sub-unit includes at least one two-dimensional convolution kernel.
  • the server After the server inputs the frequency domain information of multiple audio frames into the time point determination model, it extracts features from the frequency domain information of the multiple audio frames through the frequency domain feature extraction branch of the time point determination model, that is, through frequency domain feature extraction. Multiple two-dimensional convolution subunits on the branch convolve the frequency domain information to obtain the frequency domain audio features of multiple audio frames.
  • the frequency domain characteristics of multiple audio frames can be extracted from the frequency domain information of multiple audio frames.
  • the server obtains the audio features of the target video based on the time domain audio features and frequency domain audio features of the multiple audio frames.
  • the server fuses the time domain audio features and the frequency domain audio features of multiple audio frames to obtain the initial audio features of the target video.
  • the server convolves the initial audio features of the target video to obtain the audio features of the target video.
  • the server fuses the time domain audio features and frequency domain audio features of multiple audio frames by adding them to obtain the initial audio features of the target video, and further convolves the initial audio features. By fusing time-domain audio features and frequency-domain audio features, the resulting audio features can more accurately express the audio characteristics of the target video.
  • the server when the server extracts time-domain audio features through multiple one-dimensional convolution kernels and extracts frequency-domain audio features through multiple two-dimensional convolution kernels, the dimension of the obtained time-domain audio features is one-dimensional.
  • the dimension of frequency domain audio features is two-dimensional.
  • the server upsamples the time-domain audio features of multiple audio frames, changing the one-dimensional time-domain audio features into two-dimensional time-domain audio features.
  • the server adds the two-dimensional time domain audio features and the frequency domain audio features to obtain the initial audio features of the target video.
  • This addition process is the process of fusing the time domain audio features and the frequency domain audio features.
  • the server convolves the initial audio features with at least one two-dimensional convolution kernel to obtain the audio features of the target video.
  • the server obtains the audio features of the target video based on time-domain audio features and frequency-domain audio features of multiple audio frames through a time point determination model.
  • the time point determination model includes an audio feature fusion unit.
  • the server uses the audio feature fusion subunit of the time point determination model to fuse the time domain audio features and frequency domain audio features of the multiple audio frames into the audio features of the target video.
  • the audio feature fusion subunit belongs to the audio feature extraction unit.
  • the server fuses the time domain audio features and the frequency domain audio features of multiple audio frames to obtain the initial audio features of the target video.
  • the server performs maximum pooling and mean pooling on the initial audio features to obtain the target The first pooling feature and the second pooling feature of the standard video.
  • the server fuses the first pooled features and the second pooled features to obtain the audio features of the target video.
  • the server uses maximum pooling and mean pooling to reduce the complexity of the initial audio features and improve the efficiency of subsequent operations.
  • the server when the server extracts time-domain audio features through multiple one-dimensional convolution kernels and extracts frequency-domain audio features through multiple two-dimensional convolution kernels, the dimension of the obtained time-domain audio features is one-dimensional.
  • the dimension of frequency domain audio features is two-dimensional.
  • the server upsamples the time-domain audio features of multiple audio frames, changing the one-dimensional time-domain audio features into two-dimensional time-domain audio features.
  • the server adds the two-dimensional time domain audio features and the frequency domain audio features and performs convolution to obtain the initial audio features of the target video.
  • This addition and convolution process is the fusion of time domain audio features and frequency domain audio features. the process of.
  • the server performs maximum pooling and mean pooling on the initial audio features respectively to obtain the first pooling feature and the second pooling feature of the target video.
  • the first pooling feature is a pooling feature obtained by performing maximum pooling on the initial audio feature
  • the second pooling feature is a pooling feature obtained by performing uniform branch pooling on the initial audio feature.
  • the server adds the first pooled feature and the second pooled feature to obtain the third pooled feature.
  • the server linearly rectifies the third pooled feature to obtain the audio feature of the target video. Among them, Rectified Linear is also called linear correction.
  • the server can linearly rectify the third pooling feature through a linear rectification function to obtain the audio features of the target video.
  • the linear rectification function is also called a ramp function.
  • the server obtains the audio features of the target video based on time-domain audio features and frequency-domain audio features of multiple audio frames through a time point determination model.
  • the time point determination model includes an audio feature fusion unit.
  • the server uses the audio feature fusion subunit of the time point determination model to fuse the time domain audio features and frequency domain audio features of multiple audio frames into the audio features of the target video. Audio feature fusion Subunits belong to this audio feature extraction unit.
  • the server inputs the time domain information 401 of multiple audio frames into the time point determination model.
  • the time domain feature extraction branch 402 of the audio feature extraction unit of the time point determination model performs feature extraction on the time domain information 401 of the multiple audio frames. That is, the server performs feature extraction on the time domain information 401 through multiple one-dimensional convolution subunits and multiple maximum pooling subunits to obtain the time domain audio features of the multiple audio frames.
  • each one-dimensional convolution subunit corresponds to a one-dimensional convolution kernel.
  • the number of one-dimensional convolution sub-units is four, which are respectively named the first convolution sub-unit, the second convolution sub-unit, the third convolution sub-unit and the fourth convolution sub-unit; the maximum The number of value pooling sub-units is three, which are respectively named the first maximum value pooling sub-unit, the second maximum value pooling sub-unit and the third maximum value pooling sub-unit.
  • the server convolves the time domain information through the first convolution subunit to obtain the first time domain feature vector of the time domain information.
  • the server convolves the second time domain feature vector through the second convolution subunit to obtain the second time domain feature vector of the time domain information.
  • the server performs maximum pooling on the second time domain feature vector through the first maximum pooling subunit to obtain the first pooling vector of the time domain information.
  • the server convolves the first pooling vector through the third convolution subunit to obtain the third time domain feature vector of the time domain information.
  • the server performs maximum pooling on the third time domain feature vector through the second maximum pooling subunit to obtain the second pooling vector of the time domain information.
  • the server convolves the second pooling vector through the fourth convolution subunit to obtain the fourth time domain feature vector of the time domain information.
  • the server performs maximum pooling on the second time domain feature vector through the third maximum pooling subunit to obtain the time domain audio feature vector of the target video.
  • the time domain audio feature vector is used to represent the time domain audio features of the target video.
  • the server upsamples the time domain audio feature vector through the reshaping subunit 4021 of the audio feature extraction unit to obtain a two-dimensional time domain audio feature vector 4022.
  • the server After obtaining the two-dimensional time domain audio feature vector, the server performs feature extraction on the time domain information 401 of multiple audio frames through the frequency domain feature extraction branch 403 of the audio feature extraction unit of the time point determination model to obtain the multiple audio frames.
  • frequency domain audio characteristics That is, the server processes the time domain information 401 of the multiple audio frames through the frequency domain information acquisition subunit 4031 on the frequency domain feature extraction branch 403 to obtain the frequency domain information of the multiple audio frames.
  • the frequency domain information is a mel cepstrum.
  • the server convolves the frequency domain information through at least one two-dimensional convolution subunit on the frequency domain feature extraction branch 403 to obtain the frequency domain audio feature vector 4032 of the target video.
  • the server determines the audio feature fusion subunit 404 of the model through the time point, adds the two-dimensional time domain audio feature vector 4022 and the frequency domain audio feature vector 4032, and then passes it through the two-dimensional convolution subunit 405 of the audio feature extraction unit. Perform convolution to obtain the initial sound of the target video frequency characteristics.
  • the server processes the initial audio features through the maximum pooling subunit 406 and the mean pooling subunit 407 of the audio feature extraction unit to obtain the first pooling feature and the second pooling feature.
  • the server adds the first pooled feature and the second pooled feature to obtain the third pooled feature.
  • the server linearly rectifies the third pooling feature through a linear rectification subunit 408 (Rectified Linear Unit) to obtain the audio feature 409 of the target video.
  • the audio feature extraction unit of the time point determination model is pretrained audio neural networks (Pretrained Audio Neural Networks, PANNs).
  • the server can either perform step 302 first and then perform the following step 303, or it can perform the following step 303 first and then perform the step 302, or it can perform step 302 and the following step 303 at the same time.
  • the embodiment of the present application does not limit this. In this embodiment of the present application, the server first performs step 302 and then performs the following step 303 as an example for explanation.
  • the server performs feature extraction on multiple video frames of the target video to obtain image features of the target video.
  • the multiple video frames of the target video are temporally consecutive video frames in the target video.
  • the video feature of the target video is a video feature sequence
  • the video feature sequence includes multiple video sub-features
  • each video sub-feature corresponds to a time point of the target video
  • each video sub-feature is used to Reflect the video characteristics at the corresponding time point.
  • the server inputs multiple video frames into a time point determination model, performs feature extraction on the multiple video frames through the time point determination model, and obtains image features of the multiple video frames.
  • the image features of the multiple video frames are also That is, the image characteristics of the target video.
  • feature extraction is performed on multiple video frames through a time point determination model to obtain image features of the target video, thereby achieving abstract expression of multiple video frames and improving subsequent computing efficiency.
  • Example 1 The server inputs multiple video frames into the time point determination model, and performs convolution, normalization and linear correction on the multiple video frames through the time point determination model to obtain the image features of the multiple video frames.
  • the server inputs multiple video frames into a time point determination model, and the time point determination model includes an image feature extraction unit.
  • the server convolves multiple video frames by determining at least one two-dimensional convolution layer of the image feature extraction unit of the model at a time point to obtain feature maps of the multiple video frames.
  • the server determines at least one normalization layer and at least one linear correction layer of the model at the time point, normalizes and linearly corrects the feature maps of the multiple video frames, and obtains the image features of the multiple video frames.
  • the server represents video frames in the form of matrices and image features in the form of vectors. In the process of convolving the video frames, the convolution kernel is used to slide on the video frames. .
  • the image feature extraction unit includes three types of residual construction subunits, which are respectively recorded as the first type of residual construction subunit, the second type of residual construction subunit and the third type of residual construction subunit.
  • the image feature extraction unit is divided into multiple network stages, and each network stage includes the above three types of residual construction subunits.
  • the three types of residual construction sub-units include at least one convolution layer, at least one normalization layer and at least one linear correction layer.
  • the convolution layer, normalization layer and The number of linear correction layers and how they are connected vary.
  • the plurality of network phases include a beginning phase, an intermediate phase, and an end phase.
  • the server After the server inputs multiple video frames into the image feature extraction unit of the determined model at a time point, it passes through multiple network stages of the image feature extraction unit, that is, the first type of residual construction subunit, the second type of residual in multiple network stages.
  • the difference construction subunit and the third type residual construction subunit perform convolution, normalization and linear correction on multiple video frames to obtain image features of the multiple video frames.
  • the first type of residual construction subunit is also called a start residual block (Start ResBlock)
  • the second type of residual construction subunit is also called a middle residual block (Middle ResBlock)
  • the third type of residual construction subunit is also called a middle residual block (Middle ResBlock).
  • the residual-like building subunit is also called the end residual block (End ResBlock).
  • the first type of residual construction sub-unit 501 includes a one-dimensional convolution layer 5011, a normalization layer 5012, a linear correction layer 5013, a three-dimensional convolution layer 5014, a normalization layer 5015, and a linear correction layer 5016. , one-dimensional convolution layer 5017 and normalization layer 5018.
  • the second type of residual construction subunit 502 includes a normalization layer 5021, a linear correction layer 5022, a one-dimensional convolution layer 5023, a normalization layer 5024, a linear correction layer 5025, a three-dimensional convolution layer 5026, and a normalization layer. 5027, linear correction layer 5028 and one-dimensional convolution layer 5029.
  • the third type of residual construction subunit 503 sequentially includes a normalization layer 5031, a linear correction layer 5032, a one-dimensional convolution layer 5033, Normalization layer 5034, linear correction layer 5035, three-dimensional convolution layer 5036, normalization layer 5037, linear correction layer 5038 and one-dimensional convolution layer 5039.
  • the convolution layer is used for convolution
  • the normalization layer is used for normalization
  • the linear correction layer is used for linear correction.
  • the image feature extraction unit is a neural network IResNet (Improved Residual Networks).
  • the output result of the neural network IResNet is the image feature of the target video.
  • IResNet taking the number of network layers as 50 as an example, the 50-layer network includes three stages, namely a beginning stage, four intermediate stages and an end stage. Each of the four intermediate stages includes multiple residual construction subunits.
  • IResNet can surpass ResNet in both accuracy and learning convergence. For example, on the ImageNet dataset, using ResNet with 50 layers while using IResNet under the same configuration, the improvement in top-1 accuracy ranges from 1.19% to 2.33%. At the same time, these improvements are obtained without increasing model complexity.
  • the image feature extraction unit of the time point determination model is IResNet.
  • the image feature extraction unit of the time point determination model can also be other structures. The application examples do not limit this.
  • Example 2 The server inputs multiple video frames into the time point determination model, and encodes the multiple video frames based on the attention mechanism through the time point determination model to obtain the image features of the multiple video frames.
  • the image features obtained by the time point determination model are also the semantic features of the corresponding content items.
  • the time point determination model is a semantic feature encoder, such as a Transformer encoder.
  • the server inputs the multiple video frames into the image feature extraction unit of the time point determination model, and performs embedding coding on the multiple video frames through the image feature extraction unit of the time point determination model to obtain multiple embedded features.
  • An embedding feature corresponds to one video frame among multiple video frames.
  • Embedding features are used to represent the position of each video frame in multiple video frames and the content of each video frame.
  • the server inputs multiple embedded features into the time point determination model, uses the three linear transformation matrices of the time point determination model, linearly transforms the multiple embedded features, and obtains the query (Query) corresponding to each video frame of the multiple video frames.
  • Vector, key vector and value vector are examples of the server that generate multiple embedded features.
  • the server determines the model through time points, and obtains the attention weights of multiple video frames based on the query vectors and key vectors corresponding to multiple video frames.
  • the server determines the model through time points, and obtains the attention encoding vectors of multiple video frames based on the attention weight of each video frame of multiple video frames and the value vector of each video frame of multiple video frames.
  • the attention encoding vector is also is the image feature of the video frame.
  • the server determines the model through time points, multiplies each embedded feature with three linear transformation matrices, and obtains the query vector, key vector, and value vector corresponding to each video frame in multiple video frames.
  • the server determines the model through time points, and determines multiple other videos based on the query vector of the first video frame and the key vectors of multiple other video frames among the multiple video frames.
  • Frame multiple attention weights for the first video frame.
  • the server determines the model through time points and performs a weighted summation of the attention weights of multiple other video frames on the first video frame and the value vectors of multiple other video frames to obtain the attention encoding vector of the first video frame. .
  • the server determines the model through time points, encodes the first video frame of multiple video frames, and obtains the attention encoding vector of the first video frame.
  • the way in which the server encodes other video frames of the plurality of video frames and the above-mentioned method of encoding the first video frame belong to the same inventive concept.
  • the implementation process please refer to the above description and will not be repeated here.
  • Example 3 The server inputs multiple video frames into the time point determination model, and performs convolution, normalization and linear correction on the multiple video frames through the time point determination model to obtain the first image features of the multiple video frames.
  • the server determines the model through time points, encodes multiple video frames based on the attention mechanism, and obtains the second image features of the multiple video frames.
  • the server fuses the first image features and the second image features of the multiple video frames to obtain the image features of the multiple video frames.
  • the time point determination model includes a first image feature extraction unit and a second image feature extraction unit.
  • the first image feature extraction unit is used to extract the first image feature of the target video
  • the second image feature extraction unit is used to extract the second image feature of the target video.
  • the server inputs the multiple video frames into the time point determination model
  • the first image features of the multiple video frames are obtained through the first image feature extraction unit
  • the first image features of the multiple video frames are obtained through the second image feature extraction unit.
  • Second image features When the server fuses the first image features and the second image features of multiple video frames, a weighted summation method may be used.
  • the weight of the weighted summation is set by technical personnel according to the actual situation, such as setting it to 0.3, 0.5 or 0.8, etc., which is not limited in the embodiments of this application.
  • Example 4 The server inputs multiple video frames into the time point determination model, and performs full connection and pooling on the multiple video frames through the time point determination model to obtain the image features of the multiple video frames.
  • the server inputs multiple video frames into a time point determination model, performs full connections on the multiple video frames through at least one fully connected layer of the time point determination model, and obtains fully connected features of the multiple video frames.
  • the server determines the pooling layer of the model through time points, performs either maximum pooling or average pooling on the fully connected features of multiple video frames, and obtains the image features of multiple video frames.
  • This image feature is also called are deep features or low-level features.
  • the server represents the video frame in the form of a matrix and the image features in the form of a vector. In the process of fully connecting the video frames, a method of multiplying the fully connected matrix and the matrix of the video frame is used. way to achieve it.
  • the time point determination model is a feature extractor based on Deep Neural Networks (DNN).
  • DNN Deep Neural Networks
  • the server can also use other structured time point determination models to obtain image features, which is not limited in the embodiments of the present application.
  • the server can also extract the subtitle features of the target video, and determine the video features of the target video by combining the audio features, image features and subtitle features of the target video, which can improve the expressive ability of the video features.
  • the server extracts audio features, image features, and subtitle features of the target video.
  • the method for the server to extract the audio features and image features of the target video belongs to the same inventive concept as the above-mentioned steps 302 and 303.
  • the implementation process please refer to the description of the above-mentioned steps 302 and 303, which will not be described again here.
  • the following describes the method for the server to extract subtitle features of the target video.
  • the server inputs the subtitles of the target video into the time point determination model, and performs feature extraction on the subtitles of the target video through the time point determination model to obtain the subtitle features of the target video.
  • the time point determination model includes a subtitle feature extraction unit, and the server can extract subtitle features of the target video through the subtitle feature extraction unit.
  • the server performs embedding coding on the subtitles of the target video through the subtitle feature extraction unit to obtain the subtitle embedding features of the target video.
  • the server uses the subtitle feature extraction unit to convolve and pool the subtitle embedding features of the target video to obtain the subtitle features of the target video.
  • the server can also obtain the subtitle features of the target video through other text feature extraction methods, which is not limited in the embodiments of this application.
  • the server fuses the audio features and image features to obtain the video features of the target video.
  • the server superimposes audio features and image features to obtain video features of the target video.
  • the server adds the audio feature sequence and the image feature sequence to obtain the video feature sequence of the target video.
  • the video features of the target video merge audio features and image features
  • the video features of the target video are also called audio and video high-level semantic features of the target video.
  • each sub-feature in the video feature sequence represents the video feature of the corresponding time point in the target video, that is, the semantic information of the corresponding time point. Since the audio features and image features of the target video are combined when determining the video features of the target video, the obtained video features can reflect the characteristics of the target video in both audio and image dimensions, and the accuracy of the video features is high.
  • the server adjusts the dimensions of the audio features or the image features so that after the adjustment, the dimensions of the audio features and the image features are the same.
  • the server when the server extracts subtitle features of the target video, the server fuses the audio features, image features, and subtitle features of the target video to obtain the video features of the target video.
  • audio features for audio Feature sequence when the image feature is an image feature sequence and the subtitle feature is a subtitle feature sequence, the server adds the audio feature sequence, image feature sequence and subtitle feature sequence to obtain the video feature sequence of the target video. Since the audio features, image features and subtitle features of the target video are combined when determining the video features of the target video, the obtained video features can reflect the characteristics of the target video in the three dimensions of audio, image and subtitles, and the accuracy of the video features is relatively high. high.
  • the above steps 302-304 are implemented by the feature extraction sub-model of the time point determination model.
  • the server encodes the video features of the target video based on the attention mechanism and obtains multiple target parameters.
  • the multiple target parameters correspond to multiple time points of the target video.
  • the target parameters are used to represent the probability of inserting background music at the corresponding time points. .
  • the video feature includes multiple sub-features
  • the server determines the model through time points, encodes every two sub-features of the multiple sub-features based on the attention mechanism, and obtains the target parameters of each sub-feature.
  • the video feature includes multiple sub-features corresponding to multiple time points of the target video.
  • One sub-feature corresponds to one time point of the target video.
  • Different sub-features correspond to different time points.
  • Each sub-feature is used to represent the corresponding time point.
  • Video features are used to represent the corresponding time point.
  • the server determines the model through time points and determines the plurality of attention parameters for the first sub-feature from the plurality of second sub-features among the plurality of sub-features based on the attention mechanism.
  • the server determines the model through time points, fuses multiple attention parameters, and obtains the target parameters of the first sub-feature.
  • the server determines the model through time points and determines multiple attention parameters of multiple second sub-features to the first sub-feature based on the attention mechanism.
  • the server determines the model through time points, performs full connection on the first sub-feature, and obtains the embedded feature of the first sub-feature. For any second sub-feature among multiple second sub-features, the server determines the model through time points, performs full connection on the second sub-feature, and obtains the embedded feature of the second sub-feature. The server determines the model through time points, and determines the similarity parameter between the first sub-feature and the second sub-feature based on the embedded feature of the first sub-feature and the embedded feature of the second sub-feature. The server determines the model through time points, and determines the attention parameter of the second sub-feature to the first sub-feature based on the first sub-feature and the similarity parameter between the first sub-feature and the second sub-feature.
  • the similarity parameter between the first sub-feature and the second sub-feature is used to describe the degree of similarity between the first sub-feature and the second sub-feature.
  • the similarity parameter between the first sub-feature and the second sub-feature is positively related to the degree of similarity between the first sub-feature and the second sub-feature. That is to say, the higher the similarity parameter, the higher the similarity between the first sub-feature and the second sub-feature; the lower the similarity parameter, the lower the similarity between the first sub-feature and the second sub-feature.
  • Attention parameters are also called attention weights.
  • the time point determination model includes a target parameter acquisition unit.
  • the server uses the time point determination model's target parameter acquisition unit to fully connect the first sub-feature to obtain the embedded feature of the first sub-feature. That is, the server inputs the first sub-feature into the fully-connected layer of the target parameter acquisition unit, multiplies the first sub-feature with the fully-connected matrix of the fully-connected layer of the target parameter acquisition unit, and obtains the embedded feature of the first sub-feature. .
  • the server inputs the second sub-feature into the fully-connected layer of the target parameter acquisition unit, multiplies the second sub-feature with the fully-connected matrix of the fully-connected layer of the target parameter acquisition unit, and obtains the embedded feature of the second sub-feature.
  • the server determines the similarity parameter between the first sub-feature and the second sub-feature based on the embedded feature of the first sub-feature and the embedded feature of the second sub-feature through the target parameter acquisition unit.
  • the similarity parameter is the dot product of the first sub-feature and the second sub-feature, or the cosine similarity between the first sub-feature and the second sub-feature, which is not limited in the embodiment of the present application.
  • the server multiplies the first sub-feature and the similarity parameter between the first sub-feature and the second sub-feature through the target parameter acquisition unit to obtain the attention parameter of the second sub-feature to the first sub-feature.
  • Figure 6 provides an architectural diagram of a target parameter acquisition unit.
  • the server inputs the video feature sequence ⁇ a1-an ⁇ of the target video into the target parameter acquisition unit.
  • the server obtains the unit through the target parameters, based on the attention machine Determine multiple attention parameters ⁇ c12-c1n ⁇ of multiple second sub-features ⁇ a2-an ⁇ to the first sub-feature (a1).
  • n is the number of sub-features in the video feature, and n is a positive integer.
  • the server performs a full connection (FC) on the first sub-feature a1 and the second sub-feature ai through the target parameter acquisition unit to obtain the first sub-feature a1.
  • the server multiplies the embedded feature of the first sub-feature a1 and the embedded feature of the second sub-feature ai through the target parameter acquisition unit to obtain the similarity between the embedded feature of the first sub-feature a1 and the embedded feature of the second sub-feature ai degree parameter m1i.
  • the server multiplies the similarity parameter m1i and the first sub-feature a1 through the target parameter acquisition unit to obtain the attention parameter c1i of the second sub-feature ai to the first sub-feature a1.
  • i is a positive integer, 2 ⁇ i ⁇ n.
  • the server determines the model through time points, fuses multiple attention parameters, and obtains the target parameters of the first sub-feature.
  • the target parameter of the first sub-feature is also called the attention weight of the first sub-feature or the confidence of inserting background music at the time point corresponding to the first sub-feature.
  • the server determines the target parameter acquisition unit of the model through time points, and adds multiple attention parameters to obtain the target parameter of the first sub-feature. That is to say, the target parameter of the first sub-feature is obtained by fusing multiple attention parameters of the first sub-feature with multiple second sub-features.
  • the server uses the target parameter acquisition unit to add multiple second sub-features ⁇ a2-an ⁇ to multiple attention parameters ⁇ c12-c1n ⁇ of the first sub-feature (a1) to obtain the first The target parameter w1 of sub-feature (a1).
  • the server obtains the target parameter of the first sub-feature among multiple sub-features through a time point determination model.
  • the method for the server to obtain the target parameters of other sub-features among the multiple sub-features belongs to the same inventive concept as described above, and the implementation process will not be described again.
  • the target parameters of multiple sub-features of the video features obtained during the experiment are drawn as a line chart, and multiple video frames and multiple audio files of the target video are The time domain information of the frame and the frequency domain information of multiple audio frames are aligned based on the time point, resulting in Figure 7.
  • a polyline is drawn including multiple video frames 701 of the target video, frequency domain information 702 of multiple audio frames of the target video, time domain information 703 of multiple audio frames of the target video, and target parameters of multiple sub-features.
  • 704 can reflect the changes in the target parameters of multiple sub-features as a whole.
  • the above step 305 is implemented by the target parameter determination sub-model of the time point determination model.
  • the server determines at least one candidate time point for inserting background music.
  • the candidate time point is a time point among multiple time points at which the target parameters meet the target conditions.
  • the candidate time point is also a time point determined by the server to be suitable for inserting background music.
  • Video producers can choose from candidate time points to determine the target time point for inserting background music into the target video.
  • the number of candidate time points is one or more, which is not limited in the embodiments of this application.
  • the target parameter meeting the target condition means that the target parameter is greater than or equal to the parameter threshold.
  • the parameter threshold is set by technicians according to actual conditions, and is not limited in the embodiments of this application.
  • the server determines the time point at which the target parameter is greater than or equal to the parameter threshold among multiple time points as a candidate time point for inserting background music.
  • the video producer can select among the determined candidate time points to determine the target time point at which the background music is finally inserted. For example, the server sends the candidate time points of the target video to the terminal, and the terminal displays the candidate time points of the target video to the video producer. In response to any candidate time point being selected, the terminal inserts background music at the selected candidate time point, which is also the target time point.
  • the terminal after receiving the candidate time point of the target video sent by the server, the terminal can display the candidate time point on the timeline of the target video. For example, the terminal displays the candidate time point in the form of a dot on the timeline of the target video.
  • Video producers can control the terminal to play different contents of the target video by clicking on different candidate time points, and select the target time point to insert background music from the candidate time points according to the played content. By selecting based on candidate time points, the scope of determining the target time point for inserting background music is greatly narrowed and the quality of background music is improved. Insertion efficiency.
  • the server obtains the target video 801.
  • the server performs feature extraction on the video track (multiple video frames) of the target video to obtain image features 802 of the target video.
  • the server uses the IResNet model to extract features from the video track of the target video.
  • the server performs feature extraction on the audio track (multiple audio frames) of the target video to obtain the audio features of the target video 803.
  • the server uses the PANNs model to extract features from the audio track of the target video.
  • the server determines the model through time points, fuses the image features 802 and audio features 803 of the target video, and obtains the video features 804 of the target video. Based on the attention mechanism, the server encodes every two sub-features in the video feature 804 to obtain the target parameters 805 of each sub-feature.
  • the server is used as the execution subject as an example.
  • the technical solutions provided by the embodiments of the present application can also be executed by the terminal, and the embodiments of the present application do not limit this.
  • the audio features and image features of the target video are combined to determine the video features of the target video.
  • the video features can more accurately represent the content of the target video.
  • multiple target parameters can be obtained, which represent the probability of inserting background music at the corresponding time point.
  • candidate time points can be determined from multiple time points, which are the time points at which background music can be inserted into the target video.
  • the video producer does not need to watch the target video completely and only needs to select among the determined candidate time points, which improves the efficiency of inserting background music into the video.
  • a fully automatic method for determining the location of the episode (background music) is provided.
  • This solution can automatically determine the location of the episode of the video through the advanced semantic features of the audio and video, and then determine the location of the episode of the video.
  • Post-production or video re-creation provides alternative interlude positions, which can get rid of manual selection and greatly reduce the cost of video production.
  • the time point determination model is used to locate the position where background music is inserted, which can modularize scientific calculation data and avoid time point differences caused by differences in human senses.
  • the audio features and image features of the target video are combined to determine the video features of the target video.
  • the video features can more accurately represent the content of the target video.
  • the video features are encoded based on the attention mechanism to obtain multiple target parameters, which represent the probability of inserting background music at the corresponding time point.
  • a candidate time point is determined from the multiple time points, which is also a time point at which background music can be inserted into the target video.
  • the determined candidate time points are more accurate.
  • the video producer does not need to watch the target video completely, and only needs to select among the identified candidate time points. This improves the efficiency of inserting background music into the video while ensuring accuracy. .
  • the above steps 301-306 include the implementation of the server using the time point determination model to obtain candidate time points of the target video.
  • the following takes the execution subject as the server as an example to describe the method of training the time point determination model. For illustration, see Figure 9.
  • the method includes the following steps.
  • the server inputs the sample video into the time point determination model, separates the audio of the sample video through the time point determination model, and obtains the original audio and background music of the sample video.
  • the server inputs the sample video into the time point determination model, and performs feature extraction on the sample frequency domain information of multiple sample audio frames of the sample video through the time point determination model to obtain the first audio feature of the sample video.
  • the server determines the model through time points, uses multiple scales to pool the first audio features, and obtains multiple second audio features of the sample video.
  • the server determines the model through time points and fuses multiple second audio features to obtain the audio separation features of the sample video.
  • the server determines the model through time points, separates the sample frequency domain information based on audio separation features, and obtains the original audio and background music of the sample video.
  • the server inputs the sample video into the time point determination model, and performs feature extraction on the sample frequency domain information of multiple sample audio frames of the sample video through the time point determination model to obtain the first audio feature of the sample video.
  • the server inputs the time domain information of multiple sample audio frames of the sample video into the time point determination model, and converts the time domain information of the multiple sample audio frames into the frequency of multiple sample audio frames through the time point determination model. domain information.
  • the server determines the model through time points, convolves the frequency domain information of multiple sample audio frames, and obtains the first audio feature of the sample video.
  • the time point determination model uses a dilated convolution kernel when convolving the frequency domain information of multiple sample audio frames. For example, referring to Figure 10, the server convolves the frequency domain information 1001 of multiple sample audio frames through the audio separation unit of the time point determination model to obtain the first audio feature 1002 of the sample video.
  • the server determines the model through inter-points, uses multiple scales to pool the first audio features, and obtains multiple second audio features of the sample video.
  • the server When the server pools the first audio features using different scales, it obtains the second audio features of different sizes. That is, one scale corresponds to one size, and the plurality of second audio features are a plurality of second audio features of different sizes.
  • This pooling method based on different scales is also called pyramid pooling.
  • the server determines multiple pooling kernels of the model through time points, uses multiple scales to pool the first audio features, and obtains multiple second audio features of the sample video.
  • the multiple pooling kernels correspond to Multiple scales.
  • the server determines multiple pooling kernels of the audio separation unit of the model through time points, uses multiple scales to pool the first audio features 1001, and obtains multiple second audio features 1002 of the sample video.
  • the sizes of the plurality of second audio features 1002 are all different.
  • the server determines the model through time points and fuses multiple second audio features to obtain the audio separation features of the sample video.
  • the server determines the model through time points and convolves multiple second audio features to obtain multiple third audio features of the sample video.
  • the server determines the model through time points, upsamples multiple third audio features, and obtains multiple fourth audio features of the sample video.
  • the plurality of fourth audio features are all the same size as the first audio features.
  • the server determines the model through time points and fuses the plurality of fourth audio features with the first audio features to obtain the audio separation features of the sample video. For example, referring to Figure 10, the server convolves multiple second audio features 1002 through the audio separation unit of the time point determination model to obtain multiple third audio features 1003 of the sample video.
  • the server determines the model through time points, upsamples the plurality of third audio features 1003, and obtains the plurality of fourth audio features 1004 of the sample video.
  • the server determines the model through time points, fuses the plurality of fourth audio features 1004 with the first audio features 1001 and then performs convolution to obtain the audio separation features of the sample video.
  • the above implementation is implemented by an audio separation sub-model of the time point determination model.
  • the audio separation sub-model is Pyramid Scene Parsing Network (PSPnet).
  • PSPnet Pyramid Scene Parsing Network
  • feature maps of different scales generated by pyramid pooling are finally spliced together and then input to the fully connected layer for classification.
  • the pyramid structure can fuse features at four different scales: The first layer highlights a single global pooling output at the coarsest scale. Other layers divide the first audio feature map into second audio features of different scales and form a set representation for different positions in different first audio features.
  • the pooling kernel covers all, half and small parts of the first audio feature.
  • the pyramid structure has a total of N scales
  • 1 ⁇ 1 convolution is used after each scale to reduce the number of channels of the corresponding scale to the original 1/N, where N is a positive integer.
  • the low-dimensional features are then directly upsampled through bilinear interpolation to obtain features of the same size as the original features. Finally, the features of different scales are spliced together as the final audio separation features.
  • the server determines the model through time points, separates the sample frequency domain information based on audio separation features, and obtains the original audio and background music of the sample video.
  • the server determines the boundary information of the sample frequency domain information based on the audio separation features through a time point determination model. Boundary information is used to represent the boundary between the original audio and background music in the sample frequency domain information. server pass The model is determined at the time point, and the sample frequency domain information is processed based on the boundary information to obtain the original audio and background music of the sample video.
  • the server adds tags to multiple time points of the sample video based on the appearance time of the background music of the sample video in the sample video. Since the time point label is used to represent the appearance time of the background music in the sample video, after the server separates the background music and the original audio in the sample video, the appearance time of the separated background music in the sample video is multiple times. Just click Add Tag. There is no need for technicians to add tags manually. Tag adding is more efficient.
  • step 901 is an optional step.
  • the server can remove the background music in the sample video by executing step 901, so that the time point determination model is not affected by the existing background music during the training phase.
  • the server does not need to perform step 901 and can directly perform the following step 902.
  • the original audio in the following step 902 is also the audio of the sample video.
  • the server determines the model through time points, extracts features from the original audio of the sample video and multiple sample video frames, and obtains the sample audio features and sample image features of the sample video.
  • the original audio of the sample video includes multiple sample audio frames of the sample video.
  • the server performs feature extraction on the original audio and multiple sample video frames of the sample video, and obtains the sample audio features and sample image features of the sample video.
  • This method belongs to the same inventive concept as the above-mentioned steps 302 and 303.
  • For the implementation process please refer to the above-mentioned steps 302 and 303. The description will not be repeated here.
  • the server determines the model through time points, fuses the sample audio features and the sample image features, and obtains the video features of the sample video.
  • the server determines the model through time points, fuses the sample audio features and the sample image features, and the method of obtaining the video features of the sample video belongs to the same inventive concept as the above-mentioned step 304.
  • the implementation process please refer to the description of the above-mentioned step 304, which will not be repeated here. Repeat.
  • the server determines the model through time points, encodes the video features of the sample video based on the attention mechanism, and obtains multiple sample parameters.
  • the multiple sample parameters correspond to multiple time points of the sample video.
  • the sample parameters are used to represent the time at the corresponding time. The probability of inserting background music.
  • the server determines the model through time points, encodes the video features of the sample video based on the attention mechanism, and obtains multiple sample parameters that belong to the same inventive concept as the above-mentioned step 305.
  • the server determines the model through time points, encodes the video features of the sample video based on the attention mechanism, and obtains multiple sample parameters that belong to the same inventive concept as the above-mentioned step 305.
  • the server determines the model through time points, encodes the video features of the sample video based on the attention mechanism, and obtains multiple sample parameters that belong to the same inventive concept as the above-mentioned step 305.
  • the server determines the model through time points, encodes the video features of the sample video based on the attention mechanism, and obtains multiple sample parameters that belong to the same inventive concept as the above-mentioned step 305.
  • the server determines the model through time points, encodes the video features of the sample video based on the attention mechanism, and obtains multiple sample parameters that belong to the same inventive concept as the above
  • the server trains a time point determination model based on the difference information between labels of multiple time points of the sample video and multiple sample parameters.
  • the labels are used to represent the appearance time of the background music in the sample video.
  • the sample parameters are used to represent the probability of inserting background music at the corresponding time point.
  • the sample parameters are positively related to the probability of inserting background music at the corresponding time point. That is to say, the larger the sample parameter, the higher the probability of inserting background music at the corresponding time point; the smaller the sample parameter, the lower the probability of inserting background music at the corresponding time point.
  • Tags are used to indicate when the background music appears in the sample video. Training the time point model based on the difference information between labels and sample parameters enables the time point determination model to learn the appearance time of background music in the sample video, thereby outputting candidate time points during use.
  • the server constructs a target loss function based on difference information between labels at multiple time points of the sample video and multiple sample parameters.
  • the server uses the gradient descent method to train the time point determination model based on the target loss function.
  • the server normalizes multiple sample parameters so that the multiple sample parameters are within the target range.
  • the labels of multiple time points include the maximum and minimum values of the target range. The maximum value indicates that background music appears at the corresponding time point, and the minimum value indicates that no background music appears at the corresponding time point.
  • the purpose of training the time point determination model based on the normalized multiple sample parameters and multiple time point labels is to make the determined sample parameters as close as possible to the maximum or minimum value of the target range after normalization.
  • the purpose of training is to make The sample parameters at the time point should be as close as possible to the maximum value of the target range; when there is no background music at the time point, the purpose of training is to make the sample parameters at the time point as close as possible to the minimum value of the target range.
  • the server obtains sample video 1101 from the sample video collection.
  • the server performs feature extraction on the video track (multiple video frames) of the sample video to obtain sample image features 1102 of the sample video.
  • the server uses the IResNet model to extract features from the video track of the sample video.
  • the server performs audio separation on the audio track (multiple audio frames) of the sample video to obtain the original audio 1103 and background music 1104 of the target video.
  • the server performs feature extraction on the original audio to obtain sample audio features 1105 of the sample video.
  • the server uses the PANNs model to extract features from the audio track of the sample video.
  • the server determines the model through time points and fuses the sample image features 1102 and the sample audio features 1105 of the sample video to obtain the video features 1106 of the sample video.
  • the server encodes every two sub-features in the video feature 1106 to obtain the sample parameters 1107 of each sub-feature.
  • the server adds tags to multiple time points of the sample video based on the appearance time of the background music 1104 in the sample video.
  • the server constructs a loss function based on labels between multiple sample parameters and multiple time points, and trains the time point determination model based on the loss function.
  • the relevant information of manually annotated time points is often used as labels to participate in model training.
  • the technical solution provided by the embodiment of this application uses an audio separation sub-model built based on the semantic segmentation model to perform audio separation on the audio track of the sample video, separate the original background music in the audio track, and calculate its time position as the time point. Labels are directly involved in model training. This method allows the model to learn the habit information of humans adding interlude locations through sample videos. At the same time, using the original audio obtained by separating the background music for model training can make the original audio more similar to the audio when the actual landing is speculated, so that the time point determination model can learn more accurate audio features.
  • an attention-based mechanism is used to determine the target parameters based on the video feature sequence, that is, the confidence that each time point in the entire video feature sequence can be used as a candidate time point is calculated.
  • This mechanism allows the time point determination model to calculate the attention parameters between every two time points on the entire video feature sequence, and can more accurately train the positioning ability of the time point determination model.
  • Figure 12 is a schematic structural diagram of a device for determining the insertion time point of background music provided by an embodiment of the present application.
  • the device includes: a feature extraction module 1201, a feature fusion module 1202, an encoding module 1203, and a candidate time point determination module 1204. .
  • Feature extraction module 1201 is used to extract audio features and image features of the target video.
  • the feature fusion module 1202 is used to fuse audio features and image features to obtain video features of the target video.
  • Encoding module 1203 is used to encode the video features of the target video based on the attention mechanism to obtain multiple target parameters.
  • the multiple target parameters correspond to multiple time points of the target video.
  • the target parameters are used to represent the background insertion at the corresponding time points. Probability of music.
  • the candidate time point determination module 1204 is used to determine at least one candidate time point for inserting background music.
  • the candidate time point is a time point among multiple time points at which the target parameter meets the target condition.
  • the feature extraction module 1201 is used to extract features from multiple audio frames of the target video to obtain the audio features of the target video. Feature extraction is performed on multiple video frames of the target video to obtain the image features of the target video.
  • the feature extraction module 1201 is used to extract features from the time domain information of multiple audio frames to obtain the time domain audio features of the multiple audio frames. Feature extraction is performed on frequency domain information of multiple audio frames to obtain frequency domain audio features of multiple audio frames. Based on the time domain audio features and frequency domain audio features of multiple audio frames, the audio features of the target video are obtained.
  • the feature extraction module 1201 is configured to use multiple one-dimensional convolution kernels to perform feature extraction on the time domain information of multiple audio frames to obtain the time domain audio features of the multiple audio frames. Extracting features from frequency domain information of multiple audio frames to obtain frequency domain audio features of multiple audio frames includes: using multiple two-dimensional convolution kernels to extract features from frequency domain information of multiple audio frames to obtain multiple audio frames. frequency domain audio characteristics.
  • the feature fusion module 1202 is used to combine time domain audio features and frequency domain audio features of multiple audio frames. Features are fused to obtain the initial audio features of the target video. Perform maximum pooling and mean pooling on the initial audio features respectively to obtain the first pooling feature and the second pooling feature of the target video. The first pooling feature and the second pooling feature are fused to obtain the audio features of the target video.
  • the video features include multiple sub-features, and the multiple sub-features correspond to multiple time points of the target video.
  • the encoding module 1203 is used to determine the model through the time points, and based on the attention mechanism, each two sub-features of the multiple sub-features are Encode to obtain the target parameters of each sub-feature.
  • the encoding module 1203 is configured to determine, for a first sub-feature among the plurality of sub-features, a plurality of attention parameters for the first sub-feature from a plurality of second sub-features among the plurality of sub-features based on an attention mechanism. Multiple attention parameters are fused to obtain the target parameters of the first sub-feature.
  • the encoding module 1203 is used to perform full connection on the first sub-feature to obtain the embedded feature of the first sub-feature. For any second sub-feature among the plurality of second sub-features, the second sub-feature is fully connected to obtain the embedded feature of the second sub-feature. Based on the embedded feature of the first sub-feature and the embedded feature of the second sub-feature, a similarity parameter between the first sub-feature and the second sub-feature is determined. Based on the first sub-feature and the similarity parameter between the first sub-feature and the second sub-feature, an attention parameter of the second sub-feature to the first sub-feature is determined.
  • the device further includes:
  • the training module is used to input the sample video into the time point determination model, extract features from the sample video through the time point determination model, and obtain the sample audio features and sample image features of the sample video.
  • the sample audio features and sample image features are fused to obtain the video features of the sample video.
  • the video features of the sample video are encoded based on the attention mechanism to obtain multiple sample parameters.
  • the multiple sample parameters correspond to multiple time points of the sample video.
  • the sample parameters are used to represent the insertion of the background at the corresponding time point. Probability of music.
  • the time point determination model is trained. The labels are used to represent the appearance time of the background music in the sample video.
  • the device further includes:
  • the audio separation module is used to separate the audio of the sample video through the time point determination model to obtain the original audio and background music of the sample video.
  • the training module is also used to determine the model through time points, extract features from the original audio of the sample video and multiple sample video frames, and obtain the sample audio features and sample image features of the sample video.
  • the audio separation module is used to perform feature extraction on sample frequency domain information of multiple sample audio frames of the sample video through a time point determination model to obtain the first audio feature of the sample video.
  • the first audio features are pooled using multiple scales to obtain multiple second audio features of the sample video.
  • multiple second audio features are fused to obtain the audio separation features of the sample video.
  • the sample frequency domain information is separated based on the audio separation characteristics, and the original audio and background music of the sample video are obtained.
  • the audio separation module is used to convolve multiple second audio features to obtain multiple third audio features of the sample video.
  • the plurality of third audio features are upsampled to obtain a plurality of fourth audio features of the sample video, and the sizes of the plurality of fourth audio features are the same as the first audio features.
  • the plurality of fourth audio features are fused with the first audio features to obtain audio separation features of the sample video.
  • the audio separation module is used to determine the boundary information of the sample frequency domain information based on the audio separation feature, and the boundary information is used to represent the boundary between the original audio and the background music in the sample frequency domain information.
  • the sample frequency domain information is processed based on the boundary information to obtain the original audio and background music of the sample video.
  • the device further includes:
  • the tag adding module is used to add tags to multiple time points of the sample video based on the appearance time of the background music of the sample video in the sample video.
  • the feature extraction module 1201 is also used to extract audio features, image features and subtitle features of the target video.
  • the feature fusion module 1202 is also used to fuse the audio features, image features and subtitle features of the target video, Get the video features of the target video.
  • the device for determining the insertion time point of background music determines the insertion time point of background music, it only takes the division of the above functional modules as an example. In practical applications, the above mentioned functions can be used as needed. Function allocation is completed by different functional modules, that is, the internal structure of the computer device is divided into different functional modules to complete all or part of the functions described above.
  • the device for determining the insertion time point of background music provided in the above embodiments and the embodiment of the method for determining the insertion time point of background music belong to the same concept. The specific implementation process can be found in the method embodiments and will not be described again here.
  • the audio features and image features of the target video are combined to determine the video features of the target video.
  • the video features can more accurately represent the content of the target video.
  • the video features are encoded based on the attention mechanism to obtain multiple target parameters, which represent the probability of inserting background music at the corresponding time point.
  • candidate time points are determined from multiple time points.
  • the candidate time points are also time points at which background music can be inserted into the target video.
  • the determined candidate time points are more accurate.
  • the video producer does not need to watch the target video completely, and only needs to select among the identified candidate time points. This improves the efficiency of inserting background music into the video while ensuring accuracy. .
  • Embodiments of the present application provide a computer device for executing the above method.
  • the computer device can be implemented as a terminal or a server.
  • the structure of the terminal is first introduced below:
  • Figure 13 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal 1300 includes: one or more processors 1301 and one or more memories 1302.
  • the processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc.
  • the processor 1301 can adopt at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), and PLA (Programmable Logic Array, programmable logic array).
  • DSP Digital Signal Processing, digital signal processing
  • FPGA Field-Programmable Gate Array, field programmable gate array
  • PLA Programmable Logic Array, programmable logic array
  • the processor 1301 can also include a main processor and a co-processor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the co-processor is A low-power processor used to process data in standby mode.
  • the processor 1301 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is responsible for rendering and drawing content that needs to be displayed on the display screen.
  • the processor 1301 may also include an AI (Artificial Intelligence, artificial intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. Memory 1302 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash memory storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1302 is used to store at least one computer program, and the at least one computer program is used to be executed by the processor 1301 to implement the methods provided by the method embodiments in this application. Method to determine the insertion time point of background music.
  • the terminal 1300 optionally further includes: a peripheral device interface 1303 and at least one peripheral device.
  • the processor 1301, the memory 1302 and the peripheral device interface 1303 may be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board.
  • the peripheral device includes: at least one of a display screen 1305, an audio circuit 1307, and a power supply 1308.
  • the peripheral device interface 1303 may be used to connect at least one I/O (Input/Output) related peripheral device to the processor 1301 and the memory 1302 .
  • the processor 1301, the memory 1302, and the peripheral device interface 1303 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 1301, the memory 1302, and the peripheral device interface 1303 or Both of them can be implemented on separate chips or circuit boards, which is not limited in this embodiment.
  • the display screen 1305 is used to display UI (User Interface, user interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the display screen 1305 also has the function of collecting and displaying The capability of touch signals on or above the surface of screen 1305.
  • the touch signal can be input to the processor 1301 as a control signal for processing.
  • the display screen 1305 can also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • Audio circuitry 1307 may include a microphone and speakers.
  • the microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals that are input to the processor 1301 for processing, or to the radio frequency circuit 1304 to implement voice communication.
  • the power supply 1308 is used to power various components in the terminal 1300.
  • Power source 1308 may be AC, DC, disposable batteries, or rechargeable batteries.
  • FIG. 13 does not constitute a limitation on the terminal 1300, and may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
  • the above computer equipment can also be implemented as a server.
  • the structure of the server is introduced below:
  • FIG 14 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1400 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPUs) 1401 and a or multiple memories 1402, wherein at least one computer program is stored in the one or more memories 1402, and the at least one computer program is loaded and executed by the one or more processors 1401 to implement each of the above methods. Example method provided.
  • the server 1400 may also have components such as wired or wireless network interfaces, keyboards, and input and output interfaces for input and output.
  • the server 1400 may also include other components for implementing device functions, which will not be described again here.
  • a computer-readable storage medium is also provided. At least one computer program is stored in the computer-readable storage medium. The computer program is loaded and executed by the processor to implement the background music in the above embodiment. Method for determining the insertion time point.
  • the computer-readable storage medium can be read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), read-only compact disc (Compact Disc Read-Only Memory, CD-ROM), Tapes, floppy disks and optical data storage devices, etc.
  • a computer program product including a computer program that implements the above background music insertion time point determination method when executed by a processor.
  • the computer program involved in the embodiments of the present application may be deployed and executed on one computer device, or executed on multiple computer devices located in one location, or distributed in multiple locations and communicated through Executed on multiple computer devices interconnected by a network, multiple computer devices distributed in multiple locations and interconnected through a communication network can form a blockchain system.
  • the program can be stored in a computer-readable storage medium.
  • the storage medium can be read-only memory, magnetic disk or optical disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Television Signal Processing For Recording (AREA)
  • Studio Circuits (AREA)
  • Management Or Editing Of Information On Record Carriers (AREA)
  • Image Analysis (AREA)

Abstract

一种背景音乐的插入时间点确定方法、装置、设备和存储介质。该方法包括:提取目标视频的音频特征及图像特征(201);将音频特征及图像特征进行融合,得到目标视频的视频特征(202);基于注意力机制对目标视频的视频特征进行编码,得到多个目标参数,多个目标参数对应目标视频的多个时间点,目标参数表示在对应时间点插入背景音乐的概率(203);确定插入背景音乐的至少一个候选时间点,候选时间点为多个时间点中目标参数符合目标条件的时间点(204)。

Description

背景音乐的插入时间点确定方法、装置、设备和存储介质
本申请要求于2022年04月15日提交的申请号为202210393110.3,发明名称为“背景音乐的插入时间点确定方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,特别涉及一种背景音乐的插入时间点确定方法、装置、设备和存储介质。
背景技术
随着互联网技术的发展,观看视频已成为常见的一种娱乐形式。在视频制作过程中,往往会在视频中插入背景音乐,通过背景音乐来烘托视频的氛围,从而提高视频的观看效果。发明内容
本申请实施例提供了一种背景音乐的插入时间点确定方法、装置、设备和存储介质,能够提高在视频中插入背景音乐的效率,技术方案如下。
一方面,提供了一种背景音乐的插入时间点确定方法,所述方法包括:
提取目标视频的音频特征以及图像特征;
将所述音频特征以及所述图像特征进行融合,得到所述目标视频的视频特征;
基于注意力机制对所述目标视频的视频特征进行编码,得到多个目标参数,所述多个目标参数对应于所述目标视频的多个时间点,所述目标参数用于表示在对应时间点插入背景音乐的概率;
确定插入背景音乐的至少一个候选时间点,所述候选时间点为所述多个时间点中目标参数符合目标条件的时间点。
一方面,提供了一种背景音乐的插入时间点确定装置,所述装置包括:
特征提取模块,用于提取目标视频的音频特征以及图像特征;
特征融合模块,用于将所述音频特征以及所述图像特征进行融合,得到所述目标视频的视频特征;
编码模块,用于基于注意力机制对所述目标视频的视频特征进行编码,得到多个目标参数,所述多个目标参数对应于所述目标视频的多个时间点,所述目标参数用于表示在对应时间点插入背景音乐的概率;
候选时间点确定模块,用于确定插入背景音乐的至少一个候选时间点,所述候选时间点为所述多个时间点中目标参数符合目标条件的时间点。
一方面,提供了一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述计算机程序由所述一个或多个处理器加载并执行以实现所述背景音乐的插入时间点确定方法。
一方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现所述背景音乐的插入时间点确定方法。
一方面,提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述背景音乐的插入时间点确定方法。
附图说明
图1是本申请实施例提供的一种背景音乐的插入时间点确定方法的实施环境的示意图;
图2是本申请实施例提供的一种背景音乐的插入时间点确定方法的流程图;
图3是本申请实施例提供的另一种背景音乐的插入时间点确定方法的流程图;
图4是本申请实施例提供的一种特征提取单元的结构示意图;
图5是本申请实施例提供的一种残差构建子单元的结构示意图;
图6是本申请实施例提供的一种目标参数获取单元的结构示意图;
图7是本申请实施例提供的一种效果示意图;
图8是本申请实施例提供的另一种背景音乐的插入时间点确定方法的流程图;
图9是本申请实施例提供的一种时间点确定模型的训练方法的流程图;
图10是本申请实施例提供的一种音频分离单元的结构示意图;
图11是本申请实施例提供的另一种时间点确定模型的训练方法的流程图;
图12是本申请实施例提供的一种背景音乐的插入时间点确定装置的结构示意图;
图13是本申请实施例提供的一种终端的结构示意图;
图14是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识子模型使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、示教学习等技术。
语义特征:用于表示文本所表达语义的特征,不同的文本可以对应于相同的语义特征,例如文本“今天天气怎么样”和文本“今天天气如何”可以对应于同一个语义特征。计算机设备可以将文本中的字符映射为字符向量,根据字符之间的关系,对字符向量进行组合和运算,得到文本的语义特征。例如计算机设备可以采用编解码器的双向编码器表示(Bidirectional Encoder Representations from Transformers,BERT)。
归一化:将取值范围不同的数列映射到(0,1)区间上,便于数据的处理。在一些情况下,归一化后的数值可以直接被实现为概率。
嵌入编码(Embedded Coding):嵌入编码在数学上表示一个对应关系,即通过一个函数F将X空间上的数据映射到Y空间上,其中该函数F是单射函数,映射的结果是结构保存,单射函数表示映射后的数据与映射前的数据唯一对应,结构保存表示映射前数据的大小关系后映射后数据的大小关系相同,例如映射前存在数据X1以及X2,映射后得到X1对应的Y1以及X2对应的Y2。若映射前的数据X1>X2,那么相应地,映射后的数据Y1大于Y2。对于词语来说,就是将词语映射到另外一个空间,便于后续的机器学习和处理。
注意力权重:可以表示训练或预测过程中某个数据的重要性,重要性表示输入的数据对输出数据影响的大小。重要性高的数据其对应的注意力权重的值较高,重要性低的数据其对应的注意力权重的值较低。在不同的场景下,数据的重要性并不相同,模型的训练注意力权 重的过程也即是确定数据重要性的过程。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
以下,对本申请的实施环境进行介绍。
本申请实施例提供的背景音乐的插入时间点确定方法,能够由计算机设备执行。在一些实施例中,该计算机设备为终端或服务器。下面介绍一下本申请实施例提供的背景音乐的插入时间点确定方法的实施环境,图1是本申请实施例提供的一种背景音乐的插入时间点确定方法的实施环境示意图,参见图1,该实施环境中可以包括终端110和服务器140。
终端110通过无线网络或有线网络与服务器140相连。可选地,终端110是车载终端、智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表以及智能电视等,但并不局限于此。终端110安装和运行有支持确定插入背景音乐的时间点的应用程序。
服务器140是独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式系统,或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、分发网络(Content Delivery Network,CDN)、以及大数据和人工智能平台等基础云计算服务的云服务器。服务器140为终端110上运行的应用程序提供后台服务。
可选地,终端110和服务器140的数量不受限制。
在介绍完本申请实施例的实施环境之后,下面将结合上述实施环境对本申请实施例的应用场景进行介绍,在下述说明过程中,终端也即是上述实施环境中的终端110,服务器也即是上述实施环境中的服务器140。
本申请实施例提供的技术方案能够应用在向各类视频中插入背景音乐的场景下,比如,应用在向影视作品中插入背景音乐的场景下,或者应用在向短视频中插入背景音乐的场景下。
在向影视作品中插入背景音乐的场景下,视频制作人员通过终端选择待插入背景音乐的影视作品,该影视作品也即是目标视频。终端将该影视作品发送至服务器,由该服务器对该影视作品进行处理,得到该影视作品中的候选时间点,该候选时间点也即是可供在影视作品中插入背景音乐的时间点。服务器将该影视作品的候选时间点发送给终端,终端显示该影视作品的候选时间点。视频制作人员能够在终端上显示的候选时间点中选择插入背景音乐的目标时间点。视频制作人员通过终端选择影视作品之后,服务器就能够直接确定出该影视作品中的候选时间点,无需视频制作人员完整地观看影视作品后再确定候选时间点,大大提高了在影视作品中插入背景音乐的效率。
在短视频中插入背景音乐的场景下,短视频作者通过终端选择待插入背景音乐的短视频,该短视频也即是目标视频。终端将该短视频发送至服务器,由该服务器对该短视频进行处理,得到该短视频中的候选时间点,该候选时间点也即是可供在短视频中插入背景音乐的时间点。服务器将该短视频的候选时间点发送给终端,终端显示该短视频的候选时间点。短视频作者能够在终端上显示的候选时间点中选择插入背景音乐的目标时间点。短视频作者通过终端选择短视频之后,服务器就能够直接确定出该短视频中的候选时间点,无需短视频作者在完整短视频的范围内进行选择,大大提高了在短视频中插入背景音乐的效率。
需要说明的是,本申请实施例提供的技术方案除了能够应用在向影视作品或者短视频中插入背景音乐之外,还能够应用在向其他类型的视频中插入背景音乐的场景下,本申请实施例对此不做限定。
在介绍完本申请实施例的实施环境和应用场景之后,下面对本申请实施例提供的技术方 案进行介绍。参见图2,本申请实施例提供的技术方案可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以执行主体为服务器为例进行说明,方法包括下述步骤。
201、服务器提取目标视频的音频特征以及图像特征。
其中,目标视频为待插入背景音乐的视频,比如为尚未插入背景音乐的影视作品,或者为二次创作时的视频片段等,本申请实施例对此不做限定。音频特征能够反映目标视频在音频上的特性,音频特征也被称为听觉特征;图像特征能够反映目标视频的图像上的特性,图像特征也被称为视觉特征。
202、服务器将该音频特征以及该图像特征进行融合,得到该目标视频的视频特征。
其中,在获取该目标视频的视频特征的过程中,融合了目标视频的音频特征以及图像特征,使得该目标视频的视频特征能够从听觉和视觉两个维度上反映该目标视频的特性,该视频特征具有较强的表达能力。
203、服务器基于注意力机制对该目标视频的视频特征进行编码,得到多个目标参数,该多个目标参数对应于该目标视频的多个时间点,该目标参数用于表示在对应时间点插入背景音乐的概率。
其中,基于注意力机制对目标视频的视频特征进行处理时,能够充分利用视频特征中的信息,提高确定出的目标参数的准确性。
204、服务器确定插入背景音乐的至少一个候选时间点,候选时间点为多个时间点中目标参数符合目标条件的时间点。
其中,候选时间点为插入背景音乐的概率较高的时间点,视频制作人员能够在确定出的候选时间点中进行选择插入背景音乐的目标时间点。
通过本申请实施例提供的技术方案,结合了目标视频的音频特征和图像特征来确定目标视频的视频特征,该视频特征能够较为准确地表示目标视频的内容。基于注意力机制对视频特征进行编码,得到多个目标参数,多个目标参数表示在对应时间点插入背景音乐的概率。基于多个时间点的目标参数,从多个时间点中确定出候选时间点,该候选时间点也即是可以在目标视频中插入背景音乐的时间点。在上述确定候选时间点的过程中,由于结合了注意力机制,使得确定出的候选时间点较为准确。同时在插入背景音乐时,无需视频制作人员完整地观看目标视频,只需在确定出的候选时间点中进行选择即可,在保证准确性的前提下,提高了在视频中插入背景音乐的效率。
需要说明的是,上述步骤201-204是对本申请实施例提供的技术方案的简单说明,下面将结合一些例子,对本申请实施例提供的技术方案进行更加详细的说明,参见图3,本申请实施例提供的技术方案可以由终端或服务器执行,也可以由终端和服务器共同执行,在本申请实施例中,以技术方案由终端和服务器共同执行为例进行说明,方法包括下述步骤。
301、服务器获取目标视频。
其中,该目标视频为待插入背景音乐的视频。在一些实施例中,该目标视频为影视作品中的电影或者电视剧等,或者为短视频等其他类型的视频等,本申请实施例对此不做限定。
在一些实施例中,响应于对目标视频的操作,终端向服务器发送该目标视频。在这种实施方式下,视频制作人员能够通过对该目标视频的操作来控制终端向服务器发送该目标视频,视频制作人员能够自行选择目标视频,人机交互的效率较高。
举例来说,终端显示视频选择页面,该视频选择页面包括多个候选视频。响应于对该多个候选视频中目标视频的点击操作,终端向服务器发送该目标视频。服务器获取该目标视频。在这种情况下,该多个候选视频为存储在终端上的视频。在该多个候选视频为存储在服务器上的视频的情况下,响应于在视频选择页面上对目标视频的点击操作,终端向服务器发送视频选择指令,该视频选择指令携带该目标视频的标识。服务器接收到该视频选择指令之后, 从该视频选择指令中获取该目标视频的标识。服务器基于该目标视频的标识进行查询,获取该目标视频。
302、服务器对目标视频的多个音频帧进行特征提取,得到目标视频的音频特征。
在一些实施例中,服务器对该多个音频帧的时域信息进行特征提取,得到该多个音频帧的时域音频特征。服务器对该多个音频帧的频域信息进行特征提取,得到该多个音频帧的频域音频特征。服务器基于该多个音频帧的时域音频特征和频域音频特征,获取该目标视频的音频特征。
在这种实施方式下,服务器能够提取该目标视频的多个音频帧的时域音频特征和频域音频特征,该音频特征能够更加准确地反映目标视频的音频特性。
在一些实施例中,服务器上部署有时间点确定模型,服务器通过该时间点确定模型来实现上述实施方式。其中,时间点确定模型包括音频特征提取单元。服务器通过该时间点确定模型的音频特征提取单元来获取该目标视频的音频特征。
在一些实施例中,目标视频的音频特征为一个音频特征序列,该音频特征序列包括多个音频子特征,每个音频子特征对应于该目标视频的一个时间点,每个音频子特征用于反映对应时间点的音频特性。
为了对上述实施方式进行更加清楚的说明,下面分为三个部分对上述实施方式进行说明。
第一部分、服务器对多个音频帧的时域信息进行特征提取,得到多个音频帧的时域音频特征。
其中,多个音频帧为目标视频中在时间上连续的音频帧,多个音频帧的时域信息用于描述多个音频帧的幅值在时间上的变化情况,时域音频特征能够反映多个音频帧在时域上的特性。
在一些实施例中,多个音频帧的时域音频特征为一个时域音频特征序列。时域音频特征序列包括多个子特征,每个子特征对应于目标视频的一个时间点,每个子特征用于反映对应时间点的时域音频特性。
在一些实施例中,服务器采用多个一维卷积核对多个音频帧的时域信息进行特征提取,得到多个音频帧的时域音频特征。在这种实施方式下,服务器通过多个一维卷积核来提取时域音频特征,多个一维卷积核能够较为准确地提取时域音频特征。
举例来说,服务器将多个音频帧的时域信息输入时间点确定模型,通过时间点确定模型对时域信息进行特征提取,得到多个音频帧的时域音频特征。其中,时间点确定模型包括音频特征提取单元,音频特征提取单元包括时域特征提取支路和频域特征提取支路。时域特征提取支路用于提取多个音频帧的时域音频特征,频域支路用于提取多个音频帧的频域音频特征。音频特征提取单元的时域特征提取支路包括多个一维卷积子单元和多个池化子单元,每个一维卷积子单元包括至少一个一维卷积核。服务器将多个音频帧的时域信息输入时间点确定模型之后,通过时间点确定模型的时域特征提取支路对多个音频帧的时域信息进行特征提取,也即是通过时域特征提取支路上的多个一维卷积子单元对时域信息进行卷积,得到多个时域特征图。服务器通过时域特征提取支路上的多个池化子单元,对多个时域特征图进行池化,得到多个音频帧的频域音频特征。
通过在提取多个音频帧的时域音频特征的过程中使用多个一维卷积核,能够从多个音频帧的时域信息中提取到多个音频帧的时域特性,特别是多个音频帧的响度和采样幅度能够被准确提取。在提取时域音频特征时通过池化层来降低复杂度,提高时域音频特征的提取效率。
第二部分、服务器对多个音频帧的频域信息进行特征提取,得到多个音频帧的频域音频特征。
在一些实施例中,多个音频帧的频域音频特征为一个频域音频特征序列,频域音频特征序列包括多个子特征,每个子特征对应于该目标视频的一个时间点,每个子特征用于反映对应时间点的频域音频特性。
其中,多个音频帧的频域信息为多个音频帧的频谱,比如为多个音频帧的梅尔倒频谱。多个音频帧的频域信息是基于多个音频帧的时域信息确定的,比如,对多个音频帧的时域信息进行傅里叶变换,得到多个音频帧的傅里叶频谱。服务器通过三角窗函数将多个音频帧的傅里叶频谱映射至梅尔刻度,得到多个音频帧的第一梅尔参数。服务器获取多个音频帧的第一梅尔参数的对数,得到多个音频帧的第二梅尔参数。服务器对多个音频帧的第二梅尔参数进行离散余弦变换,得到多个音频帧的梅尔倒频谱,梅尔倒频谱也即是多个音频帧的频域信息。需要说明的是,上述说明是本申请实施例提供的一种基于时域信息获取梅尔倒频谱的方式。在其他可能的实施方式中,服务器还能够通过其他方法来基于时域信息获取梅尔倒频谱,本申请实施例对此不做限定。另外,在下述说明过程中,以多个音频帧的频域信息为多个音频帧的梅尔倒频谱为例进行说明,在其他可能的实施方式中,该多个音频帧的频域信息也可以为其他类型的频谱,本申请实施例对此不做限定。
在一些实施例中,服务器采用多个二维卷积核对多个音频帧的频域信息进行特征提取,得到多个音频帧的频域音频特征。在这种实施方式下,服务器通过多个二维卷积核来提取频域音频特征,多个二维卷积核能够较为准确地提取频域音频特征。
举例来说,服务器将多个音频帧的频域信息输入时间点确定模型,通过时间点确定模型对该频域信息进行特征提取,得到该多个音频帧的频域音频特征。其中,时间点确定模型包括音频特征提取单元。如上述第一部分的描述,音频特征提取单元包括时域特征提取支路和频域特征提取支路,时域特征提取支路用于提取该多个音频帧的时域音频特征,频域支路用于提取该多个音频帧的频域音频特征。音频特征提取单元的频域特征提取支路包括多个二维卷积子单元,每个二维卷积子单元包括至少一个二维卷积核。服务器将多个音频帧的频域信息输入时间点确定模型之后,通过时间点确定模型的频域特征提取支路对多个音频帧的频域信息进行特征提取,也即是通过频域特征提取支路上的多个二维卷积子单元对频域信息进行卷积,得到多个音频帧的频域音频特征。
通过在提取多个音频帧的频域音频特征的过程中使用多个二维卷积核,能够从多个音频帧的频域信息中提取到多个音频帧的频域特性。
第三部分、服务器基于该多个音频帧的时域音频特征和频域音频特征,获取该目标视频的音频特征。
在一些实施例中,服务器将多个音频帧的时域音频特征和频域音频特征进行融合,得到目标视频的初始音频特征。服务器对该目标视频的初始音频特征进行卷积,得到该目标视频的音频特征。在这种实施方式下,服务器通过相加这种方式将多个音频帧的时域音频特征和频域音频特征进行融合,得到目标视频的初始音频特征,通过对初始音频特征进行卷积来进一步融合时域音频特征和频域音频特征,得到的音频特征能够更加准确地表达目标视频的音频特性。
举例来说,在服务器通过多个一维卷积核来提取时域音频特征,通过多个二维卷积核来提取频域音频特征的情况下,得到的时域音频特征的维度为一维,频域音频特征的维度为二维。在这种情况下,服务器对多个音频帧的时域音频特征进行上采样,将一维的时域音频特征变为二维的时域音频特征。服务器将二维的时域音频特征与频域音频特征相加,得到该目标视频的初始音频特征,这个相加过程也即是融合时域音频特征和频域音频特征的过程。服务器通过至少一个二维卷积核对该初始音频特征进行卷积,得到该目标视频的音频特征。在一些实施例中,服务器通过时间点确定模型来基于多个音频帧的时域音频特征和频域音频特征来获取目标视频的音频特征。时间点确定模型包括音频特征融合单元,服务器通过时间点确定模型的音频特征融合子单元,来将该多个音频帧的时域音频特征和频域音频特征融合为该目标视频的音频特征,该音频特征融合子单元属于该音频特征提取单元。
在一些实施例中,服务器将多个音频帧的时域音频特征和频域音频特征进行融合,得到目标视频的初始音频特征。服务器分别对初始音频特征进行最大值池化和均值池化,得到目 标视频的第一池化特征和第二池化特征。服务器将第一池化特征以及第二池化特征进行融合,得到目标视频的音频特征。在这种实施方式下,服务器通过最大值池化和均值池化两种方式来降低初始音频特征的复杂度,提高了后续运算的效率。
举例来说,在服务器通过多个一维卷积核来提取时域音频特征,通过多个二维卷积核来提取频域音频特征的情况下,得到的时域音频特征的维度为一维,频域音频特征的维度为二维。在这种情况下,服务器对多个音频帧的时域音频特征进行上采样,将一维的时域音频特征变为二维的时域音频特征。服务器将二维的时域音频特征与频域音频特征相加后进行卷积,得到目标视频的初始音频特征,这个相加和卷积的过程也即是融合时域音频特征和频域音频特征的过程。服务器分别对初始音频特征进行最大值池化和均值池化,得到目标视频的第一池化特征和第二池化特征。其中,第一池化特征为对该初始音频特征进行最大值池化得到的池化特征,第二池化特征为对该初始音频特征进行均支池化得到的池化特征。服务器将第一池化特征和第二池化特征相加,得到第三池化特征。服务器对第三池化特征进行线性整流,得到该目标视频的音频特征。其中,线性整流(Rectified Linear)也被称为线性修正,服务器能够通过线性整流函数来对第三池化特征进行线性整流,得到该目标视频的音频特征,该线性整流函数也被称为斜坡函数。在一些实施例中,服务器通过时间点确定模型基于多个音频帧的时域音频特征和频域音频特征来获取目标视频的音频特征。时间点确定模型包括音频特征融合单元,服务器通过时间点确定模型的音频特征融合子单元,来将多个音频帧的时域音频特征和频域音频特征融合为目标视频的音频特征,音频特征融合子单元属于该音频特征提取单元。
下面将结合上述实施方式以及图4,对上述步骤302进行说明。
参见图4,服务器将多个音频帧的时域信息401输入时间点确定模型。通过该时间点确定模型的音频特征提取单元的时域特征提取支路402对该多个音频帧的时域信息401进行特征提取。也即是服务器通过多个一维卷积子单元和多个最大值池化子单元对该时域信息401进行特征提取,得到该多个音频帧的时域音频特征。其中,每个一维卷积子单元对应于一个一维卷积核。在一些实施例中,一维卷积子单元的数量为四个,分别命名为第一卷积子单元、第二卷积子单元、第三卷积子单元以及第四卷积子单元;最大值池化子单元的数量为三个,分别命名为第一最大值池化子单元、第二最大值池化子单元以及第三最大值池化子单元。服务器通过第一卷积子单元对该时域信息进行卷积,得到该时域信息的第一时域特征向量。服务器通过第二卷积子单元对该第二时域特征向量进行卷积,得到该时域信息的第二时域特征向量。服务器通过第一最大值池化子单元对该第二时域特征向量进行最大值池化,得到该时域信息的第一池化向量。服务器通过第三卷积子单元对该第一池化向量进行卷积,得到该时域信息的第三时域特征向量。服务器通过第二最大值池化子单元对该第三时域特征向量进行最大值池化,得到该时域信息的第二池化向量。服务器通过第四卷积子单元对该第二池化向量进行卷积,得到该时域信息的第四时域特征向量。服务器通过第三最大值池化子单元度该第二时域特征向量进行最大值池化,得到该目标视频的时域音频特征向量。时域音频特征向量用于表示该目标视频的时域音频特征。服务器通过音频特征提取单元的重塑子单元4021,对该时域音频特征向量进行上采样,得到二维的时域音频特征向量4022。
获取二维的时域音频特征向量之后,服务器通过时间点确定模型的音频特征提取单元的频域特征提取支路403对多个音频帧的时域信息401进行特征提取,得到该多个音频帧的频域音频特征。也即是服务器通过频域特征提取支路403上的频域信息获取子单元4031对该多个音频帧的时域信息401进行处理,得到该多个音频帧的频域信息。在一些实施例中,该频域信息为梅尔倒频谱。服务器通过频域特征提取支路403上的至少一个二维卷积子单元对该频域信息进行卷积,得到该目标视频的频域音频特征向量4032。服务器通过时间点确定模型的音频特征融合子单元404,将该二维的时域音频特征向量4022和该频域音频特征向量4032相加后通过该音频特征提取单元的二维卷积子单元405进行卷积,得到该目标视频的初始音 频特征。服务器通过音频特征提取单元的最大池化子单元406和均值池化子单元407分别对该初始音频特征进行处理,得到第一池化特征和第二池化特征。服务器将第一池化特征和第二池化特征相加,得到第三池化特征。服务器通过线性整流子单元408(Rectified Linear Unit)对该第三池化特征进行线性整流,得到该目标视频的音频特征409。
在一些实施例中,时间点确定模型的音频特征提取单元为预训练音频神经网络(Pretrained Audio Neural Networks,PANNs)。
需要说明的是,在步骤301之后,服务器既能够先执行步骤302,再执行下述步骤303,也能够先执行下述步骤303,再执行步骤302,或者能够同时执行步骤302和下述步骤303,本申请实施例对此不做限定。在本申请实施例中,以服务器先执行步骤302,再执行下述步骤303为例进行说明。
303、服务器对目标视频的多个视频帧进行特征提取,得到目标视频的图像特征。
其中,目标视频的多个视频帧为该目标视频中在时间上连续的视频帧。在一些实施例中,目标视频的视频特征为一个视频特征序列,该视频特征序列包括多个视频子特征,每个视频子特征对应于该目标视频的一个时间点,每个视频子特征用于反映对应时间点的视频特性。
在一些实施例中,服务器将多个视频帧输入时间点确定模型,通过时间点确定模型对多个视频帧进行特征提取,得到多个视频帧的图像特征,该多个视频帧的图像特征也即是该目标视频的图像特征。在这种实施方式下,通过时间点确定模型对多个视频帧进行特征提取,得到目标视频的图像特征,从而实现了对多个视频帧的抽象表达,提高了后续的运算效率。
下面通过四个例子对上述实施方式进行说明。
例1、服务器将多个视频帧输入时间点确定模型,通过时间点确定模型对多个视频帧进行卷积、归一化和线性修正,得到多个视频帧的图像特征。
举例来说,服务器将多个视频帧输入时间点确定模型,该时间点确定模型包括图像特征提取单元。服务器通过时间点确定模型的图像特征提取单元的至少一个二维卷积层,对多个视频帧进行卷积,得到该多个视频帧的特征图。服务器通过时间点确定模型的至少一个归一化层和至少一个线性修正层,对多个视频帧的特征图进行归一化和线性修正,得到该多个视频帧的图像特征。在一些实施例中,服务器以矩阵的形式来表示视频帧,以向量的形式来表示图像特征,在对视频帧进行卷积的过程中,采用卷积核在视频帧上进行滑动的方式来实现。
例如,图像特征提取单元包括三种类型的残差构建子单元,分别记作第一类残差构建子单元、第二类残差构建子单元以及第三类残差构建子单元。图像特征提取单元分为多个网络阶段,每个网络阶段均包括上述三种类型的残差构建子单元。其中,三种类型的残差构建子单元均包括至少一个卷积层、至少一个归一化层以及至少一个线性修正层,不同类型的残差构建子单元中卷积层、归一化层以及线性修正层的数量和连接方式有所不同。在一些实施例中,多个网络阶段包括开始阶段、中间阶段以及结束阶段。服务器将多个视频帧输入时间点确定模型的图像特征提取单元之后,通过图像特征提取单元的多个网络阶段,也即是多个网络阶段中第一类残差构建子单元、第二类残差构建子单元以及第三类残差构建子单元对多个视频帧进行卷积、归一化以及线性修正,得到该多个视频帧的图像特征。
在一些实施例中,第一类残差构建子单元也被称为开始残差块(Start ResBlock),第二类残差构建子单元也被称为中间残差块(Middle ResBlock),第三类残差构建子单元也被称为结束残差块(End ResBlock)。参见图5,示出了一种第一类残差构建子单元501、第二类残差构建子单元502以及第三类残差构建子单元503的结构示意图。在图5中,第一类残差构建子单元501依次包括一维卷积层5011、归一化层5012、线性修正层5013、三维卷积层5014、归一化层5015、线性修正层5016、一维卷积层5017以及归一化层5018。第二类残差构建子单元502依次包括归一化层5021、线性修正层5022、一维卷积层5023、归一化层5024、线性修正层5025、三维卷积层5026、归一化层5027、线性修正层5028以及一维卷积层5029。第三类残差构建子单元503依次包括归一化层5031、线性修正层5032、一维卷积层5033、 归一化层5034、线性修正层5035、三维卷积层5036、归一化层5037、线性修正层5038以及一维卷积层5039。其中,卷积层用于进行卷积、归一化层用于进行归一化,线性修正层用于进行线性修正。
在一些实施例中,图像特征提取单元为神经网络IResNet(Improved Residual Networks,改进残差网络)。神经网络IResNet的输出结果为目标视频的图像特征。在IResNet中,以网络层数为50为例,该50层网络包括三种阶段,分别是一个开始阶段,四个中间阶段以及一个结束阶段。四个中间阶段的每个中间阶段包括多个残差构建子单元。IResNet能够在准确性和学习收敛性方面上都超过ResNet。例如,在ImageNet数据集上,使用具有50层的ResNet,同时在相同配置下使用IResNet,top-1精度的提升范围在1.19%到2.33%之间。同时,这些改进是在不增加模型复杂性的情况下获得的。
需要说明的是,上述是以时间点确定模型的图像特征提取单元为IResNet为例进行说明的,在其他可能的实施方式中,该时间点确定模型的图像特征提取单元还可以为其他结构,本申请实施例对此不做限定。
例2、服务器将多个视频帧输入时间点确定模型,通过时间点确定模型,基于注意力机制对多个视频帧进行编码,得到多个视频帧的图像特征。其中,通过时间点确定模型获取的图像特征也即是对应内容项的语义特征。在这种实施方式下,时间点确定模型为语义特征编码器,比如为Transformer编码器。
在一些实施例中,服务器将该多个视频帧输入该时间点确定模型的图像特征提取单元,通过时间点确定模型的图像特征提取单元,对多个视频帧进行嵌入编码,得到多个嵌入特征。一个嵌入特征对应于多个视频帧的一个视频帧。嵌入特征用于表示各个视频帧在多个视频帧中的位置以及各个视频帧的内容。服务器将多个嵌入特征输入时间点确定模型,通过时间点确定模型的三个线性变换矩阵,对多个嵌入特征进行线性变换,得到该多个视频帧的每个视频帧对应的查询(Query)向量、键(Key)向量以及值(Value)向量。服务器通过时间点确定模型,基于多个视频帧对应的查询向量以及键向量,获取视频帧的多个视频帧的注意力权重。服务器通过时间点确定模型,基于多个视频帧的各个视频帧的注意力权重和多个视频帧的各个视频帧的值向量,获取多个视频帧的注意力编码向量,注意力编码向量也即是视频帧的图像特征。
例如,服务器通过时间点确定模型,将每个嵌入特征分别与三个线性变换矩阵相乘,得到多个视频帧中每个视频帧分别对应的查询向量、键向量以及值向量。对于多个视频帧中的第一个视频帧,服务器通过时间点确定模型,基于第一个视频帧的查询向量,与多个视频帧中多个其他视频帧的键向量,确定多个其他视频帧对第一个视频帧的多个注意力权重。服务器通过时间点确定模型,将多个其他视频帧对第一个视频帧的注意力权重,与多个其他视频帧的值向量进行加权求和,得到该第一个视频帧的注意力编码向量。需要说明的是,上述是以服务器通过时间点确定模型,对多个视频帧的第一个视频帧进行编码,得到第一个视频帧的注意力编码向量为例进行说明的。服务器对多个视频帧的其他视频帧进行编码的方式与上述对该第一个视频帧进行编码的方法属于同一发明构思,实现过程参见上述描述,在此不再赘述。
例3、服务器将多个视频帧输入时间点确定模型,通过时间点确定模型对多个视频帧进行卷积、归一化和线性修正,得到多个视频帧的第一图像特征。服务器通过时间点确定模型,基于注意力机制对多个视频帧进行编码,得到多个视频帧的第二图像特征。服务器将多个视频帧的第一图像特征和第二图像特征进行融合,得到多个视频帧的图像特征。
举例来说,时间点确定模型包括第一图像特征提取单元和第二图像特征提取单元。第一图像特征提取单元用于提取目标视频的第一图像特征,第二图像特征提取单元用于提取目标视频的第二图像特征。服务器将多个视频帧输入时间点确定模型之后,通过第一图像特征提取单元来获取多个视频帧的第一图像特征,通过第二图像特征提取单元来获取多个视频帧的 第二图像特征。服务器将多个视频帧的第一图像特征和第二图像特征进行融合时,可以采用加权求和的方式。加权求和的权重由技术人员根据实际情况进行设置,如设置为0.3、0.5或者0.8等,本申请实施例对此不做限定。
例4、服务器将多个视频帧输入时间点确定模型,通过时间点确定模型对多个视频帧进行全连接和池化,得到多个视频帧的图像特征。
举例来说,服务器将多个视频帧输入时间点确定模型,通过时间点确定模型的至少一个全连接层,对多个视频帧进行全连接,得到多个视频帧的全连接特征。服务器通过时间点确定模型的池化层,对多个视频帧的全连接特征进行最大值池化或者平均池化中的任一项,得到多个视频帧的图像特征,该图像特征也被称为深度特征或者底层特征。在一些实施例中,服务器以矩阵的形式来表示视频帧,以向量的形式来表示图像特征,在对视频帧进行全连接的过程中,采用将全连接矩阵与视频帧的矩阵进行相乘的方式来实现。在一些实施例中,时间点确定模型为基于深度神经网络(Deep Neural Networks,DNN)的特征提取器。
需要说明的是,上述是以时间点确定模型提取内容项的底层特征和语义特征为例进行说明的。随着科学技术的发展,服务器还能够采用其他结构的时间点确定模型来获取图像特征,本申请实施例对此不做限定。
另外,上述步骤302-303是分别对提取目标视频的音频特征和视频特征进行说明的。在其他可能的实施方式中,服务器还能够提取该目标视频的字幕特征,通过结合目标视频的音频特征、图像特征以及字幕特征来确定该目标视频的视频特征,能够提高视频特征的表达能力。
在一些实施例中,服务器提取目标视频的音频特征、图像特征以及字幕特征。其中,服务器提取目标视频的音频特征和图像特征的方法与上述步骤302和303属于同一发明构思,实现过程参见上述步骤302和303的说明,在此不再赘述。
下面对服务器提取目标视频的字幕特征的方法进行说明。
在一些实施例中,服务器将目标视频的字幕输入时间点确定模型,通过时间点确定模型对目标视频的字幕进行特征提取,得到该目标视频的字幕特征。在一些实施例中,时间点确定模型包括字幕特征提取单元,服务器通过字幕特征提取单元能够提取该目标视频的字幕特征。
例如,服务器通过字幕特征提取单元,对目标视频的字幕进行嵌入编码,得到该目标视频的字幕嵌入特征。服务器通过该字幕特征提取单元,对目标视频的字幕嵌入特征进行卷积以及池化,得到该目标视频的字幕特征。
当然,服务器除了能够采用卷积和池化的方式来获取目标视频的字幕特征之外,还能够通过其他文本特征提取方法来获取目标视频的字幕特征,本申请实施例对此不做限定。
304、服务器将音频特征以及图像特征进行融合,得到目标视频的视频特征。
在一些实施例中,服务器将音频特征和图像特征进行叠加,得到该目标视频的视频特征。在音频特征为音频特征序列,图像特征为图像特征序列的情况下,服务器将音频特征序列和该图像特征序列相加,得到该目标视频的视频特征序列。
在一些实施例中,由于目标视频的视频特征融合了音频特征和图像特征,目标视频的视频特征也被称为目标视频的音视频高级语义特征。在视频特征为视频特征序列的情况下,视频特征序列中的每个子特征表示该目标视频中对应时间点的视频特征,也即是对应时间点的语义信息。由于确定目标视频的视频特征时结合了目标视频的音频特征和图像特征,使得得到的视频特征能够在音频和图像两个维度上体现目标视频的特性,视频特征的准确性较高。
在一些实施例中,在音频特征和图像特征的维度不同的情况下,服务器对音频特征或图像特征的维度进行调整,以使得在调整之后,音频特征和图像特征的维度相同。
在一些实施例中,在服务器提取该目标视频的字幕特征的情况下,服务器将目标视频的音频特征、图像特征以及字幕特征进行融合,得到目标视频的视频特征。在音频特征为音频 特征序列,图像特征为图像特征序列,字幕特征为字幕特征序列的情况下,服务器将音频特征序列、图像特征序列以及字幕特征序列相加,得到目标视频的视频特征序列。由于确定目标视频的视频特征时结合了目标视频的音频特征、图像特征和字幕特征,使得得到的视频特征能够在音频、图像以及字幕三个维度上体现目标视频的特性,视频特征的准确性较高。
在一些实施例中,上述步骤302-304由时间点确定模型的特征提取子模型实现。
305、服务器基于注意力机制对目标视频的视频特征进行编码,得到多个目标参数,多个目标参数对应于目标视频的多个时间点,目标参数用于表示在对应时间点插入背景音乐的概率。
在一些实施例中,视频特征包括多个子特征,服务器通过时间点确定模型,基于注意力机制对多个子特征中每两个子特征进行编码,得到各个子特征的目标参数。其中,视频特征包括的多个子特征对应于目标视频的多个时间点,一个子特征对应于目标视频的一个时间点,不同子特征对应的时间点不同,各个子特征用于表示对应时间点的视频特征。
举例来说,对于多个子特征中的第一子特征,服务器通过时间点确定模型,基于注意力机制确定多个子特征中的多个第二子特征对第一子特征的多个注意力参数。服务器通过时间点确定模型,将多个注意力参数进行融合,得到第一子特征的目标参数。
为了对上述举例进行更加清楚的说明,下面将分为两个部分对上述举例进行进一步说明,参见部分A和部分B。
部分A、服务器通过时间点确定模型,基于注意力机制确定多个子特征中的多个第二子特征对第一子特征的多个注意力参数。
在一些实施例中,服务器通过时间点确定模型,对第一子特征进行全连接,得到第一子特征的嵌入特征。对于多个第二子特征中的任一第二子特征,服务器通过时间点确定模型,对第二子特征进行全连接,得到第二子特征的嵌入特征。服务器通过时间点确定模型,基于第一子特征的嵌入特征和第二子特征的嵌入特征,确定第一子特征和第二子特征之间的相似度参数。服务器通过时间点确定模型,基于第一子特征以及第一子特征和第二子特征之间的相似度参数,确定第二子特征对第一子特征的注意力参数。
其中,第一子特征和第二子特征之间的相似度参数用于描述第一子特征和第二子特征之间的相似程度。在一些实施例中,第一子特征和第二子特征之间的相似度参数与第一子特征和第二子特征之间的相似程度正相关。也即是相似度参数越高,表示第一子特征和第二子特征之间的相似程度越高;相似度参数越低,表示第一子特征和第二子特征之间的相似程度越低。注意力参数也被称为注意力权重。
举例来说,时间点确定模型包括目标参数获取单元,服务器通过时间点确定模型的目标参数获取单元,对第一子特征进行全连接,得到第一子特征的嵌入特征。也即是,服务器将第一子特征输入该目标参数获取单元的全连接层,将第一子特征与目标参数获取单元的全连接层的全连接矩阵相乘,得到第一子特征的嵌入特征。服务器将第二子特征输入该目标参数获取单元的全连接层,将第二子特征与目标参数获取单元的全连接层的全连接矩阵相乘,得到第二子特征的嵌入特征。服务器通过目标参数获取单元,基于第一子特征的嵌入特征和第二子特征的嵌入特征,确定第一子特征和第二子特征之间的相似度参数。其中,相似度参数为第一子特征和第二子特征的点积,或者为第一子特征和第二子特征之间的余弦相似度,本申请实施例对此不做限定。服务器通过目标参数获取单元,将第一子特征以及第一子特征和第二子特征之间的相似度参数相乘,得到第二子特征对第一子特征的注意力参数。
需要说明的是,上述是以确定多个第二子特征中的一个第二子特征对第一子特征的注意力参数为例进行说明的。服务器通过时间点确定模型确定其他第二子特征对第一子特征的注意力参数的方法与上述描述属于同一发明构思,实现过程不再赘述。
例如,图6提供了一种目标参数获取单元的架构图,参见图6,服务器将该目标视频的视频特征序列{a1-an}输入目标参数获取单元。服务器通过目标参数获取单元,基于注意力机 制确定多个第二子特征{a2-an}对第一子特征(a1)的多个注意力参数{c12-c1n}。其中,n为视频特征中子特征的数量,n为正整数。以服务器第二子特征ai对第一子特征a1的注意力参数为例,服务器通过目标参数获取单元,对第一子特征a1和第二子特征ai进行全连接(FC),得到第一子特征a1的嵌入特征和第二子特征ai的嵌入特征。服务器通过目标参数获取单元,将第一子特征a1的嵌入特征和第二子特征ai的嵌入特征相乘,得到第一子特征a1的嵌入特征和第二子特征ai的嵌入特征之间的相似度参数m1i。服务器通过目标参数获取单元,将相似度参数m1i与第一子特征a1相乘,得到第二子特征ai对第一子特征a1的注意力参数c1i。其中,i为正整数,2≤i≤n。
部分B、服务器通过时间点确定模型,将多个注意力参数进行融合,得到第一子特征的目标参数。
在一些实施例中,第一子特征的目标参数也被称为第一子特征的注意力权重或者在第一子特征对应的时间点插入背景音乐的置信度。
在一些实施例中,服务器通过时间点确定模型的目标参数获取单元,将多个注意力参数进行相加,得到第一子特征的目标参数。也就是说,第一子特征的目标参数是由多个第二子特征对第一子特征的多个注意力参数融合后得到的。
例如,参见图6,服务器通过目标参数获取单元,将多个第二子特征{a2-an}对第一子特征(a1)的多个注意力参数{c12-c1n}相加,得到第一子特征(a1)的目标参数w1。
需要说明的是,上述是以服务器通过时间点确定模型获取多个子特征中的第一子特征的目标参数为例进行说明。服务器获取多个子特征中其他子特征的目标参数的方法与上述描述属于同一发明构思,实现过程不再赘述。
为了直观地体现本申请实施例提供的技术方案所带来的效果,将实验过程中得到的视频特征的多个子特征的目标参数绘制为折线图,将目标视频的多个视频帧、多个音频帧的时域信息以及多个音频帧的频域信息以时间点为基准进行对齐,得到图7。参见图7,包括目标视频的多个视频帧701、目标视频的多个音频帧的频域信息702,目标视频的多个音频帧的时域信息703以及多个子特征的目标参数绘制成的折线图704。其中,704能够从整体上反映多个子特征的目标参数的变化情况。
在一些实施例中,上述步骤305由时间点确定模型的目标参数确定子模型实现。
306、服务器确定插入背景音乐的至少一个候选时间点,候选时间点为多个时间点中目标参数符合目标条件的时间点。
其中,候选时间点也即是服务器确定出的适合插入背景音乐的时间点。视频制作人员能够在候选时间点中进行选择,确定在目标视频中插入背景音乐的目标时间点。候选时间点的数量为一个或者多个,本申请实施例对此不做限定。
在一些实施例中,目标参数符合目标条件是指目标参数大于或等于参数阈值。参数阈值由技术人员根据实际情况进行设置,本申请实施例对此不做限定。服务器将多个时间点中,目标参数大于或等于参数阈值的时间点,确定为插入背景音乐的候选时间点。
在一些实施例中,在服务器确定出候选时间点之后,视频制作人员能够在确定出的候选时间点中进行选择,以确定最终插入背景音乐的目标时间点。例如,服务器将目标视频的候选时间点发送给终端,由终端将目标视频的候选时间点展示给视频制作人员。响应于任一候选时间点被选中,终端在被选中的候选时间点插入背景音乐,被选中的候选时间点也即是目标时间点。
在一些实施例中,终端接收到服务器发送的目标视频的候选时间点之后,能够将该候选时间点显示在该目标视频的时间轴上。例如,终端在目标视频的时间轴上以圆点的形式显示该候选时间点。视频制作人员能够通过点击不同的候选时间点来控制终端播放目标视频的不同内容,并根据播放的内容来从候选时间点中选择插入背景音乐的目标时间点。通过基于候选时间点进行选择,大大缩小了确定插入背景音乐的目标时间点的范围,提高了背景音乐的 插入效率。
下面将结合图8以及上述步骤301-306中各个可能的实施方式,对本申请实施例提供的背景音乐的插入时间点确定方法进行说明。
参见图8,服务器获取目标视频801。服务器对目标视频的视频轨道(多个视频帧)进行特征提取,得到目标视频的图像特征802。服务器在对目标视频的视频轨道进行特征提取时,采用了IResNet模型来实现。服务器对目标视频的音频轨道(多个音频帧)进行特征提取,得到目标视频的音频特征803。服务器在对目标视频的音频轨道进行特征提取时,采用了PANNs模型来实现。服务器通过时间点确定模型,将目标视频的图像特征802和音频特征803进行融合,得到目标视频的视频特征804。服务器基于注意力机制,对视频特征804中每两个子特征进行编码,得到各个子特征的目标参数805。
需要说明的是,在上述说明过程中,是以服务器为执行主体为例进行说明的。在其他可能的实施方式中,本申请实施例提供的技术方案也能够由终端执行,本申请实施例对此不做限定。
通过本申请实施例提供的技术方案,结合了目标视频的音频特征和图像特征来确定目标视频的视频特征,该视频特征能够较为准确地表示目标视频的内容。通过基于注意力机制对视频特征进行编码,能够得到多个目标参数,多个目标参数表示在对应时间点插入背景音乐的概率。通过基于多个时间点的目标参数,能够从多个时间点中确定出候选时间点,候选时间点也即是可以在目标视频中插入背景音乐的时间点。在上述确定候选时间点的过程中,无需视频制作人员完整地观看目标视频,只需在确定出的候选时间点中进行选择即可,提高了在视频中插入背景音乐的效率。
在本申请实施例提供的技术方案中,提供了一种全自动的插曲(背景音乐)位置的确定方法,该方案能够通过音视频的高级语义特征来自动确定视频的插曲位置,然后为视频的后期制作或者视频的二创提供插曲位置备选,能够摆脱人工选择的方式,大大减少了视频的制作成本。同时使用时间点确定模型来对插入背景音乐的位置进行定位,可以模块化的科学计算数据,避免了因为人类感官差异而造成的时间点差异。
通过本申请实施例提供的技术方案,结合了目标视频的音频特征和图像特征来确定目标视频的视频特征,该视频特征能够较为准确地表示目标视频的内容。基于注意力机制对视频特征进行编码,得到多个目标参数,多个目标参数表示在对应时间点插入背景音乐的概率。基于多个时间点的目标参数,从多个时间点中确定出候选时间点,该候选时间点也即是可以在目标视频中插入背景音乐的时间点。在上述确定候选时间点的过程中,由于结合了注意力机制,使得确定出的候选时间点较为准确。同时在插入背景音乐时,无需视频制作人员完整地观看目标视频,只需在确定出的候选时间点中进行选择即可,在保证准确性的前提下,提高了在视频中插入背景音乐的效率。
上述步骤301-306包括服务器利用时间点确定模型来获取目标视频的候选时间点的实施方式,为了进行更加清楚的说明,下面以执行主体为服务器为例,对训练该时间点确定模型的方法进行说明,参见图9,方法包括下述步骤。
901、服务器将样本视频输入时间点确定模型,通过时间点确定模型对样本视频进行音频分离,得到样本视频的原始音频和背景音乐。
在一些实施例中,服务器将样本视频输入该时间点确定模型,通过时间点确定模型对样本视频的多个样本音频帧的样本频域信息进行特征提取,得到该样本视频的第一音频特征。服务器通过时间点确定模型,采用多种尺度对第一音频特征进行池化,得到样本视频的多个第二音频特征。服务器通过时间点确定模型,将多个第二音频特征进行融合,得到样本视频的音频分离特征。服务器通过时间点确定模型,基于音频分离特征对样本频域信息进行分离,得到样本视频的原始音频和背景音乐。
为了对上述实施方式进行更加清楚的说明,下面将分为四个部分对上述实施方式进行说明,参见部分M、部分N、部分O以及部分P。
部分M、服务器将样本视频输入该时间点确定模型,通过该时间点确定模型对该样本视频的多个样本音频帧的样本频域信息进行特征提取,得到该样本视频的第一音频特征。
在一些实施例中,服务器将样本视频的多个样本音频帧的时域信息输入时间点确定模型,通过时间点确定模型将多个样本音频帧的时域信息转化为多个样本音频帧的频域信息。服务器通过时间点确定模型,对多个样本音频帧的频域信息进行卷积,得到样本视频的第一音频特征。在一些实施例中,时间点确定模型对多个样本音频帧的频域信息进行卷积时采用的是空洞卷积核。例如,参见图10,服务器通过时间点确定模型的音频分离单元,对多个样本音频帧的频域信息1001进行卷积,得到样本视频的第一音频特征1002。
部分N、服务器通过间点确定模型,采用多种尺度对该第一音频特征进行池化,得到样本视频的多个第二音频特征。
其中,服务器采用不同尺度对第一音频特征进行池化时,得到的是不同尺寸的第二音频特征。也即是,一个尺度对应于一个尺寸,多个第二音频特征为多个不同尺寸的第二音频特征。这种基于不同尺度的池化方法也被称为金字塔池化。
在一些实施例中,服务器通过时间点确定模型的多个池化核,采用多种尺度对第一音频特征进行池化,得到样本视频的多个第二音频特征,多个池化核对应于多种尺度。例如,参见图10,服务器通过时间点确定模型的音频分离单元的多个池化核,采用多种尺度对第一音频特征1001进行池化,得到样本视频的多个第二音频特征1002。从图10中可以看出,多个第二音频特征1002的尺寸均不相同。
部分O、服务器通过时间点确定模型,将多个第二音频特征进行融合,得到样本视频的音频分离特征。
在一些实施例中,服务器通过时间点确定模型,对多个第二音频特征进行卷积,得到样本视频的多个第三音频特征。服务器通过时间点确定模型,对多个第三音频特征进行上采样,得到样本视频的多个第四音频特征。多个第四音频特征的尺寸均与第一音频特征相同。服务器通过时间点确定模型,将多个第四音频特征与第一音频特征进行融合,得到样本视频的音频分离特征。例如,参见图10,服务器通过时间点确定模型的音频分离单元,对多个第二音频特征1002进行卷积,得到样本视频的多个第三音频特征1003。服务器通过时间点确定模型,对多个第三音频特征1003进行上采样,得到样本视频的多个第四音频特征1004。服务器通过时间点确定模型,将多个第四音频特征1004与第一音频特征1001进行融合后再进行卷积,得到样本视频的音频分离特征。
在一些实施例中,上述实施方式由时间点确定模型的音频分离子模型实现。音频分离子模型为金字塔结构网络(Pyramid Scene Parsing Network,PSPnet)。在PSPnet中,金字塔池化生成的不同尺度的特征图最终被拼接起来,再输入到全连接层以进行分类。在一些实施例中,金字塔结构可以融合四种不同尺度的特征:第一层突出显示的是最粗糙尺度的单个全局池化输出。其他层将第一音频特征映射划分为不同的尺度的第二音频特征,并形成针对不同第一音频特征中不同位置的集合表示。使用4层金字塔结构,池化核覆盖了第一音频特征的全部、一半和小部分。为了维护全局特性的权重,如果金字塔结构共有N个尺度,则在每个尺度后使用1×1卷积,将对应尺度的通道数量降为原本的1/N,N为正整数。然后通过双线性插值直接对低维特征进行上采样,得到与原始特征相同尺寸的特征。最后,将不同尺度的特征拼接起来,作为最终的音频分离特征。
部分P、服务器通过时间点确定模型,基于音频分离特征对样本频域信息进行分离,得到样本视频的原始音频和背景音乐。
在一些实施例中,服务器通过时间点确定模型,基于音频分离特征,确定样本频域信息的边界信息。边界信息用于表示样本频域信息中原始音频和背景音乐之间的边界。服务器通 过时间点确定模型,基于边界信息对样本频域信息进行处理,得到样本视频的原始音频和背景音乐。
在一些实施例中,服务器基于样本视频的背景音乐在样本视频中的出现时间,为样本视频的多个时间点添加标签。由于时间点的标签用于表示样本视频中背景音乐的出现时间,那么在服务器将样本视频中的背景音乐和原始音频分离之后,基于分离出的背景音乐在样本视频中的出现时间为多个时间点添加标签即可,无需技术人员再手动添加标签,标签添加的效率较高。
需要说明的是,上述步骤901是可选步骤。在样本视频中存在背景音乐的情况下,服务器通过执行该步骤901能够去除该样本视频中的背景音乐,这样可以让时间点确定模型在训练阶段不受已有背景音乐的影响。在样本视频中不存在背景音乐的情况下,服务器无须执行步骤901,直接执行下述步骤902即可。在直接执行下述步骤902的情况下,下述步骤902中的原始音频也即是样本视频的音频。
902、服务器通过时间点确定模型,对样本视频的原始音频和多个样本视频帧进行特征提取,得到样本视频的样本音频特征以及样本图像特征。
其中,样本视频的原始音频包括样本视频的多个样本音频帧。服务器对样本视频的原始音频和多个样本视频帧进行特征提取,得到样本视频的样本音频特征以及样本图像特征的方法,与上述步骤302和303属于同一发明构思,实现过程参见上述步骤302和303的描述,在此不再赘述。
903、服务器通过时间点确定模型,将样本音频特征以及样本图像特征进行融合,得到样本视频的视频特征。
其中,服务器通过时间点确定模型,将样本音频特征以及样本图像特征进行融合,得到样本视频的视频特征的方法与上述步骤304属于同一发明构思,实现过程参见上述步骤304的描述,在此不再赘述。
904、服务器通过时间点确定模型,基于注意力机制对样本视频的视频特征进行编码,得到多个样本参数,多个样本参数对应于样本视频的多个时间点,样本参数用于表示在对应时间点插入背景音乐的概率。
其中,服务器通过时间点确定模型,基于注意力机制对样本视频的视频特征进行编码,得到多个样本参数与上述步骤305属于同一发明构思,实现过程参见上述步骤305的描述,在此不再赘述。
905、服务器基于样本视频的多个时间点的标签与多个样本参数之间的差异信息,对时间点确定模型进行训练,标签用于表示样本视频中背景音乐的出现时间。
其中,样本参数用于表示在对应时间点插入背景音乐的概率。在一些实施例中,样本参数与在对应时间点插入背景音乐的概率正相关。也即是样本参数越大,表示在对应时间点插入背景音乐的概率越高;样本参数越小,表示在对应时间点插入背景音乐的概率越低。标签用于表示样本视频中背景音乐的出现时间。基于标签和样本参数之间的差异信息对时间点模型进行训练,就能够使得时间点确定模型学习到背景音乐在样本视频中的出现时间,从而在使用过程中输出候选时间点。
在一些实施例中,服务器基于样本视频的多个时间点的标签与多个样本参数之间的差异信息,构建目标损失函数。服务器采用梯度下降法,基于目标损失函数对时间点确定模型进行训练。
例如,服务器将多个样本参数进行归一化,使得多个样本参数处于目标范围内。多个时间点的标签包括目标范围的最大值和最小值,最大值表示对应时间点出现了背景音乐,最小值表示对应时间点没有出现背景音乐。基于归一化后的多个样本参数与多个时间点的标签对时间点确定模型进行训练的目的是,使得确定出的样本参数在归一化后尽量接近目标范围的最大值或者最小值。其中,在某个时间点出现了背景音乐的情况下,训练的目的也即是使得 时间点的样本参数尽量接近目标范围的最大值;在时间点没有出现了背景音乐的情况下,训练的目的也即是使得时间点的样本参数尽量接近目标范围的最小值。
下面将结合图11以及上述步骤901-905中各个可能的实施方式,对本申请实施例提供的背景音乐的插入时间点确定方法进行说明。参见图11,服务器从样本视频集合中获取样本视频1101。服务器对样本视频的视频轨道(多个视频帧)进行特征提取,得到样本视频的样本图像特征1102。服务器在对样本视频的视频轨道进行特征提取时,采用了IResNet模型来实现。服务器对样本视频的音频轨道(多个音频帧)进行音频分离,得到目标视频的原始音频1103和背景音乐1104。服务器对原始音频进行特征提取,得到样本视频的样本音频特征1105。服务器在对样本视频的音频轨道进行特征提取时,采用了PANNs模型来实现。服务器通过时间点确定模型,将样本视频的样本图像特征1102和样本音频特征1105进行融合,得到样本视频的视频特征1106。服务器基于注意力机制,对视频特征1106中每两个子特征进行编码,得到各个子特征的样本参数1107。服务器基于背景音乐1104在样本视频中的出现时间,为样本视频的多个时间点添加标签。服务器基于多个样本参数与多个时间点之间的标签构建损失函数,基于损失函数对时间点确定模型进行训练。
相关技术中,往往会通过人工标注的时间点的相关信息作为标签参与到模型的训练中。本申请实施例提供技术方案使用了基于语义分割模型搭建的音频分离子模型,对样本视频的音轨进行音频分离,把音轨中原有的背景音乐分离开,计算出其时间位置作为时间点的标签直接参与到模型的训练中。该方法能够让模型通过样本视频学习到人类添加插曲位置的习惯信息。同时使用分离背景音乐得到的原始音频进行模型训练,能够让原始音频更趋向于实际落地推测时的音频,从而让时间点确定模型学习到更准确的音频特征。
在本申请实施例提供的技术方案中,使用了基于注意力机制来基于视频特征序列确定目标参数,也就是对整个视频特征序列中每个时间点可以当作候选时间点的置信度进行计算。该机制能够让时间点确定模型在整个视频特征序列上计算出每两个时间点之间的注意力参数,能够更准确的训练时间点确定模型的定位能力。
图12是本申请实施例提供的一种背景音乐的插入时间点确定装置的结构示意图,参见图12,装置包括:特征提取模块1201、特征融合模块1202、编码模块1203以及候选时间点确定模块1204。
特征提取模块1201,用于提取目标视频的音频特征以及图像特征。
特征融合模块1202,用于将音频特征以及图像特征进行融合,得到目标视频的视频特征。
编码模块1203,用于基于注意力机制对目标视频的视频特征进行编码,得到多个目标参数,多个目标参数对应于目标视频的多个时间点,目标参数用于表示在对应时间点插入背景音乐的概率。
候选时间点确定模块1204,用于确定插入背景音乐的至少一个候选时间点,候选时间点为多个时间点中目标参数符合目标条件的时间点。
在一些实施例中,特征提取模块1201,用于对目标视频的多个音频帧进行特征提取,得到目标视频的音频特征。对目标视频的多个视频帧进行特征提取,得到目标视频的图像特征。
在一些实施例中,特征提取模块1201,用于对多个音频帧的时域信息进行特征提取,得到多个音频帧的时域音频特征。对多个音频帧的频域信息进行特征提取,得到多个音频帧的频域音频特征。基于多个音频帧的时域音频特征和频域音频特征,获取目标视频的音频特征。
在一些实施例中,特征提取模块1201,用于采用多个一维卷积核对多个音频帧的时域信息进行特征提取,得到多个音频帧的时域音频特征。对多个音频帧的频域信息进行特征提取,得到多个音频帧的频域音频特征包括:采用多个二维卷积核对多个音频帧的频域信息进行特征提取,得到多个音频帧的频域音频特征。
在一些实施例中,特征融合模块1202,用于将多个音频帧的时域音频特征和频域音频特 征进行融合,得到目标视频的初始音频特征。分别对初始音频特征进行最大值池化和均值池化,得到目标视频的第一池化特征和第二池化特征。将第一池化特征以及第二池化特征进行融合,得到目标视频的音频特征。
在一些实施例中,视频特征包括多个子特征,多个子特征对应于目标视频的多个时间点,编码模块1203用于通过时间点确定模型,基于注意力机制对多个子特征中每两个子特征进行编码,得到各个子特征的目标参数。
在一些实施例中,编码模块1203用于对于多个子特征中的第一子特征,基于注意力机制确定多个子特征中的多个第二子特征对第一子特征的多个注意力参数。将多个注意力参数进行融合,得到第一子特征的目标参数。
在一些实施例中,编码模块1203用于对第一子特征进行全连接,得到第一子特征的嵌入特征。对于多个第二子特征中的任一第二子特征,对第二子特征进行全连接,得到第二子特征的嵌入特征。基于第一子特征的嵌入特征和第二子特征的嵌入特征,确定第一子特征和第二子特征之间的相似度参数。基于第一子特征以及第一子特征和第二子特征之间的相似度参数,确定第二子特征对第一子特征的注意力参数。
在一些实施例中,装置还包括:
训练模块,用于将样本视频输入时间点确定模型,通过时间点确定模型对样本视频进行特征提取,得到样本视频的样本音频特征以及样本图像特征。通过时间点确定模型,将样本音频特征以及样本图像特征进行融合,得到样本视频的视频特征。通过时间点确定模型,基于注意力机制对样本视频的视频特征进行编码,得到多个样本参数,多个样本参数对应于样本视频的多个时间点,样本参数用于表示在对应时间点插入背景音乐的概率。基于样本视频的多个时间点的标签与多个样本参数之间的差异信息,对时间点确定模型进行训练,标签用于表示样本视频中背景音乐的出现时间。
在一些实施例中,装置还包括:
音频分离模块,用于通过时间点确定模型对样本视频进行音频分离,得到样本视频的原始音频和背景音乐。
训练模块还用于通过时间点确定模型,对样本视频的原始音频和多个样本视频帧进行特征提取,得到样本视频的样本音频特征以及样本图像特征。
在一些实施例中,音频分离模块用于通过时间点确定模型对样本视频的多个样本音频帧的样本频域信息进行特征提取,得到样本视频的第一音频特征。通过时间点确定模型,采用多种尺度对第一音频特征进行池化,得到样本视频的多个第二音频特征。通过时间点确定模型,将多个第二音频特征进行融合,得到样本视频的音频分离特征。通过时间点确定模型,基于音频分离特征对样本频域信息进行分离,得到样本视频的原始音频和背景音乐。
在一些实施例中,音频分离模块用于对多个第二音频特征进行卷积,得到样本视频的多个第三音频特征。对多个第三音频特征进行上采样,得到样本视频的多个第四音频特征,多个第四音频特征的尺寸均与第一音频特征相同。将多个第四音频特征与第一音频特征进行融合,得到样本视频的音频分离特征。
在一些实施例中,音频分离模块用于基于音频分离特征,确定样本频域信息的边界信息,边界信息用于表示样本频域信息中原始音频和背景音乐之间的边界。基于边界信息对样本频域信息进行处理,得到样本视频的原始音频和背景音乐。
在一些实施例中,装置还包括:
标签添加模块,用于基于样本视频的背景音乐在样本视频中的出现时间,为样本视频的多个时间点添加标签。
在一些实施例中,特征提取模块1201还用于提取目标视频的音频特征、图像特征以及字幕特征。
特征融合模块1202,还用于将目标视频的音频特征、图像特征以及字幕特征进行融合, 得到目标视频的视频特征。
需要说明的是:上述实施例提供的背景音乐的插入时间点确定装置在确定背景音乐的插入时间点时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将计算机设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的背景音乐的插入时间点确定装置与背景音乐的插入时间点确定方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
通过本申请实施例提供的技术方案,结合了目标视频的音频特征和图像特征来确定目标视频的视频特征,视频特征能够较为准确地表示目标视频的内容。基于注意力机制对视频特征进行编码,得到多个目标参数,多个目标参数表示在对应时间点插入背景音乐的概率。基于多个时间点的目标参数,从多个时间点中确定出候选时间点,候选时间点也即是可以在目标视频中插入背景音乐的时间点。在上述确定候选时间点的过程中,由于结合了注意力机制,使得确定出的候选时间点较为准确。同时在插入背景音乐时,无需视频制作人员完整地观看目标视频,只需在确定出的候选时间点中进行选择即可,在保证准确性的前提下,提高了在视频中插入背景音乐的效率。
本申请实施例提供了一种计算机设备,用于执行上述方法,该计算机设备可以实现为终端或者服务器,下面先对终端的结构进行介绍:
图13是本申请实施例提供的一种终端的结构示意图。
通常,终端1300包括有:一个或多个处理器1301和一个或多个存储器1302。
处理器1301可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器1301可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器1301也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器1301可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器1301还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器1302可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1302还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1302中的非暂态的计算机可读存储介质用于存储至少一个计算机程序,该至少一个计算机程序用于被处理器1301所执行以实现本申请中方法实施例提供的背景音乐的插入时间点确定方法。
在一些实施例中,终端1300还可选包括有:外围设备接口1303和至少一个外围设备。处理器1301、存储器1302和外围设备接口1303之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口1303相连。具体地,外围设备包括:显示屏1305、音频电路1307和电源1308中的至少一种。
外围设备接口1303可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器1301和存储器1302。在一些实施例中,处理器1301、存储器1302和外围设备接口1303被集成在同一芯片或电路板上;在一些其他实施例中,处理器1301、存储器1302和外围设备接口1303中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
显示屏1305用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏1305是触摸显示屏时,显示屏1305还具有采集在显示 屏1305的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器1301进行处理。此时,显示屏1305还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。
音频电路1307可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器1301进行处理,或者输入至射频电路1304以实现语音通信。
电源1308用于为终端1300中的各个组件进行供电。电源1308可以是交流电、直流电、一次性电池或可充电电池。
本领域技术人员可以理解,图13中示出的结构并不构成对终端1300的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
上述计算机设备还可以实现为服务器,下面对服务器的结构进行介绍:
图14是本申请实施例提供的一种服务器的结构示意图,该服务器1400可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(Central Processing Units,CPU)1401和一个或多个的存储器1402,其中,所述一个或多个存储器1402中存储有至少一条计算机程序,所述至少一条计算机程序由所述一个或多个处理器1401加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器1400还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器1400还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有至少一条计算机程序,该计算机程序由处理器加载并执行以实现上述实施例中的背景音乐的插入时间点确定方法。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现上述背景音乐的插入时间点确定方法。
在一些实施例中,本申请实施例所涉及的计算机程序可被部署在一个计算机设备上执行,或者在位于一个地点的多个计算机设备上执行,又或者,在分布在多个地点且通过通信网络互连的多个计算机设备上执行,分布在多个地点且通过通信网络互连的多个计算机设备可以组成区块链系统。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来控制相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (19)

  1. 一种背景音乐的插入时间点确定方法,由计算机设备执行,所述方法包括:
    提取目标视频的音频特征以及图像特征;
    将所述音频特征以及所述图像特征进行融合,得到所述目标视频的视频特征;
    基于注意力机制对所述目标视频的视频特征进行编码,得到多个目标参数,所述多个目标参数对应于所述目标视频的多个时间点,所述目标参数用于表示在对应时间点插入背景音乐的概率;
    确定插入背景音乐的至少一个候选时间点,所述候选时间点为所述多个时间点中目标参数符合目标条件的时间点。
  2. 根据权利要求1所述的方法,其中,所述提取目标视频的音频特征以及图像特征包括:
    对所述目标视频的多个音频帧进行特征提取,得到所述目标视频的音频特征;
    对所述目标视频的多个视频帧进行特征提取,得到所述目标视频的图像特征。
  3. 根据权利要求2所述的方法,其中,所述对所述目标视频的多个音频帧进行特征提取,得到所述目标视频的音频特征包括:
    对所述多个音频帧的时域信息进行特征提取,得到所述多个音频帧的时域音频特征;
    对所述多个音频帧的频域信息进行特征提取,得到所述多个音频帧的频域音频特征;
    基于所述多个音频帧的时域音频特征和频域音频特征,获取所述目标视频的音频特征。
  4. 根据权利要求3所述的方法,其中,所述对所述多个音频帧的时域信息进行特征提取,得到所述多个音频帧的时域音频特征包括:
    采用多个一维卷积核对所述多个音频帧的时域信息进行特征提取,得到所述多个音频帧的时域音频特征;
    所述对所述多个音频帧的频域信息进行特征提取,得到所述多个音频帧的频域音频特征包括:
    采用多个二维卷积核对所述多个音频帧的频域信息进行特征提取,得到所述多个音频帧的频域音频特征。
  5. 根据权利要求3所述的方法,其中,所述基于所述多个音频帧的时域音频特征和频域音频特征,获取所述目标视频的音频特征包括:
    将所述多个音频帧的时域音频特征和频域音频特征进行融合,得到所述目标视频的初始音频特征;
    分别对所述初始音频特征进行最大值池化和均值池化,得到所述目标视频的第一池化特征和第二池化特征;
    将所述第一池化特征以及所述第二池化特征进行融合,得到所述目标视频的音频特征。
  6. 根据权利要求1所述的方法,其中,所述视频特征包括多个子特征,所述多个子特征对应于所述目标视频的多个时间点,所述基于注意力机制对所述目标视频的视频特征进行编码,得到多个目标参数包括:
    基于注意力机制对所述多个子特征中每两个子特征进行编码,得到各个所述子特征的目标参数。
  7. 根据权利要求6所述的方法,其中,所述基于注意力机制对所述多个子特征中每两个 子特征进行编码,得到各个所述子特征的目标参数包括:
    对于所述多个子特征中的第一子特征,基于注意力机制确定所述多个子特征中的多个第二子特征对所述第一子特征的多个注意力参数;
    将所述多个注意力参数进行融合,得到所述第一子特征的目标参数。
  8. 根据权利要求7所述的方法,其中,所述基于注意力机制确定所述多个子特征中的多个第二子特征对所述第一子特征的多个注意力参数包括:
    对所述第一子特征进行全连接,得到所述第一子特征的嵌入特征;
    对于所述多个第二子特征中的任一第二子特征,对所述第二子特征进行全连接,得到所述第二子特征的嵌入特征;
    基于所述第一子特征的嵌入特征和所述第二子特征的嵌入特征,确定所述第一子特征和所述第二子特征之间的相似度参数;
    基于所述第一子特征以及所述第一子特征和所述第二子特征之间的相似度参数,确定所述第二子特征对所述第一子特征的注意力参数。
  9. 根据权利要求6-8任一项所述的方法,其中,所述方法还包括:
    将样本视频输入时间点确定模型,通过所述时间点确定模型对所述样本视频进行特征提取,得到所述样本视频的样本音频特征以及样本图像特征;
    通过所述时间点确定模型,将所述样本音频特征以及所述样本图像特征进行融合,得到所述样本视频的视频特征;
    通过所述时间点确定模型,基于注意力机制对所述样本视频的视频特征进行编码,得到多个样本参数,所述多个样本参数对应于所述样本视频的多个时间点,所述样本参数用于表示在对应时间点插入背景音乐的概率;
    基于所述样本视频的多个时间点的标签与所述多个样本参数之间的差异信息,对所述时间点确定模型进行训练,所述标签用于表示所述样本视频中背景音乐的出现时间。
  10. 根据权利要求9所述的方法,其中,所述通过所述时间点确定模型对所述样本视频进行特征提取,得到所述样本视频的样本音频特征以及样本图像特征之前,所述方法还包括:
    通过所述时间点确定模型对所述样本视频进行音频分离,得到所述样本视频的原始音频和背景音乐;
    所述通过所述时间点确定模型对所述样本视频进行特征提取,得到所述样本视频的样本音频特征以及样本图像特征包括:
    通过所述时间点确定模型,对所述样本视频的所述原始音频和多个样本视频帧进行特征提取,得到所述样本视频的样本音频特征以及样本图像特征。
  11. 根据权利要求10所述的方法,其中,所述通过所述时间点确定模型对所述样本视频进行音频分离,得到所述样本视频的原始音频和背景音乐包括:
    通过所述时间点确定模型对所述样本视频的多个样本音频帧的样本频域信息进行特征提取,得到所述样本视频的第一音频特征;
    通过所述时间点确定模型,采用多种尺度对所述第一音频特征进行池化,得到所述样本视频的多个第二音频特征;
    通过所述时间点确定模型,将所述多个第二音频特征进行融合,得到所述样本视频的音频分离特征;
    通过所述时间点确定模型,基于所述音频分离特征对所述样本频域信息进行分离,得到所述样本视频的原始音频和背景音乐。
  12. 根据权利要求11所述的方法,其中,所述将所述多个第二音频特征进行融合,得到所述样本视频的音频分离特征包括:
    对所述多个第二音频特征进行卷积,得到所述样本视频的多个第三音频特征;
    对所述多个第三音频特征进行上采样,得到所述样本视频的多个第四音频特征,所述多个第四音频特征的尺寸均与所述第一音频特征相同;
    将所述多个第四音频特征与所述第一音频特征进行融合,得到所述样本视频的音频分离特征。
  13. 根据权利要求11所述的方法,其中,所述基于所述音频分离特征对所述样本频域信息进行分离,得到所述样本视频的原始音频和背景音乐包括:
    基于所述音频分离特征,确定所述样本频域信息的边界信息,所述边界信息用于表示所述样本频域信息中原始音频和背景音乐之间的边界;
    基于所述边界信息对所述样本频域信息进行处理,得到所述样本视频的原始音频和背景音乐。
  14. 根据权利要求10所述的方法,其中,所述方法还包括:
    基于所述样本视频的背景音乐在所述样本视频中的出现时间,为所述样本视频的多个时间点添加标签。
  15. 根据权利要求1所述的方法,其中,所述提取目标视频的音频特征以及图像特征包括:
    提取所述目标视频的所述音频特征、所述图像特征以及字幕特征;
    所述将所述音频特征以及所述图像特征进行融合,得到所述目标视频的视频特征包括:
    将所述目标视频的所述音频特征、所述图像特征以及所述字幕特征进行融合,得到所述目标视频的视频特征。
  16. 一种背景音乐的插入时间点确定装置,配置于计算机设备中,所述装置包括:
    特征提取模块,用于提取目标视频的音频特征以及图像特征;
    特征融合模块,用于将所述音频特征以及所述图像特征进行融合,得到所述目标视频的视频特征;
    编码模块,用于基于注意力机制对所述目标视频的视频特征进行编码,得到多个目标参数,所述多个目标参数对应于所述目标视频的多个时间点,所述目标参数用于表示在对应时间点插入背景音乐的概率;
    候选时间点确定模块,用于确定插入背景音乐的至少一个候选时间点,所述候选时间点为所述多个时间点中目标参数符合目标条件的时间点。
  17. 一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条计算机程序,所述计算机程序由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求15任一项所述的背景音乐的插入时间点确定方法。
  18. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条计算机程序,所述计算机程序由处理器加载并执行以实现如权利要求1至权利要求15任一项所述的背景音乐的插入时间点确定方法。
  19. 一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现权利要求1至权利要求15任一项所述的背景音乐的插入时间点确定方法。
PCT/CN2023/077645 2022-04-15 2023-02-22 背景音乐的插入时间点确定方法、装置、设备和存储介质 WO2023197749A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210393110.3 2022-04-15
CN202210393110.3A CN114495916B (zh) 2022-04-15 2022-04-15 背景音乐的插入时间点确定方法、装置、设备和存储介质

Publications (2)

Publication Number Publication Date
WO2023197749A1 WO2023197749A1 (zh) 2023-10-19
WO2023197749A9 true WO2023197749A9 (zh) 2024-01-04

Family

ID=81489589

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/077645 WO2023197749A1 (zh) 2022-04-15 2023-02-22 背景音乐的插入时间点确定方法、装置、设备和存储介质

Country Status (2)

Country Link
CN (1) CN114495916B (zh)
WO (1) WO2023197749A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114495916B (zh) * 2022-04-15 2022-07-12 腾讯科技(深圳)有限公司 背景音乐的插入时间点确定方法、装置、设备和存储介质
CN117854535B (zh) * 2024-03-08 2024-05-07 中国海洋大学 基于交叉注意力的视听语音增强方法及其模型搭建方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107682642A (zh) * 2017-09-19 2018-02-09 广州艾美网络科技有限公司 识别视频特效触发时间点的方法、装置和终端设备
CN111198958A (zh) * 2018-11-19 2020-05-26 Tcl集团股份有限公司 匹配背景音乐的方法、装置及终端
CN109462776B (zh) * 2018-11-29 2021-08-20 北京字节跳动网络技术有限公司 一种视频特效添加方法、装置、终端设备及存储介质
CN109862393B (zh) * 2019-03-20 2022-06-14 深圳前海微众银行股份有限公司 视频文件的配乐方法、系统、设备及存储介质
CN110335625A (zh) * 2019-07-08 2019-10-15 百度在线网络技术(北京)有限公司 背景音乐的提示及识别方法、装置、设备以及介质
CN112565882A (zh) * 2019-09-26 2021-03-26 北京字节跳动网络技术有限公司 视频生成方法、装置、电子设备和计算机可读介质
CN110740262A (zh) * 2019-10-31 2020-01-31 维沃移动通信有限公司 背景音乐的添加方法、装置及电子设备
US10841666B1 (en) * 2020-03-31 2020-11-17 Amazon Technologies, Inc. Generation of points of insertion of directed content into a video asset
CN111970579A (zh) * 2020-08-14 2020-11-20 苏州思萃人工智能研究所有限公司 基于ai视频理解的视频音乐适配方法与系统
CN111988663B (zh) * 2020-08-28 2022-09-06 北京百度网讯科技有限公司 视频播放节点的定位方法、装置、设备以及存储介质
CN113569088B (zh) * 2021-09-27 2021-12-21 腾讯科技(深圳)有限公司 一种音乐推荐方法、装置以及可读存储介质
CN114495916B (zh) * 2022-04-15 2022-07-12 腾讯科技(深圳)有限公司 背景音乐的插入时间点确定方法、装置、设备和存储介质

Also Published As

Publication number Publication date
CN114495916A (zh) 2022-05-13
CN114495916B (zh) 2022-07-12
WO2023197749A1 (zh) 2023-10-19

Similar Documents

Publication Publication Date Title
CN111930992B (zh) 神经网络训练方法、装置及电子设备
CN107481717B (zh) 一种声学模型训练方法及系统
WO2023197749A1 (zh) 背景音乐的插入时间点确定方法、装置、设备和存储介质
CN111581437A (zh) 一种视频检索方法及装置
CN111488489B (zh) 视频文件的分类方法、装置、介质及电子设备
CN113378784B (zh) 视频标签推荐模型的训练方法和确定视频标签的方法
CN109271542A (zh) 封面确定方法、装置、设备及可读存储介质
CN111597779B (zh) 文本生成方法、装置、设备以及存储介质
CN109660865B (zh) 为视频自动打视频标签的方法及装置、介质和电子设备
CN113157965B (zh) 音频可视化模型训练及音频可视化方法、装置及设备
CN111626049B (zh) 多媒体信息的标题修正方法、装置、电子设备及存储介质
CN114465737B (zh) 一种数据处理方法、装置、计算机设备及存储介质
WO2021190174A1 (zh) 信息确定方法、装置、计算机设备及存储介质
CN116050496A (zh) 图片描述信息生成模型的确定方法及装置、介质、设备
WO2023207541A1 (zh) 一种语音处理方法及相关设备
CN110263218A (zh) 视频描述文本生成方法、装置、设备和介质
CN111274412A (zh) 信息提取方法、信息提取模型训练方法、装置及存储介质
CN114282055A (zh) 视频特征提取方法、装置、设备及计算机存储介质
JP2023535108A (ja) ビデオタグ推薦モデルのトレーニング方法及びビデオタグの決定方法、それらの装置、電子機器、記憶媒体及びコンピュータプログラム
CN113392265A (zh) 多媒体处理方法、装置及设备
CN115798459B (zh) 音频处理方法、装置、存储介质及电子设备
CN113393841A (zh) 语音识别模型的训练方法、装置、设备及存储介质
CN116821781A (zh) 分类模型的训练方法、文本分析方法及相关设备
CN116913278B (zh) 语音处理方法、装置、设备和存储介质
CN112686052B (zh) 试题推荐及相关模型的训练方法、电子设备、存储装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23787397

Country of ref document: EP

Kind code of ref document: A1