CN111741233B

CN111741233B - Video dubbing method and device, storage medium and electronic equipment

Info

Publication number: CN111741233B
Application number: CN202010687225.4A
Authority: CN
Inventors: 余自强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2021-06-15
Anticipated expiration: 2040-07-16
Also published as: CN111741233A

Abstract

The disclosure provides a video dubbing method, a video dubbing device, a storage medium and an electronic device. The method comprises the following steps: acquiring at least two video durations of at least two video materials, and generating a video duration set of the at least two video durations; acquiring the dubbing music audio, and carrying out drumbeat detection on the dubbing music audio to determine drumbeats in the dubbing music audio; dividing the dubbing music audio into at least two audio segments according to the drumbeat; acquiring at least two audio durations of at least two audio clips, and generating an audio duration set of the at least two audio durations; and matching the video duration set with the audio duration set to enable each video duration in the video duration set to correspond to each audio duration in the audio duration set, and generating the dubbing music videos corresponding to at least two video materials according to the matching result. The method and the device greatly reduce the time and difficulty of making the dubbing music video by the user, so that the generated dubbing music video is closer to the music rhythm, and the dubbing music audio quality is higher.

Description

Video dubbing method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of video processing technologies, and in particular, to a video dubbing method, a video dubbing apparatus, a computer-readable medium, and an electronic device.

Background

With the development of the internet, people have been used to share videos shot by themselves through the network. Among them, the cool dazzling effect of the click video in which the audio rhythm and the video clip rhythm are matched becomes the most popular video form at present. When the picture in the video and the music are effectively combined, a user watching the video can feel the atmosphere in the video and have an immersive experience.

In order to enable the video materials to be close to the rhythm of the background music, a user can adjust the sequence of the video materials one by one and can adjust the playing speed of the video materials or the playing speed of the background music, so that the effect of aligning the rhythm of the video materials and the rhythm of the background music is realized. However, adjusting the sequence of video materials one by one is time-consuming and labor-consuming, which increases the difficulty of making video for users, and the way of adjusting the playing speed makes the video playing effect unnatural and loses the aesthetic feeling of music.

In view of the above, there is a need in the art to develop a new video dubbing method and apparatus.

It should be noted that the information disclosed in the above background section is only for enhancement of understanding of the technical background of the present application, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a video dubbing method, a video dubbing apparatus, a computer readable medium, and an electronic device, so as to overcome the technical problems of difficulty and poor dubbing effect to at least some extent.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of an embodiment of the present disclosure, there is provided a video dubbing method, including: acquiring at least two video durations of at least two video materials, and generating a video duration set of the at least two video durations;

acquiring a music audio, and carrying out drumbeat detection on the music audio to determine drumbeats in the music audio;

dividing the soundtrack audio into at least two audio segments according to the drum points;

acquiring at least two audio durations of the at least two audio segments, and generating an audio duration set of the at least two audio durations;

and matching the video duration set with the audio duration set to enable each video duration in the video duration set to correspond to each audio duration in the audio duration set, and generating the dubbing music videos corresponding to the at least two video materials according to a matching result.

According to an aspect of an embodiment of the present disclosure, there is provided a video dubbing apparatus including:

the time length acquisition module is configured to acquire at least two video time lengths of at least two video materials and generate a video time length set of the at least two video time lengths;

the drumbeat detection module is configured to acquire the dubbing music audio and perform drumbeat detection on the dubbing music audio to determine drumbeats in the dubbing music audio;

a segment dividing module configured to divide the soundtrack audio into at least two audio segments according to the drumbeats;

the set generation module is configured to acquire at least two audio durations of the at least two audio segments and generate an audio duration set of the at least two audio durations;

a video generation module configured to match the video duration set with the audio duration set, so that each video duration in the video duration set corresponds to each audio duration in the audio duration set, and generate a dubbing video corresponding to the at least two video materials according to a matching result.

In some embodiments of the present disclosure, based on the above technical solutions, the video generation module includes: the material determining submodule is configured to determine a target video time length in the video time length set and determine a target video material in the at least two video materials according to the target video time length;

a segment determination submodule configured to determine a target audio duration in the audio duration set according to the target video duration, and determine a target audio segment in the at least two audio segments according to the target audio duration;

a segment alignment sub-module configured to align the target video material with the target audio segment such that each video duration in the set of video durations corresponds to each audio duration in the set of audio durations.

In some embodiments of the present disclosure, based on the above technical solutions, the segment determining sub-module includes: a difference calculation unit configured to determine a first audio duration in the at least two audio durations of the set of audio durations and calculate a duration difference between the target video duration and the first audio duration;

and the time length difference value determining unit is configured to obtain a time length threshold value corresponding to the time length difference value, and determine the first audio time length as a target audio time length when the time length difference value is smaller than the time length threshold value.

the time length difference value judging unit is configured to obtain a time length threshold value corresponding to the time length difference value, and when the time length difference value is larger than or equal to the time length threshold value, a second audio time length is determined in other audio time lengths except the first audio time length;

a duration merging unit configured to merge the first audio duration and the second audio duration as a target audio duration corresponding to the target video duration, wherein a duration difference between a sum of the first audio duration and the second audio duration and the target video duration is smaller than the duration threshold.

In some embodiments of the present disclosure, based on the above technical solutions, the segment determining sub-module includes: the sequence obtaining unit is configured to sequence the at least two video materials according to the at least two video durations to obtain a video duration sequence, and sequence the at least two audio clips according to the at least two audio durations to obtain an audio duration sequence;

the sequence determining unit is configured to determine a video sequence of the target video duration in the video duration sequence, and determine a target audio duration corresponding to the target video duration in the audio duration sequence according to the video sequence.

In some embodiments of the present disclosure, based on the above technical solutions, the greater than determining unit includes: the sequence acquisition subunit is configured to determine an audio sequence of the first audio duration in the audio duration sequence, and determine other audio durations which are not used as target audio durations in the audio duration sequence;

a duration determination subunit configured to determine a second audio duration among the other audio durations according to the audio order.

In some embodiments of the present disclosure, based on the above technical solution, the segment alignment sub-module includes: the ratio calculation unit is configured to calculate the target video time length and the target audio time length to obtain a time length ratio, and obtain a ratio threshold corresponding to the time length ratio;

a ratio comparison unit configured to compare the duration ratio with the ratio threshold and align the target audio clip with the target video material according to a ratio comparison result.

In some embodiments of the present disclosure, based on the above technical solutions, the ratio comparing unit includes: a material clipping subunit configured to clip the target video material to align the target audio segment with the target video material if the duration ratio is greater than the ratio threshold;

a speed adjustment subunit configured to adjust a playing speed of the target video material to align the target audio segment with the target video material if the duration ratio is less than or equal to the ratio threshold.

In some embodiments of the present disclosure, based on the above technical solutions, the drum point detecting module includes: an audio conversion sub-module configured to perform Fourier transform on the soundtrack audio to obtain a spectrum of the soundtrack audio;

the frequency spectrum difference submodule is configured to perform difference calculation on the frequency spectrum to obtain a frequency spectrum flux mean value of the frequency spectrum;

a peak detection sub-module configured to perform peak detection on the spectral flux average value and determine a drum point in the dubbing music audio.

In some embodiments of the present disclosure, based on the above technical solutions, the peak detection sub-module includes: the parameter determining unit is configured to determine a parameter corresponding to the spectral flux average value, and calculate the spectral flux average value and the parameter to obtain a spectral flux threshold value;

a spectrum comparison unit configured to compare the spectral flux with the spectral flux threshold and determine a drum point in the dubbing music video according to a spectrum comparison result.

In some embodiments of the present disclosure, based on the above technical solutions, the spectrum difference sub-module includes: the sound spectrum generating unit is configured to splice the frequency spectrums to generate sound spectrums corresponding to the frequency spectrums, and filter the sound spectrums by using a Mel filter to obtain Mel frequency spectrums;

and the frequency spectrum calculating unit is configured to perform difference calculation on the Mel frequency spectrum to obtain frequency spectrum flux, and calculate the average value of the frequency spectrum flux to obtain a frequency spectrum flux average value.

In some embodiments of the present disclosure, based on the above technical solutions, the audio conversion sub-module includes: the audio framing unit is configured to frame the dubbing music audio to obtain an audio frame;

a spectrum generating unit configured to perform Fourier transform on the audio frame to obtain a spectrum corresponding to the soundtrack audio.

According to an aspect of the embodiments of the present disclosure, there is provided a computer readable medium, on which a computer program is stored, the computer program, when executed by a processor, implementing a video dubbing method as in the above technical solution.

According to an aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the video dubbing method as in the above technical solution via executing the executable instructions.

In the technical scheme provided by the embodiment of the disclosure, the dubbing music audio of the video material is generated in a mode of matching the video time length set with the audio time length set. On one hand, the time and difficulty of making the dubbing video by the user are greatly reduced, and the interest of making the video by the user is improved; on the other hand, the matching effect of the video duration set and the audio duration set is good, so that the generated music video is closer to the music rhythm, the quality of the music audio is higher, and the follow-up video sharing power of a user is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

fig. 1 schematically shows an exemplary system architecture diagram to which the disclosed solution is applied.

Fig. 2 schematically illustrates a flow chart of steps of a video soundtrack method in some embodiments of the present disclosure.

Fig. 3 schematically illustrates a flow chart of steps of a method of drumbeat detection of soundtrack audio in some embodiments of the present disclosure.

Fig. 4 schematically illustrates a flow chart of steps of a method of deriving a spectrum of soundtrack audio in some embodiments of the present disclosure.

Fig. 5 schematically illustrates a flow chart of steps of a method of obtaining a mean value of spectral flux in some embodiments of the present disclosure.

Fig. 6 schematically illustrates a schematic diagram of utilizing a triangular filter as a mel filter in some embodiments of the present disclosure.

Fig. 7 schematically illustrates a flow chart of steps of a method of peak detection in some embodiments of the present disclosure.

Fig. 8 schematically illustrates a flow chart of steps of a method of matching a set of video durations and a set of audio durations in some embodiments of the present disclosure.

FIG. 9 schematically illustrates a flow chart of steps of a method of determining a target audio duration from a target video duration in some embodiments of the present disclosure.

FIG. 10 schematically illustrates a flow chart of steps in another method of determining a target audio duration based on a target video duration in some embodiments of the present disclosure.

FIG. 11 is a flow chart that schematically illustrates steps in yet another method for determining a target audio duration based on a target video duration in some embodiments of the present disclosure.

Fig. 12 schematically illustrates a flow chart of steps of a method of determining a second audio duration in some embodiments of the present disclosure.

Fig. 13 schematically illustrates a flow chart of steps of a method of aligning a target audio segment with a target video material in some embodiments of the present disclosure.

Fig. 14 schematically illustrates a flow chart of steps of a method of aligning a target audio segment with a target video material according to a ratio comparison result in some embodiments of the present disclosure.

Fig. 15 schematically shows a flowchart of steps of a video dubbing method in an application scenario according to an embodiment of the present disclosure.

Fig. 16 schematically shows a flowchart of steps of a drum point detection method in an application scenario according to an embodiment of the present disclosure.

Fig. 17 schematically shows another spectrogram obtained by stitching the frequency spectrums in the embodiment of the present disclosure.

Fig. 18 schematically shows an effect obtained by performing filtering processing using a mel filter in the embodiment of the present disclosure.

Fig. 19 schematically shows an effect diagram of the spectral flux average obtained in the embodiment of the present disclosure.

Fig. 20 schematically shows a drum point position diagram obtained by performing peak detection in the embodiment of the present disclosure.

Fig. 21 schematically illustrates a flowchart of the steps of a method of aligning video material with an audio clip in an application scenario according to an embodiment of the present disclosure.

Fig. 22 schematically shows an effect diagram of aligning an audio segment with a video material in the embodiment of the present disclosure.

Fig. 23 schematically illustrates an effect diagram of generating a score video in an application scenario according to an embodiment of the present disclosure.

Fig. 24 schematically illustrates a block diagram of a video soundtrack apparatus in some embodiments of the present disclosure.

FIG. 25 schematically illustrates a structural diagram of a computer system suitable for use with an electronic device that implements an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Based on the problems of the above schemes, the present disclosure provides a video dubbing method, a video dubbing apparatus, a computer readable medium, and an electronic device.

Fig. 1 shows an exemplary system architecture diagram to which the disclosed solution is applied.

As shown in fig. 1, the system architecture 100 may include a terminal 110, a network 120, and a server side 130. Wherein the terminal 110 and the server 130 are connected through the network 120.

The terminal 110 may specifically be a desktop terminal or a mobile terminal, the mobile terminal may specifically be at least one of a smart phone, a tablet computer, a notebook computer, an intelligent sound box, an intelligent watch, and the like, and the desktop terminal may specifically be a desktop computer, but is not limited thereto; network 120 may be any type of communication medium capable of providing a communication link between terminal 110 and server 130, such as a wired communication link, a wireless communication link, or a fiber optic cable, etc., and the disclosure is not limited thereto; the server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

Specifically, the user selects the video material and the soundtrack audio corresponding to the video material through the terminal 110, and the server 130 may receive the video material and the soundtrack audio corresponding to the video material through the network 120. Further, the server 130 may obtain video durations of the video material, and generate a video duration set of at least two video durations; correspondingly, the server 130 may determine the audio duration of the corresponding audio segment according to the drum point obtained by drum point detection, and generate an audio duration set of at least two audio durations. Then, the video duration set and the audio duration set are matched to generate a dubbing audio corresponding to the video material, and the dubbing audio is sent to the terminal 110 for the user to view.

In addition, the audio dubbing method in the embodiment of the present disclosure may be applied to a terminal, and may also be applied to a server, which is not particularly limited in this disclosure. The embodiment of the present disclosure is mainly illustrated by applying the audio dubbing method to the server 130.

The video dubbing method, the video dubbing apparatus, the computer-readable medium, and the electronic device provided by the present disclosure are described in detail below with reference to the specific embodiments.

Fig. 2 schematically illustrates a flow chart of steps of a video soundtrack method in some embodiments of the present disclosure. As shown in fig. 2, the video dubbing method may mainly include the following steps:

step S210, at least two video time lengths of at least two video materials are obtained, and a video time length set of at least two video time lengths is generated.

And S220, acquiring the audio of the soundtrack, and carrying out drumbeat detection on the soundtrack to determine drumbeats in the soundtrack.

And step S230, dividing the dubbing music audio into at least two audio segments according to the drumbeat.

And S240, acquiring at least two audio durations of at least two audio segments and generating an audio duration set of the at least two audio durations.

And S250, matching the video time length set with the audio time length set to enable each video time length in the video time length set to correspond to each audio time length in the audio time length set, and generating the dubbing music video corresponding to at least two video materials according to a matching result.

In an exemplary embodiment of the present disclosure, the soundtrack audio for the video material is generated by matching the set of video durations with the set of audio durations. On one hand, the time and difficulty of making the dubbing video by the user are greatly reduced, and the interest of making the video by the user is improved; on the other hand, the matching effect of the video duration set and the audio duration set is good, so that the generated music video is closer to the music rhythm, the quality of the music audio is higher, and the follow-up video sharing power of a user is improved.

The following describes each step of the video dubbing music method in detail.

In step S210, at least two video durations of at least two video materials are obtained, and a video duration set of the at least two video durations is generated.

In the exemplary embodiment of the present disclosure, the video material may be a video, a picture, or other material to be dubbed or edited, and this exemplary embodiment is not particularly limited to this. And, the video material may be selected by the user, received by other users, or generated through processing. The video duration may be a playing duration of each video material, and the unit is second, millisecond, and the like, which is not particularly limited in this exemplary embodiment.

Further, after at least two video durations corresponding to at least two video materials are acquired one by one, a corresponding video duration set can be generated according to the video durations, that is, the video duration set includes values of the video durations. For example, the video duration set includes "0.2, 0.6, 0.7, 0.9, 1.2, and 4.1", and each value represents the video duration in seconds.

In step S220, the soundtrack audio is acquired and subjected to drumhead detection to determine the drumhead in the soundtrack audio.

In the exemplary embodiment of the present disclosure, the soundtrack audio may be selected and transmitted by the user, or may be selected by the user from candidate audio, or may be determined by the server in the audio library according to the user requirement, which is not particularly limited in this exemplary embodiment.

Further, drum-point detection is carried out on the dubbing music video, and drum points in the dubbing music audio are determined. The drumhead may be a click or a tapping sound on a drum, or may be a beat drumhead of a percussion part in an orchestra.

In an alternative embodiment, fig. 3 shows a flow chart of the steps of a method of drumbeat detection of soundtrack audio, as shown in fig. 3, which includes at least the steps of: in step S310, fourier transform is performed on the soundtrack audio to obtain the spectrum of the soundtrack audio.

Before fourier transforming the soundtrack audio, framing of the soundtrack audio may also be performed.

In an alternative embodiment, fig. 4 shows a flow chart of the steps of a method of deriving a spectrum of soundtrack audio, which, as shown in fig. 4, comprises at least the steps of: in step S410, the audio frame is obtained by framing the soundtrack audio.

Specifically, the dubbing music audio may be framed in a windowing manner. When the audio frame is divided, one audio frame can be divided according to the fact that the audio of the soundtrack moves by one translation length, and the translation length of the divided audio frame is the set windowing width.

For example, the windowing width may be 1024, and the divided audio frames may be: the first audio frame is 0-1024; the second audio frame is 1025- > 2048; the third audio frame is 2049-; … …, respectively; and so on. Thus, for a sampling frequency of 44.1kHZ, 43 audio frames can be acquired for 1 s.

In step S420, fourier transform is performed on the audio frame to obtain a spectrum corresponding to the soundtrack audio.

Since it is difficult to determine the characteristics of the signal in the time domain, the audio frame can be observed by converting the audio frame into an energy distribution in the frequency domain. Different energy distributions represent the characteristics of different audio frames.

In particular, the requirement for converting an audio frame from the time domain to the frequency domain can be achieved using a fourier transform. The fourier transform means that a certain function satisfying a certain condition is expressed as a trigonometric function or a defined combination of their integrals.

In the present disclosure, a spectrum corresponding to the soundtrack audio may be derived using a fast fourier transform. The fast fourier transform is a discrete fourier transform calculated by a computer, and has an advantage of a small amount of calculation. In addition, a short-time fourier transform may also be used, and the exemplary embodiment is not particularly limited thereto.

In the exemplary embodiment, the audio frame obtained by framing processing can be converted from the time domain to the frequency domain through fourier transform, so that a corresponding frequency spectrum is obtained, the calculation mode is simple, the calculation amount is small, and the calculation resources are saved.

In step S320, a spectrum flux average of the spectrum is obtained by performing a difference calculation on the spectrum.

In an alternative embodiment, fig. 5 shows a flow chart of the steps of a method of calculating the difference of the frequency spectrum, as shown in fig. 5, the method comprising at least the following steps: in step S510, the frequency spectrums are spliced to generate a sound spectrum corresponding to the frequency spectrums, and the sound spectrum is filtered by a mel filter to obtain a mel frequency spectrum.

The concatenation of the spectra may be performed along the time domain. Specifically, the spectrogram of the frequency spectrum is rotated 90 degrees to the left, the amplitude value in the rotated spectrogram is quantized according to the number of gray levels, and the quantized amplitude value is further represented by the gray level to generate the spectrogram. Therefore, the time dimension is added to the original frequency spectrum, namely, the sound spectrum corresponding to the frequency spectrum is obtained. Wherein, the larger the amplitude value, the smaller the corresponding gray level.

The mel filter may be a mel filter bank including a plurality of filters. The number of filters may be determined according to the number of divisions of the mel-frequency interval, which is not particularly limited in the present exemplary embodiment.

The arrangement of the Mel filter bank corresponds to the auditory model of human ears, only some specific frequencies are concerned, and signals of specific frequencies are allowed to pass through. The Mel filter can filter redundant data in the sound spectrum, and ensure effective data therein to obtain corresponding Mel frequency spectrum.

In the present exemplary embodiment, the mel filter may adopt a triangular filter, that is, each data in each frame of the sound spectrum corresponds to a gain, and all the data multiplied by the gain in one frame are added to obtain the mel spectrum.

Fig. 6 shows a schematic diagram of using a triangular filter as the mel filter, and as shown in fig. 6, when the mel filter is composed of triangular filters, it can be set that the filters are dense at low frequencies, the threshold value is large, the filters are sparse at high frequencies, and the threshold value is low. This setting exactly follows the objective rule that the human ear is duller for sounds of higher frequency.

In addition to this, a rectangular filter may be employed. When there is no overlapping part between every two consecutive audio frames in the framing process, a rectangular filter can be used as a mel filter to ensure effective data in each frame of audio frame.

The mel spectrum is the frequencies corresponding to the perception of speech signals by the human ear. The perception of different frequencies in a speech signal by the human ear does not appear linear in the frequency domain. Generally, the perception of low frequencies is higher and the perception of high frequencies is lower. For example, the two sound signals are two times different in frequency, but the human ear does not have twice different perception of the two sound signals.

Specifically, the relationship between mel frequency spectrum and frequency is in accordance with formula (1):

（1）

wherein the content of the first and second substances,

which represents the frequency spectrum of the mel-frequency spectrum,

is the frequency. When the frequency is small, the mel frequency spectrum changes fast along with the frequency; the mel-frequency spectrum rises very slowly when the frequency is large.

In step S520, the difference between the mel frequency spectra is calculated to obtain the spectral flux, and the average value of the spectral flux is calculated to obtain the average value of the spectral flux.

To extract the dynamic features of the soundtrack audio, the information in the mel-frequency spectrum may be compressed into a manageable one-dimensional floating-point array. Specifically, the formula (2) is referred to by adopting a difference calculation mode:

（2）

wherein the content of the first and second substances,

is as follows

The spectral values of the individual mel-frequency spectra,

is as follows

In the Mel frequency spectrum

The amplitude of the corresponding one of the frequencies,

is as follows

In the Mel frequency spectrum

Amplitude corresponding to each frequency. That is, the amplitude of each band of the previous spectrum is subtracted from the amplitude of the corresponding band in the current spectrum, and the calculated differences are added to obtain the spectral flux.

Negative values of spectral flux may be culled in view of the interest in only the rise of spectral flux, but not the fall of spectral flux.

It should be noted that, for the convenience of subsequent peak detection, the obtained spectral flux may be subjected to secondary difference calculation according to formula (2), so as to make the rising trend of the spectral flux more prominent.

After the spectral flux is obtained, the audio tempo is basically already visible, but for subsequent peak detection, the spectral flux may be averaged.

For example, for a 1024 window size with a sample rate of 44.1kHZ, the framing process is performed, with each audio frame being approximately 43 ms. When the mean value of the spectral flux is to be obtained for a time span of 0.5s, the mean value of the spectral flux can be calculated using 0.5/0.043=11 sample windows. That is, for each spectral flux, the spectral fluxes of the first 5 samples, the last 5 samples, and the current sample may be selected to obtain the average of the spectral fluxes of the current sample.

In the present exemplary embodiment, the difference calculation and the averaging process performed on the mel spectrum may obtain a spectrum flux average value, so that the peak value is more prominent, thereby facilitating the subsequent peak value detection.

In step S330, peak detection is performed on the spectral flux average value, and a drum point in the dubbing music audio is determined.

In an alternative embodiment, fig. 7 shows a flow chart of the steps of a method of performing peak detection, as shown in fig. 7, the method comprising at least the steps of: in step S710, a parameter corresponding to the mean value of the spectral flux is determined, and the mean value of the spectral flux and the parameter are calculated to obtain a spectral flux threshold.

Wherein the parameter may be a constant that is manually determined or manually adjusted. In general, the value may be 1.2, or other constants that may be determined according to actual conditions, and this exemplary embodiment is not particularly limited to this.

Specifically, the spectral flux average value may be multiplied by the parameter to obtain a corresponding spectral flux threshold.

In step S720, the spectral flux is compared with a spectral flux threshold, and a drum point in the dubbing video is determined according to the spectral comparison result.

A specific way of comparison may be to compare the spectral flux with a calculated spectral flux threshold.

When the spectral flux is larger than the spectral flux threshold value, the sampling point corresponding to the spectral flux can be determined as a drum point in the dubbing music video. And the amplitude and position of the drum spot are saved for subsequent audio segment division.

In the exemplary embodiment, the drumhead of the dubbing music audio can be determined according to the definition of the spectral flux threshold, and the determination of the drumhead position is very accurate, so that a basis is provided for the division and determination of the subsequent audio segment.

In step S230, the dubbing music audio is divided into at least two audio pieces according to the drumbeat.

In an exemplary embodiment of the present disclosure, when a drum point is at a start point or an end point of an audio segment, the audio segment may be dubbing audio determined by two adjacent drum points. Therefore, after determining the drumhead of the soundtrack audio, the soundtrack audio between two adjacent drumheads may be determined as one audio clip.

In addition, when the drum point is at the middle position of the audio segment, the dubbing music audio can be divided into two audio segments according to one drum point. Therefore, the manner of dividing the score audio according to the drumbeat may be determined according to actual circumstances, and the present exemplary embodiment is not particularly limited thereto.

In step S240, at least two audio durations of at least two audio segments are obtained, and an audio duration set of the at least two audio durations is generated.

In an exemplary embodiment of the present disclosure, after the audio clip is determined, a play time period of the audio clip may be determined as an audio time period. For example, the audio time duration may be in units of seconds or milliseconds, which is not particularly limited in the present exemplary embodiment.

Further, after at least two audio durations of at least two audio segments are acquired one by one, a corresponding audio duration set may be generated according to the multiple audio durations, that is, the audio duration set includes values of the multiple audio durations. For example, "0.2, 0.3, 0.6, 0.9, and 3.3" are included in the audio duration set, and each value represents the audio duration in seconds.

In step S250, the video duration set and the audio duration set are matched, so that each video duration in the video duration set corresponds to each audio duration in the audio duration set, and a dubbing music video corresponding to at least two video materials is generated according to a matching result.

In an exemplary embodiment of the present disclosure, after the video duration set and the audio duration set are obtained, the two sets may be matched to generate a corresponding soundtrack video.

It should be noted that when a user desires a specific sequence of video materials corresponding to video durations in some video duration sets, the video materials may be frozen according to the specific sequence.

For example, the requirement of the specific sequence may be that the slice header is necessarily a certain video material, or that certain two video materials are necessarily in a combined form. The freezing process may be to remove the video material and perform subsequent operations in the process of matching the audio duration set according to the video duration set. And after the audio time length set is matched according to the video time length set, inserting the video material into a corresponding position. In addition, other freezing processing modes may exist, and the exemplary embodiment is not particularly limited to this.

In an alternative embodiment, fig. 8 shows a flow chart of the steps of a method of matching a set of video durations and a set of audio durations, as shown in fig. 8, the method comprising at least the steps of: in step S810, a target video duration is determined in the video duration set, and a target video material is determined in at least two video materials according to the target video duration.

When the video time length in the video time length set is obtained, a target video time length can be determined from the video time lengths, and a video material corresponding to the target video time length is determined as a target video material.

The method for determining the target video duration may be arbitrarily selected, may be sequentially selected, or may be determined according to other manners capable of implementing traversal selection, which is not particularly limited in this exemplary embodiment.

For example, when the target video duration is determined from the video durations, the video durations may be sorted first, and the target video durations may be selected in order from short to long. In addition, the target video duration may be selected in order from long to short, which is not particularly limited in the present exemplary embodiment.

In step S820, a target audio duration is determined in the audio duration set according to the target video duration, and a target audio segment is determined in at least two audio segments according to the target audio duration.

In some embodiments, the target video time length is the same as the target audio time length or a time length difference between the target video time length and the target audio time length is less than a preset time length threshold. Corresponding target audio durations may be determined for each of the video durations in the set of video durations one by one to determine an audio duration in the set of audio durations corresponding to each of the video durations in the set of video durations. And the time length difference value of each video time length and the audio time length corresponding to each video time length is less than a preset threshold value, and the audio time lengths corresponding to the video time lengths in the video time length set are different.

In some embodiments, when a certain video duration in the video duration set is too short to find a corresponding audio duration in the audio duration set, so that a duration difference between the video duration and the audio duration corresponding to the video duration is smaller than a preset threshold, or when a user intends to combine two or more video durations in the video duration set together for playing, two or more video durations in the video duration set may be combined into a new video duration for determining a corresponding audio duration in the audio duration set, where two or more video material corresponding to two or more video durations are combined into a new video material, and a duration difference between the new video duration and the corresponding audio duration is smaller than a preset threshold.

Similarly, when a certain audio time length in the audio time length set is too short to find a corresponding video time length in the video time length set, so that a time length difference between the audio time length and the video time length corresponding to the audio time length is smaller than a preset threshold value, or when a user intends to combine two or more audio time lengths in the audio time length set together for playing, two or more audio time lengths in the audio time length set can be combined into a new audio time length to be matched with the video time length in the video time length set, so as to determine the video time length corresponding to the new audio time length. And the time length difference value between the audio time length of the new audio clip and the corresponding video time length is less than a preset threshold value.

Fig. 9 and 10 show flowcharts of steps of two methods for determining the target audio duration, respectively. Fig. 9 may be a flowchart of steps for determining one audio duration as the target audio duration, and fig. 10 is a flowchart of steps for determining at least two audio durations as the target audio duration.

In an alternative embodiment, fig. 9 shows a flowchart of the steps of a method for determining a target audio duration based on a target video duration, as shown in fig. 9, the method at least comprises the following steps: in step S910, a first audio duration is determined in at least two audio durations in the audio duration set, and a duration difference between the target video duration and the first audio duration is calculated.

After determining the target video duration, a first audio duration may be selected at the audio duration to match the target video duration. Wherein the selection mode is not particularly limited. And after matching, subtracting the audio time length from the target video time length to obtain a time length difference value.

In step S920, a duration threshold corresponding to the duration difference is obtained, and when the duration difference is smaller than the duration threshold, the first audio duration is determined as the target audio duration.

In order to determine whether the time length difference value meets the matching condition, a time length threshold value can be preset for judgment. The time length threshold may be set to 0.2 or 0.3, or may be set to other values, which is not particularly limited in the present exemplary embodiment.

After the duration difference and the duration threshold are obtained, the duration difference and the duration threshold may be compared in magnitude.

When the time length difference is smaller than the time length threshold, it may be determined that the audio time length satisfies a matching condition with the target video time length (i.e., the audio time length corresponds to the target video time length), and thus the audio time length is determined to be the target audio time length.

In the exemplary embodiment, the target audio time length is determined according to the matching condition that the time length difference is smaller than the time length threshold, the determination mode is simple and feasible, and the practical operability is extremely strong.

In addition, in an alternative embodiment, fig. 10 is a flow chart illustrating steps of another method for determining a target audio duration according to a target video duration, as shown in fig. 10, the method at least includes the following steps: in step S1010, a first audio duration is determined in at least two audio durations in the audio duration set, and a duration difference between the target video duration and the first audio duration is calculated.

In step S1020, a duration threshold corresponding to the duration difference is obtained, and when the duration difference is greater than or equal to the duration threshold, a second audio duration is determined in other audio durations except the first audio duration.

To determine whether the time length difference satisfies the matching condition, a time length threshold that is the same as that in step S920 may be preset for determination, or may be set to other values, which is not particularly limited in this exemplary embodiment.

When the time length difference is greater than or equal to the time length threshold, a second audio time length adapted to the first audio time length may be determined in the audio time length first to perform subsequent combination processing, and then the combined audio time length is determined as the target audio time length. The manner of selecting the second audio duration from the audio duration set is not particularly limited.

In step S1030, the first audio duration and the second audio duration are combined as a target audio duration corresponding to the target video duration, where a duration difference between a sum of the first audio duration and the second audio duration and the target video duration is less than a duration threshold.

After the second audio duration is selected, the second audio duration may be supplemented to the previously selected first audio duration to obtain the target audio duration. The second audio duration is combined with the previously selected audio duration, and may be supplemented to the front of the first audio duration, or may be supplemented to the back of the first audio duration, or may have other set supplementing manners, which is not particularly limited in this exemplary embodiment.

It should be noted that the process of selecting the second audio time length for merging is not performed at once, but after selecting the second audio time length, whether the time length difference between the audio time length after merging the first audio time length and the second audio time length and the target video time length meets the requirement of the time length threshold is calculated.

When the combined audio time length is determined to meet the matching requirement, determining the supplemented audio time length as the target audio time length; and when the supplemented audio time length does not meet the matching requirement, continuing to select other audio time lengths as second audio time lengths to carry out combination and judgment until the second audio time length is selected, wherein the time length difference between the sum of the first audio time length and the second audio time length and the target video time length is less than the time length threshold value.

In addition, when all the two other audio durations do not satisfy the matching condition after being supplemented, the third audio duration may be continuously supplemented, or other processing manners may also be available, which is not particularly limited in this exemplary embodiment.

In the present exemplary embodiment, the target audio time length is determined by the matching condition that the time length difference is greater than or equal to the time length threshold, the determination is simple and feasible, and a plurality of drumbeats can be aligned on a piece of video material subsequently, so that the generated dubbing music video is closer to the music tempo.

In addition to the manner of determining the target audio duration from the target video duration shown in fig. 9 and 10, there may be other manners of determining in the case where the video material constitutes a video duration sequence and the audio clip generates an audio duration sequence.

In an alternative embodiment, fig. 11 is a flow chart illustrating steps of a method for determining a target audio duration based on a target video duration, as shown in fig. 11, the method at least comprising the steps of: in step S1110, at least two video materials are sorted according to at least two video durations to obtain a video duration sequence, and at least two audio segments are sorted according to at least two audio durations to obtain an audio duration sequence.

Further, the video materials may be sorted in the order from short to long video duration, or may be sorted in the order from long to short video duration, which is not particularly limited in this exemplary embodiment.

After sorting, a video duration sequence can be obtained. For example, the video duration sequence may be: 0.2 seconds of video material, 0.6 seconds of video material, 0.7 seconds of video material, 0.9 seconds of video material, 1.2 seconds of video material, and 4.1 seconds of video material.

On the other hand, the manner of sorting the audio segments may be determined according to the order of the audio duration from short to long, or may be arranged according to the order of the audio duration from long to short, which is not particularly limited in this exemplary embodiment.

After sorting, a corresponding audio duration sequence can be obtained. For example, the audio duration sequence may be: a 0.2 second audio clip, a 0.3 second audio clip, a 0.6 second audio clip, a 0.9 second audio band, and a 3.3 second audio clip.

In step S1120, a video sequence of the target video duration in the video duration sequence is determined, and a target audio duration corresponding to the target video duration is determined in the audio duration sequence according to the video sequence.

The video durations in the video duration sequence are arranged in order, so that after the target video duration is determined, the video order of the target video duration can be further determined. The video order may characterize the position of the target video duration in the video duration sequence. For example, the video sequence may be the 1 st or the largest sequence, and the representation manner of the video sequence in the present exemplary embodiment is not particularly limited.

After determining the video order of the target video time duration, the target video time duration may be determined in the audio time duration sequence in the same order. For example, when the video sequence of the target video duration is the 5 th in the video duration sequence, the audio duration ranked at the 5 th bit in the audio duration sequence may be selected as the target audio duration. In addition, there may be other corresponding relations between the video duration sequences and the audio duration sequences, and this exemplary embodiment is not particularly limited to this.

In the exemplary embodiment, a way of determining the target audio duration in the video duration sequence and the audio duration sequence is provided, and the way of determining in the sequences is more careful and rigorous, and the determining way is more accurate and efficient.

It should be noted that the manner of determining the target audio time length shown in fig. 9 and fig. 10 is applicable to fig. 11, that is, after the target audio time length is determined in the audio time length sequence, it may also be determined whether the target audio time length needs to be updated by calculating a time length difference between the target video time length and the target audio time length, and the determination manner is the same as that in fig. 9 and fig. 10, and is not described again here.

Furthermore, after the audio duration sequence is generated, there is also a way to determine a second audio duration in the audio duration sequence.

In an alternative embodiment, fig. 12 shows a flow chart of the steps of a method of determining a second audio duration, as shown in fig. 12, the method comprising at least the steps of: in step S1210, an audio sequence of the first audio duration is determined in the audio duration sequence, and other audio durations that are not the target audio duration are determined in the audio duration sequence.

The audio durations in the sequence of audio durations are arranged in order, so that after the first audio duration is determined, the audio order of the first audio duration may be further determined. The audio order may characterize a position of the first audio duration in the sequence of audio durations. For example, the audio sequence may be the 1 st, or the largest, and the like, and the representation manner of the audio sequence in the present exemplary embodiment is not particularly limited.

And dividing the first audio time length in the audio time length sequence at the moment to obtain other audio time lengths except the audio time length. Moreover, the other audio duration may also be one that has not been previously taken as the target audio duration, that is, the other audio duration is obtained by rejecting all audio durations that have been matched with the target video duration.

In step S1220, a second audio time period is determined among the other audio time periods according to the audio order.

Specifically, the second audio duration may be selected according to the order of the other audio durations in the audio duration sequence, or the other audio durations may be reordered to select. Whether the original order or the new order is adopted, the ordering mode can be arranged according to the order of other audio time lengths from short to long. In addition, the data may be arranged in other sequences, and the exemplary embodiment is not particularly limited thereto.

For example, if the other audio durations are arranged in the order from short to long, the shortest other audio duration may be selected as the second audio duration, and whether the duration difference between the audio duration obtained by combining the first audio duration and the second audio duration and the target video duration meets the requirement of the duration threshold is calculated. And when the supplemented audio time length does not meet the matching requirement, continuing to select a second short audio time length for combination and judgment until the second audio time length meeting the matching requirement after combination is selected.

After the target audio duration is determined, the audio segment corresponding to the target audio duration may be determined to be the target audio segment. Therefore, when the time length difference value is smaller than the time length threshold value, one section of audio clip can be determined as the target audio clip; when the time length difference is greater than or equal to the time length threshold, at least two audio segments can be determined as the target audio segment.

In step S830, the target video material is aligned with the target audio segment such that each video duration in the set of video durations corresponds to each audio duration in the set of audio durations.

After determining the target audio duration and the order of the target audio segments corresponding to the target audio duration based on the target video duration, the target audio segments may be aligned with the target video material.

For example, the alignment mode may be two modes, i.e., clipping the target video material or adjusting the playing speed of the target video material, and one of the two modes may be optionally aligned, or may be used simultaneously, which is not particularly limited in this exemplary embodiment. Further, a preferred alignment is shown in fig. 13.

In an alternative embodiment, fig. 13 shows a flow chart of the steps of a method of aligning a target audio segment and a target video material, as shown in fig. 13, the method comprising at least the steps of: in step S1310, a time length ratio is calculated between the target video time length and the target audio time length, and a ratio threshold corresponding to the time length ratio is obtained.

Specifically, the time length ratio may be obtained by dividing the target audio time length by the target video time length. In addition, other calculation manners are also possible, and the present exemplary embodiment is not particularly limited to this.

Further, a ratio threshold corresponding to the length ratio is obtained. The ratio threshold may be preset, and generally may be 0.8 or 0.9, or may be other values, which is not limited in this exemplary embodiment.

In step S1320, the duration ratio is compared to the ratio threshold, and the target audio segment is aligned with the target video material according to the comparison result of the ratio.

After the duration ratio and the ratio threshold are obtained, the duration ratio and the ratio threshold can be compared, and a ratio comparison result is obtained.

In an alternative embodiment, fig. 14 shows a flow chart of the steps of a method of aligning a target audio segment with a target video material according to a ratio comparison, as shown in fig. 14, the method comprising at least the steps of: in step S1410, if the duration ratio is greater than the ratio threshold, the target video material is cropped to align the target audio segment with the target video material.

When the time length ratio is greater than the ratio threshold, the target video material can be automatically cut, and the target video material can also be manually cut, so that the target audio segment and the target video material are completely aligned.

In step S1420, if the duration ratio is less than or equal to the ratio threshold, the playing speed of the target video material is adjusted to align the target audio segment with the target video material.

When the time length ratio is less than or equal to the ratio threshold, the play speed of the target video material can be automatically or manually adjusted to completely align the target audio segment with the target video material. The playing speed may be 1.5 times, 2.0 times, or other times, which is not limited in this exemplary embodiment.

In the exemplary embodiment, the playing speed of the target video material is cut or adjusted according to the comparison result of the time length ratio and the ratio threshold, so that the target audio segment and the target video material are completely aligned, and the video production quality is improved.

After the target video material in the video sequence duration sequence is aligned with the target audio segment in the audio duration sequence, other video materials may be repeatedly selected as the target video material to perform the matching manner in fig. 8 to achieve the alignment of the entire video duration and the audio duration sequence.

Aligning each of the at least two video materials with the corresponding audio segment of each video material can enable the video duration of each video material in the set of video durations to correspond to the audio duration of the corresponding audio segment of each video material in the set of audio durations.

And further splicing the at least two video materials with the dubbing audio according to the sequence of the audio time lengths corresponding to the at least two video time lengths to generate the dubbing video corresponding to the at least two video materials. And the sequence of the audio time length is the playing sequence of the audio clips corresponding to the audio time length in the dubbing music video.

Finally, after the soundtrack video with the video material matched and aligned with the soundtrack audio is generated, the user may perform export to the local or share to others.

Wherein the soundtrack video may be a click video. The video with the card points is mainly a video with a cool effect, which is obtained by editing and playing video materials according to drumbeats of music audio through music with a rhythmic feeling. In addition, the score video may be other types of videos, and the exemplary embodiment is not particularly limited thereto.

The following describes the video dubbing method provided in the embodiments of the present disclosure in detail with reference to a specific application scenario.

Fig. 15 is a flowchart illustrating steps of a video dubbing method in an application scenario, and as shown in fig. 15, in step S1510, drum-point detection is performed on the acquired dubbing audio, and respective drum-point positions of the dubbing audio are extracted.

The audio drumbeat is used as a representation of the rhythm of the audio of the dubbing music. The consistency of the soundtrack audio to the video material tempo relies on the accurate extraction of soundtracks from the soundtrack audio.

Specifically, fig. 16 is a schematic diagram illustrating an effect of the drumbeat detection method in an application scene, and as shown in fig. 16, the soundtrack audio corresponding to the video material is obtained, and an initial signal of the soundtrack audio can be seen, an abscissa of the schematic diagram is time, and a unit may be a second or other unit, which is not particularly limited in this exemplary embodiment.

In step S1610, the soundtrack audio is preprocessed.

Specifically, the audio frame may be obtained by performing framing processing on the soundtrack audio. In addition, fourier transform of the audio frame may be included to obtain a spectrum corresponding to the soundtrack audio.

Further, the spectrogram of the dubbing music audio is obtained by splicing the frequency spectrum along the time domain.

Fig. 17 shows another spectrogram obtained by stitching the frequency spectrums, and as shown in fig. 17, the horizontal direction of the spectrogram represents the time dimension, and the vertical direction of the spectrogram represents the frequency dimension.

In step S1620, the sound spectrum is filtered by the mel filter to obtain a mel spectrum, and a difference calculation is further performed to obtain a spectral flux.

Fig. 18 shows an effect diagram obtained by the filtering process using the mel filter, and as shown in fig. 18, the effect diagram is obtained by filtering using 24 mel filters, so that a mel spectrum diagram reduced to 24 dimensions in the longitudinal direction can be obtained.

Further, a difference calculation and an average calculation are carried out on the Mel frequency spectrum by using an identification function. The recognition function may include a difference calculation formula and an average calculation formula.

After the difference calculation is carried out on the Mel frequency spectrum, the spectral flux can be obtained, and further, the average value of the spectral flux is calculated to obtain the average value of the spectral flux.

Fig. 19 shows an effect graph of spectral flux averages, as shown in fig. 19, for a framing process of 1024 windows size with a sampling rate of 44.1kHZ, each audio frame being approximately 43 ms.

When the mean value of the spectral flux is to be obtained for a time span of 0.5s, the mean value of the spectral flux can be calculated using 0.5/0.043=11 sample windows. That is, for each spectral flux, the spectral fluxes of the first 5 samples, the last 5 samples, and the current sample may be selected to obtain the average of the spectral fluxes of the current sample. Therefore, after the spectral flux average of each sample is obtained, a distribution chart can be obtained.

In step S1630, peak detection is performed on the spectral flux average.

And carrying out peak value detection on the calculated spectral flux average value. And multiplying the spectral flux by the corresponding parameter to obtain a spectral flux threshold value. Therefore, all peak points greater than the spectral flux threshold are determined as drum points.

Fig. 20 is a schematic diagram showing the drum locations obtained by peak detection, and as shown in fig. 20, the amplitudes and locations of all the drum points are saved after the drum points are detected. Further, a drum point sequence chart after peak detection is obtained.

In step S1520, the video material and the audio clip are sorted from small to large according to duration. Thus, a video duration sequence and an audio duration sequence can be obtained.

It is worth noting that when a user desires a particular order of video material in certain video duration sequences, the video material may be frozen in that order.

For example, the requirement of the specific sequence may be that the slice header is necessarily a certain video material, or that certain two video materials are necessarily in a combined form. The freezing process may be to remove the video material and perform subsequent operations during matching of the audio duration sequence according to the video duration sequence. And after matching the audio time length sequence according to the video time length sequence, inserting the video material into the position of the corresponding sequence. In addition, other freezing processing modes may exist, and the exemplary embodiment is not particularly limited to this.

In step S1530, the video material in the video duration sequence and the audio clip in the audio duration sequence are aligned in order.

Fig. 21 is a flowchart illustrating steps of a method for aligning video material and audio clips in an application scene, and as shown in fig. 21, in step S2110, the audio clips with the smallest time difference are matched with the sorted video material one by one.

In step S2120, a time length difference between the video time length of the video material and the audio time length of the audio clip is calculated using formula (3). Specifically, the formula (3) is as follows:

| video duration of video material-audio duration of audio clip | < set threshold (3)

In step S2130, when the time length difference is smaller than the time length threshold, it may be determined whether the video time length and the audio time length at this time are completely equal.

When the video duration and the audio duration are completely equal, the video material and the audio segment are aligned, and subsequent processing is not needed; when the video time length is not completely aligned with the audio time length, the video material corresponding to the video time length can be cut or the speed is adjusted.

In step S2140, when the time length difference is greater than or equal to the time length threshold, the temporarily unaligned other audio time lengths are arranged in order from beat to beat.

Further, other audio time lengths are supplemented to the current audio time length until the time length difference between the combined audio time length and the current video time length is smaller than the time length threshold value.

In step S1540, when the video duration and the combined audio duration are not completely aligned, the video material corresponding to the video duration may be cut or the speed of the video material may be adjusted.

Fig. 22 is a schematic diagram illustrating the effect of aligning the audio segments with the video material, and as shown in fig. 22, after the audio segments aligned with the video material are determined, the video segments with difference values of the video materials of the corresponding audio segments are clipped or adjusted at double speed to achieve the effect of complete alignment.

In step S1550, the score video is synthesized.

And according to the finally aligned video material and the audio clip, a click video with the time length completely aligned can be generated, and the user can carry out the operation of exporting to the local or sharing to other people.

Fig. 23 is a schematic diagram illustrating the effect of generating a soundtrack video in an application scenario, as shown in fig. 23, 2310 is an alignment line of an audio clip and video materials, and the position of the alignment line is a starting point of each video material; 2320 is drum point of dubbing music audio, and peak point of audio waveform diagram; 2330 is a double speed play flag.

It can be seen that the video duration of each video material coincides with the audio duration of the soundtrack audio and that one video material can align with a combination of multiple audio segments. In addition, the video material can be adjusted in playing speed by the control at the double-speed playing identifier 2330.

Based on the application scenarios, the video dubbing music method provided by the embodiment of the disclosure generates the dubbing music audio of the video material in a manner of matching the video time length set with the audio time length set. On one hand, the time and difficulty of making the dubbing video by the user are greatly reduced, and the interest of making the video by the user is improved; on the other hand, the matching effect of the video duration set and the audio duration set is good, so that the generated music video is closer to the music rhythm, the quality of the music audio is higher, and the follow-up video sharing power of a user is improved.

It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

The following describes embodiments of the apparatus of the present disclosure, which may be used to execute the video dubbing method in the above embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the video dubbing method described above in the present disclosure.

Fig. 24 schematically illustrates a block diagram of a video soundtrack apparatus in some embodiments of the present disclosure. As shown in fig. 24, the video dubbing apparatus 2400 may mainly include: the time duration obtaining module 2410, the drum point detecting module 2420, the segment dividing module 2430, the set generating module 2440 and the video generating module 2450.

A duration obtaining module 2410, configured to obtain at least two video durations of at least two video materials, and generate a video duration set of the at least two video durations;

a drumhead detection module 2420 configured to acquire the soundtrack audio and perform drumhead detection on the soundtrack audio to determine drumheads in the soundtrack audio;

a segment determination module 2430 configured to divide the soundtrack audio into at least two audio segments according to the drumbeat;

a score ordering module 2440 configured to obtain at least two audio durations of the at least two audio segments and generate an audio duration set of the at least two audio durations;

a sequence alignment module 2450 configured to match the set of video durations with the set of audio durations such that each video duration in the set of video durations corresponds to each audio duration in the set of audio durations, and generate a soundtrack video corresponding to the at least two video materials according to a matching result.

In some embodiments of the present disclosure, the video generation module comprises: the material determining submodule is configured to determine a target video time length in the video time length set and determine a target video material in at least two video materials according to the target video time length;

the segment determining submodule is configured to determine a target audio time length in the audio time length set according to the target video time length and determine a target audio segment in at least two audio segments according to the target audio time length;

In some embodiments of the present disclosure, the segment determination submodule comprises: the difference calculation unit is configured to determine a first audio time length in at least two audio time lengths in the audio time length set, and calculate a time length difference between the target video time length and the first audio time length;

and the time length difference value determining unit is configured to obtain a time length threshold value corresponding to the time length difference value, and determine the first audio time length as the target audio time length when the time length difference value is smaller than the time length threshold value.

and the time length merging unit is configured to merge the first audio time length and the second audio time length as a target audio time length corresponding to the target video time length, wherein the time length difference between the sum of the first audio time length and the second audio time length and the target video time length is less than the time length threshold value.

In some embodiments of the present disclosure, the segment determination submodule comprises: the sequence acquisition unit is configured to sequence the at least two video materials according to the at least two video durations to obtain a video duration sequence, and sequence the at least two audio clips according to the at least two audio durations to obtain an audio duration sequence;

and the sequence determining unit is configured to determine the video sequence of the target video time length in the video time length sequence and determine the target audio time length corresponding to the target video time length in the audio time length sequence according to the video sequence.

In some embodiments of the present disclosure, the greater than determination unit includes: the sequence acquisition subunit is configured to determine an audio sequence of the first audio duration in the audio duration sequence, and determine other audio durations which are not used as the target audio duration in the audio duration sequence;

In some embodiments of the present disclosure, the segment alignment submodule comprises: the ratio calculation unit is configured to calculate the target video time length and the target audio time length to obtain a time length ratio, and obtain a ratio threshold corresponding to the time length ratio;

a ratio comparison unit configured to compare the duration ratio with a ratio threshold and align the target audio segment with the target video material according to a ratio comparison result.

In some embodiments of the present disclosure, the ratio comparing unit includes: a material clipping subunit configured to clip the target video material to align the target audio segment with the target video material if the duration ratio is greater than the ratio threshold;

a speed adjustment subunit configured to adjust the play speed of the target video material to align the target audio segment with the target video material if the duration ratio is less than or equal to the ratio threshold.

In some embodiments of the present disclosure, the drum point detection module comprises: the audio conversion sub-module is configured to perform Fourier transform on the dubbing music audio to obtain a frequency spectrum of the dubbing music audio;

and the peak detection submodule is configured to perform peak detection on the spectral flux average value and determine a drum point in the dubbing music audio.

In some embodiments of the present disclosure, the peak detection sub-module comprises: the parameter determining unit is configured to determine a parameter corresponding to the average value of the spectral flux, and calculate the average value of the spectral flux and the parameter to obtain a threshold value of the spectral flux;

and the spectrum comparison unit is configured to compare the spectral flux with a spectral flux threshold value and determine a drum point in the dubbing music video according to the spectrum comparison result.

In some embodiments of the disclosure, the spectral difference sub-module comprises: the sound spectrum generating unit is configured to splice the frequency spectrums to generate sound spectrums corresponding to the frequency spectrums, and filter the sound spectrums by utilizing a Mel filter to obtain Mel frequency spectrums;

and the frequency spectrum calculating unit is configured to perform difference calculation on the Mel frequency spectrum to obtain frequency spectrum flux, and calculate the average value of the frequency spectrum flux to obtain the average value of the frequency spectrum flux.

In some embodiments of the disclosure, the audio conversion sub-module comprises: the audio framing unit is configured to frame the dubbing music audio to obtain an audio frame;

and the frequency spectrum generating unit is configured to perform Fourier transform on the audio frame to obtain a frequency spectrum corresponding to the dubbing music audio.

The specific details of the video dubbing apparatus provided in the embodiments of the present disclosure have been described in detail in the corresponding method embodiments, and therefore are not described herein again.

FIG. 25 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present disclosure.

It should be noted that the computer system 2500 of the electronic device shown in fig. 25 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 25, the computer system 2500 includes a Central Processing Unit (CPU) 2501 that can perform various appropriate actions and processes in accordance with a program stored in a Read-Only Memory (ROM) 2502 or a program loaded from a storage section 2508 into a Random Access Memory (RAM) 2503. In the RAM 2503, various programs and data necessary for system operation are also stored. The CPU 2501, ROM 2502, and RAM 2503 are connected to each other via a bus 2504. An Input/Output (I/O) interface 2505 is also connected to the bus 2504.

The following components are connected to the I/O interface 2505: an input portion 2506 including a keyboard, a mouse, and the like; an output portion 2507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage portion 2508 including a hard disk and the like; and a communication section 2509 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 2509 performs communication processing via a network such as the internet. A driver 2510 is also connected to the I/O interface 2505 as needed. A removable medium 2511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 2510 as necessary, so that a computer program read out therefrom is mounted in the storage portion 2508 as necessary.

In particular, the processes described in the various method flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications portion 2509, and/or installed from removable media 2511. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 2501.

It should be noted that the computer readable medium shown in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s).

It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for video dubbing, the method comprising:

acquiring at least two video durations of at least two video materials, and generating a video duration set of the at least two video durations;

removing video materials required to be used as titles or two video materials required to be combined from the at least two video materials to determine target video time length in a video time length set corresponding to the at least two video materials after removal, and determining the target video materials from the at least two video materials after removal according to the target video time length;

sequencing the at least two video materials after being removed according to the at least two video durations to obtain a video duration sequence, and sequencing the at least two audio clips according to the at least two audio durations to obtain an audio duration sequence; the mode of sequencing the audio clips is determined according to the sequence of the audio duration from short to long, or the audio clips are arranged according to the sequence from long to short;

determining a video sequence of the target video time length in the video time length sequence, determining a first audio time length corresponding to the target video time length in the at least two audio time lengths of the audio time length sequence according to the video sequence, and calculating a time length difference value between the target video time length and the first audio time length;

acquiring a time length threshold corresponding to the time length difference, and determining the first audio time length as a target audio time length when the time length difference is smaller than the time length threshold;

when the time length difference is larger than or equal to the time length threshold, determining the audio sequence of the first audio time length in the audio time length sequence, and determining other audio time lengths which are not used as the target audio time length in the audio time length sequence;

determining a second audio time length in the other audio time lengths according to the audio sequence; the second audio time length is selected according to the sequence of other audio time lengths in the audio time length sequence, the shortest audio time length in the other audio time lengths is selected as the second audio time length, and whether the time length difference value between the audio time length obtained by combining the first audio time length and the second audio time length and the target video time length is smaller than the time length threshold value or not is calculated; when the time length difference between the combined audio time length and the target video time length is greater than or equal to the time length threshold, continuing to select a second short audio time length in the other audio time lengths for combination and judgment until the second audio time length is selected, wherein the time length difference between the combined audio time length and the target video time length is smaller than the time length threshold;

combining the first audio time length and the second audio time length to be used as a target audio time length corresponding to the target video time length, wherein the time length difference between the sum of the first audio time length and the second audio time length and the target video time length is smaller than the time length threshold, the second audio time length is combined to the front or the back of the first audio time length, and a target audio clip is determined in the at least two audio clips according to the target audio time length;

calculating the target video time length and the target audio time length to obtain a time length ratio, and acquiring a ratio threshold corresponding to the time length ratio;

comparing the time length ratio with the ratio threshold, aligning the target audio clip with the target video material according to a ratio comparison result so as to enable each video time length in the video time length set to correspond to each audio time length in the audio time length set, and inserting the video material with the requirement as a title or the two video materials with the requirement combination into a corresponding position based on a matching result of each video time length in the video time length set to correspond to each audio time length in the audio time length set so as to generate a dubbing music video corresponding to the at least two video materials;

if the ratio comparison result is that the duration ratio is greater than the ratio threshold, cutting the target video material to align the target audio clip with the target video material;

and if the comparison result of the ratio is that the time length ratio is less than or equal to the ratio threshold, adjusting the playing speed of the target video material so as to align the target audio clip with the target video material.

2. The video dubbing method of claim 1, wherein the drum-beating detection of the dubbing audio to determine the drum-beating in the dubbing audio comprises:

carrying out Fourier transform on the dubbing music audio to obtain the frequency spectrum of the dubbing music audio;

carrying out differential calculation on the frequency spectrum to obtain a frequency spectrum flux mean value of the frequency spectrum;

and carrying out peak value detection on the spectral flux average value, and determining a drum point in the dubbing music audio.

3. The video dubbing method of claim 2, wherein the peak detecting the spectral flux average value and determining the drum point in the dubbing audio comprises:

determining a parameter corresponding to the average value of the spectral flux, and calculating the average value of the spectral flux and the parameter to obtain a threshold value of the spectral flux;

and comparing the spectral flux with the spectral flux threshold value, and determining the drum point in the dubbing music video according to the spectral comparison result.

4. The video dubbing method of claim 2, wherein the differentiating the spectrum to obtain the spectral flux average of the spectrum comprises:

splicing the frequency spectrums to generate sound spectrums corresponding to the frequency spectrums, and filtering the sound spectrums by utilizing a Mel filter to obtain Mel frequency spectrums;

and carrying out differential calculation on the Mel frequency spectrum to obtain frequency spectrum flux, and calculating the average value of the frequency spectrum flux to obtain the average value of the frequency spectrum flux.

5. The video dubbing method of claim 2, wherein the fourier transforming the dubbing audio to obtain a spectrum of the dubbing audio comprises:

performing framing processing on the dubbing music audio to obtain an audio frame;

and carrying out Fourier transform on the audio frame to obtain a frequency spectrum corresponding to the dubbing music audio.

6. A video dubbing apparatus, the apparatus comprising:

a video generation module configured to

7. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the video dubbing method of any one of claims 1 to 5.

8. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the video soundtrack method of any one of claims 1-5 via execution of the executable instructions.