CN116127125A

CN116127125A - Multimedia data processing method, device, equipment and computer readable storage medium

Info

Publication number: CN116127125A
Application number: CN202111346983.0A
Authority: CN
Inventors: 冯鑫
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2023-05-16

Abstract

The application provides a multimedia data processing method, a device, equipment and a computer readable storage medium; the method comprises the following steps: acquiring video data to be matched with music and music data to be processed, and performing information source separation on the music data to obtain a singing track and an accompaniment track; determining non-singing time information based on the singing track, and determining a reference start-stop point set based on the accompaniment track; determining a target starting and ending point set from the reference starting and ending point set based on the non-singing time information; determining a soundtrack starting point and a soundtrack ending point from the target starting point set based on a first playing time length of the video data to be soundtrack; and determining target music data between the music starting point and the music ending point in the music data as target music data of the video data. Through the video game system and method, video game playing efficiency, accuracy and universality can be improved.

Description

Multimedia data processing method, device, equipment and computer readable storage medium

Technical Field

The present disclosure relates to information processing technologies, and in particular, to a method, an apparatus, a device, and a computer readable storage medium for processing multimedia data.

Background

In recent years, with the development of mobile network and internet technologies, intelligent terminals have become necessities for people's daily life and work. With this, the development and prosperity of movie works and short videos, and further intelligent understanding and editing technologies of videos have been developed, for example, adding a soundtrack, a cartoon head portrait, etc. to a short video can be added to a short video to be produced.

In an application scene of adding a score to a video, it is particularly important to select a start/stop point of the music. In the related art, when determining the starting and ending points of the score, a manual labeling method can be adopted, and the score can be determined by paragraphs of lyrics. However, the manual labeling method is low in efficiency, and the starting and stopping point determination based on the lyric paragraph is not suitable for music without lyrics such as self-created songs, and is poor in universality.

Disclosure of Invention

The embodiment of the application provides a multimedia data processing method, a multimedia data processing device and a computer readable storage medium, which can improve the efficiency, the accuracy and the universality of video soundtrack.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a multimedia data processing method, which comprises the following steps:

acquiring video data to be matched with music and music data to be processed, and performing information source separation on the music data to obtain a singing track and an accompaniment track;

Determining non-singing time information based on the singing track, and determining a reference start-stop point set based on the accompaniment track;

determining a target starting and ending point set from the reference starting and ending point set based on the non-singing time information;

determining a soundtrack starting point and a soundtrack ending point from the target starting point set based on a first playing time length of the video data to be soundtrack;

and determining target music data between the music starting point and the music ending point in the music data as target music data of the video data.

An embodiment of the present application provides a multimedia data processing apparatus, including:

the information source separation module is used for acquiring video data to be assembled and music data to be processed, and carrying out information source separation on the music data to obtain a singing sound track and an accompaniment sound track;

the first determining module is used for determining non-singing time information based on the singing voice track and determining a reference start-stop point set based on the accompaniment voice track;

a second determining module, configured to determine a target start-stop point set from the reference start-stop point set based on the singing time information;

a third determining module, configured to determine a soundtrack start point and a soundtrack end point from the target start-stop point set based on a first playing duration of the video data to be soundtrack;

And a fourth determining module for determining target music data between the soundtrack start point and the soundtrack end point in the music data as target soundtrack data of the video data.

In some embodiments, the source separation module is further configured to:

performing time-frequency conversion on the music data to obtain a spectrum amplitude spectrum of the music data;

extracting features of the spectrum amplitude spectrum to obtain singing features and accompaniment features, and combining the singing features and the accompaniment features to obtain combined features;

determining a singing mask and an accompaniment mask based on the combined features, the singing features, and the accompaniment features;

performing mask calculation on the spectrum amplitude spectrum by using the singing mask and the accompaniment mask respectively to correspondingly obtain singing spectrum amplitude and accompaniment spectrum amplitude;

and respectively performing frequency-time conversion on the singing frequency spectrum amplitude and the accompaniment frequency spectrum amplitude to correspondingly obtain a singing sound track and an accompaniment sound track.

In some embodiments, the first determining module is further configured to:

performing voice activity detection on the singing voice track, positioning according to impulse signals in the singing voice track, and determining singing time information in the singing voice track;

And determining non-singing time information in the singing audio track based on the singing time information.

In some embodiments, the first determining module is further configured to:

acquiring a trained audio event detection model;

inputting the singing voice track into the audio event detection model to obtain singing time information in the singing voice track;

In some embodiments, the first determining module is further configured to:

acquiring a frequency spectrum characteristic sequence of the accompaniment track and a preset sliding window length;

carrying out sliding window processing on the frequency spectrum characteristic sequence based on the sliding window length to obtain a plurality of sliding window results;

dividing each sliding window result into N sub-results, and determining the energy average value of the N sub-results in each sliding window result;

determining a target sliding window result from a plurality of sliding window results based on the energy mean value of N sub-results in each sliding window result;

a reference set of starting and ending points is determined based on the target sliding window result.

In some embodiments, the first determining module is further configured to:

determining a sliding window result of which the energy mean value of the N sub-results meets a decreasing condition as a first target sliding window result;

Determining a sliding window result of which the energy mean value of the N sub-results meets the increasing condition as a second target sliding window result;

determining a right boundary point of the first target sliding window result as a reference termination point;

and determining a left boundary point of the second target sliding window result as a reference starting point.

In some embodiments, the non-singing time information includes a non-singing time interval, and the second determining module is further configured to:

and determining a target starting and ending point set based on a reference starting and ending point which falls into the singing time interval in the reference starting and ending point set, wherein the target starting and ending point set comprises a target starting point and a target ending point.

In some embodiments, the third determining module is further configured to:

determining a first playing time length of the video data to be matched;

determining each interval duration between a target starting point and each target ending point located after the target starting point in a target starting point set;

determining each time length difference between each interval time length and the first playing time length;

when the difference value of the time lengths is smaller than at least one target interval time length of a preset threshold value, determining a target starting point corresponding to a minimum value in the at least one target interval time length as a soundtrack starting point, and determining a target ending point corresponding to the minimum value as a soundtrack ending point.

In some embodiments, the apparatus further comprises:

the first acquisition module is used for acquiring the track type of the music data to be processed when at least one target interval duration with a duration difference smaller than a preset threshold value does not exist;

an output module configured to determine a plurality of candidate music data based on the track type, and output a plurality of music identifications of the plurality of candidate music data;

a fifth determining module for determining a selected target music identifier based on receiving a selecting operation for the music identifier;

and a sixth determining module, configured to determine music data to be processed based on the target music identifier.

In some embodiments, the apparatus further comprises:

a seventh determining module, configured to determine a second playing duration of the target score data based on the score start point and the score end point;

the adjusting module is used for adjusting the playing time length of the video data to be assembled based on the second playing time length to obtain adjusted video data;

and the synthesis module is used for synthesizing the adjusted video data and the target score data to obtain the score video data.

An embodiment of the present application provides a computer device, including:

a memory for storing executable instructions;

and the processor is used for realizing the multimedia data processing method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores executable instructions that when executed by a processor, implement the multimedia data processing method provided by the embodiment of the application.

The embodiment of the application provides a computer program product, which comprises a computer program or instructions, and the computer program or instructions realize the multimedia data processing method provided by the embodiment of the application when being executed by a processor. .

The embodiment of the application has the following beneficial effects:

after obtaining video data to be assembled and music data to be processed, firstly carrying out information source separation on the music data to obtain a singing sound track and an accompaniment sound track, then determining non-singing time information based on the singing sound track, determining a reference starting and ending point set based on the accompaniment sound track, further determining a target starting and ending point set from the reference starting and ending point set based on the non-singing time information, determining a soundtrack starting point and a soundtrack ending point from the target starting and ending point set based on first playing time of the video data to be assembled, and finally determining target music data between the soundtrack starting point and the soundtrack ending point in the music data as target soundtrack data of the video data; thus, the singing track and the accompaniment track are separated from the music data to be processed, so that the non-singing time information in the whole piece of music can be accurately positioned in a scene without lyrics, and the universality of video soundtracks is improved; and finally, screening out the final music starting point from the reference starting point based on the detected non-singing time section, thereby ensuring singing integrity and positioning the position of energy fade-in and fade-out gradual change of song rhythm and improving the efficiency and accuracy of video music.

Drawings

FIG. 1 is a diagram of a network architecture of a multimedia data processing system 100 provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a terminal 400 provided in an embodiment of the present application;

fig. 3 is a flow chart of a multimedia data processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of an implementation process of source separation for music data according to an embodiment of the present application;

fig. 5 is a schematic flow chart of another implementation of the multimedia data processing method according to the embodiment of the present application;

fig. 6 is an audio map and spectrogram of a voice audio track extracted from a short video of a first edition according to an embodiment of the present application;

fig. 7 is an audio map and a spectrogram of a BGM audio track extracted from a short video of an initial version according to an embodiment of the present application;

FIG. 8 is an audio map of a final short video clip finished product generated after a soundtrack provided by an embodiment of the present application;

fig. 9 is a schematic flowchart of still another implementation of the multimedia data processing method according to the embodiment of the present application;

FIG. 10 is a diagram of a Unet network architecture;

fig. 11 is a schematic diagram of an implementation process of source separation using a Unet model according to an embodiment of the present application;

fig. 12 is a schematic diagram of a non-silence segment located by VAD according to an embodiment of the present application;

FIG. 13 is a schematic diagram of end point positioning based on energy detection according to an embodiment of the present application;

fig. 14 is a schematic diagram of non-Vocal paragraph screening points according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) The signal source separation, in which a plurality of audio signals may be doped in a whole audio, so that the whole audio is generated by mixing a plurality of audio signals, and the signal source separation is to separate the mixed audio signals through signal processing or other algorithms, extract the audio signal sequences of the specified types from the mixed signals, and finally generate the individual audio files.

2) Singing (Vocals), which refers to the audio track of the singing part, is to separately write waveform sound files (wav, wave) files after the source separation from the whole mixed audio, and store the singing signals of the singers separated from the whole piece of music.

3) Background music (Bgm, background music), which refers to an accompaniment track, is written in a wav file from accompaniment signals separated from the track of the entire song by a source.

4) The use of symmetrical U-shaped structures comprising compressed and expanded paths is very innovative at the time and affects to some extent the design of the latter several partitioning networks, the name of which is also taken from their U-shape, being one of the earlier algorithms for semantic partitioning using full convolutional networks.

5) Voice activity detection (VAD, voice Activity Detection) is widely used in speech coding, noise reduction and ASR scenarios. What is said here is speech/non-speech (non-speech/silence) detection, a VAD system typically consists of two parts, feature extraction and speech/non-speech decision.

6) The Mel frequency, a nonlinear frequency scale based on sensory judgment of equidistant pitch (pitch) variation of human ears, is a frequency scale which can be set artificially to cater to auditory perception threshold variation of human ears when signal processing is performed, and in the field of audio processing, a plurality of basic audio features are calculated through Mel frequency.

7) The python interface (webtcvad) of WebRTC Voice Activity Detector (VAD) can effectively support python2 and python3, which can distinguish between silence frames and non-silence frames in a segment of speech segment.

In order to better understand the multimedia data processing method provided in the embodiments of the present application, first, a multimedia data processing method for determining a score start and end point in the related art and the existing drawbacks are described.

In the related art, there are at least three kinds of multimedia data processing methods for making a score start and stop point determination:

first, depending on the method of manual annotation, after determining the highlight or the chorus segment of music, the beginning and ending point of the selected score is manually performed before and after the determined segment, and then the music is cut according to the position.

Secondly, audio energy detection is carried out on the whole original music audio to determine the position of a music playing dead point, positioning is carried out on the whole original music audio, and fade-in fade-out point selection is carried out on the original audio track.

Thirdly, in some intelligent production short video music distribution schemes, in order to avoid singing cut-off phenomenon when selecting a music distribution starting point, the time of lyrics is often considered, and according to the time segments of the lyrics, which time segments of the whole song do not have singers to sing, so that the cut-off phenomenon is avoided.

The disadvantages of the three technical schemes include the following aspects:

first, the score start and end points using human-labeled are effectively more adaptable to intelligently produce short video clip score start and end point selections. However, the manual labeling method is low in efficiency for tasks, increases the cost of the whole intelligent production, cannot realize industrial production, and cannot guarantee real-time performance.

Second, the starting and ending point positions calculated by audio energy detection based on the original audio track of the whole song are positions capable of positioning the fade-in and fade-out effects on the whole energy, but the situation that the selected starting and ending point positions are just in the condition that one lyric is not singed cannot be avoided, and if the song is cut according to the selected starting and ending point positions, the phenomenon that the music singing of the short video is truncated at the beginning and the end of the dubbing can occur.

Thirdly, not all songs firstly carry lyric information, especially on the app of a short video clip, the user uploads own music or songs recorded by the user as background music, and the lyric information is lacking, so that the universality of the third technical scheme cannot be achieved. Furthermore, only the non-singing paragraphs are used as starting and ending point selection, so that the fade-in and fade-out effects at the beginning and the end of the soundtrack cannot be ensured, and better audience audiovisual experience cannot be achieved.

Based on the above problems, the embodiments of the present application provide a multimedia data processing method, apparatus, device, and computer readable storage medium, which perform point location determination of a soundtrack start point by combining a source separation model and audio energy detection, so as to not only avoid singing cut-off phenomenon generated by the soundtrack start point, but also ensure experience of beginning fade-in and ending fade-out when a short video clip produced intelligently is soundtrack.

Referring to fig. 1, fig. 1 is a schematic diagram of a network architecture of a multimedia data processing system 100 provided in an embodiment of the present application, where, as shown in fig. 1, the network architecture includes a server 200, a network 300, and a terminal 400, where the terminal 400 is connected to the server 200 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 400 firstly acquires video data of a music to be dubbed, the video data can be acquired through an image acquisition device of the terminal 400, or can be generated by a plurality of pictures by utilizing a video making tool, or can be downloaded from a network, or can be transmitted to the terminal 400 by other terminals, the terminal 400 also acquires the music data to be processed, after acquiring the video data and the music data of the music to be dubbed, the terminal 400 can automatically perform source separation on the music data to obtain a singing audio track and an accompaniment audio track, further determines time information of a non-singing part based on the singing audio track, determines a reference starting and ending point set of fade-in and fade-out based on the accompaniment audio track, then screens out a target starting and ending point set of the non-singing time interval from the reference starting and ending point set, finally determines a soundtrack starting point and a soundtrack ending point set of the target music from the first play duration of the video data of the music to be dubbed, after obtaining the soundtrack starting point and the soundtrack ending point, namely, the target music data can be intercepted from the music data, and finally the target music data and the video data to be dubbed are synthesized, and the video data is obtained. The terminal 400 may then transmit the video data after the score to the server 200, thereby distributing the video data after the score on the video viewing platform.

The network architecture shown in fig. 1 may be further based on a process of implementing multimedia data processing by the server 200, at this time, after obtaining video data to be assembled and music data to be processed, the terminal 400 sends the video data and the music data to the server 200, the server 200 performs source separation on the music data to obtain a singing audio track and an accompaniment audio track, further determines time information of a non-singing part based on the singing audio track, determines a reference starting point set of fade-in and fade-out based on the accompaniment audio track, then screens out a target starting point set falling into a non-singing time interval from the reference starting point set, finally determines an assembly starting point and an assembly ending point from a first playing time length and the target starting point set of the video data to be assembled, after obtaining the assembly starting point and the assembly ending point, namely, can intercept the target music data from the music data, finally synthesizes the target music data and the video data to be assembled, then the server 200 sends the video data to the terminal 400, after determining that the video data to be assembled meets the requirement of the video data, and then sends the video data to the service platform after the video data is stored in the audio platform.

In some embodiments, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart television, a smart car terminal, etc. The terminal 400 may be provided with an Application client capable of implementing a video distribution function, for example, a video play Application (APP), a video editing APP, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiments of the present application.

Fig. 2 is a schematic structural diagram of a computer device provided in the embodiment of the present application, where the computer device may be the server 200 or the terminal 400 shown in fig. 1, and in the embodiment of the present application, the computer device is taken as an example of the terminal 400 to be described.

As shown in fig. 2, the terminal 400 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 450 described in the embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

network communication module 452 for reaching other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

A presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a multimedia data processing apparatus 455 stored in a memory 450, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the source separation module 4551, the first determination module 4552, the second determination module 4553, the third determination module 4554 and the fourth determination module 4555 are logical, and thus may be arbitrarily combined or further split according to the functions implemented. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the multimedia data processing method provided by the embodiments of the present application, e.g., the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Log ic Device), complex programmable logic devices (CPLD, complex Programmable Logic Dev ice), field programmable gate arrays (FPGA, field-Programmable Gate Array), or other electronic components.

In some embodiments, the terminal or the server may implement the multimedia data processing method provided in the embodiments of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; the method can be a local (Native) Application program (APP), namely a program which can be run only by being installed in an operating system, such as video play APP, video editing APP and the like; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.

The multimedia data processing method provided by the embodiment of the present application will be described with reference to exemplary applications and implementations of the terminal provided by the embodiment of the present application.

The embodiment of the application provides a multimedia data processing method for video coordination, which can be executed by a terminal or a server or can be executed by the terminal and the server together. Fig. 3 is a flowchart of a multimedia data processing method according to an embodiment of the present application, and each step of the multimedia data processing method according to the embodiment of the present application will be described with reference to fig. 3.

Step S101, obtaining video data to be matched and music data to be processed, and carrying out information source separation on the music data to obtain a singing sound track and an accompaniment sound track.

When the step S101 is implemented by the terminal, the video data may be acquired and obtained by the image acquisition device of the terminal itself, or may be generated by the terminal by using a plurality of pictures, or may be downloaded from a network server, or may be sent to the terminal by another terminal. The music data to be processed may be stored locally by the terminal, downloaded from a network server, recorded by the terminal user itself, or the like. When step S101 is implemented by the server, the video data and the music data may be transmitted by the terminal, and may be obtained based on the video identification and the music identification search query transmitted by the terminal.

The video data includes at least a plurality of video frame images, and may include audio data or may not include audio data.

The music data is subjected to source separation, and when the music data is realized, the music data is input into a trained source separation model, and accompaniment tracks and singing tracks are separated from the music data by the trained source separation model. Each track defines properties of the track, such as the tone quality, number of channels, input/output ports, volume, etc. of the track. The accompaniment tracks include only background music data and the singing tracks include only Vocal (Vocal) data.

Step S102, determining non-singing time information based on the singing voice track, and determining a reference start-stop point set based on the accompaniment voice track.

When determining the non-singing time information based on the singing audio track is implemented, the VAD may detect the impulse signal, and then determine the position where the impulse signal does not appear as a non-singing segment, where the non-singing time information includes a non-singing time interval, that is, a start time and an end time of each non-singing segment.

When the reference starting and ending point set is determined based on the accompaniment tracks, the audio energy detection can be carried out on the accompaniment tracks to determine the fade-in and fade-out positions in the music data, namely the reference starting and ending point set is determined, so that the experience of the beginning fade-in and ending fade-out of the video data after the accompaniment can be ensured. The reference start-stop point set includes a reference start point and a reference stop point, which may each be represented by a time point.

Step S103, determining a target starting and ending point set from the reference starting and ending point set based on the singing time information.

Since the reference start-stop point is determined by performing energy detection on the accompaniment track without considering the singing part, in order to avoid the singing truncation phenomenon, this step S103 may be implemented by determining, as a target start-stop point set, a reference start-stop point in the reference start-stop point set that falls within the non-singing time interval, where the target start-stop point set includes a target start point and a target end point. The target set of starting points is a subset of the reference set of starting points.

Step S104, determining a soundtrack start point and a soundtrack end point from the target set of starting and ending points based on the first playing time length of the video data to be soundtrack.

When the step is realized, the interval duration between each target starting point and each target ending point after the target starting point is calculated in sequence, the duration difference between each interval duration and the first playing duration is determined in sequence, and then the target starting point corresponding to the duration difference with the minimum duration difference and less than the preset threshold value is determined as the coordination starting point. Thus, the matching degree of the target music data and the video data playing duration between the determined soundtrack starting point and the determined soundtrack ending point can be ensured.

Step S105 of determining target music data between the soundtrack start point and soundtrack end point in the music data as target soundtrack data of the video data.

The beginning point and the ending point of the dubbing are each a time point, for example, the beginning point of the dubbing is 1 minute 50 seconds and the ending point of the dubbing is 2 minutes 40 seconds, then the music data between 1 minute 50 seconds and 2 minutes 40 seconds is determined as the target music data. The target music data is used for making a score for the video data. In some embodiments, after the target score data of the video data is determined, the target music data and the video data are synthesized, so that the score video data is obtained, and the entertainment and the ornamental value of the video are improved.

In the multimedia data processing method provided by the embodiment of the application, after obtaining video data to be dubbed and music data to be processed, firstly, carrying out information source separation on the music data to obtain a singing voice track and an accompaniment voice track, then determining non-singing time information based on the singing voice track, determining a reference starting and ending point set based on the accompaniment voice track, further determining a target starting and ending point set from the reference starting and ending point set based on the non-singing time information, determining a dubbing starting point and a dubbing ending point from the target starting and ending point set based on first playing duration of the video data to be dubbed, and finally determining target music data between the dubbing starting point and the dubbing ending point in the music data as target dubbing data of the video data; thus, the singing track and the accompaniment track are separated from the music data to be processed, so that the non-singing time information in the whole piece of music can be accurately positioned in a scene without lyrics, and the universality of video soundtracks is improved; and finally, screening out the final music starting point from the reference starting point based on the detected non-singing time section, thereby ensuring singing integrity and positioning the position of energy fade-in and fade-out gradual change of song rhythm and improving the efficiency and accuracy of video music.

In some embodiments, the "source separation is performed on the music data to obtain the singing track and the accompaniment track" in the above step S101 may be performed through steps S1011 to S1015 shown in fig. 4, and each step is described below in connection with fig. 4.

Step S1011, performing time-frequency conversion on the music data to obtain a spectrum magnitude spectrum of the music data.

In this embodiment of the present application, the acquired music data is a time-domain audio signal, so in this step, the music data is subjected to time-frequency conversion, for example, the time-domain audio signal may be converted into a frequency domain by fourier transformation, so as to obtain a frequency domain signal of the music data. The frequency domain signal includes phase information and amplitude information, and in this step, the phase information in the frequency domain signal needs to be removed to obtain a spectrum amplitude spectrum of the music data.

Step S1012, extracting features from the spectrum amplitude spectrum to obtain singing features and accompaniment features, and combining the singing features and the accompaniment features to obtain combined features.

When the step is realized, the spectrum amplitude spectrum is respectively input into an accompaniment feature extraction model and a singing feature extraction model to respectively obtain the accompaniment feature and the singing feature. The accompaniment feature extraction model and the singing feature extraction model can be a Unet model or other types of neural network models, and the accompaniment feature extraction model is used for carrying out convolution, pooling and other treatments on the spectrum amplitude spectrum to obtain the accompaniment feature; similarly, the spectrum amplitude spectrum is convolved, pooled and the like through a singing feature extraction model to obtain singing features.

The accompaniment features and the singing features are feature vectors, for example, the accompaniment features and the singing features can be vectors with dimensions of 1*N, and the accompaniment features and the singing features can be combined, and when the accompaniment features and the singing features are realized, feature values corresponding to the accompaniment features and the singing features can be added to obtain the combined features.

For example, accompaniment features of [ a ] ₁ ,a ₂ ,...,a _N ]Singing features [ b ] ₁ ,b ₂ ,...,b _N ]The combined characteristic is [ a ] ₁ +b ₁ ,a ₂ +b ₂ ,...,a _N +b _N ]。

Step S1013, determining a singing mask and an accompaniment mask based on the combined feature, the singing feature, and the accompaniment feature.

Here, the proportion of the accompaniment features in the combined features may be determined as an accompaniment mask, and the proportion of the singing features in the combined features may be determined as a singing mask.

In connection with the above-described examples,the vector of 1*N is the singing feature and the accompaniment feature, 1*N is the accompaniment mask and 1*N is the accompaniment mask

Singing mask of +.>

Step S1014, performing mask calculation on the spectrum magnitude spectrum by using the singing mask and the accompaniment mask, so as to obtain a singing spectrum magnitude and an accompaniment spectrum magnitude.

Taking the above example, the spectrum magnitude spectrum is also a vector of 1*N, assumed to be [ L ] ₁ ,L ₂ ,...,L _N ]In this step, when the mask calculation is performed, the accompaniment mask and the two vectors of the spectrum amplitude spectrum are subjected to dot multiplication to obtain the accompaniment spectrum amplitude, namely

The singing mask and the frequency spectrum amplitude are subjected to dot multiplication to obtain the singing frequency spectrum amplitude, namely +.>

Step S1015, performing frequency-time conversion on the singing frequency spectrum amplitude and the accompaniment frequency spectrum amplitude respectively, so as to obtain a singing audio track and an accompaniment audio track correspondingly.

When the step is realized, the singing frequency spectrum amplitude and the accompaniment frequency spectrum amplitude are respectively subjected to inverse Fourier transform, so that the conversion from a frequency domain to a time domain is realized, and the singing track and the accompaniment track are correspondingly obtained.

In some embodiments, the above-mentioned "determining the non-singing time information based on the singing track" in step S102 may be implemented in the following two ways, which are described below, respectively.

The first way can be achieved by the following steps:

step S1021A, voice activity detection is carried out on the singing voice track, positioning is carried out according to impulse signals in the singing voice track, and singing time information in the singing voice track is determined.

The voice signal except the voice is separated from the singing audio track through the information source separation process, so that the VAD is used for positioning the impulse signal of the voice in the singing audio track, and the singing time information in the singing audio track is obtained. In the embodiment of the present application, the singing time information, that is, the singing time interval, includes a singing start time and a singing end time.

Step S1022A, determining non-singing time information in the singing track based on the singing time information.

After the singing time information is determined, the singing time interval can be removed from the singing voice track, and the non-singing time interval can be obtained, so that the non-singing time information is determined.

The second implementation may be achieved by:

step S1021B, a trained audio event detection model is obtained.

The audio event detection model is a neural network model, and can be obtained by performing repeated iterative training by using training singing track data marked with singing parts and non-singing parts.

Step S1022B, inputting the singing audio track into the audio event detection model to obtain singing time information in the singing audio track.

After the singing audio track is input into the audio event detection model, the audio event detection model can detect the singing event of the singing audio track and output singing time information of each singing paragraph in the singing audio track.

Step S1023B, determining non-singing time information in the singing audio track based on the singing time information.

Similar to step S1022A, after determining the singing time information, the singing time interval may be removed from the singing track, so that the non-singing time interval may be obtained, and thus the non-singing time information is determined.

Through the two implementation modes, after the singing voice track is extracted, the part of the whole singing voice track where the singing voice appears can be positioned, and the positioned singing paragraph is the paragraph of the singer in the music data for sounding, so that the part of the whole song which is not the singing voice is determined, and the singing voice and lyrics are prevented from being cut off when the song is intercepted.

In some embodiments, "determining a reference start-stop set based on the accompaniment tracks" in step S102 may be achieved by:

step S1024, the frequency spectrum characteristic sequence of the accompaniment track and the preset sliding window length are obtained.

When the step is implemented, fourier transformation may be performed on the separated accompaniment tracks, for example, short-time fourier transformation may be performed to obtain a spectrum of the accompaniment tracks, and then mel transformation and logarithmic calculation are performed to obtain a spectrum feature sequence of the accompaniment tracks. The sliding window length W may be preset, and W is a positive integer, for example, 18, 24, 36, etc.

And step S1025, carrying out sliding window processing on the frequency spectrum characteristic sequence based on the sliding window length to obtain a plurality of sliding window results.

In some embodiments, it is also necessary to obtain a sliding step S, which in practical implementations may be determined based on the sliding window length, for example, may be one half, one third, or the sliding step, i.e. the sliding window length, W. When step S1025 is implemented, the sliding window processing may be performed on the spectrum feature sequence according to the sliding window length and the sliding step length, so as to obtain a plurality of sliding window results. Each sliding window result comprises W spectral feature values. There are (W-S) overlapping spectral features in adjacent sliding results.

In step S1026, each sliding window result is divided into N sub-results, and the energy average of the N sub-results in each sliding window result is determined.

Where N is a positive integer, for example N may be 2, 3, 4, etc. In order to ensure that the sliding window result can be equally divided, when the sliding window length is set, W can be set to be the integral multiple of N, and the sliding window result can be equally divided without limitation, so that the integral multiple of W is not required to be set. In the embodiment of the present application, W may be set to 18, n to 3, and s to 6.

In the implementation of step S1026, all the frequency domain energies in each sliding window result may be accumulated to generate a one-dimensional energy sequence, and then the one-dimensional energy sequence is divided into N sub-results, so as to determine an energy average value in each sub-result.

Step S1027, determining a target sliding window result from a plurality of sliding window results based on the energy average value of the N sub-results in each sliding window result.

Through the steps S1024 to S1027, N sub-results corresponding to each sliding window result can be determined, then, the N sub-results corresponding to each sliding window are compared in size, and if the energy average value of the N sub-results meets the decreasing condition, the sliding window result corresponding to the N sub-results is determined to be the first target sliding window result; and if the energy mean value of the N sub-results meets the increasing condition, determining the sliding window result corresponding to the N sub-results as a second target sliding window result. If the N sub-results do not meet the increasing condition or the decreasing condition, the sliding window result corresponding to the N sub-results is not the target sliding window result.

It should be noted that the energy mean value of the N sub-results satisfies the decreasing condition, which means that the N sub-results perform energy mean value comparison according to the time sequence, if P is satisfied ₁ >P ₂ >P ₃ >…>P _N And determining that the energy average value of the N sub-results meets a decreasing condition, namely, when energy average value comparison is performed, ordering the sub-results according to time sequence. The energy mean values of the N sub-results are similar to meet the increasing condition, namely, the N sub-results are subjected to energy mean value comparison according to the time sequence, if P is met ₁ <P ₂ <P ₃ <…<P _N And determining that the energy mean value of the N sub-results meets the increment condition.

Step S1028, determining a reference start-stop point set based on the target sliding window result.

In the implementation manner accepted in step S1027, since N sub-results in the first target sliding window result satisfy the decreasing condition, the first target sliding window result is an effect of energy fade out, and in this step S1028, the right boundary point of the first target sliding window result is determined as the reference termination point; and the N sub-results in the second target sliding window result satisfy the increasing condition, so that the effect of energy fade-in is the first target sliding window result, at this time, in step S1028, the left boundary point of the second target sliding window result is determined as the reference starting point.

Through the steps S1024 to S1028, the energy detection may be performed on the accompaniment track, so as to determine the position of energy fade-in and the position of energy fade-out in the accompaniment track, and determine the position of energy fade-in as the reference start point and the position of energy fade-out as the reference end point, so that the time information of the reference start-stop point position suitable for the soundtrack in the accompaniment track can be accurately located.

It should be noted that, step S102 may be implemented by step S1021A, step S1022A, step S1024 to step S1028, or may be implemented by step S1021B to step S1023B, step S1024 to step S1028, but the above two implementations are not shown in the drawings.

Based on the foregoing embodiments, the embodiments of the present application further provide a multimedia data processing method, which is applied to the network architecture shown in fig. 1, and fig. 5 is a schematic flow chart of another implementation of the multimedia data processing method provided in the embodiments of the present application, as shown in fig. 5, where the flow includes:

in step S201, the terminal acquires video data to be dubbed and music data to be processed.

When the terminal is realized, the terminal can determine the video data to be matched and the music data to be processed based on the received operation instruction for video matching.

In step S202, the terminal transmits the video data and the music data to the server.

In step S203, the server performs source separation on the music data to obtain a singing track and an accompaniment track.

In step S204, the server determines non-singing time information based on the singing track, and determines a reference start/stop set based on the accompaniment track.

In step S205, the server will determine the target start-stop point set based on the reference start-stop points falling into the non-singing time interval in the reference start-stop point set.

The implementation process of the above steps S203 to S205 is similar to the implementation process of the steps S101 to S103, and reference may be made to the steps S101 to S103 in the actual application process.

In step S206, the server determines a first playing duration of the video data to be assembled.

The first playing time period is a time period used for playing the video data to be assembled at a normal multiple speed (1 multiple speed).

In step S207, the server determines respective interval durations between the target start point and respective target end points located after the target start point in the target start-stop point set.

In this step, the server determines a respective interval duration between each target start point in the set of target start points and a respective target end point located after the target start point.

For example, the first start point 52 seconds, the second start point 1 minute 3 seconds, the first end point 1 minute 42 seconds, the third start point 1 minute 58 seconds, the second end point 2 minute 20 seconds, and the third end point 2 minute 36 seconds, respectively, in time order, then it is necessary to calculate the interval duration between the first start point and the first end point (50 seconds), the interval duration between the first start point and the second end point (1 minute 28 seconds), the interval duration between the first start point and the third end point (1 minute 43 seconds), the interval duration between the second start point and the second end point (1 minute 17 seconds), the interval duration between the second start point and the third end point (1 minute 33 seconds), and the interval duration between the third start point and the third end point (38 seconds).

In step S208, the server determines each time difference between each interval time and the first playing time.

In this step, the difference between the respective interval durations and the first playing duration may be a value obtained by subtracting the interval duration from the first playing duration to obtain an absolute value.

Taking the above example, assuming that the first playing duration is 1 minute and 30 seconds, the difference between each interval duration and the first playing duration is 40 seconds, 2 seconds, 13 seconds, 3 seconds, and 52 seconds, respectively.

In step S209, the server determines whether there is at least one target interval duration in which the duration difference is smaller than a preset threshold.

When there is at least one target interval duration of which the duration difference is smaller than the preset threshold, step S210 is entered; when there is no at least one target interval duration for which the duration difference is smaller than the preset threshold, step S215 is entered.

In step S210, the server determines a target start point corresponding to a minimum value in at least one target interval duration as a score start point, and determines a target end point corresponding to the minimum value as a score end point.

Assuming that the preset threshold is 5 seconds, the difference in time length of 2 seconds and the difference in time length of 3 seconds are less than 5 seconds, so the process proceeds from step S209 to step S210, in which a target start point corresponding to 2 seconds is determined as a score start point, a target end point corresponding to 2 seconds is determined as a score end point, that is, a first start point is determined as a score start point, and a second end point is determined as a score end point.

In step S211, the server determines the target music data between the beginning point of the score and the ending point of the score as target score data, and determines the second playing duration of the target score data.

In the step, the second playing duration of the target score data can be determined by subtracting the time point corresponding to the score start point from the score end point.

With the above example, the beginning point of the dubbing is 52 seconds, and the ending point of the dubbing is 2 minutes and 20 seconds, so the second playing duration is 1 minute and 28 seconds.

In step S212, the server adjusts the playing duration of the video data to be assembled based on the second playing duration, so as to obtain adjusted video data.

When the step is implemented, the playing time length can be adjusted by adjusting the playing speed of the video data to be assembled, so that the playing time length of the adjusted video data is the same as the second playing time length, or the video data to be assembled can be cut to the second playing time length by clipping the video data to be assembled, for example, when the second playing time length is smaller than the first playing time length, and when the second playing time length is longer than the first playing time length, a piece of flower can be added for the video data to be assembled, so that the playing time length of the adjusted video data reaches the second playing time length.

In step S213, the server synthesizes the adjusted video data and the target score data to obtain the score-matched video data.

When the step is implemented, the target music data and the adjusted video data can be directly synthesized to obtain the video data after the music is assembled, different processing modes can be adopted based on whether the adjusted video data has audio data or not, for example, when the adjusted video data does not have audio data, the target music data can be directly synthesized, and when the adjusted video data has reference data, different implementation modes can be provided, for example, after the audio data in the adjusted video data is filtered, the mute video data and the target music data are synthesized; the audio data in the adjusted video data may be synthesized with the target music data after the volume of the audio data is reduced.

In step S214, the server transmits the video data after the dubbing to the terminal.

In step S215, the server acquires the track type of the music data to be processed.

In this embodiment of the present application, if none of the obtained time length differences is smaller than the preset threshold, it is determined that no suitable piece of music in the music data to be processed is playing music for the video data to be played, and in the subsequent step, the server may make a music recommendation for the terminal based on the track type of the music data to be processed.

In step S216, the server determines a plurality of candidate music data based on the track type.

In step S217, the server outputs a plurality of music identifications of the plurality of candidate music data.

In the embodiment of the application, the server outputs a plurality of music identifications of a plurality of candidate music data, that is, the server transmits the plurality of music identifications to the terminal.

In step S218, the terminal determines the selected target music identifier based on receiving the selection operation for the music identifier.

In some embodiments, after receiving the music identifier sent by the server, the terminal may present a plurality of music identifiers, and may also present a music playing control, where each music identifier corresponds to one music playing control, and if the terminal receives a touch operation for a certain music playing control, the terminal plays the corresponding music data, so that the user may determine whether to select the music identifier. When the terminal receives a selection operation for a certain music identification, the selected target music identification is determined.

In step S219, the terminal sends the selected target music identifier to the server.

In step S220, the server determines music data to be processed based on the target music identification.

When the method is realized, the server determines the music data corresponding to the target music identifier as updated to-be-processed music data. After this step, the above steps S203 to S220 are repeatedly performed until the dubbing process of the video data is completed.

It should be noted that, the music data processing and video coordination process provided in the embodiments of the present application is implemented by a server, and may be implemented by a terminal in actual implementation.

In the multimedia data processing method provided by the embodiment of the application, after video data to be assembled and music data to be processed are obtained, firstly, carrying out information source separation on the music data to obtain a singing voice track and an accompaniment voice track, then determining non-singing time information based on the singing voice track, determining a reference starting and ending point set based on the accompaniment voice track, further determining a target starting and ending point set from the reference starting and ending point set based on the non-singing time information, then respectively calculating each second playing time length between a target starting point and each target ending point after the target starting point, then calculating each interval time length between each second playing time length and the first playing time length, and determining the target starting and ending point corresponding to the interval time length which is smaller than a preset threshold value and the minimum interval time length as an assembling starting point, so that the matching degree between the finally determined music playing time length and the video playing time length between the assembling starting points can be ensured, and the integrity of the assembling music can be ensured; in addition, in the embodiment of the application, if the final score-starting point is not screened, the server can screen the music matched with the music type from the music library based on the music type of the music data to determine the music data to be processed, send the music data to the terminal for confirmation, and if the notification message of the terminal for determining to use the music data to be processed is received, utilize the music data to be processed to carry out score matching for the video data again so as to ensure that the target music data matched with the video data can be determined and ensure the robustness of the video score matching.

In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.

The multimedia data processing method provided by the embodiment of the application can be applied to a scene of intelligently producing short video clips, and the position of the score start dead point is positioned by combining the two modules of the information source separation model and the audio energy detection, so that a reference score cut point is provided for the clip score.

In the multimedia data processing method provided by the embodiment of the application, the original version of short video is input into the information source separation model to obtain the voice audio track (corresponding to the singing audio track in other embodiments) and the BGM audio track (corresponding to the accompaniment audio track in other embodiments), and then the position information of the singing paragraph in the music is determined through the VAD algorithm. The singing position of the whole song is positioned by utilizing the position information of the singing paragraphs, the fade-in and fade-out positions of the whole song are determined by carrying out audio energy detection on the BGM audio tracks, and finally, the dubbing starting point is determined by using a double selection method of the singing non-paragraphs and the energy detection, so that the singing cutting phenomenon can be avoided, the time point position of audio gradual change can be accurately positioned, and the dubbing can be intelligently produced and provided with the reference and cutting starting point of the music in a full-automatic mode.

Fig. 6 is an audio map and spectrogram of a voice audio track extracted from a short video of a first edition according to an embodiment of the present application. Where 601 is an audio map and 602 is a spectrogram. Fig. 7 is an audio map and a spectrogram of a BGM audio track extracted from a short video of an initial version according to an embodiment of the present application. Wherein 701 is an audio chart, 702 is a spectrogram, and fig. 8 is an audio chart of a final short video clip finished product generated after the music score provided in the embodiment of the present application.

The following describes a technical side implementation procedure of the multimedia data processing method provided in the embodiment of the present application. The multimedia data processing method provided by the embodiment of the application can comprise the following three processes when being implemented: non-voice paragraph detection based on source separation and VAD, start-stop point positioning based on energy detection and final screening combining the start-stop points of the non-voice paragraphs.

Fig. 9 is a schematic flowchart of still another implementation of the multimedia data processing method provided in the embodiment of the present application, where the multimedia data processing method may be applied to a server or a terminal, and in the embodiment of the present application, the application is described by taking the application to the terminal as an example, and as shown in fig. 9, the flowchart includes:

in step S901, the terminal performs source separation on the input music to obtain a singer voice track and an accompaniment track.

When the method is realized, the accompaniment track and singer voice track of the audio can be extracted from the whole input music through the information source separation model built by the double U-net network.

Fig. 10 is a diagram of a network architecture of the U-shaped network (U-shaped network) as shown in fig. 10, wherein the U-shaped network (U-shaped network) has a U-shaped symmetrical structure, a left side 1001 is a convolution layer, and a right side 1002 is an upsampling layer. In the Unet structure shown in fig. 10, 4 convolutional layers (convolutional layer) and corresponding 4 upsampling layers (up sampling l ayer) are included. When the model is realized, the network can be realized from the beginning, the weight is initialized, and then the model is trained; the convolutional layer structure and the corresponding trained weight files of the existing networks can be used, and the subsequent upsampling layer is added to perform training calculation, such as a residual network (ResNet, residual Ne twork), a super-resolution test sequence network (VGGNet, visual Geometry Group Network) and the like, because in the deep learning model training, if the existing weight model files can be used, the training speed can be greatly increased. The other characteristic is that the feature map obtained by each convolution layer of the Unet network is connected (connected) to the corresponding up-sampling layer, so that each layer of feature map can be effectively used in subsequent calculation. I.e. a skip-connection. In this way, compared with other network structures such as a full convolution network (FCN, fully Convolution Network), the Unet avoids directly performing supervision and loss calculation in the high-level feature map, and combines features in the low-level feature map, so that the finally obtained feature map contains features of a high layer and a lot of low layers, feature fusion under different sizes is realized, and accuracy of model results can be improved.

Fig. 11 is a schematic diagram of an implementation process of source separation using a Unet model according to an embodiment of the present application, where, as shown in fig. 11, the process includes:

in step S1101, the terminal acquires music to be processed and determines the track spectrum of the music.

In step S1102, the terminal eliminates the phase in the audio track spectrum to obtain the spectrum amplitude spectrum of the audio track.

In step S1103, the terminal inputs the spectrum magnitude spectrum of the audio track to the background music U-shaped network (BGM-Unet, backGround Music Unethical) and the singing U-shaped network (VOCAL-Unet).

The accompaniment features and singing voice features in music can be extracted by utilizing the BGM-Unet and the VOCAL-Unet.

In step S1104, the terminal combines the accompaniment features and the singing voice features to obtain the combined features.

In step S1105, the terminal performs mask calculation on the combined features to calculate an accompaniment mask and a singing mask, respectively.

In step S1106, the terminal calculates the corresponding positions of the accompaniment mask map and the singing mask map and the spectrum amplitude spectrum of the original audio track, and then generates the corresponding accompaniment audio track and singing audio track after the spectrum inverse transformation.

In step S902, the terminal inputs the singer voice track into the VAD model to locate the position of the non-Vocal paragraph.

The accompaniment track and the singing track can be obtained through the separation of the source separation model, wherein the singing track does not contain any audio signals except the human voice, so the VAD is used for positioning the place where the human voice appears in the human voice track. In the practical application process, an open source VAD algorithm can be used to locate the position of the non-silence segment. Fig. 12 is a schematic diagram of a non-silence segment located by VAD in the embodiment of the present application, as can be seen from fig. 12, in the whole singing track, except for a part of segments marked by boxes, where higher audio impulse signals appear, the rest parts are mute, then the impulse signals are located by VAD algorithm, then the rest areas of the whole track are non-Vocal segment sets, and the time positions of the non-Vocal segments in the original music audio are calculated.

It should be noted that, in some embodiments, the time position information of the non-Vocal paragraphs may also be determined by the audio event detection positioning model, and the singing paragraphs in the song may be positioned by positioning the singing positions, so as to position the non-Vocal paragraph time set.

In step S903, the terminal performs audio energy detection on the accompaniment tracks to determine a reference start point set of the accompaniment tracks.

When the step is implemented, short-time Fourier transform (STFT, short-Term Fourier Transform) is performed on the accompaniment tracks obtained through source separation to obtain the frequency spectrum of the accompaniment tracks, and then the characteristic logarithmic Mel frequency spectrum (log-melspactrogram) required by final detection is generated through Mel (Mel) transformation and logarithm calculation.

And setting a sliding window with a duration of T on the generated whole spectrum characteristic sequence, wherein the sliding window is the fixed filter for judging energy gradual change. Fig. 13 is a schematic diagram of end point positioning based on energy detection according to the embodiment of the present application, as shown in fig. 13, in each sliding window, all the energy in the frequency domain is accumulated to generate a one-dimensional energy sequence, and then the whole sequence is equally divided into three segments according to the time T/3, so as to obtain the average value of the energy sequence in each segment. Fig. 13 is a schematic diagram of locating the termination points, and in the intelligent production of short video clip music tasks, the music termination points are preferably selected with fade-out effect, so that the three average energy values in the whole sliding window should be gradually decreasing from left to right (starting point is opposite, and gradually increasing from left to right), namely E1> E2> E3. If the successively decreasing behavior of the energy within the window is fully met, the window right boundary meets the selection of the termination point (the starting point is the left boundary if met).

When the detection method is realized, the sliding distance of the sliding window in the whole detection process is T/3, and the detection is carried out every T/3 granularity.

In step S904, the terminal uses the non-vocal paragraph time set to screen the reference starting and ending points of the accompaniment to obtain the final score starting and ending point set.

Through the steps S901 to S904, the non-Vocal time segment set may be located on the singing track, the preliminary start-stop time position may be located on the accompaniment track, then the point on the accompaniment track may be screened by using the time threshold of the non-Vocal segment, the point not on the non-Vocal segment may be removed, and the point remaining in the non-Vocal segment is the soundtrack detected by the system.

Fig. 14 is a schematic diagram of non-Vocal section screening points provided in the embodiment of the present application, in which a non-Vocal time position of 1401 in a singing track and a reference starting point screened by an accompaniment track 1402 are shown in fig. 14, in a final screening process, starting points falling into a non-Vocal time interval are reserved, and the following reserved target starting points shown in fig. 14 are obtained, including a starting point 1403, an ending point 1404, an ending point 1405, a starting point 1406 and an ending point 1407. And then calculating the interval duration between the starting point 1403 and the ending point 1404, the interval duration between the starting point 1403 and the ending point 1405, the interval duration between the starting point 1403 and the ending point 1407 and the interval duration between the starting point 1406 and the ending point 1407, respectively determining the duration differences between the interval durations and the playing of the video, determining the minimum duration difference, and if the minimum duration difference is smaller than a preset threshold, determining the starting point and the ending point corresponding to the minimum duration difference as the final coordination starting point.

In the multimedia data processing method provided by the embodiment of the application, the singing track and the accompaniment track of the whole song are separated by utilizing the information source separation model, the method can be applied to the situation without lyrics, a non-singing time paragraph set in the whole song can be accurately positioned, the phenomenon of singing interception in the selecting process can be avoided, then audio energy detection is carried out on the accompaniment track, the accompaniment energy fade-in fade-out time position is positioned, therefore, the time information of the position suitable for the soundtrack cutting starting point in the song accompaniment can be accurately positioned, and finally the last starting point positioning is screened through the detected non-singing time paragraph and the detected soundtrack cutting starting point position together, so that the singing interception phenomenon can be avoided, the position of the song rhythm energy fade-in fade-out gradual change can be positioned, and the short video clip model can be intelligently produced.

Continuing with the description below of exemplary structures implemented as software modules of the multimedia data processing apparatus 455 provided by embodiments of the present application, in some embodiments, as shown in fig. 2, the software modules stored in the multimedia data processing apparatus 455 of the memory 450 may include:

The information source separation module 4551 is configured to obtain video data to be assembled and music data to be processed, and perform information source separation on the music data to obtain a singing track and an accompaniment track;

a first determining module 4552 configured to determine non-singing time information based on the singing track, and determine a reference start-stop set based on the accompaniment track;

a second determining module 4553, configured to determine a target start-stop point set from the reference start-stop point set based on the non-singing time information;

a third determining module 4554, configured to determine a soundtrack start point and a soundtrack end point from the target start-stop point set based on the first play duration of the video data to be soundtrack;

a fourth determining module 4555 for determining target music data between the score start point and the score end point in the music data as target score data of the video data.

In some embodiments, the source separation module is further configured to:

In some embodiments, the first determining module is further configured to:

acquiring a trained audio event detection model;

In some embodiments, the first determining module is further configured to:

In some embodiments, the third determining module is further configured to:

determining a first playing time length of the video data to be matched;

In some embodiments, the apparatus further comprises:

It should be noted here that: the description of the embodiments of the multimedia data processing apparatus above, similar to the description of the method above, has the same advantageous effects as the embodiments of the method. For technical details not disclosed in the embodiments of the multimedia data processing apparatus of the present application, those skilled in the art will understand with reference to the description of the embodiments of the method of the present application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the multimedia data processing method according to the embodiment of the present application.

The embodiments of the present application provide a computer readable storage medium storing executable instructions, wherein the executable instructions are stored, which when executed by a processor, cause the processor to perform the multimedia data processing method provided by the embodiments of the present application, for example, the multimedia data processing method as shown in fig. 3, 4, and 5.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (html, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computer device or on multiple computer devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A method of multimedia data processing, the method comprising:

2. The method of claim 1, wherein said separating the music data from the source to obtain a singing track and an accompaniment track comprises:

3. The method of claim 1, wherein said determining non-singing time information based on the singing track comprises:

4. The method of claim 1, wherein said determining non-singing time information based on the singing track comprises:

acquiring a trained audio event detection model;

5. The method of claim 1, wherein said determining a set of reference start-stop points based on said accompaniment tracks comprises:

6. The method of claim 5, wherein determining a target sliding window result from a plurality of sliding window results based on an energy mean of N sub-results in the respective sliding window result comprises:

the determining a reference start-stop point set based on the target sliding window result includes:

7. The method of claim 1, wherein the non-singing time information includes a non-singing time interval, wherein the determining a target start-stop set from the reference start-stop set based on the non-singing time information comprises:

8. The method of claim 7, wherein determining a soundtrack start point and a soundtrack end point from the target set of starting points based on the first play duration of the video data to be soundtrack comprises:

determining a first playing time length of the video data to be matched;

9. The method of claim 8, wherein the method further comprises:

when at least one target interval duration with a duration difference smaller than a preset threshold value does not exist, obtaining the track type of the music data to be processed;

Determining a plurality of candidate music data based on the track type, and outputting a plurality of music identifications of the plurality of candidate music data;

determining a selected target music identifier based on receiving a selection operation for the music identifier;

and determining the music data to be processed based on the target music identification.

10. The method according to any one of claims 1 to 8, further comprising:

determining a second playing duration of the target score data based on the score start point and the score end point;

adjusting the playing time length of the video data to be matched based on the second playing time length to obtain adjusted video data;

and synthesizing the adjusted video data and the target match data to obtain match video data.

11. A multimedia data processing apparatus, the apparatus comprising:

12. A computer device, the computer device comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 10 when executing executable instructions stored in said memory.

13. A computer readable storage medium storing executable instructions which when executed by a processor implement the method of any one of claims 1 to 10.

14. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of any one of claims 1 to 10.