CN112289344A

CN112289344A - Method and device for determining drum point waveform and computer storage medium

Info

Publication number: CN112289344A
Application number: CN202011195349.7A
Authority: CN
Inventors: 王征韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2021-01-29

Abstract

The embodiment of the application discloses a method and a device for determining a drum point waveform and a computer storage medium, which are used for more accurately determining an audio segment corresponding to the drum point waveform. The method in the embodiment of the application comprises the following steps: determining a candidate frame of the target drum point waveform, determining an audio frame with the highest energy value in the k audio frames as a reference point, and further determining an audio frame with an energy value smaller than a first threshold value as a starting frame of the drum point waveform and an audio frame with an energy value smaller than a second threshold value as an ending frame of the drum point waveform by taking the reference point as a reference point, thereby determining the drum point waveform. The method and the device for determining the drum point waveform determine the drum point waveform based on the energy value of the audio frame in the candidate frame, and compared with the method for selecting the drum point waveform in a fixed length mode, the method and the device for determining the drum point waveform increase the determination standard and determination method of the drum point waveform, so that the drum point waveform selection result is more accurate.

Description

Method and device for determining drum point waveform and computer storage medium

Technical Field

The embodiment of the application relates to the field of data processing, in particular to a method and a device for determining a drumhead waveform and a computer storage medium.

Background

The drum point waveform refers to a waveform of local short audio corresponding to a drum point in a music song, the drum point is a repeated beat in the music song and is an audio segment with large amplitude in the music song, and the length of the drum point waveform is usually 100-500 ms. Drum point waveforms played by different kinds of drums are different from each other, and thus, the kind of drum can be determined by recognizing the drum point waveforms and used as training data for a drum classification model, that is, a network model for recognizing the kind of drum.

The existing drum point waveform determining method is fixed-length selection, namely a drum point waveform is selected on a source song by using a candidate frame with a specified length, but because different song types and drum points under playing scenes have different duration times, the drum point waveforms in the source song are not completely consistent in length, so that the intercepted drum point waveforms possibly damage the integrity of the waveforms, namely part of the drum point waveforms are not intercepted; or to intercept audio segments of a non-drumbeat waveform. Therefore, the fixed-length selection method cannot accurately select the drum waveform.

Disclosure of Invention

The embodiment of the application provides a method and a device for determining a drum point waveform and a computer storage medium, which are used for more accurately determining an audio segment corresponding to the drum point waveform.

A first aspect of an embodiment of the present application provides a method for determining a drumhead waveform, including:

acquiring target audio, wherein the target audio is a drumhead audio track of a source song and comprises audio segments corresponding to a plurality of drumhead waveforms in the source song;

performing drum point detection on the target audio according to a drum point detection algorithm to obtain a drum point sequence of the target audio, wherein the drum point sequence comprises a plurality of drum points;

determining a candidate frame of a target drum point waveform of the target drum point according to the target drum point, wherein the target drum point is any one of the plurality of drum points;

determining an audio frame with the highest energy value in the k audio frames as a reference point from the k audio frames taking the audio frame corresponding to the target drum point as a center in the candidate frame, wherein k is a positive integer;

determining that an audio frame with an energy value smaller than a first threshold exists in the audio frames before the reference point in the candidate frame, and an audio frame with an energy value smaller than a second threshold exists in the audio frames after the reference point in the candidate frame, wherein the first threshold is m% of the energy value of the reference point, the second threshold is n% of the energy value of the reference point, 0 < m < 100, and 0 < n < 100;

and determining the audio frame with the energy value smaller than a first threshold value as a starting frame of the target drumhead waveform, and determining the audio frame with the energy value smaller than a second threshold value as an ending frame of the target drumhead waveform.

A second aspect of the embodiments of the present application provides a drumhead waveform determining apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring target audio, the target audio is a drumbeat audio track of a source song, and the target audio comprises audio segments corresponding to a plurality of drumbeat waveforms in the source song;

the drum point detection unit is used for carrying out drum point detection on the target audio according to a drum point detection algorithm to obtain a drum point sequence of the target audio, and the drum point sequence comprises a plurality of drum points;

a first determining unit, configured to determine a candidate frame of a target drum point waveform of a target drum point according to the target drum point, where the target drum point is any one of the drum points;

a second determining unit, configured to determine, as a reference point, an audio frame with a highest energy value in k audio frames, among the k audio frames centered on the audio frame corresponding to the target drum point in the candidate frame, where k is a positive integer;

a third determining unit, configured to determine that there is an audio frame with an energy value smaller than a first threshold in the audio frames before the reference point in the candidate frame, and an audio frame with an energy value smaller than a second threshold in the audio frames after the reference point in the candidate frame, where the first threshold is m% of the energy value of the reference point, the second threshold is n% of the energy value of the reference point, 0 < m < 100, and 0 < n < 100;

a fourth determining unit, configured to determine the audio frame with the energy value smaller than the first threshold as a start frame of the target drumhead waveform, and determine the audio frame with the energy value smaller than the second threshold as an end frame of the target drumhead waveform.

A third aspect of the embodiments of the present application provides a drumhead waveform determining apparatus, including:

the system comprises a processor, a memory, a bus and input and output equipment;

the processor is connected with the memory and the input and output equipment;

the bus is respectively connected with the processor, the memory and the input and output equipment;

the processor is used for acquiring a target audio, wherein the target audio is a drumbeat audio track of a source song and comprises audio segments corresponding to a plurality of drumbeat waveforms in the source song; performing drum point detection on the target audio according to a drum point detection algorithm to obtain a drum point sequence of the target audio, wherein the drum point sequence comprises a plurality of drum points; determining a candidate frame of a target drum point waveform of the target drum point according to the target drum point; determining an audio frame with the highest energy value in the k audio frames as a reference point from the k audio frames taking the audio frame corresponding to the target drum point as a center in the candidate frame, wherein k is a positive integer; determining that an audio frame with an energy value smaller than a first threshold exists in the audio frames before the reference point in the candidate frame, and an audio frame with an energy value smaller than a second threshold exists in the audio frames after the reference point in the candidate frame, wherein the first threshold is m% of the energy value of the reference point, the second threshold is n% of the energy value of the reference point, 0 < m < 100, and 0 < n < 100; and determining the audio frame with the energy value smaller than a first threshold value as a starting frame of the target drumhead waveform, and determining the audio frame with the energy value smaller than a second threshold value as an ending frame of the target drumhead waveform.

A fourth aspect of embodiments of the present application provides a computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

in the embodiment of the application, a candidate frame of a target drumhead waveform is determined, an audio frame with the highest energy value in k audio frames is determined as a reference point, and by taking the reference point as a reference point, an audio frame with an energy value smaller than a first threshold value is further determined as a starting frame of the drumhead waveform, and an audio frame with an energy value smaller than a second threshold value is further determined as an ending frame of the drumhead waveform, so that the drumhead waveform is determined. The method and the device for determining the drum point waveform determine the drum point waveform based on the energy value of the audio frame in the candidate frame, and compared with the method for selecting the drum point waveform in a fixed length mode, the method and the device for determining the drum point waveform increase the determination standard and determination method of the drum point waveform, so that the drum point waveform selection result is more accurate.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for determining a drumhead waveform according to an embodiment of the present disclosure;

FIG. 2 is another schematic flow chart illustrating a method for determining a drumhead waveform according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a drum sequence of target audio in an embodiment of the present application;

FIG. 4 is a schematic diagram of waveforms of two adjacent drumheads in the embodiment of the present application;

FIG. 5 is a schematic structural diagram of a drumhead waveform determining apparatus according to an embodiment of the present application;

fig. 6 is another schematic structural diagram of the drumhead waveform determining apparatus in the embodiment of the present application.

Detailed Description

Referring to fig. 1, an embodiment of a method for determining a drumhead waveform in the embodiment of the present application includes:

101. acquiring a target audio;

the drumhead waveform determination method of the present embodiment may be applied to a drumhead waveform determination apparatus, which may be a computer device capable of data processing, such as a server, a terminal, or the like.

The target audio is a drumbeat audio track of the source song, and the drumbeat audio track can be separated from the source song according to a drumbeat separation algorithm such as spleteter and the like, so that the target audio is obtained, therefore, the duration of the target audio is the same as that of the source song, and the target audio comprises audio segments corresponding to a plurality of drumbeat waveforms in the source song.

102. Performing drum point detection on the target audio according to a drum point detection algorithm to obtain a drum point sequence of the target audio;

in this embodiment, the position of the drum point in the target audio needs to be determined, that is, drum point detection is performed on the target audio according to a drum point detection algorithm to obtain a drum point sequence of the target audio. The drum point sequence is a queue formed by arranging a plurality of drum points in the target audio in time, and the time point corresponding to each drum point can be known from the drum point sequence.

The drum point detection algorithm may be a method combining envelope extraction and peak detection (peak detection), wherein the envelope extraction may be energy envelope extraction or waveform envelope extraction of the drum point, and the specific extraction method may be extraction by hilbert transform, or may be an extraction method combining downsampling and smoothing extraction. The peak point detection can be based on the detection of threshold rule, and also can be based on the scheme of combining wavelet transformation and significance.

103. Determining a candidate frame of a target drum point waveform of the target drum point according to the target drum point;

a drumbeat is a beat in the rhythm of a music song and generally corresponds to an audio segment having a certain length, i.e., a drumbeat waveform. The target drum point is any one of the drum points in the drum point sequence, and in order to define the interval where the target drum point waveform of the target drum point is located, a candidate frame of the target drum point waveform may be determined first. And then determining the real range of the target drumhead waveform from the candidate frame. In the present embodiment, the length of the candidate frame of the target drumbeat waveform may be set, for example, in a general case, the length of the drumbeat waveform is 100 to 500ms (ms is a unit of "millisecond"), and the length of the candidate frame may be limited to a value in a range of 100 to 500 ms.

In order to ensure that the candidate frame only corresponds to one drumhead waveform, the candidate frame only comprises one drumhead, so that drumhead waveforms of other drumhead are avoided being selected when the real interval of the drumhead waveform is determined.

104. Determining an audio frame with the highest energy value in the k audio frames as a reference point from the k audio frames taking the audio frame corresponding to the target drum point as the center in the candidate frame;

since the drumhead is a beat in music and the amplitude of the drumhead is large, the energy of an audio frame near the drumhead is generally high, and an audio frame with the highest energy in the drumhead waveform is generally located near the drumhead. Therefore, in order to determine the reference point with the highest energy value, after the candidate frame is determined, the energy value of each audio frame in the k audio frames taking the audio frame corresponding to the target drum point as the center in the candidate frame is calculated, and the audio frame with the highest energy value in the k audio frames is determined as the reference point, so that the reference point can be used as the reference point to further determine the starting frame and the ending frame of the target drum point waveform.

105. Determining that audio frames with energy values smaller than a first threshold exist in the audio frames before the reference point in the candidate frame, and audio frames with energy values smaller than a second threshold exist in the audio frames after the reference point in the candidate frame;

since the audio frame in the drumhead waveform has a significant energy attenuation characteristic, the present embodiment determines the range of the target drumhead waveform based on the energy of the audio frame. Therefore, after the reference point is determined, it is determined whether there is an audio frame with an energy value smaller than the first threshold in the audio frames before the reference point in the candidate frame, and whether there is an audio frame with an energy value smaller than the second threshold in the audio frames after the reference point in the candidate frame, if there are both an audio frame with an energy value smaller than the first threshold and an audio frame with an energy value smaller than the second threshold, step 106 is executed.

The audio frame preceding the reference point in the candidate frame is an audio frame earlier in time than the reference point, and the audio frame following the reference point in the candidate frame is an audio frame later in time than the reference point. For example, the start point and the end point of the candidate box are 30.2s and 30.5s, respectively, the reference point is 30.25s, the audio frame before the reference point is 30.25s to 30.2s, and the audio frame after the reference point is 30.25s to 30.5 s.

Since the reference point is an audio frame having the highest energy value in the candidate frame, and the drum waveform has energy attenuation characteristics, it is necessary to determine an audio frame having an energy value lower than the reference point from the reference point in order to determine the start frame and the end frame of the drum waveform. In this embodiment, a plurality of audio frames in the candidate box are searched to find out an audio frame whose energy value in the audio frame before the reference point is smaller than a first threshold, and an audio frame whose energy value in the audio frame after the reference point is smaller than a second threshold, where the first threshold is m% of the energy value of the reference point, the second threshold is n% of the energy value of the reference point, 0 < m < 100, and 0 < n < 100. The specific values of m and n are empirical values, and can be specifically determined according to real samples of a plurality of drum waveforms, namely, according to audio frames with the lowest energy value and the highest energy value in the samples. Preferably, the specific values of m and n may be 15.

106. Determining the audio frame with the energy value smaller than the first threshold value as a starting frame of the target drumhead waveform, and determining the audio frame with the energy value smaller than the second threshold value as an ending frame of the target drumhead waveform;

if the judgment results in the step 105 are both present, determining that the audio frame with the energy value smaller than the first threshold is an initial frame of the drum point waveform, and the audio frame with the energy value smaller than the second threshold is an end frame of the drum point waveform, so that the target drum point waveform can be determined according to the initial frame and the end frame, and the audio segment from the initial frame to the end frame is the audio segment corresponding to the target drum point waveform.

In this embodiment, a candidate frame of the target drumhead waveform is determined, an audio frame with the highest energy value among the k audio frames is determined as a reference point, and with the reference point as a reference, an audio frame with an energy value smaller than a first threshold is further determined as a start frame of the drumhead waveform, and an audio frame with an energy value smaller than a second threshold is further determined as an end frame of the drumhead waveform, thereby determining the drumhead waveform. In the embodiment, the drum point waveform is determined based on the energy value of the audio frame in the candidate frame, and compared with the drum point waveform selected in a fixed length mode, the drum point waveform is additionally provided with a determination standard and a determination method, so that the selection result of the drum point waveform is more accurate.

The embodiments of the present application will be described in further detail below on the basis of the aforementioned embodiment shown in fig. 1. Referring to fig. 2, another embodiment of the method for determining a drumhead waveform in the embodiment of the present application includes:

201. acquiring a target audio;

in this embodiment, in order to further eliminate the interference of the audio frequency irrelevant to the drumhead in the target audio frequency, the target audio frequency may be down-sampled, and in this embodiment, the sampling frequency of 8KHz is preferably used for down-sampling, so as to filter the audio frequency irrelevant to the drumhead in the target audio frequency.

202. Performing drum point detection on the target audio according to a drum point detection algorithm to obtain a drum point sequence of the target audio;

the operation performed in this step is similar to the operation performed in step 102 in the embodiment shown in fig. 1, and is not repeated here.

203. Determining a candidate frame of a target drum point waveform of the target drum point according to the target drum point;

in this embodiment, a specific manner of determining the candidate frame may be to determine the candidate frame of the drum point waveform according to the relative positions of the plurality of drum points. Specifically, when the target drum point is not the first drum point in the drum point sequence, the first candidate starting point of the candidate frame is a time point obtained by adding the first correction value to the time point of the last drum point of the target drum point; when the target drum point is the head drum point, the first candidate starting point is zero.

As can be seen from the observation of multiple samples of the drumhead waveform, the drumhead is generally located at the head of the drumhead waveform, and the drumhead waveform is generally not shorter than 50 ms. Therefore, in order to avoid the starting point of the candidate frame falling within the interval of the drum point waveform of the previous drum point, the first candidate starting point of the candidate frame needs to be corrected, and a first correction value is added to the time point of the previous drum point of the target drum point, which is equivalent to the time point obtained after moving the time point of the previous drum point a certain distance backwards, and then the obtained time point is taken as the first candidate starting point. Since the drumhead waveform is generally not shorter than 50ms, the distance by which the time point of the last drumhead is moved backward (i.e., the first correction value) may preferably be 50 ms.

For example, as shown in fig. 3, fig. 3 shows a drum point sequence of a plurality of drum points in the target audio, with the time points of the drum points referring to the drum points. When the drum point t is determined to be the target drum point, the previous drum point of t is tpre, and a time point obtained by adding the first correction value to the time point of tpre is taken as a first candidate starting point of the candidate frame, namely the first candidate starting point is tpre +50 ms; when tpre is determined to be the target drum point, the first candidate starting point is zero since tpre is the first drum point.

Meanwhile, in order to determine the end point of the candidate frame, in this embodiment, when the target drum point is not the last drum point in the drum point sequence, the first candidate end point of the candidate frame is a time point obtained by subtracting the second correction value from the time point of the next drum point of the target drum point; when the target drum point is the last drum point, the first candidate end point is the last time point of the target audio.

As can be seen from an observation of a plurality of drum beat waveform samples, the drum beat spacing of a music song may not be shorter than 20ms even at the fastest tempo. Therefore, in order to avoid the end point of the candidate frame falling within the interval of the drum point waveform of the next drum point, the first candidate end point of the candidate frame needs to be corrected, and the time point of the next drum point of the target drum point is subtracted by a second correction value, which is equivalent to the time point obtained after the time point of the next drum point is moved forward by a certain distance and is taken as the first candidate end point. Since the drum pitch cannot be shorter than 20ms, the distance (i.e., the second correction value) by which the time point of the next drum is moved forward may preferably be 20 ms.

For example, as shown in fig. 3, when it is determined that the drum point t is the target drum point, and the next drum point of t is tnxt, a time point obtained by subtracting the second correction value from the time point of tnxt is taken as the first candidate end point of the candidate frame, that is, the first candidate end point is tnxt-20 ms; and if the last drum point in the target audio is determined to be the target drum point, the first candidate end point is the last time point of the target audio.

The first candidate starting point and the first candidate end point of the candidate frame are determined by the method, so that the selection range of the candidate frame can be narrowed, and the finally determined interval of the drum point waveform is more accurate.

After the first candidate start point and the first candidate end point are determined, it is possible to determine whether the interval duration between the first candidate start point and the first candidate end point is within a preset duration range, which may be the aforementioned general length (100 to 500ms) of the drum waveform. If the candidate frame established by the first candidate starting point and the first candidate end point meets the general standard of the drum point waveform within the preset time length range, determining the first candidate starting point as the target starting point of the candidate frame and determining the first candidate end point as the target end point of the candidate frame; if the candidate frame is not within the preset duration range, it is indicated that the length of the candidate frame established by the first candidate starting point and the first candidate ending point is larger and does not meet the general standard of the drum point waveform, and the selection range of the candidate frame needs to be further narrowed.

Therefore, when the time length is not within the preset time length range, the second candidate starting point and the second candidate ending point of the candidate frame can be further determined. In this embodiment, a second candidate start point and a second candidate end point may be determined according to the general length of the drum point waveform, and a time point obtained by subtracting the first limit value from the target time point of the target drum point is used as the second candidate start point of the candidate frame, which is equivalent to a time point obtained by moving the target time point forward by a certain distance and is used as the second candidate start point; and taking the time point obtained by adding the second limit value to the target time point as a second candidate end point of the candidate frame, which is equivalent to taking the time point obtained by moving the target time point backward for a certain distance as the second candidate end point.

The first limit value and the second limit value are values determined according to the general length of the drum point waveform, and are empirical values, the duration from the waveform starting point to the drum point and the duration from the drum point to the waveform ending point can be summarized according to a plurality of drum point waveform samples, the duration from the waveform starting point to the drum point is the first limit value, and the duration from the drum point to the waveform ending point is the second limit value.

For example, as shown in fig. 3, if the length of the drumhead waveform is set to 300ms, the sum of the time length from the second candidate start point to the target time point of the target drumhead and the time length from the target time point to the second candidate end point is 300 ms. Since the first limit value is 100ms and the second limit value is 200ms as a result of experimental summary, when t is determined to be the target drum point, the second candidate start point is t-100ms and the second candidate end point is t +200 ms.

Then, the maximum value of the first candidate starting point and the second candidate starting point is used as the target starting point of the candidate frame, the minimum value of the first candidate end point and the second candidate end point is used as the target end point of the candidate frame, namely, the later time point of the two candidate starting points is selected as the target starting point, and the earlier time point of the two candidate end points is selected as the target end point, so that the selection range of the candidate frame is further reduced.

The starting point and the end point of the candidate frame are determined by the method in the step, so that the candidate frame can be closer to the real interval of the drum point waveform, and the finally determined drum point waveform is more accurate.

204. Determining an audio frame with the highest energy value in the k audio frames as a reference point from the k audio frames taking the audio frame corresponding to the target drum point as the center in the candidate frame;

in this embodiment, there are various ways to calculate the energy value of the audio frame in the candidate box, for example, the energy value of the audio frame may be calculated based on the envelope extraction of the hilbert transform, or the energy value of the audio frame may be calculated based on the envelope extraction of the low-pass filter.

Preferably, in this embodiment, the reference point is determined in 5 audio frames centered on the audio frame corresponding to the target drum point, so that the audio frame with the highest energy value can be accurately located.

205. Judging whether an audio frame with an energy value smaller than a first threshold exists in the audio frames positioned before the reference point in the candidate frame, if so, executing a step 206; if not, go to step 207;

the present embodiment determines the energy values of the audio frames before and after the reference point frame by frame based on the reference point determined in step 204, that is, with the reference point as a starting point, toward the target starting point of the candidate frame, sequentially calculates the energy value of each audio frame, that is, determines the energy value of the audio frame before the reference point frame by frame, and determines whether the energy value of the audio frame is less than the first threshold.

206. Determining the audio frame with the energy value smaller than a first threshold value as a starting frame of the target drumhead waveform;

when the audio frame with the energy value smaller than the first threshold value is obtained through calculation, the audio frame with the energy value smaller than the first threshold value is determined as the starting frame.

207. Taking the audio frame with the minimum energy value in the audio frame before the reference point in the candidate frame as a starting frame;

since the audio frames in the drumhead waveform have the energy attenuation characteristic, when there is no audio frame whose energy value is smaller than the first threshold in the audio frames before the reference point, that is, the energy values of all the audio frames before the reference point in the candidate frame are larger than the first threshold, the audio frame whose energy value is the smallest in the audio frames before the reference point may be taken as the start frame.

208. Judging whether an audio frame with an energy value smaller than a second threshold exists in the audio frames positioned behind the reference point in the candidate frame, if so, executing step 209; if not, go to step 210;

similarly, the energy value of each audio frame is sequentially calculated from the reference point as a starting point towards the target end point of the candidate frame, that is, the energy value of the audio frame after the reference point is determined frame by frame, and whether the energy value of the audio frame is smaller than the second threshold value is judged.

209. Determining the audio frame with the energy value smaller than the second threshold value as a termination frame of the target drumhead waveform;

when an audio frame with the energy value smaller than the second threshold value appears, the audio frame with the energy value smaller than the second threshold value is determined as the termination frame. After the start frame and the end frame are determined, the calculation of the energy values of the audio frames may be stopped and the indices of the start frame and the end frame may be returned in order to confirm the positions of the start frame and the end frame.

In a preferred embodiment, the energy value of an audio frame may be represented by calculating a Root Mean Square Error (RMSE) of the audio frame. Specifically, in the candidate frame established in step 203, a sliding window with a certain size is sequentially moved by a preset step length, and the size of the sliding window is represented by the number of sample points corresponding to the sliding window. After each movement, the root mean square error of a preset number of sample points corresponding to the sliding window is calculated, and the energy of the audio frame is represented by the calculated root mean square error result. The predetermined number of sample points corresponding to the sliding window constitute an audio frame, that is, the size of the sliding window corresponds to the number of sample points of an audio frame.

For example, if the number of sample points of an audio frame is 128, the size of the sliding window (window size) is 128, which indicates that the number of sample points corresponding to the sliding window is 128, i.e., the root mean square error of the 128 sample points is calculated after each movement, and the root mean square error represents the energy value of the audio frame. If the step length (hop _ length) is set to 64 and the distance of movement is represented by the number of sample points, the distance of movement of the sliding window per time is 64 sample points.

In this embodiment, the energy values of all the audio frames in the candidate box may be calculated first, and then the audio frame whose energy value in the audio frame before the reference point is smaller than the first threshold and the audio frame whose energy value in the audio frame after the reference point is smaller than the second threshold may be determined.

210. Taking the audio frame with the minimum energy value in the audio frames positioned after the reference point in the candidate frame as a termination frame;

when there is no audio frame with an energy value smaller than the second threshold in the audio frames after the reference point, that is, the energy values of all audio frames after the reference point in the candidate frame are greater than the second threshold, the audio frame with the smallest energy value in the audio frames after the reference point may be used as the termination frame.

As shown in fig. 4, fig. 4 is a schematic diagram of two adjacent drum point waveforms, and it can be seen that the adjacent drum point waveforms both have higher energy values, and therefore, there may be no audio frame with an energy value smaller than the first threshold or smaller than the second threshold, and it is necessary to determine the start frame and the end frame of the drum point waveform according to the principle that the energy value is the smallest.

In this embodiment, step 205 and step 208 do not have a sequential execution order, that is, it may be determined whether there is an audio frame with an energy value smaller than the first threshold first and then whether there is an audio frame with an energy value smaller than the second threshold; or judging whether the audio frame with the energy value smaller than the second threshold exists or not firstly and then judging whether the audio frame with the energy value smaller than the first threshold exists or not; it may also be performed simultaneously. This embodiment is not limited to this.

After the drum point waveform of the target drum point in the target audio is determined, the audio segment from the starting frame to the ending frame of the target drum point waveform can be intercepted from the target audio, so that the audio segment corresponding to the target drum point waveform can be obtained, and the target drum point waveform can provide training data for the training of the drum classification model.

In this embodiment, the length and the selection range of the candidate frame may be limited by the first candidate start point and the second candidate start point, the first candidate end point and the second candidate end point, the candidate interval of the drumhead waveform may be more accurately located, and the start frame and the end frame of the drumhead waveform may be selected based on the energy value, so that the real interval of the drumhead waveform may be accurately determined.

With reference to fig. 5, the method for determining a drumhead waveform in the embodiment of the present application is described above, and a drumhead waveform determining apparatus in the embodiment of the present application is described below, where an embodiment of the drumhead waveform determining apparatus in the embodiment of the present application includes:

the acquiring unit 501 is configured to acquire a target audio, where the target audio is a drumbeat audio track of a source song, and the target audio includes audio segments corresponding to a plurality of drumbeat waveforms in the source song;

a drum point detection unit 502, configured to perform drum point detection on the target audio according to a drum point detection algorithm to obtain a drum point sequence of the target audio, where the drum point sequence includes a plurality of drum points;

a first determining unit 503, configured to determine a candidate frame of a target drum point waveform of a target drum point according to the target drum point, where the target drum point is any one of a plurality of drum points;

a second determining unit 504, configured to determine, as a reference point, an audio frame with a highest energy value in k audio frames, from the k audio frames centered around the audio frame corresponding to the target drum point in the candidate frame, where k is a positive integer;

a third determining unit 505, configured to determine that there are audio frames with an energy value smaller than a first threshold in the audio frames before the reference point in the candidate frame, and audio frames with an energy value smaller than a second threshold in the audio frames after the reference point in the candidate frame, where the first threshold is m% of the energy value of the reference point, the second threshold is n% of the energy value of the reference point, 0 < m < 100, and 0 < n < 100;

a fourth determining unit 506, configured to determine an audio frame with an energy value smaller than the first threshold as a starting frame of the target drumhead waveform, and determine an audio frame with an energy value smaller than the second threshold as an ending frame of the target drumhead waveform.

In a preferred embodiment of this embodiment, the first determining unit 503 is specifically configured to determine a first candidate starting point of the candidate frame, where the first candidate starting point is a time point obtained by adding a first correction value to a time point of a last drum point of the target drum point when the target drum point is not a first drum point in the drum point sequence, the first candidate starting point is zero when the target drum point is a first drum point, and determine a first candidate ending point of the candidate frame, where the first candidate ending point is a time point obtained by subtracting a second correction value from a time point of a next drum point of the target drum point when the target drum point is not a last drum point in the drum point sequence, the first candidate ending point is a last time point of the target audio when the target drum point is a last drum point, and if a time interval between the first candidate starting point and the first candidate ending point is within a preset time length range, the first candidate starting point, the second candidate starting point, the first candidate ending point and the candidate ending point are determined, The first candidate endpoint is a target endpoint of the candidate frame.

In a preferred implementation manner of this embodiment, if the interval duration between the first candidate start point and the first candidate end point is not within the preset duration range, the first determining unit 503 is further configured to use a time point obtained by subtracting the first limit value from the target time point as a second candidate start point of the candidate frame, use a time point obtained by adding the second limit value to the target time point as a second candidate end point of the candidate frame, and determine that a maximum value between the first candidate start point and the second candidate start point is the target start point and a minimum value between the first candidate end point and the second candidate end point is the target end point.

In a preferred embodiment of this embodiment, the third determining unit 505 is specifically configured to sequentially calculate an energy value of each audio frame toward a target starting point of the candidate frame with the reference point as a starting point, and determine an audio frame with an energy value smaller than the first threshold as a starting frame; and taking the reference point as a starting point, sequentially calculating the energy value of each audio frame towards the target end point of the candidate frame, and determining the audio frame with the energy value smaller than the second threshold value as an ending frame.

In a preferred embodiment of this embodiment, when sequentially calculating the energy value of each audio frame, the third determining unit 505 is specifically configured to sequentially move the sliding window by a preset step length, calculate a root mean square error of a preset number of sample points corresponding to the sliding window after each movement, represent the energy value of the audio frame by the root mean square error, where the preset number is a size of the sliding window, and the preset number of sample points constitutes the audio frame.

In a preferred embodiment of this embodiment, if there is no audio frame with an energy value smaller than the first threshold, the fourth determining unit 506 is further configured to use the audio frame with the smallest energy value in the audio frames before the reference point in the candidate frame as a starting frame;

if there is no audio frame with an energy value smaller than the second threshold, the fourth determining unit 506 is further configured to use the audio frame with the smallest energy value in the audio frames after the reference point in the candidate frame as the termination frame.

The drumhead waveform determining apparatus further includes:

a clipping unit 507 for clipping a drum point waveform from the start frame to the end frame from the target audio.

In this embodiment, the operations performed by the units in the drumhead waveform determining apparatus are similar to those described in the embodiments shown in fig. 1 to 2, and are not described again here.

In this embodiment, the first determining unit 503 determines a candidate frame of the target drumhead waveform, the second determining unit 504 determines an audio frame with the highest energy value among the k audio frames as a reference point, and with reference to the reference point, the fourth determining unit 506 further determines an audio frame with an energy value smaller than the first threshold as a start frame of the drumhead waveform, and an audio frame with an energy value smaller than the second threshold as an end frame of the drumhead waveform, thereby determining the drumhead waveform. In the embodiment, the drum point waveform is determined based on the energy value of the audio frame in the candidate frame, and compared with the drum point waveform selected in a fixed length mode, the drum point waveform is additionally provided with a determination standard and a determination method, so that the selection result of the drum point waveform is more accurate.

Referring to fig. 6, a drumhead waveform determining apparatus according to an embodiment of the present application is described below, where an embodiment of the drumhead waveform determining apparatus according to the present application includes:

the drumhead waveform determining apparatus 600 may include one or more Central Processing Units (CPUs) 601 and a memory 605, where the memory 605 stores one or more applications or data.

The memory 605 may be volatile storage or persistent storage, among other things. The program stored in the memory 605 may include one or more modules, each of which may include a series of instruction operations on the drumhead waveform determination device. Still further, the central processor 601 may be configured to communicate with the memory 605 to execute a series of command operations in the memory 605 on the drumhead waveform determination device 600.

The drumhead waveform determining device 600 may also include one or more power supplies 602, one or more wired or wireless network interfaces 603, one or more input-output interfaces 604, and/or one or more operating systems, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The central processing unit 601 may perform the operations performed by the drumhead waveform determining apparatus in the embodiments shown in fig. 1 to fig. 2, and details thereof are not repeated herein.

An embodiment of the present application further provides a computer storage medium, where one embodiment includes: the computer storage medium has stored therein instructions that, when executed on a computer, cause the computer to perform the operations performed by the drumhead waveform determining apparatus of the embodiment shown in fig. 1-2 described above.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and the like.

Claims

1. A method of drumhead waveform determination comprising:

2. The method according to claim 1, wherein the determining a candidate frame of the target drum point waveform of the target drum point according to the target drum point comprises:

determining a first candidate starting point of the candidate frame, wherein when the target drum point is not the first drum point in the drum point sequence, the first candidate starting point is a time point obtained by adding a first correction value to a time point of a last drum point of the target drum point, and when the target drum point is the first drum point, the first candidate starting point is zero;

determining a first candidate end point of the candidate frame, wherein when the target drum point is not the last drum point in the drum point sequence, the first candidate end point is a time point obtained by subtracting a second correction value from a time point of a next drum point of the target drum point, and when the target drum point is the last drum point, the first candidate end point is the last time point of the target audio;

and if the interval duration of the first candidate starting point and the first candidate ending point is within a preset duration range, determining that the first candidate starting point is a target starting point of the candidate frame and the first candidate ending point is a target ending point of the candidate frame.

3. The method for determining a drum point waveform according to claim 2, wherein if a time interval between the first candidate start point and the first candidate end point is not within the preset time range, the method further comprises:

taking a time point obtained by subtracting a first limit value from the target time point as a second candidate starting point of the candidate frame, and taking a time point obtained by adding a second limit value to the target time point as a second candidate ending point of the candidate frame;

determining that the maximum value of the first candidate starting point and the second candidate starting point is the target starting point, and the minimum value of the first candidate end point and the second candidate end point is the target end point.

4. The method of claim 1, wherein the determining that there are audio frames with an energy value less than a first threshold in audio frames preceding the reference point within the candidate box and audio frames with an energy value less than a second threshold in audio frames following the reference point within the candidate box comprises:

sequentially calculating the energy value of each audio frame by taking the reference point as a starting point and facing to the target starting point of the candidate frame, and determining the audio frame with the energy value smaller than the first threshold value as the starting frame;

and sequentially calculating the energy value of each audio frame by taking the reference point as a starting point and facing to the target end point of the candidate frame, and determining the audio frame with the energy value smaller than the second threshold value as the termination frame.

5. The method of claim 4, wherein said sequentially calculating an energy value for each audio frame comprises:

moving the sliding window in sequence by a preset step length, calculating the root mean square error of a preset number of sample points corresponding to the sliding window after each movement, representing the energy value of the audio frame by the root mean square error, wherein the preset number is the size of the sliding window, and the preset number of sample points form the audio frame.

6. The method of drum point waveform determination according to claim 1, wherein if there is no audio frame for which the energy value is less than a first threshold, the method further comprises:

taking the audio frame with the minimum energy value in the audio frames positioned before the reference point in the candidate frame as the starting frame;

if there are no audio frames with the energy value less than a second threshold, the method further comprises:

and taking the audio frame with the minimum energy value in the audio frames positioned after the reference point in the candidate frame as the termination frame.

7. The method of drumhead waveform determination according to claim 1, further comprising:

and intercepting a drum point waveform from the starting frame to the ending frame from the target audio.

8. A drumhead waveform determination apparatus comprising:

9. A drumhead waveform determination apparatus comprising:

the processor is connected with the memory and the input and output equipment;

10. A computer storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 7.