CN111312285B

CN111312285B - Beginning popping detection method and device

Info

Publication number: CN111312285B
Application number: CN202010044525.0A
Authority: CN
Inventors: 张斌
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2023-02-14
Anticipated expiration: 2040-01-14
Also published as: CN111312285A

Abstract

The embodiment of the invention discloses a method and a device for detecting initial popping, wherein the method comprises the following steps: extracting audio frequency domain features from the first audio file; obtaining a pre-trained target detection model; and inputting the audio frequency domain characteristics of the first audio file into a target detection model so as to detect whether the first audio file has beginning crackles. By implementing the embodiment of the invention, the accuracy rate of the beginning popping detection can be improved.

Description

Beginning popping detection method and device

Technical Field

The embodiment of the invention relates to the technical field of audio processing, in particular to a beginning pop detection method and device.

Background

At present, in the processes of generation, transmission, storage and the like of various audio files, tone quality is easily damaged, and further the hearing of a user is influenced. For example, when an audio file is encoded by using an official encoder of the lame3.16 version, the encoded audio file may generate a pop sound at the beginning when played. The pop at the beginning means that the sound wave generated in about 20 seconds before the audio file is played has obvious pulse on the frequency, so that the sound is like transient noise to the human hearing.

At present, beginning pop can be detected by Digital Signal Processing (DSP). Specifically, the DSP may compare the frequency of each frame signal in the audio file with a preset threshold for beginning pop. When a frame signal larger than a threshold value of the beginning plosive exists in the audio file, the DSP determines that the beginning plosive exists in the audio file.

However, the initial popping threshold is not necessarily the same for different audio files. Whether the beginning pop exists in different audio files is judged according to the same threshold value of the beginning pop, and the beginning pop is easy to be misjudged, so that the accuracy rate of detecting the beginning pop is reduced.

Disclosure of Invention

The embodiment of the invention discloses a method and a device for detecting beginning plosive, which can effectively improve the accuracy of detecting whether the audio file has the beginning plosive.

In a first aspect, an embodiment of the present invention provides a method for detecting pop at the beginning, including:

extracting, by the computing device, audio frequency domain features for the first audio file;

the computing equipment obtains a pre-trained target detection model; the target detection model is trained by audio frequency domain features of a plurality of second audio files;

the computing device inputs audio frequency domain features of the first audio file to the target detection model to detect whether a beginning plosive exists in the first audio file.

Wherein the object detection model comprises a classifier, an encoder, and a generator; wherein an output of the encoder is an input to the generator, an output of the generator is an input to the classifier.

The computing equipment inputs the audio frequency domain characteristics of the first audio file into an encoder in the target detection model, and obtains the first characteristics by using the encoder and the generator; the first feature is a mask of audio frequency domain features of the first audio file;

the computing device inputting a product of the first feature and an audio frequency domain feature of the first audio file to the classifier to obtain a detection result; the detection result indicates whether the first audio file has beginning plosive.

In a second aspect, an embodiment of the present invention discloses a method for training a target detection model, including:

the computing device trains the target detection model using a discriminator.

The computing equipment inputs the audio frequency domain characteristics of the plurality of second audio files to an encoder of the target detection model, and obtains second characteristics by using the encoder and the generator; the second feature is a mask of a beginning plosive of the plurality of second audio files;

the computing device obtaining a third feature and a first tag; the third feature is an actual mask of a beginning plosive of the plurality of second audio files, the first tag indicating whether there is a beginning plosive in the plurality of second audio files;

the computing device inputting the second feature and the third feature to the discriminator and the first label to the classifier;

the computing device training the encoder and the generator with an output of the discriminator and an output of the classifier;

the computing device trains the classifier using the first label and an output of the classifier.

And when the computing equipment detects that the network formed by the encoder and the generator and the discriminator reach Nash equilibrium, the computing equipment stops training the target detection model to obtain the trained target detection model.

In a third aspect, an embodiment of the present invention discloses a beginning pop detection device, including:

the audio frequency domain feature extraction module is used for extracting audio frequency domain features from the first audio file;

the target detection module is used for obtaining a pre-trained target detection model; the target detection model is trained by audio frequency domain features of a plurality of second audio files;

the target detection module is further configured to input the audio frequency domain characteristics of the first audio file to the target detection model, so as to detect whether there is a beginning pop in the first audio file.

In a fourth aspect, an embodiment of the present invention further discloses a computer-readable storage medium, where a computer program is stored, where the computer program includes program instructions, and the program instructions, when executed by a processor, cause the processor to execute the methods of the first and second aspects.

In an embodiment of the invention, a computing device builds a target detection model using a neural network. When a pre-trained target detection model is obtained, the computing device may input an audio file to the trained target detection model and generate the first feature using a generator therein. The first feature may relatively truly reflect whether the audio file has the beginning pop, so that the computing device may input the first feature to a classifier in the target detection model to obtain the detection result. The computing device can judge whether the beginning plosive exists in the audio file according to the detection result. Because the computing equipment uses a large number of audio files when training the target detection model, the condition that the initial popping exists in the audio files is considered, and the accuracy rate of initial popping detection is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a training block diagram of an initial pop detection model according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for training a beginning pop detection model according to an embodiment of the present invention;

FIG. 3 is a flowchart of another method for training a beginning pop detection model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an initial pop detection model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a computing device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of another computing device according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention discloses a beginning popping detection method and device. The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It is to be understood that the terminology used in the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

Since the embodiments of the present invention relate to the application of a neural network, for convenience of understanding, related terms and related concepts such as the neural network related to the embodiments of the present invention will be described first.

(1) Extracting audio frequency domain features

To detect whether an audio file has an initial pop, the computing device may first extract audio frequency domain features of the audio file. Specifically, the method for extracting the audio frequency domain features by the computing device may be to compute a log-mel frequency spectrum (log-mel), compute a mel-frequency cepstral coefficient (MFCC), compute a Linear Prediction Cepstral Coefficient (LPCC), compute a perceptual linear prediction coefficient (PLP), and the like. Wherein, for different methods for extracting the audio frequency domain features, the features may have different dimensions. For example, the log-mel spectral features may be 128-dimensional and the mel-frequency cepstral coefficients may be 64-dimensional.

The computing device extracts audio frequency domain features of the audio file to obtain a first feature matrix. The size of the first feature matrix row may be determined by a method of extracting audio frequency domain features by a computing device. For example, when the computing device computes log-mel, the size of the first feature matrix row may be 128. The size of the first feature matrix column may be determined by the length of time the computing device extracts the audio frequency domain features. For example, the computing device may perform the operation of extracting audio frequency domain features on the first 20 seconds of the audio file when played. In extracting the audio frequency domain features, the computing device may frame the speech signals in the 20 second audio file. For example, the computing device may take every 10 milliseconds of the speech signal as a frame, and the 20 second speech signal may be divided into 120 frames, each corresponding to a spectrum. The size of the first feature matrix column may be 120. If the computing device performs the operation of audio frequency domain feature on the first 20 seconds of the audio file during playing in a manner of calculating log-mel, and takes the voice signal of every 10 milliseconds as one frame when performing framing processing on the voice signal of the audio file, a first feature matrix with the size of 128 × 120 can be obtained. The values of the elements in the matrix can be used to represent the frequency value of a frame of speech signal at a certain frequency.

The method for extracting the audio frequency domain features is not limited in the embodiment of the present invention, and may be any one of the methods mentioned above or other methods besides calculating log-mel.

The time length for extracting the audio frequency domain features is not limited in the embodiment of the present invention, and may be the first 20 seconds of the audio file during playing, or may also be the time lengths of the first 10 seconds, the first 15 seconds, and the like of the audio file during playing. The beginning pop generally exists in the first 20 seconds of the audio file during playing, so that the operation of extracting the audio frequency domain features is performed on the first 20 seconds of the audio file during playing, and the influence of high-frequency tones existing in the audio file on the detection of the beginning pop by the computing equipment can be greatly reduced. For example, high frequency tones present in some audio files also appear as distinct pulses on the spectrogram. The obvious pulse may be a difference between the frequency of a certain frame of voice signal in the audio file and the frequencies of other frames of voice signals nearby, where the threshold may be 4 khz, 5 khz, or the like, and the embodiment of the present invention does not limit this.

(2) Neural network

The neural network may be composed of neural units, neural unitsAn element may mean in x _s And an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s =1, 2, … … n, n is a natural number greater than 1, ws is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by a number of the above-mentioned single neural units joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(3) Variational autoencoder-generated countermeasure network

A variational auto-coder-generated adaptive countermeasure network (VAE-GAN) is a deep learning model that includes VAEs and GANs.

Wherein, the VAE comprises at least two parts, one part is an Encoder (Encoder) and the other part is a Decoder (Decode). Both the encoder and the decoder may be implemented by a neural network, in particular by a deep neural network, or a convolutional neural network. The basic principles of VAE are described here by way of example of generating pictures. E (Encoder) receives a picture x1, and converts the picture x1 into a vector. D (Decoder) receives the vector and converts the vector into picture x2. Where picture x2 is the reconstruction of picture x 1.

The GAN includes at least two parts, one part is a Generator (Generator) and the other part is a Discriminator (Discriminator). By these two parts learning in game with each other, GAN can produce better output. The generator and the arbiter can be implemented by a neural network as well. The basic principle of GAN is also described by taking the generation of pictures as an example. G (Generator) is a network that can generate pictures, which receives a random noise z, from which the picture is generated, denoted G (z). D (Discriminator) is a discrimination network that can be used to discriminate whether a picture is an actual picture or a picture generated by G. D receives an input parameter x, wherein x represents a picture, and outputs D (x), and the D (x) is the probability that the picture represented by x is a real picture. If D (x) is 1, it represents a picture where x is 100% true, and if 0, it represents a picture where x cannot be true.

In the process of training the GAN, G generates a real picture to deceive D as much as possible, and D distinguishes the generated picture from the real picture as much as possible. Thus, G and D constitute a dynamic "gaming" process. As a result of the final game, in an ideal state, G can generate enough pictures G (z) to be "fake" and D can hardly determine whether the generated pictures are true or not, so the process of the "game" reaches nash equilibrium, that is, D (G (z)) =0.5. This results in an excellent generator G which can be used to generate pictures.

The VAE may generate a picture containing certain image features. For example, for a human face image scene, the picture of a particular image feature includes a picture of a "smile" image feature, the image of the input VAE includes a "non-smile" image feature, and the human face image of the VAE output includes a "smile" image feature. The VAE generates a picture having a lower resolution than a picture to which the VAE is input. And GAN generated pictures have higher resolution than VAE generated pictures. Since the input to the GAN contains a random noise, the image output by the GAN contains random image features. Combining VAE with GAN can generate high resolution pictures that contain specific image features. The VAE-GAN comprises at least three parts, an encoder, a generator and a discriminator. All three parts can be realized through a neural network. Wherein the encoder and generator may constitute a VAE and the generator and arbiter may constitute a GAN.

In the scene of generating pictures, the training process of VAE-GAN is actually a process of reducing the difference between the pictures generated by the generator and the pictures input to the encoder. This results in an excellent generator that cannot determine whether the picture generated by the excellent generator is generated by the generator or is input to the encoder.

The computing device may use the VAE-GAN to construct a detection model for initial pop detection.

In the invention, a detection model for detecting whether the audio file has beginning crackles by computing equipment is a target detection model.

Wherein the input received by the computing device in training the detection model may include the first feature matrix and the second feature matrix. The second feature matrix may be a feature matrix obtained after performing a first masking operation according to the first feature matrix, and the first masking operation may be performed manually. For example, the second feature matrix may be obtained by the computing device obtaining a mel-frequency spectrum map according to the first feature matrix. For music with beginning pop, the corresponding mel frequency spectrum chart has obvious pulses, so that the first mask operation can be manually carried out on the first feature matrix. For example, the corresponding element value of the initial pop actually existing in the audio file in the first feature matrix is marked as 1, and the element values at other positions are marked as 0. Therefore, the second feature matrix can reflect the actual condition whether the audio file has beginning pop or not, and the size of the second feature matrix is the same as that of the first feature matrix.

And the computing equipment can obtain a third feature matrix through the operation of the encoder and the generator in the detection model. The third feature matrix may be a feature matrix obtained after performing a second masking operation according to the first feature matrix, and the second masking operation may be performed by the computing device, different from the first masking operation.

In one possible implementation, the computing device trains the encoder and the generator, and may mark the corresponding element value of the possible initial plosive in the audio file in the first feature matrix as 1, and mark the element values of other positions as 0. Therefore, the third feature matrix can reflect whether the audio file has the possibility of initial popping or not, and the size of the third feature matrix is the same as that of the first feature matrix or the second feature matrix.

In the embodiment of the invention, the computing device utilizes a 'game' process of an encoder, a generator and a discriminator to reduce the difference between the third characteristic matrix and the second characteristic matrix, so that the third characteristic matrix can reflect whether the audio file has the beginning pop more truly.

The following describes functions of detecting various parts of the VAE-GAN in the beginning popping scene according to the embodiment of the invention.

a. Encoder for encoding a video signal

The input to the Encoder (Encoder) may be a first feature matrix and the output may be a vector. The vector has a mapping relation with the first feature matrix. Wherein the mapping relationship is determined by a neural network implementing the encoder.

b. Generator

The input to the Generator (Generator) may be a vector of the encoder output and the output may be a third feature matrix. The computing device may train the generator. After training and learning, the computing device may adjust parameters of the neural networks in the encoder and the generator, thereby reducing a difference between the third feature matrix and the second feature matrix, i.e., the third feature matrix may more truly reflect whether there is a beginning pop in the audio file.

In this way, a detection model for beginning pop detection of an audio file can be obtained through training and learning. The computing device trains the detection model using a large number of audio files. Therefore, the detection model obtained by training considers the difference of the beginning plosive in a large number of audio files, and is more accurate in beginning plosive detection.

c. Distinguishing device

A Discriminator (Discriminator) may be used to train the encoder and the generator. The input of the discriminator may be the second feature matrix or the third feature matrix, and the output may be the first probability. The first probability may be used to represent a probability that the input to the discriminator is the second feature matrix. For example, if the first probability is 1, it may indicate that the input of the discriminator is 100% of the second feature matrix. If the first probability is 0, it may indicate that the input of the discriminator is definitely not the second feature matrix. When the input received by the discriminator is the third feature matrix, if the first probability output by the discriminator approaches to 0, the second feature matrix and the third feature matrix can be distinguished by the discriminator, that is, the third feature matrix output by the generator cannot reflect more truly whether the audio file has the beginning pop. The first probability may be used to calculate a loss function for the encoder and the generator. Parameters in the encoder and generator may be adjusted during training according to a loss function.

In the embodiment of the present invention, in order to implement the function of detecting the initial pop, a classifier (classification) is added to the existing structure of the VAE-GAN model. The classifier may also be implemented by a neural network. Specifically, the present invention may be implemented by a deep neural network or a convolutional neural network, which is not limited in the embodiment of the present invention.

The classifier can distinguish whether the audio file has the beginning plosive according to the first characteristic matrix and the third characteristic matrix.

The input received by the computing device in training the detection model may include an initial plosive label in addition to the first feature matrix and the second feature matrix. The beginning pop label can be used for marking whether beginning pops exist in the audio file. For example, when there is a beginning pop in an audio file, the computing device may value the beginning pop label as 1. When the audio file does not have a beginning pop, the computing device may value the beginning pop label as 0.

The beginning plosive label may be used as an input to the classifier. The other input to the classifier is the product of the quantities of the first feature matrix and the third feature matrix.

Since the third feature matrix is the feature matrix after the first feature matrix is subjected to the second masking operation, the computing device has marked the corresponding element value of the initial pop possibly existing in the audio file in the first feature matrix as 1 and marked the element values at other positions as 0, so that after the dot product operation is performed, the interference of part of elements in the first feature matrix on the classifier can be reduced. For example, a part of audio existing in the audio file corresponds to a region with a small element value in the first feature matrix, and the region with the small element value may interfere with the classifier in distinguishing whether the beginning pop exists.

The output of the classifier may be a second probability that may characterize the probability of the audio file having a beginning plosive. If the second probability is 1, it can indicate that 100% of the audio file has beginning pop. If the second probability is 0, it may indicate that the audio file must not have a beginning pop. The larger the difference between the second probability and the beginning plosive label is, the poorer the capability of the classifier for distinguishing whether the beginning plosive exists in the audio file is. According to the second probability and the beginning pop label, the computing device can calculate a loss function of the classifier, and further adjust parameters in the classifier to reduce the difference between the second probability and the beginning pop label, so that the capability of the classifier for distinguishing whether the beginning pop exists in the audio file is improved.

The encoder, the generator, the discriminator and the classifier can be realized by a single neural network. The computing device may train the encoder and the generator as a whole, that is, may train the encoder and the generator using a back propagation algorithm using a loss function to adjust parameters within the neural network. The computing device may adjust parameters in the encoder so that the output vector may better characterize the first feature matrix. The computing device may adjust parameters within the neural network in the generator such that the generator output third feature matrix may more realistically reflect whether or not there is a beginning pop in the audio file.

The computing device may train the discriminators and classifiers separately, i.e., using a back propagation algorithm to train them according to the loss functions of the neural networks in the discriminators and classifiers, respectively. Therefore, the discriminator can more accurately judge whether the input of the discriminator is the second characteristic matrix, and the classifier can more accurately distinguish whether the audio file has beginning crackles.

(4) Loss function

The process of training the neural network is the process of learning the weight matrix, and the final purpose of training is to obtain the weight matrix of all layers of the trained neural network. The weight matrix for all layers is a weight matrix formed by the weight vector W for each layer.

In training the neural network, because the output of the neural network is expected to be as close as possible to the value really expected to be predicted, the computing device can adjust the weight vector of each layer of the neural network according to the difference between the predicted value of the current network and the value really expected to be target. Of course, before the first adjustment, there is usually an initialization procedure, i.e. the parameters are pre-configured for each layer in the neural network. Illustratively, if the predicted value of the neural network is high, the computing device adjusts the weight vector to make the predicted value lower, and continues to adjust until the neural network can predict the true desired target value or a value that is very close to the true desired target value.

Therefore, the computing device needs to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (lossfunction) or an objective function (objectionfunction). The loss function and the objective function are important equations for measuring the difference between the predicted value and the target value. Here, taking the loss function as an example, a higher output value (loss) of the loss function indicates a larger difference. The training process of the neural network is a process for minimizing the loss.

(5) Cross entropy loss function

The cross-entropy loss function (cross-entropy loss function) is a way to measure the predicted and target values of a neural network. The cross entropy is an important concept in shannon information theory, and is mainly used for measuring difference information between two probability distributions. Since the smaller the cross entropy, the smaller the difference between the two distributions, the cross entropy can be used as a loss function for training the neural network.

In an embodiment of the present invention, the discriminator in the VAE-GAN may be used to discriminate the difference between the second feature matrix and the third feature matrix, so as to achieve the purpose of training the encoder, the generator and the discriminator, i.e. to reduce the difference between the third feature matrix and the second feature matrix. In this way, the third feature matrix output by the generator can reflect more truly whether the audio file has the beginning pop or not, and the discriminator has difficulty in judging whether the input is the second feature matrix or the third feature matrix.

The computing device may use a cross-entropy loss function to train a neural network for each portion in the VAE-GAN. Wherein, for the encoder and generator parts, the computing device may preset the expression of the cross entropy loss function as:

L _G ＝E[log(D(G(z)))] (1-2)

wherein, the operation symbol E represents the calculation mathematical expectation, z is the output of the encoder, i.e. the input of the generator, G (z) represents the output of the generator, i.e. the third feature matrix, and D (G (z)) represents the first probability output by the discriminator when the input received by the discriminator is the third feature matrix.

For the part of the discriminator, the computing device may preset the expression of the cross entropy loss function as:

L _D ＝E[log(D(x _r ))]+E[log(1-D(G(z)))] (1-3)

wherein x is _r The second feature matrix can reflect the actual situation of whether the audio file has the beginning plosive. D (x) _r ) Is a first probability output by the discriminator when the input received by the discriminator is the second feature matrix.

For the classifier portion, the computing device may preset the expression of the cross entropy loss function as:

L _c ＝E[log(C(x*x _g ))] (1-4)

wherein, C (x) _g ) Representing the input received by the classifier as a first feature matrix x and a third feature matrix x _g The number of the first and second probabilities output by the classifier.

(6) Back propagation algorithm

The neural network may use a Back Propagation (BP) algorithm to modify the initial weight matrix during the training process, so that the difference between the predicted value and the target value is smaller and smaller. Specifically, an error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and an initial weight matrix is adjusted by back-propagating error loss information, so that the error loss converges. The back propagation algorithm is an error-loss dominated back propagation motion aimed at obtaining an optimal solution, e.g. a weight matrix.

In the method for detecting the beginning plosive provided by the embodiment of the invention, the computing equipment can train the detection model for realizing the beginning plosive detection by using a back propagation algorithm to obtain the updated detection model. The computing device may then detect whether an initial plosive is present in the music using the updated detection model. When the updated detection model is used for detecting the initial popping, the detection result can be obtained only by taking the first feature matrix obtained after the audio frequency domain features are extracted as the input of the detection model, and compared with a DSP, the detection process of the detection model is more convenient.

In the embodiment of the invention, the computing equipment utilizes the neural network to construct the detection model, and the detection model is trained to generate the characteristic matrix. The characteristic matrix can truly reflect whether the audio file has the beginning pop or not, so that the computing device can utilize the characteristic matrix to realize beginning pop detection. Because the computing equipment uses a large number of audio files when training the detection model, the condition that the initial popping sound in the audio files is different is considered, and the accuracy rate of the initial popping sound detection is improved.

The following describes a detection model for detecting initial popping in an embodiment of the present invention. Referring to fig. 1, fig. 1 is a training block diagram of an initial pop detection model according to an embodiment of the present invention. As shown in fig. 1, the detection model may comprise an encoder 110, a generator 120, and a classifier 140. The discriminator 130 is used to assist in training the encoder 110 and the generator 120. Wherein:

the encoder 110 may be implemented by a neural network, and in particular, may be implemented by a convolutional neural network. As shown in fig. 1, the encoder 110 may be configured to receive x, and, via operation, may output a vector z.

In an embodiment of the present invention, x received by the encoder 110 may be one or more first feature matrices in the training set. The training set may include a plurality of first feature matrices for training the detection model. One of the first feature matrices may be a feature matrix obtained by extracting audio frequency domain features of a piece of music by the computing device.

In one possible implementation, the computing device may preset a batch processing amount (batch), and the encoder 110 receives a corresponding amount of the first feature matrix for processing according to the size of the batch value. For example, there are 2048 first feature matrices in the training set, that is, the computing device performs an operation of extracting audio frequency domain features on 2048 pieces of music. The computing device may preset the size of the batch value to be 32, and the encoder 110 receives 32 first feature matrices at a time. Since the 2048 first feature matrices are included in the training set, the computing device may train the encoder 64 times to traverse all of the first feature matrices. The above-mentioned one training may be that the computing device calculates the loss function and adjusts parameters in the encoder according to all the first feature matrices in the input encoder. The value of batch is less than or equal to the total number of the first feature matrices in the training set, and the value of batch is not particularly limited in the embodiment of the present invention.

The vector z output by the encoder 110 has a mapping relationship with the first feature matrix x, and the specific mapping manner is determined by the neural network implementing the encoder. The vector z may serve as an intermediary connecting the encoder 110 and the generator 120.

The generator 120 may be implemented by a neural network, and in particular, may be implemented by a convolutional neural network. As shown in fig. 1, the generator 120 may be configured to receive a vector z output by the encoder 110, and perform a second masking operation on x according to the vector z to output x _g . Wherein x is _g One or more third feature matrices are included. x is the number of _g The number of the third feature matrices included is the same as the number of the first feature matrices included in x, and both can be determined by a preset value of batch of the computing device.

The generator 120 receives the vector z and is computed to output x _g . The vector z and the first feature matrix x have a mapping relationThus x _g There is also an association with x. x is the number of _g May be the feature matrix after the second masking operation on the first feature matrix, i.e., x _g It can reflect whether the first feature matrix in x has the possible features of the initial pop.

Through the operations of the encoder and the generator, the computing device may convert the first feature matrix into a third feature matrix, which may reflect whether the audio file may have an initial plosive. Training the encoder and generator by the computing device may cause the third feature matrix to more realistically reflect whether or not an audio file has an initial pop.

The discriminator 130 may be implemented by a neural network, and specifically, may be implemented by a convolutional neural network. As shown in FIG. 1, the discriminator 130 may be used to receive x _g Or x _r And judges that its input is x _r Then outputs a first probability Y.

Wherein the computing device may only input x at a time _g Or only inputting x _r To the arbiter 130. Specifically, the computing device may preset a batch value, and the arbiter 130 receives x _r Then, a corresponding number of second feature matrices corresponding to the first feature matrix in x may be input to the discriminator 130 according to the size of the batch value.

The first probability Y may indicate that the arbiter determines that the input is x _r The probability of (c). For example, the closer the first probability Y is to 1, it may indicate that the arbiter considers its input to be x _r The higher the probability of (c). Conversely, the closer the first probability Y is to 0, it can indicate that the input of the discriminator is x _g The higher the probability of (c).

Due to x _r Can reflect the actual situation whether the audio file has beginning crackles or not, and x _g Reflecting the possibility of the presence of a beginning pop in an audio file, the computing device therefore trains the encoder, generator, and discriminator to reduce x _g And x _r The difference between them, makes it difficult for the discriminator to determine that its input is x _g Or x _r Such a process. Thus, a more ideal encoder can be obtainedAnd a generator. X of the ideal generator output _g Whether the audio file has the beginning popping or not can be reflected more truly.

The classifier 140 may be implemented by a neural network, and specifically, may be implemented by a convolutional neural network. As shown in FIG. 1, classifier 140 may be used to receive x and x _g And (3) performing multiplication operation on the result and the initial popping label L corresponding to the first characteristic matrix in x, calculating the probability of the first characteristic matrix in x having the characteristic of the initial popping, and outputting a second probability C.

After the calculation equipment performs the number multiplication operation, the obtained operation result is the element value of the position where the first feature matrix possibly has the feature of the initial pop in the reserved x, and the element values of other positions are removed. The accuracy of distinguishing whether the audio file has the beginning pop sound by the classifier can be improved by the computing equipment according to the operation result.

The second probability C may represent a probability of whether or not the first feature matrix in x has a feature of the beginning plosive. The closer the value in C is to 1, the higher the probability that the first feature matrix in x corresponding thereto has the feature of the initial pop is. The closer the value in C is to 0, the lower the probability that the first feature matrix in x has the feature of the initial pop is.

It should be noted that the three inputs received by the computing device in training the detection model all correspond to audio files in the same batch. I.e. one or more first feature matrices contained in the input x of the encoder 110, one or more second feature matrices x contained in the input of the discriminator 130 _r And the input L of the classifier 140 contains one or more beginning pop tags that all correspond to the same audio file.

The detection model is not limited to be composed of the encoder 110, the generator 120 and the classifier 140, and may include more or less parts, which is not limited by the embodiment of the present invention.

Based on the training block diagram of the detection model shown in fig. 1, the embodiment of the invention provides a method for training an initial pop detection model. Referring to fig. 2, fig. 2 is a flowchart of a method for training a beginning pop detection model according to an embodiment of the present invention. As shown in fig. 2, the method includes steps S101 to S103.

S101, extracting audio frequency domain characteristics of an audio file by computing equipment to obtain a first characteristic matrix.

How the computing device obtains the first feature matrix may refer to the concept of extracting the audio frequency domain features, and will not be described herein again.

S102, training the detection model by the computing equipment according to the first characteristic matrix, the second characteristic matrix and the beginning plosive label, and adjusting parameters in the detection model.

The process of training the detection model by the computing device may specifically be: the first feature matrix is used as the input of the encoder, and the computing device can obtain the vector mapped with the first feature matrix through the operation in the encoder. The vector is used as the input of the generator, and the calculation device can obtain a third feature matrix through the operation in the generator. The computing device may obtain the first probability by using the third feature matrix and the second feature matrix as inputs to a discriminator and judging whether the inputs are the second feature matrix by the discriminator. The first probability represents a probability that the determiner determines its input to be the second feature matrix. And taking the quantity product of the first characteristic matrix and the third characteristic matrix and the initial plosive label as the input of a classifier, and distinguishing whether the audio file has the initial plosive by using the classifier, wherein the second probability can be obtained by the computing equipment. The second probability represents the probability that the classifier distinguishes the audio file that the beginning plosive exists.

The computing device may construct a cross-entropy loss function using the first probability and the second probability, and adjust parameters in the encoder, the generator, the discriminator, and the classifier using a back propagation algorithm.

For the loss functions used for training the discriminators and the classifiers, the concept introduction in the cross entropy loss function may be referred to, and details are not repeated here.

Since the classifier is added to the structure of the standard VAE-GAN, the expression for the loss function used to train the whole of the encoder and generator can be as follows:

L _G ＝E[log(D(G(z)))]+E[log(C(x*x _g ))] (1-5)

where z represents a vector of encoder outputs, G (z) represents the output of the generator, i.e., the third feature matrix, and D (G (z)) represents the output of the discriminator when the input is the third feature matrix. x denotes a first feature matrix, x _g Representing a third feature matrix, C (x), obtained after a second masking operation on the first feature matrix _g ) Denotes x and x _g The product of the quantities of (a) and (b) is input to the output of the classifier.

The structure of the standard VAE-GAN described above represents a model that contains an encoder, a generator, and a discriminator. Compared with the loss function (1-2) which only contains the error term calculated by the discriminator, the loss function (1-5) is added with the error term calculated by the classifier, which is beneficial to improving the accuracy of the output of the generator for reflecting whether the audio file has the beginning pop.

S103, when detecting that the encoder, the generator and the discriminator in the detection model reach Nash equilibrium, the computing equipment can stop training the detection model and store the parameters in the detection model.

The training process of the detection model by the computing device comprises the training process of the encoder, the generator and the discriminator, and the training process can be actually regarded as the process of 'gaming' the encoder, the generator and the discriminator.

In one aspect, the computing device takes the first feature matrix as an input to the encoder and outputs a third feature matrix x after passing through the generator _g . The computing device may use the loss function in equations (1-5) to adjust the parameters of the neural network in the encoder and generator, thereby letting x _g Whether the audio file has beginning crackles or not can be reflected more accurately. When the computing device will x _g When the input is an input to the discriminator, it is difficult for the discriminator to determine that the input is x _g Or x _r 。

On the other hand, the computing device maps the second feature matrix x _r Or a third feature matrix x _g As an input to the discriminator, and outputs the first probability Y at the discriminator. Meter for measuringThe computing equipment can use the loss function in the formula (1-3) to adjust the parameters of the neural network in the discriminator, thereby enabling the discriminator to more accurately discriminate that the input is x _g Or x _r 。

The achievement of nash equilibrium may be a condition for the detection model to stop training. Nash equalization can indicate that the "gaming" process of the encoder, the generator and the arbiter achieves the best results. Specifically, the nash equilibrium can be represented by the following formula:

D(G(z))＝0.5 (1-6)

where G (z) is the output of the generator, i.e. the third feature matrix x _g . (1-6) the formula may indicate when the computing device assigns the third feature matrix x _g When the first probability Y is input into the discriminator, the first probability Y output by the discriminator is 0.5. This represents the third feature matrix x _g And x _r The difference between them is so small that the discriminator cannot make a judgment, i.e. x _g Whether the characteristic of the initial popping exists in the first characteristic matrix can be reflected more truly.

In the embodiment of the present invention, the achievement of nash equalization does not mean that the first probability Y of the output of the discriminator is once 0.5, but means that the first probability Y of the output of the discriminator becomes gradually stable at 0.5 as the training progresses. For example, the computing device may set a threshold range, such as 0.45-0.55 or 0.4-0.5, and when the values of the first probability Y in m trainings all fall within the set threshold range, the computing device may stop training the detection model. The value of m may be a positive integer such as 10, 15, or 20, and the threshold range and the value of m are not more specifically limited in the embodiment of the present invention.

Because it is necessary to detect whether there is a beginning pop in an audio file, the computing device performs a training process on the detection model and also performs a training process on the classifier. The classifier is mainly used for distinguishing whether the characteristics of the beginning plosive exist in a first characteristic matrix x of the input encoder. Specifically, the computing device may adjust parameters in the classifier using the penalty function in equations (1-4). Likewise, the computing device can stop training the classifier when the "gaming" process of the encoder, generator, and arbiter reaches nash equilibrium.

In the invention, when the computing equipment trains the detection model by using the discriminator, the audio frequency domain characteristics can be extracted from a plurality of second audio files to obtain a first characteristic matrix. The second feature obtained by the computing device using the encoder and the generator may be a third feature matrix. The third feature matrix is obtained by performing a second masking operation on the first feature matrix, and may reflect a situation that a plurality of second audio files may have a beginning pop.

The third feature acquired by the computing device may be a second feature matrix. The second feature matrix is obtained by performing the first masking operation on the first feature matrix, and can reflect the actual situation of whether the plurality of second audio files have beginning pops.

The first tag acquired by the computing device may be an initial pop tag. The beginning pop label may be used to mark whether beginning pops exist in the plurality of second audio files.

The computing device can obtain a trained detection model according to the steps S101 to S103. The computing device may detect whether a first audio file has an initial plosive using a trained detection model. The computing device extracts audio frequency domain features for the first audio file and inputs the audio frequency domain features of the first audio file to an encoder of the detection model.

The computing device may obtain an output of a generator in the detection model. The output of the generator is the first characteristic. The first feature is a feature matrix obtained by performing a second masking operation on the audio frequency domain feature of the first audio file, and can reflect the situation that the first audio file may have a beginning plosive.

Referring to fig. 3, fig. 3 is a flowchart of another method for training a pop-start detection model according to an embodiment of the present invention. As shown in fig. 3, the method for training the initial pop detection model includes steps S1021 to S1024.

And S1021, the computing equipment obtains a vector mapped with the first feature matrix through operation in the encoder according to the first feature matrix.

And S1022, the calculation device obtains a third feature matrix through operation in the generator according to the vector output by the encoder.

S1023, the computing equipment judges whether the input of the computing equipment is the second feature matrix or not through a discriminator according to the second feature matrix and the third feature matrix to obtain a first probability; and the computing equipment distinguishes whether the audio file has beginning popping or not through the classifier according to the quantity product of the first characteristic matrix and the third characteristic matrix and the beginning popping label to obtain a second probability.

And S1024, calculating the value of a preset loss function by the calculating equipment according to the first probability and the second probability, and adjusting parameters in the encoder, the generator, the discriminator and the classifier by using a back propagation algorithm.

The specific expression of the loss function may refer to the concept in the cross entropy loss function and the introduction in step S102, which is not described herein again.

The computing device may save the parameters in the detection model after stopping training the detection model. Specifically, since the final purpose of the detection model is to detect whether the beginning pop exists in the audio file, the computing device may save the parameters in the encoder, the generator and the classifier, and use the structure formed by the three parts of the encoder, the generator and the classifier as the detection model for detecting whether the beginning pop exists in the music.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a beginning pop detection model according to an embodiment of the present invention. As shown in fig. 4, the detection model may comprise three parts: an encoder 110, a generator 120, and a classifier 140. The functions of the encoder, the generator and the classifier may refer to the functions of the encoder, the generator and the classifier of the detection model in fig. 1, and are not described herein again.

The detection model shown in fig. 4 is a trained model, that is, the third feature matrix x output by the generator in the detection model _g Whether the first feature matrix x has the features of the beginning plosive or not can be reflected more truly, and the classifier in the detection model can more accurately distinguish whether the audio file has the beginning plosive or not。

In the detection model shown in fig. 4, after the computing device performs an operation of extracting audio frequency domain features on an audio file, the obtained first feature matrix x may be used as an input of the detection model. The computing device can obtain the second probability C through the operation processes in the encoder, the growing period and the classifier. The computing device may distinguish whether there is a beginning pop in the audio file based on the second probability C.

In one possible implementation, the computing device may set a threshold that is greater than 0 and less than 1, e.g., 0.85 or 0.9, etc. When the second probability C is detected to be larger than or equal to the threshold value, the computing device gives the detection result of the detection model as that the audio file has beginning pop. The threshold value is not limited in the embodiment of the present invention.

When the computing device detects whether the first audio file has the beginning pop by using the detection model as shown in fig. 4, a detection result can be obtained. The detection result is the second probability output by the classifier in the detection model. The second probability may characterize a probability that the first audio file has a beginning pop.

The embodiment of the invention does not limit the audio files processed by the computing equipment, and the audio files can comprise different types of audio files such as song files, drama files, small article files and the like. For example, when the computing device detects whether or not there is a beginning pop in a song file, the computing device can effectively detect the song with the beginning pop through beginning pop detection in a song library containing the song with the beginning pop, and can replace, correct or eliminate the defective songs. Further, for a scene where a song is to be stored into the music gallery, through the beginning plosive detection, the computing device may detect whether there is a beginning plosive for the song to be stored, thereby deciding whether to store the music into the music gallery. Thus, the quality of songs in the song library can be greatly improved.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a computing device according to an embodiment of the present invention, where the computing device is configured to execute the method for detecting pop at the beginning according to the embodiment of the present invention.

As shown in fig. 5, the structure of the computing device comprises an audio frequency domain feature extraction module 210 and an object detection module 220, wherein:

the audio frequency domain feature extraction module 210 may be configured to extract an audio frequency domain feature for the first audio file.

The target detection module 220 may be configured to obtain a pre-trained target detection model.

The object detection model includes a classifier, an encoder, and a generator. Wherein, the output of the coder is used as the input of the generator, and the output of the generator is used as the input of the classifier.

The target detection module 220 may further be configured to input the audio frequency domain feature of the first audio file into the target detection model to obtain a detection result. The detection result may indicate whether there is a beginning pop in the first audio file.

Referring to fig. 6, fig. 6 is a schematic structural diagram of another computing device according to an embodiment of the present disclosure. Specifically, the computing device includes an external input interface 310, a processor 320, a memory 330, and an output interface 340 connected by a bus. The memory 330 may store contents such as an operating system 331, a multimedia file 332, and an application file 333.

In the embodiment of the present invention, the beginning pop detection method is executed based on a computer program, and the application program file 333 of the computer program is stored in the memory 330 of the computing device, and is compiled into machine code at runtime and then transferred to the processor 320 for execution, so that the audio frequency domain feature extraction module 210, the training module 220 and the detection module 230 are logically formed in the computing device. During the operation of the beginning pop detection method, the external input interface 310 receives the first feature matrix, the second feature matrix and the beginning pop tag, and transmits the data received by the external input interface 310 to the memory 330 for buffering, and then the data is input to the processor 320 for processing, and the result of the processing is either buffered in the memory 330 for subsequent processing, or transmitted to the output interface 340 for output.

Wherein the processor 320 is configured to invoke the computer program to cause the computing device to perform the following operations:

audio frequency domain features are extracted for the first audio file. And obtaining a pre-trained target detection model which is trained by the audio frequency domain characteristics of a plurality of second audio files. And inputting the audio frequency domain characteristics of the first audio file into a target detection model so as to detect whether the first audio file has beginning crackles.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method for detecting initial pop is characterized by comprising the following steps:

the computing equipment obtains a pre-trained target detection model; the target detection model is trained by audio frequency domain features of a plurality of second audio files; the target detection model includes: a classifier, an encoder and a generator; wherein the output of the encoder is the input of the generator and the output of the generator is the input of the classifier;

the computing equipment inputs the audio frequency domain characteristics of the first audio file into an encoder in the target detection model, and obtains first characteristics by using the encoder and the generator; the first feature is a mask of audio frequency domain features of the first audio file;

the computing device inputting a product of the first feature and an audio frequency domain feature of the first audio file to the classifier to obtain a detection result; the detection result indicates whether the first audio file has beginning crackles.

2. The method of claim 1, wherein the computing device trains the object detection model using a discriminator, and wherein the training of the object detection model comprises:

the computing device obtaining a third feature and a first tag; the third feature indicates an accurate position of a beginning plosive of the plurality of second audio files, and the first tag indicates whether the beginning plosive exists in the plurality of second audio files;

3. The method of claim 2, wherein the training process of the object detection model comprises:

the computing device detects whether the network of encoders, generators and the arbiter reaches nash equilibrium;

if so, the computing equipment stops training the target detection model to obtain the trained target detection model.

4. An initial pop detection device, comprising:

the target detection module is used for obtaining a pre-trained target detection model; the target detection model is trained by audio frequency domain features of a plurality of second audio files; the visual inspection model includes: a classifier, an encoder and a generator; wherein the output of the encoder is the input of the generator and the output of the generator is the input of the classifier;

the target detection module is further configured to input the audio frequency domain feature of the first audio file to an encoder in the target detection model, and obtain the first feature by using the encoder and the generator; the first feature is a mask of audio frequency domain features of the first audio file;

the target detection module inputs the product of the first characteristic and the audio frequency domain characteristic of the first audio file into the classifier to obtain a detection result; and the detection result indicates whether the first audio file has beginning crackles.

5. An initial pop detection device, comprising:

an external input interface, a processor, a memory for storing a computer program, and an output interface, the external input interface, the memory, and the output interface coupled to the processor by a bus;

the external input interface is used for receiving a first audio file;

the output interface is used for outputting a result of detecting whether the first audio file has beginning crackles;

the processor is configured to invoke the computer program to cause the apparatus to:

extracting audio frequency domain features from the first audio file; obtaining a pre-trained target detection model; the target detection model is trained by audio frequency domain features of a plurality of second audio files; the target detection model includes: a classifier, an encoder and a generator; wherein the output of the encoder is the input of the generator and the output of the generator is the input of the classifier; inputting the audio frequency domain characteristics of the first audio file into an encoder in the target detection model, and obtaining first characteristics by using the encoder and the generator; the first feature is a mask of audio frequency domain features of the first audio file; inputting the product of the first characteristic and the audio frequency domain characteristic of the first audio file into the classifier to obtain a detection result; the detection result indicates whether the first audio file has beginning crackles.

6. A computer-readable storage medium for storing the computer program, the computer program being executable by the processor to implement the method of any one of claims 1 to 3.