CN115273826A

CN115273826A - Singing voice recognition model training method, singing voice recognition method and related device

Info

Publication number: CN115273826A
Application number: CN202210720102.5A
Authority: CN
Inventors: 龚韬
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-06-23
Filing date: 2022-06-23
Publication date: 2022-11-01

Abstract

The application discloses a singing voice recognition model training method, a singing voice recognition method and a related device, wherein the training method comprises the following steps: acquiring training audio and a corresponding audio label; extracting the audio features of the training audio to obtain training features; inputting the training characteristics into the initial model to obtain a training recognition result; the initial model comprises a first convolution layer and a second convolution layer, wherein the first convolution layer and the second convolution layer are provided with rectangular convolution kernels, the long side of each first rectangular convolution kernel is arranged along the direction of a frequency axis, and the long side of each second rectangular convolution kernel is arranged along the direction of a time axis; generating a loss value by using the training recognition result and the audio label, and performing parameter adjustment processing on the initial model by using the loss value; if the preset completion condition is met, determining the initial model after parameter adjustment as a singing voice recognition model; the singing voice recognition model obtained by the method has strong anti-noise interference capability.

Description

Singing voice recognition model training method, singing voice recognition method and related device

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a singing voice recognition model training method, a singing voice recognition method, and a related apparatus.

Background

With the development of the multimedia industry and the rise of short videos, music is no longer simple like listening to songs alone, and is also consumed by the public in a richer and more varied form. Music can be used as dubbing music or background music and appears in various scenes such as live broadcast, short video, outdoors and the like, singing voice in audio waveforms in such scenes is mixed with sounds such as steady-state noise, environmental noise, transient noise, human voice noise and the like, for example, for different recording devices, various sound receiving effects such as loudness of audio and sound field are different, and thus the frequency spectrum distribution of the audio is different. Singing Voice Detection (SVD) is a task based on the field of MIR (Music Information Retrieval), but in a real environment, the robustness is low, the anti-interference capability is weak, and the performance of the SVD can rapidly slide down in a complex noise scene.

Disclosure of Invention

In view of the above, an object of the present application is to provide a singing voice recognition model training method, a singing voice recognition method and a related device, so that the singing voice recognition model has strong anti-noise interference capability and can accurately distinguish the singing voice.

In order to solve the above technical problem, in a first aspect, the present application provides a method for training a singing voice recognition model, including:

acquiring training audio and a corresponding audio label; wherein the training audio comprises noise-containing audio interfered by noise, and the audio tag is used for indicating that the training audio is singing audio or non-singing audio;

extracting the audio features of the training audio to obtain training features;

inputting the training characteristics into an initial model to obtain a training recognition result; the initial model comprises a first convolution layer and a second convolution layer, wherein the first convolution layer and the second convolution layer are provided with rectangular convolution kernels, the long sides of the first rectangular convolution kernels are arranged along the direction of a frequency axis, and the long sides of the second rectangular convolution kernels are arranged along the direction of a time axis;

generating a loss value by using the training recognition result and the audio label, and performing parameter adjustment processing on the initial model by using the loss value;

and if the preset completion condition is detected to be met, determining the initial model after the parameters are adjusted as the singing voice recognition model.

Optionally, the obtaining training audio includes:

acquiring initial training audio;

and carrying out dynamic range control processing on the initial training audio to obtain the training audio.

Optionally, the obtaining training audio includes:

acquiring initial training audio;

and determining a preset audio length, and performing fragmentation processing and/or zero padding processing on the initial training audio based on the preset audio length to obtain the training audio.

Optionally, the generating process of the audio tag includes:

determining an audio category corresponding to the training audio;

generating the audio tag based on the audio category.

Optionally, the extracting the audio feature of the training audio to obtain a training feature includes:

and carrying out Mel frequency spectrum extraction processing and/or Mel cepstrum coefficient extraction processing on the training audio by taking an audio frame as granularity to obtain the training characteristics.

Optionally, after the extracting the audio feature of the training audio to obtain the training feature, the method further includes:

dividing the training features and the corresponding audio tags into a training set and a verification set;

correspondingly, the inputting the training features into the initial model to obtain a training recognition result includes:

inputting training characteristics contained in the training set into an initial model to obtain a training recognition result;

if the preset completion condition is met, determining the initial model after parameter adjustment as the singing voice recognition model comprises the following steps:

if the initial model meets the preset training condition, carrying out identification accuracy verification on the initial model after parameter adjustment by using the verification set;

if the recognition accuracy of the initial model after the parameter adjustment does not meet the preset accuracy condition, returning to the step of inputting the training features contained in the training set into the initial model to obtain a training recognition result; and determining the initial model after the parameter adjustment as a singing voice recognition model until the recognition accuracy of the initial model after the parameter adjustment meets the preset accuracy condition.

In a second aspect, the present application also provides a singing voice recognition method, including:

acquiring audio to be tested;

extracting the audio features of the audio to be detected to obtain the features to be detected;

inputting the characteristics to be detected into a singing voice recognition model to obtain a singing voice recognition result; the singing voice recognition model is obtained based on the singing voice recognition model training method.

Optionally, the extracting the audio features of the audio to be detected to obtain the features to be detected includes:

carrying out feature extraction processing on the audio to be detected to obtain initial audio features;

and determining a preset audio length, and performing fragmentation processing and/or zero filling processing on the initial audio features based on the preset audio length to obtain the features to be detected.

Optionally, if there are a plurality of features to be detected, the inputting the features to be detected into the singing voice recognition model to obtain the singing voice recognition result includes:

inputting each feature to be detected into the singing voice recognition model respectively to obtain a segmentation recognition result;

and carrying out fusion processing on the segmentation recognition results to obtain the singing voice recognition result.

Optionally, the method further comprises:

determining the starting and stopping time range of each feature to be tested relative to the audio frequency to be tested;

sequencing each fragment identification result by utilizing the starting and stopping time range to obtain a first sequence;

carrying out same classification boundary fusion processing on the first sequence to obtain a second sequence;

and determining the singing voice endpoint corresponding to the audio to be tested based on the second sequence.

In a third aspect, the present application further provides an electronic device, comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the singing voice recognition model training method and/or the singing voice recognition method.

In a fourth aspect, the present application further provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the above-mentioned singing voice recognition model training method, and/or the above-mentioned singing voice recognition method.

Therefore, the method adopts special training data and a special initial model to realize the song identification of the noisy audio, and has stronger robustness and anti-interference capability. Specifically, the settings include noisy audio as training audio, so that the model can learn how to distinguish singing audio under noise interference using the training audio. The initial model comprises a first convolutional layer and a second convolutional layer, convolution kernels of the two convolutional layers are rectangular, the convolution kernels with long sides arranged along the direction of a frequency axis can acquire frequency domain information such as pitch and a sound range in a larger frequency domain range, and the convolution kernels with long sides arranged along the direction of a time axis can acquire time domain information such as rhythm and melody in a larger time domain range. Through the first convolution layer and the second convolution layer, the initial model can obtain more information, noise interference resistance is facilitated, and accurate classification is achieved. The singing voice recognition model obtained after training can have strong anti-noise interference capability, and can accurately distinguish the singing voice.

In addition, the application also provides a singing voice identification method and a related device, and the singing voice identification method and the related device also have the beneficial effects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic diagram of a hardware composition framework to which a singing voice recognition model training method and/or a singing voice recognition method provided in an embodiment of the present application are/is applied;

fig. 2 is a schematic diagram of a hardware composition framework to which another singing voice recognition model training method and/or a singing voice recognition method provided in the embodiment of the present application are/is applied;

fig. 3 is a schematic flowchart of a singing voice recognition model training method according to an embodiment of the present application;

FIG. 4 is a diagram illustrating the effect of dynamic range control provided by an embodiment of the present application;

FIG. 5 is a diagram illustrating a first convolution kernel and a second convolution kernel according to an embodiment of the present application;

fig. 6 is a diagram illustrating an effect of positioning a singing voice starting point according to an embodiment of the present application;

fig. 7 is a flow chart of singing voice recognition provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

For convenience of understanding, a hardware composition framework used in the singing voice recognition model training method and/or the singing voice recognition method provided in the embodiment of the present application is introduced. Referring to fig. 1, fig. 1 is a schematic diagram of a hardware composition framework for a singing voice recognition model training method and/or a singing voice recognition method provided in an embodiment of the present application. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.

Wherein, the processor 101 is configured to control the overall operation of the electronic device 100 to complete all or part of the steps of the singing voice recognition model training method and/or the singing voice recognition method; the memory 102 is used to store various types of data to support operation at the electronic device 100, such data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), electrically Erasable Programmable Read-Only Memory (EEPROM), erasable Programmable Read-Only Memory (EPROM), programmable Read-Only Memory (PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk. In the present embodiment, the memory 102 stores therein at least programs and/or data for realizing the following functions:

acquiring a training audio and a corresponding audio label; wherein the training audio comprises noisy audio interfered by noise, and the audio label is used for indicating that the training audio is singing audio or non-singing audio;

and if the initial model after parameter adjustment is detected to meet the preset completion condition, determining the initial model after parameter adjustment as the singing voice recognition model.

The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: wi-Fi part, bluetooth part, NFC part.

The electronic Device 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for executing the singing voice recognition model training method and/or the singing voice recognition method.

Of course, the structure of the electronic device 100 shown in fig. 1 does not constitute a limitation to the electronic device in the embodiment of the present application, and in practical applications, the electronic device 100 may include more or less components than those shown in fig. 1, or some components may be combined.

It is to be understood that, in the embodiment of the present application, the number of the electronic devices is not limited, and it may be that a plurality of electronic devices cooperate to complete the singing voice recognition model training method and/or the singing voice recognition method. In a possible implementation manner, please refer to fig. 2, and fig. 2 is a schematic diagram of a hardware composition framework to which another singing voice recognition model training method and/or the singing voice recognition method provided in the embodiments of the present application are applicable. As can be seen from fig. 2, the hardware composition framework may include: the first electronic device 11 and the second electronic device 12 are connected to each other through a network 13.

In the embodiment of the present application, the hardware structures of the first electronic device 11 and the second electronic device 12 may refer to the electronic device 100 in fig. 1. That is, it can be understood that there are two electronic devices 100 in the present embodiment, and the two devices perform data interaction. Further, in the embodiment of the present application, the form of the network 13 is not limited, that is, the network 13 may be a wireless network (such as WIFI, bluetooth, etc.), or may also be a wired network.

The first electronic device 11 and the second electronic device 12 may be the same electronic device, for example, the first electronic device 11 and the second electronic device 12 are both servers; or may be different types of electronic devices, for example, the first electronic device 11 may be a smartphone or other smart terminal, and the second electronic device 12 may be a server. In one possible embodiment, a server with high computing power may be used as the second electronic device 12 to improve the data processing efficiency and reliability, and thus the processing efficiency of model training and/or song recognition. Meanwhile, a smartphone with low cost and wide application range is used as the first electronic device 11 to realize interaction between the second electronic device 12 and the user. It is to be understood that the interaction process may be: the smart phone acquires and plays training audio from the server, acquires an audio tag, sends the audio tag to the server, and performs subsequent model training steps by using the acquired audio tag. After the server generates the singing voice recognition model, the server obtains the audio to be detected sent by the smart phone and conducts song recognition on the audio.

Specifically, please refer to fig. 3, fig. 3 is a schematic flow chart of a singing voice recognition model training method according to an embodiment of the present disclosure. The method in this embodiment comprises:

s101: training audio and corresponding audio tags are obtained.

The training audio includes noisy audio interfered by noise, and may include non-noisy audio not interfered by noise. The noise interference means that noise such as steady-state noise, environmental noise, transient noise, and human voice noise exists in the training audio in addition to the singing voice, so that the noise-containing audio is interfered more, and the identification of whether the noise-containing audio is the singing voice audio is not facilitated. The audio content of the training audio is not limited, and the audio data under the corresponding scene can be collected aiming at different types of audio tags in consideration of the diversity of real noise scenes. For example, the vocal audio tags may include vocal audio for vocal singing and unvoiced vocal audio, the voice data of the vocal audio for vocal singing may be various vocal music, various instrumental singing, concert, ktv singing, and the like, and the voice data of the unvoiced vocal audio is vocal accompaniment separated stem, practicing teaching, and the like. In addition, the types of non-singing audio are more varied, and may include, for example, voiced speech audio, spoken audio, pure music audio, pure noise audio, and so forth. The voice data of the talking audio with music is a vocal book, a movie, a drama, a short video, a comprehensive art and the like, the voice data of the spoken and white audio is reciting, news, a conference, chatting, a fret, a phase sound and the like, the voice data of the pure music audio is musical instrument solo, symphony, accompaniment, advertising soundtrack and the like, and the voice data of the pure noise audio is white noise (underwater sound, rain sound and the like), laughter, palmar sound, eating sound and the like. The data format of the training audio is not limited, and may be ts code Stream (Transport Stream) acquired from streaming media, or may acquire an audio file as the training audio, and the format of the file form may be mp3, m4a, wav, and the like.

The audio tag is used to indicate whether the training audio is singing audio or non-singing audio, and it should be noted that the audio tag is not used to indicate whether the training audio is noisy audio, i.e. whether the training audio contains noise, and is not related to whether the training audio is singing audio. The noisy audio may be a singing audio or a non-singing audio.

The present application does not limit the specific acquisition mode of the training audio, and in one embodiment, the training audio and the audio tag may be generated in advance and stored in a designated location, and may be read from the designated location when necessary. In another embodiment, the training audio may be generated by itself when training the singing voice recognition model, and in particular, in one embodiment, since the length of the model input is generally fixed, the length of the training audio may be set, and when the training audio is acquired, the initial training audio is acquired first. And determining a preset audio length, wherein the preset audio length is the length corresponding to each training audio, and performing fragmentation processing and/or zero padding processing on the initial training audio based on the preset audio length to obtain the training audio. The slicing processing is to divide the initial training audio into a plurality of segments when the length of the initial training audio is greater than the preset audio length, wherein the length of each segment is equal to the preset audio length, and there may be no overlap or overlap between the segments, that is, the frame shift step length of the slices during slicing may be less than, equal to, or greater than the preset audio length. The zero padding process is to pad zero data to make the length of the initial training audio or the length of the segment (usually the last segment) obtained after the initial training audio is sliced less than the preset audio length, so that the length of the segment reaches the preset audio length. In one specific embodiment, the preset audio length may be denoted by dur' and may be in milliseconds ms.

In another embodiment, the different sound recording devices may affect the loudness of the audio, sound field, and other sound receiving effects, thereby causing different spectral distributions. For example, when the recording device is far away from the sound source, the collected audio signal has low loudness and far sound field, and some sound details are not easy to capture; on the contrary, when the recording device is too close to the sound source, problems of too large loudness, glitches, distortion and the like can occur. To solve this problem, the present application may process the audio spectral distribution using an audio Dynamic Range Control (DRC) technique. Specifically, an initial training audio is obtained first, in this embodiment, the initial training audio refers to an audio without DRC processing, and the dynamic range control processing is performed on the initial training audio to obtain a training audio.

While not limiting the specific process of DRC processing, in one embodiment, a digitally sampled signal x, i.e., an initial training audio, is first obtained using an audio tool (e.g., librosa audio tool), and then the linear x signal is converted to a decibel db signal, x_db=20 × log10 (x). Passing the dB signal to the static characteristic equation (i.e., DRC static curve x)_sc) Obtaining a difference value to obtain a gain curve g_c＝x_sc-x_dbThen, the inflection point of the gain curve is smoothly transited to obtain a curve g_sThen to curve g_sGain compensation is carried out to obtain a DRC gain control curve g_mFinally, the curve is converted from the dB value state to the linear value state to obtain g_lin＝10^(g_m/20). Finally, the calculated gain control signal g is applied to the original audio signal x_linPerforming dynamic adjustment to obtain the dynamically adjusted audio signal y = g_linX. Through DRC processing, different degrees of gain adjustment can be performed for signal amplitudes at different stages (i.e., a noise floor stage, a medium amplitude stage, a large amplitude stage, etc.), so that the sound sounds more steady and soft. Referring to fig. 4, fig. 4 is a diagram illustrating an effect of dynamic range control according to an embodiment of the present application. The multiplication signal has a signal x on the left and a signal g on the right_linAfter equal sign, the processed signal y is the training audio in this embodiment. It can be seen that with DRC processing, the larger amplitude part of the x-signal is attenuated, while the smaller amplitude part is enhanced.

The specific generation mode of the audio tag is not limited in the present application, and in one embodiment, each training audio can be played and marked manually. In another embodiment, the audio tags may be generated according to categories of training audio, which may include six categories of musical singing, musical talking, spoken, pure music, pure noise, and after determining the categories, the musical singing and singing may be determined as singing audio and the other categories as non-singing audio, as described above. The song type can be determined according to the class label of the training audio, that is, the training data of the pre-existing class label can be utilized, and the class label is mapped to the audio label of the application, so that the training label is obtained. It should be noted that the specific form of the audio tag is not limited in the present application, and in the first embodiment, the category tags such as musical singing and singing may be reconfigured to be the same tag, such as tag 1, representing the singing voice audio. Alternatively, in the second embodiment, it may be configured as different tags, but the tags corresponding to the two types of audio belong to the same audio tag, i.e. the singing audio, for example, the content of the category tag is kept unchanged, and the correspondence between the two category tags with music singing and the singing audio tag is generated. It is understood that, if the second embodiment is adopted, the subsequent singing voice recognition model can classify the audio by taking the audio category label as the granularity, realize the subdivision detection, and can map the audio category to the label of whether the audio is the singing voice or not in the post-processing process.

S102: and extracting the audio features of the training audio to obtain the training features.

After the training audio is obtained, corresponding audio features, namely training features, can be obtained through feature extraction. In one embodiment, the training audio is subjected to mel spectrum extraction processing and/or mel cepstrum coefficient extraction processing with an audio frame as a granularity to obtain the training features. The Mel spectrum (Mel spectrum) is a frequency spectrum diagram obtained by pre-emphasizing, framing and windowing an original audio signal, performing short-time Fourier transform on each frame of signal, and then performing Mel filter bank, wherein the abscissa represents time and the ordinate represents frequency. Mel Frequency Cepstral Coefficients (MFCC) are also a kind of spectrogram, which are just the Cepstral Coefficients of Mel Frequency, and are characterized by being beneficial to the characterization of sound timbre.

Specifically, the Mel feature and the MFCC feature may be extracted from the bits of the training audio that are audio frames using librosa tool. The Mel characteristic is close to the non-linear perception of human ear hearing, which is beneficial for the neural network to analyze the audio characteristic from the perspective of hearing frequency correlation, and the spectrum envelope contained in the MFCC characteristic represents the tone color related information, so that the neural network can learn to distinguish different sound components such as human voice, noise and the like from the tone color dimension. The embodiment does not limit the specific form of the two features, for example, mel feature can be formed by a matrix F of 128 × N dimensions_melFormed by a 20 XN-dimensional matrix F_mfccIs formed therein

Each represents a frame length of the audio (rounded down), dur represents a duration of the audio in ms, where dur = dur' for training audio. sr denotes the sampling rate of the audio reading in Hz and hop denotes the frame shift, i.e. the number of samples between successive frames. Performing feature splicing on the Mel feature and the MFCC feature to obtain a 148 multiplied by N dimensional feature F_inputThe characteristic F_inputI.e. the training characteristics.

The training features may be saved in a file format, such as the file format of ". Npy",

s103: and inputting the training characteristics into the initial model to obtain a training recognition result.

The initial model refers to a singing voice recognition model with unfinished parameter adjustment, and specifically can be a model built based on a convolutional neural network. The initial model comprises a first convolution layer and a second convolution layer, the first convolution layer and the second convolution layer are provided with rectangular convolution kernels, the long sides of the first rectangular convolution kernels are arranged along the direction of a frequency axis, and the long sides of the second rectangular convolution kernels are arranged along the direction of a time axis. That is, the initial model has two mutually orthogonal two-dimensional convolutional layers of a particular scale, which may be, for example, a 32 × 1 dimensional time domain convolutional layer and a 7 × 64 dimensional frequency domain convolutional layer, respectively, and which are capable of learning to different degrees time domain information (e.g., rhythm, melody) and frequency domain information (e.g., pitch, range) on the time axis and frequency axis of the audio feature, respectively. Except the first convolution layer and the second convolution layer, the convolution layers in different combination modes are sequentially stacked to form an initial model, so that context information and relevance of audio content are better understood, and finally, C neural network nodes form output, wherein C is the category of the audio label, when the audio label is of two categories (singing voice audio and non-singing voice audio), C is 2, and when the audio label is of six categories (singing voice with music, singing, talking with music, spoken voice, pure music and pure noise), C is 6.

Referring to fig. 5, fig. 5 is a schematic diagram of a first convolution kernel and a second convolution kernel provided in an embodiment of the present application, in which a rectangle labeled with "frequency domain convolution layer" represents a convolution kernel of the frequency domain convolution layer, i.e., a first rectangular convolution kernel, and a rectangle labeled with "time domain convolution layer" represents a convolution kernel of the time domain convolution layer, i.e., a second rectangular convolution kernel.

In the process of forward propagation, the audio feature F is transmitted_inputInputting the initial model singly or in batch, and performing matrix operation with the parameters of the initial model to output vector Z = [ Z = [ [ Z ]_1,z₂,…,z_i]Wherein z is_iIndicating the output value of the ith node. Converting the output value into a value range [0,1 ] by utilizing a softmax activation function]And a probability distribution P = [ P ] with a sum of all values of 1₁,p₂,…,p_i]Wherein:

c in the activation function is the number of output nodes, i.e. the number of class labels.

S104: and generating a loss value by using the training recognition result and the audio label, and performing parameter adjustment processing on the initial model by using the loss value.

The application does not limit the specific calculation mode of the loss value, and the loss value can be calculated by using a cross entropy loss function. Specifically, the probability p of each class is calculated by a cross entropy loss function_iLabel category information (i.e. audio label, specifically, the above six categories of labels, or two categories of audio labels) l_iSum of entropy values being Loss_softmaxThe label category information can be characterized as L = [ L =₁,l₂…,l_i]. The back propagation process is to output a Loss value Loss from the model_softmaxTo input F_inputThe chain type derivation is carried out in the direction of (2), and the parameters of the initial model are updated. Wherein:

after M forward and backward propagation processes (M is not limited in specific size), the initial model gradually learns the commonality and difference of the vocal and non-vocal audio labels from the audio features.

S105: and if the preset completion condition is detected to be met, determining the initial model after the parameters are adjusted as the singing voice recognition model.

The preset completion condition is used to indicate that the initial model is sufficiently trained, and the number and the specific content of the preset completion condition are not limited, for example, the preset completion condition may be a condition for limiting the recognition accuracy of the initial model, or may be a condition for limiting the training duration of the initial model, or may be a condition for limiting the training round of the initial model. When one, a specified number or all of the preset completion conditions are satisfied, the initial model after parameter adjustment can be determined as the singing voice recognition model, indicating that the model training process is completed.

In the embodiment of the application, in order to ensure the identification accuracy of the initial model after parameter adjustment in practical application, the training of the model can be divided into two parts of model parameter adjustment and model identification accuracy verification. Thus, after the training audio and the corresponding audio tags are obtained, the audio tagged training audio may be divided into a training set and a validation set. The training features with audio labels may also be divided into training and validation sets after they are obtained.

For example, data set C can be constructed using all tagged data_totalRandomly dividing the data set into C according to a preset proportion (such as 9:1)_trainTraining set of individual samples and C_valIn the process of dividing the verification set (namely verification data) of each sample, the number of audios of 6 category labels (namely, singing with music, speaking with music, spoken word, pure music and pure noise) in the verification set can be ensured to be equal to the greatest extent.

The data with the label may be training audio with a label, at this time, after the training audio and the corresponding audio label are obtained, the training audio with the label is divided into a training set and a verification set, and then according to the operation of S102, the audio features of the training audio included in the training set and the audio features of the training audio included in the verification set are respectively extracted.

The data with the label can also be training features with the label, and at this time, after the audio features of the training audio are extracted to obtain the training features, the training features with the label are divided into a training set and a verification set.

Based on the division of the training set and the verification set, the preset completion condition may include a preset training condition and a preset accuracy condition.

In the model training stage, training characteristics contained in the training set can be input into the initial model to obtain a training recognition result; and generating a loss value by using the training recognition result and the audio label contained in the training set, and performing parameter adjustment processing on the initial model by using the loss value.

If the initial model meets the preset training condition, the initial model after parameter adjustment can be verified by using a verification set.

The preset training condition may include a condition for limiting a training time of the initial model, or may be a condition for limiting a training turn of the initial model, which is not limited herein.

If the recognition accuracy of the initial model after the parameter adjustment does not meet the preset accuracy condition, returning to the step of inputting the training characteristics contained in the training set into the initial model to obtain a training recognition result; and determining the initial model after the parameter adjustment as the singing voice recognition model until the recognition accuracy of the initial model after the parameter adjustment meets the preset accuracy condition.

The prediction accuracy condition may be a condition that limits the identification accuracy of the initial model, for example, the accuracy of the audio label corresponding to the audio identified by the initial model reaches more than 90%.

The recognition accuracy of the singing voice recognition model can be effectively ensured by dividing the training set and the verification set.

After the singing voice recognition model is trained based on the training method, the singing voice recognition model can be used for processing audio. Specifically, the audio to be detected is obtained, and the length and the specific content of the audio to be detected are not limited, and the audio to be detected may be an audio with a singing voice, or may be an audio without a singing voice, or of course, may be an audio with noise, or may also be an audio without noise. The audio to be tested may be the audio obtained directly without processing, or may be the audio processed by DRC in dynamic range control. Extracting the audio features of the audio to be detected to obtain the features to be detected, it can be understood that the extracting manner of the features to be detected should be the same as the extracting manner of the training features. The features to be detected are input into the singing voice recognition model, and a singing voice recognition result can be obtained.

Specifically, the length of the audio to be detected may be longer, and in this case, the audio to be detected needs to be sliced, and if the length of the audio to be detected is shorter, zero padding needs to be performed. In the feature extraction, firstly, feature extraction processing is performed on the audio to be detected to obtain initial audio features. It will be appreciated that the feature extraction process is performed in the same manner as the training process. And determining the preset audio length, and performing fragmentation processing and/or zero filling processing on the initial audio features based on the preset audio length to obtain the features to be detected. It will be appreciated that the number of features to be measured may beOne or more of the specific slicing modes of the initial audio features may refer to the slicing mode of the training audio during generation. In one embodiment, the acquired single audio signal with the duration of dur serves as the audio to be detected, and the initial audio feature F is obtained after audio feature extraction_inputIts dimension may be 148 xn. Initial audio features F_inputAccording to the length of audio frame on time axis

The segmentation is performed to meet the input requirement of the minimum duration dur' of the singing voice recognition model. If the width of the last fragment is smaller than the length N 'of the audio frame, zero filling operation is carried out to ensure that the feature dimensions of all audio fragments are consistent after the fragment is fragmented, and finally, an audio feature matrix F with the dimension of W multiplied by 148 multiplied by N' is obtained_input', which includes W features to be measured. Will be characterized by F_inputAnd inputting the result into the trained singing voice recognition model, and outputting W output results, wherein each output result corresponds to the characteristic to be detected obtained after one piece is divided. The content of the output result is related to the content of the label used in training, for example, when the above six kinds of labels are used as the audio labels, one output result is the probability distribution vector p of the six kinds of labels of the corresponding audio features singing with music, singing, speaking with music, spoken word, pure music and pure noise_w＝[p₀,p₁,…,p_i]，0≤p_iI ≦ 1,0 ≦ 5,i represents the category label count.

The specific form and number of the singing voice recognition result are not limited, and when the number of the audio features is multiple, in an embodiment, the model output result corresponding to each audio feature may serve as a singing voice recognition result indicating whether the audio to be detected is singing voice in the time period corresponding to the audio feature. In another embodiment, certain post-processing steps may be performed to obtain a singing voice recognition result that can characterize whether the entire audio synthesis to be tested is singing voice. Specifically, if there are a plurality of features to be detected, the features to be detected may be respectively input into the singing voice recognition model to obtain the segmentation recognition results, and the segmentation recognition results are fusedAnd processing to obtain a singing voice recognition result. The fragment recognition result is the model output result p_w. The specific manner of the fusion process is not limited, and for example, in one manner, the average probability of singing voice of the whole audio can be calculated, and the singing voice recognition result can be obtained based on the average probability of singing voice. Based on the above F_inputThe generation process of' continues to be explained, where W probability vectors are averaged longitudinally to obtain an average probability vector p_mean＝[p₀’,p₁’,…,p_i’]Wherein p is_i’＝p_wi/W，p_wiRepresenting the probability value of the ith category of the w-th vector. Because singing and singing with music both correspond to the audio labels of the singing voice, p in the average probability vector₀' and p₁The probabilities represent singing voice, and the average probability p of singing voice can be determined_vocal＝p₀’+p₁' when the proportion of the part actually having singing voice in the audio to be tested is larger, p is larger_vocalThe higher the value of (c). A preset threshold value may be set when p_vocalAnd when the singing voice recognition result is larger than a preset threshold value, determining the singing voice recognition result as the singing voice, otherwise, determining the singing voice recognition result as the non-singing voice.

In addition, the starting and ending end points of the singing voice in the audio to be detected can be positioned, specifically, the starting and ending time range of each feature to be detected relative to the audio to be detected is determined at first, namely the time range corresponding to the feature to be detected in the audio to be detected; and sequencing the fragment identification results by using the starting and stopping time range to obtain a first sequence. The adjacent segment recognition results in the first sequence may be the same or different, the same classification boundary fusion processing is performed on the first sequence, the boundary of the interval in the segment recognition results of two temporally adjacent same classifications (where the types may refer to singing voice and non-singing voice, or may refer to the above six categories) is removed, the small time segments are fused into the large time segments, the boundary of the interval in the two temporally adjacent segment recognition results of different classifications is retained, the second sequence is obtained, and it can be determined that two sides of each boundary in the second sequence respectively correspond to different types, so that each boundary in the second sequence is the starting point or the end point of the singing voice, and can be collectively referred to as the end point. The corresponding singing voice endpoint of the audio to be tested can be determined based on the second sequence.

In particular, for W probability vectors p_wIn other words, the start and end time points of the w-th probability vector are calculated as t _ start_w= (w-1) × dur' and t _ end_wAnd = w × dur', each start and end time point is taken as a minimum unit interval, and the start and end time points and the minimum unit interval together form a start-stop time range corresponding to the w-th audio to be measured. Taking the index value index of the maximum value of each probability vector_w ⁱThe maximum probability class label of the vector is represented, and the adjacent and same index value index_w ⁱThe corresponding minimum unit intervals are fused into a large time interval, so that the category label intervals with different time lengths can be obtained, and the starting time point and the ending time point of each category label interval are the end points of the interval. By using the method, the singing voice end point at the frame level can be positioned and identified, and the singing voice area and the non-singing voice area can be divided. Alternatively, in another embodiment, the vocal region, the singing region, the vocal speaking region, the spoken region, the pure music region, and the pure noise region may be further subdivided.

It will be appreciated that the smaller dur', the more accurate the end point location and the smaller the granularity. Referring to fig. 6, fig. 6 is a diagram illustrating an effect of locating a singing voice starting point according to an embodiment of the present application. The upper half part is a first sequence arranged according to a time sequence, the length between every two adjacent dotted lines is dur', and a second sequence of the lower half part is obtained after the same classification boundary fusion processing.

Referring to fig. 7, fig. 7 is a flow chart of singing voice recognition according to an embodiment of the present application. After the audio signal (namely, the audio to be detected) is acquired, DRC processing can be carried out on the audio signal, so that the subsequent singing voice recognition is more convenient. Inputting the audio signal processed by DRC into a feature extraction module to obtain the feature to be detected, inputting the feature to be detected into a singing voice detection model (namely a singing voice identification model), and further inputting the output result obtained by the singing voice identification model into a post-processing module for post-processing. The details of the post-processing refer to the previous description.

By applying the song voice recognition method provided by the embodiment of the application, the song recognition of the noisy audio is realized by adopting special training data and a special initial model, and the robustness and the anti-interference capability are stronger. Specifically, the settings include noisy audio as training audio, so that the model can learn how to distinguish singing audio under noise interference using the training audio. The initial model comprises a first convolutional layer and a second convolutional layer, the convolutional kernels of the two convolutional layers are rectangular, the convolutional kernel with the long side arranged along the direction of the frequency axis can acquire frequency domain information such as pitch and range, and the convolutional kernel with the long side arranged along the direction of the time axis can acquire time domain information such as rhythm and melody. Through the first convolution layer and the second convolution layer, the initial model can obtain more information, noise interference resistance is facilitated, and accurate classification is achieved. The singing voice recognition model obtained after training can have strong anti-noise interference capability, and can accurately distinguish the singing voice.

In the following, a computer-readable storage medium provided by an embodiment of the present application is introduced, and the computer-readable storage medium described below and the singing voice recognition model training method and/or the singing voice recognition method described above may be referred to in correspondence with each other.

The present application further provides a computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the above-mentioned singing voice recognition model training method and/or singing voice recognition method.

The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A singing voice recognition model training method is characterized by comprising the following steps:

2. The training method of singing voice recognition model according to claim 1, wherein the obtaining of the training audio comprises:

acquiring initial training audio;

3. The training method of singing voice recognition model according to claim 1, wherein the obtaining of the training audio comprises:

acquiring initial training audio;

4. The training method of singing voice recognition model according to claim 1, wherein the generation process of the audio tag comprises:

determining an audio category corresponding to the training audio;

generating the audio tag based on the audio category.

5. The training method of singing voice recognition model according to claim 1, wherein the extracting the audio features of the training audio to obtain the training features comprises:

6. The method for training a singing voice recognition model according to claim 1, further comprising, after the extracting the audio features of the training audio to obtain training features:

7. A singing voice recognition method, comprising:

acquiring audio to be tested;

inputting the characteristics to be detected into a singing voice recognition model to obtain a singing voice recognition result; wherein the singing voice recognition model is obtained based on the singing voice recognition model training method of any one of claims 1 to 6.

8. The singing voice recognition method of claim 6, wherein the extracting the audio features of the audio to be tested to obtain the features to be tested comprises:

9. The singing voice recognition method of claim 6, wherein if there are a plurality of features to be detected, the inputting the features to be detected into the singing voice recognition model to obtain the singing voice recognition result comprises:

10. The singing voice recognition method according to claim 9, further comprising:

and determining the singing voice end point corresponding to the audio to be tested based on the second sequence.

11. An electronic device comprising a memory and a processor, wherein:

the memory is used for storing a computer program;

the processor is configured to execute the computer program to implement the singing voice recognition model training method according to any one of claims 1 to 6, and/or the singing voice recognition method according to any one of claims 7 to 10.

12. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a singing voice recognition model training method according to any one of claims 1 to 6, and/or a singing voice recognition method according to any one of claims 7 to 10.