WO2020248485A1 - 音频单音色分离方法、装置、计算机设备及存储介质 - Google Patents
音频单音色分离方法、装置、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2020248485A1 WO2020248485A1 PCT/CN2019/117096 CN2019117096W WO2020248485A1 WO 2020248485 A1 WO2020248485 A1 WO 2020248485A1 CN 2019117096 W CN2019117096 W CN 2019117096W WO 2020248485 A1 WO2020248485 A1 WO 2020248485A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- audio
- sample
- target
- neural network
- lstm neural
- Prior art date
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 72
- 238000013528 artificial neural network Methods 0.000 claims abstract description 171
- 230000009466 transformation Effects 0.000 claims abstract description 28
- 238000012545 processing Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 124
- 238000001228 spectrum Methods 0.000 claims description 39
- 238000004364 calculation method Methods 0.000 claims description 35
- 238000012549 training Methods 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 30
- 230000003595 spectral effect Effects 0.000 claims description 26
- 230000015654 memory Effects 0.000 claims description 21
- 230000009467 reduction Effects 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 claims description 3
- 238000003786 synthesis reaction Methods 0.000 claims description 3
- 230000001131 transforming effect Effects 0.000 claims description 2
- 238000000034 method Methods 0.000 abstract description 24
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 9
- 238000012360 testing method Methods 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 4
- 239000012634 fragment Substances 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Definitions
- This application relates to the field of audio processing technology, and in particular to audio single tone color separation methods, devices, computer equipment, and storage media.
- the embodiments of the present application provide an audio single tone color separation method, device, computer equipment, and storage medium to solve the problem that the prior art cannot achieve single tone color separation.
- An audio single tone color separation method including:
- One LSTM neural network corresponding to each timbre type is selected from the pre-trained LSTM neural networks as the target LSTM neural network.
- Each LSTM neural network uses the audio samples corresponding to different timbre type combinations in advance.
- each tone type combination is composed of more than two tone types;
- the respective target spectrograms are respectively subjected to time domain transformation to obtain the target monotone audio corresponding to each of the target spectrograms as the audio separation result of the target audio.
- An audio single tone color separation device including:
- Audio acquisition module for acquiring target audio to be audio separated
- a timbre type determining module for determining the timbre types to be separated for the target audio
- the neural network selection module is used to select an LSTM neural network corresponding to each tone type from the pre-trained LSTM neural networks as the target LSTM neural network, and each LSTM neural network adopts different tone types.
- the audio samples corresponding to the combination are pre-trained, and each timbre type combination is composed of more than two timbre types;
- a target audio input module configured to input the target audio as an input to the target LSTM neural network to obtain various target spectrograms output by the target LSTM neural network;
- the time domain transformation module is configured to perform time domain transformation on each of the target spectrograms respectively to obtain the target monotone audio corresponding to each of the target spectrograms as the audio separation result of the target audio.
- a computer device including a memory, a processor, and computer readable instructions stored in the memory and capable of running on the processor, and the processor implements the above-mentioned audio single tone color separation when the processor executes the computer readable instructions Method steps.
- One or more readable storage media storing computer readable instructions, and the computer readable storage medium storing computer readable instructions so that the one or more processors execute the steps of the above-mentioned audio single tone color separation method.
- FIG. 1 is a schematic diagram of an application environment of an audio single tone color separation method in an embodiment of the present application
- FIG. 2 is a flowchart of an audio single tone color separation method in an embodiment of the present application
- FIG. 3 is a schematic diagram of the principle of the LSTM neural network performing timbre separation on a mixed audio in this application to obtain each separated audio;
- FIG. 4 is a schematic flowchart of step 102 of an audio single tone color separation method in an application scenario in an embodiment of the present application
- FIG. 5 is a schematic flowchart of pre-training an LSTM neural network in an application scenario of the audio single tone separation method in an embodiment of the present application
- Fig. 6 is a schematic flow chart of synthesizing mixed audio samples in an application scenario of the audio single tone separation method in an embodiment of the present application
- FIG. 7 is a schematic flowchart of step 304 of an audio single tone color separation method in an application scenario in an embodiment of the present application
- FIG. 8 is a schematic structural diagram of an audio single tone color separation device in an application scenario in an embodiment of the present application.
- FIG. 9 is a schematic structural diagram of a tone color type determining module in an embodiment of the present application.
- FIG. 10 is a schematic structural diagram of an audio single tone color separation device in another application scenario in an embodiment of the present application.
- Fig. 11 is a schematic diagram of a computer device in an embodiment of the present application.
- the audio single tone color separation method provided in this application can be applied in the application environment as shown in Fig. 1, where the client communicates with the server through the network.
- the client can be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
- the server can be implemented as an independent server or a server cluster composed of multiple servers.
- an audio single tone separation method is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
- the server first needs to obtain the target audio to be audio separated. It is understandable that the server can obtain the target audio in a variety of ways. For example, it can be uploaded to the server by the staff responsible for audio separation; or the server can extract the audio files from the specified database according to the timed task, and extract these The received audio file is determined as the target audio to be audio separated; etc.
- the LSTM neural network used in this embodiment needs to be pre-trained.
- the LSTM Neural networks can generally only be used to separate monophonic audio of these several timbre types. For example, suppose that when training a certain LSTM neural network, three types of timbre of piano, violin, and drums are learned and trained. After the LSTM neural network is trained, when the LSTM neural network is used to separate a certain target audio, a piano will be obtained Single tone audio in three tone types:, violin and drum.
- multiple LSTM neural networks are trained for different timbre type combinations, so that these pre-trained LSTM neural networks can cover as many application scenarios as possible, for example, for piano, violin, and drum sounds.
- Train an LSTM neural network for the combination of types denoted as network No. 1; train an LSTM neural network for the combination of three tone types of piano, violin and harmonica, denoted as network No. 2; for the three tone types of erhu, organ and guitar Combine and train an LSTM neural network, denoted as network 3; etc.
- the server in order to accurately separate the single-timbre audio data in the target audio, the server also needs to determine the various timbre types that need to be separated for the target audio.
- the staff responsible for timbre separation can manually input instructions to inform the server which timbre types need to be classified for the target audio, so that the server can determine the respective timbre types;
- the server can also automatically determine the combination of timbre types roughly contained in the target audio according to the source occasion of the target audio, so as to determine the timbre types that the target audio needs to separate.
- step 102 may include:
- the target audio can also record the source occasion of the target audio when it is collected, so that when the server obtains the target audio, the source occasion of the target audio can be obtained at the same time.
- a digital label "001" can be placed on the target audio, the digital label indicates that the source occasion of the target audio is a school meeting room, and the server reads the digital label from the target audio to obtain the source occasion.
- the server can preset the occasion timbre correspondence relationship.
- the occasion timbre correspondence records the correspondence relationship between the occasion and the timbre type combination.
- the timbre correspondence determines the timbre type combination corresponding to the source occasion of the target audio. For example, if the server obtains the source occasion of the target audio as "001", that is, the school conference room, in this occasion, the timbre correspondence relationship records the timbre type combination of "001" and "human voice, noise", and the server can determine The timbre type combination of the "human voice, noise” is shown.
- the server can directly obtain the determined timbre types in the timbre type combination, and determine these timbre types as the timbre types that need to be separated for the target audio.
- each LSTM neural network is pre-trained in this embodiment.
- These LSTM neural networks are obtained by pre-training the audio samples corresponding to different timbre type combinations, that is, corresponding to different timbre type combinations.
- the timbre type combination consists of more than two timbre types.
- the server After the server determines the various timbre types that need to be separated for the target audio, it can select an LSTM neural network corresponding to each timbre type from the pre-trained LSTM neural networks, as used to perform timbre for the target audio Separated target LSTM neural network.
- the training process of the LSTM neural network corresponding to each timbre type combination will be described in detail below. As shown in Figure 5, further, the LSTM neural network corresponding to each timbre type combination is pre-trained through the following steps:
- each mixed audio sample is synthesized according to the monophonic audio samples corresponding to the respective sample timbre types, and each mixed audio sample is synthesized from one monophonic audio sample corresponding to each sample timbre type;
- each mixed audio sample For each mixed audio sample, input each mixed audio sample as an input to the LSTM neural network to obtain a spectrogram of each sample output by the LSTM neural network;
- each sample spectrogram calculates the error between each sample spectrogram and each single tone color spectrogram by using a preset cost function, where each single tone color spectrogram refers to the input of each single tone color audio sample corresponding to each mixed audio sample.
- Spectrogram obtained by frequency domain transformation
- step 301 it can be understood that when preparing to train the LSTM neural network corresponding to a certain tone type combination, the tone type included in this tone type combination must first be obtained and recorded as each sample tone type.
- the server first obtains the timbre type of each sample of the LSTM neural network as "erhu, guitar and harmonica".
- the data used in the training of the LSTM neural network are all accurate audio data. Therefore, it is necessary to collect respective monophonic audio samples for each sample tone type.
- the single-timbre audio sample mentioned here means that the audio sample only contains audio data of one timbre type.
- the single-timbre audio sample corresponding to the timbre type "erhu” can collect the sound of an erhu instrument in a noise-free environment Audio, the audio data collected in this way can be regarded as the single tone audio data of "Erhu", so it can be used as a single tone audio sample.
- the server may synthesize the respective monophonic audio samples corresponding to the respective sample tone types to obtain each mixed audio sample, wherein each mixed audio sample is composed of the respective sample tone types A corresponding single-timbre audio sample is synthesized.
- each mixed audio sample in step 303 is synthesized through the following steps:
- the server can take a sample from the single tone audio sample of each sample tone type, that is, the sample to be mixed, and then perform the mixing process on each sample to be mixed to obtain a mixture Audio samples.
- a sample tone type that is, the sample to be mixed
- three sample timbre types are obtained, which are recorded as type 1, type 2, and type 3.
- 10 monotone audio samples are collected for each sample timbre type.
- One single-timbre audio sample is taken as the sample to be mixed from the single-timbre audio samples, and one single-timbre audio sample is taken from the 10 single-timbre audio samples of category 2 as the sample to be mixed.
- One single-tone audio sample is taken as the sample to be mixed from the timbre audio samples, and then the three taken out samples to be mixed are subjected to mixing processing to obtain 1 mixed audio sample. This is the synthesis process of a mixed audio sample. Repeating the above steps 401 and 402, multiple mixed audio samples can be obtained.
- step 303 it should be noted that in order to improve the effectiveness of sample training, in step 303, it should be noted that the combination of synthesized monotone audio samples used by any two mixed audio samples is different. It should be known that if two monotone audio samples are If the combination of is the same, the mixed audio samples synthesized by the two combinations are also the same. The two same samples are used for subsequent training, which generally only increases the computational burden required for training, and the training completion of the LSTM neural network no help. For this reason, when the above steps 401 and 402 are repeatedly executed, the combination of the samples to be mixed can be selected in a combination manner, so that each mixed audio sample obtained after mixing of each combination is different.
- step 304 after the server synthesizes each mixed audio sample, it can put these mixed audio samples into the LSTM neural network to train the LSTM neural network. For each mixed audio sample, the server inputs it as an input to the LSTM neural network, and obtains the spectrum diagram of each sample output by the LSTM neural network.
- step 304 may include:
- the preset number is equal to the number of each sample tone color type
- each audio feature vector corresponding to each multi-layer perceptron perform transposed convolution calculation on the audio feature vector and the preset convolution kernel to obtain the dimension-upgraded corresponding to each multi-layer perceptron.
- FIG. 3 shows a schematic diagram of the principle of the LSTM neural network performing timbre separation on a mixed audio to obtain each separated audio.
- the server may perform frequency domain transformation on each mixed audio sample to obtain a mixed spectrogram of each mixed audio sample.
- the server may window the hybrid spectrogram. Specifically, a Hamming window can be added to the hybrid spectrogram, and one frame of data on the hybrid spectrogram can be obtained through the Hamming window, and the server then performs short-time Fourier transform on each frame of data to obtain each spectrum feature vector. It should be noted that when reading data from the hybrid spectrogram through the Hamming window, a certain overlap rate can be set, for example, an overlap rate in the interval of 50%-80% can be set, and the time length of each frame of data can be Set to about 20 milliseconds.
- each spectrum feature vector can be overlapped and grouped, that is, multiple spectrum feature vectors are bundled into a set of spectrum feature segments.
- each group of spectrum feature segments can be specifically divided into a preset number of spectrum feature vectors, and the preset number can be set according to actual usage conditions, which is not limited in this embodiment.
- the spectrum feature vectors from 0 to a are divided into the first group, that is, the first group of spectrum feature segments; the spectrum feature vectors from a/2 to 3*a/2 are divided into the second group, that is, the second group Group of spectral feature fragments; the spectrum feature vectors from a to 2*a are divided into the third group, that is, the third group of spectral feature fragments; and so on, until all the spectral feature vectors are grouped (where a can be each group The number of frames, and ensure that a is an even number).
- the server may perform convolution calculations on each group of spectral feature segments with a preset convolution kernel to obtain the reduction
- a preset convolution kernel can be set to be smaller in the time dimension and larger in the frequency dimension, so that after convolving the spectral feature segment, the obtained segment vector will be flattened in the frequency domain.
- the reduced-dimensional segment vector can be a one-dimensional vector.
- the activation function can be used after each layer of convolutional layer to activate, so as to add nonlinear factors to the LSTM neural network to solve the lack of expression ability of the linear model The problem.
- step 505 refer to FIG. 3.
- the fragment vectors obtained by the server can be straightened one-dimensional vectors.
- the server inputs these fragment vectors into a long short-term memory network (LSTM), Perform a sequence-to-sequence (seq2seq) learning, and the vector output by each network unit in the LSTM is used as the audio information extracted at a certain time node, that is, each audio information vector output by the LSTM is obtained.
- LSTM long short-term memory network
- the LSTM neural network in this embodiment is also provided with a preset number of multi-layer perceptrons (MLP, Multi-Layer Perceptron), and the number of multi-layer perceptrons is equal to the number of each sample tone type, such as There are 3 sample timbre types in the timbre type combination corresponding to the LSTM neural network of this training, and there are 3 multilayer perceptrons in the LSTM neural network.
- MLP Multi-Layer Perceptron
- the function of the multilayer perceptron in this embodiment is similar to the information filter, which is used to filter out the characteristic information of a certain timbre and realize the separation of audio information.
- the server inputs each audio information vector obtained in step 505 to the multi-layer perceptron to obtain the separated feature vector output by the multi-layer perceptron. This is equivalent to the server inputting each audio information vector into each multilayer perceptron once.
- the server executes step 505 to obtain N audio information vectors, the server can input N audio information vectors into perceptron a to obtain the separated feature vectors output by perceptron a; and the server inputs N audio information vectors into perceptron b to obtain the perception The separated feature vector output by device b; in addition, the server also inputs N audio information vectors into perceptron c to obtain the separated feature vector output by perceptron c. In this way, the server can obtain the results respectively output by the three multilayer perceptrons.
- the server has basically achieved the separation of the monotone audio information in the mixed audio sample through the processing of the above steps 501-506, but the audio information obtained by these separations is only used by the LSTM neural network.
- the recognized data form exists, that is, the aforementioned separation feature vector. In order to enable these separated audio information to be recognized and used, it is also necessary to process these separated feature vectors through a dual reverse process to realize the restoration of audio information.
- steps 502-504 are the processing of audio in the data form, so that the mixed audio data is converted into a data form that is easier to understand and recognize by the neural network. Therefore, step 507 -509 is the reverse process of 502-504, which can restore the separated feature vector to the same data form as the mixed spectrogram.
- the server inputs the separated feature vector into the same LSTM for seq2seq learning, and the vector output by each network unit is used as the separated audio feature information restored at a certain time node , So as to obtain the audio feature vector corresponding to each multilayer perceptron.
- the server may perform transposed convolution calculations on the audio feature vector corresponding to each multilayer perceptron and the preset convolution kernel to obtain Each audio feature segment corresponding to each multilayer perceptron after the upgrade.
- step 509 after the server obtains the upgraded audio feature segments corresponding to each multilayer perceptron, it can combine the audio feature segments corresponding to each multilayer perceptron to obtain the corresponding audio feature segments of each multilayer perceptron.
- the single tone color spectrogram is used as the spectrogram of each sample output by the LSTM neural network.
- a cost function can be preset to calculate the error between each sample spectrogram and each single tone color spectrogram, and then adjust
- the error obtained by the calculation of the cost function is taken as the target, and each network parameter in the LSTM neural network is continuously adjusted until the calculation result of the cost function converges, and then it is determined that the LSTM neural network has been trained.
- stochastic gradient descent SGD can be used to promote the LSTM neural network to converge quickly.
- the aforementioned cost function may be Mean Squared Error (MSE, Mean Squared Error).
- the server may divide the collected training data samples into a training data set and a test data set in advance, wherein the training data set accounts for 80% of the number of samples, and the test data set accounts for 20% of the number of samples.
- the server can use each sample in the test data set to test and evaluate the LSTM neural network.
- the staff in charge of the test can audition the monotone audio output by the LSTM neural network , And use the audition effect as the evaluation of the LSTM neural network.
- it is determined that the LSTM neural network has indeed completed the training and can be put into use; otherwise, if the evaluation fails, you can consider retraining the LSTM neural network.
- the spectrogram of each sample output in step 304 can be compared with the spectrogram of each monotone audio sample corresponding to the mixed audio sample for verification. If the verification is consistent, the neural network training can be determined carry out.
- the server after the server obtains the target audio and determines the target LSTM neural network, it can use the target audio as input to the target LSTM neural network to obtain various target spectrograms output by the target LSTM neural network .
- the target spectrogram may be subjected to time domain transformation. And obtain the target single-timbre audio corresponding to each of the target spectrograms as the audio separation result of the target audio. It can be considered that the final target single-timbre audio is the audio separated from the target audio by the individual single-timbre audio data contained in the target audio, and it is also the division of the target audio in each timbre type that needs to be separated. The result of audio separation under.
- the target audio to be separated from the audio is obtained; then, the various timbre types that need to be separated for the target audio are determined; then, the pre-trained LSTM neural networks are selected from the various timbre
- One LSTM neural network corresponding to the category is used as the target LSTM neural network, and each LSTM neural network is obtained by pre-training the audio samples corresponding to different timbre type combinations, and each timbre type combination is composed of more than two timbre types;
- the target audio is input to the target LSTM neural network to obtain each target spectrogram output by the target LSTM neural network; finally, each target spectrogram is subjected to time domain transformation to obtain the The target single-timbre audio corresponding to each target spectrogram is used as the audio separation result of the target audio.
- this application can separate the target audio into each target single tone audio through the pre-trained LSTM neural network, and the corresponding LSTM neural network can be selected according to the type of timbre to be separated to determine the final separated target single tone.
- the type of audio tone not only realizes the single tone separation of audio, but also makes the result of single tone separation controllable to a certain extent, providing more support and assistance for audio content analysis in certain application scenarios.
- an audio single tone color separation device is provided, and the audio single tone color separation device corresponds one-to-one with the audio single tone color separation method in the foregoing embodiment.
- the audio single tone color separation device includes an audio acquisition module 601, a tone type determination module 602, a neural network selection module 603, a target audio input module 604, and a time domain transformation module 605.
- the detailed description of each functional module is as follows:
- the audio acquisition module 601 is used to acquire the target audio to be audio separated
- the timbre type determining module 602 is used to determine the timbre types that need to be separated for the target audio
- the neural network selection module 603 is used to select an LSTM neural network corresponding to each tone type from each pre-trained LSTM neural network as a target LSTM neural network, and each LSTM neural network adopts a different tone color.
- the audio samples corresponding to the category combinations are pre-trained, and each timbre category combination is composed of more than two timbre categories;
- the target audio input module 604 is configured to input the target audio as an input to the target LSTM neural network to obtain various target spectrograms output by the target LSTM neural network;
- the time domain transformation module 605 is configured to perform time domain transformation on each target spectrogram to obtain the target monotone audio corresponding to each target spectrogram as an audio separation result of the target audio.
- the tone color type determining module 602 may include:
- the source occasion obtaining unit 6021 is configured to obtain the source occasion of the target audio
- the category combination determining unit 6022 is configured to determine the tone color category combination corresponding to the source occasion of the target audio according to the preset occasion tone color correspondence relationship, and the occasion tone color correspondence relationship records the corresponding relationship between the occasion and the tone color category combination;
- the tone color type determining unit 6023 is configured to determine each tone color type in the determined tone color type combination as each tone color type that needs to be separated for the target audio.
- the LSTM neural network corresponding to each timbre type combination can be pre-trained through the following modules:
- the sample type acquisition module 606 is configured to acquire each tone type included in the tone type combination corresponding to the LSTM neural network to be trained as each sample tone type;
- the audio sample collection module 607 is configured to separately collect the single tone audio samples corresponding to the respective sample tone types
- the mixed audio sample synthesis module 608 is configured to synthesize each mixed audio sample according to the corresponding single tone audio sample of each sample timbre type, and each mixed audio sample is composed of a single tone audio sample corresponding to each sample timbre type. Synthesized
- the sample input module 609 is configured to input each mixed audio sample as an input to the LSTM neural network for each mixed audio sample, to obtain a spectrogram of each sample output by the LSTM neural network;
- the error calculation module 610 is configured to use a preset cost function to calculate the error between each sample spectrogram and each single tone color spectrogram, and each single tone color spectrogram refers to each corresponding to each mixed audio sample.
- Spectrogram obtained by transforming single tone audio samples into frequency domain;
- the network parameter adjustment module 611 is configured to adjust the network parameters of the LSTM neural network with the calculation result of the cost function as a target, until the calculation result of the cost function converges, and then determine that the LSTM neural network has been trained.
- the sample input module may include:
- a frequency domain transformation unit configured to perform frequency domain transformation on each mixed audio sample to obtain a mixed spectrogram of each mixed audio sample
- a windowing unit configured to window the hybrid spectrogram, and perform short-time Fourier transform on each frame of data obtained by the windowing to obtain each spectral feature vector;
- the overlapping grouping unit is configured to perform overlapping grouping of the respective spectrum feature vectors to obtain each group of spectrum feature segments, and there are overlapping spectrum feature vectors between any two adjacent groups of spectrum feature segments;
- the convolution calculation unit is configured to perform convolution calculation on each group of spectral feature segments with a preset convolution kernel to obtain each segment vector after dimensionality reduction, and each segment vector corresponds to a set of spectral feature segments;
- a segment vector input unit configured to input each of the segment vectors into the LSTM to obtain each audio information vector output by the LSTM;
- the multi-layer perceptron processing unit is configured to input the respective audio information vectors to each multi-layer perceptron for each multi-layer perceptron in the preset number of multi-layer perceptrons to obtain each multi-layer perceptron.
- the separation feature vector output by the perceptron, the preset number is equal to the number of the timbre types of each sample;
- the feature restoration unit is configured to input the separated feature vector to the LSTM for feature restoration for the separated feature vector output by each multilayer perceptron, and obtain the audio feature vector corresponding to each multilayer perceptron;
- the transposed convolution calculation unit is configured to perform transposed convolution calculations on the audio feature vector corresponding to each multi-layer perceptron with the preset convolution kernel to obtain the upgraded, Each audio feature segment corresponding to each multilayer perceptron;
- the feature segment splicing unit is used to splice each audio feature segment corresponding to each multilayer perceptron to obtain a single tone color spectrogram corresponding to each multilayer perceptron, as each sample spectrogram output by the LSTM neural network.
- each mixed audio sample is synthesized through the following modules:
- the sample selection module to be mixed is configured to select a single-timbre audio sample from the single-timbre audio samples for a single-timbre audio sample corresponding to each of the sample timbre types as the sample to be mixed;
- the mixing processing module is used to perform mixing processing on the to-be-mixed samples corresponding to the respective sample timbre types to obtain a mixed audio sample.
- Each module in the above audio single tone color separation device can be implemented in whole or in part by software, hardware and a combination thereof.
- the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
- a computer device is provided.
- the computer device may be a server, and its internal structure diagram may be as shown in FIG. 11.
- the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
- the memory of the computer device includes a readable storage medium and an internal memory.
- the readable storage medium stores an operating system, computer readable instructions, and a database.
- the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
- the database of the computer equipment is used to store the data involved in the audio single tone color separation method.
- the network interface of the computer device is used to communicate with an external terminal through a network connection.
- the computer-readable instruction is executed by the processor to realize a method of separating audio single tone color.
- the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
- a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
- the processor executes the computer-readable instructions, the audio frequency in the foregoing embodiment is implemented.
- the steps of the single tone color separation method for example, step 101 to step 105 shown in FIG. 2.
- the processor implements the functions of the modules/units of the audio single tone color separation device in the foregoing embodiment when executing the computer-readable instructions, for example, the functions of the modules 601 to 605 shown in FIG. 8. To avoid repetition, I won’t repeat them here.
- a computer-readable storage medium In one embodiment, a computer-readable storage medium is provided.
- the one or more computer-readable storage media store computer-readable instructions.
- the steps of the audio single tone color separation method in the above method embodiments are implemented, or the one or more readable storage media storing computer readable instructions are executed by one or
- one or more processors execute computer-readable instructions to implement the functions of each module/unit in the audio single tone color separation device in the foregoing device embodiment. To avoid repetition, I won’t repeat them here.
- the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
- the computer-readable instructions can be stored in a computer-readable storage. In the medium, when the computer-readable instructions are executed, they may include the processes of the foregoing method embodiments.
- any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory.
- the memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory may include random access memory (RAM) or external cache memory.
- RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
- SRAM static RAM
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDRSDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- SLDRAM synchronous chain Channel
- memory bus Radbus direct RAM
- RDRAM direct memory bus dynamic RAM
- RDRAM memory bus dynamic RAM
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Quality & Reliability (AREA)
- Auxiliary Devices For Music (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种音频单音色分离方法、装置、计算机设备及存储介质,应用于音频处理技术领域,用于解决现有技术无法实现单音色分离的问题。方法包括:获取待音频分离的目标音频(101);确定针对目标音频所需分离的各个音色种类(102);从预先训练好的各个LSTM神经网络中选取出与各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络(103),各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;将目标音频作为输入投入至目标LSTM神经网络,得到输出的各个目标频谱图(104);将各个目标频谱图分别进行时域变换,得到各个目标频谱图各自对应的目标单音色音频,作为目标音频的音频分离结果(105)。
Description
本申请以2019年06月13日提交的申请号为201910511337.1,名称为“音频单音色分离方法、装置、计算机设备及存储介质”的中国发明专利申请为基础,并要求其优先权。
本申请涉及音频处理技术领域,尤其涉及音频单音色分离方法、装置、计算机设备及存储介质。
在音乐库的开发中,音乐的内容分析是尤为重要的。一般自然采集得到的音频中常常是混合多种乐器和人声,若能实现对音频中各种乐器、人声的单音色分离,则可以对单一乐器、人声的内容分析提供强力的素材支持,并且促进音乐的音高识别技术和自动转录技术快速发展。可见,对音频实现单音色分离具有巨大的意义和价值。
因此,寻找一种能够实现音频单音色分离的方法一直是本领域技术人员亟需解决的问题。
发明内容
本申请实施例提供一种音频单音色分离方法、装置、计算机设备及存储介质,以解决现有技术无法实现单音色分离的问题。
一种音频单音色分离方法,包括:
获取待音频分离的目标音频;
确定针对所述目标音频所需分离的各个音色种类;
从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;
将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;
将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
一种音频单音色分离装置,包括:
音频获取模块,用于获取待音频分离的目标音频;
音色种类确定模块,用于确定针对所述目标音频所需分离的各个音色种类;
神经网络选取模块,用于从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;
目标音频输入模块,用于将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;
时域变换模块,用于将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上述音频单音色 分离方法的步骤。
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读存储介质存储有计算机可读指令,使得所述一个或多个处理器执行上述音频单音色分离方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例中音频单音色分离方法的一应用环境示意图;
图2是本申请一实施例中音频单音色分离方法的一流程图;
图3是本申请中LSTM神经网络对一个混合音频进行音色分离得到各个分离音频的原理示意图;
图4是本申请一实施例中音频单音色分离方法步骤102在一个应用场景下的流程示意图;
图5是本申请一实施例中音频单音色分离方法在一个应用场景下预先训练LSTM神经网络的流程示意图;
图6是本申请一实施例中音频单音色分离方法在一个应用场景下合成混合音频样本的流程示意图;
图7是本申请一实施例中音频单音色分离方法步骤304在一个应用场景下的流程示意图;
图8是本申请一实施例中音频单音色分离装置在一个应用场景下的结构示意图;
图9是本申请一实施例中音色种类确定模块的结构示意图;
图10是本申请一实施例中音频单音色分离装置在另一个应用场景下的结构示意图;
图11是本申请一实施例中计算机设备的一示意图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请提供的音频单音色分离方法,可应用在如图1的应用环境中,其中,客户端通过网络与服务器进行通信。其中,该客户端可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在一实施例中,如图2所示,提供一种音频单音色分离方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:
101、获取待音频分离的目标音频;
本实施例中,服务器首先需要获取到待音频分离的目标音频。可以理解的是,服务器可以通过多种方式获取到目标音频,比如,可以由负责音频分离的工作人员上传至服务器;也可以由服务器根据定时任务从指定数据库中提取得到音频文件,并将这些提取到的音频文件确定为待音频分离的目标音频;等等。
102、确定针对所述目标音频所需分离的各个音色种类;
需要说明的是,本实施例中使用的LSTM神经网络需要预先训练得到,一个LSTM神经网络在训练时以哪几种音色种类为目标进行音频分离的,在该LSTM神经网络训练完成后,该LSTM神经网络一般仅能用于分离得到这几种音色种类的单音色音频。例如,假设在训练某LSTM神经网络时,针对钢琴、小提琴、鼓三种音色种类进行学习训练,该LSTM神经网络训练好之后,使用该LSTM神经网络对某目标音频进行音色分离时,会得到钢琴、小提琴、鼓三种音色种类下的单音色音频。因此,在本实施例中针对不同的音色种类组合分别训练了多个LSTM神经网络,以便这些预先训练好的LSTM神经网络能够覆盖尽可能多的应用场景,例如针对钢琴、小提琴和鼓三种音色种类的组合训练一个LSTM神经网络,记为1号网络;针对钢琴、小提琴和口琴三种音色种类的组合训练一个LSTM神经网络,记为2号网络;针对二胡、风琴和吉他三种音色种类的组合训练一个LSTM神经网络,记为3号网络;等等。
在上述内容基础上,为了准确分离该目标音频中的单音色音频数据,服务器还需要确定针对所述目标音频所需分离的各个音色种类。具体地,一种方式是,负责音色分离的工作人员可以手动输入指令,告知服务器对该目标音频所需分类的音色种类是哪几种,这样服务器即可确定所述各个音色种类;另一种方式是,服务器也可以根据该目标音频的来源场合自动判断该目标音频中大概包含的音色种类组合,从而确定该目标音频所需分离的各个音色种类。
可以理解的是,对目标音频的采集一般根据来源场合的不同会具有某些共同的音色种类,部分来源场合甚至仅有一种音色种类组合。比如,歌舞厅场合中采集到的音频往往包含人声、鼓、电子琴等几种音色种类,学校会议室场合中采集到的音频则往往仅包含人声和噪声。为便于理解,如图4所示,进一步地,步骤102可以包括:
201、获取所述目标音频的来源场合;
202、根据预设的场合音色对应关系确定与所述目标音频的来源场合对应的音色种类组合,所述场合音色对应关系记录了场合与音色种类组合之间的对应关系;
203、将确定出的所述音色种类组合中的各个音色种类确定为针对所述目标音频所需分离的各个音色种类。
对于上述步骤201,可以理解的是,目标音频在被采集时可以一并记录该目标音频的来源场合,从而服务器获取到该目标音频时,可以同时获取到该目标音频的来源场合。例如,可以在目标音频上打上数字标签“001”,该数字标签标示该目标音频的来源场合为学校会议室,服务器从目标音频上读取到该数字标签即可获取到该来源场合。
对于步骤202,服务器上可以预先设置场合音色对应关系,所述场合音色对应关系记录了场合与音色种类组合之间的对应关系,服务器在获取到目标音频的来源场合之后,可以根据预设的场合音色对应关系确定与所述目标音频的来源场合对应的音色种类组合。例如,服务器获取到该目标音频的来源场合为“001”,即学校会议室,在该场合音色对应关系中记录了“001”与“人声、噪声”的音色种类组合对应,则服务器可以确定出该“人声、噪声”的音色种类组合。
对于步骤203,容易理解的是,音色种类组合由一个以上的音色种类组成,且每个音色种类组合由哪些音色种类组成均预先设置好。因此,服务器可以直接获取到确定出的所述音色种类组合中的各个音色种类,并将这些音色种类确定为针对所述目标音频所需分离的各个音色种类。
103、从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;
由上述内容可知,本实施例中预先训练好各个LSTM神经网络,这些LSTM神经网 络分别采用不同的音色种类组合所对应的音频样本预先训练得到,也即与不同音色种类组合相对应,其中每个音色种类组合由两个以上音色种类组成。
服务器在确定出目标音频所需分离的各个音色种类之后,可以从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为用于为该目标音频进行音色分离的目标LSTM神经网络。
为便于理解,下面将对每个音色种类组合对应的LSTM神经网络的训练过程进行详细描述。如图5所示,进一步地,每个音色种类组合对应的LSTM神经网络通过以下步骤预先训练得到:
301、获取待训练的LSTM神经网络对应的音色种类组合包含的各个音色种类,作为各个样本音色种类;
302、分别采集所述各个样本音色种类各自对应的单音色音频样本;
303、根据所述各个样本音色种类各自对应的单音色音频样本合成得到各个混合音频样本,每个混合音频样本由所述各个样本音色种类各自对应的一个单音色音频样本合成得到;
304、针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图;
305、使用预设的代价函数计算所述各个样本频谱图与各个单音色频谱图之间的误差,所述各个单音色频谱图是指所述每个混合音频样本对应的各个单音色音频样本进过频域变换得到的频谱图;
306、以所述代价函数的计算结果为目标,调整所述LSTM神经网络的网络参数,直到所述代价函数的计算结果收敛,然后确定所述LSTM神经网络已训练完成。
对于步骤301,可以理解的是,当准备训练某个音色种类组合对应的LSTM神经网络时,首先要获取这个音色种类组合包含的音色种类,记为各个样本音色种类。例如,准备训练二胡、吉他和口琴这一组合对应的LSTM神经网络,服务器首先获取该LSTM神经网络的各个样本音色种类为“二胡、吉他和口琴”。
特别地,在某些应用场景下,有时候仅需从音频中分离出特定的几种音色种类的音频即可,为了满足这种情况的需求,也可以将所有与特定音色种类不同的音频数据划分为“其它”音色种类,在训练LSTM神经网络时考虑上“其它”音色种类。例如,若某个LSTM神经网络需要被训练成用于分离二胡、吉他和口琴三种音色种类的音频数据,可以将二胡、吉他、口琴和“其它”四种音色种类设定为一个音色种类组合,作为该LSTM神经网络的各个样本音色种类。
对于步骤302,训练时,为了保证样本的纯净度,使用于LSTM神经网络训练的数据均为准确的音频数据,因此,需要分别针对各个样本音色种类采集各自对应的单音色音频样本。这里所说的单音色音频样本是指该音频样本中仅包含有一种音色种类的音频数据,比如“二胡”这一音色种类对应的单音色音频样本可以在无噪音环境下采集二胡乐器奏响的音频,这样采集得到的音频数据可以认为是“二胡”的单音色音频数据,因此可以用作单音色音频样本。
对于步骤303,在采集得到单音色音频样本之后,服务器可以根据所述各个样本音色种类各自对应的单音色音频样本合成得到各个混合音频样本,其中,每个混合音频样本由所述各个样本音色种类各自对应的一个单音色音频样本合成得到。
为便于理解,如图6所示,更进一步地,步骤303中每个混合音频样本通过以下步骤合成得到:
401、针对每个所述样本音色种类对应的单音色音频样本,从所述单音色音频样本中选取出一个单音色音频样本,作为待混音样本;
402、将所述各个样本音色种类各自对应的待混音样本进行混音处理,得到一个混合 音频样本。
对于步骤401和步骤402,服务器可以从每个样本音色种类的单音色音频样本中取一个样本,即所述待混音样本,再将取出的各个待混音样本进行混音处理,得到一个混合音频样本。例如,步骤301获取到3个样本音色种类,分别记为种类1、种类2和种类3,步骤302为每个样本音色种类分别采集了10个单音色音频样本,则服务器可以从种类1的10个单音色音频样本中取出1个单音色音频样本作为待混音样本,从种类2的10个单音色音频样本中取出1个单音色音频样本作为待混音样本,从种类3的10个单音色音频样本中取出1个单音色音频样本作为待混音样本,然后把取出的3个待混音样本进行混音处理,得到1个混音音频样本。这就是一个混音音频样本的合成过程,重复上述步骤401和步骤402,可以得到多个混音音频样本。
需要说明的是,为了提高样本训练的有效性,在步骤303中应当注意任意两个混音音频样本所用的合成的单音色音频样本的组合是不同的,应当知道,若两个单音色音频样本的组合相同,则这两个组合各自合成得到的混音音频样本也相同,两个相同的样本投入到后续的训练,一般仅增加了训练所需的运算负担,对LSTM神经网络的训练完成度没有帮助。为此,在重复执行上述步骤401和步骤402时,可以采用组合的方式选取待混音样本的组合,这样,由各个组合混音后得到的各个混合音频样本各不相同。
对于步骤304,服务器在合成得到各个混合音频样本之后,可以将这些混音音频样本投入到LSTM神经网络中,对该LSTM神经网络进行训练。针对每个混音音频样本,服务器将其作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图。
为便于理解,下面将对每个混音音频样本输入LSTM神经网络后,在LSTM神经网络中的处理过程进行详细描述。更进一步地,如图7所示,步骤304可以包括:
501、将所述每个混合音频样本进行频域变换,得到所述每个混合音频样本的混合频谱图;
502、对所述混合频谱图加窗,且对加窗得到的每帧数据进行短时傅里叶变换,得到各个频谱特征向量;
503、对所述各个频谱特征向量进行重叠分组,得到各组频谱特征片段,任意相邻两组频谱特征片段之间存在重叠的频谱特征向量;
504、将每组频谱特征片段分别与预设卷积核进行卷积计算,得到降维后的各个片段向量,每个片段向量对应一组频谱特征片段;
505、将所述各个片段向量输入至LSTM,得到所述LSTM输出的各个音频信息向量;
506、针对预设数量个多层感知器中的每个多层感知器,将所述各个音频信息向量输入至每个多层感知器,得到所述每个多层感知器输出的分离特征向量,所述预设数量等于所述各个样本音色种类的数量;
507、针对每个多层感知器输出的分离特征向量,将所述分离特征向量输入至所述LSTM进行特征还原,得到每个多层感知器对应的音频特征向量;
508、针对每个多层感知器对应的音频特征向量,将所述音频特征向量分别与所述预设卷积核进行转置卷积计算,得到升维后的、每个多层感知器对应的各个音频特征片段;
509、分别拼合各个多层感知器各自对应的各个音频特征片段,得到各个多层感知器各自对应的单音色频谱图,作为所述LSTM神经网络输出的各个样本频谱图。
对于上述步骤501-509,可以结合图3进行理解,图3示出了LSTM神经网络对一个混合音频进行音色分离得到各个分离音频的原理示意图。
对于步骤501,首先,服务器可以将所述每个混合音频样本进行频域变换,得到所 述每个混合音频样本的混合频谱图。
对于步骤502,在得到混合频谱图之后,服务器可以对该混合频谱图加窗。具体地,可以对该混合频谱图加汉明窗,通过汉明窗可以取得该混合频谱图上的一帧帧的数据,服务器再对每帧数据进行短时傅里叶变换,得到各个频谱特征向量。需要说明的是,通过汉明窗从该混合频谱图上读取数据时可以设定一定的重叠率,比如可以设定50%-80%区间中的一个重叠率,每帧数据的时间长度可以设定为20毫秒左右。
对于步骤503,在得到各个频谱特征向量之后,可以对这些频谱特征向量进行重叠分组,也即将多个频谱特征向量捆绑成一组频谱特征片段。分组时,每组频谱特征片段具体可以划分有预设数量个频谱特征向量,该预设数量可以根据实际使用情况设定,本实施例对此不作限定。
需要说明的是,分组与分组之间存在一定的数据重叠,也即任意相邻两组频谱特征片段之间存在重叠的频谱特征向量。这样分组的意义在于抽取瞬态音频的变化,分组之间重叠的时间长度要求概括较为简单的音频变化。由于还需要把瞬时变化的部分考虑进去,在分组的过程中,可以采用50%重叠率的重叠分组方式。为了便于理解,重叠分组时,首先把各个频谱特征向量进行编号,从0到n。然后可以定义从0到a号的频谱特征向量划分为第一分组,即第一组频谱特征片段;从a/2到3*a/2号的频谱特征向量划分为第二分组,即第二组频谱特征片段;从a到2*a号的频谱特征向量划分为第三分组,即第三组频谱特征片段;以此类推,直至对所有频谱特征向量完成分组位置(其中a可以为每组的帧数,且保证a为一个偶数)。
对于步骤504,在得到各组频谱特征片段之后,为了将这些频谱特征片段降维,便于后续进行序列学习,服务器可以将每组频谱特征片段分别与预设卷积核进行卷积计算,得到降维后的各个片段向量,其中,每个片段向量对应一组频谱特征片段。具体地,该预设卷积核可以设定为在时间维度上较小,在频率维度上较大,从而对频谱特征片段进行卷积之后,得到的片段向量将在频域方向上展平,一般来说,可以使得降维后的片段向量为一个一维向量。
需要说明的是,为了提高LSTM神经网络对模型的表达能力,可以在每层卷积层后面使用激活函数进行激活,从而在该LSTM神经网络中加入非线性因素,解决线性模型在表达能力上不足的问题。
对于步骤505,参阅图3,在进行步骤504处理后,服务器得到的各个片段向量可以为被拉直的一个个一维向量,然后,服务器将这些片段向量输入一个长短期记忆网络(LSTM),进行一次从序列到序列(seq2seq)的学习,并且LSTM中每个网络单元输出的向量作为某时间节点被抽取的音频信息,即得到LSTM输出的各个音频信息向量。
对于步骤506,本实施例中的LSTM神经网络上还设置有预设数量个多层感知器(MLP,Multi-Layer Perceptron),多层感知器的数量等于所述各个样本音色种类的数量,比如,本次训练的LSTM神经网络对应的音色种类组合存在3个样本音色种类,则该LSTM神经网络中预设有3个多层感知器。本实施例中多层感知器的功能类似于信息过滤器,用来过滤出某一音色的特征信息,实现了音频信息的分离。
服务器针对每个多层感知器,将步骤505得到的各个音频信息向量输入至该多层感知器,得到该多层感知器输出的分离特征向量。这相当于,服务器将各个音频信息向量分别输入每个多层感知器一次,比如,假设有3个多层感知器,分别为感知器a、感知器b和感知器c,服务器执行步骤505得到N个音频信息向量,则服务器可以将N个音频信息向量输入至感知器a中,得到感知器a输出的分离特征向量;并且,服务器将N个音频信息向量输入至感知器b中,得到感知器b输出的分离特征向量;另外,服务器还将N个音频信息向量输入至感知器c中,得到感知器c输出的分离特征向量。这样,服务器可以得到3个多层感知器分别输出的结果。
对于步骤507,可以理解的是,服务器经过上述步骤501-506的处理,基本已经实现了对混合音频样本中单音色音频信息的分离,但这些分离得到的音频信息以一种仅被LSTM神经网络识别的数据形式存在,即上述的分离特征向量。为了使得这些分离后的音频信息能够被识别和使用,还需要将这些分离特征向量经过对偶的逆向过程处理,实现音频信息的还原。
参阅图3,并结合上述步骤502-504可知,步骤502-504是对音频在数据形式上的处理过程,使得混合音频数据转换为更容易被神经网络理解和识别的数据形式,因此,步骤507-509为与502-504对偶的逆向处理过程,能够将分离特征向量在数据形式上还原为与混合频谱图相同。
具体地,服务器针对每个多层感知器输出的分离特征向量,将所述分离特征向量输入同一个LSTM进行seq2seq学习,每个网络单元输出的向量作为某时间节点被还原的分离的音频特征信息,从而得到每个多层感知器对应的音频特征向量。
对于步骤508,服务器在得到各个音频特征向量之后,可以针对每个多层感知器对应的音频特征向量,将所述音频特征向量分别与所述预设卷积核进行转置卷积计算,得到升维后的、每个多层感知器对应的各个音频特征片段。
对于步骤509,服务器在得到升维后的、每个多层感知器对应的各个音频特征片段,可以分别拼合各个多层感知器各自对应的各个音频特征片段,得到各个多层感知器各自对应的单音色频谱图,作为所述LSTM神经网络输出的各个样本频谱图。
对于步骤305和步骤306,本实施例中,为了评估该LSTM神经网络训练的完成度,可以预先设置代价函数来计算所述各个样本频谱图与各个单音色频谱图之间的误差,然后在调整网络参数时,以所述代价函数的计算得到的误差为目标,不断调整LSTM神经网络中的各个网络参数,直到所述代价函数的计算结果收敛,然后确定所述LSTM神经网络已训练完成。具体地,在训练时,可以使用随机梯度下降法(SGD)促使该LSTM神经网络快速收敛。
特别地,上述的代价函数可以为均方误差(MSE,Mean Squared Error)。
优选地,服务器可以预先将采集得到的训练数据样本划分为训练数据集和测试数据集,其中训练数据集占样本数量的80%,测试数据集占样本数量的20%。在该LSTM神经网络训练完成之后,服务器可以使用测试数据集中的各个样本对该LSTM神经网络进行测试和评估,评估时,可以由负责测试的工作人员对该LSTM神经网络输出的单音色音频进行试听,并以试听的效果作为对该LSTM神经网络的评判,评判通过,再确定该LSTM神经网络确已完成训练,可以投入使用;反之,若评判不通过,可以考虑对该LSTM神经网络重新训练。
优选地,本实施例中,还可以将步骤304输出的各个样本频谱图与混合音频样本对应的各个单音色音频样本的频谱图进行对比校验,若均校验一致,则可以确定神经网络训练完成。
104、将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;
本实施例中,服务器在获取到目标音频,确定出目标LSTM神经网络之后,可以将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图。
105、将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
服务器在得到所述目标LSTM神经网络输出的各个目标频谱图之后,为了便于音频数据的管理和存储,也为了方便后续对单音色音频的使用,可以将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音 频的音频分离结果。可以认为,最后得到的各个目标单音色音频,就是该目标音频中所包含的各个单音色音频数据各自从目标音频中分离出来的音频,同时也是该目标音频在所需分离的各个音色种类的划分下的音频分离结果。
本实施例中,首先,获取待音频分离的目标音频;然后,确定针对所述目标音频所需分离的各个音色种类;接着,从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;再之,将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;最后,将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。可见,本申请通过预先训练好的LSTM神经网络能够将目标音频分离成各个目标单音色音频,且可以根据所需分离得到的音色种类选取出对应的LSTM神经网络来决定最终分离得到的目标单音色音频的音色种类,不仅实现音频的单音色分离,还使得单音色分离的结果在一定程度上可控,在某些应用场景下为音频的内容分析提供更多的支持和帮助。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
在一实施例中,提供一种音频单音色分离装置,该音频单音色分离装置与上述实施例中音频单音色分离方法一一对应。如图8所示,该音频单音色分离装置包括音频获取模块601、音色种类确定模块602、神经网络选取模块603、目标音频输入模块604和时域变换模块605。各功能模块详细说明如下:
音频获取模块601,用于获取待音频分离的目标音频;
音色种类确定模块602,用于确定针对所述目标音频所需分离的各个音色种类;
神经网络选取模块603,用于从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;
目标音频输入模块604,用于将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;
时域变换模块605,用于将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
如图9所示,优选地,所述音色种类确定模块602可以包括:
来源场合获取单元6021,用于获取所述目标音频的来源场合;
种类组合确定单元6022,用于根据预设的场合音色对应关系确定与所述目标音频的来源场合对应的音色种类组合,所述场合音色对应关系记录了场合与音色种类组合之间的对应关系;
音色种类确定单元6023,用于将确定出的所述音色种类组合中的各个音色种类确定为针对所述目标音频所需分离的各个音色种类。
如图10所示,优选地,每个音色种类组合对应的LSTM神经网络可以通过以下模块预先训练得到:
样本种类获取模块606,用于获取待训练的LSTM神经网络对应的音色种类组合包含的各个音色种类,作为各个样本音色种类;
音频样本采集模块607,用于分别采集所述各个样本音色种类各自对应的单音色音频样本;
混合音频样本合成模块608,用于根据所述各个样本音色种类各自对应的单音色音频样本合成得到各个混合音频样本,每个混合音频样本由所述各个样本音色种类各自对应的一个单音色音频样本合成得到;
样本输入模块609,用于针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图;
误差计算模块610,用于使用预设的代价函数计算所述各个样本频谱图与各个单音色频谱图之间的误差,所述各个单音色频谱图是指所述每个混合音频样本对应的各个单音色音频样本进过频域变换得到的频谱图;
网络参数调整模块611,用于以所述代价函数的计算结果为目标,调整所述LSTM神经网络的网络参数,直到所述代价函数的计算结果收敛,然后确定所述LSTM神经网络已训练完成。
优选地,所述样本输入模块可以包括:
频域变换单元,用于将所述每个混合音频样本进行频域变换,得到所述每个混合音频样本的混合频谱图;
加窗单元,用于对所述混合频谱图加窗,且对加窗得到的每帧数据进行短时傅里叶变换,得到各个频谱特征向量;
重叠分组单元,用于对所述各个频谱特征向量进行重叠分组,得到各组频谱特征片段,任意相邻两组频谱特征片段之间存在重叠的频谱特征向量;
卷积计算单元,用于将每组频谱特征片段分别与预设卷积核进行卷积计算,得到降维后的各个片段向量,每个片段向量对应一组频谱特征片段;
片段向量输入单元,用于将所述各个片段向量输入至LSTM,得到所述LSTM输出的各个音频信息向量;
多层感知器处理单元,用于针对预设数量个多层感知器中的每个多层感知器,将所述各个音频信息向量输入至每个多层感知器,得到所述每个多层感知器输出的分离特征向量,所述预设数量等于所述各个样本音色种类的数量;
特征还原单元,用于针对每个多层感知器输出的分离特征向量,将所述分离特征向量输入至所述LSTM进行特征还原,得到每个多层感知器对应的音频特征向量;
转置卷积计算单元,用于针对每个多层感知器对应的音频特征向量,将所述音频特征向量分别与所述预设卷积核进行转置卷积计算,得到升维后的、每个多层感知器对应的各个音频特征片段;
特征片段拼合单元,用于分别拼合各个多层感知器各自对应的各个音频特征片段,得到各个多层感知器各自对应的单音色频谱图,作为所述LSTM神经网络输出的各个样本频谱图。
优选地,每个混合音频样本通过以下模块合成得到:
待混音样本选取模块,用于针对每个所述样本音色种类对应的单音色音频样本,从所述单音色音频样本中选取出一个单音色音频样本,作为待混音样本;
混音处理模块,用于将所述各个样本音色种类各自对应的待混音样本进行混音处理,得到一个混合音频样本。
关于音频单音色分离装置的具体限定可以参见上文中对于音频单音色分离方法的限定,在此不再赘述。上述音频单音色分离装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构 图可以如图11所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储音频单音色分离方法中涉及到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种音频单音色分离方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中音频单音色分离方法的步骤,例如图2所示的步骤101至步骤105。或者,处理器执行计算机可读指令时实现上述实施例中音频单音色分离装置的各模块/单元的功能,例如图8所示模块601至模块605的功能。为避免重复,这里不再赘述。
在一个实施例中,提供了一种计算机可读存储介质,该一个或多个存储有计算机可读指令的可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行计算机可读指令时实现上述方法实施例中音频单音色分离方法的步骤,或者,该一个或多个存储有计算机可读指令的可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行计算机可读指令时实现上述装置实施例中音频单音色分离装置中各模块/单元的功能。为避免重复,这里不再赘述。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。
Claims (20)
- 一种音频单音色分离方法,其特征在于,包括:获取待音频分离的目标音频;确定针对所述目标音频所需分离的各个音色种类;从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
- 根据权利要求1所述的音频单音色分离方法,其特征在于,所述确定针对所述目标音频所需分离的各个音色种类包括:获取所述目标音频的来源场合;根据预设的场合音色对应关系确定与所述目标音频的来源场合对应的音色种类组合,所述场合音色对应关系记录了场合与音色种类组合之间的对应关系;将确定出的所述音色种类组合中的各个音色种类确定为针对所述目标音频所需分离的各个音色种类。
- 根据权利要求1或2所述的音频单音色分离方法,其特征在于,每个音色种类组合对应的LSTM神经网络通过以下步骤预先训练得到:获取待训练的LSTM神经网络对应的音色种类组合包含的各个音色种类,作为各个样本音色种类;分别采集所述各个样本音色种类各自对应的单音色音频样本;根据所述各个样本音色种类各自对应的单音色音频样本合成得到各个混合音频样本,每个混合音频样本由所述各个样本音色种类各自对应的一个单音色音频样本合成得到;针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图;使用预设的代价函数计算所述各个样本频谱图与各个单音色频谱图之间的误差,所述各个单音色频谱图是指所述每个混合音频样本对应的各个单音色音频样本进过频域变换得到的频谱图;以所述代价函数的计算结果为目标,调整所述LSTM神经网络的网络参数,直到所述代价函数的计算结果收敛,然后确定所述LSTM神经网络已训练完成。
- 根据权利要求3所述的音频单音色分离方法,其特征在于,所述针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图包括:将所述每个混合音频样本进行频域变换,得到所述每个混合音频样本的混合频谱图;对所述混合频谱图加窗,且对加窗得到的每帧数据进行短时傅里叶变换,得到各个频谱特征向量;对所述各个频谱特征向量进行重叠分组,得到各组频谱特征片段,任意相邻两组频谱特征片段之间存在重叠的频谱特征向量;将每组频谱特征片段分别与预设卷积核进行卷积计算,得到降维后的各个片段向量,每个片段向量对应一组频谱特征片段;将所述各个片段向量输入至LSTM,得到所述LSTM输出的各个音频信息向量;针对预设数量个多层感知器中的每个多层感知器,将所述各个音频信息向量输入至每个多层感知器,得到所述每个多层感知器输出的分离特征向量,所述预设数量等于所述各个样本音色种类的数量;针对每个多层感知器输出的分离特征向量,将所述分离特征向量输入至所述LSTM进行特征还原,得到每个多层感知器对应的音频特征向量;针对每个多层感知器对应的音频特征向量,将所述音频特征向量分别与所述预设卷积核进行转置卷积计算,得到升维后的、每个多层感知器对应的各个音频特征片段;分别拼合各个多层感知器各自对应的各个音频特征片段,得到各个多层感知器各自对应的单音色频谱图,作为所述LSTM神经网络输出的各个样本频谱图。
- 根据权利要求3所述的音频单音色分离方法,其特征在于,每个混合音频样本通过以下步骤合成得到:针对每个所述样本音色种类对应的单音色音频样本,从所述单音色音频样本中选取出一个单音色音频样本,作为待混音样本;将所述各个样本音色种类各自对应的待混音样本进行混音处理,得到一个混合音频样本。
- 一种音频单音色分离装置,其特征在于,包括:音频获取模块,用于获取待音频分离的目标音频;音色种类确定模块,用于确定针对所述目标音频所需分离的各个音色种类;神经网络选取模块,用于从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;目标音频输入模块,用于将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;时域变换模块,用于将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
- 根据权利要求6所述的音频单音色分离装置,其特征在于,所述音色种类确定模块包括:来源场合获取单元,用于获取所述目标音频的来源场合;种类组合确定单元,用于根据预设的场合音色对应关系确定与所述目标音频的来源场合对应的音色种类组合,所述场合音色对应关系记录了场合与音色种类组合之间的对应关系;音色种类确定单元,用于将确定出的所述音色种类组合中的各个音色种类确定为针对所述目标音频所需分离的各个音色种类。
- 根据权利要求6或7所述的音频单音色分离装置,其特征在于,每个音色种类组合对应的LSTM神经网络通过以下模块预先训练得到:样本种类获取模块,用于获取待训练的LSTM神经网络对应的音色种类组合包含的各个音色种类,作为各个样本音色种类;音频样本采集模块,用于分别采集所述各个样本音色种类各自对应的单音色音频样本;混合音频样本合成模块,用于根据所述各个样本音色种类各自对应的单音色音频样本合成得到各个混合音频样本,每个混合音频样本由所述各个样本音色种类各自对应的一个单音色音频样本合成得到;样本输入模块,用于针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图;误差计算模块,用于使用预设的代价函数计算所述各个样本频谱图与各个单音色频谱图之间的误差,所述各个单音色频谱图是指所述每个混合音频样本对应的各个单音色音频样本进过频域变换得到的频谱图;网络参数调整模块,用于以所述代价函数的计算结果为目标,调整所述LSTM神经网络的网络参数,直到所述代价函数的计算结果收敛,然后确定所述LSTM神经网络已训练完成。
- 根据权利要求8所述的音频单音色分离装置,其特征在于,所述样本输入模块包括:频域变换单元,用于将所述每个混合音频样本进行频域变换,得到所述每个混合音频样本的混合频谱图;加窗单元,用于对所述混合频谱图加窗,且对加窗得到的每帧数据进行短时傅里叶变换,得到各个频谱特征向量;重叠分组单元,用于对所述各个频谱特征向量进行重叠分组,得到各组频谱特征片段,任意相邻两组频谱特征片段之间存在重叠的频谱特征向量;卷积计算单元,用于将每组频谱特征片段分别与预设卷积核进行卷积计算,得到降维后的各个片段向量,每个片段向量对应一组频谱特征片段;片段向量输入单元,用于将所述各个片段向量输入至LSTM,得到所述LSTM输出的各个音频信息向量;多层感知器处理单元,用于针对预设数量个多层感知器中的每个多层感知器,将所述各个音频信息向量输入至每个多层感知器,得到所述每个多层感知器输出的分离特征向量,所述预设数量等于所述各个样本音色种类的数量;特征还原单元,用于针对每个多层感知器输出的分离特征向量,将所述分离特征向量输入至所述LSTM进行特征还原,得到每个多层感知器对应的音频特征向量;转置卷积计算单元,用于针对每个多层感知器对应的音频特征向量,将所述音频特征向量分别与所述预设卷积核进行转置卷积计算,得到升维后的、每个多层感知器对应的各个音频特征片段;特征片段拼合单元,用于分别拼合各个多层感知器各自对应的各个音频特征片段,得到各个多层感知器各自对应的单音色频谱图,作为所述LSTM神经网络输出的各个样本频谱图。
- 根据权利要求8所述的音频单音色分离装置,其特征在于,每个混合音频样本通过以下模块合成得到:待混音样本选取模块,用于针对每个所述样本音色种类对应的单音色音频样本,从所述单音色音频样本中选取出一个单音色音频样本,作为待混音样本;混音处理模块,用于将所述各个样本音色种类各自对应的待混音样本进行混音处理,得到一个混合音频样本。
- 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其特征在于,所述处理器执行所述计算机可读指令时实现如下步骤:获取待音频分离的目标音频;确定针对所述目标音频所需分离的各个音色种类;从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神 经网络输出的各个目标频谱图;将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
- 根据权利要求11所述的计算机设备,其特征在于,所述确定针对所述目标音频所需分离的各个音色种类包括:获取所述目标音频的来源场合;根据预设的场合音色对应关系确定与所述目标音频的来源场合对应的音色种类组合,所述场合音色对应关系记录了场合与音色种类组合之间的对应关系;将确定出的所述音色种类组合中的各个音色种类确定为针对所述目标音频所需分离的各个音色种类。
- 根据权利要求11或12所述的计算机设备,其特征在于,每个音色种类组合对应的LSTM神经网络通过以下步骤预先训练得到:获取待训练的LSTM神经网络对应的音色种类组合包含的各个音色种类,作为各个样本音色种类;分别采集所述各个样本音色种类各自对应的单音色音频样本;根据所述各个样本音色种类各自对应的单音色音频样本合成得到各个混合音频样本,每个混合音频样本由所述各个样本音色种类各自对应的一个单音色音频样本合成得到;针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图;使用预设的代价函数计算所述各个样本频谱图与各个单音色频谱图之间的误差,所述各个单音色频谱图是指所述每个混合音频样本对应的各个单音色音频样本进过频域变换得到的频谱图;以所述代价函数的计算结果为目标,调整所述LSTM神经网络的网络参数,直到所述代价函数的计算结果收敛,然后确定所述LSTM神经网络已训练完成。
- 根据权利要求13所述的计算机设备,其特征在于,所述针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图包括:将所述每个混合音频样本进行频域变换,得到所述每个混合音频样本的混合频谱图;对所述混合频谱图加窗,且对加窗得到的每帧数据进行短时傅里叶变换,得到各个频谱特征向量;对所述各个频谱特征向量进行重叠分组,得到各组频谱特征片段,任意相邻两组频谱特征片段之间存在重叠的频谱特征向量;将每组频谱特征片段分别与预设卷积核进行卷积计算,得到降维后的各个片段向量,每个片段向量对应一组频谱特征片段;将所述各个片段向量输入至LSTM,得到所述LSTM输出的各个音频信息向量;针对预设数量个多层感知器中的每个多层感知器,将所述各个音频信息向量输入至每个多层感知器,得到所述每个多层感知器输出的分离特征向量,所述预设数量等于所述各个样本音色种类的数量;针对每个多层感知器输出的分离特征向量,将所述分离特征向量输入至所述LSTM进行特征还原,得到每个多层感知器对应的音频特征向量;针对每个多层感知器对应的音频特征向量,将所述音频特征向量分别与所述预设卷积核进行转置卷积计算,得到升维后的、每个多层感知器对应的各个音频特征片段;分别拼合各个多层感知器各自对应的各个音频特征片段,得到各个多层感知器各自对应的单音色频谱图,作为所述LSTM神经网络输出的各个样本频谱图。
- 根据权利要求13所述的计算机设备,其特征在于,每个混合音频样本通过以下 步骤合成得到:针对每个所述样本音色种类对应的单音色音频样本,从所述单音色音频样本中选取出一个单音色音频样本,作为待混音样本;将所述各个样本音色种类各自对应的待混音样本进行混音处理,得到一个混合音频样本。
- 一个或多个存储有计算机可读指令的可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:获取待音频分离的目标音频;确定针对所述目标音频所需分离的各个音色种类;从预先训练好的各个LSTM神经网络中选取出与所述各个音色种类对应的一个LSTM神经网络,作为目标LSTM神经网络,所述各个LSTM神经网络分别采用不同的音色种类组合所对应的音频样本预先训练得到,每个音色种类组合由两个以上音色种类组成;将所述目标音频作为输入投入至所述目标LSTM神经网络,得到所述目标LSTM神经网络输出的各个目标频谱图;将所述各个目标频谱图分别进行时域变换,得到所述各个目标频谱图各自对应的目标单音色音频,作为所述目标音频的音频分离结果。
- 根据权利要求16所述的可读存储介质,其特征在于,所述确定针对所述目标音频所需分离的各个音色种类包括:获取所述目标音频的来源场合;根据预设的场合音色对应关系确定与所述目标音频的来源场合对应的音色种类组合,所述场合音色对应关系记录了场合与音色种类组合之间的对应关系;将确定出的所述音色种类组合中的各个音色种类确定为针对所述目标音频所需分离的各个音色种类。
- 根据权利要求16或17所述的可读存储介质,其特征在于,每个音色种类组合对应的LSTM神经网络通过以下步骤预先训练得到:获取待训练的LSTM神经网络对应的音色种类组合包含的各个音色种类,作为各个样本音色种类;分别采集所述各个样本音色种类各自对应的单音色音频样本;根据所述各个样本音色种类各自对应的单音色音频样本合成得到各个混合音频样本,每个混合音频样本由所述各个样本音色种类各自对应的一个单音色音频样本合成得到;针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图;使用预设的代价函数计算所述各个样本频谱图与各个单音色频谱图之间的误差,所述各个单音色频谱图是指所述每个混合音频样本对应的各个单音色音频样本进过频域变换得到的频谱图;以所述代价函数的计算结果为目标,调整所述LSTM神经网络的网络参数,直到所述代价函数的计算结果收敛,然后确定所述LSTM神经网络已训练完成。
- 根据权利要求18所述的可读存储介质,其特征在于,所述针对每个混合音频样本,将所述每个混合音频样本作为输入投入至所述LSTM神经网络,得到所述LSTM神经网络输出的各个样本频谱图包括:将所述每个混合音频样本进行频域变换,得到所述每个混合音频样本的混合频谱图;对所述混合频谱图加窗,且对加窗得到的每帧数据进行短时傅里叶变换,得到各个频谱特征向量;对所述各个频谱特征向量进行重叠分组,得到各组频谱特征片段,任意相邻两组频谱特征片段之间存在重叠的频谱特征向量;将每组频谱特征片段分别与预设卷积核进行卷积计算,得到降维后的各个片段向量,每个片段向量对应一组频谱特征片段;将所述各个片段向量输入至LSTM,得到所述LSTM输出的各个音频信息向量;针对预设数量个多层感知器中的每个多层感知器,将所述各个音频信息向量输入至每个多层感知器,得到所述每个多层感知器输出的分离特征向量,所述预设数量等于所述各个样本音色种类的数量;针对每个多层感知器输出的分离特征向量,将所述分离特征向量输入至所述LSTM进行特征还原,得到每个多层感知器对应的音频特征向量;针对每个多层感知器对应的音频特征向量,将所述音频特征向量分别与所述预设卷积核进行转置卷积计算,得到升维后的、每个多层感知器对应的各个音频特征片段;分别拼合各个多层感知器各自对应的各个音频特征片段,得到各个多层感知器各自对应的单音色频谱图,作为所述LSTM神经网络输出的各个样本频谱图。
- 根据权利要求18所述的可读存储介质,其特征在于,每个混合音频样本通过以下步骤合成得到:针对每个所述样本音色种类对应的单音色音频样本,从所述单音色音频样本中选取出一个单音色音频样本,作为待混音样本;将所述各个样本音色种类各自对应的待混音样本进行混音处理,得到一个混合音频样本。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910511337.1A CN110335622B (zh) | 2019-06-13 | 2019-06-13 | 音频单音色分离方法、装置、计算机设备及存储介质 |
CN201910511337.1 | 2019-06-13 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020248485A1 true WO2020248485A1 (zh) | 2020-12-17 |
Family
ID=68141106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/117096 WO2020248485A1 (zh) | 2019-06-13 | 2019-11-11 | 音频单音色分离方法、装置、计算机设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110335622B (zh) |
WO (1) | WO2020248485A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117153178A (zh) * | 2023-10-26 | 2023-12-01 | 腾讯科技(深圳)有限公司 | 音频信号处理方法、装置、电子设备和存储介质 |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110335622B (zh) * | 2019-06-13 | 2024-03-01 | 平安科技(深圳)有限公司 | 音频单音色分离方法、装置、计算机设备及存储介质 |
CN110827850B (zh) * | 2019-11-11 | 2022-06-21 | 广州国音智能科技有限公司 | 音频分离方法、装置、设备及计算机可读存储介质 |
CN110853666B (zh) * | 2019-12-17 | 2022-10-04 | 科大讯飞股份有限公司 | 一种说话人分离方法、装置、设备及存储介质 |
CN111048111B (zh) * | 2019-12-25 | 2023-07-04 | 广州酷狗计算机科技有限公司 | 检测音频的节奏点的方法、装置、设备及可读存储介质 |
CN111370031B (zh) * | 2020-02-20 | 2023-05-05 | 厦门快商通科技股份有限公司 | 语音分离方法、系统、移动终端及存储介质 |
CN111370019B (zh) * | 2020-03-02 | 2023-08-29 | 字节跳动有限公司 | 声源分离方法及装置、神经网络的模型训练方法及装置 |
US20230335091A1 (en) * | 2020-03-06 | 2023-10-19 | Algoriddim Gmbh | Method and device for decomposing, recombining and playing audio data |
CN111724807B (zh) * | 2020-08-05 | 2023-08-11 | 字节跳动有限公司 | 音频分离方法、装置、电子设备及计算机可读存储介质 |
CN113113040B (zh) * | 2021-03-22 | 2023-05-09 | 北京小米移动软件有限公司 | 音频处理方法及装置、终端及存储介质 |
CN113282509B (zh) * | 2021-06-15 | 2023-11-10 | 广州虎牙科技有限公司 | 音色识别、直播间分类方法、装置、计算机设备和介质 |
CN114036341B (zh) * | 2022-01-10 | 2022-03-29 | 腾讯科技(深圳)有限公司 | 音乐标签的预测方法、相关设备 |
CN117975933B (zh) * | 2023-12-29 | 2024-08-27 | 北京稀宇极智科技有限公司 | 音色混合方法和装置、音频处理方法和装置、电子设备、存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180158456A1 (en) * | 2016-12-01 | 2018-06-07 | Postech Academy-Industry Foundation | Speech recognition device and method thereof |
CN109378010A (zh) * | 2018-10-29 | 2019-02-22 | 珠海格力电器股份有限公司 | 神经网络模型的训练方法、语音去噪方法及装置 |
CN109584903A (zh) * | 2018-12-29 | 2019-04-05 | 中国科学院声学研究所 | 一种基于深度学习的多人语音分离方法 |
CN109801644A (zh) * | 2018-12-20 | 2019-05-24 | 北京达佳互联信息技术有限公司 | 混合声音信号的分离方法、装置、电子设备和可读介质 |
CN110335622A (zh) * | 2019-06-13 | 2019-10-15 | 平安科技(深圳)有限公司 | 音频单音色分离方法、装置、计算机设备及存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108962279A (zh) * | 2018-07-05 | 2018-12-07 | 平安科技(深圳)有限公司 | 音频数据的乐器识别方法及装置、电子设备、存储介质 |
CN108986843B (zh) * | 2018-08-10 | 2020-12-11 | 杭州网易云音乐科技有限公司 | 音频数据处理方法及装置、介质和计算设备 |
CN109841226B (zh) * | 2018-08-31 | 2020-10-16 | 大象声科(深圳)科技有限公司 | 一种基于卷积递归神经网络的单通道实时降噪方法 |
CN109119063B (zh) * | 2018-08-31 | 2019-11-22 | 腾讯科技(深圳)有限公司 | 视频配音生成方法、装置、设备及存储介质 |
-
2019
- 2019-06-13 CN CN201910511337.1A patent/CN110335622B/zh active Active
- 2019-11-11 WO PCT/CN2019/117096 patent/WO2020248485A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180158456A1 (en) * | 2016-12-01 | 2018-06-07 | Postech Academy-Industry Foundation | Speech recognition device and method thereof |
CN109378010A (zh) * | 2018-10-29 | 2019-02-22 | 珠海格力电器股份有限公司 | 神经网络模型的训练方法、语音去噪方法及装置 |
CN109801644A (zh) * | 2018-12-20 | 2019-05-24 | 北京达佳互联信息技术有限公司 | 混合声音信号的分离方法、装置、电子设备和可读介质 |
CN109584903A (zh) * | 2018-12-29 | 2019-04-05 | 中国科学院声学研究所 | 一种基于深度学习的多人语音分离方法 |
CN110335622A (zh) * | 2019-06-13 | 2019-10-15 | 平安科技(深圳)有限公司 | 音频单音色分离方法、装置、计算机设备及存储介质 |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117153178A (zh) * | 2023-10-26 | 2023-12-01 | 腾讯科技(深圳)有限公司 | 音频信号处理方法、装置、电子设备和存储介质 |
CN117153178B (zh) * | 2023-10-26 | 2024-01-30 | 腾讯科技(深圳)有限公司 | 音频信号处理方法、装置、电子设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN110335622B (zh) | 2024-03-01 |
CN110335622A (zh) | 2019-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020248485A1 (zh) | 音频单音色分离方法、装置、计算机设备及存储介质 | |
Pariente et al. | Asteroid: the PyTorch-based audio source separation toolkit for researchers | |
US20230197043A1 (en) | Time-varying and nonlinear audio processing using deep neural networks | |
CN108962279A (zh) | 音频数据的乐器识别方法及装置、电子设备、存储介质 | |
US11996112B2 (en) | Method and apparatus for voice conversion and storage medium | |
CN111508508A (zh) | 一种超分辨率音频生成方法及设备 | |
CN113470688B (zh) | 语音数据的分离方法、装置、设备及存储介质 | |
Choi et al. | Korean singing voice synthesis based on auto-regressive boundary equilibrium gan | |
Eskimez et al. | Adversarial training for speech super-resolution | |
CN111916093B (zh) | 音频处理方法及装置 | |
Ramírez et al. | A general-purpose deep learning approach to model time-varying audio effects | |
Chandna et al. | Content based singing voice extraction from a musical mixture | |
Rodriguez-Serrano et al. | Online score-informed source separation with adaptive instrument models | |
CN112185342A (zh) | 语音转换与模型训练方法、装置和系统及存储介质 | |
Vaca et al. | An open audio processing platform with zync fpga | |
CN108369803A (zh) | 用于形成基于声门脉冲模型的参数语音合成系统的激励信号的方法 | |
CN113421584B (zh) | 音频降噪方法、装置、计算机设备及存储介质 | |
Kadyan et al. | Prosody features based low resource Punjabi children ASR and T-NT classifier using data augmentation | |
CN113506586A (zh) | 用户情绪识别的方法和系统 | |
Joseph et al. | Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking | |
Resna et al. | Multi-voice singing synthesis from lyrics | |
CN116778946A (zh) | 人声伴奏分离方法、网络训练方法、设备及存储介质 | |
CN115188363A (zh) | 语音处理方法、系统、设备及存储介质 | |
CN114365219A (zh) | 音频分离方法、装置、设备、存储介质及程序产品 | |
Mores | Vowel quality in violin sounds—A timbre analysis of Italian masterpieces |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19932497 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19932497 Country of ref document: EP Kind code of ref document: A1 |