CN117012223A - Audio separation method, training method, device, equipment, storage medium and product - Google Patents

Audio separation method, training method, device, equipment, storage medium and product Download PDF

Info

Publication number
CN117012223A
CN117012223A CN202210472271.1A CN202210472271A CN117012223A CN 117012223 A CN117012223 A CN 117012223A CN 202210472271 A CN202210472271 A CN 202210472271A CN 117012223 A CN117012223 A CN 117012223A
Authority
CN
China
Prior art keywords
audio
track
separation
feature
spectrum
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210472271.1A
Other languages
Chinese (zh)
Inventor
刘雪松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zeku Technology Shanghai Corp Ltd
Original Assignee
Zeku Technology Shanghai Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zeku Technology Shanghai Corp Ltd filed Critical Zeku Technology Shanghai Corp Ltd
Priority to CN202210472271.1A priority Critical patent/CN117012223A/en
Priority to PCT/CN2022/143311 priority patent/WO2023207193A1/en
Publication of CN117012223A publication Critical patent/CN117012223A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Quality & Reliability (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application discloses an audio separation method, a training device, equipment, a storage medium and a product, and relates to the field of artificial intelligence. The method comprises the following steps: acquiring an i-1 audio feature of an i-1 audio fragment in an audio separation process; inputting the ith audio frequency spectrum of the ith audio frequency segment and the ith audio frequency feature into a separation network to carry out audio frequency separation to obtain an audio track mask of each audio track in the ith audio frequency segment, wherein the ith-1 audio frequency segment is the last segment of the ith audio frequency segment in the target audio frequency; and performing spectrum extraction on the ith audio frequency spectrum by utilizing the track mask of each track to obtain the track spectrum of each track. According to the method provided by the embodiment of the application, the separation delay can be reduced and the accuracy of the separation network for separating the audio fragments input by the short window can be improved by introducing the audio characteristics obtained in the separation process of the last fragment.

Description

Audio separation method, training method, device, equipment, storage medium and product
Technical Field
The embodiment of the application relates to the field of artificial intelligence, in particular to an audio separation method, a training method, a device, equipment, a storage medium and a product.
Background
Audio separation technology refers to technology that extracts and separates original audio tracks of human voice, instrument voice, and the like from audio.
In the related art, an audio separation algorithm based on artificial intelligence (Artificial Intelligence, AI) may be used to perform audio separation, and in this process, audio data in a period of time is separated by using a separation network, so as to obtain an audio frequency spectrum corresponding to each audio track in audio.
However, when the related art scheme is adopted to perform audio separation, the corresponding information of the audio data needs to be separated in a long period of time, and the separation delay is high, so that real-time audio separation cannot be realized.
Disclosure of Invention
The embodiment of the application provides an audio separation method, a training device, equipment, a storage medium and a product. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides an audio separation method, including:
acquiring an i-1 audio feature of an i-1 audio fragment in an audio separation process;
inputting the i-1 audio feature and the i audio frequency spectrum of the i audio fragment into a separation network for audio separation to obtain an audio track mask of each audio track in the i audio fragment, wherein the i-1 audio fragment is the last fragment of the i audio fragment in the target audio;
And carrying out spectrum extraction on the ith audio frequency spectrum by utilizing the track mask of each track to obtain the track frequency spectrum of each track.
In another aspect, an embodiment of the present application provides a training method for a separation network, where the method includes:
obtaining sample track-dividing audio data, and carrying out mixing treatment on the sample track-dividing audio data to obtain mixed audio data;
inputting the mixed audio frequency spectrum corresponding to the mixed audio data into a separation network for audio separation to obtain a predicted audio track mask of each audio track;
performing spectrum extraction on the mixed audio frequency spectrum by utilizing each predicted audio track mask to obtain predicted audio track spectrums of each audio track;
and updating and training the separation network based on the predicted track spectrum and the sample track spectrum corresponding to the sample track audio data.
In another aspect, an embodiment of the present application provides an audio separation apparatus, including:
the acquisition module is used for acquiring the i-1 audio characteristics of the i-1 audio fragment in the audio separation process;
the audio separation module is used for carrying out audio separation on the ith audio feature and an ith audio frequency spectrum input separation network of an ith audio frequency fragment to obtain an audio track mask of each audio track in the ith audio frequency fragment, wherein the ith-1 audio frequency fragment is the last fragment of the ith audio frequency fragment in the target audio frequency;
And the frequency spectrum extraction module is used for carrying out frequency spectrum extraction on the ith audio frequency spectrum by utilizing the track mask of each track to obtain the track frequency spectrum of each track.
In another aspect, an embodiment of the present application provides a training apparatus for separating a network, where the apparatus includes:
the acquisition module is used for acquiring sample track-divided audio data and carrying out mixing processing on the sample track-divided audio data to obtain mixed audio data;
the audio separation module is used for inputting the mixed audio frequency spectrum corresponding to the mixed audio data into a separation network to carry out audio separation to obtain a predicted audio track mask of each audio track;
the frequency spectrum extraction module is used for carrying out frequency spectrum extraction on the mixed audio frequency spectrum by utilizing each predicted audio track mask to obtain predicted audio track frequency spectrums of each audio track;
and the training module is used for updating and training the separation network based on the sample track frequency spectrum corresponding to the predicted track frequency spectrum and the sample track-dividing audio data.
In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one section of program, and the at least one section of program is loaded and executed by the processor to implement the audio separation method or the training method of the separation network according to the above aspect.
In another aspect, embodiments of the present application provide a computer readable storage medium having at least one program code stored therein, the program code loaded and executed by a processor to implement an audio separation method or a training method for a separation network as described in the above aspects.
In another aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the audio separation method or the training method of the separation network provided in various alternative implementations of the above aspects.
The technical scheme provided by the embodiment of the application can bring the following beneficial effects:
in the embodiment of the application, when the ith audio frequency spectrum of the ith audio fragment is subjected to audio separation, the audio separation is performed based on the characteristics of the ith-1 audio fragment in the audio separation process and the ith audio frequency spectrum, so that the track mask of each track is obtained, and the audio frequency spectrum is separated into the track frequency spectrum corresponding to each track. In the process of audio frequency separation of the audio frequency fragments, the audio frequency characteristics obtained in the separation process of the last audio frequency fragment are introduced, so that the problem of less information caused by short window input can be avoided, and the accuracy of the separation network for separating the audio frequency fragments input by the short window can be improved. When the method provided by the embodiment of the application is used for audio separation by utilizing the separation network, only the audio fragments input in a short time window can be separated each time, the separation delay is reduced, and the real-time audio separation can be realized.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 shows a flowchart of an audio separation method according to an exemplary embodiment of the present application;
FIG. 2 shows a schematic diagram of two separate network forms shown in accordance with an exemplary embodiment of the present application;
FIG. 3 shows a schematic diagram of an audio separation process provided by an exemplary embodiment of the present application;
fig. 4 shows a flowchart of an audio separation method according to another exemplary embodiment of the present application;
FIG. 5 illustrates a schematic diagram of a time domain feature filling process provided by an exemplary embodiment of the present application;
FIG. 6 illustrates a schematic diagram of a buffer provided by an exemplary embodiment of the present application;
fig. 7 shows a flowchart of an audio separation method according to another exemplary embodiment of the present application;
FIG. 8 illustrates a network architecture diagram of a split network provided by an exemplary embodiment of the present application;
FIG. 9 illustrates a schematic diagram of a causal convolution process provided by an exemplary embodiment of the present application;
FIG. 10 illustrates a flowchart of a method of training a split network provided in accordance with another exemplary embodiment of the present application;
FIG. 11 is a flow chart illustrating a method of training a split network according to another exemplary embodiment of the present application;
FIG. 12 is a schematic diagram of a split network training process provided by an exemplary embodiment of the present application;
fig. 13 is a block diagram showing the structure of an audio separating apparatus according to an embodiment of the present application;
fig. 14 is a block diagram showing the structure of an audio separating apparatus according to another embodiment of the present application;
fig. 15 is a block diagram showing the structure of a computer device according to an exemplary embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
In the related art, in order to ensure the accuracy of audio separation when audio separation is performed, audio clips are generally input to a separation network in a long time window, thereby performing audio separation. For example, 12s of audio data is input as an input window, and audio separation is performed. However, in this manner, a longer input time window will cause a longer separation delay, for example, when 12s of audio data is taken as the input time window, waiting for 12s of audio data is required, and real-time audio separation cannot be achieved. If the audio clips are input to the separation network only in a short time window, the separation accuracy will be reduced due to the smaller data volume.
In the embodiment of the application, in order to realize real-time audio separation and ensure the separation accuracy in the process of real-time audio separation, the separation characteristic of the last audio fragment in the process of audio separation is introduced in the process of audio separation of each audio fragment, more information is acquired for audio separation, and the accuracy of a separation network in audio separation of input audio in a short time window can be improved, so that the real-time audio separation is realized.
The method provided by the embodiment of the application can be applied to any scene needing audio separation, for example, in the music playing process, the played music is separated into the audio corresponding to different audio tracks by using the audio separation method, so that the stereo effect of the music playing is improved.
The method provided by the embodiment of the application can be applied to computer equipment, wherein the computer equipment is electronic equipment provided with an audio separation function. The electronic device may be a mobile terminal such as a smart phone, a tablet computer, a laptop portable notebook computer, or a terminal such as a desktop computer or a projection computer, which is not limited in the embodiment of the present application. When the audio separation is needed, the target audio can be input into the computer equipment, and the computer equipment utilizes the separation network to carry out the audio separation on the target audio to obtain the audio data corresponding to each audio track in the target audio.
Referring to fig. 1, a flowchart of an audio separation method according to an exemplary embodiment of the present application is shown, where the method is applied to a computer device as an example, and the method includes:
step 101, the i-1 audio feature of the i-1 audio fragment in the audio separation process is obtained.
In one possible implementation manner, in the process of performing audio separation on the target audio by using the separation network, the target audio is split into multiple audio fragments, and audio separation is performed on each audio fragment in turn. I.e. after audio separation of the i-1 th audio segment, audio separation of the i-th audio segment is performed.
In order to realize real-time audio separation, the audio fragment time length of each time of inputting the audio fragment into the separation network is required to be shorter, so that larger delay can be avoided, namely, the audio fragment is required to be input in a short time window. However, if only a shorter audio segment is input to the separation network for audio separation at a time, the separation accuracy will be affected due to the shorter audio segment and the smaller amount of data contained therein. Therefore, in one possible implementation manner, in the process of performing audio separation on the ith audio fragment, the computer equipment acquires the i-1 audio feature of the ith-1 audio fragment in the audio separation process, so that the features extracted in the previous audio fragment separation process are fused, and the audio separation accuracy is improved.
Step 102, inputting the i-1 audio feature and the i audio frequency spectrum of the i audio fragment into a separation network for audio separation to obtain an audio track mask of each audio track in the i audio fragment, wherein the i-1 audio fragment is the last fragment of the i audio fragment in the target audio.
When the audio separation is performed on the ith audio segment, firstly, time-frequency conversion is performed on the ith audio segment, wherein the time-frequency conversion adopts a Short-time Fourier transform (STFT) mode to convert audio data into a frequency domain, so as to obtain a complex frequency spectrum of the ith audio segment, namely the ith audio frequency spectrum. The time-frequency conversion mode is as follows:
X=STFT(x)
wherein X is the audio data corresponding to the ith audio fragment, and X is the ith audio spectrum.
Optionally, after the computer device obtains the ith audio frequency spectrum, inputting the ith audio frequency spectrum into a separation network, simultaneously inputting the ith-1 audio feature into the separation network together, and performing audio separation on the ith audio frequency spectrum by using the separation network to obtain an audio track mask of each audio track.
And at the same time, the separation network outputs the ith audio feature of the ith audio frequency spectrum in the separation process for the audio separation process of the (i+1) th audio fragment.
The processing procedure of the separation network is as follows:
Net(X,H i-1 )=([m 0 ,m 1 ,m 2 ,...,m N-1 ],H i )
Wherein H is i-1 I.e. representing the i-1 th audio feature of the i-1 th audio fragment during audio separation, H i Representing an ith audio feature, m, of an ith audio fragment during audio separation N-1 Representing the track mask corresponding to the N-1 st track.
In one possible implementation, the separation network may be a single separation network, as shown in fig. 2, into which audio spectrum may be input for audio separation, and the separation network may output track masks for individual tracks. Alternatively, the separation network may be a network including a plurality of separation sub-networks, as shown in fig. 2, where the audio spectrum is input into the separation network 202, and audio separation is performed by using each separation sub-network in the separation network 202, where each separation sub-network may output an audio track mask corresponding to each audio track.
In the process of carrying out audio separation on the target audio, the whole audio data corresponding to the target audio can be subjected to time-frequency conversion to obtain a complex spectrum of the target audio, the complex spectrum of the target audio is input into a separation network, and the separation network disassembles the complex spectrum of the target audio into complex spectrums corresponding to a plurality of audio fragments, so that the complex spectrums of all the audio fragments are respectively subjected to audio separation. Alternatively, in another possible implementation manner, each audio segment may be subjected to a time-frequency transformation to obtain a complex spectrum of each audio segment, so that the complex spectrum is input into a separation network for audio separation, which is not limited by the embodiment of the present application.
And 103, performing spectrum extraction on the ith audio frequency spectrum by utilizing the track mask of each track to obtain the track frequency spectrum of each track.
After obtaining the track mask for each track, the computer device may utilize the track mask to perform spectral extraction on the i-th audio spectrum. The track mask is also in complex form. When the track mask is utilized to extract the frequency spectrum, the track mask is multiplied by the ith audio frequency spectrum, so that the corresponding track frequency spectrum of each track can be obtained. The spectrum extraction process is as follows:
wherein,i.e. the track spectrum corresponding to the i-th track.
Optionally, after obtaining the audio track spectrums of each audio track, performing time-frequency Inverse transformation on each audio track spectrum, where the time-frequency Inverse transformation may use Inverse Short Time Fourier Transform (ISTFT) to transform the audio track spectrum into a time domain, so as to obtain audio data corresponding to each audio track. Namely:
wherein,i.e. the audio data corresponding to the i-th track.
In one possible implementation, as shown in fig. 3, the audio separation process for the ith audio segment is performed by first performing time-frequency conversion 301 on the ith audio segment by the computer device to obtain an ith audio spectrum, then inputting the ith audio spectrum and the ith-1 audio feature into the separation network 302 to obtain an audio track mask corresponding to the audio tracks 0 to N-1, and simultaneously outputting the ith audio feature by the separation network 302 for use in the audio separation process for the (i+1) th audio segment. Then, the computer device multiplies the ith audio frequency spectrum by the audio track mask of each audio track to obtain audio track spectrums corresponding to audio tracks 0 to N-1, and then performs time-frequency inverse transformation 303 on each audio track spectrum to obtain audio data corresponding to audio tracks 0 to N-1.
In summary, in the embodiment of the present application, when the i-th audio spectrum of the i-th audio clip is subjected to audio separation, the audio separation is performed based on the characteristics of the i-1-th audio clip in the audio separation process and the i-th audio spectrum, so as to obtain the track mask of each track, thereby separating the audio spectrum into the track spectrums corresponding to each track. In the process of audio frequency separation of the audio frequency fragments, the audio frequency characteristics obtained in the separation process of the last audio frequency fragment are introduced, so that the problem of less information caused by short window input can be avoided, and the accuracy of the separation network for separating the audio frequency fragments input by the short window can be improved. When the method provided by the embodiment of the application is used for audio separation by utilizing the separation network, only the audio fragments input in a short time window can be separated each time, the separation delay is reduced, and the real-time audio separation can be realized.
In one possible implementation, the separation network includes a plurality of convolution layers, where the computer device performs convolution processing using the convolution layers to obtain audio features corresponding to the audio spectrum. In the embodiment of the application, in the process of carrying out convolution processing in the convolution layer, the audio characteristics of the last segment in the separation process are introduced to carry out causal convolution processing, so that the audio characteristics in the history separation process are fused to carry out audio separation. An exemplary explanation of the causal convolution process will be described below.
Referring to fig. 4, a flowchart of an audio separation method according to another exemplary embodiment of the present application is shown, where the embodiment of the present application is described by taking the application of the method to a computer device as an example, the method includes:
step 401, obtain the i-1 audio feature of the i-1 audio clip in the audio separation process.
In this step, reference may be made to step 101, and this embodiment will not be repeated.
In step 402, each of the i-1 th layered audio features is determined, wherein different i-1 th layered audio features correspond to different convolutional layers in the separation network.
In one possible implementation, the separation network includes a plurality of convolution layers, and the audio features are convolved. In the embodiment of the application, in order to realize that the separation network can input in a short time window, namely, the time length of each time of separating the audio fragments is shorter, the computer equipment transmits information in the adjacent audio fragment separation process, and the information transmission process is forward transmission, namely, the information of the previous audio fragment in the audio separation process is transmitted to the current audio fragment separation process, so that the current audio fragment separation process can carry out audio separation based on more information.
The information transmission process is a transmission process of a convolution state, wherein the convolution state is the i-1 audio characteristic in the audio separation process of the i-1 audio fragment. In one possible implementation, during the information transfer process, the convolution state of the convolution layer of the i-1 th audio segment during the audio separation process is transferred to the corresponding convolution layer of the i-th audio segment during the audio separation process. The i-1 audio features comprise convolution states corresponding to different convolution layers in the audio separation process, and the convolution states corresponding to the different convolution layers are different i-1 layered audio features. After the i-1 audio features are acquired, the i-1 hierarchical audio features corresponding to the convolution layers respectively need to be determined, and then the i-1 hierarchical audio features are input into the convolution layers corresponding to the separation network.
Step 403, inputting each i-1 layered audio feature into each convolution layer to perform causal convolution processing with the audio feature of the i audio spectrum, so as to obtain the track mask of each track.
In order to ensure the continuity of the information transmission process, in the embodiment of the application, a causal convolution mode is adopted to carry out convolution processing, so that the i-1 audio characteristic is transmitted to the audio separation process of the i audio segment. Alternatively, this step may include the steps of:
And step 403a, performing feature filling on the kth audio input feature of the kth convolution layer based on the ith-1 layered audio feature corresponding to the kth convolution layer to obtain a kth audio filling feature.
In the process of performing convolution processing, in order to ensure that the size of the input feature map of the convolution layer is the same as that of the output feature map, the input feature map of the convolution layer is generally filled, so that the output size is ensured to be the same as that of the input size. Taking the convolution kernel as a 3×3 convolution kernel as an example, one column needs to be filled in each of the left and right sides of the feature map, and one row needs to be filled in each of the top and bottom sides, so that the output size is ensured. In one possible implementation, the computer device will feature fill the audio input features of the convolutional layer with the i-1 th layered audio features, resulting in filled audio fill features, thereby convolving the audio fill features in the convolutional layer. Wherein the process of feature filling the kth audio input feature of the kth convolutional layer may comprise the steps of:
step one, performing time domain feature filling on the kth audio input feature by utilizing the ith-1 layered audio feature corresponding to the kth convolution layer.
In the audio separation process, a two-dimensional convolution layer is generally used for feature extraction in two dimensions of the time domain and the frequency domain. Therefore, in the process of feature filling, feature filling needs to be performed in the time domain dimension and the frequency domain dimension respectively, so that the output size is ensured. If the size of each input channel is n×t (where N represents N frequency points, T represents T time frames), and the convolution kernel is 3×3, the input size of the final convolution layer after the feature filling is completed is (n+2) ×t+2.
In one possible implementation, the causal convolution process will change the time domain filling mode, and the computer device will perform time domain feature filling on the kth audio input feature using the ith-1 layered audio feature corresponding to the kth convolution layer.
In the related art, when time-domain filling is performed, the filling is usually performed in the historical time frame and the future time frame, and the filling value is 0, taking the 3×3 convolution kernel as an example, a row will be filled in the left side and the right side of the feature map, that is, the information of one time frame of the history and one time frame of the future is filled, however, in this way, only the output size of the convolution layer can be ensured, but the filling information is meaningless. In this embodiment, a causal convolution manner is adopted, where the causal convolution time domain filling process is adopted, and the filling value is no longer 0, but the forward-transferred convolution state is used for filling. And during filling, filling is carried out in a historical time frame, so that the convolution operation of a longer input characteristic diagram can be disassembled into a plurality of convolution operations of shorter characteristic diagrams, and the separation network can carry out audio separation by taking a short time window as input.
Optionally, in the historical time frame, the time domain feature filling is performed on the kth audio input feature by using the ith-1 layered audio feature corresponding to the kth convolution layer.
The characteristic size of the i-1 th layered audio characteristic is the same as the time domain characteristic size difference value of the input characteristic and the output characteristic of the convolution layer. In one possible implementation, when storing the i-1 th layered audio feature of the i-1 th audio segment during audio separation, the computer device will determine the features to be stored based on the time domain feature size differences between the input features and the output features of the convolutional layers in the separation network. Taking the 3×3 convolution kernel as an example, when performing the time domain feature filling, the filling needs to be performed within two historical frames, so the i-1 th layered audio feature will include the input audio feature of the last two frames of the convolution layer. As shown in fig. 5, the kth audio input feature contains features in the 0 to T-1 frames in the time domain, and the i-1 th layered audio feature is used to perform time domain feature filling in two historical time frames 501, so that the feature size in the time domain is changed to t+2. And for the current kth audio input feature, the ith layered audio feature stored in the last two time frames 502 is passed to the audio separation process for the (i+1) th audio clip.
And step two, carrying out frequency domain feature filling on the kth input audio features filled with the time domain features to obtain kth audio filling features.
In one possible implementation manner, the frequency domain feature filling manner is unchanged, that is, the frequency domain feature filling is performed on the kth input audio feature after the time domain feature filling on the high-frequency point and the low-frequency point respectively.
Optionally, the padding value is 0, and information of higher frequency points and lower frequency points is supplemented. As shown in fig. 5, when the convolution kernel is 3×3, 0 is filled in the high frequency point 503 and the low frequency point 504, so that the feature size in the frequency domain is changed to n+2.
In the above embodiment, the time domain feature filling is performed first, and then the spectral feature filling is performed. However, in the actual separation process, the frequency domain feature filling may be performed first, and then the time domain feature filling may be performed.
And 403b, performing causal convolution processing on the kth audio filling feature to obtain a kth audio output feature.
After the k audio input features are filled to obtain k audio filling features, the computer equipment performs two-dimensional causal convolution processing on the k audio filling features in the k convolution layer to obtain k audio output features of the k convolution layer.
In step 403c, track masks for each track are determined based on the nth audio output characteristics of the nth convolutional layer, n being greater than or equal to k.
After causal convolution processing with multiple convolution layers, the track mask for each track may be determined based on the nth audio output characteristics of the nth convolution layer. The nth convolution layer may be the last convolution layer included in the final output layer in the split network.
Step 404, the ith audio feature of the ith audio fragment during audio separation is stored in a buffer.
After audio separation of the ith audio clip, the computer device may store the ith audio feature in the separation process. The ith audio feature also comprises an ith layered audio feature corresponding to each convolution layer. Schematically, as shown in fig. 5, the audio features in the last two frames of time frames are the ith layered audio features corresponding to the kth convolution layer.
In one possible implementation, the storage may be consolidated for each of the ith layered audio features. And when the audio is transmitted to the next audio separation process, the ith audio feature is segmented to obtain each ith layered audio feature, and the ith layered audio feature is input into a corresponding convolution layer.
In yet another possible embodiment, a neural network accelerator is provided in the computer device for accelerating the audio separation process of the separation network. In order to further accelerate the audio separation process, a buffer area can be arranged inside the neural network accelerator for storing the audio characteristics to be transferred to the next time in the audio separation process. And for each convolution layer, corresponding buffer partitions are respectively arranged, and the audio features corresponding to different convolution layers can be stored in different buffer partitions. The separation network can read the convolution state generated in the last audio separation process, namely the audio feature, from the buffer partition corresponding to each convolution layer, so that the time domain feature filling is carried out on the input feature map of the current convolution layer. In this way, the stored audio features do not need to be combined and divided, so that the additional cost and the power consumption can be reduced, and the audio separation efficiency of the separation network can be further improved.
Illustratively, as shown in fig. 6, a buffer area 602 is provided inside the neural network accelerator 601, and the buffer area includes buffer partitions corresponding to the convolution layers 1 to M.
Optionally, in the process of storing the ith audio feature, the ith layered audio feature corresponding to each convolution layer in the ith audio feature is stored in each buffer partition, and different buffer partitions correspond to different convolution layers.
And when each i-th layered audio feature is stored in the buffer partition, the i-1-th layered audio feature is required to be cleared because the i-1-th layered audio feature is originally stored in the buffer partition, and the i-th layered audio feature is stored in the cleared buffer area. I.e. replacing the i-1 th layered audio feature within the buffer partition with the i-th layered audio feature.
In this embodiment, a causal convolution processing manner is adopted to forward transfer information in an audio separation process, that is, features of a previous audio separation process are utilized to perform feature filling on a current audio separation process, so that input features of a long window can be split into input features of a short window, and therefore when audio separation is performed on audio clips input in a short window by using a separation network, the situation that separation accuracy is low due to less information is avoided, the accuracy of audio separation is improved, separation delay is reduced, and real-time audio separation is realized.
In this embodiment, a dedicated buffer is set for the audio features that need to be transferred to the next audio separation process in each audio separation process, so as to avoid merging and splitting the layered audio features corresponding to all the convolution layers, improve the separation efficiency of audio separation by the separation network, and help to reduce power consumption.
In one possible implementation, the separation network is a U-Net network, where the U-Net network includes an input layer, an encoding block, a decoding block, and an output layer, where each layer includes a different convolution layer. In the embodiment of the application, the input of the separation network comprises the audio characteristics in the last audio separation process, and the audio characteristics are respectively input into the convolution layers corresponding to the input layer, the coding block, the decoding block and the output layer. The following will describe exemplary embodiments.
Referring to fig. 7, a flowchart of an audio separation method according to another exemplary embodiment of the present application is shown, where the embodiment of the present application is described by taking the application of the method to a computer device as an example, the method includes:
step 701, obtaining the i-1 audio feature of the i-1 audio fragment in the audio separation process.
In this step, reference may be made to step 101, and this embodiment will not be repeated.
In step 702, each of the i-1 th layered audio features is determined, wherein different i-1 th layered audio features correspond to different convolutional layers in the separation network.
In this step, reference may be made to step 402, and this embodiment will not be repeated.
And 703, performing feature extraction on the i-1 th layered audio feature corresponding to the input layer and the i audio spectrum input layer to obtain an audio initial feature.
In one possible implementation, the input layer of the split network consists of one or more two-dimensional convolution layers. The input layer firstly performs feature extraction on the input ith audio frequency spectrum to obtain audio initial features, namely an initial input feature map. Because the input layer comprises a convolution layer, the input of the input layer comprises the i-1 th layered audio characteristic corresponding to the input layer in the audio separation process of the i-1 th audio fragment. And when the input layer comprises a plurality of convolution layers, the i-1 th layered audio feature corresponding to the input layer comprises the i-1 th layered audio feature of the plurality of convolution layers, and in the input process, the computer equipment inputs each i-1 th layered audio feature into the convolution layer corresponding to the input layer.
And step 704, inputting the i-1 th layered audio feature and the audio initial feature corresponding to the coding block into the coding block for feature coding to obtain audio coding features.
After obtaining the audio initial feature, the computer device inputs the audio initial feature into N encoding blocks for encoding. Wherein each coding block consists of an encoder and a downsampler. And the encoder is made up of one or more two-dimensional convolutional layers. Each convolution layer may be followed by a linear rectifying (Rectified Linear Unit, reLU) active layer, namely:
y=max(0,x)
the downsampler downsamples the audio feature in two dimensions, i.e., in the time domain, the features of two frames of time frames may be combined into the features of one frame of time frame, and the coding scale is coarsened from finer to coarser. Since the coding scale is changed from thin to thick, the convolution layers in different coding blocks can set different channel numbers. When the coding scale is thicker, more channels can be adopted to learn more characteristic information, and the probability of inaccurate separation caused by information loss in the down-sampling process is reduced.
The encoding process of each encoded block is as follows:
encoding i =Encoder(x i )
x i+1 =Downsample(encoding i )
wherein the method comprises the steps of,encoding i Is the coding information output by the ith coding block, and x is i+1 Is the input to the next code block.
After encoding with each encoding block, the output features will be input into the bottleneck layer. The bottleneck layer is an encoder, the corresponding encoding scale is the coarsest, the data volume is the smallest, and after the bottleneck layer is passed, the encoding is completed, so that the audio encoding characteristics are obtained.
Wherein, since each coding block contains different convolution layers and the bottleneck layer also contains convolution layers, the input of each coding block and the bottleneck layer contains the corresponding i-1 th layered audio characteristic. Illustratively, as shown in fig. 8, the i-th audio spectrum is input into the input layer 801, and at the same time, the i-1-th layered audio feature corresponding to the input layer is input into the input layer, so as to obtain the audio initial feature, and input into the encoding block 802. The corresponding i-1 th layered audio feature is input into each coding block 802 at the same time, after coding by using the N coding blocks, the output feature is input into the bottleneck layer 803 for coding, and at the same time, the corresponding i-1 th layered audio feature is input, and the bottleneck layer 803 outputs the final audio coding feature. In one possible implementation, the i-1 th hierarchical audio feature corresponding to each convolution layer is obtained by segmenting the i-1 th audio feature.
Step 705, inputting the i-1 layered audio feature and the audio coding feature corresponding to the decoding block into the decoding block for feature decoding to obtain the audio decoding feature.
After the audio coding features are obtained, the computer device continues to input the audio coding features into the N decoding blocks for feature decoding. Wherein each decoding block comprises an up-sampler and a decoder. In the decoding process, the up-sampler performs up-sampling in the time domain and the frequency domain, so that the decoding scale is gradually restored from coarse to fine. And correspondingly, the decoder is also composed of one or more two-dimensional convolution layers, and each decoding block carries out decoding after up-sampling on the input characteristics and inputs the input characteristics into the next decoding block. And because the encoding process comprises a down-sampling process, partial detail information is lost in the down-sampling process, and the lost detail information is recovered, when the decoder is utilized for decoding, the input features of the decoder are in jump connection with the output features of the encoder with the corresponding size, namely, the output features of the encoder with the same size are in feature splicing with the input features of the decoder, and the input features of the encoder with the same size are input into the decoder for decoding, so that the information lost due to the down-sampling is recovered, and the accuracy of audio separation is improved. The decoding operation of the decoding block can be shown as follows:
y i+1 =Decoder(Concatenate(Upsample(y i ),encoding i )
Wherein y is i Representing the input of the ith decoding block, at pair y i Up-sampling and encoding information encoding with encoder in ith encoding block i And (5) splicing. y is i+1 I.e. the output of the i-th decoding block, i.e. the input of the i + 1-th decoding block.
And likewise, the input of each decoding block contains the corresponding i-1 th layered audio feature.
As shown in fig. 8, after the bottleneck layer 803 outputs the audio encoding features, it is input into the decoding blocks 804, and the input of each decoding block 804 contains the i-1 th layered audio features.
And 706, performing feature separation on the i-1 th layered audio feature corresponding to the output layer and the audio decoding feature input/output layer to obtain an audio track mask of each audio track.
Finally, the audio decoding characteristics decoded by the N decoding blocks are input into an output layer for characteristic separation, and the track masks of all the tracks are obtained. The output layer is also comprised of one or more two-dimensional convolutional layers. And the convolution layer in the output layer comprises an activation layer, and the tanh activation operation is carried out to control the output range between (-1, 1). Wherein, the tanh activation operation is as follows:
where x is the output characteristic of the convolutional layer.
Accordingly, the input of the output layer contains the corresponding i-1 th layered audio feature. As shown in fig. 8, the i-1 th layered audio feature corresponding to the output layer 805 and the audio decoding features decoded by the N decoding blocks 804 are input into the output layer 805, so as to obtain the track mask of each track.
And in the audio separation process by using the input layer, the coding block, the bottleneck layer, the decoding block and the output layer, wherein each convolution layer outputs the corresponding input layered audio characteristics for the next audio separation process.
Wherein the feature size of the output layered audio feature may be determined based on the size of the convolution layer convolution kernel. After output, each layered audio feature can be combined and stored, and can also be directly stored in the corresponding buffer partition.
As shown in fig. 8, the ith layered audio feature input by each convolution layer of the input layer 801, the encoding block 802, the bottleneck layer 803, the decoding block 804 and the output layer 805 combines the outputs to obtain the ith audio feature for the audio separation process of the (i+1) th audio clip.
As described above, the convolution processes of the convolution layers included in the input layer, the encoding block, the decoding block, and the output layer in the U-Net network are causal convolutions. As shown in fig. 9, the causal convolution process will be schematically illustrated with a block containing 3 encoded blocks and 3 decoded blocks. The convolution kernels of the convolution layers are all 3×3, and fig. 9 only illustrates a time domain processing procedure because the processing manner in the frequency domain in the causal convolution procedure is the same as that in the normal convolution procedure. Wherein each dot represents a time frame and the lines between each dot represent the dependency of the convolution operation. In the audio separation process of the ith audio segment, the ith-1 audio feature 901 is input into the corresponding input of each convolution layer, and the input feature of each convolution layer is subjected to time domain feature filling. And the input features of the last two frames of each convolution layer are output to obtain the ith audio feature 902 for the audio separation process of the (i+1) th audio segment.
It should be noted that, although the encoder and the decoder are only schematically illustrated as including the convolution layer, in the actual separation process, the down-sampler and the up-sampler may include the convolution layer as well, and perform the causal convolution operation.
For NThe bottleneck layer of the U-Net network of the layer needs to keep at least one complete time frame after N times of downsampling, so that the shortest input window length is 2 N A time frame.
I.e. the length of time of the ith audio segment is determined according to the number of network layers of the U-Net network. Optionally, the length of time of the ith audio fragment is greater than or equal to 2 N The time length of each time frame is N, namely the network layer number.
Illustratively, as shown in FIG. 9, when the number of network layers of the U-Net network is 3, the shortest input window is 8 time frames. If the sampling rate is 48kHz and each time frame is 1024 sampling points, the corresponding signal duration is 1024×8/48000=0.171 s. The method provided by the embodiment of the application can realize the input of the audio data by taking 0.171s as a time window each time, greatly reduces the separation delay, and can be used for real-time audio separation.
In the above embodiments, the process of audio separation using the separation network is schematically described. Wherein the separation network is a pre-trained network, the training process of the separation network will be schematically described below.
Referring to fig. 10, a flowchart of a method for training a separation network according to another exemplary embodiment of the present application is illustrated, where the method is applied to the computer device 110 illustrated in fig. 1, and the method includes:
in step 1001, sample track audio data is obtained, and mixing processing is performed on the sample track audio data to obtain mixed audio data.
In one possible implementation, the network parameters of the separation network are updated with the sample tracked audio data. Firstly, the computer equipment carries out mixing processing on the sample track-divided audio data, wherein during the mixing processing, the mixing processing is carried out according to a certain rule, and the mixed audio data are obtained. The mixing process is as follows:
wherein s is i Sample track audio data is represented, where i=0, 1, 2..n-1. Alpha i Representing the mixing gain for the i-th track, which may be preset or randomly generated.
Step 1002, inputting the mixed audio frequency spectrum corresponding to the mixed audio data into a separation network for audio separation, and obtaining a predicted audio track mask of each audio track.
After the mixed audio data is obtained, the computer equipment performs time-frequency transformation on the mixed audio data, so that the mixed audio data is transformed to a frequency domain, and a corresponding mixed audio frequency spectrum is obtained. Alternatively, the mixed audio data is transformed into the frequency domain using an STFT method.
And then, carrying out audio separation on the mixed audio frequency spectrum by utilizing a separation network to obtain a predicted audio track mask of each audio track. Namely:
[m 0 ,m 1 ,m 2 ,...,m N-1 ]=Net(X)
wherein X represents a mixed audio spectrum, wherein m N-1 A predicted track mask representing the N-1 th track.
In a possible implementation manner, when the mixed audio frequency spectrum is subjected to audio frequency separation by using the separation network, the mixed audio frequency spectrum is also split into a plurality of sections of audio frequency spectrum, and the plurality of sections of audio frequency spectrum are input into the separation network for audio frequency separation in sequence.
In step 1003, spectrum extraction is performed on the mixed audio spectrum by using each predicted audio track mask, so as to obtain predicted audio track spectrums of each audio track.
After the predicted track mask of each track is obtained, the mixed audio spectrum can be extracted by using the predicted track mask, so that the predicted track spectrum of each track is obtained. The spectrum extraction process is to multiply the predicted audio track mask with the mixed audio frequency spectrum, so that the predicted audio track frequency spectrum of each audio track can be obtained.
Step 1004, updating and training the separation network based on the sample track spectrum corresponding to the predicted track spectrum and the sample track audio data.
After obtaining the predicted track spectrum for each track, the computer device may update the separation network with the predicted track spectrum and the sample track spectrum. The sample track frequency spectrum is a complex frequency spectrum obtained by performing time-frequency conversion on each sample track audio data.
In the embodiment of the application, the separation network is trained by utilizing a large number of sample track-dividing audio data, so that the accuracy of audio separation of the separation network is improved. When the separation network is utilized to perform audio separation of the short window, separation accuracy can be ensured, thereby realizing real-time audio separation.
Referring to fig. 11, a flowchart of a method for training a separation network according to another exemplary embodiment of the present application is shown, where the method is applied to a computer device as an example, and the method includes:
step 1101, obtaining sample track audio data, and performing mixing processing on the sample track audio data to obtain mixed audio data.
The embodiment of this step may refer to step 1001, and this embodiment is not described herein.
In step 1102, in each convolution layer in the separation network, causal convolution processing is performed on the audio features of the mixed audio spectrum, so as to obtain an audio track mask of each audio track.
In one possible implementation, the separate network is a U-Net network. In the process of carrying out audio frequency separation on the mixed audio frequency spectrum through the separation network, the computer equipment carries out causal convolution processing on the corresponding audio frequency characteristics by utilizing each convolution layer in the separation network, and finally, the audio frequency spectrum is output through the output layer to obtain the audio track mask of each audio track. Wherein the causal convolution process may include the steps of:
In step 1102a, feature filling is performed on the kth audio input features of the kth convolutional layer.
Optionally, the time domain feature filling mode is changed in the causal convolution processing process. The training process of the separation network is different from the using process in a specific time domain characteristic filling mode. Because there is no delay limitation in the training process and no real-time separation is needed, the input window in the training process can be input in a long-time window, when the input window is a long-time window, the influence of the filling value in the time frame on the separation result is negligible, and the 0 can be directly used for filling, namely, the audio characteristics of the last audio fragment in the audio separation process are not needed. However, the convolution is still causal, and the feature filling of the kth audio input feature of the kth convolution layer may comprise the steps of:
step one, in the historical time frame, time domain feature filling is carried out on the kth audio input feature.
Wherein, when filling in the time domain dimension, filling will be performed in the historical time frame. Taking a 3 x 3 convolution kernel as an example, two columns on the left side of the feature map will be filled, i.e. the information of two time frames of the history is filled. Since the audio features of the longer window have been entered, the audio features of the last audio separation process need not be used, alternatively the fill value may be 0.
And step two, respectively filling frequency domain features of the kth audio input features on the high-frequency point and the low-frequency point.
And the frequency domain feature filling is carried out on the kth audio input feature on the high-frequency point and the low-frequency point. Taking a convolution kernel of 3×3 as an example, two rows are filled in the upper and lower of the feature map, and information on a higher frequency point and a lower frequency point is filled in.
And 1102b, performing causal convolution processing on the k audio input features after feature filling.
After the feature is filled, in the convolution layer, the computer device may perform causal convolution processing on the filled kth audio input feature to obtain an audio output feature of the convolution layer.
Eventually, the output layer of the split network will output the predicted track mask for each track.
In step 1103, spectrum extraction is performed on the mixed audio spectrum by using each predicted audio track mask, so as to obtain a predicted audio track spectrum of each audio track.
The embodiment of this step may refer to step 1003, and this embodiment is not repeated.
Step 1104 determines a contrast loss based on the predicted track spectrum and the sample track spectrum.
After the predicted audio track spectrum is obtained, the loss function calculation can be carried out on the predicted audio track spectrum and the sample audio track spectrum of each audio track, so as to obtain the contrast loss between the predicted audio track spectrum and the sample audio track spectrum. The contrast loss determination mode is as follows:
Wherein,representing the predicted track spectrum, S i Representing the sample track spectrum.
In the above manner, a Mean Square Error (MSE) is used as a loss function to determine the contrast loss. In other possible ways, the loss function of L1, L2, etc. may also be used to determine the contrast loss, which is not limited in this embodiment.
Step 1105, performing reverse update training on the separation network based on the contrast loss.
After determining the contrast loss for each track, the computer device may utilize the contrast loss to reverse update the training of the separation network. Alternatively, a gradient back-propagation algorithm may be used to update the network parameters of the split network until the loss function reaches a convergence condition. The separation network after training is completed can be used for real-time audio separation.
In a possible implementation manner, as shown in fig. 12, the training process of the separation network performs a mixing process 1201 on sample track-divided audio data to obtain mixed audio data, performs a time-frequency transformation 1202 on the mixed audio data to obtain a mixed audio spectrum, then inputs the mixed audio spectrum into the separation network 1203 to perform audio separation to obtain predicted track masks of each track, multiplies the mixed audio spectrum by each predicted track mask to obtain a predicted track spectrum corresponding to each track, and performs a loss function calculation 1204 on the sample track spectrum corresponding to the sample track-divided audio data and the predicted track spectrum to obtain a comparison loss, and reversely updates the separation network 1203 by using the comparison loss to realize update training of the separation network.
In this embodiment, in the process of performing audio separation on the mixed audio spectrum by using the separation network, each convolution layer adopts a causal convolution manner, so that the separation network is suitable for separating audio segments input by a short window, separation delay is reduced, and real-time separation is realized.
Referring to fig. 13, a block diagram of an audio separation device according to an embodiment of the application is shown. As shown in fig. 13, the apparatus may include:
an acquisition module 1301, configured to acquire an i-1 th audio feature of the i-1 th audio segment in the audio separation process;
an audio separation module 1302, configured to perform audio separation on the i-1 audio feature and an i audio spectrum input separation network of an i audio segment to obtain an audio track mask of each audio track in the i audio segment, where the i-1 audio segment is a last segment of the i audio segment in the target audio;
the spectrum extraction module 1303 is configured to perform spectrum extraction on the ith audio spectrum by using the track mask of each track, so as to obtain a track spectrum of each track.
Optionally, the audio separation module 1302 is further configured to:
determining each of the i-1 th layered audio features, wherein different i-1 th layered audio features correspond to different convolutional layers in the separation network;
And respectively inputting each i-1 th layered audio characteristic into each convolution layer to carry out causal convolution processing on the audio characteristic of the i audio frequency spectrum, so as to obtain an audio track mask of each audio track.
Optionally, the audio separation module 1302 is further configured to:
based on the ith-1 layered audio characteristics corresponding to the kth convolution layer, performing characteristic filling on the kth audio input characteristics of the kth convolution layer to obtain kth audio filling characteristics;
performing causal convolution processing on the kth audio filling feature to obtain a kth audio output feature;
and determining an audio track mask of each audio track based on the nth audio output characteristic of the nth convolution layer, wherein n is more than or equal to k.
Optionally, the audio separation module 1302 is further configured to:
performing time domain feature filling on the kth audio input feature by utilizing the ith-1 layered audio feature corresponding to the kth convolution layer;
and carrying out frequency domain feature filling on the kth input audio feature after the time domain feature filling to obtain kth audio filling features.
Optionally, the audio separation module 1302 is further configured to:
in a historical time frame, performing time domain feature filling on the kth audio input feature by utilizing the ith-1 layered audio feature corresponding to the kth convolution layer;
The step of performing frequency domain feature filling on the kth input audio feature after the time domain feature filling includes:
and respectively filling the frequency domain features of the kth input audio features filled with the time domain features on the high-frequency points and the low-frequency points.
Optionally, the separation network is a U-Net network, and each convolution layer includes a convolution layer in an input layer, an encoding block, a decoding block and an output layer in the separation network;
the audio separation module 1302 is further configured to:
inputting the i-1 th layered audio feature and the i audio frequency spectrum corresponding to the input layer into the input layer for feature extraction to obtain audio initial features;
inputting the i-1 layered audio characteristics and the audio initial characteristics corresponding to the coding blocks into the coding blocks for characteristic coding to obtain audio coding characteristics;
inputting the i-1 layered audio characteristics and the audio coding characteristics corresponding to the decoding block into the decoding block for characteristic decoding to obtain audio decoding characteristics;
inputting the i-1 layered audio characteristics and the audio decoding characteristics corresponding to the output layer into the output layer for characteristic separation to obtain the track masks of the tracks.
Optionally, the time length of the ith audio segment is determined according to the network layer number of the U-Net network.
Optionally, the apparatus further includes:
and the storage module is used for storing the ith audio frequency characteristic of the ith audio frequency fragment in the audio frequency separation process in a buffer zone.
Optionally, the storage module is further configured to:
and respectively storing the ith layered audio features corresponding to all the convolution layers in each buffer partition, wherein different buffer partitions correspond to different convolution layers.
In summary, in the embodiment of the present application, when the i-th audio spectrum of the i-th audio clip is subjected to audio separation, the audio separation is performed based on the characteristics of the i-1-th audio clip in the audio separation process and the i-th audio spectrum, so as to obtain the track mask of each track, thereby separating the audio spectrum into the track spectrums corresponding to each track. In the process of audio frequency separation of the audio frequency fragments, the audio frequency characteristics obtained in the separation process of the last audio frequency fragment are introduced, so that the problem of less information caused by short window input can be avoided, and the accuracy of the separation network for separating the audio frequency fragments input by the short window can be improved. When the method provided by the embodiment of the application is used for audio separation by utilizing the separation network, only the audio fragments input in a short time window can be separated each time, the separation delay is reduced, and the real-time audio separation can be realized.
Referring to fig. 14, a block diagram of a training device for separating a network according to another embodiment of the present application is shown. As shown in fig. 14, the apparatus may include:
an obtaining module 1401, configured to obtain sample track-divided audio data, and perform mixing processing on the sample track-divided audio data to obtain mixed audio data;
an audio separation module 1402, configured to input a mixed audio spectrum corresponding to the mixed audio data into a separation network to perform audio separation, so as to obtain a predicted audio track mask of each audio track;
a spectrum extraction module 1403, configured to perform spectrum extraction on the mixed audio spectrum by using each of the predicted audio track masks, so as to obtain a predicted audio track spectrum of each audio track;
a training module 1404, configured to update and train the separation network based on the sample track spectrum corresponding to the sample split audio data and the predicted track spectrum.
Optionally, the audio separation module 1402 is further configured to:
and in each convolution layer in the separation network, performing causal convolution processing on the audio characteristics of the mixed audio frequency spectrum to obtain an audio track mask of each audio track.
Optionally, the audio separation module 1402 is further configured to:
Performing feature filling on the kth audio input features of the kth convolution layer;
and carrying out causal convolution processing on the k audio input features after feature filling.
Optionally, the audio separation module 1402 is further configured to:
performing time domain feature filling on the kth audio input feature in a historical time frame;
and respectively filling the frequency domain features of the kth audio input features on the high-frequency point and the low-frequency point.
Optionally, the training module 1404 is further configured to:
determining a contrast loss based on the predicted track spectrum and the sample track spectrum;
and based on the contrast loss, performing reverse updating training on the separation network.
In the embodiment of the application, the separation network is trained by utilizing a large number of sample track-dividing audio data, so that the accuracy of audio separation of the separation network is improved. When the separation network is utilized to perform audio separation of the short window, separation accuracy can be ensured, thereby realizing real-time audio separation.
It should be noted that: in the device provided in the above embodiment, when implementing the functions thereof, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be implemented by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to FIG. 15, a block diagram of a computer device 1500 is shown, according to an exemplary embodiment of the present application. The computer device 1500 of the present application may include one or more of the following: memory 1520, processor 1510.
Processor 1510 may include one or more processing cores. The processor 1510 utilizes various interfaces and lines to connect various portions of the overall computer device 1500, performing various functions of the computer device 1500 and processing data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1520, and invoking data stored in the memory 1520. Alternatively, the processor 1510 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 1510 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content required to be displayed by the display screen 1530; the modem is used to handle wireless communications. It will be appreciated that the modems described above may also be implemented solely by a communication chip, rather than being integrated into the processor 1510.
The Memory 1520 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (ROM). Optionally, the memory 1520 includes a non-transitory computer readable medium (non-transitory computer-readable storage medium). Memory 1520 may be used to store instructions, programs, code, sets of codes, or instruction sets. The memory 1520 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, which may be an Android (Android) system (including a system developed based on an Android system), an IOS system developed by apple corporation (including a system developed based on an IOS system depth), or other systems, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created by the computer device 1500 in use (e.g., phonebook, audiovisual data, chat log data), and the like.
In addition, those skilled in the art will appreciate that the structure of the computer device 1500 shown in the above-described figures is not limiting of the computer device 1500, and that a computer device may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. For example, the computer device 1500 further includes components such as a radio frequency circuit, a shooting component, a sensor, an audio circuit, a wireless fidelity (Wireless Fidelity, wiFi) component, a power supply, and a bluetooth component, which are not described herein.
The present application also provides a computer readable storage medium having stored therein at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, the code set, or the instruction set being loaded and executed by a processor to implement the audio separation method or the training method of the separation network provided in any of the above-described exemplary embodiments.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the printing apparatus reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer apparatus performs the audio separation method or the training method of the separation network provided in the above-described alternative implementation.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limiting.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims (19)

1. A method of audio separation, the method comprising:
acquiring an i-1 audio feature of an i-1 audio fragment in an audio separation process;
inputting the i-1 audio feature and the i audio frequency spectrum of the i audio fragment into a separation network for audio separation to obtain an audio track mask of each audio track in the i audio fragment, wherein the i-1 audio fragment is the last fragment of the i audio fragment in the target audio;
and carrying out spectrum extraction on the ith audio frequency spectrum by utilizing the track mask of each track to obtain the track frequency spectrum of each track.
2. The method of claim 1, wherein audio separating the i-1 th audio feature from the i-th audio spectrum input separation network of the i-th audio clip to obtain an audio track mask for each audio track in the i-th audio clip, comprises:
determining each of the i-1 th layered audio features, wherein different i-1 th layered audio features correspond to different convolutional layers in the separation network;
And respectively inputting each i-1 th layered audio characteristic into each convolution layer to carry out causal convolution processing on the audio characteristic of the i audio frequency spectrum, so as to obtain an audio track mask of each audio track.
3. The method according to claim 2, wherein said inputting each of said i-1 th hierarchical audio features into each of said convolution layers for causal convolution processing with audio features of said i-th audio spectrum to obtain an audio track mask for each of said audio tracks, comprises:
based on the ith-1 layered audio characteristics corresponding to the kth convolution layer, performing characteristic filling on the kth audio input characteristics of the kth convolution layer to obtain kth audio filling characteristics;
performing causal convolution processing on the kth audio filling feature to obtain a kth audio output feature;
and determining an audio track mask of each audio track based on the nth audio output characteristic of the nth convolution layer, wherein n is more than or equal to k.
4. The method of claim 3, wherein the feature filling the kth audio input feature of the kth convolutional layer based on the corresponding i-1 th layered audio feature of the kth convolutional layer to obtain a kth audio filling feature, comprises:
performing time domain feature filling on the kth audio input feature by utilizing the ith-1 layered audio feature corresponding to the kth convolution layer;
And carrying out frequency domain feature filling on the kth input audio feature after the time domain feature filling to obtain kth audio filling features.
5. The method of claim 4, wherein the time domain feature filling the kth audio input feature with the ith-1 layered audio feature corresponding to the kth convolutional layer comprises:
in a historical time frame, performing time domain feature filling on the kth audio input feature by utilizing the ith-1 layered audio feature corresponding to the kth convolution layer;
the step of performing frequency domain feature filling on the kth input audio feature after the time domain feature filling includes:
and respectively filling the frequency domain features of the kth input audio features filled with the time domain features on the high-frequency points and the low-frequency points.
6. The method according to any one of claims 2 to 5, wherein the separate network is a U-Net network, each of the convolutional layers comprising one of an input layer, an encoding block, a decoding block, and an output layer in the separate network;
inputting each i-1 th layered audio feature into each convolution layer to perform causal convolution processing with the audio feature of the i audio spectrum, to obtain an audio track mask of each audio track, including:
Inputting the i-1 th layered audio feature and the i audio frequency spectrum corresponding to the input layer into the input layer for feature extraction to obtain audio initial features;
inputting the i-1 layered audio characteristics and the audio initial characteristics corresponding to the coding blocks into the coding blocks for characteristic coding to obtain audio coding characteristics;
inputting the i-1 layered audio characteristics and the audio coding characteristics corresponding to the decoding block into the decoding block for characteristic decoding to obtain audio decoding characteristics;
inputting the i-1 layered audio characteristics and the audio decoding characteristics corresponding to the output layer into the output layer for characteristic separation to obtain the track masks of the tracks.
7. The method of claim 6, wherein the length of time of the ith audio clip is determined based on the number of network layers of the U-Net network.
8. The method according to any one of claims 1 to 5, wherein after audio-separating the i-1 th audio feature from the i-th audio spectrum input separation network of the i-th audio clip to obtain the track mask of each track in the i-th audio clip, the method further comprises:
And storing the ith audio characteristic of the ith audio fragment in a buffer area in the audio separation process.
9. The method of claim 8, wherein storing the i audio feature of the i audio segment during audio separation in a buffer comprises:
and respectively storing the ith layered audio features corresponding to all the convolution layers in each buffer partition, wherein different buffer partitions correspond to different convolution layers.
10. A method of training a split network, the method comprising:
obtaining sample track-dividing audio data, and carrying out mixing treatment on the sample track-dividing audio data to obtain mixed audio data;
inputting the mixed audio frequency spectrum corresponding to the mixed audio data into a separation network for audio separation to obtain a predicted audio track mask of each audio track;
performing spectrum extraction on the mixed audio frequency spectrum by utilizing each predicted audio track mask to obtain predicted audio track spectrums of each audio track;
and updating and training the separation network based on the predicted track spectrum and the sample track spectrum corresponding to the sample track audio data.
11. The method according to claim 10, wherein the audio-separating the mixed audio spectrum corresponding to the mixed audio data by the audio-separating network to obtain a predicted audio track mask of each audio track includes:
And in each convolution layer in the separation network, performing causal convolution processing on the audio characteristics of the mixed audio frequency spectrum to obtain an audio track mask of each audio track.
12. The method of claim 11, wherein the causal convolution processing of the audio features of the mixed audio spectrum in the respective convolution layers within the separation network comprises:
performing feature filling on the kth audio input features of the kth convolution layer;
and carrying out causal convolution processing on the k audio input features after feature filling.
13. The method of claim 12, wherein the feature filling of the kth audio input feature of the kth convolutional layer comprises:
performing time domain feature filling on the kth audio input feature in a historical time frame;
and respectively filling the frequency domain features of the kth audio input features on the high-frequency point and the low-frequency point.
14. The method according to any one of claims 10 to 13, wherein the updating training the separate network based on the sample track spectrum of which the predicted track spectrum corresponds to the sample split audio data comprises:
Determining a contrast loss based on the predicted track spectrum and the sample track spectrum;
and based on the contrast loss, performing reverse updating training on the separation network.
15. An audio separation device, the device comprising:
the acquisition module is used for acquiring the i-1 audio characteristics of the i-1 audio fragment in the audio separation process;
the audio separation module is used for carrying out audio separation on the ith audio feature and an ith audio frequency spectrum input separation network of an ith audio fragment to obtain an audio track mask of each audio track in the ith audio fragment, wherein the ith-1 audio fragment is the last fragment of the ith audio fragment in the target audio;
and the frequency spectrum extraction module is used for carrying out frequency spectrum extraction on the ith audio frequency spectrum by utilizing the track mask of each track to obtain the track frequency spectrum of each track.
16. A training apparatus for a split network, the apparatus comprising:
the acquisition module is used for acquiring sample track-divided audio data and carrying out mixing processing on the sample track-divided audio data to obtain mixed audio data;
the audio separation module is used for inputting the mixed audio frequency spectrum corresponding to the mixed audio data into a separation network to carry out audio separation to obtain a predicted audio track mask of each audio track;
The frequency spectrum extraction module is used for carrying out frequency spectrum extraction on the mixed audio frequency spectrum by utilizing each predicted audio track mask to obtain predicted audio track frequency spectrums of each audio track;
and the training module is used for updating and training the separation network based on the sample track frequency spectrum corresponding to the predicted track frequency spectrum and the sample track-dividing audio data.
17. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the audio separation method of any one of claims 1 to 9 or the training method of the separation network of any one of claims 10 to 14.
18. A computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement the audio separation method of any one of claims 1 to 9 or the training method of the separation network of any one of claims 10 to 14.
19. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium, from which a processor reads and executes them to implement the audio separation method according to any of claims 1 to 9 or to implement the training method of the separation network according to any of claims 10 to 14.
CN202210472271.1A 2022-04-29 2022-04-29 Audio separation method, training method, device, equipment, storage medium and product Pending CN117012223A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210472271.1A CN117012223A (en) 2022-04-29 2022-04-29 Audio separation method, training method, device, equipment, storage medium and product
PCT/CN2022/143311 WO2023207193A1 (en) 2022-04-29 2022-12-29 Audio separation method and apparatus, training method and apparatus, and device, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210472271.1A CN117012223A (en) 2022-04-29 2022-04-29 Audio separation method, training method, device, equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN117012223A true CN117012223A (en) 2023-11-07

Family

ID=88517173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210472271.1A Pending CN117012223A (en) 2022-04-29 2022-04-29 Audio separation method, training method, device, equipment, storage medium and product

Country Status (2)

Country Link
CN (1) CN117012223A (en)
WO (1) WO2023207193A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711423A (en) * 2024-02-05 2024-03-15 西北工业大学 Mixed underwater sound signal separation method combining auditory scene analysis and deep learning

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111696572B (en) * 2019-03-13 2023-07-18 富士通株式会社 Voice separation device, method and medium
TWI718750B (en) * 2019-11-07 2021-02-11 國立中央大學 Source separation method, apparatus, and non-transitory computer readable medium
CN111627458B (en) * 2020-05-27 2023-11-17 北京声智科技有限公司 Sound source separation method and equipment
CN113380262B (en) * 2021-05-13 2022-10-18 重庆邮电大学 Sound separation method based on attention mechanism and disturbance perception
CN112989107B (en) * 2021-05-18 2021-07-30 北京世纪好未来教育科技有限公司 Audio classification and separation method and device, electronic equipment and storage medium
CN113314140A (en) * 2021-05-31 2021-08-27 哈尔滨理工大学 Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711423A (en) * 2024-02-05 2024-03-15 西北工业大学 Mixed underwater sound signal separation method combining auditory scene analysis and deep learning
CN117711423B (en) * 2024-02-05 2024-05-10 西北工业大学 Mixed underwater sound signal separation method and system combining auditory scene analysis and deep learning

Also Published As

Publication number Publication date
WO2023207193A1 (en) 2023-11-02

Similar Documents

Publication Publication Date Title
CN112289342B (en) Generating audio using neural networks
KR20190039459A (en) Learning method and learning device for improving performance of cnn by using feature upsampling networks, and testing method and testing device using the same
CN110599492A (en) Training method and device for image segmentation model, electronic equipment and storage medium
CN112235583B (en) Image coding and decoding method and device based on wavelet transformation
JP7379524B2 (en) Method and apparatus for compression/decompression of neural network models
JP2023548468A (en) Deep learning-based speech enhancement
US20240062744A1 (en) Real-time voice recognition method, model training method, apparatuses, device, and storage medium
CN117012223A (en) Audio separation method, training method, device, equipment, storage medium and product
KR20210003507A (en) Method for processing residual signal for audio coding, and aduio processing apparatus
CN113555032A (en) Multi-speaker scene recognition and network training method and device
CN113782042B (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN113228093A (en) Synchronized input feedback for machine learning
CN113837980A (en) Resolution adjusting method and device, electronic equipment and storage medium
CN115485769A (en) Method, apparatus and system for enhancing multi-channel audio in a reduced dynamic range domain
EP4184392A1 (en) Neural network processing device, information processing device, information processing system, electronic instrument, neural network processing method, and program
CN113990347A (en) Signal processing method, computer equipment and storage medium
CN113810058A (en) Data compression method, data decompression method, device and electronic equipment
CN114360490A (en) Speech synthesis method, apparatus, computer device and storage medium
CN112419216A (en) Image interference removing method and device, electronic equipment and computer readable storage medium
CN113508399A (en) Method and apparatus for updating neural network
CN113377331B (en) Audio data processing method, device, equipment and storage medium
CN117576118A (en) Multi-scale multi-perception real-time image segmentation method, system, terminal and medium
CN113593600B (en) Mixed voice separation method and device, storage medium and electronic equipment
CN114171053B (en) Training method of neural network, audio separation method, device and equipment
CN112951218B (en) Voice processing method and device based on neural network model and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination