CN113488063A

CN113488063A - Audio separation method based on mixed features and coding and decoding

Info

Publication number: CN113488063A
Application number: CN202110753553.4A
Authority: CN
Inventors: 吴俊�; 颜庆国; 董勤伟; 查显光; 崔林; 赵新冬; 戴威
Original assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-08
Anticipated expiration: 2041-07-02
Also published as: CN113488063B

Abstract

The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding, which comprises the following steps: collecting data; collecting audio data through a multi-terminal device; training in an early stage; randomly extracting partial data for marking, and marking end points of a voiced part and an unvoiced part; extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism; removing audio blank; removing low-energy areas in the audio through a voice endpoint detection algorithm and a clustering classifier, reserving high-energy areas containing obvious sounds, and splicing the segmented segments; audio separation; and analyzing the voice waveform of the mixed audio by using the audio waveform data after blank removal by using a coding and decoding network structure, separating a sound source by predicting a single time-frequency mask, and outputting the audio only comprising human voice.

Description

Audio separation method based on mixed features and coding and decoding

Technical Field

The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding.

Background

The popularization of multi-terminal equipment and the progress of high-speed network transmission technology, and the problem that the existing enterprises are difficult to solve is not to acquire massive data. The acquired large amount of data often contains a lot of low-quality data, and the data seriously influences the operation of downstream business of the enterprise. Voice data is a large category of internet data. Under the scenes of network teaching, audio and video conferences and the like, voice data are important components of activities such as classroom activity monitoring, record retention, human-computer interaction and the like. The voice data stream processed by the back-end service server has a large amount of voice blanks and unrecognizable ambient noise, which increases the processing burden of the service server.

One way to alleviate this problem is to expand the number of distributed servers, using computing power to handle the processing of invalid data, but a single expansion of computing power does not fundamentally solve the problem. After the concept of the data middlebox is proposed, more and more enterprises are willing to use a centralized mode to preprocess data by using an algorithm, and then effective data is distributed to each service server for use of downstream tasks through data extraction.

A conventional speech blank removal algorithm is a Voice Activity Detection algorithm (VAD). The main task of this algorithm is to accurately locate the start and end points of speech from noisy speech and separate the silence from the actual speech, which is a key technique in the speech signal processing process. The traditional voice separation technology is a Non-Negative Matrix Factorization (NMF) based method and a F0 estimation based method, which are traditional supervision methods for voice separation tasks, and the separation effect is poor.

In view of the above problems, the present designer is actively making research and innovation based on the years of abundant practical experience and professional knowledge in engineering application of such products and with the application of theory, in order to create an audio separation method based on mixing features and encoding and decoding, so that the audio separation method is more practical.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention provides an audio separation method based on mixed characteristics and coding and decoding, thereby effectively solving the problems in the background art.

In order to achieve the purpose, the invention adopts the technical scheme that: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:

the method comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices, and caching the audio stream data in a local server according to the sequence of the data sequence;

step two: training in an early stage; randomly extracting partial data for labeling, and labeling end points of a voiced part and an unvoiced part for reference training;

step three: extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism;

step four: removing audio blank; the final audio features obtained in the third step are expressed, low-energy areas in the audio are removed through a voice endpoint detection algorithm and a clustering classifier, high-energy areas containing obvious sounds are reserved, and the segmented effective audio segments are spliced;

step five: audio separation; the environmental sound and the human voice are taken as two sound sources, the audio waveform data after blank removal is used, the voice waveform of the mixed audio is analyzed by utilizing a coding and decoding network structure, and the sound sources are separated by predicting a single time-frequency mask.

Further, in the first step, the sequence of the data sequence is a numbering sequence provided by the device before the data stream is transmitted.

Furthermore, in the fourth step, when the divided effective audio segments are spliced, the two ends of the effective audio segment are smoothed by adding transition signals to the two ends of the audio waveform.

Further, the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.

Further, in the second step, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.

Further, the proportion of the training set is 60% -90%.

Further, the classifier is a decision tree based audio classifier, a KNN based audio classifier, or a neural network based audio classifier.

Further, the trained classifier judges the categories of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.

Further, in the fifth step, the design of the coding and decoding model structure includes:

s1: setting the network to L levels, including L downsample blocks and L upsample blocks;

s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;

s3: the data processed by the S2 is processed by one-dimensional convolution and is processed by L upper sampling blocks, and local high-resolution characteristics are calculated;

s4: combining the features in S2 and S3 to obtain multi-scale features for prediction;

s5: and performing prediction through multi-scale features, and separating the audio data into K source audios.

Further, when combining the down-sampling block and the up-sampling block output characteristics, the down-sampling block is arranged along the processing sequence in the forward direction, the up-sampling block is arranged along the processing sequence in the reverse direction, and each down-sampling block output characteristic after arrangement is combined with the up-sampling block output characteristic corresponding to the arrangement sequence.

Further, of the L levels, each successive level has half the runtime resolution of the previous level.

Further, each of the downsampling block and the upsampling block includes a convolutional layer and a residual network, and the residual network is located behind the convolutional layer to reduce overfitting.

Further, the model returns a prediction in the interval (-1, 1), one for each source audio sample, for K source audios to be predicted.

Further, the one-dimensional convolution includes a basic structure filled with zeros and an activation function, which is a LeakyReLU activation function or a tanh activation function.

Further, the audio data processed by the upsampling block is subjected to one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the other one-dimensional convolutions are LeakyReLU activation functions.

Further, in the downsampling process, the discard feature is decimated every other time step, and the time resolution is halved.

Further, in performing the upsampling process, upsampling is performed by a multiple of two in the time direction using linear interpolation.

The invention has the beneficial effects that: according to the method, through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at the fact that a large amount of blanks and environmental noises exist in audio data acquired by multiple terminals, features are extracted in a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the characteristics of sound can be described better by the acquired final audio feature data than the characteristics of single feature; the voice endpoint algorithm and the classifier are combined, so that the detected voice endpoint is more accurate than that of the traditional method only using the endpoint detection algorithm, downstream voice recognition errors caused by the removal of effective voice are avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, the environmental voice is finally removed, and the audio only including human voice is output.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of audio blanking removal

Fig. 3 is a flow chart of audio separation.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

In the description of the present invention, it should be noted that the orientations or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like are based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; either directly or indirectly through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

As shown in fig. 1 to 2: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:

Through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at the fact that a large amount of blanks and environmental noises exist in audio data acquired by multiple terminals, features are extracted in a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the characteristics of sound can be described better by the acquired final audio feature data than the characteristics of single feature; the voice endpoint algorithm and the classifier are combined, so that the detected voice endpoint is more accurate than that of the traditional method only using the endpoint detection algorithm, downstream voice recognition errors caused by the removal of effective voice are avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, the environmental voice is finally removed, and the audio only including human voice is output.

The frequency domain features comprise frequency domain energy, sub-band energy ratio, frame-based spectrum centroid and frame-based spectrum bandwidth, the time domain features comprise short-time energy, short-time zero-crossing rate, short-time average amplitude difference, short-time average amplitude and the like, the cepstrum domain features comprise Mel cepstrum coefficients and linear prediction cepstrum coefficients, and simultaneously, the cepstrum domain features can be combined with other features comprising loudness, loudness range, reflection coefficients, mute frame ratio, wavelet coefficients, spectrum entropy value and the like; the method is used for calculating various sound characteristics, and a weighted average audio characteristic is obtained through an attention mechanism, and the weighting coefficient can be trained in a back propagation mode.

In this embodiment, in the first step, the sequence of the data sequence is a numbering sequence provided by the device before the data stream is transmitted.

The sequence of the data sequence is the numbering sequence provided by the device before the data stream is transmitted, so that the spatial positions and the like of the multiple devices are restored, and the processing and expression effects of subsequent audio data are improved.

Preferably, in the fourth step, when the divided effective audio segments are spliced, transition signals are added to two ends of the audio waveform to smooth two ends of the effective audio segments.

After the effective audio frequency segments are segmented and spliced, the transition between the two segments is unnatural, the subsequent listening effect is influenced, and the transition signals are added at the two ends of the audio frequency waveform to smooth the two ends of the effective audio frequency segments, so that the transition between the audio frequencies is more natural, and the output audio frequency effect is better.

In this embodiment, the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.

Under the situations of meeting recording and the like, the voice of the current main speaker needs to be recorded, the voice of other non-main speakers and the noise for recognizing the voice of the speaker are both the voice with influence, and the voice and the noise are both judged as environmental voice, so that the voice recognition of the main speaker is ensured, and the effect of audio processing is increased.

Preferably, in step two, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.

And randomly extracting partial data, labeling end points of the voiced part and the unvoiced part, providing reference training, and dividing the labeled data into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model, so that the recognition effect of the classifier is improved.

In the embodiment, the proportion of the training set is 60% -90%, most of the marked data is used as the training set, the rest of the marked data is used as the verification set and the test set, the proportion of the verification set and the test set can be equal, and the classifier is trained on the training set which accounts for most of the marked data and then verified and tested by a small part of the verification set, so that the classification effect is more accurate.

The classifier can be an audio classifier based on a decision tree, an audio classifier based on a K-Nearest Neighbor (KNN) algorithm, and an audio classifier based on a neural network.

Preferably, in the above embodiment, the trained classifier determines the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.

When the blank part and the effective part of the audio frequency are divided by a subsequent voice endpoint detection algorithm, in order to judge whether the divided endpoints are accurate, the trained classifier judges the categories of n frames before and after the endpoints obtained by the voice endpoint detection algorithm, identifies the categories as effective data and ineffective data, and corrects the endpoints, thereby reducing the error screening of the voice frames and increasing the identification effect.

As shown in fig. 3, in the present embodiment, the structure design of the coding/decoding model includes:

s1: setting the model network to L levels, wherein the L levels comprise L downsampling blocks and L upsampling blocks;

The network is set to L levels, audio data is processed through a down-sampling block to obtain advanced features, the audio data is processed through an up-sampling block to obtain local high-resolution features, the local high-resolution features are combined to obtain multi-scale features for prediction, prediction is carried out based on the multi-scale features, the audio data is separated into K source audios, and therefore the audios are separated. Wherein only human voice and ambient sound need to be separated, where K =2, since the outputs of the down-sample block and the up-sample block are combined in the audio separation process, the accuracy of the separation is increased.

In this embodiment, when combining the downsampling block and the upsampling block output characteristics, the downsampling blocks are arranged in the forward direction along the processing sequence, the upsampling blocks are arranged in the reverse direction along the processing sequence, and each downsampling block output characteristic after arrangement is combined with the upsampling block output characteristic corresponding to the arrangement sequence.

When the output characteristics of the lower sampling blocks and the upper sampling blocks are combined, the lower sampling blocks are arranged in the forward direction along the processing sequence, the upper sampling blocks are arranged in the reverse direction along the processing sequence, and the output characteristics of each lower sampling block after arrangement are combined with the output characteristics of the upper sampling blocks corresponding to the arrangement sequence, so that the current high-grade characteristics and the corresponding local high-resolution characteristics are combined, and the identification accuracy is improved.

Of the L levels, each successive level has half the runtime resolution of the previous level because after one level of processing, a feature is output and subsequent processing is reduced, thereby setting each successive level to half the runtime resolution of the previous level.

Preferably, each of the downsampling block and the upsampling block includes a convolutional layer and a residual network, and the residual network is located behind the convolutional layer to reduce the overfitting.

A residual network is arranged behind one convolution layer to reduce overfitting, thereby improving audio separation quality.

In this embodiment, the model returns predictions in the interval (-1, 1), one for each source audio sample, for K source audios to be predicted, thereby increasing the prediction separation effect.

In this embodiment, Conv1(x, y) is represented as a one-dimensional convolution of x filters of size y, the one-dimensional convolution comprising a basic framework of zero padding and an activation function, in an extension of the basic framework of one-dimensional convolution, Conv1D does not involve zero padding, x is first clipped centrally, so it has the same time step as the current layer.

The activation function is a LeakyReLU activation function or a tanh activation function. The LeakyReLU activation function theoretically has all the points of ReLU, and the problem of DeadReLU is avoided, the interval is a fixed parameter from 1 to positive infinity, the value range of the tanh activation function is from-1 to 1, the negative input of the points is mapped into a negative value, and the 0 input is mapped into a value near 0.

As a preferred example of the above embodiment, the audio data processed by the upsampling block is subjected to a one-dimensional convolution, where the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolution is a leakyreu activation function.

Because the model returns the prediction in the interval (-1, 1), the audio data processed by the upsampling block is subjected to one-dimensional convolution, the activation function of the one-dimensional convolution is the tanh activation function, and the activation functions of the other one-dimensional convolutions are set as the LeakyReLU activation functions.

In an embodiment, when downsampling processing is performed, drop features are decimated every other time step and the temporal resolution is halved. When the upsampling processing is carried out, upsampling is carried out by using linear interpolation in a time direction by a multiple of two times, and the linear interpolation has the characteristics of simplicity and convenience compared with other interpolation modes.

It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An audio separation method based on mixed features and coding and decoding is characterized by comprising the following steps:

step two: training in an early stage; randomly extracting partial data, and labeling the end points of the voiced part and the unvoiced part for reference training;

2. The audio separation method based on mixing characteristics and coding/decoding of claim 1, wherein in the first step, the data sequence has a sequence order that is a numbering order provided by the device before the data stream is transmitted.

3. The audio separation method based on mixing feature and coding/decoding of claim 1, wherein in the fourth step, when splicing the divided valid audio segments, the two ends of the valid audio segments are smoothed by adding transition signals to the two ends of the audio waveform.

4. The method of claim 1, wherein the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.

5. The audio separation method based on mixed features and coding/decoding of claim 1, wherein in the second step, the labeled data is divided into a training set, a validation set and a test set, the training set is used for training the classifier, the validation set is used for validating the training process of the model, and the test set is used for detecting the effect of the final model.

6. The method of claim 5, wherein the training set is 60% -90% in weight.

7. The method of claim 5, wherein the classifier is a decision tree-based audio classifier, a KNN-based audio classifier, or a neural network-based audio classifier.

8. The audio separation method based on mixed features and coding/decoding of claim 5, wherein the trained classifier determines the classes of n frames before and after the endpoint obtained by the speech endpoint detection algorithm and corrects the endpoint.

9. The audio separation method based on mixed features and coding-decoding as claimed in claim 1, wherein in the step five, the design of coding-decoding model structure comprises:

10. The audio separation method based on mixture characterization and codec of claim 9, wherein when combining the downsampled blocks with the upsampled block output characteristics, the downsampled blocks are arranged in a forward direction along the processing order, the upsampled blocks are arranged in a reverse direction along the processing order, and each of the downsampled block output characteristics after arrangement is combined with the upsampled block output characteristics of the corresponding arrangement order.

11. The method of claim 9, wherein the run-time resolution of each successive level of the L levels is half of the previous level.

12. The method of claim 9, wherein each of the downsampled block and the upsampled block comprises a convolutional layer and a residual network, and the residual network is located after the convolutional layer to reduce overfitting.

13. The method of claim 9, wherein the model returns the prediction in the interval (-1, 1) for K source audio samples to be predicted, one for each source audio sample.

14. The method of claim 9, wherein the one-dimensional convolution includes a zero-padded basic structure and an activation function, and the activation function is a LeakyReLU activation function or a tanh activation function.

15. The method of claim 14, wherein the upsampled block processed audio data is subjected to a one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are a LeakyReLU activation function.

16. The method of claim 9, wherein dropping features are decimated every other time step and temporal resolution is halved when downsampling.

17. The audio separation method based on mixture features and codec of claim 9, wherein the upsampling is performed by a multiple of two in the time direction using linear interpolation at the time of the upsampling process.