CN113488063B

CN113488063B - Audio separation method based on mixed features and encoding and decoding

Info

Publication number: CN113488063B
Application number: CN202110753553.4A
Authority: CN
Inventors: 吴俊�; 颜庆国; 董勤伟; 查显光; 崔林; 赵新冬; 戴威
Original assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2023-12-19
Anticipated expiration: 2041-07-02
Also published as: CN113488063A

Abstract

The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding, which comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices; early training; randomly extracting part data for marking, and marking endpoints of a sound part and a silent part; extracting audio characteristics; extracting features in the frequency domain, the time domain and the cepstrum domain of the audio signal, and weighting a plurality of features through an attention mechanism; removing the audio blank; removing low-energy regions in the audio through a voice endpoint detection algorithm and an integrated classifier, reserving high-energy regions containing obvious sounds, and splicing the segmented fragments; audio separation; and analyzing the voice waveform of the mixed audio by using the audio waveform data with the blank removed, separating the sound sources by predicting the corresponding single time-frequency mask, and outputting the audio only comprising human voice.

Description

Audio separation method based on mixed features and encoding and decoding

Technical Field

The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding.

Background

The popularization of multi-terminal equipment and the progress of high-speed network transmission technology are not the problem that the existing enterprises are difficult to solve in acquiring massive data. The large amount of data obtained often contains a lot of low quality data, and the operation of the downstream business of the enterprise is seriously influenced by the data. Voice data is a large class of internet data. In the scenes of network teaching, audio-video conferences and the like, voice data are important components of behaviors such as classroom behavior monitoring, record and retention, human-computer interaction and the like. The voice data stream processed by the back-end service server has a large amount of voice blank and unrecognizable environmental noise, which increases the processing load of the service server.

One way to alleviate this problem is to extend the number of distributed servers to handle the processing of invalid data using computing power, but a single extended computing power does not fundamentally solve the problem. After the concept of the data center is put forward, more and more enterprises are willing to use a centralized mode, use algorithms to preprocess data, and then distribute effective data to each service server for downstream tasks through data extraction.

The traditional speech blank removal algorithm is a speech end point detection algorithm (Voice Activity Detection, VAD). The main task of this algorithm is to accurately locate the start and end points of speech from noisy speech and to separate silence from actual speech, which is a key technique in the speech signal processing process. The traditional voice separation technology has the advantages that a method based on Non-negative matrix factorization (Non-negative Matrix Factorization, NMF) and a method based on F0 estimation are traditional supervision methods for voice separation tasks, and the separation effect is poor.

In view of the above problems, the present inventors have actively studied and innovated based on the practical experience and expertise which are rich for years in such product engineering applications, and in combination with the application of the theory, in order to create an audio separation method based on mixed features and encoding and decoding, which makes the audio separation method more practical.

The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.

Disclosure of Invention

The invention provides an audio separation method based on mixed characteristics and coding and decoding, thereby effectively solving the problems in the background technology.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:

step one: collecting data; collecting audio data through a plurality of terminal devices, and buffering the audio stream data in a local server according to the sequence of the data sequences;

step two: early training; randomly extracting part data for marking, marking endpoints of a sound part and a silent part for reference training;

step three: extracting audio characteristics; extracting features in the frequency domain, the time domain and the cepstrum domain of the audio signal, and weighting a plurality of features through an attention mechanism;

step four: removing the audio blank; removing low-energy regions in the audio through a voice endpoint detection algorithm and an integrated classifier, reserving high-energy regions containing obvious sounds, and splicing the segmented effective audio fragments;

step five: audio separation; the method comprises the steps of regarding the ambient sound and the human voice as two sound sources, analyzing the voice waveform of the mixed audio by utilizing a coding and decoding network structure by using the audio waveform data with the blank removed, and separating the sound sources by predicting the corresponding single time-frequency mask.

Further, in the first step, the sequence of the data sequence is the number sequence provided by the device before the data stream is transmitted.

In the fourth step, when the cut effective audio segments are spliced, the transition signals are added to the two ends of the audio waveform, so that the two ends of the effective audio segments are smoothed.

Further, the ambient sounds are noise affecting speaker speech recognition and other sounds of non-dominant speakers.

In the second step, the labeled data are divided into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.

Further, the training set accounts for 60% -90%.

Further, the classifier is a decision tree-based audio classifier, a KNN-based audio classifier, or a neural network-based audio classifier.

Further, the trained classifier judges the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.

Further, in the fifth step, the codec model structure design includes:

s1: setting the network to L levels, wherein the network comprises L downsampling blocks and L upsampling blocks;

s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;

s3: the data after the S2 processing is subjected to one-dimensional convolution processing and then is processed by L up-sampling blocks, and local high-resolution characteristics are calculated;

s4: combining the features in S2 and S3 to obtain a multi-scale feature for prediction;

s5: the audio data is separated into K source audio by prediction of the multi-scale features.

Further, when combining the downsampling blocks with the upsampling block output features, the downsampling blocks are arranged forward along the processing sequence, the upsampling blocks are arranged backward along the processing sequence, and each downsampling block output feature after arrangement is combined with the upsampling block output feature of the corresponding arrangement sequence.

Further, of the L levels, each successive level has a run-time resolution that is half that of the previous level.

Further, each of the downsampled block and the upsampled block includes a convolutional layer and a residual network, the residual network being located after the convolutional layer to mitigate overfitting.

Further, the model returns predictions within the interval (-1, 1), one for each source audio sample, for the K source audio to be predicted.

Further, the one-dimensional convolution includes a basic architecture filled with zeros and an activation function, which is a LeakyReLU activation function or a tanh activation function.

Further, the audio data after the up-sampling block is subjected to a one-dimensional convolution, where the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are LeakyReLU activation functions.

Further, in the downsampling process, the discard features are extracted every other time step, and the time resolution is halved.

Further, in the upsampling process, upsampling is performed by a factor of two in the time direction using linear interpolation by upsampling.

The beneficial effects of the invention are as follows: according to the invention, through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, a large amount of blank and environmental noise exist in audio data acquired by multiple terminals, features are extracted on a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the obtained final audio feature data can describe the characteristics of sound more than the single feature; the voice endpoint algorithm is combined with the integrated classifier, so that the detected voice endpoint is more accurate than the traditional method that only the endpoint detection algorithm is used, the downstream voice recognition error caused by removal of effective sound is avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, finally, the environmental sound is removed, and the audio only comprising the voice is output.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a flow chart of audio blank removal

Fig. 3 is a flow chart of audio separation.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

In the description of the present invention, it should be noted that the directions or positional relationships indicated as being "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are directions or positional relationships based on the drawings are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; the connection may be direct or indirect via an intermediate medium, or may be internal communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

As shown in fig. 1 to 2: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:

Through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at a large amount of blank and environmental noise in audio data acquired by multiple terminals, features are extracted on a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the acquired final audio feature data can describe the characteristics of sound more than the single feature; the voice endpoint algorithm is combined with the integrated classifier, so that the detected voice endpoint is more accurate than the traditional method that only the endpoint detection algorithm is used, the downstream voice recognition error caused by removal of effective sound is avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, finally, the environmental sound is removed, and the audio only comprising the voice is output.

The frequency domain features comprise frequency domain energy, subband energy ratio, spectrum centroid based on frames, spectrum bandwidth based on frames and the like, the time domain features comprise short-time energy, short-time zero-crossing rate, short-time average amplitude difference, short-time average amplitude and the like, the cepstrum domain features comprise mel cepstrum coefficients and linear prediction cepstrum coefficients, and meanwhile, the cepstrum domain features can also comprise loudness, loudness range, reflection coefficients, mute frame proportion, wavelet coefficients, spectral entropy values and the like in combination with other features; the above method is used to calculate various sound features, and through the attention mechanism, a weighted average audio feature is obtained, and the weighting coefficient can be trained through a back propagation mode.

In this embodiment, in the first step, the sequence of data is the number sequence provided by the device before the data stream is transmitted.

The sequence of the data sequence is the numbering sequence provided by the equipment before the transmission of the data stream, so that the spatial positions and the like of multiple equipment can be restored, and the processing and expression effects of the subsequent audio data can be improved.

In the fourth step, when the cut effective audio segments are spliced, the transition signals are added to the two ends of the audio waveform, so as to smooth the two ends of the effective audio segments.

After the effective audio segments are segmented and spliced, transition between the two segments is unnatural, the subsequent listening effect is affected, and smooth processing is performed on the two ends of the effective audio segments by adding transition signals at the two ends of an audio waveform, so that transition between audios is more natural, and the output audio effect is better.

In this embodiment, the ambient sound is noise affecting speaker speech recognition and other non-dominant speaker sounds.

In the situations of meeting record and the like, the voice of the current main speaker is required to be recorded, the voice of other non-main speakers and the voice of the speaker are both voices with influence, and the voice of the other non-main speakers and the voice of the speaker are judged to be environmental voices, so that the voice of the main speaker is ensured to be identified, and the effect of audio processing is improved.

In the second step, the labeled data is divided into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.

And randomly extracting part of data, labeling the endpoints of the voiced part and the unvoiced part, providing reference training, dividing the labeled data into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model, so that the identification effect of the classifier is increased.

In this embodiment, the training set accounts for 60% -90%, most of the marked data is used as the training set, the rest of the marked data is used as the verification set and the test set, the duty ratio of the verification set and the test set can be equal, and the classifier is trained by the training set with the duty ratio of most of the marked data, and then the classification effect is more accurate through verification and test of the verification set of the small part.

The classifier can be an audio classifier based on a decision tree, an audio classifier based on a K-Nearest Neighbor (KNN) or an audio classifier based on a neural network.

As a preferable mode of the above embodiment, the trained classifier judges the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.

When the subsequent voice endpoint detection algorithm divides the audio blank part and the effective part, in order to judge whether the divided endpoints are accurate, the trained classifier judges the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, identifies the classifier as effective data and invalid data, and corrects the endpoint, thereby reducing error screening of voice frames and increasing the identification effect.

As shown in fig. 3, in the present embodiment, the codec model structure design includes:

s1: setting the model network to L levels, wherein the model network comprises L downsampling blocks and L upsampling blocks;

The method comprises the steps of setting a network to be L levels, processing audio data through a downsampling block to obtain advanced features, processing through an upsampling block to obtain local high-resolution features, combining the local high-resolution features to obtain multi-scale features for prediction, and separating the audio data into K source audios based on the multi-scale features for prediction, so that audios are separated. Wherein only human voice and environmental sound need to be separated, where k=2, since the outputs of the downsampling block and the upsampling block are combined in the audio separation process, the accuracy of the separation is increased.

In this embodiment, when the downsampling blocks and the upsampling block output features are combined, the downsampling blocks are arranged in the forward direction along the processing sequence, the upsampling blocks are arranged in the reverse direction along the processing sequence, and each downsampling block output feature after arrangement is combined with the upsampling block output feature in the corresponding arrangement sequence.

When the output features of the lower sampling block and the upper sampling block are combined, the lower sampling block is arranged in the forward direction along the processing sequence, the upper sampling block is arranged in the reverse direction along the processing sequence, and the output features of each arranged lower sampling block are combined with the output features of the upper sampling block corresponding to the arrangement sequence, so that the current high-level features and the corresponding local high-resolution features are combined, and the identification accuracy is improved.

Of the L levels, the run-time resolution of each successive level is half that of the previous level, since after one level processing, one feature will be output, and the subsequent processing will be reduced, setting the run-time resolution of each successive level to half that of the previous level.

Preferably, each downsampling block and upsampling block comprises a convolution layer and a residual network, the residual network being located after the convolution layer to mitigate over-fitting.

A residual network is arranged behind one convolution layer to reduce the overfitting, thereby improving the audio separation quality.

In this embodiment, the model returns predictions within the interval (-1, 1), one for each source audio sample, corresponding to the K source audio to be predicted, thereby increasing the prediction separation effect.

In this embodiment Conv1 (x, y) is represented as a one-dimensional convolution of x filters of size y, the one-dimensional convolution comprising a basic framework of padding to zero and an activation function, in an extension of the one-dimensional convolution basic framework Conv1D does not involve zero padding, x is first clipped by the center so that it has the same time step as the current layer.

The activation function is a LeakyReLU activation function or a tanh activation function. The LeakyReLU activation function theoretically has all points of the ReLU, and does not have the problem of DeadReLU, the interval is a fixed parameter from 1 to positive infinity, the range of the tanh activation function is-1 to 1, the points are that negative inputs are mapped to negative values, and 0 inputs are mapped to values near 0.

As a preferred embodiment of the foregoing embodiment, the audio data after the up-sampling block processing is subjected to one-dimensional convolution, where one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are LeakyReLU activation functions.

Because the model returns predictions within the interval (-1, 1), the audio data after up-sampling block processing is subjected to one-dimensional convolution, and the activation function of this one-dimensional convolution is a tanh activation function, and the activation functions of the remaining one-dimensional convolutions are set as a LeakyReLU activation function.

In an embodiment, the discard feature is extracted every other time step and the time resolution is halved when the downsampling process is performed. When up-sampling processing is performed, up-sampling is performed in the time direction by a multiple of two by using linear interpolation, and compared with other interpolation modes, the linear interpolation has the characteristics of simplicity and convenience.

It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An audio separation method based on mixed characteristics and coding and decoding is characterized by comprising the following steps:

step two: early training; randomly extracting part data, marking the endpoints of the voiced part and the unvoiced part, and providing reference training;

step five: audio separation; regarding the environment sound and the human voice as two sound sources, analyzing the voice waveform of the mixed audio by utilizing the encoding and decoding network structure by using the audio waveform data with the blank removed, and separating the sound sources by predicting the corresponding single time-frequency mask;

in the fifth step, the structural design of the encoding and decoding model includes:

s5: predicting through the multi-scale features, and separating the audio data into K source audios;

when the lower sampling block is combined with the upper sampling block output characteristic, the lower sampling blocks are arranged in the forward direction along the processing sequence, the upper sampling blocks are arranged in the reverse direction along the processing sequence, and each of the arranged lower sampling block output characteristic is combined with the upper sampling block output characteristic of the corresponding arrangement sequence;

the run-time resolution of each successive level of the L levels is half that of the previous level;

each lower sampling block and each upper sampling block comprise a layer of convolution layer and a residual error network, and the residual error network is positioned behind the convolution layer so as to reduce over fitting;

the model returns predictions within the interval (-1, 1), one for each source audio sample, for the K source audio to be predicted.

2. The audio separation method based on mixed features and codec as claimed in claim 1, wherein in the first step, the sequence of data is the number sequence provided by the device before transmission.

3. The audio separation method based on mixed features and coding and decoding according to claim 1, wherein in the fourth step, when the sliced effective audio segments are spliced, the two ends of the effective audio segments are smoothed by adding transition signals to the two ends of the audio waveform.

4. The method of claim 1, wherein the ambient sound is noise affecting speaker speech recognition and other non-dominant speaker sounds.

5. The audio separation method based on mixed features and coding and decoding according to claim 1, wherein in the second step, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.

6. The audio separation method based on mixed features and encoding and decoding according to claim 5, wherein the training set accounts for 60% -90%.

7. The audio separation method based on mixed features and coding and decoding according to claim 5, wherein the classifier is a decision tree-based audio classifier, a KNN-based audio classifier or a neural network-based audio classifier.

8. The audio separation method based on mixed features and codec according to claim 5, wherein the trained classifier judges the class of n frames before and after the endpoint obtained by the speech endpoint detection algorithm and corrects the endpoint.

9. The audio separation method based on hybrid features and codec of claim 1, wherein the one-dimensional convolution includes a basic structure filled with zeros and an activation function, which is a LeakyReLU activation function or a tanh activation function.

10. The audio separation method based on mixed features and codec of claim 9, wherein the audio data processed by the up-sampling block is subjected to a one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are a LeakyReLU activation function.

11. The audio separation method based on mixed features and codec as claimed in claim 1, wherein the discarding features are extracted every other time step and the time resolution is halved when the downsampling process is performed.

12. The audio separation method based on hybrid feature and codec of claim 1, wherein upsampling is performed in a double multiple in a time direction using linear interpolation by upsampling.