CN113488063A - Audio separation method based on mixed features and coding and decoding - Google Patents
Audio separation method based on mixed features and coding and decoding Download PDFInfo
- Publication number
- CN113488063A CN113488063A CN202110753553.4A CN202110753553A CN113488063A CN 113488063 A CN113488063 A CN 113488063A CN 202110753553 A CN202110753553 A CN 202110753553A CN 113488063 A CN113488063 A CN 113488063A
- Authority
- CN
- China
- Prior art keywords
- audio
- data
- features
- coding
- decoding
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000926 separation method Methods 0.000 title claims abstract description 33
- 238000012549 training Methods 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 22
- 238000001514 detection method Methods 0.000 claims abstract description 12
- 230000007246 mechanism Effects 0.000 claims abstract description 7
- 230000005236 sound signal Effects 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 28
- 230000004913 activation Effects 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 23
- 238000005070 sampling Methods 0.000 claims description 17
- 230000000694 effects Effects 0.000 claims description 16
- 230000007613 environmental effect Effects 0.000 claims description 11
- 238000012360 testing method Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 230000002441 reversible effect Effects 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000003066 decision tree Methods 0.000 claims description 3
- 238000013461 design Methods 0.000 claims description 3
- 230000002123 temporal effect Effects 0.000 claims description 2
- 239000000203 mixture Substances 0.000 claims 2
- 238000010200 validation analysis Methods 0.000 claims 2
- 238000012512 characterization method Methods 0.000 claims 1
- 238000012795 verification Methods 0.000 description 9
- 238000001228 spectrum Methods 0.000 description 3
- 238000013480 data collection Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/0308—Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding, which comprises the following steps: collecting data; collecting audio data through a multi-terminal device; training in an early stage; randomly extracting partial data for marking, and marking end points of a voiced part and an unvoiced part; extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism; removing audio blank; removing low-energy areas in the audio through a voice endpoint detection algorithm and a clustering classifier, reserving high-energy areas containing obvious sounds, and splicing the segmented segments; audio separation; and analyzing the voice waveform of the mixed audio by using the audio waveform data after blank removal by using a coding and decoding network structure, separating a sound source by predicting a single time-frequency mask, and outputting the audio only comprising human voice.
Description
Technical Field
The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding.
Background
The popularization of multi-terminal equipment and the progress of high-speed network transmission technology, and the problem that the existing enterprises are difficult to solve is not to acquire massive data. The acquired large amount of data often contains a lot of low-quality data, and the data seriously influences the operation of downstream business of the enterprise. Voice data is a large category of internet data. Under the scenes of network teaching, audio and video conferences and the like, voice data are important components of activities such as classroom activity monitoring, record retention, human-computer interaction and the like. The voice data stream processed by the back-end service server has a large amount of voice blanks and unrecognizable ambient noise, which increases the processing burden of the service server.
One way to alleviate this problem is to expand the number of distributed servers, using computing power to handle the processing of invalid data, but a single expansion of computing power does not fundamentally solve the problem. After the concept of the data middlebox is proposed, more and more enterprises are willing to use a centralized mode to preprocess data by using an algorithm, and then effective data is distributed to each service server for use of downstream tasks through data extraction.
A conventional speech blank removal algorithm is a Voice Activity Detection algorithm (VAD). The main task of this algorithm is to accurately locate the start and end points of speech from noisy speech and separate the silence from the actual speech, which is a key technique in the speech signal processing process. The traditional voice separation technology is a Non-Negative Matrix Factorization (NMF) based method and a F0 estimation based method, which are traditional supervision methods for voice separation tasks, and the separation effect is poor.
In view of the above problems, the present designer is actively making research and innovation based on the years of abundant practical experience and professional knowledge in engineering application of such products and with the application of theory, in order to create an audio separation method based on mixing features and encoding and decoding, so that the audio separation method is more practical.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention provides an audio separation method based on mixed characteristics and coding and decoding, thereby effectively solving the problems in the background art.
In order to achieve the purpose, the invention adopts the technical scheme that: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:
the method comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices, and caching the audio stream data in a local server according to the sequence of the data sequence;
step two: training in an early stage; randomly extracting partial data for labeling, and labeling end points of a voiced part and an unvoiced part for reference training;
step three: extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism;
step four: removing audio blank; the final audio features obtained in the third step are expressed, low-energy areas in the audio are removed through a voice endpoint detection algorithm and a clustering classifier, high-energy areas containing obvious sounds are reserved, and the segmented effective audio segments are spliced;
step five: audio separation; the environmental sound and the human voice are taken as two sound sources, the audio waveform data after blank removal is used, the voice waveform of the mixed audio is analyzed by utilizing a coding and decoding network structure, and the sound sources are separated by predicting a single time-frequency mask.
Further, in the first step, the sequence of the data sequence is a numbering sequence provided by the device before the data stream is transmitted.
Furthermore, in the fourth step, when the divided effective audio segments are spliced, the two ends of the effective audio segment are smoothed by adding transition signals to the two ends of the audio waveform.
Further, the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.
Further, in the second step, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.
Further, the proportion of the training set is 60% -90%.
Further, the classifier is a decision tree based audio classifier, a KNN based audio classifier, or a neural network based audio classifier.
Further, the trained classifier judges the categories of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.
Further, in the fifth step, the design of the coding and decoding model structure includes:
s1: setting the network to L levels, including L downsample blocks and L upsample blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data processed by the S2 is processed by one-dimensional convolution and is processed by L upper sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain multi-scale features for prediction;
s5: and performing prediction through multi-scale features, and separating the audio data into K source audios.
Further, when combining the down-sampling block and the up-sampling block output characteristics, the down-sampling block is arranged along the processing sequence in the forward direction, the up-sampling block is arranged along the processing sequence in the reverse direction, and each down-sampling block output characteristic after arrangement is combined with the up-sampling block output characteristic corresponding to the arrangement sequence.
Further, of the L levels, each successive level has half the runtime resolution of the previous level.
Further, each of the downsampling block and the upsampling block includes a convolutional layer and a residual network, and the residual network is located behind the convolutional layer to reduce overfitting.
Further, the model returns a prediction in the interval (-1, 1), one for each source audio sample, for K source audios to be predicted.
Further, the one-dimensional convolution includes a basic structure filled with zeros and an activation function, which is a LeakyReLU activation function or a tanh activation function.
Further, the audio data processed by the upsampling block is subjected to one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the other one-dimensional convolutions are LeakyReLU activation functions.
Further, in the downsampling process, the discard feature is decimated every other time step, and the time resolution is halved.
Further, in performing the upsampling process, upsampling is performed by a multiple of two in the time direction using linear interpolation.
The invention has the beneficial effects that: according to the method, through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at the fact that a large amount of blanks and environmental noises exist in audio data acquired by multiple terminals, features are extracted in a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the characteristics of sound can be described better by the acquired final audio feature data than the characteristics of single feature; the voice endpoint algorithm and the classifier are combined, so that the detected voice endpoint is more accurate than that of the traditional method only using the endpoint detection algorithm, downstream voice recognition errors caused by the removal of effective voice are avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, the environmental voice is finally removed, and the audio only including human voice is output.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of audio blanking removal
Fig. 3 is a flow chart of audio separation.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
In the description of the present invention, it should be noted that the orientations or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like are based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; either directly or indirectly through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
As shown in fig. 1 to 2: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:
the method comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices, and caching the audio stream data in a local server according to the sequence of the data sequence;
step two: training in an early stage; randomly extracting partial data for labeling, and labeling end points of a voiced part and an unvoiced part for reference training;
step three: extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism;
step four: removing audio blank; the final audio features obtained in the third step are expressed, low-energy areas in the audio are removed through a voice endpoint detection algorithm and a clustering classifier, high-energy areas containing obvious sounds are reserved, and the segmented effective audio segments are spliced;
step five: audio separation; the environmental sound and the human voice are taken as two sound sources, the audio waveform data after blank removal is used, the voice waveform of the mixed audio is analyzed by utilizing a coding and decoding network structure, and the sound sources are separated by predicting a single time-frequency mask.
Through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at the fact that a large amount of blanks and environmental noises exist in audio data acquired by multiple terminals, features are extracted in a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the characteristics of sound can be described better by the acquired final audio feature data than the characteristics of single feature; the voice endpoint algorithm and the classifier are combined, so that the detected voice endpoint is more accurate than that of the traditional method only using the endpoint detection algorithm, downstream voice recognition errors caused by the removal of effective voice are avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, the environmental voice is finally removed, and the audio only including human voice is output.
The frequency domain features comprise frequency domain energy, sub-band energy ratio, frame-based spectrum centroid and frame-based spectrum bandwidth, the time domain features comprise short-time energy, short-time zero-crossing rate, short-time average amplitude difference, short-time average amplitude and the like, the cepstrum domain features comprise Mel cepstrum coefficients and linear prediction cepstrum coefficients, and simultaneously, the cepstrum domain features can be combined with other features comprising loudness, loudness range, reflection coefficients, mute frame ratio, wavelet coefficients, spectrum entropy value and the like; the method is used for calculating various sound characteristics, and a weighted average audio characteristic is obtained through an attention mechanism, and the weighting coefficient can be trained in a back propagation mode.
In this embodiment, in the first step, the sequence of the data sequence is a numbering sequence provided by the device before the data stream is transmitted.
The sequence of the data sequence is the numbering sequence provided by the device before the data stream is transmitted, so that the spatial positions and the like of the multiple devices are restored, and the processing and expression effects of subsequent audio data are improved.
Preferably, in the fourth step, when the divided effective audio segments are spliced, transition signals are added to two ends of the audio waveform to smooth two ends of the effective audio segments.
After the effective audio frequency segments are segmented and spliced, the transition between the two segments is unnatural, the subsequent listening effect is influenced, and the transition signals are added at the two ends of the audio frequency waveform to smooth the two ends of the effective audio frequency segments, so that the transition between the audio frequencies is more natural, and the output audio frequency effect is better.
In this embodiment, the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.
Under the situations of meeting recording and the like, the voice of the current main speaker needs to be recorded, the voice of other non-main speakers and the noise for recognizing the voice of the speaker are both the voice with influence, and the voice and the noise are both judged as environmental voice, so that the voice recognition of the main speaker is ensured, and the effect of audio processing is increased.
Preferably, in step two, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.
And randomly extracting partial data, labeling end points of the voiced part and the unvoiced part, providing reference training, and dividing the labeled data into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model, so that the recognition effect of the classifier is improved.
In the embodiment, the proportion of the training set is 60% -90%, most of the marked data is used as the training set, the rest of the marked data is used as the verification set and the test set, the proportion of the verification set and the test set can be equal, and the classifier is trained on the training set which accounts for most of the marked data and then verified and tested by a small part of the verification set, so that the classification effect is more accurate.
The classifier can be an audio classifier based on a decision tree, an audio classifier based on a K-Nearest Neighbor (KNN) algorithm, and an audio classifier based on a neural network.
Preferably, in the above embodiment, the trained classifier determines the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.
When the blank part and the effective part of the audio frequency are divided by a subsequent voice endpoint detection algorithm, in order to judge whether the divided endpoints are accurate, the trained classifier judges the categories of n frames before and after the endpoints obtained by the voice endpoint detection algorithm, identifies the categories as effective data and ineffective data, and corrects the endpoints, thereby reducing the error screening of the voice frames and increasing the identification effect.
As shown in fig. 3, in the present embodiment, the structure design of the coding/decoding model includes:
s1: setting the model network to L levels, wherein the L levels comprise L downsampling blocks and L upsampling blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data processed by the S2 is processed by one-dimensional convolution and is processed by L upper sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain multi-scale features for prediction;
s5: and performing prediction through multi-scale features, and separating the audio data into K source audios.
The network is set to L levels, audio data is processed through a down-sampling block to obtain advanced features, the audio data is processed through an up-sampling block to obtain local high-resolution features, the local high-resolution features are combined to obtain multi-scale features for prediction, prediction is carried out based on the multi-scale features, the audio data is separated into K source audios, and therefore the audios are separated. Wherein only human voice and ambient sound need to be separated, where K =2, since the outputs of the down-sample block and the up-sample block are combined in the audio separation process, the accuracy of the separation is increased.
In this embodiment, when combining the downsampling block and the upsampling block output characteristics, the downsampling blocks are arranged in the forward direction along the processing sequence, the upsampling blocks are arranged in the reverse direction along the processing sequence, and each downsampling block output characteristic after arrangement is combined with the upsampling block output characteristic corresponding to the arrangement sequence.
When the output characteristics of the lower sampling blocks and the upper sampling blocks are combined, the lower sampling blocks are arranged in the forward direction along the processing sequence, the upper sampling blocks are arranged in the reverse direction along the processing sequence, and the output characteristics of each lower sampling block after arrangement are combined with the output characteristics of the upper sampling blocks corresponding to the arrangement sequence, so that the current high-grade characteristics and the corresponding local high-resolution characteristics are combined, and the identification accuracy is improved.
Of the L levels, each successive level has half the runtime resolution of the previous level because after one level of processing, a feature is output and subsequent processing is reduced, thereby setting each successive level to half the runtime resolution of the previous level.
Preferably, each of the downsampling block and the upsampling block includes a convolutional layer and a residual network, and the residual network is located behind the convolutional layer to reduce the overfitting.
A residual network is arranged behind one convolution layer to reduce overfitting, thereby improving audio separation quality.
In this embodiment, the model returns predictions in the interval (-1, 1), one for each source audio sample, for K source audios to be predicted, thereby increasing the prediction separation effect.
In this embodiment, Conv1(x, y) is represented as a one-dimensional convolution of x filters of size y, the one-dimensional convolution comprising a basic framework of zero padding and an activation function, in an extension of the basic framework of one-dimensional convolution, Conv1D does not involve zero padding, x is first clipped centrally, so it has the same time step as the current layer.
The activation function is a LeakyReLU activation function or a tanh activation function. The LeakyReLU activation function theoretically has all the points of ReLU, and the problem of DeadReLU is avoided, the interval is a fixed parameter from 1 to positive infinity, the value range of the tanh activation function is from-1 to 1, the negative input of the points is mapped into a negative value, and the 0 input is mapped into a value near 0.
As a preferred example of the above embodiment, the audio data processed by the upsampling block is subjected to a one-dimensional convolution, where the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolution is a leakyreu activation function.
Because the model returns the prediction in the interval (-1, 1), the audio data processed by the upsampling block is subjected to one-dimensional convolution, the activation function of the one-dimensional convolution is the tanh activation function, and the activation functions of the other one-dimensional convolutions are set as the LeakyReLU activation functions.
In an embodiment, when downsampling processing is performed, drop features are decimated every other time step and the temporal resolution is halved. When the upsampling processing is carried out, upsampling is carried out by using linear interpolation in a time direction by a multiple of two times, and the linear interpolation has the characteristics of simplicity and convenience compared with other interpolation modes.
It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (17)
1. An audio separation method based on mixed features and coding and decoding is characterized by comprising the following steps:
the method comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices, and caching the audio stream data in a local server according to the sequence of the data sequence;
step two: training in an early stage; randomly extracting partial data, and labeling the end points of the voiced part and the unvoiced part for reference training;
step three: extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism;
step four: removing audio blank; the final audio features obtained in the third step are expressed, low-energy areas in the audio are removed through a voice endpoint detection algorithm and a clustering classifier, high-energy areas containing obvious sounds are reserved, and the segmented effective audio segments are spliced;
step five: audio separation; the environmental sound and the human voice are taken as two sound sources, the audio waveform data after blank removal is used, the voice waveform of the mixed audio is analyzed by utilizing a coding and decoding network structure, and the sound sources are separated by predicting a single time-frequency mask.
2. The audio separation method based on mixing characteristics and coding/decoding of claim 1, wherein in the first step, the data sequence has a sequence order that is a numbering order provided by the device before the data stream is transmitted.
3. The audio separation method based on mixing feature and coding/decoding of claim 1, wherein in the fourth step, when splicing the divided valid audio segments, the two ends of the valid audio segments are smoothed by adding transition signals to the two ends of the audio waveform.
4. The method of claim 1, wherein the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.
5. The audio separation method based on mixed features and coding/decoding of claim 1, wherein in the second step, the labeled data is divided into a training set, a validation set and a test set, the training set is used for training the classifier, the validation set is used for validating the training process of the model, and the test set is used for detecting the effect of the final model.
6. The method of claim 5, wherein the training set is 60% -90% in weight.
7. The method of claim 5, wherein the classifier is a decision tree-based audio classifier, a KNN-based audio classifier, or a neural network-based audio classifier.
8. The audio separation method based on mixed features and coding/decoding of claim 5, wherein the trained classifier determines the classes of n frames before and after the endpoint obtained by the speech endpoint detection algorithm and corrects the endpoint.
9. The audio separation method based on mixed features and coding-decoding as claimed in claim 1, wherein in the step five, the design of coding-decoding model structure comprises:
s1: setting the model network to L levels, wherein the L levels comprise L downsampling blocks and L upsampling blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data processed by the S2 is processed by one-dimensional convolution and is processed by L upper sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain multi-scale features for prediction;
s5: and performing prediction through multi-scale features, and separating the audio data into K source audios.
10. The audio separation method based on mixture characterization and codec of claim 9, wherein when combining the downsampled blocks with the upsampled block output characteristics, the downsampled blocks are arranged in a forward direction along the processing order, the upsampled blocks are arranged in a reverse direction along the processing order, and each of the downsampled block output characteristics after arrangement is combined with the upsampled block output characteristics of the corresponding arrangement order.
11. The method of claim 9, wherein the run-time resolution of each successive level of the L levels is half of the previous level.
12. The method of claim 9, wherein each of the downsampled block and the upsampled block comprises a convolutional layer and a residual network, and the residual network is located after the convolutional layer to reduce overfitting.
13. The method of claim 9, wherein the model returns the prediction in the interval (-1, 1) for K source audio samples to be predicted, one for each source audio sample.
14. The method of claim 9, wherein the one-dimensional convolution includes a zero-padded basic structure and an activation function, and the activation function is a LeakyReLU activation function or a tanh activation function.
15. The method of claim 14, wherein the upsampled block processed audio data is subjected to a one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are a LeakyReLU activation function.
16. The method of claim 9, wherein dropping features are decimated every other time step and temporal resolution is halved when downsampling.
17. The audio separation method based on mixture features and codec of claim 9, wherein the upsampling is performed by a multiple of two in the time direction using linear interpolation at the time of the upsampling process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110753553.4A CN113488063B (en) | 2021-07-02 | 2021-07-02 | Audio separation method based on mixed features and encoding and decoding |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110753553.4A CN113488063B (en) | 2021-07-02 | 2021-07-02 | Audio separation method based on mixed features and encoding and decoding |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113488063A true CN113488063A (en) | 2021-10-08 |
CN113488063B CN113488063B (en) | 2023-12-19 |
Family
ID=77940711
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110753553.4A Active CN113488063B (en) | 2021-07-02 | 2021-07-02 | Audio separation method based on mixed features and encoding and decoding |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113488063B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115035912A (en) * | 2022-06-08 | 2022-09-09 | 哈尔滨工程大学 | Automatic underwater acoustic signal sample labeling method based on MOC model |
WO2023241254A1 (en) * | 2022-06-15 | 2023-12-21 | 腾讯科技(深圳)有限公司 | Audio encoding and decoding method and apparatus, electronic device, computer readable storage medium, and computer program product |
CN117475360A (en) * | 2023-12-27 | 2024-01-30 | 南京纳实医学科技有限公司 | Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN |
CN117854514A (en) * | 2024-03-06 | 2024-04-09 | 深圳市增长点科技有限公司 | Wireless earphone communication decoding optimization method and system for sound quality fidelity |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107182279B (en) * | 2010-10-19 | 2014-03-19 | 西南电子电信技术研究所 | Speech detection method under complicated strong noise environment |
CN107610715A (en) * | 2017-10-10 | 2018-01-19 | 昆明理工大学 | A kind of similarity calculating method based on muli-sounds feature |
CN108597525A (en) * | 2018-04-25 | 2018-09-28 | 四川远鉴科技有限公司 | Voice vocal print modeling method and device |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110400575A (en) * | 2019-07-24 | 2019-11-01 | 腾讯科技(深圳)有限公司 | Interchannel feature extracting method, audio separation method and device calculate equipment |
CN110503976A (en) * | 2019-08-15 | 2019-11-26 | 广州华多网络科技有限公司 | Audio separation method, device, electronic equipment and storage medium |
CN110992978A (en) * | 2019-12-18 | 2020-04-10 | 苏州思必驰信息科技有限公司 | Training method and system for audio and video separation model |
CN111243619A (en) * | 2020-01-06 | 2020-06-05 | 平安科技(深圳)有限公司 | Training method and device for voice signal segmentation model and computer equipment |
CN111540375A (en) * | 2020-04-29 | 2020-08-14 | 全球能源互联网研究院有限公司 | Training method of audio separation model, and audio signal separation method and device |
CN111724807A (en) * | 2020-08-05 | 2020-09-29 | 字节跳动有限公司 | Audio separation method and device, electronic equipment and computer readable storage medium |
CN112071329A (en) * | 2020-09-16 | 2020-12-11 | 腾讯科技(深圳)有限公司 | Multi-person voice separation method and device, electronic equipment and storage medium |
CN112613657A (en) * | 2020-12-22 | 2021-04-06 | 国网江苏省电力有限公司苏州供电分公司 | Short-term wind speed prediction method for wind power plant |
CN112820300A (en) * | 2021-02-25 | 2021-05-18 | 北京小米松果电子有限公司 | Audio processing method and device, terminal and storage medium |
CN112992172A (en) * | 2021-01-28 | 2021-06-18 | 广州大学 | Single-channel time domain bird song separating method based on attention mechanism |
-
2021
- 2021-07-02 CN CN202110753553.4A patent/CN113488063B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107182279B (en) * | 2010-10-19 | 2014-03-19 | 西南电子电信技术研究所 | Speech detection method under complicated strong noise environment |
CN107610715A (en) * | 2017-10-10 | 2018-01-19 | 昆明理工大学 | A kind of similarity calculating method based on muli-sounds feature |
CN108597525A (en) * | 2018-04-25 | 2018-09-28 | 四川远鉴科技有限公司 | Voice vocal print modeling method and device |
CN109935243A (en) * | 2019-02-25 | 2019-06-25 | 重庆大学 | Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model |
CN110400575A (en) * | 2019-07-24 | 2019-11-01 | 腾讯科技(深圳)有限公司 | Interchannel feature extracting method, audio separation method and device calculate equipment |
CN110503976A (en) * | 2019-08-15 | 2019-11-26 | 广州华多网络科技有限公司 | Audio separation method, device, electronic equipment and storage medium |
CN110992978A (en) * | 2019-12-18 | 2020-04-10 | 苏州思必驰信息科技有限公司 | Training method and system for audio and video separation model |
CN111243619A (en) * | 2020-01-06 | 2020-06-05 | 平安科技(深圳)有限公司 | Training method and device for voice signal segmentation model and computer equipment |
CN111540375A (en) * | 2020-04-29 | 2020-08-14 | 全球能源互联网研究院有限公司 | Training method of audio separation model, and audio signal separation method and device |
CN111724807A (en) * | 2020-08-05 | 2020-09-29 | 字节跳动有限公司 | Audio separation method and device, electronic equipment and computer readable storage medium |
CN112071329A (en) * | 2020-09-16 | 2020-12-11 | 腾讯科技(深圳)有限公司 | Multi-person voice separation method and device, electronic equipment and storage medium |
CN112613657A (en) * | 2020-12-22 | 2021-04-06 | 国网江苏省电力有限公司苏州供电分公司 | Short-term wind speed prediction method for wind power plant |
CN112992172A (en) * | 2021-01-28 | 2021-06-18 | 广州大学 | Single-channel time domain bird song separating method based on attention mechanism |
CN112820300A (en) * | 2021-02-25 | 2021-05-18 | 北京小米松果电子有限公司 | Audio processing method and device, terminal and storage medium |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115035912A (en) * | 2022-06-08 | 2022-09-09 | 哈尔滨工程大学 | Automatic underwater acoustic signal sample labeling method based on MOC model |
CN115035912B (en) * | 2022-06-08 | 2024-04-26 | 哈尔滨工程大学 | Automatic underwater sound signal sample labeling method based on MOC model |
WO2023241254A1 (en) * | 2022-06-15 | 2023-12-21 | 腾讯科技(深圳)有限公司 | Audio encoding and decoding method and apparatus, electronic device, computer readable storage medium, and computer program product |
CN117475360A (en) * | 2023-12-27 | 2024-01-30 | 南京纳实医学科技有限公司 | Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN |
CN117475360B (en) * | 2023-12-27 | 2024-03-26 | 南京纳实医学科技有限公司 | Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN |
CN117854514A (en) * | 2024-03-06 | 2024-04-09 | 深圳市增长点科技有限公司 | Wireless earphone communication decoding optimization method and system for sound quality fidelity |
CN117854514B (en) * | 2024-03-06 | 2024-05-31 | 深圳市增长点科技有限公司 | Wireless earphone communication decoding optimization method and system for sound quality fidelity |
Also Published As
Publication number | Publication date |
---|---|
CN113488063B (en) | 2023-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113488063B (en) | Audio separation method based on mixed features and encoding and decoding | |
US10878823B2 (en) | Voiceprint recognition method, device, terminal apparatus and storage medium | |
Hu et al. | Pitch‐based gender identification with two‐stage classification | |
CN104318927A (en) | Anti-noise low-bitrate speech coding method and decoding method | |
CN111292762A (en) | Single-channel voice separation method based on deep learning | |
JP2006079079A (en) | Distributed speech recognition system and its method | |
CN101366078A (en) | Neural network classifier for separating audio sources from a monophonic audio signal | |
Jiang et al. | An Improved Speech Segmentation and Clustering Algorithm Based on SOM and K‐Means | |
CN111951824A (en) | Detection method for distinguishing depression based on sound | |
CN109036470B (en) | Voice distinguishing method, device, computer equipment and storage medium | |
CN111986699B (en) | Sound event detection method based on full convolution network | |
CN112270933B (en) | Audio identification method and device | |
CN109300470B (en) | Mixing separation method and mixing separation device | |
WO2019232867A1 (en) | Voice discrimination method and apparatus, and computer device, and storage medium | |
CN111540364A (en) | Audio recognition method and device, electronic equipment and computer readable medium | |
CN115602165A (en) | Digital staff intelligent system based on financial system | |
CN113823293A (en) | Speaker recognition method and system based on voice enhancement | |
CN115881156A (en) | Multi-scale-based multi-modal time domain voice separation method | |
CN113782051B (en) | Broadcast effect classification method and system, electronic equipment and storage medium | |
Park et al. | Analysis of confidence and control through voice of Kim Jung-un | |
Hong | Speaker gender recognition system | |
CN114302301A (en) | Frequency response correction method and related product | |
Feng et al. | Noise Classification Speech Enhancement Generative Adversarial Network | |
Camarena-Ibarrola et al. | Speaker identification using entropygrams and convolutional neural networks | |
Smietanka et al. | Augmented transformer for speech detection in adverse acoustical conditions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |