CN113488063A - Audio separation method based on mixed features and coding and decoding - Google Patents

Audio separation method based on mixed features and coding and decoding Download PDF

Info

Publication number
CN113488063A
CN113488063A CN202110753553.4A CN202110753553A CN113488063A CN 113488063 A CN113488063 A CN 113488063A CN 202110753553 A CN202110753553 A CN 202110753553A CN 113488063 A CN113488063 A CN 113488063A
Authority
CN
China
Prior art keywords
audio
data
features
coding
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110753553.4A
Other languages
Chinese (zh)
Other versions
CN113488063B (en
Inventor
吴俊�
颜庆国
董勤伟
查显光
崔林
赵新冬
戴威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd, Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN202110753553.4A priority Critical patent/CN113488063B/en
Publication of CN113488063A publication Critical patent/CN113488063A/en
Application granted granted Critical
Publication of CN113488063B publication Critical patent/CN113488063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding, which comprises the following steps: collecting data; collecting audio data through a multi-terminal device; training in an early stage; randomly extracting partial data for marking, and marking end points of a voiced part and an unvoiced part; extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism; removing audio blank; removing low-energy areas in the audio through a voice endpoint detection algorithm and a clustering classifier, reserving high-energy areas containing obvious sounds, and splicing the segmented segments; audio separation; and analyzing the voice waveform of the mixed audio by using the audio waveform data after blank removal by using a coding and decoding network structure, separating a sound source by predicting a single time-frequency mask, and outputting the audio only comprising human voice.

Description

Audio separation method based on mixed features and coding and decoding
Technical Field
The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding.
Background
The popularization of multi-terminal equipment and the progress of high-speed network transmission technology, and the problem that the existing enterprises are difficult to solve is not to acquire massive data. The acquired large amount of data often contains a lot of low-quality data, and the data seriously influences the operation of downstream business of the enterprise. Voice data is a large category of internet data. Under the scenes of network teaching, audio and video conferences and the like, voice data are important components of activities such as classroom activity monitoring, record retention, human-computer interaction and the like. The voice data stream processed by the back-end service server has a large amount of voice blanks and unrecognizable ambient noise, which increases the processing burden of the service server.
One way to alleviate this problem is to expand the number of distributed servers, using computing power to handle the processing of invalid data, but a single expansion of computing power does not fundamentally solve the problem. After the concept of the data middlebox is proposed, more and more enterprises are willing to use a centralized mode to preprocess data by using an algorithm, and then effective data is distributed to each service server for use of downstream tasks through data extraction.
A conventional speech blank removal algorithm is a Voice Activity Detection algorithm (VAD). The main task of this algorithm is to accurately locate the start and end points of speech from noisy speech and separate the silence from the actual speech, which is a key technique in the speech signal processing process. The traditional voice separation technology is a Non-Negative Matrix Factorization (NMF) based method and a F0 estimation based method, which are traditional supervision methods for voice separation tasks, and the separation effect is poor.
In view of the above problems, the present designer is actively making research and innovation based on the years of abundant practical experience and professional knowledge in engineering application of such products and with the application of theory, in order to create an audio separation method based on mixing features and encoding and decoding, so that the audio separation method is more practical.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention provides an audio separation method based on mixed characteristics and coding and decoding, thereby effectively solving the problems in the background art.
In order to achieve the purpose, the invention adopts the technical scheme that: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:
the method comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices, and caching the audio stream data in a local server according to the sequence of the data sequence;
step two: training in an early stage; randomly extracting partial data for labeling, and labeling end points of a voiced part and an unvoiced part for reference training;
step three: extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism;
step four: removing audio blank; the final audio features obtained in the third step are expressed, low-energy areas in the audio are removed through a voice endpoint detection algorithm and a clustering classifier, high-energy areas containing obvious sounds are reserved, and the segmented effective audio segments are spliced;
step five: audio separation; the environmental sound and the human voice are taken as two sound sources, the audio waveform data after blank removal is used, the voice waveform of the mixed audio is analyzed by utilizing a coding and decoding network structure, and the sound sources are separated by predicting a single time-frequency mask.
Further, in the first step, the sequence of the data sequence is a numbering sequence provided by the device before the data stream is transmitted.
Furthermore, in the fourth step, when the divided effective audio segments are spliced, the two ends of the effective audio segment are smoothed by adding transition signals to the two ends of the audio waveform.
Further, the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.
Further, in the second step, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.
Further, the proportion of the training set is 60% -90%.
Further, the classifier is a decision tree based audio classifier, a KNN based audio classifier, or a neural network based audio classifier.
Further, the trained classifier judges the categories of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.
Further, in the fifth step, the design of the coding and decoding model structure includes:
s1: setting the network to L levels, including L downsample blocks and L upsample blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data processed by the S2 is processed by one-dimensional convolution and is processed by L upper sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain multi-scale features for prediction;
s5: and performing prediction through multi-scale features, and separating the audio data into K source audios.
Further, when combining the down-sampling block and the up-sampling block output characteristics, the down-sampling block is arranged along the processing sequence in the forward direction, the up-sampling block is arranged along the processing sequence in the reverse direction, and each down-sampling block output characteristic after arrangement is combined with the up-sampling block output characteristic corresponding to the arrangement sequence.
Further, of the L levels, each successive level has half the runtime resolution of the previous level.
Further, each of the downsampling block and the upsampling block includes a convolutional layer and a residual network, and the residual network is located behind the convolutional layer to reduce overfitting.
Further, the model returns a prediction in the interval (-1, 1), one for each source audio sample, for K source audios to be predicted.
Further, the one-dimensional convolution includes a basic structure filled with zeros and an activation function, which is a LeakyReLU activation function or a tanh activation function.
Further, the audio data processed by the upsampling block is subjected to one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the other one-dimensional convolutions are LeakyReLU activation functions.
Further, in the downsampling process, the discard feature is decimated every other time step, and the time resolution is halved.
Further, in performing the upsampling process, upsampling is performed by a multiple of two in the time direction using linear interpolation.
The invention has the beneficial effects that: according to the method, through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at the fact that a large amount of blanks and environmental noises exist in audio data acquired by multiple terminals, features are extracted in a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the characteristics of sound can be described better by the acquired final audio feature data than the characteristics of single feature; the voice endpoint algorithm and the classifier are combined, so that the detected voice endpoint is more accurate than that of the traditional method only using the endpoint detection algorithm, downstream voice recognition errors caused by the removal of effective voice are avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, the environmental voice is finally removed, and the audio only including human voice is output.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of audio blanking removal
Fig. 3 is a flow chart of audio separation.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
In the description of the present invention, it should be noted that the orientations or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", and the like are based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; either directly or indirectly through intervening media, or may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.
As shown in fig. 1 to 2: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:
the method comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices, and caching the audio stream data in a local server according to the sequence of the data sequence;
step two: training in an early stage; randomly extracting partial data for labeling, and labeling end points of a voiced part and an unvoiced part for reference training;
step three: extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism;
step four: removing audio blank; the final audio features obtained in the third step are expressed, low-energy areas in the audio are removed through a voice endpoint detection algorithm and a clustering classifier, high-energy areas containing obvious sounds are reserved, and the segmented effective audio segments are spliced;
step five: audio separation; the environmental sound and the human voice are taken as two sound sources, the audio waveform data after blank removal is used, the voice waveform of the mixed audio is analyzed by utilizing a coding and decoding network structure, and the sound sources are separated by predicting a single time-frequency mask.
Through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at the fact that a large amount of blanks and environmental noises exist in audio data acquired by multiple terminals, features are extracted in a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the characteristics of sound can be described better by the acquired final audio feature data than the characteristics of single feature; the voice endpoint algorithm and the classifier are combined, so that the detected voice endpoint is more accurate than that of the traditional method only using the endpoint detection algorithm, downstream voice recognition errors caused by the removal of effective voice are avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, the environmental voice is finally removed, and the audio only including human voice is output.
The frequency domain features comprise frequency domain energy, sub-band energy ratio, frame-based spectrum centroid and frame-based spectrum bandwidth, the time domain features comprise short-time energy, short-time zero-crossing rate, short-time average amplitude difference, short-time average amplitude and the like, the cepstrum domain features comprise Mel cepstrum coefficients and linear prediction cepstrum coefficients, and simultaneously, the cepstrum domain features can be combined with other features comprising loudness, loudness range, reflection coefficients, mute frame ratio, wavelet coefficients, spectrum entropy value and the like; the method is used for calculating various sound characteristics, and a weighted average audio characteristic is obtained through an attention mechanism, and the weighting coefficient can be trained in a back propagation mode.
In this embodiment, in the first step, the sequence of the data sequence is a numbering sequence provided by the device before the data stream is transmitted.
The sequence of the data sequence is the numbering sequence provided by the device before the data stream is transmitted, so that the spatial positions and the like of the multiple devices are restored, and the processing and expression effects of subsequent audio data are improved.
Preferably, in the fourth step, when the divided effective audio segments are spliced, transition signals are added to two ends of the audio waveform to smooth two ends of the effective audio segments.
After the effective audio frequency segments are segmented and spliced, the transition between the two segments is unnatural, the subsequent listening effect is influenced, and the transition signals are added at the two ends of the audio frequency waveform to smooth the two ends of the effective audio frequency segments, so that the transition between the audio frequencies is more natural, and the output audio frequency effect is better.
In this embodiment, the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.
Under the situations of meeting recording and the like, the voice of the current main speaker needs to be recorded, the voice of other non-main speakers and the noise for recognizing the voice of the speaker are both the voice with influence, and the voice and the noise are both judged as environmental voice, so that the voice recognition of the main speaker is ensured, and the effect of audio processing is increased.
Preferably, in step two, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.
And randomly extracting partial data, labeling end points of the voiced part and the unvoiced part, providing reference training, and dividing the labeled data into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model, so that the recognition effect of the classifier is improved.
In the embodiment, the proportion of the training set is 60% -90%, most of the marked data is used as the training set, the rest of the marked data is used as the verification set and the test set, the proportion of the verification set and the test set can be equal, and the classifier is trained on the training set which accounts for most of the marked data and then verified and tested by a small part of the verification set, so that the classification effect is more accurate.
The classifier can be an audio classifier based on a decision tree, an audio classifier based on a K-Nearest Neighbor (KNN) algorithm, and an audio classifier based on a neural network.
Preferably, in the above embodiment, the trained classifier determines the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.
When the blank part and the effective part of the audio frequency are divided by a subsequent voice endpoint detection algorithm, in order to judge whether the divided endpoints are accurate, the trained classifier judges the categories of n frames before and after the endpoints obtained by the voice endpoint detection algorithm, identifies the categories as effective data and ineffective data, and corrects the endpoints, thereby reducing the error screening of the voice frames and increasing the identification effect.
As shown in fig. 3, in the present embodiment, the structure design of the coding/decoding model includes:
s1: setting the model network to L levels, wherein the L levels comprise L downsampling blocks and L upsampling blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data processed by the S2 is processed by one-dimensional convolution and is processed by L upper sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain multi-scale features for prediction;
s5: and performing prediction through multi-scale features, and separating the audio data into K source audios.
The network is set to L levels, audio data is processed through a down-sampling block to obtain advanced features, the audio data is processed through an up-sampling block to obtain local high-resolution features, the local high-resolution features are combined to obtain multi-scale features for prediction, prediction is carried out based on the multi-scale features, the audio data is separated into K source audios, and therefore the audios are separated. Wherein only human voice and ambient sound need to be separated, where K =2, since the outputs of the down-sample block and the up-sample block are combined in the audio separation process, the accuracy of the separation is increased.
In this embodiment, when combining the downsampling block and the upsampling block output characteristics, the downsampling blocks are arranged in the forward direction along the processing sequence, the upsampling blocks are arranged in the reverse direction along the processing sequence, and each downsampling block output characteristic after arrangement is combined with the upsampling block output characteristic corresponding to the arrangement sequence.
When the output characteristics of the lower sampling blocks and the upper sampling blocks are combined, the lower sampling blocks are arranged in the forward direction along the processing sequence, the upper sampling blocks are arranged in the reverse direction along the processing sequence, and the output characteristics of each lower sampling block after arrangement are combined with the output characteristics of the upper sampling blocks corresponding to the arrangement sequence, so that the current high-grade characteristics and the corresponding local high-resolution characteristics are combined, and the identification accuracy is improved.
Of the L levels, each successive level has half the runtime resolution of the previous level because after one level of processing, a feature is output and subsequent processing is reduced, thereby setting each successive level to half the runtime resolution of the previous level.
Preferably, each of the downsampling block and the upsampling block includes a convolutional layer and a residual network, and the residual network is located behind the convolutional layer to reduce the overfitting.
A residual network is arranged behind one convolution layer to reduce overfitting, thereby improving audio separation quality.
In this embodiment, the model returns predictions in the interval (-1, 1), one for each source audio sample, for K source audios to be predicted, thereby increasing the prediction separation effect.
In this embodiment, Conv1(x, y) is represented as a one-dimensional convolution of x filters of size y, the one-dimensional convolution comprising a basic framework of zero padding and an activation function, in an extension of the basic framework of one-dimensional convolution, Conv1D does not involve zero padding, x is first clipped centrally, so it has the same time step as the current layer.
The activation function is a LeakyReLU activation function or a tanh activation function. The LeakyReLU activation function theoretically has all the points of ReLU, and the problem of DeadReLU is avoided, the interval is a fixed parameter from 1 to positive infinity, the value range of the tanh activation function is from-1 to 1, the negative input of the points is mapped into a negative value, and the 0 input is mapped into a value near 0.
As a preferred example of the above embodiment, the audio data processed by the upsampling block is subjected to a one-dimensional convolution, where the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolution is a leakyreu activation function.
Because the model returns the prediction in the interval (-1, 1), the audio data processed by the upsampling block is subjected to one-dimensional convolution, the activation function of the one-dimensional convolution is the tanh activation function, and the activation functions of the other one-dimensional convolutions are set as the LeakyReLU activation functions.
In an embodiment, when downsampling processing is performed, drop features are decimated every other time step and the temporal resolution is halved. When the upsampling processing is carried out, upsampling is carried out by using linear interpolation in a time direction by a multiple of two times, and the linear interpolation has the characteristics of simplicity and convenience compared with other interpolation modes.
It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (17)

1. An audio separation method based on mixed features and coding and decoding is characterized by comprising the following steps:
the method comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices, and caching the audio stream data in a local server according to the sequence of the data sequence;
step two: training in an early stage; randomly extracting partial data, and labeling the end points of the voiced part and the unvoiced part for reference training;
step three: extracting audio features; extracting characteristics in a frequency domain, a time domain and a cepstrum domain of the audio signal, and weighting a plurality of characteristics through an attention mechanism;
step four: removing audio blank; the final audio features obtained in the third step are expressed, low-energy areas in the audio are removed through a voice endpoint detection algorithm and a clustering classifier, high-energy areas containing obvious sounds are reserved, and the segmented effective audio segments are spliced;
step five: audio separation; the environmental sound and the human voice are taken as two sound sources, the audio waveform data after blank removal is used, the voice waveform of the mixed audio is analyzed by utilizing a coding and decoding network structure, and the sound sources are separated by predicting a single time-frequency mask.
2. The audio separation method based on mixing characteristics and coding/decoding of claim 1, wherein in the first step, the data sequence has a sequence order that is a numbering order provided by the device before the data stream is transmitted.
3. The audio separation method based on mixing feature and coding/decoding of claim 1, wherein in the fourth step, when splicing the divided valid audio segments, the two ends of the valid audio segments are smoothed by adding transition signals to the two ends of the audio waveform.
4. The method of claim 1, wherein the environmental sounds are noise affecting speech recognition of the speaker and sounds of other non-dominant speakers.
5. The audio separation method based on mixed features and coding/decoding of claim 1, wherein in the second step, the labeled data is divided into a training set, a validation set and a test set, the training set is used for training the classifier, the validation set is used for validating the training process of the model, and the test set is used for detecting the effect of the final model.
6. The method of claim 5, wherein the training set is 60% -90% in weight.
7. The method of claim 5, wherein the classifier is a decision tree-based audio classifier, a KNN-based audio classifier, or a neural network-based audio classifier.
8. The audio separation method based on mixed features and coding/decoding of claim 5, wherein the trained classifier determines the classes of n frames before and after the endpoint obtained by the speech endpoint detection algorithm and corrects the endpoint.
9. The audio separation method based on mixed features and coding-decoding as claimed in claim 1, wherein in the step five, the design of coding-decoding model structure comprises:
s1: setting the model network to L levels, wherein the L levels comprise L downsampling blocks and L upsampling blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data processed by the S2 is processed by one-dimensional convolution and is processed by L upper sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain multi-scale features for prediction;
s5: and performing prediction through multi-scale features, and separating the audio data into K source audios.
10. The audio separation method based on mixture characterization and codec of claim 9, wherein when combining the downsampled blocks with the upsampled block output characteristics, the downsampled blocks are arranged in a forward direction along the processing order, the upsampled blocks are arranged in a reverse direction along the processing order, and each of the downsampled block output characteristics after arrangement is combined with the upsampled block output characteristics of the corresponding arrangement order.
11. The method of claim 9, wherein the run-time resolution of each successive level of the L levels is half of the previous level.
12. The method of claim 9, wherein each of the downsampled block and the upsampled block comprises a convolutional layer and a residual network, and the residual network is located after the convolutional layer to reduce overfitting.
13. The method of claim 9, wherein the model returns the prediction in the interval (-1, 1) for K source audio samples to be predicted, one for each source audio sample.
14. The method of claim 9, wherein the one-dimensional convolution includes a zero-padded basic structure and an activation function, and the activation function is a LeakyReLU activation function or a tanh activation function.
15. The method of claim 14, wherein the upsampled block processed audio data is subjected to a one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are a LeakyReLU activation function.
16. The method of claim 9, wherein dropping features are decimated every other time step and temporal resolution is halved when downsampling.
17. The audio separation method based on mixture features and codec of claim 9, wherein the upsampling is performed by a multiple of two in the time direction using linear interpolation at the time of the upsampling process.
CN202110753553.4A 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding Active CN113488063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110753553.4A CN113488063B (en) 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110753553.4A CN113488063B (en) 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding

Publications (2)

Publication Number Publication Date
CN113488063A true CN113488063A (en) 2021-10-08
CN113488063B CN113488063B (en) 2023-12-19

Family

ID=77940711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110753553.4A Active CN113488063B (en) 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding

Country Status (1)

Country Link
CN (1) CN113488063B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035912A (en) * 2022-06-08 2022-09-09 哈尔滨工程大学 Automatic underwater acoustic signal sample labeling method based on MOC model
WO2023241254A1 (en) * 2022-06-15 2023-12-21 腾讯科技(深圳)有限公司 Audio encoding and decoding method and apparatus, electronic device, computer readable storage medium, and computer program product
CN117475360A (en) * 2023-12-27 2024-01-30 南京纳实医学科技有限公司 Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN
CN117854514A (en) * 2024-03-06 2024-04-09 深圳市增长点科技有限公司 Wireless earphone communication decoding optimization method and system for sound quality fidelity

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107182279B (en) * 2010-10-19 2014-03-19 西南电子电信技术研究所 Speech detection method under complicated strong noise environment
CN107610715A (en) * 2017-10-10 2018-01-19 昆明理工大学 A kind of similarity calculating method based on muli-sounds feature
CN108597525A (en) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 Voice vocal print modeling method and device
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110400575A (en) * 2019-07-24 2019-11-01 腾讯科技(深圳)有限公司 Interchannel feature extracting method, audio separation method and device calculate equipment
CN110503976A (en) * 2019-08-15 2019-11-26 广州华多网络科技有限公司 Audio separation method, device, electronic equipment and storage medium
CN110992978A (en) * 2019-12-18 2020-04-10 苏州思必驰信息科技有限公司 Training method and system for audio and video separation model
CN111243619A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Training method and device for voice signal segmentation model and computer equipment
CN111540375A (en) * 2020-04-29 2020-08-14 全球能源互联网研究院有限公司 Training method of audio separation model, and audio signal separation method and device
CN111724807A (en) * 2020-08-05 2020-09-29 字节跳动有限公司 Audio separation method and device, electronic equipment and computer readable storage medium
CN112071329A (en) * 2020-09-16 2020-12-11 腾讯科技(深圳)有限公司 Multi-person voice separation method and device, electronic equipment and storage medium
CN112613657A (en) * 2020-12-22 2021-04-06 国网江苏省电力有限公司苏州供电分公司 Short-term wind speed prediction method for wind power plant
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN112992172A (en) * 2021-01-28 2021-06-18 广州大学 Single-channel time domain bird song separating method based on attention mechanism

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107182279B (en) * 2010-10-19 2014-03-19 西南电子电信技术研究所 Speech detection method under complicated strong noise environment
CN107610715A (en) * 2017-10-10 2018-01-19 昆明理工大学 A kind of similarity calculating method based on muli-sounds feature
CN108597525A (en) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 Voice vocal print modeling method and device
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110400575A (en) * 2019-07-24 2019-11-01 腾讯科技(深圳)有限公司 Interchannel feature extracting method, audio separation method and device calculate equipment
CN110503976A (en) * 2019-08-15 2019-11-26 广州华多网络科技有限公司 Audio separation method, device, electronic equipment and storage medium
CN110992978A (en) * 2019-12-18 2020-04-10 苏州思必驰信息科技有限公司 Training method and system for audio and video separation model
CN111243619A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Training method and device for voice signal segmentation model and computer equipment
CN111540375A (en) * 2020-04-29 2020-08-14 全球能源互联网研究院有限公司 Training method of audio separation model, and audio signal separation method and device
CN111724807A (en) * 2020-08-05 2020-09-29 字节跳动有限公司 Audio separation method and device, electronic equipment and computer readable storage medium
CN112071329A (en) * 2020-09-16 2020-12-11 腾讯科技(深圳)有限公司 Multi-person voice separation method and device, electronic equipment and storage medium
CN112613657A (en) * 2020-12-22 2021-04-06 国网江苏省电力有限公司苏州供电分公司 Short-term wind speed prediction method for wind power plant
CN112992172A (en) * 2021-01-28 2021-06-18 广州大学 Single-channel time domain bird song separating method based on attention mechanism
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035912A (en) * 2022-06-08 2022-09-09 哈尔滨工程大学 Automatic underwater acoustic signal sample labeling method based on MOC model
CN115035912B (en) * 2022-06-08 2024-04-26 哈尔滨工程大学 Automatic underwater sound signal sample labeling method based on MOC model
WO2023241254A1 (en) * 2022-06-15 2023-12-21 腾讯科技(深圳)有限公司 Audio encoding and decoding method and apparatus, electronic device, computer readable storage medium, and computer program product
CN117475360A (en) * 2023-12-27 2024-01-30 南京纳实医学科技有限公司 Biological sign extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN
CN117475360B (en) * 2023-12-27 2024-03-26 南京纳实医学科技有限公司 Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN
CN117854514A (en) * 2024-03-06 2024-04-09 深圳市增长点科技有限公司 Wireless earphone communication decoding optimization method and system for sound quality fidelity
CN117854514B (en) * 2024-03-06 2024-05-31 深圳市增长点科技有限公司 Wireless earphone communication decoding optimization method and system for sound quality fidelity

Also Published As

Publication number Publication date
CN113488063B (en) 2023-12-19

Similar Documents

Publication Publication Date Title
CN113488063B (en) Audio separation method based on mixed features and encoding and decoding
US10878823B2 (en) Voiceprint recognition method, device, terminal apparatus and storage medium
Hu et al. Pitch‐based gender identification with two‐stage classification
CN104318927A (en) Anti-noise low-bitrate speech coding method and decoding method
CN111292762A (en) Single-channel voice separation method based on deep learning
JP2006079079A (en) Distributed speech recognition system and its method
CN101366078A (en) Neural network classifier for separating audio sources from a monophonic audio signal
Jiang et al. An Improved Speech Segmentation and Clustering Algorithm Based on SOM and K‐Means
CN111951824A (en) Detection method for distinguishing depression based on sound
CN109036470B (en) Voice distinguishing method, device, computer equipment and storage medium
CN111986699B (en) Sound event detection method based on full convolution network
CN112270933B (en) Audio identification method and device
CN109300470B (en) Mixing separation method and mixing separation device
WO2019232867A1 (en) Voice discrimination method and apparatus, and computer device, and storage medium
CN111540364A (en) Audio recognition method and device, electronic equipment and computer readable medium
CN115602165A (en) Digital staff intelligent system based on financial system
CN113823293A (en) Speaker recognition method and system based on voice enhancement
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
CN113782051B (en) Broadcast effect classification method and system, electronic equipment and storage medium
Park et al. Analysis of confidence and control through voice of Kim Jung-un
Hong Speaker gender recognition system
CN114302301A (en) Frequency response correction method and related product
Feng et al. Noise Classification Speech Enhancement Generative Adversarial Network
Camarena-Ibarrola et al. Speaker identification using entropygrams and convolutional neural networks
Smietanka et al. Augmented transformer for speech detection in adverse acoustical conditions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant