CN113488063B - Audio separation method based on mixed features and encoding and decoding - Google Patents

Audio separation method based on mixed features and encoding and decoding Download PDF

Info

Publication number
CN113488063B
CN113488063B CN202110753553.4A CN202110753553A CN113488063B CN 113488063 B CN113488063 B CN 113488063B CN 202110753553 A CN202110753553 A CN 202110753553A CN 113488063 B CN113488063 B CN 113488063B
Authority
CN
China
Prior art keywords
audio
data
features
method based
separation method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110753553.4A
Other languages
Chinese (zh)
Other versions
CN113488063A (en
Inventor
吴俊�
颜庆国
董勤伟
查显光
崔林
赵新冬
戴威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd, Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN202110753553.4A priority Critical patent/CN113488063B/en
Publication of CN113488063A publication Critical patent/CN113488063A/en
Application granted granted Critical
Publication of CN113488063B publication Critical patent/CN113488063B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding, which comprises the following steps: collecting data; collecting audio data through a plurality of terminal devices; early training; randomly extracting part data for marking, and marking endpoints of a sound part and a silent part; extracting audio characteristics; extracting features in the frequency domain, the time domain and the cepstrum domain of the audio signal, and weighting a plurality of features through an attention mechanism; removing the audio blank; removing low-energy regions in the audio through a voice endpoint detection algorithm and an integrated classifier, reserving high-energy regions containing obvious sounds, and splicing the segmented fragments; audio separation; and analyzing the voice waveform of the mixed audio by using the audio waveform data with the blank removed, separating the sound sources by predicting the corresponding single time-frequency mask, and outputting the audio only comprising human voice.

Description

Audio separation method based on mixed features and encoding and decoding
Technical Field
The invention relates to the technical field of audio data processing, in particular to an audio separation method based on mixed characteristics and coding and decoding.
Background
The popularization of multi-terminal equipment and the progress of high-speed network transmission technology are not the problem that the existing enterprises are difficult to solve in acquiring massive data. The large amount of data obtained often contains a lot of low quality data, and the operation of the downstream business of the enterprise is seriously influenced by the data. Voice data is a large class of internet data. In the scenes of network teaching, audio-video conferences and the like, voice data are important components of behaviors such as classroom behavior monitoring, record and retention, human-computer interaction and the like. The voice data stream processed by the back-end service server has a large amount of voice blank and unrecognizable environmental noise, which increases the processing load of the service server.
One way to alleviate this problem is to extend the number of distributed servers to handle the processing of invalid data using computing power, but a single extended computing power does not fundamentally solve the problem. After the concept of the data center is put forward, more and more enterprises are willing to use a centralized mode, use algorithms to preprocess data, and then distribute effective data to each service server for downstream tasks through data extraction.
The traditional speech blank removal algorithm is a speech end point detection algorithm (Voice Activity Detection, VAD). The main task of this algorithm is to accurately locate the start and end points of speech from noisy speech and to separate silence from actual speech, which is a key technique in the speech signal processing process. The traditional voice separation technology has the advantages that a method based on Non-negative matrix factorization (Non-negative Matrix Factorization, NMF) and a method based on F0 estimation are traditional supervision methods for voice separation tasks, and the separation effect is poor.
In view of the above problems, the present inventors have actively studied and innovated based on the practical experience and expertise which are rich for years in such product engineering applications, and in combination with the application of the theory, in order to create an audio separation method based on mixed features and encoding and decoding, which makes the audio separation method more practical.
The information disclosed in this background section is only for enhancement of understanding of the general background of the invention and should not be taken as an acknowledgement or any form of suggestion that this information forms the prior art already known to a person skilled in the art.
Disclosure of Invention
The invention provides an audio separation method based on mixed characteristics and coding and decoding, thereby effectively solving the problems in the background technology.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:
step one: collecting data; collecting audio data through a plurality of terminal devices, and buffering the audio stream data in a local server according to the sequence of the data sequences;
step two: early training; randomly extracting part data for marking, marking endpoints of a sound part and a silent part for reference training;
step three: extracting audio characteristics; extracting features in the frequency domain, the time domain and the cepstrum domain of the audio signal, and weighting a plurality of features through an attention mechanism;
step four: removing the audio blank; removing low-energy regions in the audio through a voice endpoint detection algorithm and an integrated classifier, reserving high-energy regions containing obvious sounds, and splicing the segmented effective audio fragments;
step five: audio separation; the method comprises the steps of regarding the ambient sound and the human voice as two sound sources, analyzing the voice waveform of the mixed audio by utilizing a coding and decoding network structure by using the audio waveform data with the blank removed, and separating the sound sources by predicting the corresponding single time-frequency mask.
Further, in the first step, the sequence of the data sequence is the number sequence provided by the device before the data stream is transmitted.
In the fourth step, when the cut effective audio segments are spliced, the transition signals are added to the two ends of the audio waveform, so that the two ends of the effective audio segments are smoothed.
Further, the ambient sounds are noise affecting speaker speech recognition and other sounds of non-dominant speakers.
In the second step, the labeled data are divided into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.
Further, the training set accounts for 60% -90%.
Further, the classifier is a decision tree-based audio classifier, a KNN-based audio classifier, or a neural network-based audio classifier.
Further, the trained classifier judges the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.
Further, in the fifth step, the codec model structure design includes:
s1: setting the network to L levels, wherein the network comprises L downsampling blocks and L upsampling blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data after the S2 processing is subjected to one-dimensional convolution processing and then is processed by L up-sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain a multi-scale feature for prediction;
s5: the audio data is separated into K source audio by prediction of the multi-scale features.
Further, when combining the downsampling blocks with the upsampling block output features, the downsampling blocks are arranged forward along the processing sequence, the upsampling blocks are arranged backward along the processing sequence, and each downsampling block output feature after arrangement is combined with the upsampling block output feature of the corresponding arrangement sequence.
Further, of the L levels, each successive level has a run-time resolution that is half that of the previous level.
Further, each of the downsampled block and the upsampled block includes a convolutional layer and a residual network, the residual network being located after the convolutional layer to mitigate overfitting.
Further, the model returns predictions within the interval (-1, 1), one for each source audio sample, for the K source audio to be predicted.
Further, the one-dimensional convolution includes a basic architecture filled with zeros and an activation function, which is a LeakyReLU activation function or a tanh activation function.
Further, the audio data after the up-sampling block is subjected to a one-dimensional convolution, where the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are LeakyReLU activation functions.
Further, in the downsampling process, the discard features are extracted every other time step, and the time resolution is halved.
Further, in the upsampling process, upsampling is performed by a factor of two in the time direction using linear interpolation by upsampling.
The beneficial effects of the invention are as follows: according to the invention, through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, a large amount of blank and environmental noise exist in audio data acquired by multiple terminals, features are extracted on a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the obtained final audio feature data can describe the characteristics of sound more than the single feature; the voice endpoint algorithm is combined with the integrated classifier, so that the detected voice endpoint is more accurate than the traditional method that only the endpoint detection algorithm is used, the downstream voice recognition error caused by removal of effective sound is avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, finally, the environmental sound is removed, and the audio only comprising the voice is output.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a flow chart of audio blank removal
Fig. 3 is a flow chart of audio separation.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
In the description of the present invention, it should be noted that the directions or positional relationships indicated as being "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are directions or positional relationships based on the drawings are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements to be referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; the connection may be direct or indirect via an intermediate medium, or may be internal communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.
As shown in fig. 1 to 2: an audio separation method based on mixed characteristics and coding and decoding comprises the following steps:
step one: collecting data; collecting audio data through a plurality of terminal devices, and buffering the audio stream data in a local server according to the sequence of the data sequences;
step two: early training; randomly extracting part data for marking, marking endpoints of a sound part and a silent part for reference training;
step three: extracting audio characteristics; extracting features in the frequency domain, the time domain and the cepstrum domain of the audio signal, and weighting a plurality of features through an attention mechanism;
step four: removing the audio blank; removing low-energy regions in the audio through a voice endpoint detection algorithm and an integrated classifier, reserving high-energy regions containing obvious sounds, and splicing the segmented effective audio fragments;
step five: audio separation; the method comprises the steps of regarding the ambient sound and the human voice as two sound sources, analyzing the voice waveform of the mixed audio by utilizing a coding and decoding network structure by using the audio waveform data with the blank removed, and separating the sound sources by predicting the corresponding single time-frequency mask.
Through the steps of data collection, early training, audio feature extraction, audio blank removal and audio separation, aiming at a large amount of blank and environmental noise in audio data acquired by multiple terminals, features are extracted on a frequency domain, a time domain and a cepstrum domain, and the features are weighted through an attention mechanism, so that the acquired final audio feature data can describe the characteristics of sound more than the single feature; the voice endpoint algorithm is combined with the integrated classifier, so that the detected voice endpoint is more accurate than the traditional method that only the endpoint detection algorithm is used, the downstream voice recognition error caused by removal of effective sound is avoided, the deep learning model based on the coding and decoding framework is used, the audio separation quality is improved, finally, the environmental sound is removed, and the audio only comprising the voice is output.
The frequency domain features comprise frequency domain energy, subband energy ratio, spectrum centroid based on frames, spectrum bandwidth based on frames and the like, the time domain features comprise short-time energy, short-time zero-crossing rate, short-time average amplitude difference, short-time average amplitude and the like, the cepstrum domain features comprise mel cepstrum coefficients and linear prediction cepstrum coefficients, and meanwhile, the cepstrum domain features can also comprise loudness, loudness range, reflection coefficients, mute frame proportion, wavelet coefficients, spectral entropy values and the like in combination with other features; the above method is used to calculate various sound features, and through the attention mechanism, a weighted average audio feature is obtained, and the weighting coefficient can be trained through a back propagation mode.
In this embodiment, in the first step, the sequence of data is the number sequence provided by the device before the data stream is transmitted.
The sequence of the data sequence is the numbering sequence provided by the equipment before the transmission of the data stream, so that the spatial positions and the like of multiple equipment can be restored, and the processing and expression effects of the subsequent audio data can be improved.
In the fourth step, when the cut effective audio segments are spliced, the transition signals are added to the two ends of the audio waveform, so as to smooth the two ends of the effective audio segments.
After the effective audio segments are segmented and spliced, transition between the two segments is unnatural, the subsequent listening effect is affected, and smooth processing is performed on the two ends of the effective audio segments by adding transition signals at the two ends of an audio waveform, so that transition between audios is more natural, and the output audio effect is better.
In this embodiment, the ambient sound is noise affecting speaker speech recognition and other non-dominant speaker sounds.
In the situations of meeting record and the like, the voice of the current main speaker is required to be recorded, the voice of other non-main speakers and the voice of the speaker are both voices with influence, and the voice of the other non-main speakers and the voice of the speaker are judged to be environmental voices, so that the voice of the main speaker is ensured to be identified, and the effect of audio processing is improved.
In the second step, the labeled data is divided into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.
And randomly extracting part of data, labeling the endpoints of the voiced part and the unvoiced part, providing reference training, dividing the labeled data into a training set, a verification set and a test set, wherein the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model, so that the identification effect of the classifier is increased.
In this embodiment, the training set accounts for 60% -90%, most of the marked data is used as the training set, the rest of the marked data is used as the verification set and the test set, the duty ratio of the verification set and the test set can be equal, and the classifier is trained by the training set with the duty ratio of most of the marked data, and then the classification effect is more accurate through verification and test of the verification set of the small part.
The classifier can be an audio classifier based on a decision tree, an audio classifier based on a K-Nearest Neighbor (KNN) or an audio classifier based on a neural network.
As a preferable mode of the above embodiment, the trained classifier judges the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, and corrects the endpoint.
When the subsequent voice endpoint detection algorithm divides the audio blank part and the effective part, in order to judge whether the divided endpoints are accurate, the trained classifier judges the classes of n frames before and after the endpoint obtained by the voice endpoint detection algorithm, identifies the classifier as effective data and invalid data, and corrects the endpoint, thereby reducing error screening of voice frames and increasing the identification effect.
As shown in fig. 3, in the present embodiment, the codec model structure design includes:
s1: setting the model network to L levels, wherein the model network comprises L downsampling blocks and L upsampling blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data after the S2 processing is subjected to one-dimensional convolution processing and then is processed by L up-sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain a multi-scale feature for prediction;
s5: the audio data is separated into K source audio by prediction of the multi-scale features.
The method comprises the steps of setting a network to be L levels, processing audio data through a downsampling block to obtain advanced features, processing through an upsampling block to obtain local high-resolution features, combining the local high-resolution features to obtain multi-scale features for prediction, and separating the audio data into K source audios based on the multi-scale features for prediction, so that audios are separated. Wherein only human voice and environmental sound need to be separated, where k=2, since the outputs of the downsampling block and the upsampling block are combined in the audio separation process, the accuracy of the separation is increased.
In this embodiment, when the downsampling blocks and the upsampling block output features are combined, the downsampling blocks are arranged in the forward direction along the processing sequence, the upsampling blocks are arranged in the reverse direction along the processing sequence, and each downsampling block output feature after arrangement is combined with the upsampling block output feature in the corresponding arrangement sequence.
When the output features of the lower sampling block and the upper sampling block are combined, the lower sampling block is arranged in the forward direction along the processing sequence, the upper sampling block is arranged in the reverse direction along the processing sequence, and the output features of each arranged lower sampling block are combined with the output features of the upper sampling block corresponding to the arrangement sequence, so that the current high-level features and the corresponding local high-resolution features are combined, and the identification accuracy is improved.
Of the L levels, the run-time resolution of each successive level is half that of the previous level, since after one level processing, one feature will be output, and the subsequent processing will be reduced, setting the run-time resolution of each successive level to half that of the previous level.
Preferably, each downsampling block and upsampling block comprises a convolution layer and a residual network, the residual network being located after the convolution layer to mitigate over-fitting.
A residual network is arranged behind one convolution layer to reduce the overfitting, thereby improving the audio separation quality.
In this embodiment, the model returns predictions within the interval (-1, 1), one for each source audio sample, corresponding to the K source audio to be predicted, thereby increasing the prediction separation effect.
In this embodiment Conv1 (x, y) is represented as a one-dimensional convolution of x filters of size y, the one-dimensional convolution comprising a basic framework of padding to zero and an activation function, in an extension of the one-dimensional convolution basic framework Conv1D does not involve zero padding, x is first clipped by the center so that it has the same time step as the current layer.
The activation function is a LeakyReLU activation function or a tanh activation function. The LeakyReLU activation function theoretically has all points of the ReLU, and does not have the problem of DeadReLU, the interval is a fixed parameter from 1 to positive infinity, the range of the tanh activation function is-1 to 1, the points are that negative inputs are mapped to negative values, and 0 inputs are mapped to values near 0.
As a preferred embodiment of the foregoing embodiment, the audio data after the up-sampling block processing is subjected to one-dimensional convolution, where one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are LeakyReLU activation functions.
Because the model returns predictions within the interval (-1, 1), the audio data after up-sampling block processing is subjected to one-dimensional convolution, and the activation function of this one-dimensional convolution is a tanh activation function, and the activation functions of the remaining one-dimensional convolutions are set as a LeakyReLU activation function.
In an embodiment, the discard feature is extracted every other time step and the time resolution is halved when the downsampling process is performed. When up-sampling processing is performed, up-sampling is performed in the time direction by a multiple of two by using linear interpolation, and compared with other interpolation modes, the linear interpolation has the characteristics of simplicity and convenience.
It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (12)

1. An audio separation method based on mixed characteristics and coding and decoding is characterized by comprising the following steps:
step one: collecting data; collecting audio data through a plurality of terminal devices, and buffering the audio stream data in a local server according to the sequence of the data sequences;
step two: early training; randomly extracting part data, marking the endpoints of the voiced part and the unvoiced part, and providing reference training;
step three: extracting audio characteristics; extracting features in the frequency domain, the time domain and the cepstrum domain of the audio signal, and weighting a plurality of features through an attention mechanism;
step four: removing the audio blank; removing low-energy regions in the audio through a voice endpoint detection algorithm and an integrated classifier, reserving high-energy regions containing obvious sounds, and splicing the segmented effective audio fragments;
step five: audio separation; regarding the environment sound and the human voice as two sound sources, analyzing the voice waveform of the mixed audio by utilizing the encoding and decoding network structure by using the audio waveform data with the blank removed, and separating the sound sources by predicting the corresponding single time-frequency mask;
in the fifth step, the structural design of the encoding and decoding model includes:
s1: setting the model network to L levels, wherein the model network comprises L downsampling blocks and L upsampling blocks;
s2: processing the audio data through L downsampling blocks, and calculating more and more advanced features on a coarser time scale;
s3: the data after the S2 processing is subjected to one-dimensional convolution processing and then is processed by L up-sampling blocks, and local high-resolution characteristics are calculated;
s4: combining the features in S2 and S3 to obtain a multi-scale feature for prediction;
s5: predicting through the multi-scale features, and separating the audio data into K source audios;
when the lower sampling block is combined with the upper sampling block output characteristic, the lower sampling blocks are arranged in the forward direction along the processing sequence, the upper sampling blocks are arranged in the reverse direction along the processing sequence, and each of the arranged lower sampling block output characteristic is combined with the upper sampling block output characteristic of the corresponding arrangement sequence;
the run-time resolution of each successive level of the L levels is half that of the previous level;
each lower sampling block and each upper sampling block comprise a layer of convolution layer and a residual error network, and the residual error network is positioned behind the convolution layer so as to reduce over fitting;
the model returns predictions within the interval (-1, 1), one for each source audio sample, for the K source audio to be predicted.
2. The audio separation method based on mixed features and codec as claimed in claim 1, wherein in the first step, the sequence of data is the number sequence provided by the device before transmission.
3. The audio separation method based on mixed features and coding and decoding according to claim 1, wherein in the fourth step, when the sliced effective audio segments are spliced, the two ends of the effective audio segments are smoothed by adding transition signals to the two ends of the audio waveform.
4. The method of claim 1, wherein the ambient sound is noise affecting speaker speech recognition and other non-dominant speaker sounds.
5. The audio separation method based on mixed features and coding and decoding according to claim 1, wherein in the second step, the labeled data is divided into a training set, a verification set and a test set, the training set is used for training the classifier, the verification set is used for verifying the model training process, and the test set is used for detecting the effect of the final model.
6. The audio separation method based on mixed features and encoding and decoding according to claim 5, wherein the training set accounts for 60% -90%.
7. The audio separation method based on mixed features and coding and decoding according to claim 5, wherein the classifier is a decision tree-based audio classifier, a KNN-based audio classifier or a neural network-based audio classifier.
8. The audio separation method based on mixed features and codec according to claim 5, wherein the trained classifier judges the class of n frames before and after the endpoint obtained by the speech endpoint detection algorithm and corrects the endpoint.
9. The audio separation method based on hybrid features and codec of claim 1, wherein the one-dimensional convolution includes a basic structure filled with zeros and an activation function, which is a LeakyReLU activation function or a tanh activation function.
10. The audio separation method based on mixed features and codec of claim 9, wherein the audio data processed by the up-sampling block is subjected to a one-dimensional convolution, the one-dimensional convolution is a tanh activation function, and the remaining one-dimensional convolutions are a LeakyReLU activation function.
11. The audio separation method based on mixed features and codec as claimed in claim 1, wherein the discarding features are extracted every other time step and the time resolution is halved when the downsampling process is performed.
12. The audio separation method based on hybrid feature and codec of claim 1, wherein upsampling is performed in a double multiple in a time direction using linear interpolation by upsampling.
CN202110753553.4A 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding Active CN113488063B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110753553.4A CN113488063B (en) 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110753553.4A CN113488063B (en) 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding

Publications (2)

Publication Number Publication Date
CN113488063A CN113488063A (en) 2021-10-08
CN113488063B true CN113488063B (en) 2023-12-19

Family

ID=77940711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110753553.4A Active CN113488063B (en) 2021-07-02 2021-07-02 Audio separation method based on mixed features and encoding and decoding

Country Status (1)

Country Link
CN (1) CN113488063B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035912B (en) * 2022-06-08 2024-04-26 哈尔滨工程大学 Automatic underwater sound signal sample labeling method based on MOC model
CN115116451A (en) * 2022-06-15 2022-09-27 腾讯科技(深圳)有限公司 Audio decoding method, audio encoding method, audio decoding device, audio encoding device, electronic equipment and storage medium
CN117475360B (en) * 2023-12-27 2024-03-26 南京纳实医学科技有限公司 Biological feature extraction and analysis method based on audio and video characteristics of improved MLSTM-FCN

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107182279B (en) * 2010-10-19 2014-03-19 西南电子电信技术研究所 Speech detection method under complicated strong noise environment
CN107610715A (en) * 2017-10-10 2018-01-19 昆明理工大学 A kind of similarity calculating method based on muli-sounds feature
CN108597525A (en) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 Voice vocal print modeling method and device
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110400575A (en) * 2019-07-24 2019-11-01 腾讯科技(深圳)有限公司 Interchannel feature extracting method, audio separation method and device calculate equipment
CN110503976A (en) * 2019-08-15 2019-11-26 广州华多网络科技有限公司 Audio separation method, device, electronic equipment and storage medium
CN110992978A (en) * 2019-12-18 2020-04-10 苏州思必驰信息科技有限公司 Training method and system for audio and video separation model
CN111243619A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Training method and device for voice signal segmentation model and computer equipment
CN111540375A (en) * 2020-04-29 2020-08-14 全球能源互联网研究院有限公司 Training method of audio separation model, and audio signal separation method and device
CN111724807A (en) * 2020-08-05 2020-09-29 字节跳动有限公司 Audio separation method and device, electronic equipment and computer readable storage medium
CN112071329A (en) * 2020-09-16 2020-12-11 腾讯科技(深圳)有限公司 Multi-person voice separation method and device, electronic equipment and storage medium
CN112613657A (en) * 2020-12-22 2021-04-06 国网江苏省电力有限公司苏州供电分公司 Short-term wind speed prediction method for wind power plant
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium
CN112992172A (en) * 2021-01-28 2021-06-18 广州大学 Single-channel time domain bird song separating method based on attention mechanism

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107182279B (en) * 2010-10-19 2014-03-19 西南电子电信技术研究所 Speech detection method under complicated strong noise environment
CN107610715A (en) * 2017-10-10 2018-01-19 昆明理工大学 A kind of similarity calculating method based on muli-sounds feature
CN108597525A (en) * 2018-04-25 2018-09-28 四川远鉴科技有限公司 Voice vocal print modeling method and device
CN109935243A (en) * 2019-02-25 2019-06-25 重庆大学 Speech-emotion recognition method based on the enhancing of VTLP data and multiple dimensioned time-frequency domain cavity convolution model
CN110400575A (en) * 2019-07-24 2019-11-01 腾讯科技(深圳)有限公司 Interchannel feature extracting method, audio separation method and device calculate equipment
CN110503976A (en) * 2019-08-15 2019-11-26 广州华多网络科技有限公司 Audio separation method, device, electronic equipment and storage medium
CN110992978A (en) * 2019-12-18 2020-04-10 苏州思必驰信息科技有限公司 Training method and system for audio and video separation model
CN111243619A (en) * 2020-01-06 2020-06-05 平安科技(深圳)有限公司 Training method and device for voice signal segmentation model and computer equipment
CN111540375A (en) * 2020-04-29 2020-08-14 全球能源互联网研究院有限公司 Training method of audio separation model, and audio signal separation method and device
CN111724807A (en) * 2020-08-05 2020-09-29 字节跳动有限公司 Audio separation method and device, electronic equipment and computer readable storage medium
CN112071329A (en) * 2020-09-16 2020-12-11 腾讯科技(深圳)有限公司 Multi-person voice separation method and device, electronic equipment and storage medium
CN112613657A (en) * 2020-12-22 2021-04-06 国网江苏省电力有限公司苏州供电分公司 Short-term wind speed prediction method for wind power plant
CN112992172A (en) * 2021-01-28 2021-06-18 广州大学 Single-channel time domain bird song separating method based on attention mechanism
CN112820300A (en) * 2021-02-25 2021-05-18 北京小米松果电子有限公司 Audio processing method and device, terminal and storage medium

Also Published As

Publication number Publication date
CN113488063A (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN113488063B (en) Audio separation method based on mixed features and encoding and decoding
Wang et al. Speech emotion recognition with dual-sequence LSTM architecture
CN110709924B (en) Audio-visual speech separation
CN111508498B (en) Conversational speech recognition method, conversational speech recognition system, electronic device, and storage medium
Hu et al. Pitch‐based gender identification with two‐stage classification
Xiang et al. A nested u-net with self-attention and dense connectivity for monaural speech enhancement
EP2363852A1 (en) Computer-based method and system of assessing intelligibility of speech represented by a speech signal
CN104318927A (en) Anti-noise low-bitrate speech coding method and decoding method
CN112270933B (en) Audio identification method and device
Lu et al. Self-supervised audio spatialization with correspondence classifier
Wu et al. Increasing compactness of deep learning based speech enhancement models with parameter pruning and quantization techniques
CN111540364A (en) Audio recognition method and device, electronic equipment and computer readable medium
CN115602165A (en) Digital staff intelligent system based on financial system
Jin et al. Speech separation and emotion recognition for multi-speaker scenarios
Soni et al. State-of-the-art analysis of deep learning-based monaural speech source separation techniques
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
CN115881156A (en) Multi-scale-based multi-modal time domain voice separation method
Kaur et al. Maximum likelihood based estimation with quasi oppositional chemical reaction optimization algorithm for speech signal enhancement
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
Richter et al. Audio-visual speech enhancement with score-based generative models
Smietanka et al. Augmented Transformer for Speech Detection in Adverse Acoustical Conditions
Aung et al. M-Diarization: A Myanmar Speaker Diarization using Multi-scale dynamic weights
Ali et al. Enhancing Embeddings for Speech Classification in Noisy Conditions.
CN117095674B (en) Interactive control method and system for intelligent doors and windows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant