CN116778970B - Voice detection model training method in strong noise environment - Google Patents

Voice detection model training method in strong noise environment Download PDF

Info

Publication number
CN116778970B
CN116778970B CN202311076367.7A CN202311076367A CN116778970B CN 116778970 B CN116778970 B CN 116778970B CN 202311076367 A CN202311076367 A CN 202311076367A CN 116778970 B CN116778970 B CN 116778970B
Authority
CN
China
Prior art keywords
voice
model
noise
data
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311076367.7A
Other languages
Chinese (zh)
Other versions
CN116778970A (en
Inventor
李春霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun Mingxi Technology Co ltd
Original Assignee
Changchun Mingxi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun Mingxi Technology Co ltd filed Critical Changchun Mingxi Technology Co ltd
Priority to CN202311076367.7A priority Critical patent/CN116778970B/en
Publication of CN116778970A publication Critical patent/CN116778970A/en
Application granted granted Critical
Publication of CN116778970B publication Critical patent/CN116778970B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Noise Elimination (AREA)

Abstract

The application provides a training method of a voice detection model in a strong noise environment, which comprises the following steps: the method comprises the steps of acquiring voice data in a strong noise environment, preprocessing the voice data, carrying out sliding window segmentation, converting an original voice signal into spectrum representation through Fourier transformation, and inputting the spectrum representation into a convolutional neural network CNN to extract meaningful voice characteristic data; after a two-way long-short-term memory progressive learning model is introduced to estimate a progressive ratio mask of a corpus layer, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm; according to user feedback and model performance, continuously optimizing and fine-tuning a voice detection model; the method can adaptively adjust the trade-off between noise reduction and voice distortion, and realize the adaptive optimization of various noise environments; by utilizing the information provided by the PRMs, the model can estimate the noise more accurately, and the effect of voice detection is further improved.

Description

Voice detection model training method in strong noise environment
Technical Field
The application relates to the technical field of voice detection, in particular to a training method of a voice detection model in a strong noise environment.
Background
The voice detection model training method is one of key technologies and core contents which are necessary for voice reconnaissance. The voice detection accuracy directly determines the reconnaissance capability of voice information. The voice application scenes are gradually enriched, and different application scenes are often accompanied by noise; due to the presence of noise, recognition accuracy is a persistent problem. The prior art has made remarkable improvement, but the recognition accuracy of the existing voice detection model training method in a specific noise environment is still insufficient, and the voice detection model training method in a strong noise environment is provided for the purpose.
Disclosure of Invention
In order to solve the above problems, the present application proposes a method for training a speech detection model in a strong noise environment, so as to more exactly solve the above problem that the recognition rate is not accurate enough when the existing speech detection model constructed in a specific noise environment is used for actual detection.
The application is realized by the following technical scheme:
the application provides a training method of a voice detection model in a strong noise environment, which comprises the following steps:
s1: acquiring voice data in an in-situ recording in a strong noise environment, and preprocessing the voice data;
s2: sliding window segmentation is carried out on the preprocessed voice data, and original voice signals are converted into frequency spectrum representation through Fourier transformation on each segment;
s3: inputting the frequency spectrum into a convolutional neural network CNN, and automatically extracting meaningful voice characteristic data from the input data;
s4: after a two-way long-short-term memory progressive learning model is introduced according to voice characteristic data to estimate a progressive ratio mask of a corpus level, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm;
s5: and continuously optimizing and fine-tuning the voice detection model according to the user feedback and the model performance.
Further, the step of obtaining the voice data in the in-situ recording in the strong noise environment and preprocessing the voice data includes the steps of;
according to the voice acquisition module, acquiring voice data in different noise environments, removing a mute segment from the audio signal intensity in the acquired voice data, and standardizing the voice data by a Z-Score method for one label of each audio sample.
Further, the step of performing sliding window segmentation on the preprocessed voice data and converting the original voice signal into a spectrum representation for each segment through fourier transformation includes the steps of;
the frame length and frame shift of the window are set according to the actual task to ensure that the overlapping part exists between the continuous audio fragments, and the intensity of each frequency component of each frame is calculated through Fourier transform after the window function is applied to each frame to inhibit the frequency spectrum leakage, so that a frequency spectrum is obtained.
Further, the step of inputting the spectrum into the convolutional neural network CNN and automatically extracting meaningful voice feature data from the input data includes;
after the spectrum is input into the convolutional neural network CNN, the convolutional kernel slides on the input data and calculates, and the CNN automatically identifies important frequency modes, harmonic structures and tone characteristics.
Further, after the progressive ratio mask of the corpus layer is estimated by introducing the two-way long-short-term memory progressive learning model according to the voice characteristic data, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and the step of parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm comprises the following steps of;
predicting progressive ratio masks PRMs by BLSTM as a regression model; wherein PRMs are generated by the intermediate layer and as learning targets, which correspond to the ratio between clean speech and noise, i.e. "mask"; a series of progressive ratio masks PRMs for helping trade-off between noise reduction and speech distortion are obtained by taking the characteristics of logarithmic power spectrum LPS as the input of a speech detection model and the ideal ratio masks IRM as the output, the trade-off between noise reduction and speech distortion is controlled adaptively, the noise is estimated accurately by an information model provided by the PRMs, and the loss is calculated by improving an optimization algorithm according to the weighted MMSE criterion of m target layers so as to optimize parameters.
Further, the step of obtaining a series of progressive ratio masks PRMs for helping to trade-off between noise reduction and speech distortion by taking the logarithmic power spectrum LPS characteristics as input to the speech detection model and the ideal ratio mask IRM as output includes;
PRMs trade-off between noise reduction and speech distortion, defined as:
wherein,for time frame +.>For frequency bin->Short-term fourier transform of the speech signal in time frames and frequency bins, < >>Targeting T-F units for a progressive ratio mask>Is a short-time fourier transform of the noise of (a),for the input signal in the T-F unit->Is a noisy short-time fourier transform of (a).
Further, the improved optimization algorithm is as follows;
wherein,weighting factor for the mth target layer, < ->For the set of weight matrix and bias vector, +.>And outputting the neural network serving as the mth target layer.
Further, the step of continuously optimizing and fine-tuning the voice detection model according to the user feedback and the model performance comprises the following steps of;
if the model performs poorly in some situations, the voice detection model is continually optimized by adding various types and levels of noise in the public voice database to the clean voice data to generate more training samples.
The application has the beneficial effects that: extracting characteristic data after preprocessing voice data; after the voice characteristic data is introduced into a two-way long-short-term memory progressive learning model to estimate the progressive ratio mask of the corpus level, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm; the method can adaptively adjust the trade-off between noise reduction and voice distortion, thereby realizing the adaptive optimization of various noise environments; moreover, by utilizing the information provided by the PRMs, a more accurate detection model can be constructed.
Drawings
Fig. 1 is a flow chart of a training method of a speech detection model in a strong noise environment according to the present application.
The realization, functional characteristics and advantages of the present application are further described with reference to the accompanying drawings in combination with the embodiments.
Detailed Description
In order to more clearly and completely describe the technical scheme of the application, the application is further described below with reference to the accompanying drawings.
Referring to fig. 1, the present application proposes a training method for a speech detection model in a strong noise environment, which comprises the following steps:
s1: acquiring voice data in an in-situ recording in a strong noise environment, and preprocessing the voice data;
s2: sliding window segmentation is carried out on the preprocessed voice data, and original voice signals are converted into frequency spectrum representation through Fourier transformation on each segment;
s3: inputting the frequency spectrum into a convolutional neural network CNN, and automatically extracting meaningful voice characteristic data from the input data;
s4: after a two-way long-short-term memory progressive learning model is introduced according to voice characteristic data to estimate a progressive ratio mask of a corpus level, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm;
s5: and continuously optimizing and fine-tuning the voice detection model according to the user feedback and the model performance.
In the specific implementation, voice data in the field recording in a strong noise environment are obtained, and the voice data are preprocessed; recording voice in different noise environments through professional audio equipment such as a microphone, and simultaneously, taking into consideration that the voice is additionally recorded by using equipment such as a mobile phone and the like, namely, acquiring input voice through various equipment sources, wherein the data can cover as much voice and noise types as possible, including different speakers, accents, speech speeds, various noise environments and the like; although the goal is to detect speech in noisy environments, some noise does not help with model training, such as breeze or rain, fourier transform or spectral subtraction methods can be used to try to reduce these extraneous noise, which needs to be normalized to the same volume level before speech data is input into the model, which would otherwise be biased to recognize larger signals and ignore smaller signals; the method comprises the steps of carrying out sliding window segmentation on preprocessed voice data, wherein the audio signal is changed along with time, so that the whole signal cannot be directly processed, but the whole signal needs to be segmented firstly, specifically, a window length is determined firstly, for example, 20ms, then a window is slid forwards once at regular intervals, for example, 10ms, and the signal in the window is taken as one segment, so that 50% overlapping parts exist in every two adjacent segments, and the continuity of the signal is ensured; converting the original speech signal into a spectral representation by fourier transformation for each segment; fourier transform is a method of converting a time domain signal into a frequency domain signal, and in audio processing, frequency components of the signal, that is, intensities of each frequency component, are generally of interest, instead of original waveforms, so that each segmented signal is converted into a spectral representation using fourier transform, specifically, the intensities of its respective frequency components can be calculated for each segment, thereby obtaining a spectrum; inputting the frequency spectrum into a convolutional neural network CNN, and automatically extracting meaningful voice characteristic data from the input data; convolutional neural network CNN is a deep learning model that automatically extracts meaningful features from input data by using convolutional kernels and pooling layers; specifically, the convolution kernel slides on the input data and performs calculation, so that a new feature map is obtained; then, the pooling layer can reduce the dimension of the feature map, so that the calculated amount is reduced and the generalization capability of the model is enhanced; in audio processing, the advantages of CNN can be exploited to allow it to automatically extract meaningful speech features from the spectrum; specifically, CNNs identify some important frequency patterns, harmonic structures, timbres, etc. features; after a progressive type learning model of two-way long-short-term memory is introduced according to voice characteristic data to estimate a progressive type ratio mask of a corpus level, the two-way long-short-term memory (BLSTM) is a neural network capable of acquiring information from the front direction and the rear direction of an input sequence; the BLSTM can better understand the context information than the unidirectional LSTM; this is very useful in speech recognition because it is often a continuous piece of speech that is processed, both before and after which can affect the current understanding, in which framework the BLSTM is used as a regression model to predict Progressive Ratio Masks (PRMs); the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm; these PRMs are generated by the intermediate layer and as learning targets, they correspond to the ratio between clean speech and noise, i.e. "mask"; using a Logarithmic Power Spectrum (LPS) feature as an input to the model, and an Ideal Ratio Mask (IRM) as an output; in this way, a series of PRMs may be obtained that may help balance noise reduction and speech distortion; specifically, if noise reduction is merely focused, noise may be excessively suppressed, resulting in speech distortion; while if only attention is paid to maintaining voice quality, noise may not be effectively eliminated; the PRMs can find the balance point between the two, thereby realizing the aims of reducing noise and keeping voice quality; according to user feedback and model performance, continuously optimizing and fine-tuning a voice detection model; if the model performs poorly in some situations, the voice detection model is continually optimized by adding various types and levels of noise in the public voice database to the clean voice data to generate more training samples.
In the present embodiment, in step S1, it includes; according to the voice acquisition module, acquiring voice data in different noise environments, wherein the environments comprise various scenes such as indoor, outdoor, mute, traffic noise and the like; in each environment, a large number of voice samples need to be collected to ensure that the model has good performance under various conditions, and the audio signal strength in the acquired voice data is removed from the silence section, because the silence section does not carry any voice information, but the noise and the complexity of the data are increased; the model accuracy and efficiency can be improved by removing the mute segment, and a label is allocated to each audio sample, so that the model is trained, and a label is allocated to each audio sample; tags may be binary, such as whether they contain speech, or may be multiple, such as the class or emotion of speech; tags are annotated by a professional or generated by automated tools; normalizing the voice data by a Z-Score method; when preprocessing data, it is often necessary to normalize the data in order to eliminate scale differences in the data; Z-Score normalization is a common normalization method that is accomplished by subtracting the mean and dividing by the standard deviation; the data so processed will have zero mean and unit standard deviation, making the model easier to learn, and preventing certain features from dominate the learning process due to their large range of values.
In the present embodiment, in step S2, it includes; setting the frame length and frame shift of a window for the preprocessed voice data according to an actual task so as to ensure that an overlapped part exists between continuous audio fragments, and setting a proper window length and frame shift amount when framing the voice signal; the window length determines the amount of data used in one calculation, while the frame shift determines the step size of each shift; by setting these two parameters, it is possible to ensure a certain overlap between consecutive audio pieces so as to capture continuity information in the speech signal; after each frame is subjected to window function inhibition spectrum leakage, the intensity of each frequency component of the spectrum is calculated through Fourier transformation, so that a spectrum is obtained; the window function is a function capable of changing the shape of an input signal, and is mainly aimed at suppressing spectrum leakage; the spectrum leakage is a phenomenon generated by Fourier transformation of the non-periodic signal, and can distort the result of spectrum analysis; after the window function is applied, each frame of audio signal may be fourier transformed; the Fourier transform is a method for converting a time domain signal into a frequency domain signal, and can help to analyze the intensity of each frequency component in a voice signal; these frequency components and their corresponding intensities make up the spectrum; through the frequency spectrum, the information such as pitch, tone and the like in the voice signal can be more intuitively seen.
In the present embodiment, in step S3, it includes; convolutional Neural Networks (CNNs) are an algorithm for deep learning in which an audio signal is converted into a spectral form by a front audio signal and then inputted into the CNN as a 2D image; after the frequency spectrum is input into the convolutional neural network CNN, the convolutional kernel slides on the input data and calculates, the CNN automatically recognizes important frequency modes, harmonic structures and tone characteristics, and the convolutional kernel or the filter slides on the input data and calculates, so that the network can be helped to automatically recognize the important frequency modes, the harmonic structures, the tone characteristics and the like; CNNs have two main layers: a convolution layer and a pooling layer; in the convolution layer, each convolution kernel performs convolution operation on input data and generates a feature map; this feature map reflects a certain specific feature in the input data; in the pooling layer, the method can downsample the feature map output by the convolution layer, reduce the dimension of data and retain important information; CNNs can effectively extract spectral features when processing audio data because they can automatically learn and recognize complex patterns and structures; this makes CNN an ideal choice for tasks such as audio signal classification and recognition.
In the present embodiment, in step S4, it includes; predicting progressive ratio masks PRMs by BLSTM as a regression model; two-way long and short term memory (BLSTM) is a recurrent neural network that is capable of capturing not only past information but also future information; in processing the audio data, a progressive ratio mask may be predicted using the BLSTM as a regression model; wherein PRMs are generated by the intermediate layer and as learning targets, which correspond to the ratio between clean speech and noise, i.e. "mask"; obtaining a series of progressive ratio masks PRMs for helping balance between noise reduction and voice distortion by taking the characteristics of logarithmic power spectrum LPS as input of a voice detection model and taking ideal ratio masks IRM as output, adaptively controlling balance between the noise reduction and the voice distortion, accurately estimating the noise by an information model provided by the PRMs, calculating loss by improving an optimization algorithm according to weighted MMSE criteria of m target layers, and optimizing parameters; the Logarithmic Power Spectrum (LPS) feature is one of the important features of the audio signal, which can be used as an input to the model; an Ideal Ratio Mask (IRM) is an excellent masking strategy that defines the ideal energy ratio that each frequency component should have in order to extract clean speech from a noise signal; by using LPS features and IRM, a series of Progressive Ratio Masks (PRMs) can be derived that help balance noise reduction and speech distortion; by adaptively controlling the trade-off between noise reduction and speech distortion, noise can be accurately estimated through an information model provided by PRMs; then, calculating the loss by improving an optimization algorithm according to a weighted Minimum Mean Square Error (MMSE) criterion so as to optimize the parameters; the criterion considers m target layers, so that the model is finer, and complex structures and modes in the audio signal can be better captured; through continuous optimization and adjustment, the performance of the model can be continuously improved, and the audio signal in the high-noise environment is better processed; PRMs trade-off between noise reduction and speech distortion, defined as:
wherein,for time frame +.>For frequency bin->Short-term fourier transform of the speech signal in time frames and frequency bins, < >>Targeting T-F units for a progressive ratio mask>Is a short-time fourier transform of the noise of (a),for the input signal in the T-F unit->Is a noisy short-time fourier transform of (1);
the improved optimization algorithm is as follows;
wherein,weighting factor for the mth target layer, < ->For the set of weight matrix and bias vector, +.>Neural network output for the mth target layer; by combining the advantages of estimating progressive ratio mask PRMs based on a progressive learning framework and a conventional IMCRA, satisfactory ASR results are achieved without retraining the acoustic model; estimating progressive ratio mask PRMs of a corpus layer through a two-way long-short-term memory BLSTM progressive learning model; these estimated PRMs are then incorporated into improved minimum mean square errorThe difference reduction algorithm IMCRA program; then, combining the PRMs and IMCRA gain functions, and providing a new gain function for recovering clean voice signals frame by frame; the method can adaptively adjust the trade-off between noise reduction and voice distortion, thereby realizing the adaptive optimization of various noise environments; moreover, by utilizing the information provided by the PRMs, a more accurate detection model can be constructed, and the model can estimate noise more accurately.
Of course, the present application can be implemented in various other embodiments, and based on this embodiment, those skilled in the art can obtain other embodiments without any inventive effort, which fall within the scope of the present application.

Claims (4)

1. The method for training the voice detection model in the strong noise environment is characterized by comprising the following steps:
s1: acquiring voice data in an in-situ recording in a strong noise environment, and preprocessing the voice data;
s2: sliding window segmentation is carried out on the preprocessed voice data, and original voice signals are converted into frequency spectrum representation through Fourier transformation on each segment;
s3: inputting the frequency spectrum into a convolutional neural network CNN, and automatically extracting meaningful voice characteristic data from the input data;
s4: after a two-way long-short-term memory progressive learning model is introduced according to voice characteristic data to estimate a progressive ratio mask of a corpus level, the estimated progressive ratio mask is incorporated into a minimum control recursive average method program to construct a voice detection model, and parameter optimization is carried out on the model by improving the calculation loss of an optimization algorithm;
s5: continuously optimizing and fine-tuning a voice detection model according to user feedback and model performance, wherein the step of inputting a frequency spectrum into a convolutional neural network CNN and automatically extracting meaningful voice characteristic data from input data comprises the following steps of;
after the frequency spectrum is input into a convolutional neural network CNN, the convolutional kernel slides on input data and calculates, the CNN automatically recognizes important frequency modes, harmonic structures and tone characteristics, wherein after the gradual rate mask of a corpus level is estimated according to the gradual learning model of the two-way long-short-term memory introduced by voice characteristic data, the estimated gradual rate mask is incorporated into a minimum value control recursive average method program to construct a voice detection model, and the step of optimizing parameters of the model by improving the calculation loss of an optimization algorithm comprises the steps of;
predicting progressive ratio masks PRMs by BLSTM as a regression model; wherein PRMs are generated by the intermediate layer and as learning targets, which correspond to the ratio between clean speech and noise, i.e. "mask"; obtaining a series of progressive ratio masks PRMs for helping trade-off between noise reduction and speech distortion by taking logarithmic power spectrum LPS features as inputs of a speech detection model and ideal ratio masks IRM as outputs, adaptively controlling trade-off between noise reduction and speech distortion, accurately estimating noise by an information model provided by the PRMs, calculating losses by improving an optimization algorithm according to weighted MMSE criteria of m target layers to optimize parameters, wherein the step of obtaining the series of progressive ratio masks PRMs for helping trade-off between noise reduction and speech distortion by taking logarithmic power spectrum LPS features as inputs of the speech detection model and ideal ratio masks IRM as outputs comprises;
PRMs trade-off between noise reduction and speech distortion, defined as:
wherein,for time frame +.>For frequency bin->In time frames for speech signalsAnd short-time Fourier transform of frequency bin, +.>Targeting T-F units for a progressive ratio mask>Is a short-time fourier transform of the noise of (a),for the input signal in the T-F unit->Wherein the improved optimization algorithm is;
wherein,weighting factor for the mth target layer, < ->For the set of weight matrix and bias vector, +.>And outputting the neural network serving as the mth target layer.
2. The method for training a speech detection model in a strong noise environment according to claim 1, wherein the step of acquiring speech data in an in-field recording in the strong noise environment and preprocessing the speech data comprises;
according to the voice acquisition module, acquiring voice data in different noise environments, removing a mute segment from the audio signal intensity in the acquired voice data, and standardizing the voice data by a Z-Score method for one label of each audio sample.
3. The method according to claim 1, wherein the step of performing sliding window segmentation on the preprocessed voice data and converting the original voice signal into a spectral representation by fourier transform for each segment comprises;
the frame length and frame shift of the window are set according to the actual task to ensure that the overlapping part exists between the continuous audio fragments, and the intensity of each frequency component of each frame is calculated through Fourier transform after the window function is applied to each frame to inhibit the frequency spectrum leakage, so that a frequency spectrum is obtained.
4. The method according to claim 1, wherein the step of continuously optimizing and fine-tuning the speech detection model according to the user feedback and the model performance comprises;
if the model performs poorly in some situations, the voice detection model is continually optimized by adding various types and levels of noise in the public voice database to the clean voice data to generate more training samples.
CN202311076367.7A 2023-08-25 2023-08-25 Voice detection model training method in strong noise environment Active CN116778970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311076367.7A CN116778970B (en) 2023-08-25 2023-08-25 Voice detection model training method in strong noise environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311076367.7A CN116778970B (en) 2023-08-25 2023-08-25 Voice detection model training method in strong noise environment

Publications (2)

Publication Number Publication Date
CN116778970A CN116778970A (en) 2023-09-19
CN116778970B true CN116778970B (en) 2023-11-24

Family

ID=87993485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311076367.7A Active CN116778970B (en) 2023-08-25 2023-08-25 Voice detection model training method in strong noise environment

Country Status (1)

Country Link
CN (1) CN116778970B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
WO2020042706A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Deep learning-based acoustic echo cancellation method
CN110867181A (en) * 2019-09-29 2020-03-06 北京工业大学 Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN113077812A (en) * 2021-03-19 2021-07-06 北京声智科技有限公司 Speech signal generation model training method, echo cancellation method, device and equipment
CN113936681A (en) * 2021-10-13 2022-01-14 东南大学 Voice enhancement method based on mask mapping and mixed hole convolution network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10672414B2 (en) * 2018-04-13 2020-06-02 Microsoft Technology Licensing, Llc Systems, methods, and computer-readable media for improved real-time audio processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107452389A (en) * 2017-07-20 2017-12-08 大象声科(深圳)科技有限公司 A kind of general monophonic real-time noise-reducing method
WO2020042706A1 (en) * 2018-08-31 2020-03-05 大象声科(深圳)科技有限公司 Deep learning-based acoustic echo cancellation method
CN110473564A (en) * 2019-07-10 2019-11-19 西北工业大学深圳研究院 A kind of multi-channel speech enhancement method based on depth Wave beam forming
CN110867181A (en) * 2019-09-29 2020-03-06 北京工业大学 Multi-target speech enhancement method based on SCNN and TCNN joint estimation
CN113077812A (en) * 2021-03-19 2021-07-06 北京声智科技有限公司 Speech signal generation model training method, echo cancellation method, device and equipment
CN113936681A (en) * 2021-10-13 2022-01-14 东南大学 Voice enhancement method based on mask mapping and mixed hole convolution network

Also Published As

Publication number Publication date
CN116778970A (en) 2023-09-19

Similar Documents

Publication Publication Date Title
JP3591068B2 (en) Noise reduction method for audio signal
CN107004409B (en) Neural network voice activity detection using run range normalization
Shi et al. Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-Parallel Convolutional Modules for End-to-End Monaural Speech Separation.
CN112735456B (en) Speech enhancement method based on DNN-CLSTM network
KR102206546B1 (en) Hearing Aid Having Noise Environment Classification and Reduction Function and Method thereof
CN112309417B (en) Method, device, system and readable medium for processing audio signal with wind noise suppression
CN112927709B (en) Voice enhancement method based on time-frequency domain joint loss function
CN111899750B (en) Speech enhancement algorithm combining cochlear speech features and hopping deep neural network
US9875748B2 (en) Audio signal noise attenuation
CN111341332A (en) Speech feature enhancement post-filtering method based on deep neural network
CN103890843B (en) Signal noise attenuation
CN112289337B (en) Method and device for filtering residual noise after machine learning voice enhancement
CN111341331B (en) Voice enhancement method, device and medium based on local attention mechanism
CN116778970B (en) Voice detection model training method in strong noise environment
CN111681649B (en) Speech recognition method, interaction system and achievement management system comprising system
CN110689905B (en) Voice activity detection system for video conference system
CN112331232A (en) Voice emotion recognition method combining CGAN spectrogram denoising and bilateral filtering spectrogram enhancement
Hao et al. Soft-label Learn for No-Intrusive Speech Quality Assessment.
Seyedin et al. Robust MVDR-based feature extraction for speech recognition
Ondusko et al. Blind signal-to-noise ratio estimation of speech based on vector quantizer classifiers and decision level fusion
TWI749547B (en) Speech enhancement system based on deep learning
Rao et al. Speech signal enhancement using firefly optimization algorithm
Zhang et al. Gain factor linear prediction based decision-directed method for the a priori SNR estimation
Górriz et al. Effective speech/pause discrimination using an integrated bispectrum likelihood ratio test
Li et al. Monaural Speech Enhancement Algorithm Based on Gated Convolutional Recurrent Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant