CN111613211A - Method and device for processing specific word voice - Google Patents
Method and device for processing specific word voice Download PDFInfo
- Publication number
- CN111613211A CN111613211A CN202010307655.9A CN202010307655A CN111613211A CN 111613211 A CN111613211 A CN 111613211A CN 202010307655 A CN202010307655 A CN 202010307655A CN 111613211 A CN111613211 A CN 111613211A
- Authority
- CN
- China
- Prior art keywords
- voice
- trained
- tested
- net model
- masking value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012545 processing Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000009467 reduction Effects 0.000 claims abstract description 15
- 230000000873 masking effect Effects 0.000 claims description 69
- 238000012549 training Methods 0.000 claims description 31
- 238000001228 spectrum Methods 0.000 claims description 15
- 239000000203 mixture Substances 0.000 claims description 8
- 239000000126 substance Substances 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 2
- 238000001514 detection method Methods 0.000 abstract description 13
- 230000006870 function Effects 0.000 description 20
- 230000000694 effects Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 238000012795 verification Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Monitoring And Testing Of Exchanges (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
The invention relates to a method and a device for processing specific word voice. The method comprises the following steps: acquiring a voice to be trained with noise; extracting a first feature of the voice to be trained; inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model; acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested; and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested. By the technical scheme of the invention, the noise reduction quality and the detection efficiency of the keywords in the voice with noise can be fully and effectively improved.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for processing a specific word speech.
Background
At present, a large number of devices used for smart homes, mobile automatic devices and based on voice interaction appear in the market, such as some smart sound boxes, Amazon Alexa, Apple Siri and the like, and the devices need a specific word detection system to wake up before voice interaction, but the specific word detection system generally has a good detection effect only in a relatively quiet scene, and the performance in a noise scene is not good, that is, the specific word detection method in the prior art only has a good detection effect on voice recorded in a relatively quiet environment, and the performance in the noise scene can present a cliff type, so that keyword detection in noisy voice is inaccurate.
Disclosure of Invention
The embodiment of the invention provides a method and a device for processing specific word voice. The technical scheme is as follows:
according to a first aspect of the embodiments of the present invention, there is provided a method for processing a specific word speech, including:
acquiring a voice to be trained with noise;
extracting a first feature of the voice to be trained;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested;
and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested.
In one embodiment, the inputting the first feature into a U-NET model to be trained to obtain a target U-NET model includes:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.
In an embodiment, the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model includes:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, calculating a model loss function according to the first estimated masking value, the estimation result, the true masking value, and the true judgment result includes:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,andrespectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θpureRepresenting the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain spacemixtureAnd representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the inputting the second feature into the target U-NET model to determine whether a specific word speech exists in the speech to be tested and obtain a noise-reduced speech of the speech to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
According to a second aspect of the embodiments of the present invention, there is provided a processing apparatus for a specific word speech, including:
the acquisition module is used for acquiring a voice to be trained with noise;
the extraction module is used for extracting a first feature of the voice to be trained;
the input module is used for inputting the first characteristic into a U-NET model to be trained so as to obtain a target U-NET model;
the first processing module is used for acquiring a voice to be tested and extracting a second characteristic of the voice to be tested;
and the second processing module is used for inputting the second characteristics to the target U-NET model so as to judge whether specific word voice exists in the voice to be tested and obtain noise reduction voice of the voice to be tested.
In one embodiment, the input module comprises:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.
In one embodiment, the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, the training submodule is further configured to:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,andrespectively, the first estimateThe masking value, the result of said estimation,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θpureRepresenting the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain spacemixtureAnd representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
The technical scheme provided by the embodiment of the invention can have the following beneficial effects:
the method comprises the steps of inputting first characteristics of voice to be trained into a U-NET model to be trained to obtain a target U-NET model with higher maturity and accuracy after training, then extracting second characteristics of the voice to be tested after the voice to be tested is obtained, inputting the second characteristics into the target U-NET model with higher accuracy to obtain noise-reducing voice of the voice to be tested, namely pure voice except noise in the voice to be tested, and judging whether specific word voice exists in the voice to be tested, so that noise-reducing quality and detection efficiency of keywords in the voice with noise are fully and effectively improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1A is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.
FIG. 1B is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.
Fig. 2 is a block diagram illustrating an apparatus for processing a specific word speech according to an example embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
In order to solve the above technical problem, an embodiment of the present invention provides a method for processing a specific word speech, where the method is applicable to a specific word speech processing program, system or device, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1A, and the method includes steps S101 to S105:
in step S101, a speech to be trained with noise is acquired;
the speech to be trained is mixed and obtained in a simulation mode, and the obtained method is to add different types of noise to clean speech at different signal-to-noise ratios.
In step S102, extracting a first feature of the speech to be trained; the first characteristic is the amplitude value of the voice to be trained in the frequency domain space, namely the module value of the real part of the voice to be trained in the expression of the frequency domain space; the first characteristic and the second characteristic are only one characteristic of amplitude with voice, the training stage is trained by the characteristic, and the testing stage is also input to a trained model (namely a target U-NET model) by the amplitude characteristic to obtain a noise-reduced result.
In step S103, inputting the first characteristic into a U-NET model to be trained (based on deep learning) to obtain a target U-NET model; the U-NET model is a U-shaped network structure and can be used for noise reduction or enhancement of noisy speech and detection of keywords in the speech.
In step S104, acquiring a voice to be tested, and extracting a second feature of the voice to be tested;
the speech to be tested is recorded without mixing.
In step S105, the second feature is input to the target U-NET model to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.
By inputting the first characteristic of the voice to be trained into the U-NET model to be trained, the target U-NET model with higher maturity and accuracy after training can be obtained, then after the voice to be tested is obtained, the second characteristic of the voice to be tested can be extracted and input into a target U-NET model with higher accuracy, whether the voice of a specific word exists in the voice to be tested is judged, and obtaining a noise-reduced voice of the voice to be tested, i.e. a pure voice except noise in the voice to be tested (for example, if only the voice of a specific word is needed, the voice except the voice of the specific word can be filtered out), the method and the device can fully and effectively improve the noise reduction quality and the detection efficiency of the keywords or the specific words in the voice with noise, and further improve the voice awakening accuracy and timeliness of the voice interaction device. The particular word tone may be a voice of a particular word, such as a voice of a wake word, etc.
In one embodiment, the inputting the first feature into a U-NET model to be trained to obtain a target U-NET model includes:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.
By inputting the first characteristic into the U-NET model to be trained, a first estimated masking value PSM (Phase Sensitive Mask) corresponding to the voice to be trained can be obtained, whether the voice to be trained comprises preset voice or not, namely whether the voice to be trained comprises a certain specified keyword or not is obtained, then the U-NET model to be trained is retrained according to the first estimated masking value and the estimation result, so that an optimized and upgraded target U-NET model is obtained, the target U-NET model is conveniently and accurately used for denoising the voice with noise, and the detection efficiency and accuracy of the keyword in the voice with noise can be improved.
In an embodiment, the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model includes:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice; the preset speech may also be a speech of a specific word or keyword, and may be the same as or different from the specific word sound.
Calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
When the U-NET model is optimized, an accurate model loss function can be calculated by using the first estimated masking value, the estimated result, the real masking value and the real judgment result, then the U-NET model to be trained is adjusted by using the model loss function, and the adjustment process can be continuously and circularly repeated to obtain an optimized and upgraded target U-NET model, so that the noise reduction treatment can be accurately carried out on the voice with noise by using the target U-NET model, and the detection efficiency and the accuracy of keywords in the voice with noise can be improved.
In one embodiment, calculating a model loss function according to the first estimated masking value, the estimation result, the true masking value, and the true judgment result includes:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,andrespectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
when the voice to be trained comprises the preset voice, the value is 1, and when the voice to be trained does not comprise the preset voice, the value is 0;
the LABEL takes a value of 1 when the voice to be tested includes the specific word voice, and takes a value of 0 when the voice to be tested does not include the specific word voice.
MAE is the mean absolute error MAE (mean absolute error).
The real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θpureRepresenting the phase of the clean voice corresponding to the voice to be trained in the frequency domain space, i.e. the imaginary part, theta, of the clean voice corresponding to the voice to be trained in the expression of the frequency domain spacemixtureAnd expressing the phase of the voice to be trained in the frequency domain space, namely the imaginary part of the voice to be trained in the expression of the frequency domain space.
When the model Loss function Loss is calculated by using the formula, the average absolute error MAE (mean absolute error) is used as a convergence criterion until the Loss function convergence stops training, and a target U-NET model with the best optimization effect is achieved, so that the voice detection effect is optimal, and the noise reduction effect is optimal.
In one embodiment, the inputting the second feature into the target U-NET model to determine whether a specific word speech exists in the speech to be tested and obtain a noise-reduced speech of the speech to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space; the second characteristic is the real part of the speech to be tested in the expression of the frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
After the target U-NET model is obtained, the second characteristics of the voice to be tested can be input into the target U-NET model to judge whether the voice to be tested really has specific word voice, so that whether a certain keyword exists in the voice to be tested can be accurately identified, a second estimated masking value PSM is obtained, then STFT conversion is carried out on the voice to be tested, the frequency spectrum of the voice to be tested can be obtained, and ISTFT is carried out after the frequency spectrum is multiplied by the second estimated masking value, so that a good noise reduction effect can be obtained.
The technical solution of the present invention will be further described in detail with reference to fig. 1B:
step 1: generating data, namely mixing original specific word data and various types of noise with different signal-to-noise ratios (-5-15 dB), mixing non-specific word data and noise with different signal-to-noise ratios, using the mixed voice as training data, generating a verification set in the same way, wherein the training set and the verification set are different in noise type, signal-to-noise ratio and speaker, training a model by using the training set, and supervising the model by using the verification set without participating in error return;
step 2: extracting characteristics, namely respectively calculating short-time Fourier transform of each sentence of voice of training data, and then normalizing the amplitude of the short-time Fourier transform to be used as the input of a model;
and 3, step 3: and calculating a training target, wherein the training target consists of two parts. A phase sensitive mask (true PSM) is computed, in part, for the trained mixed speech (mix) and its corresponding clean speech (pure), as follows:
Where | represents amplitude, θ represents phase; the other part is a LABEL (LABEL) of the whole voice, the specific word phonetic symbol is marked as 1, and the non-specific word phonetic symbol is marked as 0;
and 4, step 4: inputting the extracted features into a U-NET network model for training, using an average absolute error MAE (mean absolute error) as a convergence criterion, stopping training until a loss function converges, and storing the model, wherein the loss function is defined as follows:
wherein the content of the first and second substances,andrespectively, the model estimated PSM and LABEL.
And in the testing stage, the characteristics of the tested voice are processed by a trained model to obtain a judgment result of whether the tested voice is a specific word or not and an estimated PSM, and the PSM is multiplied by the frequency spectrum (obtained by STFT) of the tested voice and then subjected to inverse Fourier transform to obtain the voice after noise reduction.
Finally, it is clear that: the above embodiments can be freely combined by those skilled in the art according to actual needs.
Corresponding to the method for processing the specific word speech provided in the embodiment of the present invention, an embodiment of the present invention further provides a device for processing the specific word speech, as shown in fig. 2, where the device includes:
an obtaining module 201, configured to obtain a voice to be trained with noise;
an extracting module 202, configured to extract a first feature of the speech to be trained;
the input module 203 is used for inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
the first processing module 204 is configured to obtain a voice to be tested, and extract a second feature of the voice to be tested;
the second processing module 205 is configured to input the second feature to the target U-NET model, so as to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.
In one embodiment, the input module comprises:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.
In one embodiment, the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
In one embodiment, the training submodule is further configured to:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,andrespectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θpureRepresenting the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain spacemixtureAnd representing the phase of the voice to be trained in the frequency domain space.
In one embodiment, the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
Claims (10)
1. A method for processing a specific word speech, comprising:
acquiring a voice to be trained with noise;
extracting a first feature of the voice to be trained;
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;
acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested;
and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested.
2. The method of claim 1,
inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model, wherein the method comprises the following steps:
inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.
3. The method of claim 2,
the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model comprises:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
4. The method of claim 3,
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result, including:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,andrespectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θpureRepresenting the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain spacemixtureAnd representing the phase of the voice to be trained in the frequency domain space.
5. The method according to any one of claims 1 to 4,
the inputting the second characteristic into the target U-NET model to judge whether the voice to be tested has a specific word voice and obtain the noise reduction voice of the voice to be tested includes:
inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;
and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.
6. An apparatus for processing a specific word speech, comprising:
the acquisition module is used for acquiring a voice to be trained with noise;
the extraction module is used for extracting a first feature of the voice to be trained;
the input module is used for inputting the first characteristic into a U-NET model to be trained so as to obtain a target U-NET model;
the first processing module is used for acquiring a voice to be tested and extracting a second characteristic of the voice to be tested;
and the second processing module is used for inputting the second characteristics to the target U-NET model so as to judge whether specific word voice exists in the voice to be tested and obtain noise reduction voice of the voice to be tested.
7. The apparatus of claim 6,
the input module includes:
the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;
and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.
8. The apparatus of claim 7,
the training submodule is specifically configured to:
acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;
calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;
and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.
9. The apparatus of claim 8,
the training submodule is further specifically configured to:
calculating the model Loss function Loss by a first predetermined formula:
wherein the content of the first and second substances,andrespectively the first estimated masking value, the estimation result,
PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;
the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:
| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θpureRepresenting the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain spacemixtureAnd representing the phase of the voice to be trained in the frequency domain space.
10. The apparatus according to any one of claims 6 to 9,
the second processing module comprises:
the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;
the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;
and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010307655.9A CN111613211B (en) | 2020-04-17 | 2020-04-17 | Method and device for processing specific word voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010307655.9A CN111613211B (en) | 2020-04-17 | 2020-04-17 | Method and device for processing specific word voice |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111613211A true CN111613211A (en) | 2020-09-01 |
CN111613211B CN111613211B (en) | 2023-04-07 |
Family
ID=72203952
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010307655.9A Active CN111613211B (en) | 2020-04-17 | 2020-04-17 | Method and device for processing specific word voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111613211B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115798455A (en) * | 2023-02-07 | 2023-03-14 | 深圳元象信息科技有限公司 | Speech synthesis method, system, electronic device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
CN108986835A (en) * | 2018-08-28 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network |
CN109461456A (en) * | 2018-12-03 | 2019-03-12 | 北京云知声信息技术有限公司 | A method of it promoting voice and wakes up success rate |
CN110060704A (en) * | 2019-03-26 | 2019-07-26 | 天津大学 | A kind of sound enhancement method of improved multiple target criterion study |
WO2019232851A1 (en) * | 2018-06-04 | 2019-12-12 | 平安科技(深圳)有限公司 | Method and apparatus for training speech differentiation model, and computer device and storage medium |
CN110600017A (en) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Training method of voice processing model, voice recognition method, system and device |
-
2020
- 2020-04-17 CN CN202010307655.9A patent/CN111613211B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
WO2019232851A1 (en) * | 2018-06-04 | 2019-12-12 | 平安科技(深圳)有限公司 | Method and apparatus for training speech differentiation model, and computer device and storage medium |
CN108986835A (en) * | 2018-08-28 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Based on speech de-noising method, apparatus, equipment and the medium for improving GAN network |
CN109461456A (en) * | 2018-12-03 | 2019-03-12 | 北京云知声信息技术有限公司 | A method of it promoting voice and wakes up success rate |
CN110060704A (en) * | 2019-03-26 | 2019-07-26 | 天津大学 | A kind of sound enhancement method of improved multiple target criterion study |
CN110600017A (en) * | 2019-09-12 | 2019-12-20 | 腾讯科技(深圳)有限公司 | Training method of voice processing model, voice recognition method, system and device |
Non-Patent Citations (1)
Title |
---|
JINKYU LEE ET AL: "Phase-sensitive Joint Learning Algorithms for Deep Learning-based Speech Enhancement" * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115798455A (en) * | 2023-02-07 | 2023-03-14 | 深圳元象信息科技有限公司 | Speech synthesis method, system, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111613211B (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106486131B (en) | A kind of method and device of speech de-noising | |
Nam et al. | Filteraugment: An acoustic environmental data augmentation method | |
CN112700786B (en) | Speech enhancement method, device, electronic equipment and storage medium | |
Ju et al. | Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge | |
US20180033427A1 (en) | Speech recognition transformation system | |
CN107464563B (en) | Voice interaction toy | |
CN111754983A (en) | Voice denoising method and device, electronic equipment and storage medium | |
CN111785288A (en) | Voice enhancement method, device, equipment and storage medium | |
CN111883154B (en) | Echo cancellation method and device, computer-readable storage medium, and electronic device | |
CN111863008A (en) | Audio noise reduction method and device and storage medium | |
CN110808057A (en) | Voice enhancement method for generating confrontation network based on constraint naive | |
CN115410589A (en) | Attention generation confrontation voice enhancement method based on joint perception loss | |
CN111613211B (en) | Method and device for processing specific word voice | |
US20140249809A1 (en) | Audio signal noise attenuation | |
CN111028858B (en) | Method and device for detecting voice start-stop time | |
CN112002307B (en) | Voice recognition method and device | |
CN114141267A (en) | Speech enhancement method and device based on complex frequency spectrum characteristics | |
CN114220451A (en) | Audio denoising method, electronic device, and storage medium | |
Mallidi et al. | Robust speaker recognition using spectro-temporal autoregressive models. | |
CN113178204A (en) | Low-power consumption method and device for single-channel noise reduction and storage medium | |
CN107818780B (en) | Robust speech recognition method based on nonlinear feature compensation | |
Liang et al. | Real-time speech enhancement algorithm for transient noise suppression | |
Seyedin et al. | New features using robust MVDR spectrum of filtered autocorrelation sequence for robust speech recognition | |
CN110648681A (en) | Voice enhancement method and device, electronic equipment and computer readable storage medium | |
CN117153178B (en) | Audio signal processing method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |