CN111613211A

CN111613211A - Method and device for processing specific word voice

Info

Publication number: CN111613211A
Application number: CN202010307655.9A
Authority: CN
Inventors: 高飞; 关海欣
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2020-09-01
Anticipated expiration: 2040-04-17
Also published as: CN111613211B

Abstract

The invention relates to a method and a device for processing specific word voice. The method comprises the following steps: acquiring a voice to be trained with noise; extracting a first feature of the voice to be trained; inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model; acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested; and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested. By the technical scheme of the invention, the noise reduction quality and the detection efficiency of the keywords in the voice with noise can be fully and effectively improved.

Description

Method and device for processing specific word voice

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for processing a specific word speech.

Background

At present, a large number of devices used for smart homes, mobile automatic devices and based on voice interaction appear in the market, such as some smart sound boxes, Amazon Alexa, Apple Siri and the like, and the devices need a specific word detection system to wake up before voice interaction, but the specific word detection system generally has a good detection effect only in a relatively quiet scene, and the performance in a noise scene is not good, that is, the specific word detection method in the prior art only has a good detection effect on voice recorded in a relatively quiet environment, and the performance in the noise scene can present a cliff type, so that keyword detection in noisy voice is inaccurate.

Disclosure of Invention

The embodiment of the invention provides a method and a device for processing specific word voice. The technical scheme is as follows:

according to a first aspect of the embodiments of the present invention, there is provided a method for processing a specific word speech, including:

acquiring a voice to be trained with noise;

extracting a first feature of the voice to be trained;

inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;

acquiring a voice to be tested, and extracting a second characteristic of the voice to be tested;

and inputting the second characteristics into the target U-NET model to judge whether the voice to be tested has specific word voice or not and obtain noise reduction voice of the voice to be tested.

In one embodiment, the inputting the first feature into a U-NET model to be trained to obtain a target U-NET model includes:

inputting the first characteristic into the U-NET model to be trained to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;

and training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model.

In an embodiment, the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model includes:

acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice;

calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result;

and adjusting the U-NET model to be trained according to the model loss function to obtain the target U-NET model.

In one embodiment, calculating a model loss function according to the first estimated masking value, the estimation result, the true masking value, and the true judgment result includes:

calculating the model Loss function Loss by a first predetermined formula:

wherein the content of the first and second substances,

and

respectively the first estimated masking value, the estimation result,

PSM and LABEL respectively represent the real masking value, the real judgment result and MAE to represent average absolute errors;

the real mask value PSM is obtained by calculation using a second preset formula, where the second preset formula is:

| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ^pureRepresenting the phase theta of the pure voice corresponding to the voice to be trained in the frequency domain space^mixtureAnd representing the phase of the voice to be trained in the frequency domain space.

In one embodiment, the inputting the second feature into the target U-NET model to determine whether a specific word speech exists in the speech to be tested and obtain a noise-reduced speech of the speech to be tested includes:

inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;

carrying out short-time Fourier transform on the voice to be tested to obtain a frequency spectrum of the voice to be tested;

and multiplying the second estimated masking value and the frequency spectrum, and then performing inverse Fourier transform to obtain the noise-reduced voice.

According to a second aspect of the embodiments of the present invention, there is provided a processing apparatus for a specific word speech, including:

the acquisition module is used for acquiring a voice to be trained with noise;

the extraction module is used for extracting a first feature of the voice to be trained;

the input module is used for inputting the first characteristic into a U-NET model to be trained so as to obtain a target U-NET model;

the first processing module is used for acquiring a voice to be tested and extracting a second characteristic of the voice to be tested;

and the second processing module is used for inputting the second characteristics to the target U-NET model so as to judge whether specific word voice exists in the voice to be tested and obtain noise reduction voice of the voice to be tested.

In one embodiment, the input module comprises:

the input submodule is used for inputting the first characteristic into the U-NET model to be trained so as to obtain a first estimated masking value corresponding to the voice to be trained and an estimation result of whether the voice to be trained comprises preset voice; the first characteristic is the amplitude value of the voice to be trained in a frequency domain space;

and the training submodule is used for training the U-NET model to be trained according to the first estimation masking value and the estimation result so as to obtain the target U-NET model.

In one embodiment, the training submodule is specifically configured to:

In one embodiment, the training submodule is further configured to:

calculating the model Loss function Loss by a first predetermined formula:

wherein the content of the first and second substances,

and

respectively, the first estimateThe masking value, the result of said estimation,

In one embodiment, the second processing module comprises:

the input submodule is used for inputting the second characteristics into the target U-NET model so as to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space;

the conversion submodule is used for carrying out short-time Fourier transform on the voice to be tested so as to obtain the frequency spectrum of the voice to be tested;

and the processing submodule is used for multiplying the second estimated masking value and the frequency spectrum and then performing inverse Fourier transform to obtain the noise-reduced voice.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

the method comprises the steps of inputting first characteristics of voice to be trained into a U-NET model to be trained to obtain a target U-NET model with higher maturity and accuracy after training, then extracting second characteristics of the voice to be tested after the voice to be tested is obtained, inputting the second characteristics into the target U-NET model with higher accuracy to obtain noise-reducing voice of the voice to be tested, namely pure voice except noise in the voice to be tested, and judging whether specific word voice exists in the voice to be tested, so that noise-reducing quality and detection efficiency of keywords in the voice with noise are fully and effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1A is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.

FIG. 1B is a flow diagram illustrating a method of processing a particular word speech according to an example embodiment.

Fig. 2 is a block diagram illustrating an apparatus for processing a specific word speech according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In order to solve the above technical problem, an embodiment of the present invention provides a method for processing a specific word speech, where the method is applicable to a specific word speech processing program, system or device, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1A, and the method includes steps S101 to S105:

in step S101, a speech to be trained with noise is acquired;

the speech to be trained is mixed and obtained in a simulation mode, and the obtained method is to add different types of noise to clean speech at different signal-to-noise ratios.

In step S102, extracting a first feature of the speech to be trained; the first characteristic is the amplitude value of the voice to be trained in the frequency domain space, namely the module value of the real part of the voice to be trained in the expression of the frequency domain space; the first characteristic and the second characteristic are only one characteristic of amplitude with voice, the training stage is trained by the characteristic, and the testing stage is also input to a trained model (namely a target U-NET model) by the amplitude characteristic to obtain a noise-reduced result.

In step S103, inputting the first characteristic into a U-NET model to be trained (based on deep learning) to obtain a target U-NET model; the U-NET model is a U-shaped network structure and can be used for noise reduction or enhancement of noisy speech and detection of keywords in the speech.

In step S104, acquiring a voice to be tested, and extracting a second feature of the voice to be tested;

the speech to be tested is recorded without mixing.

In step S105, the second feature is input to the target U-NET model to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.

By inputting the first characteristic of the voice to be trained into the U-NET model to be trained, the target U-NET model with higher maturity and accuracy after training can be obtained, then after the voice to be tested is obtained, the second characteristic of the voice to be tested can be extracted and input into a target U-NET model with higher accuracy, whether the voice of a specific word exists in the voice to be tested is judged, and obtaining a noise-reduced voice of the voice to be tested, i.e. a pure voice except noise in the voice to be tested (for example, if only the voice of a specific word is needed, the voice except the voice of the specific word can be filtered out), the method and the device can fully and effectively improve the noise reduction quality and the detection efficiency of the keywords or the specific words in the voice with noise, and further improve the voice awakening accuracy and timeliness of the voice interaction device. The particular word tone may be a voice of a particular word, such as a voice of a wake word, etc.

By inputting the first characteristic into the U-NET model to be trained, a first estimated masking value PSM (Phase Sensitive Mask) corresponding to the voice to be trained can be obtained, whether the voice to be trained comprises preset voice or not, namely whether the voice to be trained comprises a certain specified keyword or not is obtained, then the U-NET model to be trained is retrained according to the first estimated masking value and the estimation result, so that an optimized and upgraded target U-NET model is obtained, the target U-NET model is conveniently and accurately used for denoising the voice with noise, and the detection efficiency and accuracy of the keyword in the voice with noise can be improved.

acquiring a real masking value corresponding to the voice to be trained and a real judgment result of whether the voice to be trained comprises a preset voice; the preset speech may also be a speech of a specific word or keyword, and may be the same as or different from the specific word sound.

When the U-NET model is optimized, an accurate model loss function can be calculated by using the first estimated masking value, the estimated result, the real masking value and the real judgment result, then the U-NET model to be trained is adjusted by using the model loss function, and the adjustment process can be continuously and circularly repeated to obtain an optimized and upgraded target U-NET model, so that the noise reduction treatment can be accurately carried out on the voice with noise by using the target U-NET model, and the detection efficiency and the accuracy of keywords in the voice with noise can be improved.

calculating the model Loss function Loss by a first predetermined formula:

wherein the content of the first and second substances,

and

respectively the first estimated masking value, the estimation result,

when the voice to be trained comprises the preset voice, the value is 1, and when the voice to be trained does not comprise the preset voice, the value is 0;

the LABEL takes a value of 1 when the voice to be tested includes the specific word voice, and takes a value of 0 when the voice to be tested does not include the specific word voice.

MAE is the mean absolute error MAE (mean absolute error).

| pure | represents the amplitude of the pure speech corresponding to the speech to be trained in the frequency domain space, | texture | represents the amplitude of the speech to be trained in the frequency domain space, and θ^pureRepresenting the phase of the clean voice corresponding to the voice to be trained in the frequency domain space, i.e. the imaginary part, theta, of the clean voice corresponding to the voice to be trained in the expression of the frequency domain space^mixtureAnd expressing the phase of the voice to be trained in the frequency domain space, namely the imaginary part of the voice to be trained in the expression of the frequency domain space.

When the model Loss function Loss is calculated by using the formula, the average absolute error MAE (mean absolute error) is used as a convergence criterion until the Loss function convergence stops training, and a target U-NET model with the best optimization effect is achieved, so that the voice detection effect is optimal, and the noise reduction effect is optimal.

inputting the second characteristic into the target U-NET model to judge whether a specific word sound exists in the voice to be tested and a second estimated masking value corresponding to the voice to be tested; the second characteristic is the amplitude of the voice to be tested in a frequency domain space; the second characteristic is the real part of the speech to be tested in the expression of the frequency domain space;

After the target U-NET model is obtained, the second characteristics of the voice to be tested can be input into the target U-NET model to judge whether the voice to be tested really has specific word voice, so that whether a certain keyword exists in the voice to be tested can be accurately identified, a second estimated masking value PSM is obtained, then STFT conversion is carried out on the voice to be tested, the frequency spectrum of the voice to be tested can be obtained, and ISTFT is carried out after the frequency spectrum is multiplied by the second estimated masking value, so that a good noise reduction effect can be obtained.

The technical solution of the present invention will be further described in detail with reference to fig. 1B:

step 1: generating data, namely mixing original specific word data and various types of noise with different signal-to-noise ratios (-5-15 dB), mixing non-specific word data and noise with different signal-to-noise ratios, using the mixed voice as training data, generating a verification set in the same way, wherein the training set and the verification set are different in noise type, signal-to-noise ratio and speaker, training a model by using the training set, and supervising the model by using the verification set without participating in error return;

step 2: extracting characteristics, namely respectively calculating short-time Fourier transform of each sentence of voice of training data, and then normalizing the amplitude of the short-time Fourier transform to be used as the input of a model;

and 3, step 3: and calculating a training target, wherein the training target consists of two parts. A phase sensitive mask (true PSM) is computed, in part, for the trained mixed speech (mix) and its corresponding clean speech (pure), as follows:

phase of clean speech, theta^mixtureFor mixing phases of speech

Where | represents amplitude, θ represents phase; the other part is a LABEL (LABEL) of the whole voice, the specific word phonetic symbol is marked as 1, and the non-specific word phonetic symbol is marked as 0;

and 4, step 4: inputting the extracted features into a U-NET network model for training, using an average absolute error MAE (mean absolute error) as a convergence criterion, stopping training until a loss function converges, and storing the model, wherein the loss function is defined as follows:

wherein the content of the first and second substances,

and

respectively, the model estimated PSM and LABEL.

And in the testing stage, the characteristics of the tested voice are processed by a trained model to obtain a judgment result of whether the tested voice is a specific word or not and an estimated PSM, and the PSM is multiplied by the frequency spectrum (obtained by STFT) of the tested voice and then subjected to inverse Fourier transform to obtain the voice after noise reduction.

Finally, it is clear that: the above embodiments can be freely combined by those skilled in the art according to actual needs.

Corresponding to the method for processing the specific word speech provided in the embodiment of the present invention, an embodiment of the present invention further provides a device for processing the specific word speech, as shown in fig. 2, where the device includes:

an obtaining module 201, configured to obtain a voice to be trained with noise;

an extracting module 202, configured to extract a first feature of the speech to be trained;

the input module 203 is used for inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model;

the first processing module 204 is configured to obtain a voice to be tested, and extract a second feature of the voice to be tested;

the second processing module 205 is configured to input the second feature to the target U-NET model, so as to determine whether a specific word voice exists in the voice to be tested, and obtain a noise reduction voice of the voice to be tested.

In one embodiment, the input module comprises:

In one embodiment, the training submodule is specifically configured to:

In one embodiment, the training submodule is further configured to:

calculating the model Loss function Loss by a first predetermined formula:

wherein the content of the first and second substances,

and

respectively the first estimated masking value, the estimation result,

In one embodiment, the second processing module comprises:

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for processing a specific word speech, comprising:

acquiring a voice to be trained with noise;

extracting a first feature of the voice to be trained;

2. The method of claim 1,

inputting the first characteristic into a U-NET model to be trained to obtain a target U-NET model, wherein the method comprises the following steps:

3. The method of claim 2,

the training the U-NET model to be trained according to the first estimated masking value and the estimation result to obtain the target U-NET model comprises:

4. The method of claim 3,

calculating a model loss function according to the first estimated masking value, the estimation result, the real masking value and the real judgment result, including:

calculating the model Loss function Loss by a first predetermined formula:

wherein the content of the first and second substances,

and

respectively the first estimated masking value, the estimation result,

5. The method according to any one of claims 1 to 4,

the inputting the second characteristic into the target U-NET model to judge whether the voice to be tested has a specific word voice and obtain the noise reduction voice of the voice to be tested includes:

6. An apparatus for processing a specific word speech, comprising:

the acquisition module is used for acquiring a voice to be trained with noise;

7. The apparatus of claim 6,

the input module includes:

8. The apparatus of claim 7,

the training submodule is specifically configured to:

9. The apparatus of claim 8,

the training submodule is further specifically configured to:

calculating the model Loss function Loss by a first predetermined formula:

wherein the content of the first and second substances,

and

respectively the first estimated masking value, the estimation result,

10. The apparatus according to any one of claims 6 to 9,

the second processing module comprises: