CN111755010A

CN111755010A - Signal processing method and device combining voice enhancement and keyword recognition

Info

Publication number: CN111755010A
Application number: CN202010648540.6A
Authority: CN
Inventors: 付聪
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd; Chumen Wenwen Information Technology Co Ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-09

Abstract

The utility model provides a combine speech enhancement and keyword recognition's signal processing method, when speech enhancement and keyword recognition use simultaneously, through adopting the processing method that this disclosure provided, can reduce the computational complexity of whole flow to reduce the distortion that handles and bring, the method includes: inputting a voice signal; converting the speech signal into a first frequency domain signal; performing voice enhancement processing on the first frequency domain signal to obtain a second frequency domain signal; extracting the voice feature of the second frequency domain signal; and performing keyword recognition according to the voice characteristics of the second frequency domain signal.

Description

Signal processing method and device combining voice enhancement and keyword recognition

Technical Field

The present disclosure relates to the field of speech signal processing technologies, and in particular, to a signal processing method and apparatus, a readable storage medium, and a computing device that combine speech enhancement and keyword recognition.

Background

Currently, keyword recognition is widely applied to various Internet of Things (IOT) devices. These devices are extremely power demanding and therefore it is desirable that the complexity of the algorithm is as low as possible. Typically, keyword recognition is used in conjunction with speech enhancement to improve recognition performance in noisy environments. The existing method is that speech enhancement and keyword recognition are two isolated algorithms, and after a voice signal with noise is input, the speech enhancement is responsible for improving the signal-to-noise ratio of the voice signal and outputting a clean voice signal; the keyword recognition algorithm takes a clean speech signal as an input and then performs recognition, as shown in fig. 1.

In the process of mutual transformation between time domain and frequency domain, certain operation resources are consumed. Moreover, each conversion introduces a certain signal distortion, such as: the frequency spectrum leakage and possible wave-cutting distortion after the frequency domain is converted into the time domain, the frequency spectrum leakage after the time domain is converted into the frequency domain and the like, and the keyword identification quality is reduced.

Disclosure of Invention

To this end, the present disclosure provides a signal processing method, apparatus, readable storage medium, computing device and smart headset device that combine speech enhancement and keyword recognition in an effort to solve or at least mitigate at least one of the problems identified above.

According to an aspect of the embodiments of the present disclosure, there is provided a signal processing method combining speech enhancement and keyword recognition, including:

inputting a voice signal;

converting the voice signal into a first frequency domain signal;

performing voice enhancement processing on the first frequency domain signal to obtain a second frequency domain signal;

extracting the voice feature of the second frequency domain signal;

and performing keyword recognition according to the voice characteristics of the second frequency domain signal.

Optionally, before inputting the speech signal, the method further comprises:

determining a first frame length suitable for speech enhancement processing;

determining a second frame length suitable for keyword recognition;

the frame length of the speech signal is determined based on the first frame length suitable for the speech enhancement processing and the second frame length suitable for the keyword recognition.

Optionally, the frame length of the voice signal is 10-40 ms.

Optionally, before converting the speech signal into the first frequency domain signal, the method further includes:

determining a first window suitable for speech enhancement processing;

determining a second window suitable for keyword recognition;

the window used for converting the speech signal into the first frequency domain signal is determined based on a first window suitable for speech enhancement processing and a second window suitable for keyword recognition.

Optionally, the window used for converting the speech signal into the first frequency domain signal includes:

hanning window, or hamming window, or blackman window.

Optionally, performing speech enhancement processing on the first frequency domain signal includes:

and performing echo cancellation, beam forming, dereverberation and single-channel noise reduction processing on the first frequency domain signal.

According to still another aspect of the embodiments of the present disclosure, there is provided a signal processing apparatus combining speech enhancement and keyword recognition, including:

an input unit for inputting a voice signal;

the time-frequency conversion unit is used for converting the voice signal into a first frequency domain signal;

the voice enhancement unit is used for carrying out voice enhancement processing on the first frequency domain signal to obtain a second frequency domain signal;

a feature extraction unit, configured to extract a speech feature of the second frequency domain signal;

and the keyword identification unit is used for carrying out keyword identification according to the voice characteristics of the second frequency domain signal.

Optionally, the apparatus further comprises:

a frame length determination unit for determining a first frame length suitable for speech enhancement processing;

determining a second frame length suitable for keyword recognition;

Optionally, the apparatus further comprises:

a window determination unit for determining a first window suitable for speech enhancement processing;

determining a second window suitable for keyword recognition;

Optionally, the speech enhancement unit is specifically configured to:

and performing echo cancellation, beam forming, dereverberation and single-channel noise reduction on the first frequency domain signal.

According to yet another aspect of the present disclosure, there is provided a readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the signal processing method in conjunction with speech enhancement and keyword recognition described above.

According to yet another aspect of the present disclosure, there is provided a computing device comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors to perform the signal processing method described above in connection with speech enhancement and keyword recognition.

Optionally, the computing device is a smart headset device.

According to the embodiment of the disclosure, a voice signal is input, the voice signal is converted into a first frequency domain signal, the first frequency domain signal is subjected to voice enhancement processing to obtain a second frequency domain signal, voice features of the second frequency domain signal are extracted, and keyword recognition is performed according to the voice features of the second frequency domain signal; when speech enhancement and keyword recognition are used simultaneously, by adopting the processing mode provided by the disclosure, the computational complexity of the whole process can be reduced, and the distortion caused by processing can be reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram of keyword recognition in the prior art;

FIG. 2 is a block diagram of an exemplary computing device;

FIG. 3 is a flow diagram of a method of signal processing incorporating speech enhancement and keyword recognition in accordance with an embodiment of the present disclosure;

FIG. 4 is yet another flow diagram of a method of signal processing incorporating speech enhancement and keyword recognition in accordance with an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a signal processing apparatus combining speech enhancement and keyword recognition according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 2 is a block diagram of an example computing device 100 arranged to implement a signal processing method incorporating speech enhancement and keyword recognition in accordance with the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. The example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. The example memory controller 118 may be used with the processor 104, or in some implementations the memory controller 118 may be an internal part of the processor 104.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as part of a small-sized portable (or mobile) electronic device that may be a Personal Digital Assistant (PDA), a wireless web-browsing device, a personal headset device (e.g., a smart headset device), an application-specific device, or a hybrid device that include any of the above functions.

Among other things, one or more programs 122 of computing device 100 include instructions for performing a method of signal processing incorporating speech enhancement and keyword recognition in accordance with the present disclosure.

Fig. 3 illustrates a flow diagram of a signal processing method 200 incorporating speech enhancement and keyword recognition according to one embodiment of the present disclosure, the method 200 starting at step S210.

And step S210, inputting a voice signal.

The input voice signal is a preprocessed voice signal, and the voice signal is processed into a plurality of voice frames, so that the frame length of the signal meets the requirements of voice enhancement and keyword recognition processing.

Specifically, the step of determining the length of the voice frame suitable for the voice enhancement and keyword recognition processing comprises the following steps:

determining a first frame length suitable for speech enhancement processing;

determining a second frame length suitable for keyword recognition;

The first frame length may be a frame length suitable for the processing in step S230, and the second frame length may be a frame length suitable for the processing in steps S240 and S250.

According to the embodiment of the disclosure, the frame length of the voice frame is 10-40 ms. Because the speech enhancement algorithm and the keyword recognition algorithm both need to utilize the short-time stationary characteristic of speech, the length of a speech frame processed in a single time can be 10-40 ms.

Step S220, converting the voice signal into a first frequency domain signal.

According to an embodiment of the present disclosure, a Fast Fourier Transform (FFT) method is used to convert a voice signal into a frequency domain signal. The FFT method has the obvious advantage of small calculated amount, can realize the real-time processing of signals, and improves the speed of keyword recognition.

Further, before FFT computation, windowing processing needs to be performed on the speech frame obtained in step S210, and the selection of the window needs to meet the requirements of speech enhancement and keyword recognition processing on the window.

Specifically, the step of determining the type of window suitable for speech enhancement and keyword recognition processing includes:

determining a first window suitable for speech enhancement processing;

determining a second window suitable for keyword recognition;

The first window may be a window suitable for the processing in step S230, and the second window may be a window suitable for the processing in steps S240 and S250.

The window is selected primarily with a balance between the main lobe width and the side lobe height in mind. Alternative windows include hamming, hanning, and blackman windows, as well as other specially designed windows that meet the same criteria, according to embodiments of the present disclosure. The window parameters are determined according to algorithms of the speech enhancement and keyword recognition processes.

Step S230, performing speech enhancement processing on the first frequency domain signal to obtain a second frequency domain signal.

Speech enhancement processing includes, but is not limited to: echo cancellation, beamforming, dereverberation, and single-channel noise reduction processing.

According to an embodiment of the present disclosure, echo cancellation uses a self-adaptive echo cancellation method, and estimates an expected signal, i.e., a simulated echo, which approximates to an actual echo path through an algorithm, and then subtracts the simulated echo from a mixed signal collected by a microphone, so as to achieve the echo cancellation function.

According to an embodiment of the present disclosure, the beamforming algorithm aims to form a Finite-length Impulse Response (FIR) filter by offline calculation of an optimization algorithm or adaptive iteration according to an input signal, and perform filtering processing on the input signal, so as to enhance a target direction signal and suppress a non-target signal.

According to an embodiment of the present disclosure, reverberation is removed by using a multi-channel linear Prediction algorithm (WPE; Generalized Weighted Prediction Error, GWPE) or Spectral Subtraction (Spectral Subtraction), wherein a reflected wave with a delay time of about 50ms or more is referred to as echo, and an effect generated by the remaining reflected waves is referred to as reverberation.

According to an embodiment of the present disclosure, echo cancellation is performed using a Least Mean Square error (LMS) or an Affine Projection Algorithm (APA) or a Recursive Least Squares (RLS) method.

In accordance with an embodiment of the present disclosure, Fixed Beam forming (Fixed Beam), or Differential Microphone Array (DMA), or Generalized Sidelobe Canceling (GSC) methods are used for beamforming.

According to an embodiment of the disclosure, a Wiener Filter (Wiener Filter) or an optimal Log spectrum estimation (optimal Modified Log-Spectral estimate, OMLSA) method is adopted for single-channel noise reduction processing.

And step S240, extracting the voice characteristics of the second frequency domain signal.

Further, step S240 includes:

calculating a power spectrum of the second frequency domain signal;

and calculating the voice characteristics of the second frequency domain signal according to the power spectrum of the second frequency domain signal.

Common speech features include, but are not limited to, Filter banks (fbanks), Mel-Frequency Cepstral Coefficients (MFCCs), etc., all calculated based on the power spectrum of the speech signal. The power spectrum represents the set of power values of the signals at each frequency point, and reflects the energy distribution condition of the signals in the frequency domain.

And step S250, performing keyword recognition according to the voice characteristics of the second frequency domain signal.

The method comprises the steps of performing keyword recognition by adopting a pre-trained voice model, and recognizing one or more keywords.

Furthermore, in order to improve the efficiency of keyword recognition, a plurality of models may be used for recognition, and whether a keyword is included in a speech signal or not may be comprehensively determined according to recognition results and confidence data of the plurality of models.

Specific examples of the present invention are given below.

As shown in fig. 4, the specific process includes:

s1, selecting a signal frame length suitable for both speech enhancement and keyword recognition, for example: 10-40 ms, selecting a time domain and frequency domain conversion window which is suitable for speech enhancement and key word identification, such as a Hanning window; multiplying the signal by a window, and then performing fast Fourier transform to complete the conversion from a time domain to a frequency domain;

s2, processing a series of voice signals, such as echo cancellation, beam forming, reverberation removal, noise reduction and the like, and finally obtaining a clean frequency domain signal;

s3, calculating a signal power spectrum, and then calculating the voice characteristics of the signal;

and S4, completing keyword recognition according to the voice characteristics.

According to the technical scheme provided by the specific embodiment of the disclosure, voice enhancement and keyword recognition processing are combined, after the voice enhancement processing, frequency domain-time domain conversion is not performed, and frequency domain signals are directly used for feature extraction of keyword recognition, so that two times of time-frequency domain conversion are avoided, and the computational complexity and distortion caused by conversion are reduced.

According to the technical scheme provided by the disclosure, on one hand, the calculation complexity of the whole processing flow is reduced, so that the power consumption of equipment can be saved, the endurance time of the equipment is finally increased, and the user experience is improved; on the other hand, the distortion degree of the whole processing flow is reduced, so that the performance and robustness of keyword identification can be improved.

Through experimental detection, during the keyword recognition processing of a True Wireless Stereo (TWS) headset using the technical scheme provided by the present disclosure, the computational complexity can be reduced by about 20% -30%, and the clipping distortion can be avoided in case of an excessively large input signal.

Referring to fig. 5, the present disclosure provides a signal processing apparatus combining speech enhancement and keyword recognition, including:

an input unit 310 for inputting a voice signal;

a time-frequency converting unit 320, configured to convert the voice signal into a first frequency-domain signal;

a speech enhancement unit 330, configured to perform speech enhancement processing on the first frequency domain signal to obtain a second frequency domain signal;

a feature extraction unit 340, configured to extract a speech feature of the second frequency domain signal;

a keyword recognition unit 350, configured to perform keyword recognition according to the speech feature of the second frequency domain signal.

Optionally, the apparatus further comprises:

determining a second frame length suitable for keyword recognition;

and determining the frame length of the voice signal according to the first frame length suitable for voice enhancement processing and the second frame length suitable for keyword recognition.

Optionally, the apparatus further comprises:

determining a second window suitable for keyword recognition;

and determining a window used for converting the voice signal into a first frequency domain signal according to the first window suitable for the voice enhancement processing and the second window suitable for the keyword recognition.

Optionally, the speech enhancement unit 330 is specifically configured to:

For specific limitations of the signal processing apparatus combining speech enhancement and keyword recognition, reference may be made to the above limitations of the signal processing method combining speech enhancement and keyword recognition, which are not described herein again.

It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims

1. A signal processing method combining speech enhancement and keyword recognition, comprising:

inputting a voice signal;

converting the speech signal into a first frequency domain signal;

extracting the voice feature of the second frequency domain signal;

2. The method of claim 1, wherein the input speech signal is preceded by:

determining a first frame length suitable for speech enhancement processing;

determining a second frame length suitable for keyword recognition;

3. The method of claim 2, wherein the frame length of the speech signal is 10-40 ms.

4. The method of claim 1, wherein prior to converting the speech signal to the first frequency domain signal, further comprising:

determining a first window suitable for speech enhancement processing;

determining a second window suitable for keyword recognition;

5. The method of claim 4, wherein the window used to convert the speech signal to the first frequency domain signal comprises:

hanning window, or hamming window, or blackman window.

6. The method of any of claims 1-5, wherein performing speech enhancement processing on the first frequency domain signal comprises:

7. A signal processing apparatus that combines speech enhancement and keyword recognition, comprising:

an input unit for inputting a voice signal;

8. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any of claims 1-7.

9. A computing device, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors to perform operations included in any of claims 1-7.

10. The computing device of claim 9, wherein the computing device is a smart headset device.