CN110310655B

CN110310655B - Microphone signal processing method, device, equipment and storage medium

Info

Publication number: CN110310655B
Application number: CN201910324799.2A
Authority: CN
Inventors: 刘荣
Original assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Current assignee: Guangzhou Shiyuan Electronics Thecnology Co Ltd
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2021-10-22
Anticipated expiration: 2039-04-22
Also published as: CN110310655A

Abstract

The invention provides a microphone signal processing method, a device, equipment and a storage medium, wherein a signal is divided into three parts after linear echo cancellation processing and beam forming processing are carried out, a first nonlinear echo suppression processing is carried out after a first noise reduction processing is carried out on one part, and then voice existence detection is carried out to obtain a voice existence detection result X; the second path is subjected to second noise reduction processing and then is subjected to first automatic gain control processing to obtain a voice recognition signal Y for voice recognition; combining X and Y into two sound channels for the speech recognition APP to use; and the third path is subjected to third noise reduction processing and then is subjected to second nonlinear echo suppression processing to further suppress residual echo, and then is subjected to second automatic gain control processing to obtain a voice application signal Z for recording or communication APP. The invention branches the signal into three paths aiming at different requirements of the voice recognition APP and other voice APPs, has flexible structure, can independently adjust parameters and algorithms for processing two parts of signals, and does not influence each other.

Description

Microphone signal processing method, device, equipment and storage medium

Technical Field

The present invention relates to the field of speech signal processing, and more particularly, to a method, an apparatus, a device, and a storage medium for processing a microphone signal.

Background

In speech recognition applications, some pre-processing of the microphone signal is required, such as Beamforming (Beamforming), echo cancellation (AEC), Noise Reduction (NR), Automatic Gain Control (AGC), Dereverberation (DR), voice presence detection (VAD), etc. In an operating system, the software of voice recognition is usually a general APP, which can directly acquire a voice signal from a sound card device and perform recognition, while beam forming, echo cancellation, dereverberation and the like are highly related to hardware design, and are not well independently placed in application software, and each application software needs to be independently implemented, repeatedly calculated, some information is even unavailable, and the universality is poor. Some of the prior art solutions are therefore implemented in the firmware of the microphone module, which has the following disadvantages: the calculation amount is large, and the module cost is high. Or in the drive, which has the following disadvantages: resources are limited, such as floating point operations, locks, task scheduling, sleeping, etc.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a microphone signal processing method, a microphone signal processing device, microphone signal processing equipment and a microphone signal processing storage medium.

In a first aspect, an embodiment of the present invention provides a microphone signal processing method, including the following steps:

s1: carrying out linear echo cancellation (AEC) on the multi-path microphone signals and the reference signals together, and canceling out loudspeaker sounds picked up from the microphone;

s2: the multi-path microphone signals after the linear echo cancellation processing are processed by beam forming (Beamforming), one part of the beam formed signals is divided into three,

after first noise reduction processing, performing first nonlinear echo suppression processing on one path of signals to further suppress residual echo, and then performing voice presence detection (VAD) to obtain a voice presence detection result X;

the second path of signal is subjected to second noise reduction processing and then is subjected to first Automatic Gain Control (AGC) processing to obtain a voice recognition signal Y for voice recognition;

combining the voice existence detection result X and the voice recognition signal Y into two sound channels for being provided for the voice recognition APP to use;

the two different first and second noise reduction algorithms are used here because the speech signal used for speech recognition will severely affect the recognition rate if the noise is reduced too much or not well processed; the noise reduction of VAD needs to be strong, otherwise normal operation of VAD is affected. The reason why the nonlinear echo suppression part is only used on the VAD channel is that the nonlinear echo suppression part influences the voice recognition rate but is very helpful for VAD detection; after the two paths of processing are separated, the voice recognition effect and the VAD effect can be ensured, the debugging and the optimization are more convenient, and the parameters can not be mutually coupled.

And performing second nonlinear echo suppression processing on the third path of signals after third noise reduction processing to further suppress residual echo, and then performing second automatic gain control processing to obtain a voice application signal Z for recording or communication APP.

Preferably, in step S1, the reference signal is obtained from a speaker or from sound card driving/voice playing software.

Preferably, in step S1, the adaptive filter is used to perform linear echo cancellation processing on each microphone signal and the reference signal together.

Preferably, in step S2, when the multi-path microphone signal is processed by beamforming, the angle of arrival (DOA) needs to be known, and the DOA is calculated according to a preset estimation method of the DOA.

Preferably, in step S2, the voice existence detection result X and the voice recognition signal Y are combined into two channels, and the specific method is as follows: the speech presence detection result X is placed solely on one of the channels and the speech recognition signal Y is placed solely on the other channel. If the left channel stores a voice signal, the right channel stores VAD information, 0 indicates no voice, and non-0 indicates voice.

Preferably, in step S2, the voice existence detection result X and the voice recognition signal Y are combined into two channels, and the specific method is as follows: a certain bit of the speech recognition signal Y is used to store the presence detection result X. For example, the presence detection result X is stored using the lowest bit of the speech recognition signal Y, and when the lowest bit (bit) is 0, it indicates no speech, and when the lowest bit is 1, it indicates speech. The normal voice signal is 16bit or 24bit, and when the lowest 1bit is replaced by 0 or 1, the voice signal can be submerged by noise, and the original recognition rate is hardly influenced.

Preferably, the multi-path microphone signals are acquired from the multiple microphone hardware through the sound card driver and are sent to the signal processing service program, the signal processing service program processes according to the method, the processed signals are stored in the virtual sound card driver, and the virtual sound card driver simulates multiple audio input ports for providing the processed microphone signals for the voice recognition APP and other APPs respectively. For example, an audio stream formed by combining the speech presence detection result X and the speech recognition signal Y is provided for the speech recognition APP, and an audio stream of the speech application signal Z is provided for other APPs such as the recording APP and the communication APP.

The signal processing service program + virtual sound card driver is adopted in the following structural forms:

1. the universality is strong, the upper layer interfaces are uniform, each APP does not need to be independently processed, and repeated calculation is avoided;

2. the independence is strong, the whole set of processing method is executed in a signal processing service program, and the development limit is less; the algorithm and the code of the signal processing service program can be independently debugged, updated and deployed;

3. the signal processing service program is placed in the application-level service program, so that the development difficulty is low, the resource limitation is less, and the debugging is convenient;

4. the VAD and the signal processing are put together, more information can be obtained, such as a reference signal, various intermediate data in the signal processing process and the like, and the VAD effect is better after the information is utilized.

In a second aspect, an embodiment of the present invention provides a microphone signal processing apparatus, including:

a linear echo cancellation module: the linear echo cancellation device is used for carrying out linear echo cancellation processing on a plurality of paths of microphone signals and a reference signal together and canceling out loudspeaker sound picked in a microphone;

a beam forming module: the system comprises a linear echo cancellation module, a beam forming module and a control module, wherein the linear echo cancellation module is used for outputting signals of multiple microphones;

a first noise reduction module: the device is used for carrying out noise reduction processing on one path of signals formed by the wave beams;

a first nonlinear echo suppression module: the first noise reduction module is used for carrying out nonlinear echo suppression processing on the signal output by the first noise reduction module;

a voice presence detection module: the voice presence detection module is used for detecting the voice presence of the signal output by the first nonlinear echo suppression module to obtain a voice presence detection result X;

a second noise reduction module: the noise reduction processing is carried out on the other path of signals formed by the wave beams;

a first automatic gain control module: the automatic gain control module is used for carrying out automatic gain control on the signal output by the second noise reduction module to obtain a voice recognition signal Y for voice recognition;

a signal merging module: the voice recognition system is used for combining a voice existence detection result X and a voice recognition signal Y into a left sound channel and a right sound channel which are provided for a voice recognition APP to use;

a third noise reduction module: the noise reduction processing is carried out on the beam-formed third path signal;

a second nonlinear echo suppression module: the nonlinear echo suppression module is used for carrying out nonlinear echo suppression processing on the signal output by the third noise reduction module;

a second automatic gain control module: and the second automatic gain control processing module is used for carrying out second automatic gain control processing on the signal output by the second nonlinear echo suppression module to obtain a voice application signal Z for recording or communication APP.

In a third aspect, an embodiment of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements any one of the steps of the method when executing the program.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which program, when executed by a processor, performs the steps of any one of the methods described above.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

1. aiming at different requirements of a voice recognition APP and other voice APPs, the signal is branched into three paths, one path of signal is subjected to voice existence detection, the other path of signal is subjected to voice signal processing comprising noise reduction, nonlinear echo suppression and automatic gain control, parameters and algorithms of the three signal processing parts can be independently adjusted, and mutual influence is avoided;

2. the information of the voice existence detection result X is directly mixed into the voice recognition signal Y, an additional channel is not needed to be added to provide VAD information, the implementation is convenient, and the implementation framework and the structure of the original system are not needed to be changed.

Drawings

Fig. 1 is a flowchart of a microphone signal processing method according to embodiment 1 of the present invention.

Fig. 2 is a schematic diagram of a left channel storing a voice signal and a right channel storing VAD information according to embodiment 1 of the present invention.

Fig. 3 is a schematic diagram of a microphone signal processing apparatus according to embodiment 2 of the present invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

the technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, an embodiment of the present invention provides a microphone signal processing method, including the following steps:

and performing first noise reduction processing on one path of signal, performing first nonlinear echo suppression processing on the other path of signal, further suppressing residual echo, and performing voice presence detection (VAD) to obtain a voice presence detection result X. The reference signal in step S1 is also needed for nonlinear echo suppression. The linear echo cancellation part usually cannot completely cancel the loudspeaker sound picked up in the microphone, so that voice presence detection (VAD) is more reliably performed conveniently, and then voice presence detection is performed to obtain a voice presence detection result X;

When the noise reduction algorithm is executed, a noise estimation value needs to be known, and the noise estimation value is obtained through calculation according to a preset noise estimation method. Here, a conventional noise estimation method may be used.

In step S1, the reference signal is obtained from a speaker, and the reference signal is obtained from a speaker, or obtained from sound card driving/voice playing software.

In step S1, the adaptive filter is used to perform linear echo cancellation processing on each microphone signal and the reference signal.

In step S2, when the multi-path microphone signal is processed for beamforming, the angle of arrival (DOA) needs to be known, and the DOA is calculated according to a preset estimation method of the DOA.

In step S2, the speech presence detection result X and the speech recognition signal Y are combined into two sound channels, and the specific method is as follows: the speech presence detection result X is placed solely on one of the channels and the speech recognition signal Y is placed solely on the other channel. As shown in fig. 2, the left channel stores a voice signal, the right channel stores VAD information, 0 indicates no voice, and non-0 indicates voice.

In step S2, the voice presence detection result X and the voice recognition signal Y are combined into two sound channels, and the specific method may further be: a certain bit of the speech recognition signal Y is used to store the presence detection result X. For example, the presence detection result X is stored using the lowest bit of the speech recognition signal Y, and when the lowest bit (bit) is 0, it indicates no speech, and when the lowest bit is 1, it indicates speech. The normal voice signal is 16bit or 24bit, and when the lowest 1bit is replaced by 0 or 1, the voice signal can be submerged by noise, and the original recognition rate is hardly influenced.

The signal processing service program + virtual sound card is adopted, and the following reasons exist:

2. the independence is strong, and the algorithm and the code of the signal processing service program can be debugged, updated and deployed independently;

The scheme of the embodiment can be used in a video conference machine/a preschool education machine. In consideration of recording, remote education, voice control and other functions, the whole machine needs to have a microphone input and a loudspeaker to output sound. The effect of recording and speech recognition can be seriously affected by the requirement of longer pickup distance and the interference of loudspeaker signals. Therefore, a pre-processing module of the microphone signal is needed to remove the loudspeaker echo signal and the noise signal in the environment contained in the microphone signal, and adjust the signal amplitude to a proper amplitude to send to the recording software or the voice recognition module for recognition. Meanwhile, in order to ensure that the microphone signal is not sent to the voice recognition module when no voice exists, VAD is needed to detect whether the voice signal exists at present, and only when the voice signal exists, the microphone data is sent to the voice recognition module for recognition. The speech recognition module and recording software can work independently at the user application level without concern for portions of speech signal processing. This arrangement allows the use of a very low cost (because there is no signal processing) microphone module, with the signal processing part being located on the main CPU of the system.

Example 2

As shown in fig. 3, embodiment 2 of the present invention provides a microphone signal processing apparatus, including:

Example 3

Embodiment 3 of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements any of the steps of the method described above. In this embodiment, the processor is a control center of the computer system, and may be a processor of a physical machine or a processor of a virtual machine.

Example 4

Embodiment 4 of the present invention provides a computer-readable storage medium on which a computer program is stored, the program being executed by a processor to perform the steps of any one of the methods described above. The computer-readable storage medium may include, but is not limited to, any type of disk including floppy disks, optical disks, DVD, CD-ROMs, microdrive, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or device suitable for storing instructions and/or data.

It is clear to a person skilled in the art that the solution according to the embodiments of the invention can be implemented by means of software and/or hardware. The "unit" or "module" in the present specification means software and/or hardware capable of performing a specific function by itself or in cooperation with other components.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A microphone signal processing method, characterized by comprising the steps of:

s1: carrying out linear echo cancellation processing on a plurality of paths of microphone signals and a reference signal together, and canceling out loudspeaker sound picked in a microphone;

s2: the multi-path microphone signals after the linear echo cancellation are processed by beam forming, one beam forming signal is divided into three,

after first noise reduction processing, performing first nonlinear echo suppression processing on one path of signal to further suppress residual echo, and then performing voice existence detection to obtain a voice existence detection result X;

the second path of signal is subjected to second noise reduction processing and then is subjected to first automatic gain control processing to obtain a voice recognition signal Y for voice recognition;

2. The microphone signal processing method according to claim 1, wherein in step S1, the reference signal is obtained from a speaker or from sound card driver/voice playing software.

3. The microphone signal processing method according to claim 1, wherein in step S1, the adaptive filter is used to perform linear echo cancellation processing on each microphone signal and the reference signal together.

4. The method as claimed in claim 1, wherein in step S2, the arrival angle is required to be known when the multi-path microphone signal is processed by beamforming, and the arrival angle is calculated according to a predetermined arrival angle estimation method.

5. The microphone signal processing method of claim 1, wherein in step S2, the voice presence detection result X and the voice recognition signal Y are combined into two channels, specifically: the speech presence detection result X is placed solely on one of the channels and the speech recognition signal Y is placed solely on the other channel.

6. The microphone signal processing method of claim 1, wherein in step S2, the voice presence detection result X and the voice recognition signal Y are combined into two channels, specifically: a certain bit of the speech recognition signal Y is used to store the presence detection result X.

7. The microphone signal processing method according to any one of claims 1 to 6, wherein the multiple microphone signals are acquired from multiple microphone hardware by a sound card driver and sent to a signal processing service program, the signal processing service program performs processing according to the method, the processed signals are stored in a virtual sound card driver, and the virtual sound card driver simulates multiple audio input ports for providing the processed microphone signals for the speech recognition APP and other APPs, respectively.

8. A microphone signal processing apparatus, comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-7 are implemented when the program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.