CN109602333B

CN109602333B - Voice denoising method and chip based on cleaning robot

Info

Publication number: CN109602333B
Application number: CN201811512538.5A
Authority: CN
Inventors: 许登科
Original assignee: Zhuhai Amicro Semiconductor Co Ltd
Current assignee: Zhuhai Amicro Semiconductor Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2020-11-03
Anticipated expiration: 2038-12-11
Also published as: CN109602333A

Abstract

The invention discloses a voice denoising method and a chip based on a cleaning robot, comprising the following steps: step 1, determining a target voice signal from voice signals acquired by a microphone array, and correspondingly acquiring a target confidence value; step 2, judging whether preset noise data with a difference absolute value between the confidence value and the target confidence value smaller than a preset noise threshold exists in the noise database, if so, entering step 3; step 3, controlling the noise data to perform inverse processing to obtain an inverse noise signal, and then mixing and superposing the inverse noise signal and the target voice signal to obtain a pre-denoising processing result; step 4, according to the relation between the pre-denoising processing result and a preset threshold value, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target voice signal; wherein the target speech signal comprises voiced frames associated with the control commands. According to the invention, the target voice signal is subjected to pre-denoising processing, so that the denoising precision of the voice signal is improved.

Description

Voice denoising method and chip based on cleaning robot

Technical Field

The invention belongs to the technical field of robots, and particularly relates to a voice denoising method and a chip based on a cleaning robot.

Background

Although the speech pickup equipment circulating on the market can perform speech pickup on speech signals sent by a user, the speech pickup equipment can generally perform speech pickup on the noise generated in the working process of the robot while picking up the speech signals sent by the user, so that a large amount of external noise is mixed in the speech signals picked up by the equipment, the corresponding speech recognition accuracy is not high, the recognition of the external speech (effective signals) by the robot is seriously influenced, and logic judgment (for example, relevant path planning is executed) is made based on the interpretation of the speech.

In the prior art, the cleaning robot with the voice recognition function does not preprocess the collected voice signal in the front-end denoising process of the voice signal, so that the accuracy of voice recognition is reduced.

Disclosure of Invention

In order to overcome the technical defects, the invention provides the following technical scheme:

a voice denoising method based on a cleaning robot is applied to a mobile robot of which a base is provided with a microphone array with a fixed orientation, and comprises the following steps: step 1: determining a target voice signal from the voice signals acquired by the microphone array, and correspondingly acquiring a target confidence value; step 2: judging whether preset noise data with the difference absolute value between the confidence value and the target confidence value smaller than a preset noise threshold exists in the noise database, if so, entering the step 3; and step 3: controlling the preset noise data to perform inverse processing to obtain an inverse noise signal, and then performing mixed superposition on the inverse noise signal and the target voice signal to obtain a pre-denoising processing result; and 4, step 4: according to the relation between the pre-denoising processing result and a preset threshold value, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target voice signal; wherein the target speech signal comprises voiced frames associated with the control commands. The voice denoising method selectively carries out coarse denoising processing on the voiced frames of the target voice signals through pre-denoising processing, and then flexibly adjusts the confidence value to carry out fine denoising processing according to the real-time matching degree of the noise signals and the noise database so as to improve the denoising precision.

Further, the step 1 specifically includes: recognizing a voiced frame of a voice signal acquired from the microphone array through a voice engine, determining the voice signal corresponding to the voiced frame as the target voice signal when the signal-to-noise ratio value of the voiced frame is greater than a preset signal-to-noise ratio threshold value, and then extracting a target confidence value corresponding to the target voice signal from the voiced frame, wherein the voiced frame comprises the confidence value and the signal-to-noise ratio value based on the voice recognition signal. And screening out a target voice signal according to a preset signal-to-noise ratio threshold value, and identifying and processing a specific voice signal in a targeted manner, so that the accuracy of voice identification in a noise environment is improved.

Further, the step 3 specifically includes: step 301, judging whether the pre-denoising processing result is larger than the preset threshold value, if so, entering step 302, otherwise, entering step 303; step 302, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal; step 303, judging whether the absolute value of the difference between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold, if so, marking the sound frame corresponding to the pre-denoising processing result as the denoised sound frame in the target voice signal; otherwise, adjusting the target confidence value and returning to the step 2. And 3, judging the pre-denoising processing result twice, and comprehensively processing each voiced frame in the target voice signal, thereby being beneficial to the completeness of denoising and improving the accuracy of voice denoising.

Further, the method for adjusting the target confidence value comprises the following steps: and adjusting the current target confidence value to be larger or smaller according to the difference value between the confidence value of the unmarked voiced frame in the target voice signal and the current target confidence value. The method is beneficial to subsequent judgment and screening based on the unmarked voiced frames in the target voice signal, and improves the accuracy of the iterative processing process.

A chip is used for storing a program code corresponding to the voice denoising method. The chip is added with a pre-denoising processing function, so that the denoising precision of the voice signal is improved.

Compared with the prior art, the technical scheme of the invention is that after the target voice signal is obtained, in the process of denoising pretreatment, the denoising treatment is selectively carried out on the voiced frame of the target voice signal, the threshold value is intelligently set and the current denoised voiced frame is marked according to the real-time matching degree of the confidence value of the noise signal and the confidence value of the noise database, and the confidence value is flexibly adjusted to improve the denoising efficiency, so that the denoising effect is more thorough.

Drawings

Fig. 1 is a flowchart of a voice denoising method based on a cleaning robot according to an embodiment of the present invention.

FIG. 2 is a flowchart of a cleaning robot-based speech denoising method according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a voice denoising method based on a cleaning robot, specifically including:

step S101, a voice signal transmitted from a specific direction is acquired from a microphone array on a base of the cleaning robot, and a target voice signal is determined based on information domain analysis of a database pre-stored by a voice engine, so that directional voice pickup is realized, and external noise interference is reduced. Then, the process proceeds to step S102. The target voice signal comprises a control command spoken by a user orally or voice data input by a machine, and accordingly, a target confidence value is obtained based on the target voice signal, in this embodiment, the target confidence value is the degree of authenticity information of the mobile robot on a specific voice signal, and can be used as a numerical value for representing the credibility degree of a voice preliminary recognition result, so as to reduce erroneous judgment, the correctness of the recognition result is judged according to a confidence threshold value, and then the result is presented. If the target speech signal spoken by the user is "call back charging", then in the speech data recognition process, the returned target confidence value includes: sentence confidence N.

Optionally, voiced frames of the speech signals acquired in the microphone array may be identified by the speech engine, the microphone array may pass a correlated speech characteristic detection algorithm, the target speech signal comprises voiced frames associated with control instructions, such that the target speech signal may be converted into a plurality of speech frames associated with the user utterance, wherein the speech frames may comprise voiced frames and unvoiced frames, and the classification may be performed by various known techniques. And when the signal-to-noise ratio value of the voiced frame is greater than a preset signal-to-noise ratio threshold value, determining the voice signal corresponding to the voiced frame as the target voice signal, and then extracting a target confidence value corresponding to the target voice signal from the voiced frame, wherein the voiced frame comprises the confidence value and the signal-to-noise ratio value based on the voice recognition signal.

It should be noted that the voiced frame can measure the noise energy level contained therein by using the signal-to-noise ratio, which is the ratio of the power of the voice data to the power of the noise data, and is often expressed in decibels, and generally, a higher signal-to-noise ratio indicates a smaller power of the noise data, and vice versa. The noise energy level is used to reflect the amount of noise data energy in the user's voice data. The signal-to-noise ratio and the noise energy level are combined to indicate the noise level.

Step S102, according to noise data corresponding to the voiced frames contained in the target voice signal, searching a preset noise data from the noise database, judging whether the absolute value of the difference between the confidence values of the target confidence value and the preset noise data is smaller than a preset noise threshold value, if so, determining that the preset noise data is the noise data matched with the target confidence value, and then, entering step S103.

In the embodiment of the present invention, noise generated in a working environment of the robot is relatively stable, and the target voice signal collected by the microphone array also includes sound transmitted from the inside to the outside of the cleaning robot, specifically, noise generated when an executing component (such as a motor) inside the robot operates and noise generated by internal mechanical friction or vibration of the robot during movement are transmitted from the outside of the robot body.

Preferably, the target speech signal may be compared with all noise data in the noise database to obtain all speech similarity values, and then the predetermined noise threshold value may be determined based on a weighted average of all speech similarity values. In addition, multiple noise databases may be employed, and the result with the highest recognition rate may be selected from the multiple databases as the final matching result. Thereby improving the recognition rate of the working noise of the robot.

Step S103, controlling the noise data and the unmarked sound frames in the target voice signal to participate in pre-denoising processing so as to obtain a pre-denoising processing result corresponding to the noise data; specifically, the method for pre-denoising specifically includes: firstly, controlling the preset noise data to perform inverse processing to obtain an inverse noise signal; and then controlling the reversed phase noise signal and the target voice signal to be mixed and superposed to obtain a pre-denoising processing result corresponding to the preset noise data, so as to eliminate the noise signal in the target voice signal and obtain voice information after pre-denoising processing.

Step S104, according to the relation between the pre-denoising processing result and a preset threshold value, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target voice signal; wherein the target speech signal comprises voiced frames associated with the control commands.

As another embodiment, the step S104 may include, as shown in fig. 2: step S1041, judging whether the pre-denoising processing result is larger than a preset threshold value, if so, entering step S1042, otherwise, entering step S1043; the predetermined threshold is pre-stored and is used to measure the distortion of the speech signal. And the remaining unmarked voiced frames of the target voice signal may have been denoised after the pre-denoising process, but do not satisfy the condition that the pre-denoising process result is greater than the predetermined threshold, and further judgment and screening are needed to reduce the misjudgment.

Step S1042, marking the voiced frame corresponding to the pre-denoising processing result as a denoised voiced frame in the target speech signal, and if the pre-denoising processing result is greater than the predetermined threshold, the pre-denoising processing result indicates that the undesired noise has been removed from the voiced frame of the target speech signal, that is, the influence of the removed noise on the recognition of the partial speech signal is eliminated.

If the pre-denoising result is smaller than the predetermined threshold, further adjusting denoising is needed to ensure that each voiced frame in the received target voice signal can be processed, so that the voice signal denoising is more thorough, and further the denoising integrity of the target voice signal is improved and the accuracy of identifying the target voice signal is improved.

Step S1043, determining whether the absolute value of the difference between the confidence value of the pre-denoising result and the target confidence value is smaller than a confidence threshold, if yes, entering step S1044, otherwise, entering step S1045. The confidence value of the pre-denoising result is a value of the credibility of the recognition result of the pre-denoising target speech signal on the premise that the pre-denoising result is smaller than the predetermined threshold, and the confidence threshold can be used as an evaluation index of the correct recognition rate of the interfered target speech signal. And further processing the noise signals of the residual unmarked voiced frames of the target voice signal by judging whether the absolute value of the difference value between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold value or not so as to improve the comprehensiveness and the accuracy of denoising the target voice signal.

Step S1044 of marking the sound frame corresponding to the pre-denoising result as a denoised sound frame in the target speech signal, so as to realize denoising of the part of the sound frame which is not marked and obtained by screening in the step S1042, thereby improving the precision of speech recognition; and the target speech signal still has an unmarked voiced frame, and at this time, the fact means that the pre-denoising effect of the noise data matched with the current target confidence value on the unmarked voiced frame is limited.

Step S1045, according to a difference between the confidence value of the unlabeled voiced frame in the target speech signal and the current target confidence value, increasing or decreasing the current target confidence value. In this embodiment, when the confidence value of the unlabeled voiced frame in the target speech signal is greater than the current target confidence value, the current target confidence value is correspondingly turned up, otherwise, the current target confidence value is correspondingly turned down, and then the step S102 is returned to, and the noise data matched with the adjusted target confidence value is selected for further denoising processing. Obviously, the method is a parameter correction process based on the current target confidence value, and then the denoised voiced frames are judged again based on the correction parameters, so that after multiple iterations, the process is circulated until all the voiced frames in the target voice signal are denoised. And flexibly adjusting the confidence value according to the real-time matching degree of the noise signal and the noise database to improve the denoising efficiency. And then the denoised sound frame in the target voice signal is converted into a voice control instruction to control the mobile robot. The target voice signal comprises periodic components, so that the method has a periodic iteration rule in the process of executing the voice denoising method, the target confidence value is prevented from being randomly corrected, the judgment speed of the target voice signal is accelerated, and the denoising working efficiency is improved.

According to the technical scheme, under the noise scene of the working of the robot, a target voice signal sent by a user is obtained, and according to pre-stored empirical data of a noise database and the target voice signal, the empirical data of the noise database is controlled to be subjected to inverse processing to suppress the noise of the target voice signal; meanwhile, the related confidence value is flexibly adjusted according to the real-time matching degree of the noise signal and the noise database, and the denoised sound frame is screened out through judgment, so that the denoising thoroughness is greatly improved, and the speech recognition rate in a noise environment is improved.

A chip is used for storing a program code corresponding to the voice denoising method. The chip adopts a special integrated control chip, and the chips can analyze internal or external control instructions and output corresponding control signals so as to control an execution component of the robot to perform corresponding actions.

Compared with the prior art, the voice denoising method selects the matched noise data to participate in the pre-denoising treatment of the reverse phase superposition through the noise database, and the denoising precision is improved. The pre-denoising process may use a subtraction circuit to perform signal subtraction, or may use a combination of an inverter and an addition circuit to perform signal subtraction, and these circuits may be integrated with a processor into a dedicated processing chip, and may be configured according to design requirements. After the internal noise interference is filtered, the processor analyzes the filtered signals to analyze external voice signals, and the external voice signals are converted into control instructions matched with the external voice signals to control the robot. How the robot analyzes the external voice signal belongs to the existing technology which can be realized, and is not described herein again.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and not to limit it; although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art will understand that: modifications to the specific embodiments of the invention or equivalent substitutions for parts of the technical features may be made; without departing from the spirit of the present invention, it is intended to cover all aspects of the invention as defined by the appended claims.

Claims

1. A voice denoising method based on a cleaning robot is characterized by comprising the following steps:

step 1: determining a target voice signal from the voice signals acquired by the microphone array, and correspondingly acquiring a target confidence value;

step 2: judging whether preset noise data with the difference absolute value between the confidence value and the target confidence value smaller than a preset noise threshold exists in the noise database, if so, entering the step 3;

and step 3: controlling the preset noise data to perform inverse processing to obtain an inverse noise signal, and then performing mixed superposition on the inverse noise signal and the target voice signal to obtain a pre-denoising processing result;

and 4, step 4: according to the relation between the pre-denoising processing result and a preset threshold value, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target voice signal;

wherein the target speech signal comprises voiced frames associated with the control commands.

2. The speech denoising method according to claim 1, wherein the step 1 specifically comprises:

recognizing a voiced frame of a voice signal acquired from the microphone array through a voice engine, determining the voice signal corresponding to the voiced frame as the target voice signal when the signal-to-noise ratio value of the voiced frame is greater than a preset signal-to-noise ratio threshold value, and then extracting a target confidence value corresponding to the target voice signal from the voiced frame, wherein the voiced frame comprises the confidence value and the signal-to-noise ratio value based on the voice recognition signal.

3. The speech denoising method according to claim 1, wherein the step 3 specifically comprises:

step 301, judging whether the pre-denoising processing result is larger than the preset threshold value, if so, entering step 302, otherwise, entering step 303;

step 302, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal;

step 303, judging whether the absolute value of the difference between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold, if so, marking the sound frame corresponding to the pre-denoising processing result as the denoised sound frame in the target voice signal; otherwise, adjusting the target confidence value and returning to the step 2.

4. The method of denoising as claimed in claim 3, wherein the method of adjusting the confidence value of the target comprises: and adjusting the current target confidence value to be larger or smaller according to the difference value between the confidence value of the unmarked voiced frame in the target voice signal and the current target confidence value.

5. A chip, characterized in that, the chip is used for storing the program code corresponding to the speech denoising method of any one of claims 1 to 4.