CN109346099B

CN109346099B - Iterative denoising method and chip based on voice recognition

Info

Publication number: CN109346099B
Application number: CN201811512492.7A
Authority: CN
Inventors: 许登科
Original assignee: Zhuhai Amicro Semiconductor Co Ltd
Current assignee: Zhuhai Amicro Semiconductor Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2022-02-08
Anticipated expiration: 2038-12-11
Also published as: CN109346099A

Abstract

The invention discloses an iterative denoising method and a chip based on voice recognition, which comprise the following steps: step 1: determining a target voice signal and a target confidence value thereof; step 2: selecting noise data matched with the target confidence value from a noise database, and controlling the noise data and the unmarked sound frames in the target voice signal to participate in pre-denoising treatment; and step 3: judging whether the pre-denoising processing result is larger than the preset threshold value, if so, entering a step 4, otherwise, entering a step 5; and 4, step 4: marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal; and 5: judging whether the absolute value of the difference value between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold value, if so, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal; otherwise, adjusting the target confidence value and returning to the step 2.

Description

Iterative denoising method and chip based on voice recognition

Technical Field

The invention belongs to the technical field of robots, and particularly relates to an iterative denoising method and chip based on voice recognition.

Background

Although the speech pickup equipment circulating on the market can perform speech pickup on speech signals sent by a user, the speech pickup equipment can generally perform speech pickup on the noise generated in the working process of the robot while picking up the speech signals sent by the user, so that a large amount of external noise is mixed in the speech signals picked up by the equipment, the corresponding speech recognition accuracy is not high, the recognition of the external speech (effective signals) by the robot is seriously influenced, and logic judgment (for example, relevant path planning is executed) is made based on the interpretation of the speech.

In the prior art, a method for denoising at the front end of a voice signal generally includes selecting a proper voice signal according to a result of voice signal classification, and suppressing a voice signal which does not meet requirements, but the method for classifying the voice signal is complex, so that not only is denoising incomplete, but also voice recognition efficiency is not high, and residual voice frames are always unprocessed, thereby affecting the voice recognition effect.

Disclosure of Invention

In order to overcome the technical defects, the invention provides the following technical scheme:

an iterative denoising method based on speech recognition, comprising: step 1: determining a target voice signal from the voice signals acquired by the microphone array, and correspondingly acquiring a target confidence value; step 2: selecting noise data matched with the target confidence value from a noise database, and controlling the noise data and an unlabeled sound frame in the target voice signal to participate in pre-denoising processing so as to obtain a pre-denoising processing result corresponding to the noise data; and step 3: judging whether the pre-denoising processing result is larger than the preset threshold value, if so, entering a step 4, otherwise, entering a step 5; and 4, step 4: marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal; and 5: judging whether the absolute value of the difference value between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold value, if so, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal; otherwise, adjusting the target confidence value, and returning to the step 2; wherein the target speech signal comprises voiced frames associated with the control commands. The iterative denoising method comprehensively processes each voiced frame in the target voice signal by judging the pre-denoising processing result twice, and is beneficial to the denoising thoroughness and the improvement of the voice denoising accuracy.

Further, the step 1 specifically includes: recognizing a voiced frame of a voice signal acquired from the microphone array through a voice engine, determining the voice signal corresponding to the voiced frame as the target voice signal when the signal-to-noise ratio value of the voiced frame is greater than a preset signal-to-noise ratio threshold value, and then extracting a target confidence value corresponding to the target voice signal from the voiced frame, wherein the voiced frame comprises the confidence value and the signal-to-noise ratio value based on the voice recognition signal. And screening out a target voice signal according to a preset signal-to-noise ratio threshold value, and identifying and processing a specific voice signal in a targeted manner, so that the accuracy of voice identification in a noise environment is improved.

Further, in the step 2, the selecting noise data matching the target confidence value from the noise database specifically includes: and judging whether a confidence value of preset noise data of which the absolute value of the difference value with the target confidence value is smaller than a preset noise threshold value exists in the noise database, if so, determining that the preset noise data is the noise data matched with the target confidence value. And selecting a matched preset noise signal according to the confidence value of the noise signal and the real-time matching degree of the noise database, so that the accuracy of the denoising operation is improved.

Further, the method of pre-denoising specifically includes: firstly, controlling the noise data to perform phase reversal processing to obtain a phase reversal noise signal; and then controlling the inverse noise signal and the target voice signal to be mixed and superposed so as to obtain the pre-denoising processing result corresponding to the noise data. The pre-denoising processing method is simple and efficient.

Further, the method for adjusting the target confidence value comprises the following steps: and adjusting the current target confidence value to be larger or smaller according to the difference value between the confidence value of the unmarked voiced frame in the target voice signal and the current target confidence value. The method is beneficial to subsequent judgment and screening based on the unmarked voiced frames in the target voice signal, and improves the accuracy of the iterative processing process.

A chip is used for storing a program code corresponding to the iterative denoising method. The method has the advantages that the method carries out denoising processing on the sound frame of the target voice signal selectively, intelligently sets the threshold value and marks the current denoised sound frame, thereby restraining the influence of noise on voice recognition and enabling the denoising effect to be more thorough. After the target voice signal is obtained, denoising processing is firstly carried out to improve the identification accuracy. And according to the real-time matching degree of the noise signal and the noise database, the confidence value is flexibly adjusted to improve the denoising efficiency, so that the voice recognition efficiency is further improved.

Compared with the prior art, the technical scheme of the invention has the advantages that after the target voice signal is obtained, in the process of denoising pretreatment, the denoising treatment is selectively carried out on the voiced frame of the target voice signal according to the real-time matching degree of the confidence value of the noise signal and the confidence value of the noise database, and the denoising treatment is carried out by flexibly adjusting the confidence value and combining the matching data of the noise database, so that the denoising rate of the target voice signal is improved.

Drawings

Fig. 1 is a flowchart of an iterative denoising method based on speech recognition according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described in detail below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to fig. 1, an embodiment of the present invention provides an iterative denoising method based on speech recognition, and as an implementation manner of the iterative denoising method, the iterative denoising method includes:

step S101, a voice signal transmitted from a specific direction is acquired from a microphone array, and a target voice signal is determined based on information domain analysis of a database pre-stored by a voice engine, so that directional voice pickup is realized, and external noise interference is reduced. Then, the process proceeds to step S102. The target voice signal comprises a control command spoken by a user orally or voice data input by a machine, and accordingly, a target confidence value is obtained based on the target voice signal, in this embodiment, the target confidence value is the degree of authenticity information of the mobile robot on a specific voice signal, and can be used as a numerical value for representing the credibility degree of a voice preliminary recognition result, so as to reduce erroneous judgment, the correctness of the recognition result is judged according to a confidence threshold value, and then the result is presented. If the target speech signal spoken by the user is "call back charging", then in the speech data recognition process, the returned target confidence value includes: sentence confidence N.

Optionally, voiced frames of the speech signals acquired in the microphone array may be identified by the speech engine, the microphone array may pass a correlated speech characteristic detection algorithm, the target speech signal comprises voiced frames associated with control instructions, such that the target speech signal may be converted into a plurality of speech frames associated with the user utterance, wherein the speech frames may comprise voiced frames and unvoiced frames, and the classification may be performed by various known techniques. And when the signal-to-noise ratio value of the voiced frame is greater than a preset signal-to-noise ratio threshold value, determining the voice signal corresponding to the voiced frame as the target voice signal, and then extracting a target confidence value corresponding to the target voice signal from the voiced frame, wherein the voiced frame comprises the confidence value and the signal-to-noise ratio value based on the voice recognition signal.

It should be noted that the voiced frame can measure the noise energy level contained therein by using the signal-to-noise ratio, which is the ratio of the power of the voice data to the power of the noise data, and is often expressed in decibels, and generally, a higher signal-to-noise ratio indicates a smaller power of the noise data, and vice versa. The noise energy level is used to reflect the amount of noise data energy in the user's voice data. The signal-to-noise ratio and the noise energy level are combined to indicate the noise level.

Step S102, selecting noise data matched with the target confidence value from a pre-configured noise database, and then entering step S103. Specifically, according to noise data corresponding to the voiced frames contained in the target voice signal, a preset noise data is searched from the preconfigured noise database, whether the absolute value of the difference between the confidence value of the target confidence value and the confidence value of the preset noise data is smaller than a preset noise threshold value is judged, and if yes, the preset noise data is determined to be the noise data matched with the target confidence value. Under the embodiment of the invention, because the noise generated in the working area of the robot is relatively stable, the difference comparison is carried out on the pre-configured noise database without updating the noise data in real time, compared with the prior art, the software load of voice recognition is reduced, and related software resources can be reserved for subsequent denoising processing.

Preferably, the target speech signal may be compared to all noise data in the preconfigured noise database to obtain all speech similarity values, and then the predetermined noise threshold is determined based on a weighted average of all speech similarity values. In addition, multiple noise databases may be employed, and the result with the highest recognition rate may be selected from the multiple databases as the final matching result. Thereby improving the recognition rate of the working noise of the robot.

Step S103, controlling the noise data and the unmarked sound frames in the target voice signal to participate in pre-denoising processing so as to obtain a pre-denoising processing result corresponding to the noise data; specifically, the method for pre-denoising specifically includes: firstly, controlling the noise data to perform phase reversal processing to obtain a phase reversal noise signal; and then controlling the reversed phase noise signal and the target voice signal to be mixed and superposed to obtain the pre-denoising processing result corresponding to the noise data, so as to eliminate the noise signal in the target voice signal and obtain the voice information after pre-denoising processing.

Step S104, judging whether the pre-denoising processing result is larger than a preset threshold value, if so, entering step S105, otherwise, entering step S106; the predetermined threshold is pre-stored and is used to measure the distortion of the speech signal. If the pre-denoising result is greater than the predetermined threshold, the pre-denoising result indicates that undesired noises have been removed from the voiced frame of the target speech signal, i.e., the influence of the noises on the recognition result has been eliminated. If the pre-denoising result is smaller than the predetermined threshold, further adjusting denoising is needed to ensure that each voiced frame in the received target voice signal can be processed, so that the voice signal denoising is more thorough, and the integrity and the recognition accuracy of the target voice signal are improved.

Step S105, according to the judgment result that the pre-denoising processing result is larger than the preset threshold, marking the sound frame corresponding to the pre-denoising processing result as the denoised sound frame in the target voice signal, wherein the remaining unmarked sound frames of the target voice signal possibly do not meet the condition that the pre-denoising processing result is larger than the preset threshold after the pre-denoising processing, and need to wait for the denoising processing of the subsequent steps, and then are uniformly converted into a voice control instruction for controlling the mobile robot.

And S106, judging whether the absolute value of the difference between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold, if so, entering S107, and otherwise, entering S108. The confidence value of the pre-denoising result is a value of the credibility of the recognition result of the pre-denoising target speech signal on the premise that the pre-denoising result is smaller than the predetermined threshold, and the confidence threshold can be used as an evaluation index of the correct recognition rate of the interfered target speech signal. And further processing the noise signals of the residual unmarked voiced frames of the target voice signal by judging whether the absolute value of the difference value between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold value or not so as to improve the comprehensiveness and the accuracy of denoising the target voice signal.

Step S107, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target voice signal, so as to realize the processing of judging the screened unmarked sound frame in the step S105 and improve the voice recognition precision; and the target speech signal still has an unmarked voiced frame, which means that the pre-denoising effect of the noise data matched with the current target confidence value on the unmarked voiced frame is not obvious at this time, and the error is large.

Step S108, according to the difference value between the confidence value of the unmarked voiced frame in the target voice signal and the current target confidence value, the current target confidence value is adjusted to be larger or smaller. In this embodiment, when the confidence value of the unlabeled voiced frame in the target speech signal is greater than the current target confidence value, the current target confidence value is correspondingly turned up, otherwise, the current target confidence value is correspondingly turned down, and then the step S102 is returned to, and the noise data matched with the adjusted target confidence value is selected for further denoising processing. Obviously, the method is a parameter correction process based on the current target confidence value, and then the denoised voiced frames are judged again based on the correction parameters, so that after multiple iterations, the process is circulated until all the voiced frames in the target voice signal are denoised. And flexibly adjusting the confidence value according to the real-time matching degree of the noise signal and the noise database to improve the denoising efficiency. And then the denoised sound frame in the target voice signal is converted into a voice control instruction to control the mobile robot. The target voice signal comprises periodic components, so that the periodic iteration rule is provided in the process of executing the denoising method, the target confidence value is prevented from being randomly modified, the judgment speed of the target voice signal is accelerated, and the denoising working efficiency is improved.

According to the technical scheme, under the noise scene of the working of the robot, a target voice signal sent by a user is obtained, and according to pre-stored empirical data of a noise database and the target voice signal, the empirical data of the noise database is controlled to be subjected to inverse processing to suppress the noise of the target voice signal; meanwhile, the related confidence value is flexibly adjusted according to the real-time matching degree of the noise signal and the noise database, and the denoised sound frame is screened out through judgment, so that the denoising thoroughness is greatly improved, and the speech recognition rate in a noise environment is improved.

A chip is used for storing a program code corresponding to the iterative denoising method. The chip adopts a special integrated control chip, and the chips can analyze internal or external control instructions and output corresponding control signals so as to control an execution component of the robot to perform corresponding actions. The chip is arranged in the cleaning robot in a built-in mode and used for controlling the cleaning robot to execute the iterative denoising method, converting the processed target voice signal into a control instruction matched with the target voice signal and executing corresponding operation according to the control instruction. The pre-denoising process may use a subtraction circuit to perform signal subtraction, or may use a combination of an inverter and an addition circuit to perform signal subtraction, and these circuits may be integrated with a processor into a dedicated processing chip, and may be configured according to design requirements. After the internal noise interference is filtered, the processor analyzes the filtered signals to analyze external voice signals, and the external voice signals are converted into control instructions matched with the external voice signals to control the robot. How the robot analyzes the external voice signal belongs to the existing technology which can be realized, and is not described herein again.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention and not to limit it; although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art will understand that: modifications to the specific embodiments of the invention or equivalent substitutions for parts of the technical features may be made; without departing from the spirit of the present invention, it is intended to cover all aspects of the invention as defined by the appended claims.

Claims

1. An iterative denoising method based on speech recognition is characterized by comprising the following steps:

step 1: determining a target voice signal from the voice signals acquired by the microphone array, and correspondingly acquiring a target confidence value;

step 2: selecting noise data matched with the target confidence value from a noise database, and controlling the noise data and an unlabeled sound frame in the target voice signal to participate in pre-denoising processing so as to obtain a pre-denoising processing result corresponding to the noise data;

and step 3: judging whether the pre-denoising processing result is larger than a preset threshold value, if so, entering a step 4, otherwise, entering a step 5; wherein, the predetermined threshold is pre-stored and is used for measuring the distortion degree of the voice signal;

and 4, step 4: marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal;

and 5: judging whether the absolute value of the difference value between the confidence value of the pre-denoising processing result and the target confidence value is smaller than a confidence threshold value, if so, marking the sound frame corresponding to the pre-denoising processing result as a denoised sound frame in the target speech signal; otherwise, adjusting the target confidence value, and returning to the step 2; wherein the target speech signal comprises voiced frames associated with the control commands.

2. The iterative denoising method of claim 1, wherein the step 1 specifically comprises:

recognizing a voiced frame of a voice signal acquired from the microphone array through a voice engine, determining the voice signal corresponding to the voiced frame as the target voice signal when the signal-to-noise ratio value of the voiced frame is greater than a preset signal-to-noise ratio threshold value, and then extracting a target confidence value corresponding to the target voice signal from the voiced frame, wherein the voiced frame comprises the confidence value and the signal-to-noise ratio value based on the voice recognition signal.

3. The iterative denoising method of claim 1, wherein in the step 2, the selecting the noise data matching the target confidence value from the noise database specifically comprises:

and judging whether a confidence value of preset noise data of which the absolute value of the difference value with the target confidence value is smaller than a preset noise threshold value exists in the noise database, if so, determining that the preset noise data is the noise data matched with the target confidence value.

4. The iterative denoising method of claim 1, wherein the pre-denoising method specifically comprises:

firstly, controlling the noise data to perform phase reversal processing to obtain a phase reversal noise signal;

and then controlling the inverse noise signal and the target voice signal to be mixed and superposed so as to obtain the pre-denoising processing result corresponding to the noise data.

5. The iterative denoising method of claim 1, wherein the method of adjusting the target confidence value comprises: and adjusting the current target confidence value to be larger or smaller according to the difference value between the confidence value of the unmarked voiced frame in the target voice signal and the current target confidence value.

6. A chip for storing program code corresponding to the iterative denoising method according to any one of claims 1 to 5.