CN110931007A

CN110931007A - Voice recognition method and system

Info

Publication number: CN110931007A
Application number: CN201911225468.XA
Authority: CN
Inventors: 周晨
Original assignee: AI Speech Ltd
Current assignee: AI Speech Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-03-27
Anticipated expiration: 2039-12-04
Also published as: CN110931007B

Abstract

The embodiment of the invention provides a voice recognition method. The method comprises the following steps: the method comprises the steps that first noisy voice audios collected by voice recognition equipment in real time are synchronously received, and the first noisy audios sent by at least one noise collection microphone are synchronously received; performing echo cancellation on the first voice audio with noise and the first noise audio, and determining a second voice audio with noise and a second noise audio after echo cancellation; estimating the noise power spectral density of the second voice audio with noise in real time, and performing peripheral noise reduction on the second voice audio with noise according to the noise power spectral density and the second noise audio to generate noise-reduced clean voice; and carrying out voice recognition on the clean voice. The embodiment of the invention also provides a voice recognition system. The embodiment of the invention provides the most effective noise source for noise reduction of the intelligent voice equipment. The self-noise of the noise equipment in the collected signal of the voice microphone is eliminated. The method has the advantages of no need of large amount of calculation, low time delay, wider applicable equipment, and capability of ensuring the accuracy rate of voice recognition and the success rate of awakening.

Description

Voice recognition method and system

Technical Field

The invention relates to the field of intelligent voice, in particular to a voice recognition method and a voice recognition system.

Background

With the development of intelligent voice, intelligent voice equipment gradually merges into a user's home. The intelligent voice equipment can be used for executing corresponding operation by the user speaking the sentence at any time and any place at home. For example, in the smart television, the user can jump to the corresponding video by only speaking the desired program or the desired channel. For example, the user may speak a song to be played, or the weather of the day is good, and the smart speaker may perform corresponding operations after performing speech recognition.

In a home environment, other devices always emit noise, for example, sound emitted by a smart television is equivalent to noise for recognition of a smart voice device, and voice recognition is affected. For the situation, the loudspeaker in the intelligent television acquires self-noise emitted by the intelligent television through a hardware loop/software loop, and the intelligent television actively reduces the noise of the received sound through the self-noise of the television, so that the influence of indoor external noise on voice recognition is avoided.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

however, other devices in the home environment always emit noise, such as washing machines, refrigerators, ovens, and range hoods, which may also affect the voice recognition effect. The self-noise sources of the devices are difficult to obtain, active noise reduction cannot be performed, and the awakening success rate and the recognition accuracy rate of the intelligent voice device are influenced.

Disclosure of Invention

The problem that the awakening success rate and the recognition accuracy rate of the intelligent voice equipment are influenced due to the interference of a noise source in a home environment on the intelligent voice equipment in the prior art is at least solved.

In a first aspect, an embodiment of the present invention provides a noise self-acquisition method for a noise device, which is applied to a noise acquisition microphone disposed at a noise source of the noise device, and the method includes:

the noise acquisition microphone receives analog gain configuration information and configures a signal acquisition mode according to the analog gain configuration information;

and carrying out multi-channel signal acquisition through the signal acquisition mode, and sending the acquired noise audio to voice recognition equipment.

In a second aspect, an embodiment of the present invention provides a speech recognition method, which is applied to a speech recognition device that establishes a connection with the noise collection microphone, and the method includes:

the voice recognition equipment acquires a first voice audio frequency with noise in real time and synchronously receives a first noise audio frequency sent by at least one noise acquisition microphone;

respectively carrying out echo cancellation on the first voice audio with noise and the first noise audio, and determining a second voice audio with noise and a second noise audio after echo cancellation;

estimating the noise power spectral density of the second voice audio with noise in real time, and performing peripheral noise reduction on the second voice audio with noise according to the noise power spectral density and the second noise audio to generate noise-reduced clean voice;

and performing voice recognition on the clean voice, and determining information corresponding to the clean voice.

In a third aspect, an embodiment of the present invention provides a noise self-acquisition system for a noise device, applied to a noise acquisition microphone disposed at a noise source of the noise device, where the system includes:

the analog gain configuration program module is used for receiving analog gain configuration information by the noise acquisition microphone and configuring a signal acquisition mode according to the analog gain configuration information;

and the noise acquisition program module is used for carrying out multi-channel signal acquisition through the signal acquisition mode and sending the acquired noise audio to the voice recognition equipment.

In a fourth aspect, an embodiment of the present invention provides a speech recognition system, which is applied to a speech recognition device connected to the noise collection microphone, and the system includes:

the audio acquisition program module is used for acquiring a first voice audio with noise in real time by the voice recognition equipment and synchronously receiving the first noise audio sent by at least one noise acquisition microphone;

the echo cancellation program module is used for performing echo cancellation on the first voice audio with noise and the first noise audio respectively, and determining a second voice audio with noise and a second noise audio after echo cancellation;

the noise reduction program module is used for estimating the noise power spectral density of the second voice audio with noise in real time, and performing peripheral noise reduction on the second voice audio with noise according to the noise power spectral density and the second noise audio to generate noise-reduced clean voice;

and the recognition program module is used for carrying out voice recognition on the clean voice and determining the information corresponding to the clean voice.

In a fifth aspect, an electronic device is provided, comprising: the noise self-collection device comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the steps of the noise self-collection method and the voice recognition method for the noise device of any embodiment of the invention.

In a sixth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the noise self-acquisition method and the speech recognition method for a noise device according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the noise collection microphone is configured at the self-noise source and is established with the intelligent voice equipment on the basis of collecting the self-noise of the equipment specially, so that the noise is effectively transmitted to the intelligent voice equipment, and the most effective noise source is provided for noise reduction of the intelligent voice equipment. And synchronously acquiring the voice with noise and the noise audio, and inputting the voice with noise and the noise audio into an echo cancellation module to realize the cancellation of the equipment self-noise in the signal acquired by the voice microphone. And noise reduction processing is performed on the signal level, a large amount of calculation is not needed, the requirement on intelligent voice equipment is not high, and the method is more widely applicable. The time delay is low, the user experience is improved, and meanwhile, the accuracy rate of voice recognition and the awakening success rate are also ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a noise self-acquisition method for a noise device according to an embodiment of the present invention;

FIG. 2 is a flow chart of a speech recognition method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a noise self-acquisition system for a noise device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech recognition system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a noise self-acquisition method for a noise device according to an embodiment of the present invention, which includes the following steps:

s11: the noise acquisition microphone receives analog gain configuration information and configures a signal acquisition mode according to the analog gain configuration information;

s12: and carrying out multi-channel signal acquisition through the signal acquisition mode, and sending the acquired noise audio to voice recognition equipment.

In this embodiment, when some devices work, self-noise energy is very large, for example, a sweeper, a dust collector, a range hood, a washing machine, a juicer, an oven, and the like, and the strong self-noise can be collected by a microphone of the intelligent voice device, so that voice interaction performance of the intelligent device and a user is seriously affected, for example, a wake-up success rate, a recognition accuracy rate, and the like.

It is first necessary to determine which of the devices are the source of the own noise. Taking a floor sweeper as an example, the self-noise sources include a main brush motor, an auxiliary brush motor, a blower motor, a laser displacement sensor motor, friction sound of a brush and the ground and the like, and then noise microphones are respectively arranged near the self-noise sources. Noise microphones typically pick up analog microphones with a sensitivity around-38 dBV/Pa. The position of the noise microphone can be selected from the middle of several self-noise sources, so that a plurality of self-noise sources can be collected simultaneously, the number of the noise microphones can be reduced, and further the hardware cost and the calculated amount of the echo cancellation module are reduced. Generally, although the floor sweeper product has more self-noise sources, the internal structure is compact, and 2 noise microphones can achieve good effect; the self noise source of the dust collector/range hood is single, and only 1 noise microphone is needed. Manufacturers of these devices select and reserve space at the time of production to integrate the noise microphones within the interior of these devices. The user can directly use the device after buying the device without self-installation.

For step S11, each microphone receives analog gain configuration information, where the analog gain is mainly to adjust the signal strength of the linear amplification input, and the magnitude thereof directly affects the value of the output audio power within a certain range, and a larger input value is beneficial to improving the output signal-to-noise ratio, and also increases the output power in a comparable manner. However, when the input is too large, the output power increases slowly, and the distortion increases sharply. The optimum adjustment value is such that the peak output voltage is within the linear range of the amplifier. And configuring a signal acquisition mode according to the analog gain configuration information. In one embodiment, the analog gain configuration information is 0db, which is used to prevent the noise collecting microphone from collecting the speaking voice. In the present embodiment, the analog gain is 0dB, that is, the analog gain is not set, which is to prevent the noise microphone from collecting the speaker voice and causing the problem of voice self-cancellation after passing through the echo cancellation module, that is, the speaker voice is mistaken for self-noise and is cancelled. Because of no analog gain, the noise microphone cannot be too far away from the self-noise source, and the self-noise source signal with high signal-to-noise ratio cannot be acquired when the noise microphone is too far away, and the distance is preferably within 20cm, and the closer the distance is, the better the distance is.

For step S12, multi-channel signal acquisition is performed according to the signal acquisition mode determined in step S11, and the acquired noise audio is transmitted to the voice recognition apparatus. The voice recognition device is pre-connected with the noise collecting microphones of the noise devices in advance so as to facilitate the transmission of noise.

According to the implementation method, the noise collection microphone is configured on the self-noise source and used for establishing connection with the intelligent voice equipment on the basis of specially collecting the self-noise of the equipment, and the noise is effectively transmitted to the intelligent voice equipment. The most effective noise source is provided for noise reduction of the intelligent voice equipment, so that the influence of noise on identification of the intelligent voice equipment is better reduced.

Fig. 2 is a flowchart of a speech recognition method according to an embodiment of the present invention, which includes the following steps:

s21: the voice recognition equipment acquires a first voice audio frequency with noise in real time and synchronously receives a first noise audio frequency sent by at least one noise acquisition microphone;

s22: respectively carrying out echo cancellation on the first voice audio with noise and the first noise audio, and determining a second voice audio with noise and a second noise audio after echo cancellation;

s23: estimating the noise power spectral density of the second voice audio with noise in real time, and performing peripheral noise reduction on the second voice audio with noise according to the noise power spectral density and the second noise audio to generate noise-reduced clean voice;

s24: and performing voice recognition on the clean voice, and determining information corresponding to the clean voice.

In this embodiment, the smart voice device establishes a connection with the noise collection microphone in advance, for example, a wireless network. Therefore, the intelligent voice equipment can receive the noise collected by the noise collecting microphone in real time.

For step S21, in use, the speech recognition device and the noise collection microphone may be connected to each other or connected to each other separately, and the speech recognition device and the noise collection microphone may be connected to each other in a multi-channel signal synchronous collection manner under the same microphone networking. The noise collection microphone collects noise audio and sends the noise audio to the voice recognition equipment (namely, intelligent voice equipment) in real time, and the voice recognition equipment receives first voice audio with noise collected by the noise collection microphone in real time and also collects the first voice audio with noise.

For step S22, the first noisy speech audio and the first noisy speech audio are then input to an echo cancellation module and output as a speech signal with most of the self-noise of the device removed. The reference tone input of the echo cancellation algorithm is the source of the echo to be cancelled, and for devices comprising loudspeakers, the audio to be played, and for devices to which the method relates, the reference tone input is the signal picked up by the source of the noise, i.e. the noise microphone. The microphone input of the echo cancellation algorithm is a signal containing echo and voice, i.e., a signal collected by the voice microphone. Echo cancellation is realized by methods such as a linear adaptive filter, residual echo suppression and the like through related information between reference sound input and microphone input. Thereby determining a second noisy speech audio and a second noisy audio.

Further to step S23, if the power spectral density of the second noise audio does not change much with time, i.e. belongs to stationary noise, a post-filtering module may be connected after the echo cancellation module. The post-filtering algorithm suppresses noise by estimating the noise power spectral density in real time and then removing the estimated noise from the noisy signal, and introduces no or little speech distortion. For example, equipment such as a sweeper, a dust collector, a range hood and the like with a fan can generate steady wind noise in the working process, and the noise can be reduced through the rear filtering module. And then generating the clean voice after noise reduction.

In step S24, after the clean voice is determined, voice recognition is performed to determine information corresponding to the clean voice, and a wakeup operation or a voice interaction operation is performed. The method does not need methods such as pattern matching, neural networks and the like to process noise of known product types. The pattern matching and neural network method needs a large amount of data to support (such as recording audio of various scene noises), and cannot achieve a good noise reduction effect for the types of noises which are not recorded. The method can be improved only by a method of adding data, namely the method is sensitive to training data and has no universality. The method can adapt to various products by adjusting specific algorithms and parameters. In addition, the calculated amount of the neural network is much larger than the signal noise reduction processing of the method due to the mode matching, and high requirements are put forward on the calculating capacity and the memory of the intelligent voice equipment. The method does not need a neural network and pattern matching, occupies little resources, and can be used in intelligent equipment with small memory or little computing capability. In addition, due to pattern matching, the calculated amount of the neural network is large, and due to the model characteristics of the neural network, certain response delay often exists, and the intuitive experience is expressed as slow response speed of awakening and the like. The method can realize real-time processing and has no delay problem.

According to the implementation method, the voice with noise and the noise audio are synchronously acquired and input to the echo cancellation module, so that the self-noise of the equipment in the voice microphone acquisition signal is eliminated. And noise reduction processing is performed on the signal level, a large amount of calculation is not needed, the requirement on intelligent voice equipment is not high, and the method is more widely applicable. The time delay is low, the user experience is improved, and meanwhile, the accuracy rate of voice recognition and the awakening success rate are also ensured.

As an implementation manner, in this embodiment, after performing echo cancellation on the first noisy speech audio and the first noise audio, respectively, and determining a second noisy speech audio and a second noise audio after echo cancellation, the method further includes:

performing beam forming processing by using phase differences among microphones in a microphone array in the voice recognition equipment, enhancing voice signals in the voice sound source direction of the microphone array, suppressing noise signals in at least one non-voice sound source direction, and determining third voice with noise and third noise audio;

and estimating the noise power spectral density of the third noisy speech audio in real time, and performing peripheral noise reduction on the third noisy speech audio according to the noise power spectral density and the third noise audio to generate noise-reduced clean speech.

The microphone array comprises at least: the microphone array comprises a double-microphone array, a linear four-microphone array, an annular four-microphone array and an annular six-microphone array.

In the present embodiment, if the speech microphone is a microphone array, such as a two-microphone array, a linear four-microphone array, a circular six-microphone array, etc., then a beam forming module may be connected after the echo cancellation module. The wave beam forming algorithm utilizes the phase difference and amplitude difference information between every two microphones of the array, can enhance the voice in the expected direction, and inhibit the noise in the undesired direction, namely the noise in the non-voice direction, so as to obtain good noise reduction effect.

It can be seen from this embodiment that the voice in the desired direction (for example, the direction in which the user is located) is enhanced through the beamforming, and the noise in the undesired direction (i.e., the noise in the non-voice direction, that is, the direction in which the user is not located) is suppressed, so that the noise reduction effect can be further improved.

Fig. 3 is a schematic structural diagram of a noise self-acquisition system for a noise device according to an embodiment of the present invention, which can execute the noise self-acquisition method for a noise device according to any of the above embodiments and is configured in a terminal.

The noise self-acquisition system for the noise equipment provided by the embodiment comprises: an analog gain configuration program module 11 and a noise acquisition program module 12.

The analog gain configuration program module 11 is configured to receive analog gain configuration information by the noise acquisition microphone, and configure a signal acquisition mode according to the analog gain configuration information; the noise acquisition program module 12 is configured to perform multi-channel signal acquisition in the signal acquisition mode, and send the acquired noise audio to the speech recognition device.

Further, the analog gain configuration information is 0 decibel, and is used for preventing the noise collection microphone from collecting speaking voice.

As one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a noise self-acquisition method for a noise device in any of the method embodiments described above.

Fig. 4 is a schematic structural diagram of a speech recognition system according to an embodiment of the present invention, which can execute the speech recognition method according to any of the above embodiments and is configured in a terminal.

The speech recognition system provided by the embodiment comprises: an audio acquisition program module 21, an echo cancellation program module 22, a noise reduction program module 23 and an identification program module 24.

The audio acquisition program module 21 is configured to synchronously receive a first noise audio sent by at least one noise acquisition microphone, where the first noise audio is a first voice audio with noise and is acquired by the voice recognition device in real time; the echo cancellation program module 22 is configured to perform echo cancellation on the first noisy speech audio and the first noise audio, respectively, and determine a second noisy speech audio and a second noise audio after echo cancellation; the noise reduction program module 23 is configured to estimate a noise power spectral density of the second noisy speech audio in real time, and perform peripheral noise reduction on the second noisy speech audio according to the noise power spectral density and the second noise audio to generate a noise-reduced clean speech; the recognition program module 24 is configured to perform speech recognition on the clean speech and determine information corresponding to the clean speech.

Further, after the echo cancellation program module, the system further comprises:

a beam forming program module, configured to perform beam forming processing using phase differences between microphones in a microphone array in the speech recognition device, enhance speech signals in a speech sound source direction of the microphone array, suppress noise signals in at least one non-speech sound source direction, and determine a third noisy speech and a third noise audio;

and the noise reduction program module is used for estimating the noise power spectral density of the third noisy speech audio in real time, and performing peripheral noise reduction on the third noisy speech audio according to the noise power spectral density and the third noise audio to generate noise-reduced clean speech.

Further, the microphone array comprises at least: the microphone array comprises a double-microphone array, a linear four-microphone array, an annular four-microphone array and an annular six-microphone array.

As a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the methods in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a speech recognition method in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the noise self-collection device comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the steps of the noise self-collection method and the voice recognition method for the noise device of any embodiment of the invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with speech processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A noise self-acquisition method for a noisy device, applied to a noise-acquisition microphone disposed at a noise source of the noisy device, the method comprising:

2. The method of claim 1, wherein the analog gain profile is 0 decibels for preventing the noise collection microphone from collecting spoken speech.

3. A speech recognition method applied to a speech recognition device that establishes a connection with the noise collection microphone, the method comprising:

4. The method of claim 3, wherein after said separately echo canceling said first noisy speech audio and first noise audio, determining echo-canceled second noisy speech audio and second noise audio, the method further comprises:

5. The method of claim 4, wherein the microphone array comprises at least: the microphone array comprises a double-microphone array, a linear four-microphone array, an annular four-microphone array and an annular six-microphone array.

6. A noise self-acquisition system for a noisy device, applied to a noise-acquisition microphone arranged at a noise source of said noisy device, said system comprising:

7. The system of claim 6, wherein the analog gain profile is 0 decibels for preventing the noise collection microphone from collecting spoken speech.

8. A speech recognition system for use with a speech recognition device that establishes a connection with the noise-capturing microphone, the system comprising:

9. The system of claim 8, wherein after the echo cancellation program module, the system further comprises:

10. The system of claim 9, wherein the microphone array comprises at least: the microphone array comprises a double-microphone array, a linear four-microphone array, an annular four-microphone array and an annular six-microphone array.