CN110517682B

CN110517682B - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN110517682B
Application number: CN201910822237.0A
Authority: CN
Inventors: 朱振岭
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-02
Filing date: 2019-09-02
Publication date: 2022-08-30
Anticipated expiration: 2039-09-02
Also published as: CN110517682A

Abstract

The application provides a voice recognition method, a voice recognition device, equipment and a storage medium, wherein the method comprises the following steps: carrying out ADBF processing on the collected first voice information to obtain frequency spectrums in at least two directions; determining a direction corresponding to a frequency spectrum with frequency spectrum characteristics meeting preset conditions as a target direction in the frequency spectrums in the at least two directions; and acquiring second voice information acquired in the target direction, and performing voice recognition on the second voice information. Through the method and the device, the second voice information in the accurate direction can be acquired, and the accuracy of voice recognition is improved.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of electronic device technologies, and relates to but is not limited to a speech recognition method, apparatus, device, and storage medium.

Background

Currently, for an electronic device with a voice recognition function, when the voice recognition function is implemented, a signal processing process at a front end of the electronic device usually performs echo cancellation and single-channel noise reduction on a signal acquired by a microphone to obtain a processed signal, wakes up the electronic device according to the processed signal, and performs voice recognition after the electronic device is woken up.

However, in the speech recognition method in the related art, directional interference Noise in other directions is included in a signal After Echo Cancellation (AEC) and single channel Noise reduction (NS) processing, and a problem of inaccurate sound source localization easily occurs in the case of large interference or high reverberation, so that accuracy of subsequent speech recognition may be reduced.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, which can accurately position the direction of a sound source, so that the accuracy of voice recognition is improved.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a voice recognition method, which comprises the following steps:

carrying out ADBF processing on the collected first voice information to obtain frequency spectrums in at least two directions;

determining the direction corresponding to the frequency spectrum with the frequency spectrum characteristics meeting the preset conditions as a target direction in the frequency spectrums in the at least two directions;

and acquiring second voice information acquired in the target direction, and performing voice recognition on the second voice information.

An embodiment of the present application provides a speech recognition apparatus, including:

the first processing module is used for carrying out ADBF processing on the collected first voice information to obtain frequency spectrums in at least two directions;

the determining module is used for determining the direction corresponding to the frequency spectrum with the frequency spectrum characteristics meeting the preset conditions in the frequency spectrums in the at least two directions as a target direction;

and the second processing module is used for acquiring second voice information acquired in the target direction and carrying out voice recognition on the second voice information.

An embodiment of the present application provides a speech recognition device, including:

a memory for storing executable instructions;

and the processor is used for realizing the method when executing the executable instructions stored in the memory.

The embodiment of the application provides a storage medium, which stores executable instructions and is used for causing a processor to implement the method when executed.

The embodiment of the application has the following beneficial effects:

carrying out ADBF processing on the collected first voice information to obtain frequency spectrums in at least two directions; and determining the direction corresponding to the frequency spectrum with the frequency spectrum characteristics meeting the preset conditions as the target direction in the frequency spectrums in the at least two directions, so that the direction of the sound source can be accurately positioned, and therefore, in the subsequent voice recognition process, second voice information in the accurate direction can be acquired, and the accuracy of voice recognition is improved.

Drawings

FIG. 1 is a schematic flow chart of a speech recognition method in the related art;

FIG. 2A is a schematic diagram of an alternative architecture of a speech recognition system according to an embodiment of the present application;

fig. 2B is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 3A is a schematic flow chart of an alternative speech recognition method according to an embodiment of the present application;

fig. 3B is a schematic diagram of an alternative scenario of a speech recognition method according to an embodiment of the present application;

fig. 3C is a schematic view of an alternative scenario of a speech recognition method according to an embodiment of the present application;

fig. 3D is a schematic diagram of an alternative scenario of a speech recognition method according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of an alternative speech recognition method according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of an alternative speech recognition method according to an embodiment of the present application;

fig. 6A is a schematic flow chart of an alternative speech recognition method according to an embodiment of the present application;

fig. 6B is a schematic diagram illustrating an implementation of determining a user direction according to an embodiment of the present application;

FIG. 7 is a schematic flow chart of an alternative speech recognition method according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of an alternative speech recognition method according to an embodiment of the present application;

fig. 9A is a beam direction diagram in the direction of 0 ° in the embodiment of the present application;

fig. 9B is a beam directing diagram in the 90 ° direction according to the embodiment of the present application;

fig. 9C is a beam direction diagram of the 180 ° direction in the embodiment of the present application;

FIG. 10 is a schematic diagram of a wake-up word spectrum sent to a wake-up unit of an electronic device.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further explanation of the speech recognition method according to the embodiment of the present application, a speech recognition method in the related art will be explained first.

Fig. 1 is a schematic flow chart illustrating an implementation of a speech recognition method in the related art, as shown in fig. 1, the method includes the following steps:

step S101, a voice signal is collected by a microphone.

And step S102, carrying out AEC processing on the voice signal to obtain the voice signal after the AEC processing.

And step S103, performing NS processing on the voice signal after AEC processing to obtain a voice signal after NS processing.

And step S104, sending the voice signal processed by the NS to a wake-up module of the electronic equipment to wake-up the electronic equipment.

Step S105, determining whether the electronic device is woken up.

If the judgment result is yes, step S106 is executed, and if the judgment result is no, the flow is ended.

And S106, when the electronic equipment is awakened, the electronic equipment starts a voice recognition function to perform voice recognition on the collected voice information.

However, the above-described method in the related art has at least the following problems:

1) because before the electronic device wakes up, the sound source is unknown, and only AEC processing and NS processing are performed on the collected voice signal in the related art, the voice signal after NS processing also includes directional interference noise in other directions, which may cause a higher interference signal to remain in the voice signal sent to the wake-up module of the electronic device, thereby failing to effectively wake up the electronic device, resulting in a lower wake-up rate.

2) The angle estimation of the sound source is needed at the awakening moment of the electronic equipment, so that the problem of inaccurate sound source positioning and even wrong sound source positioning easily occurs to the electronic equipment with the double microphones under the condition of large interference or high reverberation, so that the later process cannot enhance the voice, even damages the voice, seriously reduces the voice recognition effect and reduces the accuracy of the voice recognition.

In order to accurately perform voice recognition, embodiments of the present application provide a voice recognition method, apparatus, device, and storage medium, which can accurately locate a direction of a sound source, thereby improving accuracy of voice recognition.

An exemplary application of the speech recognition device provided in the embodiment of the present application is described below, and the speech recognition device provided in the embodiment of the present application may be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a server.

Referring to fig. 2A, fig. 2A is an alternative architecture diagram of the speech recognition system 10 provided in the embodiment of the present application, in order to implement speech recognition on speech information of a user, the terminal 100 (exemplary shown are the terminal 100-1 and the terminal 100-2) is connected to the server 300 through the network 200, and the network 200 may be a wide area network or a local area network, or a combination of the two.

The terminal 100 displays an interface of an Application (APP) on a current interface 110 (the current interface 110-1 and the current interface 110-2 are exemplarily shown), for example, the APP may be an APP having a voice input function. Wherein, the terminal 100-1 and the terminal 100-2 may be text information corresponding to the recognized voice information on the current interface. In the embodiment of the present application, the server 300 acquires the first voice information acquired by the terminal 100-1 or the terminal 100-2 through the network 200, and performs AD BF processing on the acquired first voice information to obtain frequency spectrums in at least two directions; determining the direction corresponding to the frequency spectrum with the frequency spectrum characteristics meeting the preset conditions as a target direction in the frequency spectrums in the at least two directions; and then acquiring second voice information acquired by the terminal, performing voice recognition on the voice information, and feeding back a recognition result to the terminal. It should be noted that the voice information shown in fig. 2A includes the first voice information and the second voice information.

Fig. 2B is a schematic structural diagram of a server 300 according to an embodiment of the present application, and as shown in fig. 2B, the server 300 includes: at least one processor 210, memory 250, at least one network interface 220, and a user interface 230. The various components in server 300 are coupled together by bus system 240. It is understood that the bus system 240 is used to enable communications among the components. The bus system 240 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various buses are labeled as bus system 240 in figure 2B.

The Processor 210 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., wherein the general purpose Processor may be a microprocessor or any conventional Processor, etc.

The user interface 230 includes one or more output devices 231, including one or more speakers and/or one or more visual display screens, that enable the presentation of media content. The user interface 230 also includes one or more input devices 232, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display screen, camera, other input buttons and controls.

The memory 250 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 250 optionally includes one or more storage devices physically located remotely from processor 210. The memory 250 includes volatile memory or nonvolatile memory, and can also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 250 described in embodiments herein is intended to comprise any suitable type of memory. In some embodiments, memory 250 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 251 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 252 for communicating to other computing devices via one or more (wired or wireless) network interfaces 220, exemplary network interfaces 220 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

an input processing module 253 for detecting one or more user inputs or interactions from one of the one or more input devices 232 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application can be implemented in software, and fig. 2 shows a speech recognition apparatus 254 stored in the memory 250, and the speech recognition apparatus 254 can be software in the form of programs and plug-ins, and the like, and includes the following software modules: a first processing module 2541, a determination module 2542 and a second processing module 2543, which are logical and thus may be arbitrarily combined or further divided depending on the functionality implemented. The functions of the respective modules will be explained below.

In other embodiments, the apparatus provided in the embodiments of the present Application may be implemented in hardware, and for example, the apparatus provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to execute the information recommendation method provided in the embodiments of the present Application, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field-Programmable Gate arrays (FPGAs), or other electronic components.

The speech recognition method provided by the embodiment of the present application will be described below in conjunction with an exemplary application and implementation of the speech recognition device provided by the embodiment of the present application.

Referring to fig. 3A, fig. 3A is an alternative flowchart of a speech recognition method according to an embodiment of the present application, which will be described with reference to the steps shown in fig. 3A.

Step S301, the server performs Adaptive Beamforming (ADBF) processing on the acquired first voice information to obtain frequency spectrums in at least two directions.

Here, the collecting of the first voice information is performed by a voice collecting unit of the electronic device, which may be a microphone located on the electronic device. In the embodiment of the present application, the electronic device may be an electronic device with multiple microphones, for example, two microphones may be provided, and accurate and effective acquisition of speech is achieved through the two microphones.

The ADBF processing refers to changing a weighting coefficient by using acquired prior data information according to a self-adaptive algorithm and a criterion, so that the purposes of reserving an expected signal and filtering interference are achieved. In the embodiment of the application, the acquired first voice information can be acquired in the form of sound waves, so that the sound waves can be processed through the ADBF processing, expected signals in the sound waves are reserved, and noise of interference is filtered.

It should be noted that, in the embodiment of the present application, the ADBF processing is performed on the first voice information in at least two directions, the directions may be determined by an angle with the voice collecting unit of the electronic device, for example, the angle with the electronic device may be 0 °, 90 °, 180 °, and the like, and correspondingly, the directions may be represented as 0 ° directions, 90 ° directions, 180 ° directions, and the like.

In the ADBF processing process, the first voice information is subjected to ADBF processing in a plurality of directions at the same time, a frequency spectrum is obtained corresponding to each direction, and the frequency spectrum and the directions are in one-to-one correspondence. The frequency spectrum is used for carrying a wake-up word for waking up the electronic equipment, and then the electronic equipment is wakened up through the wake-up word, so that the electronic equipment is in a working state.

For example, if a user says "please turn on the electronic device" in the 90 ° direction, the voice information of the user is collected by the voice collecting unit of the electronic device, and the voice information is subjected to ADBF processing in three directions, namely, the 0 ° direction, the 90 ° direction and the 180 ° direction, respectively, so as to obtain frequency spectrums in the three directions, namely, the 0 ° direction, the 90 ° direction and the 180 ° direction, wherein the frequency spectrums in the three directions all carry wake-up words requesting the electronic device to turn on the electronic device.

Step S302, in the frequency spectrums in the at least two directions, a direction corresponding to a frequency spectrum whose frequency spectrum characteristics satisfy a preset condition is determined as a target direction.

Here, the target direction is a direction closest to an actual direction of the user. In this embodiment of the application, a direction corresponding to a spectrum whose spectrum characteristics satisfy a preset condition may be determined as the target direction.

The spectrum features are attribute information corresponding to the spectrum, and the attribute information can reflect the quality of the spectrum and parameters corresponding to the spectrum.

Step S303, acquiring second voice information acquired in the target direction, and performing voice recognition on the second voice information.

Here, after the target direction is determined, it indicates that the user is in the target direction, and therefore, when speech recognition is to be performed on second speech information after the first speech information of the user, the second speech information in the target direction may be collected, so that, in the speech collection process, the second speech information of the user is actually required normal speech, and other speech except the speech of the user may be regarded as noise, and therefore, only collecting the second speech information in the target direction can collect only the normal speech, accurately obtain the speech information of the user, and avoid obtaining noise in other directions.

In the embodiment of the present application, the performing voice recognition on the second voice information may be to recognize the second voice information by using any voice recognition method, convert the vocabulary content in the second voice information into readable input content of the electronic device, such as a key, a binary code, or a character sequence, or convert the recognized voice information into readable text information and display and output the readable text information.

It should be noted that the sound source of the second voice message may be the same as or different from the sound source of the first voice message. For example, the embodiments of the present application can be applied to the following scenarios:

in a first scenario, as shown in fig. 3B, when a user 31 sends first voice information in a 90 ° direction of an electronic device 30, a server performs ADBF processing on the first voice information to obtain frequency spectrums in at least two directions; and in the frequency spectrums in the at least two directions, determining a direction corresponding to the frequency spectrum whose frequency spectrum characteristics meet the preset conditions as a target direction, that is, determining a direction closest to the user 31 as a 90-degree direction, and continuously acquiring second voice information in the 90-degree direction through the electronic device 30, where the second voice information is also sent by the user 31, and at this time, voice recognition needs to be performed on the voice information of the user 31, so that voice recognition is performed on the second voice information.

In a second scenario, as shown in fig. 3C, when the user 31 sends out the first voice message in the 90 ° direction of the electronic device 30, the server performs ADBF processing on the first voice message to obtain frequency spectrums in at least two directions; and in the frequency spectrums in the at least two directions, determining a direction corresponding to a frequency spectrum whose frequency spectrum characteristics meet a preset condition as a target direction, that is, determining a direction closest to the user 31 as a 90 ° direction, and continuously acquiring second voice information in the 90 ° direction through the electronic device 30, where the second voice information is sent by the user 32, that is, the user 31 and the user 32 are located at the same position, performing wake-up on the electronic device through the first voice information of the user 31, and at this time, performing voice recognition on the voice information of the user 32, so as to perform voice recognition on the second voice information.

Referring to fig. 3C, if the user 31 sends the first voice message in the 90 ° direction of the electronic device 30, the server performs ADBF processing on the first voice message to obtain frequency spectrums in at least two directions; and in the frequency spectrums in the at least two directions, determining a direction corresponding to a frequency spectrum whose frequency spectrum characteristics satisfy a preset condition as a target direction, that is, determining a direction closest to the user 31 as a 90-degree direction, and continuously acquiring second voice information in the 90-degree direction by the electronic device, where the second voice information is sent by the user 32, that is, the user 31 and the user 32 are located at the same position, and performing electronic device wake-up through the first voice information of the user 31, but at this time, voice recognition on the voice information of the user 31 is required, therefore, the method in the embodiment of the application may further include a step of voice information judgment, judging whether the tone color of the second voice information is the same as that of the first voice information, and if so, indicating that the user 32 and the user 31 are the same person, directly performing voice recognition on the acquired second voice information, if not, indicating that user 32 is not the same person as user 31, voice information needs to be re-collected until a second voice information is collected for user 31. Alternatively, in other embodiments, when the collected voice information is the voice information of the user 32, and voice recognition of the voice information of the user 32 is not needed, the voice information of the user 32 may be saved.

Referring to fig. 3C, if the user 31 sends the first voice message in the 90 ° direction of the electronic device 30, the server performs ADBF processing on the first voice message to obtain frequency spectrums in at least two directions; and in the frequency spectrums in the at least two directions, determining a direction corresponding to a frequency spectrum whose frequency spectrum characteristics meet a preset condition as a target direction, that is, determining a direction closest to the user 31 as a 90-degree direction, and continuously acquiring second voice information in the 90-degree direction by the electronic device, where the second voice information is sent by the user 32, that is, the user 31 and the user 32 are located at the same position, and waking up the electronic device by using the first voice information of the user 31, and at this time, it is not required to perform voice recognition on which voice information is specified, so that it is possible to wait for receiving the voice information of the user 31 for recognition, and it is also possible to realize secondary wake-up on the electronic device by using the second voice information of the user 32 as the first voice information, so as to implement the current use requirement of the electronic device by the user 32.

In a fifth scenario, as shown in fig. 3D, when the user 31 sends out the first voice message in the 90 ° direction of the electronic device 30, the server performs ADBF processing on the first voice message to obtain frequency spectrums in at least two directions; and in the frequency spectrums in the at least two directions, determining the direction corresponding to the frequency spectrum with the frequency spectrum characteristics meeting the preset conditions as a target direction, namely determining the direction closest to the user 31 as a 90-degree direction, and continuously acquiring the second voice information in the 90-degree direction through the electronic device 30. If the electronic device 30 does not acquire the second voice information in the 90 ° direction within the preset time, the voice information in other directions may be acquired, for example, the voice information of the user 32 may be acquired in the 60 ° direction, and therefore, it may be continued to determine that the direction of the user 32 is the target direction according to the voice information of the user 32, and further acquire the second voice information of the user 32, and perform voice recognition.

According to the voice recognition method provided by the embodiment of the application, ADBF processing is carried out on the collected first voice information, and frequency spectrums in at least two directions are obtained; and determining the direction corresponding to the frequency spectrum with the frequency spectrum characteristics meeting the preset conditions as the target direction in the frequency spectrums in the at least two directions, so that the direction of the sound source can be accurately positioned, and therefore, in the subsequent voice recognition process, second voice information in the accurate direction can be acquired, and the accuracy of voice recognition is improved.

Fig. 4 is an alternative flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 4, the method includes the following steps:

step S401, the server determines a keyword included in the first voice message.

Here, determining the keyword included in the first voice message may be analyzing the first voice message, determining the keyword in the first voice message, for example, performing text recognition on the first voice message to obtain text information, performing word segmentation processing on the text information to obtain at least one word, and then determining a word meeting a preset part-of-speech condition as the keyword according to a part-of-speech of each word. Or, the first voice message may also be analyzed by using an artificial intelligence technique, and a keyword included in the first voice message may be determined.

For example, when the first voice message is that the user says "please play music" to the electronic device, the keyword may be determined to be "music", and therefore an application related to music on the electronic device needs to be started.

Step S402, the ADBF processing is carried out on the first voice information in at least two directions, and a beam which corresponds to each direction and comprises the keyword is obtained.

Here, the first voice message is a sound wave, when the sound wave is acquired, ADBF processing is performed on the sound wave from at least two directions, and meanwhile, since the first voice message includes the keyword, a beam obtained after the ADBF processing is performed on the sound wave also includes the keyword.

Step S403, determining the frequency spectrum of the awakening word in the corresponding direction according to the wave beam corresponding to each direction and including the keyword.

Here, when the beam corresponding to each direction is obtained, the distribution of the frequency of the beam may be determined, and a frequency distribution curve of the beam, that is, the frequency spectrum may be obtained. Moreover, since the beam includes the keyword, the keyword may also be carried on the frequency spectrum, and the keyword is used to wake up the electronic device, so that the formed frequency spectrum including the keyword is the frequency spectrum of the wake-up word.

Step S404, in the wake-up word frequency spectrums in the at least two directions, determining a direction corresponding to the wake-up word frequency spectrum whose frequency spectrum characteristics satisfy a preset condition as a target direction.

Here, the target direction is a direction closest to an actual direction of the user. In the embodiment of the application, the direction corresponding to the frequency spectrum of the wakeup word with the frequency spectrum characteristic meeting the preset condition can be determined as the target direction.

The frequency spectrum features are attribute information corresponding to the awakening word frequency spectrum, and the attribute information can reflect the quality of the awakening word frequency spectrum and parameters corresponding to the awakening word frequency spectrum.

Step S405, second voice information collected in the target direction is obtained, and voice recognition is carried out on the voice information.

According to the voice recognition method provided by the embodiment of the application, ADBF processing is carried out on the collected first voice information, and awakening word frequency spectrums in at least two directions are obtained; and in the awakening word frequency spectrums in the at least two directions, the direction corresponding to the awakening word frequency spectrum with the frequency spectrum characteristics meeting the preset conditions is determined as the target direction, so that the direction of the sound source can be accurately positioned, second voice information in the accurate direction can be acquired in the subsequent voice recognition process, and the accuracy of voice recognition is improved.

Fig. 5 is an alternative flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 5, the method includes the following steps:

step S501, the server determines a keyword included in the first voice message.

Step S502, the ADBF processing is performed on the first voice information in at least two directions, and a beam which corresponds to each direction and comprises the keyword is obtained.

It should be noted that steps S501 to S502 are the same as steps S401 to S402, and the description of the embodiments of the present application is omitted.

Step S503, the beam corresponding to each direction and including the keyword is processed by voice enhancement, and a voice enhanced beam corresponding to the direction is obtained.

Here, the voice enhancement processing is to perform signal enhancement on the beam so that a keyword of the voice enhanced beam can be accurately recognized.

In the embodiment of the present application, the speech enhancement processing may be performed in the following two ways:

in the first mode, the single-channel speech enhancement processing is performed on the beam corresponding to each direction, so as to obtain a speech enhancement beam in the corresponding direction.

Here, the single-channel speech enhancement processing is used to perform speech enhancement on the beam in the corresponding direction, and the signal strength of the obtained speech enhancement beam is higher than that of the original beam, so that accurate confirmation of the frequency spectrum of the speech enhancement beam can be facilitated.

And in the second mode, the noise in the beam corresponding to each direction is eliminated to obtain the voice enhancement beam corresponding to the direction.

Here, the noise in the beam is removed to reduce the influence of the noise on the effective voice, and since the noise is reduced and accordingly the strength of the effective voice signal is relatively enhanced, the beam corresponding to the first voice information can be enhanced, and the voice enhanced beam can be obtained.

Step S504, determine the voice enhancement beam in each direction as the frequency spectrum of the wakeup word in the corresponding direction.

Here, the wake word spectrum includes a wake word, where the wake word may be the keyword, and the wake word is used to wake up the electronic device.

Step S505, in the wake-up word frequency spectrums in the at least two directions, determining a direction corresponding to the wake-up word frequency spectrum whose frequency spectrum characteristics satisfy a preset condition as a target direction.

Step S506, second voice information collected in the target direction is obtained, and voice recognition is carried out on the voice information.

It should be noted that steps S505 to S506 are the same as steps S404 to S405, and are not repeated herein.

In some embodiments, the spectral characteristics include a signal-to-noise ratio or a wake-up rate; correspondingly, the step S302 can be implemented in the following two ways:

first, in the frequency spectrums in the at least two directions, the direction corresponding to the frequency spectrum with the highest signal-to-noise ratio is determined as the target direction.

And secondly, determining the direction corresponding to the frequency spectrum with the highest awakening rate as the target direction.

Fig. 6A is an alternative flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 6A, the method includes the following steps:

step S601, a voice acquisition unit on the electronic device acquires first voice information, and echo cancellation is carried out on the first voice information.

Here, the electronic device may have two voice collecting units, for example, the electronic device may be a smart speaker having two microphones. First voice information of a user is collected through two microphones on the intelligent sound box.

In the embodiment of the application, after the first voice information is collected, echo cancellation is performed on the first voice information to remove echoes in the first voice information.

In other embodiments, when the microphone collects the first voice information of the user, the electronic device may be playing audio, so that the microphone also collects some extraction signals, and the extraction information is a part of the voice signals in the smart sound box. When the intelligent sound box collects the extraction signal, voice elimination needs to be carried out on the extraction information so as to eliminate the extraction signal and avoid the influence of the extraction signal on the first voice information of the user.

Step S602, performing ADBF processing on the echo-cancelled first voice information in the at least two directions to obtain frequency spectrums in the at least two directions.

Here, first, a method for determining the direction is provided, and the determination of the direction in the embodiment of the present application is described by taking the example that the smart speaker has two microphones.

As shown in fig. 6B, smart speaker 60 has first microphone 601 and second microphone 602, and midpoint position 603 of first microphone 601 and second microphone 602, when the user speaks into smart speaker 60, the position of user 61 is as shown in the figure, a line segment between position of user 61 and midpoint position 603 is determined as a first connection line, and a line segment between first microphone 601 and second microphone 602 is determined as a second connection line, so the direction of the user is an included angle 62 between the first connection line and the second connection line. I.e. user 61 is positioned at angle 62 of smart speaker 60.

Based on the method for determining the included angle direction provided in fig. 6B, in this embodiment of the application, after the first voice information of the user is acquired, ADBF processing may be performed on the echo-cancelled first voice information in any at least two included angle directions, so as to obtain a frequency spectrum corresponding to each included angle direction.

Step S603, acquiring a wakeup word included in the frequency spectrum.

Step S604, waking up the electronic device through the wake-up word, so as to implement the voice recognition through the electronic device.

For example, the wake-up word may be identification information pre-stored in the smart speaker, and when the wake-up word corresponding to the identification information of the smart speaker is obtained in the frequency spectrum, the smart speaker is woken up.

Step S605, among the frequency spectrums in the at least two directions, a direction corresponding to a frequency spectrum whose frequency spectrum characteristics satisfy a preset condition is determined as a target direction.

Step S606, second voice information collected in the target direction is obtained, and voice recognition is carried out on the voice information.

It should be noted that steps S605 to S606 are the same as steps S302 to S303, and the description of the embodiments of the present application is omitted.

Fig. 7 is an alternative flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 7, the method includes the following steps:

step S701, the server performs ADBF processing on the collected first voice information to obtain frequency spectrums in at least two directions.

Step S702, in the frequency spectrums in the at least two directions, a direction corresponding to a frequency spectrum whose frequency spectrum characteristics satisfy a preset condition is determined as a target direction.

It should be noted that steps S701 to S702 are the same as steps S301 to S302, and are not repeated herein.

And step S703, acquiring a voice acquisition direction prestored in the electronic equipment.

Here, a preset voice capturing direction is stored in the storage unit on the electronic device, and the preset voice capturing direction may be a direction in which voice information for activating the electronic device is acquired within a history period.

Step S704, when the target direction is different from the voice collecting direction, storing the target direction.

Here, since the determined target direction is a direction closer to the actual direction of the user, that is, the target direction is a direction closer to the sound source direction. Therefore, after the target direction is determined, the target direction is compared with a voice acquisition direction prestored on the electronic device, and if the target direction is the same as the voice acquisition direction, the voice acquisition direction can be directly adopted as the target direction for subsequent processing. If the target direction is different from the voice collecting direction, the voice source is indicated to be in another direction different from the pre-stored voice collecting direction, so that the target direction can be stored in the storage unit of the electronic equipment, and the target direction can be changed to be used as the voice information collecting direction in the subsequent voice processing process.

Step S705, acquiring second voice information acquired in the target direction, and performing voice recognition on the voice information.

The voice recognition method provided by the embodiment of the application stores the target direction as the pre-stored voice acquisition direction on the storage unit of the electronic equipment, so that in the subsequent voice recognition and voice processing processes, when the sound source direction is consistent with the stored voice acquisition direction, the pre-stored voice acquisition direction can be directly adopted to acquire and recognize the voice information, and thus, the consistency of the subsequent voice acquisition direction and the historical voice acquisition direction can be ensured, and the acquisition of the voice information in the same direction is realized.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

The embodiment of the application provides a voice recognition method, which can greatly improve the awakening rate and recognition rate under strong noise and strong reverberation.

Fig. 8 is an alternative flowchart of a speech recognition method according to an embodiment of the present application, and as shown in fig. 8, the method includes the following steps:

step S801, a voice collecting unit collects first voice information of a user.

Step S802, performing echo cancellation on the first voice information through AEC processing.

Step S803, respectively perform ADBF processing on the first voice information in the directions of 0 °, 90 °, and 180 ° to obtain beams in three different directions of 0 °, 90 °, and 180 °.

As shown in fig. 8, step S803a is performed in the 0 ° direction, step S803b is performed in the 90 ° direction, and step S803c is performed in the 180 ° direction, respectively.

Step S804, performing NS processing on the beams in different directions to obtain the voice enhanced beams in the corresponding directions.

Here, as shown in fig. 8, step S804 includes step S804a, step S804b, and step S804c, and NS processing is performed on a beam in the 0 ° direction, a beam in the 90 ° direction, and a beam in the 180 ° direction, respectively.

Step S805, performing keyword recognition on each voice enhancement beam by adopting a KWS technology, and determining keywords.

Here, as shown in fig. 8, the step S805 includes a step S805a, a step S805b, and a step S805c of performing keyword recognition on the voice enhancement beam in the 0 ° direction, the voice enhancement beam in the 90 ° direction, and the voice enhancement beam in the 180 ° direction, respectively.

Step S806, waking up the electronic device according to the keyword, and determining whether the electronic device can be effectively woken up.

Here, if the judgment result is yes, step S807 is executed; if the judgment result is negative, the flow is ended.

And step S807, estimating the direction of arrival of the voice enhanced wave beam by adopting a DOA technology, and estimating the direction of a sound source by utilizing array information.

And step S808, performing adaptive beam forming on the voice enhancement beam by adopting an ADBF technology, and designing a weighting coefficient according to real-time data to form a beam with directivity.

In step S809, ASR speech recognition is performed on the formed beam with directivity.

The voice recognition method can be applied to electronic equipment with two microphones, and the two microphones can simultaneously acquire voice information of a user to obtain two voice signals.

In the embodiment of the application, the two voice signals collected by the two microphones are respectively subjected to adaptive beam design in the directions of 0 degree, 90 degrees and 180 degrees to respectively obtain three beams, and then single-channel voice enhancement is performed on the obtained three beams. Among them, the directivities of three beams are shown in fig. 9A to 9C, the directivity pattern of the beam 901 is shown in fig. 9A to 9C, fig. 9A shows the beam directivity pattern in the 0 ° direction, fig. 9B shows the beam directivity pattern in the 90 ° direction, and fig. 9C shows the beam directivity pattern in the 180 ° direction. As shown in fig. 9A to 9C, because the beams of the two microphones are wider and have a slightly higher tolerance to angle errors, the speaker will not significantly damage the voice signal within a certain range deviating from the main direction of the beams.

For example, assuming that the speaker direction is 160 ° and the interference noise is 60 ° direction, fig. 10 is the frequency spectrum of the wake-up word sent to the wake-up unit of the electronic device, where a is the clean speech signal, b is the superimposed interference speech signal, c is the speech after single-channel noise reduction, and d, e and f are the output frequency spectra of the beam in three directions of 0 °, 90 ° and 180 °, respectively. Since the noise type is speech interference, the single-channel noise reduction in the c-diagram hardly works; and as the interference noise comes from 60 degrees, more remained signals of d and e images are noise signals; the f-map retains more target speech signals, the signal-to-noise ratio of the f-map is highest, and then the wake-up rate is naturally also highest.

In an experimental result, the wake-up scores of the graphs a, b, c, d, e and f are 0.98, 0.32, 0.33, 0.10, 0.09 and 0.9, respectively, so that the method of the embodiment of the present application has a higher wake-up rate.

In another embodiment, please continue to refer to fig. 10, the 3-way processed signal is sent to a wake-up unit of the electronic device in real time to detect whether the signal is a wake-up word, and if the electronic device is woken up, the DOA unit of the electronic device performs speaker angle estimation by using the cached wake-up word voice. Usually, the two microphones have larger angle estimation errors and even estimation errors under high reverberation or strong noise, and the angle range is divided by utilizing the awakening information, and the angle estimation is carried out in the divided angle range, so that the accuracy of the angle estimation of the speaker can be greatly improved.

As shown in fig. 10, since the wake-up score of the graph f is the highest, when the DOA technique is used to estimate the angle of the speaker, the angle estimation range is set to be more 180 °, for example, 120 ° to 180 ° can be taken, which obviously greatly reduces the estimation error of the angle.

In the embodiment of the application, after the electronic device is awakened, the voice of the speaker is sent to the ASR unit for recognition after being subjected to the processing of reverberation, beam forming, single-channel voice enhancement and dereverberation, and the like, so that the voice interaction with the device is completed.

According to the voice recognition method provided by the embodiment of the application, the two paths of microphones are awakened by the aid of the multi-beam, the awakening rate can be improved, the angle estimation accuracy is improved by the aid of awakening information, voice damage caused by beam angle deviation noise is reduced, and the voice signal to noise ratio is improved, so that the accuracy of voice recognition is further improved.

Continuing with the exemplary structure of the speech recognition device 254 implemented as software modules provided by the embodiments of the present application, in some embodiments, as shown in fig. 2B, the software modules stored in the speech recognition device 254 of the memory 250 may include: a first processing module 2541, a determination module 2542, and a second processing module 2543.

A first processing module 2541, configured to perform ADBF processing on the acquired first voice information to obtain frequency spectrums in at least two directions;

a determining module 2542, configured to determine, as a target direction, a direction corresponding to a frequency spectrum whose frequency spectrum characteristics meet a preset condition, in the frequency spectrums in the at least two directions;

the second processing module 2543 is configured to acquire second voice information acquired in the target direction, and perform voice recognition on the second voice information.

In some embodiments, the spectrum is a wake-up word spectrum; correspondingly, the first processing module is further configured to: determining a keyword contained in the first voice message; the ADBF processing is carried out on the first voice information in at least two directions, and a wave beam which corresponds to each direction and comprises the keyword is obtained; and determining the frequency spectrum of the awakening word in the corresponding direction according to the wave beam corresponding to each direction and comprising the keyword.

In some embodiments, the apparatus further comprises:

the voice enhancement processing module is used for carrying out voice enhancement processing on the wave beam corresponding to each direction and comprising the keyword to obtain a voice enhancement wave beam corresponding to the direction;

correspondingly, the first processing module is further configured to: and determining the voice enhancement wave beam in each direction as the frequency spectrum of the awakening word in the corresponding direction.

In some embodiments, the speech enhancement processing module is further configured to:

performing single-channel voice enhancement processing on the wave beam corresponding to each direction to obtain voice enhancement wave beams corresponding to the directions; or, eliminating the noise in the beam corresponding to each direction to obtain the voice enhanced beam corresponding to the direction.

In some embodiments, the spectral characteristics include a signal-to-noise ratio or a wake-up rate;

correspondingly, the determining module is further configured to: and determining the direction corresponding to the frequency spectrum with the highest signal-to-noise ratio as the target direction, or determining the direction corresponding to the frequency spectrum with the highest awakening rate as the target direction.

In some embodiments, the apparatus further comprises:

the echo cancellation module is used for performing echo cancellation on the first voice information before ADBF processing is performed on the collected first voice information;

correspondingly, the first processing module is further configured to: and in the at least two directions, carrying out ADBF processing on the first voice information after echo cancellation to obtain frequency spectrums in the at least two directions.

In some embodiments, the apparatus further comprises:

a first obtaining module, configured to obtain a wake-up word included in the frequency spectrum;

and the awakening module is used for awakening the electronic equipment through the awakening words so as to realize the voice recognition through the electronic equipment.

In some embodiments, the apparatus further comprises:

the first acquisition module is used for acquiring a voice acquisition direction prestored on the electronic equipment;

and the storage module is used for storing the target direction when the target direction is different from the voice acquisition direction.

It should be noted that the description of the apparatus in the embodiment of the present application is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description is not repeated. For technical details not disclosed in the embodiments of the apparatus, reference is made to the description of the embodiments of the method of the present application for understanding.

Embodiments of the present application provide a storage medium having stored therein executable instructions, which when executed by a processor, will cause the processor to perform a method provided by embodiments of the present application, for example, the method as illustrated in fig. 3A.

In some embodiments, the storage medium may be a Memory such as a Ferroelectric Random Access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), a charged Erasable Programmable Read Only Memory (EEP ROM), a flash Memory, a magnetic surface Memory, an optical disc, or a Compact disc Read Only Memory (CD-ROM); or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may, but need not, correspond to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A speech recognition method, comprising:

determining keywords contained in the collected first voice information;

performing adaptive beam forming (ADBF) processing on the first voice information in at least two directions to obtain a beam corresponding to each direction and comprising the keyword;

determining the frequency spectrum of the awakening word in the corresponding direction according to the wave beam corresponding to each direction and comprising the keyword;

determining the direction corresponding to the awakening word frequency spectrum with the highest signal to noise ratio as a target direction in the awakening word frequency spectrums in the at least two directions, or determining the direction corresponding to the awakening word frequency spectrum with the highest awakening rate as the target direction;

2. The method of claim 1, further comprising:

carrying out voice enhancement processing on the wave beam corresponding to each direction and comprising the keyword to obtain a voice enhancement wave beam corresponding to the direction;

correspondingly, the determining the frequency spectrum of the wake-up word in the corresponding direction according to the beam corresponding to each direction and including the keyword includes: and determining the voice enhancement wave beam in each direction as the frequency spectrum of the awakening word in the corresponding direction.

3. The method of claim 2, wherein the performing the voice enhancement processing on the beam corresponding to each direction and including the keyword to obtain a voice enhanced beam corresponding to the direction comprises:

4. The method according to any one of claims 1 to 3, further comprising:

before ADBF processing is carried out on the collected first voice information, echo cancellation is carried out on the first voice information;

correspondingly, the ADBF processing is performed on the collected first voice information to obtain frequency spectrums in at least two directions, including:

and in the at least two directions, carrying out ADBF processing on the first voice information after echo cancellation to obtain frequency spectrums in the at least two directions.

5. The method according to any one of claims 1 to 3, further comprising:

acquiring awakening words contained in the frequency spectrum; awakening the electronic equipment through the awakening words so as to realize the voice recognition through the electronic equipment;

alternatively, the method further comprises: acquiring a voice acquisition direction prestored on the electronic equipment;

and when the target direction is different from the voice collecting direction, storing the target direction.

6. A speech recognition apparatus, comprising:

the first processing module is used for determining keywords contained in the acquired first voice information; carrying out ADBF processing on the first voice information in at least two directions to obtain a beam corresponding to each direction and comprising the keyword; determining the frequency spectrum of the awakening word in the corresponding direction according to the wave beam corresponding to each direction and comprising the keyword;

a determining module, configured to determine, in the wake-up word frequency spectrums in the at least two directions, a direction corresponding to the wake-up word frequency spectrum with a highest signal-to-noise ratio as a target direction, or determine a direction corresponding to the wake-up word frequency spectrum with a highest wake-up rate as the target direction;

7. A speech recognition device, comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 5 when executing executable instructions stored in the memory.

8. A storage medium having stored thereon executable instructions for causing a processor to perform the method of any one of claims 1 to 5 when executed.