CN110010126B

CN110010126B - Speech recognition method, apparatus, device and storage medium

Info

Publication number: CN110010126B
Application number: CN201910180338.2A
Authority: CN
Inventors: 陈建哲; 张腾飞; 向伟
Original assignee: Baidu International Technology Shenzhen Co ltd
Current assignee: Apollo Intelligent Connectivity Beijing Technology Co Ltd
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2021-10-08
Anticipated expiration: 2039-03-11
Also published as: CN113782019A; CN110010126A; CN113990320A

Abstract

The embodiment of the invention provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium. The voice recognition method can comprise the following steps: acquiring multi-path awakening voice signals from a plurality of positions; carrying out sound source positioning on the multi-channel awakening voice signals, and determining awakening voice positions; suppressing audio signals at other positions except the awakening voice position to obtain a signal to be identified; and carrying out voice recognition on the signal to be recognized. The voice awakening position is determined firstly, and the audio signals of other positions can be restrained, so that the effectiveness of the voice awakening position is kept, the influence of noise signals of other positions on voice recognition is reduced, and the interference to the voice awakening position is reduced.

Description

Speech recognition method, apparatus, device and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and storage medium.

Background

The current vehicle-mounted voice recognition system usually only allows a person in a specific position to input voice in a quiet environment. However, in a vehicle-mounted environment, a scene in which a plurality of people speak in a vehicle often occurs. For example, someone is making a call while another wants to voice initiate operations such as navigation. At this time, if the sound of the telephone is recorded by a microphone of the car machine, a lot of false identifications of the car machine can be caused.

Disclosure of Invention

Embodiments of the present invention provide a speech recognition method, apparatus, device, and storage medium, so as to solve one or more technical problems in the prior art.

In a first aspect, an embodiment of the present invention provides a speech recognition method, including:

acquiring multi-path awakening voice signals from a plurality of positions;

carrying out sound source positioning on the multi-channel awakening voice signals, and determining awakening voice positions;

suppressing audio signals at other positions except the awakening voice position to obtain a signal to be identified;

and carrying out voice recognition on the signal to be recognized.

In an embodiment of the present invention, the performing sound source localization on the multiple wake-up voice signals and determining the wake-up voice position includes:

and positioning a sound source by utilizing the signal energy of the multi-path awakening voice signals, and determining the position corresponding to the path of awakening voice signal with the maximum signal energy as the awakening voice position.

In one embodiment of the invention, the method further comprises:

and adjusting the angle of a microphone array by utilizing a beam forming mode so that the microphone array faces the awakening voice position.

In an embodiment of the present invention, suppressing audio signals at other positions than the wake-up voice position to obtain a signal to be recognized includes:

receiving a first voice signal of a microphone of the wake-up voice position;

receiving second voice signals of microphones at other positions;

and eliminating each second voice signal from the first voice signal by using a digital signal processor to obtain a signal to be recognized.

controlling the microphones at the other positions to stop receiving sound;

and receiving a signal to be identified of the microphone at the awakening voice position.

In a second aspect, an embodiment of the present invention provides a speech recognition apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring multi-path awakening voice signals from a plurality of positions;

the sound source positioning unit is used for carrying out sound source positioning on the multipath awakening voice signals and determining awakening voice positions;

the suppression unit is used for suppressing the audio signals at other positions except the awakening voice position to obtain a signal to be identified;

and the recognition unit is used for carrying out voice recognition on the signal to be recognized.

In an embodiment of the present invention, the sound source positioning unit is further configured to perform sound source positioning by using signal energy of the multiple paths of wake-up voice signals, and determine a position corresponding to one path of wake-up voice signal with the largest signal energy as a wake-up voice position.

In one embodiment of the invention, the apparatus further comprises:

and the beam forming unit is used for adjusting the angle of the microphone array in a beam forming mode so that the microphone array faces the awakening voice position.

In one embodiment of the present invention, the suppressing unit includes:

the first receiving subunit is used for receiving a first voice signal of the microphone at the awakening voice position; receiving second voice signals of microphones at other positions;

and the eliminating subunit is used for eliminating each second voice signal from the first voice signal by using a digital signal processor to obtain a signal to be recognized.

In one embodiment of the present invention, the suppressing unit includes:

the stop control unit is used for controlling the microphones at other positions to stop sound reception;

and the second receiving subunit is used for receiving the signal to be identified of the microphone at the awakening voice position.

In a third aspect, an embodiment of the present invention provides a speech recognition device, where functions of the speech recognition device may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more units corresponding to the above functions.

In one embodiment, the apparatus is configured to include a processor and a memory, the memory is used for storing a program that supports the apparatus to execute the above-mentioned speech recognition method, and the processor is configured to execute the program stored in the memory. The device may also include a communication interface for communicating with other devices or a communication network.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium for storing computer software instructions for a speech recognition apparatus, which includes a program for executing the speech recognition method.

One of the above technical solutions has the following advantages or beneficial effects: the voice awakening position is determined firstly, and the audio signals of other positions can be restrained, so that the effectiveness of the voice awakening position is kept, the influence of noise signals of other positions on voice recognition is reduced, and the interference to the voice awakening position is reduced. Therefore, accurate voice recognition results can be obtained more favorably, and user experience is improved.

Another technical scheme in the above technical scheme has the following advantages or beneficial effects: by adopting the voice recognition method provided by the embodiment of the invention, an anti-interference recognition scheme can be added in the vehicle. If a person at a location within the vehicle utters a wake up word, the location is determined to be a wake up voice location, and recognition of words spoken by the person at the location can then be performed. People at other positions can not interfere with people who awaken the voice position when speaking, so that the user experience is better, and the voice recognition of the car machine is more intelligent and accurate.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present invention will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

Fig. 1 schematically shows a flow chart of a speech recognition method according to an embodiment of the invention.

Fig. 2 schematically shows a flow chart of a speech recognition method according to another embodiment of the invention.

Fig. 3 schematically shows a schematic view of an application scenario of a speech recognition method according to yet another embodiment of the present invention.

Fig. 4 schematically shows a flow chart of a speech recognition method according to a further embodiment of the invention.

Fig. 5 schematically shows a schematic view of a speech recognition arrangement according to an embodiment of the invention.

Fig. 6 schematically shows a schematic view of a speech recognition arrangement according to another embodiment of the invention.

Fig. 7 schematically shows a schematic view of a speech recognition device according to an embodiment of the present invention.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present invention. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 schematically shows a flow chart of a speech recognition method according to an embodiment of the invention. As shown in fig. 1, the method may include:

step 101, acquiring multipath wake-up voice signals from a plurality of positions.

And 102, carrying out sound source positioning on the multipath awakening voice signals, and determining the awakening voice position.

And 103, suppressing the audio signals at other positions except the awakening voice position to obtain a signal to be identified.

And step 104, performing voice recognition on the signal to be recognized.

In one embodiment, the microphone array may include a plurality of microphones installed at a plurality of positions, and the designated space may be divided into a plurality of sound zones according to the positions of the microphones. For example: four microphones are installed in the vehicle, and are respectively close to a front driving position, a left rear driving position and a right rear driving position. The four microphones are used to divide the space inside the vehicle into four sound zones, corresponding to a driving sound zone, a copilot sound zone, a left rear driving sound zone and a right rear driving sound zone.

Each microphone may also be connected to a corresponding wake-up engine. The microphone can keep the sound receiving state when the voice device is not awakened. If a voice signal received by a certain microphone includes a wake-up word, a wake-up engine connected with the microphone can wake up the voice function of the way. These voice signals including the wake-up word may be referred to simply as wake-up voice signals.

In one embodiment, step 102 comprises: and positioning a sound source by utilizing the signal energy of the multi-path awakening voice signals, and determining the position corresponding to the path of awakening voice signal with the maximum signal energy as the awakening voice position.

The distance between the same sound source and each microphone may be different, and thus, the amount of energy received by each microphone from the voice signal emitted from the sound source may be different. Comparing the wake-up voice signals received by the microphone signals, the position of the microphone receiving the signal with the maximum energy can be determined as the wake-up voice position.

In one embodiment, as shown in fig. 2, the method further comprises:

step 201, adjusting an angle of a microphone array by using a beam forming manner so that the microphone array faces to the wake-up voice position. Specifically, after sound source positioning, the angle of the microphone array is adjusted by using a beam forming mode, so that the energy of a voice signal acquired in the direction where the awakening voice position is located is maximum, the voice signal is most effective, and other directions are weakened, so that the preliminary suppression effect on a noise signal is achieved.

In one embodiment, step 103 may include a variety of ways, exemplified by the following:

example one: and eliminating audio signals of other positions by using a Digital Signal Processor (DSP). The method specifically comprises the following steps: receiving a first voice signal of a microphone of the wake-up voice position; receiving second voice signals of microphones at other positions; and utilizing the DSP to eliminate each second voice signal from the first voice signal so as to obtain a signal to be recognized. For example, the DSP may subtract out the signals of the microphones other than the wake-up voice location.

In one application scenario, the received speech signal of the driving microphone includes the speech signal received by the co-driving microphone. If the voice signal received by the microphone of the co-driver is available, the voice signal received by the microphone of the co-driver can be cancelled from the voice signal received by the microphone of the driving. In this way, the influence of other microphones on the signals received by the driving microphone can be eliminated more effectively.

Example two: and controlling the microphone in the voice awakening position to receive the voice, and forbidding the microphones in other positions to receive the voice. The method specifically comprises the following steps: controlling the microphones at the other positions to stop receiving sound; and receiving a signal to be identified of the microphone at the awakening voice position.

According to the method provided by the embodiment of the invention, the voice awakening position is determined firstly, and the audio signals at other positions can be inhibited, so that the effectiveness of the voice at the voice awakening position is kept, the influence of noise signals at other positions on voice recognition is reduced, and the interference on the voice awakening position is reduced. Therefore, accurate voice recognition results can be obtained more favorably, and user experience is improved.

In an application example, a speech recognition system of a vehicle is taken as an example. As shown in fig. 3, the vehicle interior includes four corresponding microphones 301 at four positions (e.g., a driving position, a passenger driving position, a rear left driving position, and a rear right driving position). Each microphone 301 is connected to one wake-up engine 302 for a total of four wake-up engines. The voice signals received by the four microphones are all input into the same DSP for inhibition processing, so that the anti-interference effect is achieved. In addition, the system can also comprise a path recognition engine 303 for performing voice recognition on the signal after the DSP suppression processing.

As shown in fig. 4, the flow of speech recognition may include:

step 401, if the voice sent by the sound source includes a wake-up word, a certain position is woken up. For example, a microphone near the location of a sound source has a speech signal input that includes a wake-up word that wakes up the zone in which the microphone is located. Because the voice signals are not subjected to DSP suppression processing, and voice signals are input in other three positions, the sound zone where other positions are located can be awakened. In this case, four microphones of the car machine perform four-way recording. The driving, the copilot, the left-right driving and the right-rear driving are all provided with sound recording.

Step 402, the DSP can adjust the angles of the microphone array to point to the corresponding four positions by the wave velocity forming technology during initialization. Therefore, four wake-up voice signals at four positions are input into the DSP for processing. Wherein, through sound source localization, can regard the position that the signal energy is the biggest as awakening the pronunciation position. And the DSP inhibits the audio signals of other three positions through an algorithm to obtain a clean voice signal of the awakening voice position.

Step 403, the sound zone where the awakening voice position is located can be locked, and only the signal to be identified of the sound zone where the awakening voice position is located is obtained. In addition, when voice recognition is carried out, a clean signal to be recognized of the sound zone can be obtained through the DSP.

Step 404, after determining the sound zone where the awakening voice position is located, if the voice recognition function is executed, only the signal to be recognized of the sound zone where the awakening voice position is located may be responded. If someone in other positions chats, the anti-interference function can be achieved. For example, the driver is in the voice zone to wake up and then the navigation is recognized, and at the moment, if someone is calling at the position of the copilot. Since only the voice signal of the driving position is acquired and the voice signal of the driving position is subjected to the DSP suppression processing. Therefore, the instruction for navigation can be correctly recognized while driving.

By adopting the voice recognition method provided by the embodiment of the invention, an anti-interference recognition scheme can be added in the vehicle. If a person at a location within the vehicle utters a wake up word, the location is determined to be a wake up voice location, and recognition of words spoken by the person at the location can then be performed. People at other positions can not interfere with people who awaken the voice position when speaking, so that the user experience is better, and the voice recognition of the car machine is more intelligent and accurate.

Fig. 5 schematically shows a schematic view of a speech recognition arrangement according to an embodiment of the invention. As shown in fig. 5, the apparatus may include:

an obtaining unit 501, configured to obtain multiple wake-up voice signals from multiple locations;

a sound source positioning unit 502, configured to perform sound source positioning on the multiple wake-up voice signals, and determine a wake-up voice position;

a suppressing unit 503, configured to suppress audio signals at other positions than the wake-up voice position to obtain a signal to be recognized;

a recognition unit 504, configured to perform speech recognition on the signal to be recognized.

In one embodiment of the present invention, as shown in fig. 6, the apparatus further comprises:

a beam forming unit 601, configured to adjust an angle of the microphone array by using beam forming, so that the microphone array faces the wake-up voice position.

In one embodiment of the present invention, the suppressing unit 503 includes:

The functions of each unit in each device in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

Fig. 7 schematically shows a schematic view of a speech recognition device according to an embodiment of the present invention. As shown in fig. 7, the voice recognition apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920 implements the speech recognition method in the above embodiments when executing the computer program. The number of the memory 910 and the processor 920 may be one or more.

The apparatus further comprises:

and a communication interface 930 for communicating with an external device to perform data interactive transmission.

Memory 910 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 7, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.

An embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the computer program is used for implementing the method of any one of the above embodiments when being executed by a processor.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present invention, and these should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A speech recognition method, comprising:

acquiring multi-path awakening voice signals from a plurality of positions;

suppressing the audio signals at other positions except the awakening voice position in a way of stopping reception of the audio signals at other positions except the awakening voice position to obtain a signal to be identified;

and carrying out voice recognition on the signal to be recognized.

2. The method of claim 1, wherein performing sound source localization on the plurality of wake-up voice signals and determining a wake-up voice position comprises:

3. The method of claim 1, further comprising:

4. The method according to any one of claims 1 to 3, wherein the suppressing the audio signals at the positions other than the wake-up voice position to obtain the signal to be recognized by stopping the sound reception of the audio signals at the positions other than the wake-up voice position comprises:

controlling the microphones at the other positions to stop receiving sound;

5. A speech recognition apparatus, comprising:

the suppression unit is used for suppressing the audio signals at the positions other than the awakening voice position in a way of stopping sound reception of the audio signals at the positions other than the awakening voice position so as to obtain a signal to be identified;

6. The apparatus according to claim 5, wherein the sound source localization unit is further configured to perform sound source localization by using signal energy of the multiple wake-up voice signals, and determine a position corresponding to one of the wake-up voice signals with the largest signal energy as the wake-up voice position.

7. The apparatus of claim 5, further comprising:

8. The apparatus according to any one of claims 5 to 7, wherein the suppressing unit comprises:

9. A speech recognition device, comprising:

one or more processors;

storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-4.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 4.