WO2020001163A1 - 语音识别方法和装置、计算机设备和电子设备 - Google Patents

语音识别方法和装置、计算机设备和电子设备 Download PDF

Info

Publication number
WO2020001163A1
WO2020001163A1 PCT/CN2019/085625 CN2019085625W WO2020001163A1 WO 2020001163 A1 WO2020001163 A1 WO 2020001163A1 CN 2019085625 W CN2019085625 W CN 2019085625W WO 2020001163 A1 WO2020001163 A1 WO 2020001163A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech recognition
signal
audio signal
beam signal
keyword detection
Prior art date
Application number
PCT/CN2019/085625
Other languages
English (en)
French (fr)
Inventor
高毅
郑脊萌
于蒙
罗敏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to JP2020570624A priority Critical patent/JP7109852B2/ja
Priority to EP19824812.2A priority patent/EP3816995A4/en
Publication of WO2020001163A1 publication Critical patent/WO2020001163A1/zh
Priority to US16/921,537 priority patent/US11217229B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/40Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers
    • H04R1/406Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only by combining a number of identical transducers microphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/027Spatial or constructional arrangements of microphones, e.g. in dummy heads
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the present application relates to the field of voice interaction technology, and in particular, to a voice recognition method and apparatus, computer equipment, and electronic equipment.
  • Intelligent voice interaction is a technology that enables human-computer interaction through voice commands.
  • voice interaction technology By embedding voice interaction technology into electronic devices, the electronic devices can be artificially artificialized, and artificial intelligence-based electronic devices are now more and more popular with users. For example, Amazon's Echo smart speaker has achieved great success in the market.
  • Relevant technologies generally solve this problem by first collecting audio signals through all microphones in the microphone array, then determining the sound source angle based on the collected audio signals, and directionally collecting the audio signals according to the sound source angle, thereby reducing Interference of uncorrelated noise. This method is affected by the accuracy of the sound source angle. When the sound source angle is incorrectly detected, the accuracy of speech recognition will be reduced.
  • embodiments of the present application provide a voice recognition method and apparatus, computer equipment, and electronic equipment, which can solve the problem of low recognition accuracy of voice in related technologies.
  • a speech recognition method includes:
  • the speech recognition result of the audio signal is determined.
  • a voice recognition device includes:
  • An audio signal receiving module for receiving audio signals collected by a microphone array
  • a beamformer configured to separately perform beamforming processing on the audio signal in multiple different target directions to obtain corresponding multiple beam signals
  • a voice recognition module configured to perform voice recognition on each beam signal separately to obtain a voice recognition result of each beam signal
  • a processing module is configured to determine a speech recognition result of the audio signal according to a speech recognition result of each beam signal.
  • a computer device includes a microphone array, a memory, and a processor.
  • the memory stores a computer program.
  • the processor causes the processor to perform the steps of the foregoing method.
  • An electronic device includes:
  • a microphone array for collecting audio signals comprising at least two layers of ring structures
  • a processor connected to the microphone array and configured to process the audio signal
  • a memory storing a computer program
  • a housing encapsulating the microphone array and the processor
  • the processor When the computer program is executed by the processor, the processor is caused to execute the speech recognition method as described above.
  • the above-mentioned speech recognition method and device, computer equipment and electronic equipment by performing beamforming processing on the audio signals collected by the microphone array in a plurality of different target directions, corresponding multi-channel beam signals can be obtained, which can be performed in different target directions respectively.
  • Sound enhancement processing can clearly extract the enhanced beam signals in each target direction, that is, the method does not need to consider the direction of the sound source.
  • By performing beamforming processing in different target directions at least one target direction is close to the actual sound generation direction. At least one beam signal that has been enhanced in the target direction is clear, so speech recognition based on each beam signal can improve the accuracy of speech recognition.
  • FIG. 1 is a schematic flowchart of a speech recognition method according to an embodiment
  • FIG. 2 is a schematic diagram of a microphone array in an embodiment
  • 3 is a schematic diagram of beam signals obtained by performing beamforming processing in four target directions in an embodiment
  • FIG. 4 is a schematic diagram of interaction between a beamformer and a speech recognition model in an embodiment
  • FIG. 5 is a schematic structural diagram of a speech recognition model according to an embodiment
  • FIG. 6 is a signal diagram when a neural network node of a speech recognition model detects an awake word in an embodiment
  • FIG. 7 is an architecture diagram of speech recognition according to an embodiment
  • FIG. 8 is a schematic diagram of a microphone array according to an embodiment
  • FIG. 9 is a schematic diagram of a microphone array according to another embodiment.
  • FIG. 10 is a schematic flowchart of steps in a speech recognition method according to an embodiment
  • FIG. 11 is a structural block diagram of a voice recognition device in an embodiment
  • FIG. 12 is a structural block diagram of a computer device in one embodiment.
  • a speech recognition method is provided. This embodiment is mainly described by using the method as an example for a speech recognition device.
  • the voice recognition device may be an electronic device implanted with a voice interaction technology, and the electronic device may be a smart terminal, a smart home appliance, a robot, or the like capable of realizing human-computer interaction.
  • the speech recognition method includes:
  • S102 Receive an audio signal collected by a microphone array.
  • the microphone array refers to the arrangement of microphones, and is composed of a certain number of microphones. Each microphone collects analog signals of ambient sound, and converts the analog signals into digital audio signals through audio acquisition equipment such as analog-to-digital converters, gain controllers, and codecs.
  • the microphone array may be a one-dimensional microphone array, and the array element centers are located on the same straight line. According to whether the distance between adjacent array elements is the same, it can be divided into uniform linear array (Uniform Linear Array, ULA) and nested linear array. Uniform linear array is the simplest array topology. The distance between elements is the same, and the phase and sensitivity are the same. Nested linear arrays can be viewed as the superposition of several sets of uniform linear arrays, which is a special type of non-uniform array. This linear microphone array cannot distinguish the sound source direction in the entire 360-degree range in the horizontal direction, but can only distinguish the sound source direction in the 180-degree range. Such a linear microphone array can be adapted to an application environment with a range of 180 degrees, for example, a voice recognition device is placed against a wall, or the voice recognition device is in an environment where the sound source is within a 180 degree range.
  • ULA Uniform Linear Array
  • the microphone array may be a two-dimensional microphone array, that is, a planar microphone array, and the array element centers are distributed on a plane. According to the geometry of the array, it can be divided into equilateral triangle array, T-shaped array, uniform circular array, uniform square array, coaxial circular array, circular or rectangular area array, and so on.
  • the planar microphone array can obtain the horizontal azimuth and vertical azimuth information of the signal.
  • Such a planar microphone array can be adapted to a 360-degree application environment, for example, a voice recognition device needs to receive sounds with different orientations.
  • the microphone array may be a three-dimensional microphone array, that is, a stereo microphone array, and the array element center thereof is distributed in the stereo space. According to the three-dimensional shape of the array, it can be divided into tetrahedral array, cube array, cuboid array, and spherical array.
  • the stereo microphone array can obtain three kinds of information: the horizontal azimuth, vertical azimuth, and distance between the sound source and the reference point of the microphone array.
  • the microphone array is used as an example for illustration.
  • An embodiment of a ring microphone array is shown in FIG. 2.
  • six physical microphones are used, and they are sequentially arranged at azimuth angles of 0 degrees, 60 degrees, 120 degrees, 180 degrees, 240 degrees, and 300 degrees.
  • the radius is On the circumference of R, these 6 physical microphones form a ring microphone array.
  • Each microphone collects an analog signal of ambient sound, and converts the analog signal into a digital sound signal through audio acquisition equipment such as an analog-to-digital converter, a gain controller, and a codec.
  • the ring microphone array captures sound signals 360 degrees.
  • Beamforming is to perform delay or phase compensation and amplitude weighting on the audio signals output by each microphone in the microphone array to form a beam pointing in a specific direction.
  • the audio signals collected by the microphone array are beamformed in the directions of 0, 90, 180, or 270 degrees to form a beam pointing in the direction of 0, 90, 180, or 270 degrees.
  • a beamformer may be used to separately perform beamforming processing on audio signals in a set direction.
  • the beamformer is an algorithm based on a specific microphone array design. It can enhance audio signals in a specific target direction or multiple target directions, and suppress audio signals in non-target directions.
  • the beamformer can be any type of directional beamformer, including but not limited to a super-directional beamformer, MVDR (Minimum Variance Distortionless Response), or MUSIC (Multiple SignalSignal Classification) Classification) algorithm of the beamformer.
  • each beamformer performs beamforming processing in different directions.
  • the digital audio signals of multiple microphones constitute a microphone array signal and are sent to multiple beamformers.
  • Each beamformer enhances the audio signals in different setting directions and suppresses the audio signals in other directions. The more the audio signals that deviate from the setting direction are suppressed, the more the audio near the setting direction can be extracted. signal.
  • the arrangement of the microphone array for collecting the audio signals is not limited.
  • the audio signals in the target direction can be enhanced to reduce the interference of audio signals in other directions. Therefore, as an example, the microphone array that collects the audio signals has at least two different Direction microphone.
  • the microphone array shown in Figure 2 Take the microphone array shown in Figure 2 as an example to collect audio signals.
  • the digital audio signals of multiple microphones are used to form the microphone array signal.
  • the sound in the 0-degree direction remains unchanged (0dB gain).
  • Sounds in degrees and 330 degrees have a suppression effect greater than 9dB (about -9dB gain), and sounds in directions of 90 degrees and 270 degrees have a suppression of more than 20dB.
  • the digital audio signals of multiple microphones are combined into a microphone array signal, and the sound in the 90-degree direction remains unchanged (0dB gain), and the sound in the 30-degree and 150-degree directions has a suppression effect greater than 9dB (about -9dB gain), and it has more than 20dB rejection for 0 and 180 degree sounds.
  • the digital audio signals of multiple microphones are combined into a microphone array signal, and the sound in the 180-degree direction remains unchanged (0dB gain), and the sound in the 120- and 240-degree directions has a suppression effect greater than 9dB -9dB gain), and it has more than 20dB rejection for 90-degree and 270-degree sounds.
  • the digital audio signals of multiple microphones are combined into a microphone array signal, and the sound in the 270-degree direction remains unchanged (0dB gain), and the sound in the 210-degree and 330-degree directions has a suppression effect greater than 9dB (about -9dB gain), the sound of 180 degrees and 0 degrees has more than 20dB of rejection.
  • more or fewer beamformers may be provided to extract beam signals in other directions.
  • beamforming processing on a plurality of different target directions that are set, for the beam signal of the beamformer, it is possible to enhance the audio signal in the target direction and reduce the interference of audio signals in other directions.
  • the audio signals in the multiple target directions at least one beam signal is close to the actual sound direction, that is, at least one beam signal can reflect the actual sound, and at the same time, the noise interference in other directions is reduced.
  • the audio signals collected by the microphone array there is no need to identify the direction of the sound source, and beam forming processing is performed in a plurality of different target directions that are set.
  • the advantage of this is that beam signals in multiple target directions can be obtained, of which at least one beam signal must be close to the actual sound direction, that is, at least one beam signal can reflect the actual sound.
  • the audio signals in this direction are enhanced, and the audio signals in other directions are suppressed, which can enhance the audio signals corresponding to the angle of the actual sound direction, that is, reduce the audio signals in other directions.
  • the audio signal in this direction can be clearly extracted, and the interference of audio signals (including noise) in other directions is reduced.
  • each beam signal is processed for enhancement, and the audio signals of the non-set target direction are processed for suppression. Therefore, each beam signal can reflect the sound enhancement signals of the audio signals in different directions, and the speech is performed according to the beam signals of each direction. Recognition can improve the accuracy of speech recognition for sound-enhancing signals containing human voice.
  • the accuracy of speech recognition of audio signals in the corresponding direction can be improved.
  • the speech recognition results of the beam signals in each direction the speech recognition results of audio signals from multiple directions can be obtained, that is, combined
  • the speech recognition result of each channel enhanced sound, and the speech recognition result of the collected audio signal is obtained.
  • the above-mentioned speech recognition method performs beamforming processing on a plurality of different target directions by setting audio signals collected by the microphone array to obtain corresponding multi-channel beam signals. After performing sound enhancement processing in different target directions, it is clear To extract the beam signals after enhancement processing in each target direction, that is, the method does not need to consider the direction of the sound source. By performing beamforming processing in different target directions, then at least one target direction is close to the actual sound generation direction, so at least one is performed in the target direction. The enhanced beam signal is clear, so speech recognition based on each beam signal can improve the accuracy of speech recognition.
  • speech recognition is performed on each beam signal separately to obtain speech recognition results of each beam signal, including: inputting each beam signal into a corresponding speech recognition model, and each speech recognition model corresponds to the corresponding The speech signals of the beam signals are used for speech recognition, and the speech recognition results of each beam signal are obtained.
  • the speech recognition model is pre-trained using a neural network model.
  • the feature vectors corresponding to each beam signal such as energy and sub-band features, are calculated layer by layer through pre-trained neural network parameters for speech recognition.
  • a speech recognition model corresponding to the number of beamformers is set, that is, one beamformer corresponds to one speech recognition model.
  • each beam signal is input to a corresponding one respectively.
  • the speech recognition model performs speech recognition on the corresponding beam signals in parallel by each speech recognition model, and obtains the speech recognition results of each beam signal.
  • a beamformer and a speech recognition model are paired and run on a CPU (Central Processing Unit) or DSP (Digital Signal Processor), that is, multiple pairs of beamformers and speech recognition
  • the models are run on multiple CPUs, and then the speech recognition results of the speech recognition model are integrated to obtain the final speech recognition results.
  • This parallel operation can greatly speed up software execution.
  • processing is performed by different hardware calculation units to share the calculation amount, improve system stability, and improve speech recognition response speed.
  • the speech recognition method of the present application can apply keyword detection (Spokenkeywordspotting or SpokenTermDetection).
  • Keyword detection is a sub-field in the field of speech recognition, and its purpose is to detect all occurrences of a specified word in an audio signal.
  • the keyword detection method can be applied to the field of wake word detection.
  • the wake-up word refers to a set voice command. When the wake-up word is detected, the voice recognition device in the sleep or lock screen state enters a waiting command state.
  • the speech recognition result includes a keyword detection result.
  • determining the speech recognition result of the collected audio signal includes: determining the keyword detection result of the collected audio signal according to the keyword detection result of each beam signal.
  • Each speech recognition model receives a beam signal output by a corresponding beamformer, detects whether a keyword is included therein, and outputs a detection result. That is, each speech recognition model is used to detect whether a keyword is included in an audio signal from each direction based on the received beam signals in each direction.
  • the keywords including 4 words as an example, as shown in FIG. 5, the feature vector (such as energy and subband characteristics) of the beam signal is calculated layer by layer through the pre-trained network parameters, and finally the output value is Layer to get keyword detection results.
  • the detection result may be a binary symbol. For example, output 0 indicates that no keyword is detected, and output 1 indicates that a keyword is detected.
  • determine the keyword detection result of the collected audio signal including: when the keyword detection result of any beam signal is a keyword detected, determine the key of the collected audio signal.
  • the word detection result is that a keyword is detected, that is, when at least one of the multiple speech recognition models detects a keyword, it is determined that the keyword is detected.
  • the keyword detection result may also include a keyword detection probability; determining the keyword detection result of the collected audio signal according to the keyword detection result of each beam signal, including: when the keyword detection probability of at least one beam signal is greater than When the value is preset, it is determined that a keyword detection result of the collected audio signal is a keyword detected.
  • the output layer of the neural network has 5 nodes, which respectively represent the four keywords of "You”, “Good”, “Small” and “Listening", and Non-keyword probability. If the wake-up word appears in the window Dw for a period of time, the output node of the neural network will have a signal similar to that shown in Figure 6, and the probability of the four keywords "you", “good”, “small” and “listen” can be observed in turn. Increase. By accumulating the probability of the four keywords in the awake word in this time window, it can be judged whether the keyword appears.
  • determining a keyword detection result of the collected audio signal according to a keyword detection result of each beam signal includes: entering a keyword detection probability of each beam signal into a pre-trained classifier, and according to the classification The output of the device determines whether the captured audio signal includes keywords.
  • each speech recognition model outputs the probability of occurrence of arousal words in various directions, and a final detection decision is made by a classifier, which includes, but is not limited to, a neural network, SVM (Support Vector Machine), decision tree, etc Various classification algorithms.
  • SVM Small Vector Machine
  • decision tree Various classification algorithms.
  • the above-mentioned classifier is also referred to as a post-processing logic module in this embodiment.
  • determining the speech recognition result of the collected audio signal according to the speech recognition result of each beam signal includes: obtaining a linguistic score and / or an acoustic score of the speech recognition result of each beam signal; The highest score of the speech recognition result is determined as the speech recognition result of the collected audio signal.
  • the speech recognition method can be applied to the field of continuous or discontinuous speech recognition.
  • the outputs of multiple beamformers are simultaneously sent to multiple speech recognition models.
  • the final speech recognition result uses the speech recognition model with the best speech recognition effect. Output.
  • the final speech recognition result may be a speech recognition result with a maximum acoustic score or a linguistic score, or a combination of the two.
  • the speech recognition method further includes: performing suppression processing on an echo caused by an audio signal output by the speech recognition device.
  • an embodiment of the present application is also provided with an echo cancellation module, which can remove voice recognition. Echoes captured by the microphone due to the device playing itself. As shown in FIG. 7, the echo cancellation module can be placed before or after the beamformer. As an example, when the number of sound channels output by the multi-directional beamformer is less than the number of microphones, placing the echo cancellation module after the multi-directional beamformer can effectively reduce the amount of calculation.
  • the multiple output signals of the echo canceller module or the beamformer may be further reduced by the number of output channels through a channel selection module, so as to reduce the calculation amount of multiple subsequent speech recognition modules And memory consumption.
  • the multiple beam signals output by the multi-directional beamformer will be sent to multiple speech recognition models for wake word detection.
  • the multiple wake-up word detection results are output to a post-processing logic module to make a final decision to determine whether wake-up words appear in the current acoustic scene.
  • an electronic device including: a microphone array for collecting audio signals, the microphone array including at least two layers of a ring structure;
  • a processor connected to the microphone array and configured to process the audio signal
  • a memory storing a computer program
  • a housing encapsulating the microphone array and the processor
  • the processor When the computer program is executed by the processor, the processor is caused to execute the speech recognition method according to the foregoing embodiments.
  • the microphones in the ring array can be placed on a standard circle or on an oval circle; the microphones can be evenly distributed on the circle or can be unevenly placed on the circle.
  • the microphone structure of the link structure can collect audio signals 360 degrees, improve the direction of sound source detection, and is suitable for far-field environments.
  • At least three microphones are provided on each ring structure. That is, three ring microphones are mounted on each ring structure to form a multilayer ring array.
  • the more microphones on the ring array the higher the accuracy of theoretically calculating the direction of the sound source, and the better the enhancement quality of the sound in the target direction. Considering that more microphones cost and more computational complexity, 4 to 8 microphones are provided on each ring structure.
  • the microphones on each ring structure are arranged uniformly.
  • each ring structure is a concentric circle, and the microphones of two adjacent ring structures are respectively disposed in the same direction. That is, the microphones on each ring structure are respectively set at the same angle.
  • two ring structures are taken as an example, and three microphones are arranged on each ring structure.
  • the inner and outer microphones are set at 0 degrees, 120 degrees, and 240 degrees, respectively.
  • the multi-layer ring structure microphone array increases the number of microphones, so that the array can obtain better directivity.
  • the microphones on any two ring structures have an included angle. That is, the microphones on each ring structure are staggered. As shown in FIG. 9, two ring structures are taken as an example, and three microphones are provided on each ring structure.
  • the inner ring structure has microphones at 0 degrees, 120 degrees, and 240 degrees, and the outer ring structure has microphones at 60 degrees, 180 degrees, and 300 degrees, respectively.
  • the relative positions of the microphones are more diverse, such as the different angles between the outer microphone and the inner microphone, which can better detect and enhance sound sources in certain directions, and the microphones are more densely distributed. It adds spatial sampling, which has better detection and enhancement effects on sound signals of some frequencies.
  • a microphone can be formed on the center of the circle of the ring array to form a microphone array. Placing the microphone on the center of the circle increases the number of microphones, which can enhance the directivity of the array. For example, the microphone at the center can be connected to any microphone on the circumference. Combining a linear array with two microphones is good for detecting the direction of the sound source. The microphone at the center can also be combined with multiple microphones on the circumference to form a microphone sub-array of different shapes, which is conducive to detecting signals in different directions / frequency.
  • the speech recognition method of the present application can be applied to keyword detection, such as wake word detection, continuous or discontinuous arbitrary speech recognition.
  • keyword detection such as wake word detection, continuous or discontinuous arbitrary speech recognition.
  • a speech recognition method is applied as an example to detect an awake word, and a speech recognition method is described. As shown in FIG. 10, the method includes the following steps:
  • S1002 Receive an audio signal collected by a microphone array.
  • the microphone array is arranged in any manner.
  • the microphone array may be linearly arranged.
  • the microphone array may be a ring microphone array.
  • a ring microphone array is arranged as shown in Figs. 2, 8 and Figure 9 shows.
  • Each microphone collects analog signals of ambient sound, and converts the analog signals into digital audio signals through audio collection equipment such as analog-to-digital converters, gain controllers, and codecs.
  • S1004 Perform beamforming processing on the collected audio signals in multiple different target directions to obtain corresponding multi-channel beam signals.
  • Each beam signal is input to a speech recognition model, and the speech recognition model performs speech recognition on the corresponding beam signal in parallel to obtain the wake word detection result of each beam signal.
  • the efficiency of wake word detection can be improved.
  • Each speech recognition model receives a beam signal output by a corresponding beamformer, detects whether it contains a wake word signal, and outputs a detection result.
  • the wake-up word including 4 words as an example, as shown in FIG. 5, the feature vector (such as energy and subband characteristics) of the beam signal is calculated layer by layer through the pre-trained network parameters, and finally the output value is The layer gets the probability of the wake word or the keywords in the wake word.
  • the output layer of the neural network has 5 nodes, which respectively represent the four keywords of "You", “Good”, “Small” and “Listen”, and Non-keyword probability.
  • S1008 Obtain a wake-up word detection result of the collected audio signal according to the wake-up word detection result of each beam signal.
  • the wake word detection result can be a binary symbol (for example, output 0 indicates that no wake word is detected, output 1 indicates that wake word is detected), or output probability (for example, a larger probability value indicates a higher probability of detecting the wake word. Big).
  • output probability for example, a larger probability value indicates a higher probability of detecting the wake word. Big.
  • the output of the speech recognition model is the probability of the awake word appearing, when the output probability of at least one speech recognition model is greater than a preset value, it is considered that the awake word is detected.
  • each speech recognition model outputs the probability of occurrence of awake words in various directions, and a classifier makes the final detection decision. That is, the wake word detection probability of each beam signal is input to the classifier, and the collected audio is determined according to the output of the classifier Whether the signal includes a wake word.
  • a microphone array is used for audio signal collection, a microphone array signal is filtered by a multi-directional beamformer to form multiple directional enhancement signals, and multiple speech recognition models are used to monitor wake words in the directional enhancement signals.
  • the final discrimination results are obtained by synthesizing the detection results of wake words output from multiple speech recognition models.
  • This method does not need to consider the direction of the sound source. By performing beamforming processing in different target directions, at least one target direction is close to the actual sound generation direction, so at least one beam signal enhanced in the target direction is clear, so according to each beam signal Performing wake word detection can improve the accuracy of wake word detection in this direction.
  • a voice recognition device as shown in FIG. 11, includes:
  • a beamformer 1102 configured to perform beamforming processing on the audio signals in multiple different target directions to obtain corresponding multiple beam signals
  • the speech recognition module 1103 is configured to perform speech recognition on each beam signal and obtain a speech recognition result of each beam signal.
  • the processing module 1104 is configured to determine a speech recognition result of the audio signal according to a speech recognition result of each beam signal.
  • the above-mentioned speech recognition device performs beamforming processing on the audio signals collected by the microphone array in a plurality of different target directions to obtain corresponding multi-channel beam signals, and implements sound enhancement processing in different target directions, which can be clearly extracted.
  • Each target direction enhances the processed beam signal, that is, the method does not need to consider the direction of the sound source.
  • beamforming processing in different target directions at least one target direction is close to the actual sound generation direction, so at least one is enhanced after the target direction.
  • the beam signal is clear, so speech recognition based on each beam signal can improve the accuracy of speech recognition.
  • the processing module is configured to determine a keyword detection result of the audio signal according to a keyword detection result of each beam signal.
  • a processing module is configured to, when a keyword detection result of any beam signal is a keyword detected, determine a keyword detection result of the audio signal as a keyword detected.
  • the keyword detection result includes a keyword detection probability; a processing module configured to determine a keyword of the audio signal when a keyword detection probability of at least one of the beam signals is greater than a preset value; The detection result is that a keyword is detected.
  • a processing module is configured to input a keyword detection probability of each of the beam signals into a classifier, and determine whether the audio signal includes a keyword according to an output of the classifier.
  • a processing module is configured to calculate a linguistic score and / or an acoustic score of a speech recognition result of each beam signal, and determine a speech recognition result with a highest score as a speech recognition result of the audio signal.
  • a voice recognition module is configured to input each beam signal into a corresponding voice recognition model, and each voice recognition model performs voice recognition on the corresponding beam signal in parallel to obtain the voice of each beam signal. Identify the results.
  • a beamformer corresponds to a speech recognition model.
  • the voice recognition module is configured to input each beam signal into a corresponding voice recognition model, and each voice recognition model performs voice recognition on the corresponding beam signal in parallel to obtain a voice recognition result of each beam signal.
  • the voice recognition apparatus further includes an echo cancellation module, configured to perform echo suppression processing on an audio signal output by the voice recognition device.
  • the voice recognition device further includes a channel selection module, an echo cancellation module or a multi-channel output signal of the beamformer may further reduce the number of output channels through a channel selection module to reduce subsequent multi-channel speech recognition.
  • the calculation volume and memory consumption of the module may further reduce the number of output channels through a channel selection module to reduce subsequent multi-channel speech recognition.
  • FIG. 12 shows an internal structure diagram of a computer device in one embodiment.
  • the computer device may be a speech recognition device.
  • the computer device includes the computer device including a processor, a memory, a network interface, an input device, a display screen, a microphone array, and an audio output device connected through a system bus.
  • the microphone array collects audio signals.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and may also store a computer program. When the computer program is executed by the processor, the processor can implement a speech recognition method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor may cause the processor to execute a speech recognition method.
  • the display screen of a computer device can be a liquid crystal display or an electronic ink display screen.
  • the input device of the computer device can be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the computer device casing. It can be an external keyboard, trackpad, or mouse.
  • the audio output device includes speakers for playing sound.
  • FIG. 12 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied.
  • the specific computer equipment may be Include more or fewer parts than shown in the figure, or combine certain parts, or have a different arrangement of parts.
  • the speech recognition apparatus provided in this application may be implemented in the form of a computer program, and the computer program may be run on a computer device as shown in FIG. 12.
  • the memory of the computer equipment may store various program modules constituting the voice recognition device, such as an audio signal receiving module, a beamformer, and a voice recognition module shown in FIG. 11.
  • the computer program constituted by each program module causes the processor to execute the steps in the speech recognition method of each embodiment of the present application described in this specification.
  • the computer device shown in FIG. 12 may execute the steps of receiving an audio signal collected by a microphone array through an audio signal receiving module in the voice recognition apparatus shown in FIG. 11.
  • the computer device may perform a step of performing beam forming processing on the audio signal in a plurality of different target directions by using a beamformer to obtain corresponding multiple beam signals.
  • the computer equipment may perform the steps of performing voice recognition according to the beam signals of each channel through a voice recognition module.
  • a computer device includes a memory and a processor.
  • the memory stores a computer program.
  • the processor causes the processor to perform the following steps:
  • a speech recognition result of the audio signal is determined according to a speech recognition result of each beam signal.
  • the speech recognition result includes a keyword detection result; and determining the speech recognition result of the audio signal based on the speech recognition result of each beam signal includes: according to the keyword of each beam signal The detection result determines a keyword detection result of the audio signal.
  • the determining a keyword detection result of the audio signal according to a keyword detection result of each beam signal includes: when a keyword detection result of any one beam signal is that a keyword is detected, It is determined that a keyword detection result of the audio signal is a keyword detected.
  • the keyword detection result includes a keyword detection probability; and determining the keyword detection result of the audio signal according to the keyword detection result of each beam signal includes: When the keyword detection probability of the beam signal is greater than a preset value, it is determined that the keyword detection result of the audio signal is that a keyword is detected.
  • the determining a keyword detection result of the audio signal according to a keyword detection result of each beam signal includes: entering a keyword detection probability of the beam signal of each channel into a classifier, and The output of the classifier determines whether the audio signal includes keywords.
  • determining the speech recognition result of the audio signal according to the speech recognition result of each beam signal includes: obtaining a linguistic score and / or an acoustic score of the speech recognition result of each beam signal; The speech recognition result with the highest score is determined as the speech recognition result of the audio signal.
  • performing speech recognition on each beam signal separately to obtain speech recognition results for each beam signal includes: inputting each beam signal into a corresponding speech recognition model, and parallelizing each speech recognition model. Perform speech recognition on the corresponding beam signals to obtain speech recognition results of each beam signal.
  • the speech recognition method further includes: performing suppression processing on an echo of an audio signal output by the speech recognition device.
  • Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM dual data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • SLDRAM synchronous chain (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Otolaryngology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Obtaining Desirable Characteristics In Audible-Bandwidth Transducers (AREA)

Abstract

本申请涉及一种语音识别方法、装置、计算机设备和电子设备,方法包括:接收麦克风阵列采集的音频信号;将音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号;分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果;根据各路波束信号的语音识别结果,确定音频信号的语音识别结果。该方法无需考虑声源方向,通过在不同目标方向进行波束形成处理,使得至少一个目标方向与实际声音产生方向接近,因而至少一个在目标方向进行增强后的波束信号是清楚的,因而根据各波束信号进行语音识别,能够提高语音识别准确率。

Description

语音识别方法和装置、计算机设备和电子设备
本申请要求于2018年06月28日提交的申请号为201810689667.5、发明名称为“语音信号识别方法和装置、计算机设备和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音交互技术领域,特别是涉及一种语音识别方法和装置、计算机设备和电子设备。
背景技术
智能语音交互是一项通过语音命令实现人机交互的技术。通过将语音交互技术植入电子设备,可以使得电子设备人工智能化,而人工智能化的电子设备目前越来越受到用户的喜欢。例如,亚马逊的Echo智能音箱在市场上获得了巨大成功。
对于植入了语音交互技术的电子设备而言,准确地识别用户的语音命令是实现人机交互的基础。而用户使用电子设备的环境是不确定的,当用户处于环境噪声较大的场景下时,如何降低环境噪声对语音识别的影响,提高电子设备的语音识别准确率是亟待解决的一个问题。
相关技术解决这一问题的方法通常为:首先通过麦克风阵列中的所有麦克风采集音频信号,之后根据采集到的音频信号确定声源角度,并根据声源角度对音频信号进行指向性采集,从而减少不相关噪声的干扰。而这种方式受声源角度精确度的影响,当声源角度检测错误时,将导致语音识别的准确率降低。
发明内容
基于此,本申请实施例提供了一种语音识别方法和装置、计算机设备和电子设备,能够解决相关技术存在的对语音的识别准确率低的问题。
一种语音识别方法,包括:
接收麦克风阵列采集的音频信号;
将所述音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号;
分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果;
根据各路波束信号的语音识别结果,确定音频信号的语音识别结果。
一种语音识别装置,包括:
音频信号接收模块,用于接收麦克风阵列采集的音频信号;
波束形成器,用于将所述音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号;
语音识别模块,用于分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果;
处理模块,用于根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果。
一种计算机设备,包括麦克风阵列、存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如上述方法的步骤。
一种电子设备,包括:
用于采集音频信号的麦克风阵列,所述麦克风阵列包括至少两层环形结构;
与所述麦克风阵列连接的处理器,用于对所述音频信号进行处理;
存储有计算机程序的存储器;
封装所述麦克风阵列和所述处理器的壳体;
所述计算机程序被所述处理器执行时,使得所述处理器执行如上述的语音识别方法。
上述的语音识别方法和装置、计算机设备和电子设备,通过对麦克风阵列采集的音频信号在多个不同目标方向进行波束形成处理,可以得到对应的多路波束信号,实现了分别在不同目标方向进行声音增强处理,能够清楚地提取各个目标方向增强处理后的波束信号,即该方法无需考虑声源方向,通过在不同 目标方向进行波束形成处理,使得至少一个目标方向与实际声音产生方向接近,因而至少一个在目标方向进行增强后的波束信号是清楚的,因而根据各个波束信号进行语音识别,能够提高语音识别准确率。
附图说明
图1为一个实施例中语音识别方法的流程示意图;
图2为一个实施例中麦克风阵列的示意图;
图3为一个实施例中在四个目标方向进行波束形成处理得到的波束信号的示意图;
图4为一个实施例中波束形成器与语音识别模型的交互示意图;
图5为一个实施例中语音识别模型的结构示意图;
图6为一个实施例中语音识别模型的神经网络节点检测到唤醒词时的信号示意图;
图7为一个实施例的语音识别的架构图;
图8为一个实施例的麦克风阵列的示意图;
图9为另一个实施例的麦克风阵列的示意图;
图10为一个实施例中语音识别方法的步骤流程示意图;
图11为一个实施例中语音识别装置的结构框图;
图12为一个实施例中计算机设备的结构框图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的实施例仅仅用以解释本申请,并不用于限定本申请。
在一个实施例中,提供了一种语音识别方法。本实施例主要以该方法应用于语音识别设备为例进行说明。语音识别设备可以为植入了语音交互技术的电子设备,该电子设备可以是能够实现人机交互的智能终端、智能家电或机器人等。如图1所示,该语音识别方法包括:
S102,接收麦克风阵列采集的音频信号。
麦克风阵列即指代麦克风的排列,由一定数量的麦克风组成。各个麦克风采集环境声音的模拟信号,经过模数转换器、增益控制器、编解码器等音频采集设备将该模拟信号转换为数字音频信号。
不同排列方式的麦克风阵列,采集音频信号的效果不同。
例如,麦克风阵列可采用一维麦克风阵列,其阵元中心位于同一条直线上。根据相邻阵元间距是否相同,又可分为均匀线性阵列(Uniform Linear Array,ULA)和嵌套线性阵列。均匀线性阵列是最简单的阵列拓扑结构,其阵元之间距离相等、相位及灵敏度一致。嵌套线性阵列则可看成几组均匀线性阵列的叠加,是一类特殊的非均匀阵。这种线性麦克风阵列在水平方向上不能区分整个360度范围内的声源方向,而只能区分180度范围内的声源方向。这种线性麦克风阵列可适应于180度范围的应用环境,例如,语音识别设备靠墙,或是语音识别设备处于声音来源为180度范围的环境。
又例如,麦克风阵列可采用二维麦克风阵列,即平面麦克风阵列,其阵元中心分布在一个平面上。根据阵列的几何形状可分为等边三角形阵、T型阵、均匀圆阵、均匀方阵、同轴圆阵、圆形或矩形面阵等。平面麦克风阵列可以得到信号的水平方位角和垂直方位角信息。这种平面麦克风阵列可适应于360度范围的应用环境,例如,语音识别设备需要接收不同朝向的声音。
再例如,麦克风阵列可采用三维麦克风阵列,即立体麦克风阵列,其阵元中心分布在立体空间中。根据阵列的立体形状可分为四面体阵、正方体阵、长方体阵、球型阵等。立体麦克风阵列可以得到信号的水平方位角、垂直方位角、声源与麦克风阵列参考点距离这三种信息。
现以麦克风阵列为环形为例进行说明。一种实施例的环形麦克风阵列如图2所示,本实施例中用了6个物理麦克风,依次安放在方位角0度、60度、120度、180度、240度、300度,半径为R的圆周上,这6个物理麦克风组成一个环形麦克风阵列。每个麦克风采集环境声音的模拟信号,经过模数转换器、增益控制器、编解码器等音频采集设备将该模拟信号转换为数字声音信号。环形麦克风阵列 能够360度采集声音信号。
S104,将采集到的音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多个波束信号。
波束形成,是对麦克风阵列中各个麦克风输出的音频信号进行时延或相位补偿、幅度加权处理,以形成指向特定方向的波束。例如,对麦克风阵列采集的音频信号在0度、90度、180度或270度方向进行波束形成,形成指向0度、90度、180度或270度方向的波束。
作为一个示例,可采用波束形成器将音频信号在设定方向分别进行波束形成处理。波束形成器是基于特定麦克风阵列设计的算法,它能够对特定的一个或者多个目标方向的音频信号进行增强,对非目标方向的音频信号进行抑制。波束形成器可以是任何类型的能设定方向的波束形成器,包括但不限于超方向波束形成器、基于MVDR(Minimum Variance Distortionless Response,最小方差无畸变响应)或者MUSIC(Multiple Signal Classification,多信号分类)算法的波束形成器。
本实施例中,设置有多个波束形成器,每个波束形成器在不同方向进行波束形成处理。作为一个示例,多个麦克风的数字音频信号组成麦克风阵列信号送往多个波束形成器。各个波束形成器对不同的设定方向的音频信号进行增强处理,对其它方向的音频信号进行抑制,越偏离设定方向的音频信号被抑制的越多,这样就能提取设定方向附近的音频信号。
一个实施例中,设置有四个波束形成器,分别在0度、90度、180度和270度对音频信号进行波束形成处理,对多个方向的音频信号进行波束形成处理,得到的多路波束信号的示意图如图3所示。可以理解的是,对于输入各波束形成器的音频信号,不限于采集该音频信号的麦克风阵列的排列方式。对于多目标方向分别进行波束形成处理而言,能够对目标方向的音频信号进行增强处理,降低其它方向的音频信号的干扰,因此,作为一个示例,采集该音频信号的麦克风阵列至少具有两个不同方向的麦克风。
以利用图2所示的麦克风阵列采集音频信号为例,如图3所示,将多个麦克 风的数字音频信号组成麦克风阵列信号,对0度方向的声音维持不变(0dB增益),对60度和330度方向的声音具有大于9dB的抑制效果(约-9dB增益),对90度和270度方向的声音则具有超过20dB的抑制。线条越接近圆心则表示对该方向的声音抑制越多,从而实现了对0度方向的音频信号的增强,降低了其它方向的音频信号的干扰。
请继续参阅图3,将多个麦克风的数字音频信号组成麦克风阵列信号,对90度方向的声音维持不变(0dB增益),对30度和150度方向的声音具有大于9dB的抑制效果(约-9dB增益),对0度和180度方向的声音则具有超过20dB的抑制。线条越接近圆心则表示对该方向的声音抑制越多,从而实现了对90度方向的音频信号的增强,降低了其它方向的音频信号的干扰。
请继续参阅图3,将多个麦克风的数字音频信号组成麦克风阵列信号,对180度方向的声音维持不变(0dB增益),对120度和240度方向的声音具有大于9dB的抑制效果(约-9dB增益),对90度和270度方向的声音则具有超过20dB的抑制。线条越接近圆心则表示对该方向的声音抑制越多,从而实现了对180度方向的音频信号的增强,降低了其它方向的音频信号的干扰。
请继续参阅图3,将多个麦克风的数字音频信号组成麦克风阵列信号,对270度方向的声音维持不变(0dB增益),对210度和330度方向的声音具有大于9dB的抑制效果(约-9dB增益),对180度和0度方向的声音则具有超过20dB的抑制。线条越接近圆心则表示对该方向的声音抑制越多,从而实现了对270度方向的音频信号的增强,降低了其它方向的音频信号的干扰。
可以理解的是,为了实现其它目标方向音频信号的增强,在其它的实施例中,还可以设置更多或更少的波束形成器,以提取其它方向的波束信号。通过对设定的多个不同目标方向分别进行波束形成处理,对于该波束形成器的波束信号而言,能够实现对目标方向的音频信号的增强,降低了其它方向的音频信号的干扰。而多路目标方向的音频信号中,至少有一路波束信号与实际声音方向接近,即至少有一路波束信号能够反应实际声音,同时降低了其它方向的噪声的干扰。
本实施例中,对于麦克风阵列采集的音频信号,无需鉴别声源方向,均在设定的多个不同目标方向进行波束形成处理。这样做的优点在于,能够得到多个目标方向的波束信号,其中必然有至少一个波束信号与实际声音方向接近,即至少一个波束信号能够反应实际声音。对于该方向的波束形成器而言,对该方向的音频信号进行增强处理,对其它方向的音频信号进行抑制处理,能够增强实际声音方向对应角度的音频信号,即减少了其它方向的音频信号,能够清楚地提取在该方向的音频信号,降低了其它方向的音频信号(包括噪声)的干扰。
S106,分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果。
本实施例中,对于每一路波束信号分别进行语音识别,由于对音频信号在设定的多个不同目标方向进行波束形成处理,得到多路波束信号,即对于一路波束信号而言,是通过对设定目标方向的音频信号进行增强处理,非设定目标方向的音频信号进行抑制处理得到,故而每一路波束信号能够反应不同方向的音频信号的声音增强信号,根据每个方向的波束信号进行语音识别,对于包含人声的声音增强信号,能够提高语音识别的准确率。
S108,根据各路波束信号的语音识别结果,确定采集到的音频信号的语音识别结果。
通过对每一路波束信号进行语音识别,能够提高对应方向的音频信号的语音识别准确率,根据各个方向的波束信号的语音识别结果,能够得到来自多个方向的音频信号的语音识别结果,即结合各路声音增强后的语音识别结果,得到采集到的音频信号的语音识别结果。
上述的语音识别方法,通过对麦克风阵列采集的音频信号,在设定的多个不同目标方向进行波束形成处理,得到对应的多路波束信号,能够分别在不同目标方向进行声音增强处理后,清楚地提取各个目标方向增强处理后的波束信号,即该方法无需考虑声源方向,通过在不同目标方向进行波束形成处理,那么至少一个目标方向与实际声音产生方向接近,因而至少一个在目标方向进行 增强后的波束信号是清楚地,因而根据各个波束信号进行语音识别,能够提高语音识别的准确率。
在另一个实施例中,分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果,包括:将每路波束信号分别输入对应的语音识别模型,由各个语音识别模型并行对对应的波束信号进行语音识别,得到各路波束信号的语音识别结果。
作为一个示例,语音识别模型使用神经网络模型预先训练得到。将每路波束信号对应的特征向量,例如能量和子带特征等,通过预先训练好的神经网络参数逐层计算,进行语音识别。
在另一个实施例中,设置与波束形成器数量对应的语音识别模型,即一个波束形成器与一个语音识别模型对应,如图4所示,作为一个示例,将每路波束信号分别输入对应的语音识别模型,由各个语音识别模型并行对对应的波束信号进行语音识别,得到各路波束信号的语音识别结果。
本实施例中,通过设置与波束形成器数量对应的语音识别模型,对各路波束信号并行进行语音识别,能够提高语音识别的效率。
作为一个示例,一个波束形成器和一个语音识别模型配对运行在一个CPU(Central Processing Unit,中央处理单元)或DSP(Digital Signal Processor,数字信号处理器)上,即多对波束形成器和语音识别模型配对运行在多个CPU上,然后将语音识别模型的语音识别结果综合起来得到最终的语音识别结果。通过这种并行运算可以极大加快软件执行速度。
本实施例中,由不同的硬件计算单元来处理以分摊计算量,提高系统稳定性,并提高语音识别响应速度。作为一个示例,N个波束形成器被分成M组,M<=N,每一组由指定的硬件计算单元(例如DSP或CPU核)来执行计算。同理,N个语音识别模型被分成M组,M<=N,每一组由指定的硬件计算单元(例如DSP或CPU核)来执行计算。
本申请的语音识别方法可应用关键词检测(Spokenkeywordspotting或SpokenTermDetection)。
关键词检测是语音识别领域的一个子领域,其目的是在音频信号中检测指定词语的所有出现位置。在一个实施例中,可将关键词检测方法应用到唤醒词检测领域。其中,唤醒词是指设定的语音指令。当检测到唤醒词时,处于休眠或锁屏状态下的语音识别设备进入到等待指令状态。
其中,语音识别结果包括关键词检测结果。根据各路波束信号的语音识别结果,确定采集到的音频信号的语音识别结果,包括:根据各路波束信号的关键词检测结果,确定采集到的音频信号的关键词检测结果。
其中,各个语音识别模型接收对应的波束形成器输出的波束信号,并检测其中是否包含关键词,并将检测结果输出。即,各个语音识别模型用于根据接收到的各个方向上的波束信号,来检测来自各个方向的音频信号中是否包含关键词。以关键词包括4个字为例,如图5所示,将波束信号的特征向量(例如能量和子带特征等),通过预先训练好的网络参数逐层计算各个节点的输出值,最后在输出层得到关键词检测结果。
在一个实施例中,检测结果可以为二元符号,例如,输出0表示没有检测到关键词,输出1表示检测到关键词。根据各路波束信号的关键词检测结果,确定采集到的音频信号的关键词检测结果,包括:当任意一路波束信号的关键词检测结果为检测到关键词时,确定采集到的音频信号的关键词检测结果为检测到关键词,即当多个语音识别模型中至少有一个语音识别模型检测到关键词时,确定检测到关键词。
另外,关键词检测结果还可包括关键词检测概率;根据各路波束信号的关键词检测结果,确定采集到的音频信号的关键词检测结果,包括:当至少一路波束信号的关键词检测概率大于预设值时,确定采集到的音频信号的关键词检测结果为检测到关键词。
如图5所示,假设关键词是“你好小听”,该神经网络输出层具有5个节点,分别代表该段语音属于“你”“好”“小”“听”四个关键字以及非关键字的概率。 如果在一段时间窗口Dw内出现了唤醒词,则神经网络的输出节点将出现类似图6所示信号,即可依次观察到“你”“好”“小”“听”四个关键字的概率增大。通过在该时间窗口中累积唤醒词中这四个关键字的概率,即可判断是否出现关键词。
在一个实施例中,根据各路波束信号的关键词检测结果,确定采集到的音频信号的关键词检测结果,包括:将各路波束信号的关键词检测概率输入预先训练的分类器,根据分类器的输出确定采集到的音频信号是否包括关键词。
其中,各个语音识别模型输出各个方向唤醒词出现的概率,由一个分类器来作出最终的检测判决,该分类器包括但不限于神经网络、SVM(Support Vector Machine,支持向量机)、决策树等各种分类算法。上述分类器在本实施例中也称为后处理逻辑模块。
在另一个实施例中,根据各路波束信号的语音识别结果,确定采集到的音频信号的语音识别结果,包括:获取各路波束信号的语音识别结果的语言学得分和/或声学得分;将最高得分的语音识别结果,确定为采集到的音频信号的语音识别结果。
其中,该语音识别方法可应用于连续或非连续语音识别领域,将多个波束形成器的输出同时送入多个语音识别模型,最终的语音识别结果采用具有最佳语音识别效果的语音识别模型的输出。作为一个示例,最终的语音识别结果可为具有最大声学得分或者语言学得分的语音识别结果,或者两者组合的语音识别结果。
在另一个实施例中,语音识别方法还包括:对语音识别设备输出的音频信号导致的回声进行抑制处理。
对于包括音频播放功能的语音识别设备而言,例如,智能音箱,为避免自身播放声音对语音识别的干扰,参见图7,本申请实施例还设置有回声消除模块,回声消除模块可以去除语音识别设备因自身进行播放而被麦克风采集到的回声。如图7所示,该回声消除模块可以置于波束形成器之前或之后。作为一个示 例,当多方向波束形成器输出声音的声道数量小于麦克风数量时,将回声消除模块置于多方向波束形成器之后可以有效降低运算量。
在一个实施例中,如图7所示,回声消除器模块或者波束形成器的多路输出信号可以经过一个声道选择模块进一步减少输出声道数量,以降低后续多个语音识别模块的运算量和内存消耗。
以唤醒词检测为例,多方向波束形成器输出的多路波束信号会被送到多个语音识别模型进行唤醒词检测。多个语音识别模型在进行唤醒词检测得到多路唤醒词检测结果后,会将多路唤醒词检测结果输出到后处理逻辑模块进行最终判决,以确定当前声学场景下是否出现唤醒词。
在一个实施例中,提供一种电子设备,包括:用于采集音频信号的麦克风阵列,所述麦克风阵列包括至少两层环形结构;
与所述麦克风阵列连接的处理器,用于对所述音频信号进行处理;
存储有计算机程序的存储器;
封装所述麦克风阵列和所述处理器的壳体;
所述计算机程序被所述处理器执行时,使得所述处理器执行如上述各实施例的语音识别方法。
其中,当麦克风阵列为环形阵列时,环形阵列中麦克风可以安放在标准的圆周上,也可以安放在椭圆形的圆周上;麦克风均匀分布在圆周上,也可以非均匀地安放在圆周上。环节结构的麦克风阵列能够360度地采集音频信号,提高声源检测的方向,适用于远场环境。
在一个实施例中,各环形结构上设置至少三个麦克风。即,各环形结构上安放大于或等于三个麦克风,构成多层环形阵列。环形阵列上的麦克风越多,理论上计算声源方向的精度越高,对目标方向的声音的增强质量越好。考虑到麦克风越多成本和计算复杂度越高,各环形结构上分别设置4个至8个麦克风。
在一个实施例中,为了降低声音检测的复杂度,各环形结构上的麦克风均匀设置。
在一个实施例中,各环形结构为同心圆,相邻两个环形结构的麦克风分别设置在相同方向。即各环形结构上的麦克风分别设置在同一角度。如图8所示,以两个环形结构为例,每个环形结构上设置三个麦克风。内层麦克风和外层麦克风分别在0度,120度和240度设置。多层环形结构的麦克风阵列增加了麦克风个数,使得阵列可以获得更好的指向性。
在一个实施例中,任意两个环形结构上的麦克风具有夹角。即,各环形结构上的麦克风错开设置。如图9所示,以两个环形结构为例,每个环形结构上设置三个麦克风。内层环形结构分别在0度,120度和240度设置麦克风,外层环形结构分别在60度,180度和300度设置麦克风。这种方式的麦克风阵列,麦克风相对位置更加多样性,比如外层麦克风与内层麦克风之间具有不同夹角,从而对某些方向的声源具有更好的检测和增强效果,麦克风分布更密集则增加了空间采样,对一些频率的声音信号具有更好的检测和增强效果。
在另一个实施例中,可以在环形阵列的圆心上安放麦克风形成麦克风阵列,在圆心上放置麦克风增加了麦克风个数,可以增强阵列的指向性,比如圆心的麦克风可以与圆周上的任意一个麦克风组合成一个具有两个麦克风的线性阵列,有利于检测声源方向。圆心的麦克风也可以与圆周上的多个麦克风组合成不同形状的麦克风子阵列,有利于检测不同方向/频率的信号。
本申请的语音识别方法,可应用于关键词检测,例如唤醒词检测,连续或非连续任意语音识别领域。下面,以语音识别方法应用于唤醒词检测为例,对语音识别方法进行说明。如图10所示,该方法包括以下步骤:
S1002,接收麦克风阵列采集的音频信号。
其中,麦克风阵列的排布方式不限,例如,当电子设备靠墙,或是,电子设备处于声音来源为180度范围的环境时,麦克风阵列可以为线性排布。又例如,当电子设备需要接收不同朝向的声音,如电子设备处于360度范围的应用环境时,麦克风阵列可采用环形麦克风阵列,一种环形麦克风阵列的排布方式分别如图2、图8和图9所示。各个麦克风采集环境声音的模拟信号,经过模数转换器、 增益控制器、编解码器等音频采集设备将模拟信号转换为数字音频信号。
S1004,将采集到的音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号。
S1006,分别将每路波束信号输入语音识别模型,由语音识别模型并行对对应的波束信号进行语音识别,得到各路波束信号的唤醒词检测结果。
本实施例中,通过设置与波束形成器数量对应的语音识别模型,对各路波束信号并行进行语音识别,能够提高唤醒词检测的效率。
一个实施例的语音识别模型结构如图5所示,各个语音识别模型接收对应的波束形成器输出的波束信号,并检测其中是否包含唤醒词信号,并将检测结果输出。以唤醒词包括4个字为例,如图5所示,将波束信号的特征向量(例如能量和子带特征等),通过预先训练好的网络参数逐层计算各个节点的输出值,最后在输出层得到唤醒词或者唤醒词中关键字的概率。如图5所示,假设唤醒词是“你好小听”,该神经网络输出层具有5个节点,分别代表该段语音属于“你”“好”“小”“听”四个关键字以及非关键字的概率。
S1008,根据各路波束信号的唤醒词检测结果,得到采集到的音频信号的唤醒词检测结果。
其中,唤醒词检测结果可以是二元符号(例如输出0表示没有检测到唤醒词,输出1表示检测到唤醒词),也可以是输出概率(例如概率值越大表示检测到唤醒词的概率越大)。作为一个示例,当各个语音识别模型中至少有一个语音识别模型检测到唤醒词时,确定检测到唤醒词。如果语音识别模型的输出为唤醒词出现的概率,当至少一个语音识别模型的输出概率大于预设值时,认为检测到唤醒词。或者,各个语音识别模型输出各个方向唤醒词出现的概率,由一个分类器来作出最终的检测判决,即将各路波束信号的唤醒词检测概率输入分类器,根据分类器的输出确定采集到的音频信号是否包括唤醒词。
上述的方法,采用麦克风阵列进行音频信号采集,通过多方向波束形成器对麦克风阵列信号进行滤波形成多个方向性增强信号,通过多个语音识别模型对方向性增强信号中的唤醒词进行监测,根据对多个语音识别模型输出的唤醒 词检测结果进行综合得到最终判别结果。该方法无需考虑声源方向,通过在不同目标方向进行波束形成处理,至少一个目标方向与实际声音产生方向接近,因而至少一个在目标方向进行增强后的波束信号是清楚地,因而根据各个波束信号进行唤醒词检测,能够提高在该方向上唤醒词检测的准确率。
一种语音识别装置,如图11所示,包括:
音频信号接收模块1101,用于接收麦克风阵列采集的音频信号;
波束形成器1102,用于将所述音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号;
语音识别模块1103,用于分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果。
处理模块1104,用于根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果。
上述的语音识别装置,通过对麦克风阵列采集的音频信号在多个不同目标方向进行波束形成处理,可以得到对应的多路波束信号,实现了分别在不同目标方向进行声音增强处理,能够清楚地提取各个目标方向增强处理后的波束信号,即该方法无需考虑声源方向,通过在不同目标方向进行波束形成处理,使得至少一个目标方向与实际声音产生方向接近,因而至少一个在目标方向进行增强后的波束信号是清楚的,因而根据各个波束信号进行语音识别,能够提高语音识别准确率。
在另一个实施例中,所述处理模块,用于根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果。
在另一个实施例中,处理模块,用于当任意一路波束信号的关键词检测结果为检测到关键词时,确定所述音频信号的关键词检测结果为检测到关键词。
在另一个实施例中,所述关键词检测结果包括关键词检测概率;处理模块,用于当至少一路所述波束信号的关键词检测概率大于预设值时,确定所述音频信号的关键词检测结果为检测到关键词。
在另一个实施例中,处理模块,用于将各路所述波束信号的关键词检测概率输入分类器,根据所述分类器的输出确定所述音频信号是否包括关键词。
在另一个实施例中,处理模块,用于计算各路波束信号的语音识别结果的语言学得分和/或声学得分,将最高得分的语音识别结果,确定为所述音频信号的语音识别结果。
在另一个实施例中,语音识别模块,用于将每路波束信号分别输入对应的语音识别模型,由各个语音识别模型并行对对应的所述波束信号进行语音识别,得到各路波束信号的语音识别结果。
如图4所示,一个波束形成器与一个语音识别模型对应。所述语音识别模块,用于将每路波束信号分别输入对应的语音识别模型,由各个语音识别模型并行对对应的所述波束信号进行语音识别,得到各路波束信号的语音识别结果。
在另一个实施例中,语音识别装置还包括回声消除模块,用于对语音识别设备输出的音频信号的回声进行抑制处理。
在另一个实施例中,语音识别装置还包括声道选择模块,回声消除模块或者波束成形器的多路输出信号可以经过一个声道选择模块进一步减少输出声道数量,以降低后续多路语音识别模块的运算量和内存消耗。
图12示出了一个实施例中计算机设备的内部结构图。该计算机设备可以是语音识别设备。如图12所示,该计算机设备包括该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、输入装置、显示屏和麦克风阵列和音频输出设备。其中,麦克风阵列采集音频信号。存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现语音识别方法。
该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行语音识别方法。计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触 控板或鼠标等。音频输出设备包括扬声器,用于播放声音。
本领域技术人员可以理解,图12中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的语音识别装置可以实现为一种计算机程序的形式,计算机程序可在如图12所示的计算机设备上运行。计算机设备的存储器中可存储组成该语音识别装置的各个程序模块,比如,图11所示的音频信号接收模块、波束形成器和语音识别模块。各个程序模块构成的计算机程序使得处理器执行本说明书中描述的本申请各个实施例的语音识别方法中的步骤。
例如,图12所示的计算机设备可以通过如图11所示的语音识别装置中的音频信号接收模块执行接收麦克风阵列采集的音频信号的步骤。计算机设备可通过波束形成器执行将所述音频信号在设定的多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号的步骤。计算机设备可通过语音识别模块执行根据各路所述波束信号进行语音识别的步骤。
一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:
接收麦克风阵列采集的音频信号;
将所述音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号;
分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果;
根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果。
在另一个实施例中,所述语音识别结果包括关键词检测结果;所述根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果,包括:根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果。
在另一个实施例中,所述根据各路波束信号的关键词检测结果,确定所述 音频信号的关键词检测结果,包括:当任意一路波束信号的关键词检测结果为检测到关键词时,确定所述音频信号的关键词检测结果为检测到关键词。
在另一个实施例中,所述关键词检测结果包括关键词检测概率;所述根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果,包括:当至少一路所述波束信号的关键词检测概率大于预设值时,确定所述音频信号的关键词检测结果为检测到关键词。
在另一个实施例中,所述根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果,包括:将各路所述波束信号的关键词检测概率输入分类器,根据所述分类器的输出确定所述音频信号是否包括关键词。
在另一个实施例中,所述根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果,包括:获取各路波束信号的语音识别结果的语言学得分和/或声学得分;将最高得分的语音识别结果,确定为所述音频信号的语音识别结果。
在另一个实施例中,所述分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果,包括:将每路波束信号分别输入对应的语音识别模型,由各个语音识别模型并行对对应的所述波束信号进行语音识别,得到各路波束信号的语音识别结果。
在另一个实施例中,语音识别方法还包括:对语音识别设备输出的音频信号的回声进行抑制处理。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得, 诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (15)

  1. 一种语音识别方法,包括:
    接收麦克风阵列采集的音频信号;
    将所述音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号;
    分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果;
    根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果。
  2. 根据权利要求1所述的方法,其特征在于,所述语音识别结果包括关键词检测结果;
    所述根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果,包括:根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果。
  3. 根据权利要求2所述的方法,其特征在于,所述根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果,包括:
    当任意一路波束信号的关键词检测结果为检测到关键词时,确定所述音频信号的关键词检测结果为检测到关键词。
  4. 根据权利要求2所述的方法,其特征在于,所述关键词检测结果包括关键词检测概率;
    所述根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果,包括:
    当至少一路所述波束信号的关键词检测概率大于预设值时,确定所述音频信号的关键词检测结果为检测到关键词。
  5. 根据权利要求2所述的方法,其特征在于,所述关键词检测结果包括关键词检测概率;
    所述根据各路波束信号的关键词检测结果,确定所述音频信号的关键词检测结果,包括
    将各路所述波束信号的关键词检测概率输入分类器,根据所述分类器的输出确定所述音频信号是否包括关键词。
  6. 根据权利要求1所述的方法,其特征在于,所述根据各路波束信号的语音识别结果,确定所述音频信号的语音识别结果,包括:
    获取各路波束信号的语音识别结果的语言学得分和/或声学得分;
    将最高得分的语音识别结果,确定为所述音频信号的语音识别结果。
  7. 根据权利要求1所述的方法,其特征在于,所述分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果,包括:
    将每路波束信号分别输入对应的语音识别模型,由各个语音识别模型并行对对应的所述波束信号进行语音识别,得到各路波束信号的语音识别结果。
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括,对语音识别设备输出的音频信号的回声进行抑制处理。
  9. 一种语音识别装置,包括:
    音频信号接收模块,用于接收麦克风阵列采集的音频信号;
    波束形成器,用于将所述音频信号在多个不同目标方向分别进行波束形成处理,得到对应的多路波束信号;
    语音识别模块,用于分别对每路波束信号进行语音识别,得到各路波束信号的语音识别结果;
    处理模块,用于根据各路波束信号的语音识别结果,确定所述音频信号的 语音识别结果。
  10. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至8中任一项所述方法的步骤。
  11. 一种电子设备,包括:
    用于采集音频信号的麦克风阵列,所述麦克风阵列包括至少两层环形结构;
    与所述麦克风阵列连接的处理器,用于对所述音频信号进行处理;
    存储有计算机程序的存储器;
    封装所述麦克风阵列和所述处理器的壳体;
    所述计算机程序被所述处理器执行时,使得所述处理器执行如权利要求1至8任一项所述的语音识别方法。
  12. 根据权利要求11所述的电子设备,其特征在于,各环形结构上均匀设置有至少三个麦克风。
  13. 根据权利要求11所述的电子设备,其特征在于,各环形结构为同心圆。
  14. 根据权利要求13所述的电子设备,其特征在于,相邻两个环形结构的麦克风分别设置在相同方向。
  15. 根据权利要求13所述的电子设备,其特征在于,任意两个环形结构上的麦克风具有夹角。
PCT/CN2019/085625 2018-06-28 2019-05-06 语音识别方法和装置、计算机设备和电子设备 WO2020001163A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2020570624A JP7109852B2 (ja) 2018-06-28 2019-05-06 音声認識方法、装置、コンピュータデバイス、電子機器及びコンピュータプログラム
EP19824812.2A EP3816995A4 (en) 2018-06-28 2019-05-06 VOICE RECOGNITION METHOD AND DEVICE, COMPUTER DEVICE AND ELECTRONIC DEVICE
US16/921,537 US11217229B2 (en) 2018-06-28 2020-07-06 Method and apparatus for speech recognition, and electronic device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810689667.5A CN110164446B (zh) 2018-06-28 2018-06-28 语音信号识别方法和装置、计算机设备和电子设备
CN201810689667.5 2018-06-28

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/921,537 Continuation US11217229B2 (en) 2018-06-28 2020-07-06 Method and apparatus for speech recognition, and electronic device

Publications (1)

Publication Number Publication Date
WO2020001163A1 true WO2020001163A1 (zh) 2020-01-02

Family

ID=67645021

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/085625 WO2020001163A1 (zh) 2018-06-28 2019-05-06 语音识别方法和装置、计算机设备和电子设备

Country Status (5)

Country Link
US (1) US11217229B2 (zh)
EP (1) EP3816995A4 (zh)
JP (1) JP7109852B2 (zh)
CN (2) CN110364166B (zh)
WO (1) WO2020001163A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022017171A (ja) * 2020-07-20 2022-01-25 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 音声認識方法、音声認識装置、電子デバイス、コンピュータ可読記憶媒体及びコンピュータプログラム

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110503970B (zh) * 2018-11-23 2021-11-23 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置及存储介质
CN110517682B (zh) * 2019-09-02 2022-08-30 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
US11521599B1 (en) * 2019-09-20 2022-12-06 Amazon Technologies, Inc. Wakeword detection using a neural network
CN110751949A (zh) * 2019-10-18 2020-02-04 北京声智科技有限公司 一种语音识别方法、装置及计算机可读存储介质
CN111276143B (zh) * 2020-01-21 2023-04-25 北京远特科技股份有限公司 声源定位方法、装置、语音识别控制方法和终端设备
CN111429905B (zh) * 2020-03-23 2024-06-07 北京声智科技有限公司 语音信号处理方法、装置、语音智能电梯、介质和设备
US11322160B2 (en) * 2020-04-24 2022-05-03 Darrell Poirier Audio collection system and method for sound capture, broadcast, analysis, and presentation
CN115605953A (zh) 2020-05-08 2023-01-13 纽奥斯通讯有限公司(Us) 用于多麦克风信号处理的数据增强的系统和方法
CN113645542B (zh) * 2020-05-11 2023-05-02 阿里巴巴集团控股有限公司 语音信号处理方法和系统及音视频通信设备
CN111833867B (zh) * 2020-06-08 2023-12-05 北京嘀嘀无限科技发展有限公司 语音指令识别方法、装置、可读存储介质和电子设备
CN111883162B (zh) * 2020-07-24 2021-03-23 杨汉丹 唤醒方法、装置和计算机设备
CN112365883B (zh) * 2020-10-29 2023-12-26 安徽江淮汽车集团股份有限公司 座舱系统语音识别测试方法、装置、设备及存储介质
CN112562681B (zh) * 2020-12-02 2021-11-19 腾讯科技(深圳)有限公司 语音识别方法和装置、存储介质
CN112770222A (zh) * 2020-12-25 2021-05-07 苏州思必驰信息科技有限公司 音频处理方法和装置
CN113095258A (zh) * 2021-04-20 2021-07-09 深圳力维智联技术有限公司 定向信号提取方法、系统、装置及存储介质
CN113299307B (zh) * 2021-05-21 2024-02-06 深圳市长丰影像器材有限公司 麦克风阵列信号处理方法、系统、计算机设备及存储介质
CN113539260A (zh) * 2021-06-29 2021-10-22 广州小鹏汽车科技有限公司 一种基于车辆的语音交流方法和装置
CN113555033A (zh) * 2021-07-30 2021-10-26 乐鑫信息科技(上海)股份有限公司 语音交互系统的自动增益控制方法、装置及系统
CN113744752A (zh) * 2021-08-30 2021-12-03 西安声必捷信息科技有限公司 语音处理方法及装置
CN114257684A (zh) * 2021-12-17 2022-03-29 歌尔科技有限公司 一种语音处理方法、系统、装置及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04273298A (ja) * 1991-02-28 1992-09-29 Fujitsu Ltd 音声認識装置
CN104936091A (zh) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 基于圆形麦克风阵列的智能交互方法及系统
CN105765650A (zh) * 2013-09-27 2016-07-13 亚马逊技术公司 带有多向解码的语音辨识器
CN109272989A (zh) * 2018-08-29 2019-01-25 北京京东尚科信息技术有限公司 语音唤醒方法、装置和计算机可读存储介质

Family Cites Families (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000148185A (ja) * 1998-11-13 2000-05-26 Matsushita Electric Ind Co Ltd 認識装置及び認識方法
WO2004038697A1 (en) * 2002-10-23 2004-05-06 Koninklijke Philips Electronics N.V. Controlling an apparatus based on speech
JP3632099B2 (ja) * 2002-12-17 2005-03-23 独立行政法人科学技術振興機構 ロボット視聴覚システム
KR100493172B1 (ko) * 2003-03-06 2005-06-02 삼성전자주식회사 마이크로폰 어레이 구조, 이를 이용한 일정한 지향성을갖는 빔 형성방법 및 장치와 음원방향 추정방법 및 장치
US7415117B2 (en) * 2004-03-02 2008-08-19 Microsoft Corporation System and method for beamforming using a microphone array
US9026444B2 (en) * 2009-09-16 2015-05-05 At&T Intellectual Property I, L.P. System and method for personalization of acoustic models for automatic speech recognition
US9076450B1 (en) * 2012-09-21 2015-07-07 Amazon Technologies, Inc. Directed audio for speech recognition
US10229697B2 (en) * 2013-03-12 2019-03-12 Google Technology Holdings LLC Apparatus and method for beamforming to obtain voice and noise signals
EP2981097B1 (en) * 2013-03-29 2017-06-07 Nissan Motor Co., Ltd Microphone support device for sound source localization
US9640179B1 (en) * 2013-06-27 2017-05-02 Amazon Technologies, Inc. Tailoring beamforming techniques to environments
US9747899B2 (en) * 2013-06-27 2017-08-29 Amazon Technologies, Inc. Detecting self-generated wake expressions
WO2015151131A1 (ja) * 2014-03-31 2015-10-08 パナソニックIpマネジメント株式会社 指向性制御装置、指向性制御方法、記憶媒体及び指向性制御システム
US10510343B2 (en) * 2014-06-11 2019-12-17 Ademco Inc. Speech recognition methods, devices, and systems
JP6450139B2 (ja) * 2014-10-10 2019-01-09 株式会社Nttドコモ 音声認識装置、音声認識方法、及び音声認識プログラム
CN104810021B (zh) * 2015-05-11 2017-08-18 百度在线网络技术(北京)有限公司 应用于远场识别的前处理方法和装置
US10013981B2 (en) * 2015-06-06 2018-07-03 Apple Inc. Multi-microphone speech recognition systems and related techniques
CN105206281B (zh) * 2015-09-14 2019-02-15 胡旻波 基于分布式麦克风阵列网络的语音增强方法
US9875081B2 (en) * 2015-09-21 2018-01-23 Amazon Technologies, Inc. Device selection for providing a response
US9930448B1 (en) * 2016-11-09 2018-03-27 Northwestern Polytechnical University Concentric circular differential microphone arrays and associated beamforming
KR102457667B1 (ko) 2017-02-17 2022-10-20 쇼와덴코머티리얼즈가부시끼가이샤 접착제 필름
CN107123430B (zh) * 2017-04-12 2019-06-04 广州视源电子科技股份有限公司 回声消除方法、装置、会议平板及计算机存储介质
CN107316649B (zh) * 2017-05-15 2020-11-20 百度在线网络技术(北京)有限公司 基于人工智能的语音识别方法及装置
US10311872B2 (en) * 2017-07-25 2019-06-04 Google Llc Utterance classifier
CN107680594B (zh) * 2017-10-18 2023-12-15 宁波翼动通讯科技有限公司 一种分布式智能语音采集识别系统及其采集识别方法
CN107785029B (zh) * 2017-10-23 2021-01-29 科大讯飞股份有限公司 目标语音检测方法及装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH04273298A (ja) * 1991-02-28 1992-09-29 Fujitsu Ltd 音声認識装置
CN105765650A (zh) * 2013-09-27 2016-07-13 亚马逊技术公司 带有多向解码的语音辨识器
CN104936091A (zh) * 2015-05-14 2015-09-23 科大讯飞股份有限公司 基于圆形麦克风阵列的智能交互方法及系统
CN109272989A (zh) * 2018-08-29 2019-01-25 北京京东尚科信息技术有限公司 语音唤醒方法、装置和计算机可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3816995A4 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2022017171A (ja) * 2020-07-20 2022-01-25 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 音声認識方法、音声認識装置、電子デバイス、コンピュータ可読記憶媒体及びコンピュータプログラム
US11735168B2 (en) 2020-07-20 2023-08-22 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for recognizing voice
JP7355776B2 (ja) 2020-07-20 2023-10-03 ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド 音声認識方法、音声認識装置、電子デバイス、コンピュータ可読記憶媒体及びコンピュータプログラム

Also Published As

Publication number Publication date
US20200335088A1 (en) 2020-10-22
CN110164446A (zh) 2019-08-23
CN110164446B (zh) 2023-06-30
US11217229B2 (en) 2022-01-04
EP3816995A4 (en) 2021-08-25
JP2021515281A (ja) 2021-06-17
EP3816995A1 (en) 2021-05-05
CN110364166A (zh) 2019-10-22
JP7109852B2 (ja) 2022-08-01
CN110364166B (zh) 2022-10-28

Similar Documents

Publication Publication Date Title
WO2020001163A1 (zh) 语音识别方法和装置、计算机设备和电子设备
CN109712626B (zh) 一种语音数据处理方法及装置
WO2020103703A1 (zh) 一种音频数据处理方法、装置、设备及存储介质
US11218802B1 (en) Beamformer rotation
JP2012523731A (ja) センサーアレイに最適なモーダルビームフォーマ
Liu et al. Deep learning assisted sound source localization using two orthogonal first-order differential microphone arrays
Pujol et al. BeamLearning: An end-to-end deep learning approach for the angular localization of sound sources using raw multichannel acoustic pressure data
Kumatani et al. Multi-geometry spatial acoustic modeling for distant speech recognition
He et al. Closed-form DOA estimation using first-order differential microphone arrays via joint temporal-spectral-spatial processing
Bai et al. Audio enhancement and intelligent classification of household sound events using a sparsely deployed array
Li et al. Reverberation robust feature extraction for sound source localization using a small-sized microphone array
SongGong et al. Acoustic source localization in the circular harmonic domain using deep learning architecture
Leng et al. A new method to design steerable first-order differential beamformers
Wu et al. Sound source localization based on multi-task learning and image translation network
CN113223552B (zh) 语音增强方法、装置、设备、存储介质及程序
Sakavičius et al. Estimation of sound source direction of arrival map using convolutional neural network and cross-correlation in frequency bands
Salvati et al. Iterative diagonal unloading beamforming for multiple acoustic sources localization using compact sensor arrays
Zhu et al. IFAN: An Icosahedral Feature Attention Network for Sound Source Localization
Bai et al. Tracking of Moving Sources in a reverberant environment using evolutionary algorithms
US11950062B1 (en) Direction finding of sound sources
WO2024016793A1 (zh) 语音信号的处理方法、装置、设备及计算机可读存储介质
CN113068101B (zh) 指环阵列拾音控制方法、装置、存储介质及指环阵列
Zhu et al. A Deep Learning Based Sound Event Location and Detection Algorithm Using Convolutional Recurrent Neural Network
Luo et al. On the Design of Planar Differential Microphone Arrays with Specified Beamwidth or Sidelobe Level
Wang et al. Robust superdirective beamforming with sidelobe constraints for circular sensor arrays via oversteering

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19824812

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020570624

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2019824812

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2019824812

Country of ref document: EP

Effective date: 20210128